AI transcription: lost in translation?

Just the other day I came across an article that once piqued my interest: speech recognition software misrecognising words.

While this isn’t “new” news, the fact that this article was around the children’s version of YouTube was interesting. We’ve discussed in a number of previous blog posts the accuracy of AI for transcription, but this was a new angle on an old concept.  

What’s the story with YouTube Kids?

YouTube Kids was first launched as an app in 2015, as a version of the original YouTube service aimed specifically at children. It allows much more control over access to material that may be sensitive or inappropriate for children under the age of 13 to view. Hosting curated selections of content, parental control features, and filtering of inappropriate videos, it represents a more child friendly interface for a younger audience.

Many parents have grown to trust this extra level of scrutiny using YouTube Kids affords them. 

Which leads us to lost in translation with AI…

AI is used by the likes of YouTube to transcribe the spoken word into text. Closed captions on YouTube videos are generated by Google Speech-To-Text while Amazon Transcribe is a top commercial ASR system. Creators can use Amazon Transcribe to embed subtitles in their videos and import them into YouTube when uploading the file.

However what this research noted was that sometimes the children’s content on YouTube Kids experienced what’s known as inappropriate content hallucination where the meaning of a word is incorrectly transcribed. Their analyses suggest that such hallucinations are far from occasional, and the ASR systems often produce them with high confidence.

There were some very specific cases highlighted in the research:

. . if you like this craft keep on watching until the end so you can see related videos. . .

was transcribed to

. . if you like this cra* keep on watching until the end so you can see related videos. . .

Or there was this one

. . . in order to be strong and brave like heracles. . .

became

. . . in order to be strong and r**e like heracles. . .

Other rogue words involved “beach” become “b**ch”, “buster” turn into “ba***rd” or “combo” morph into “condom”.

It certainly highlights the need to be more vigilant about having checks and balances if an AI application modifies the source

What’s the answer?!

One suggestion by the authors of the research is to introduce a human element into the transcription process. This would involve having a human in the loop to check on transcription errors, with someone watching and manually confirming if it is there in the video or not.

The Fiona Shipley take on being lost in translation

We firmly believe that while there may have been significant advancements in AI capabilities over the last few years, humans still stand out as being the best!

It’s been suggested that AI transcription is around 88% accurate, but for most (if not all!) of our clients, this is just too far off the mark. It means that a client would need to read over a transcript and make corrections to make sense of it all, and before it could be used as a proper document. At Fiona Shipley we care of all of this before the document gets anywhere near a client – clients receive a totally accurate transcript that properly reflects what was said.

What’s more, when a person transcribes a recording with accented speech, background noise or multiple speakers it might take a little longer. With AI you’ll most likely end up with total nonsense!

Be sure to contact us via alex@fionashipley.com for your next transcription need. 

Leave a comment