Why using artificial intelligence for transcription is a false economy.

It sounds great. And incredibly easy. Using artificial intelligence (AI) for your transcription needs.

It’s great fun for adults and children alike to ask Alexa what the weather is going to be like today. Or get Google Assistant to make a note of an addition to your shopping list. But how many times have you asked Siri to call someone on your phone and he doesn’t recognise the name you’re asking for – despite four attempts at saying it very clearly?!

All of the above are just everyday examples of AI in action – and how it can go wrong. AI does a commendable job of attempting to understand what’s being said, but it just isn’t 100 per cent accurate 100 per cent of the time. When it comes to the business world, getting it right is essential. There’s no room for error.

The simple fact is that whilst the world of AI has come on leaps and bounds since the early days, it most definitely isn’t there yet. We’ve touched on this before in our blog “Where humans make a difference in transcription”. Theoretically it should be simple enough – all the AI needs to do is convert one kind of data (sound) into another (text). When even the simplest format is in place (a single speaker in a silent room speaking very clearly) AI doesn’t recognise some words and cannot get the context right. When there are multiple speakers in a large room with an audience, it becomes almost impossible.

Clients on occasions have sent us transcripts of recordings that have been digitally transcribed. One job we worked on had to be completely redone – it had arrived to the client in a completely verbatim format with no punctuation throughout which made it unusable. Not only that, the machine had replaced the core word around the entire content with something completely different – you could see how a machine ended up interpreting the word as it did, because it sounded very similar, but looking at it with human knowledge and understanding it made no sense whatsoever!

Here’s how we’ve seen AI get it wrong in the examples we’ve received at FSTL from digital transcription:

INTONATION

Just recognising sound isn’t enough. When interpreting the meaning behind something you need to understand how it’s being said. Think of it as “the music of speech”. The intonation of the speaker indicates how they are feeling, can show emphasis, is used to convey surprise or irony, or indicates whether a phrase is a statement or is posing a question. This can completely change the meaning of the transcribed text.

PUNCTUATION AND GRAMMAR

We’ve received digitally transcribed documents that contain no punctuation. Pages and pages of documents with not a full stop or comma in sight! Attempting to make sense of endless text is impossible. When it comes to grammar, AI doesn’t understand this either. A good example of this is when you use Google translate to convert what you’ve said in English to another language. It might use the correct words, but it doesn’t quite place it in the correct grammatical way that a native speaker would use.

ACCENTS

The UK is awash with different accents, and the way in which certain words are said can vary hugely. From the West Country to the North East, Birmingham to Northern Ireland, the difference between how words are pronounced is phenomenal. AI finds it incredibly difficult to target and recognise each accent and the language associated with it.

GESTURES

This comes into play particularly with video transcription when you can see how people are behaving. Gestures can indicate emotion and the attitudes that the speaker wants to convey. AI can’t see what actions are taking place or what facial expressions a speaker is using, so it becomes much more difficult for it to correctly place the message in context. Sometimes what’s being said doesn’t quite match up with the gestures being used. The difference is that what is said and how it’s said are two very distinct things. What we say is cognitive – we think about it before/as we say it. Our gestures and facial expressions are much more subconscious, they are more unfiltered expressions of emotion. Building up the whole picture by putting both elements together is a step too far for AI.

FALSE STARTS

A false start can happen when a discussion becomes heated or intense, and the train of thought jumps around. A speaker can substitute a new word or phrase but the correction seems out of place with the original subject. AI can’t recognise a false start and how to put it into the correct context, so the transcription becomes very muddled.

The bottom line is that yes, AI has moved on considerably. It’s better than it was even a year ago. But it’s not there yet! It still has a long way to go before it’s 100 per cent accurate. Just ask yourself: would you trust Alexa with a life or death decision? This 100 per cent accuracy can make all the difference when you’re relying on it for the transcription purposes of your business. Our advice would be: next time you think AI can do your transcription job, you might want to think twice!

Leave a comment