7 December 2016

It's no secret that automated speech transcriptions are low quality to the point of being almost unreadable. Understanding such a transcription often comes down to imagining saying the written words and then fudging the sounds themselves until they resemble a reasonable sentence. So the machine took sounds and changed them to text, but then you took that text and changed it back to (imagined) sound before actually understanding it. How awful!

You might have noticed that this effect is even more pronounced when the original words include unusual words, such as any topic-specific jargon. From the perspective of someone who has looked at speech recognition, this is unsurprising. Modern speech recognition relies not only on information from the input sound itself, but also from the structure of the words and sentences of the language being spoken.

As an outsider, it's not immediately clear why you need to know the structure of the language. It seems like since each word has just a couple pronunciations made up of recognizeable phonemes, you should be able to take a spoken word and then work backwards to figure out what the word is, or if you don't have a match.

However, we take the concept of a word for granted. Spoken language doesn't actually have pauses between words the way that written English has spaces. A prototypical example is the phrase "gas station". A typical pronunciation will only have a single 's' phoneme, although depending on how fast you are talking it might be a bit longer than usual. So anyone, human or machine, trying to understand the phrase "gas station" needs to know that the phrase exists in order to distinguish it from an unknown word "gastation".

The other day, I realized that this phenomenon is not unique to computers. In the past I had made the observation that it's hard to understand Japanese sentences when I don't know all the words. At first that might sound meaningless: Why would you expect to be able to understand sentences with words you don't know? But it goes farther than that: The presence of words that I don't know prevents me from recognizing the words and sentence structures that I do know, because it is not immediately apparent where they begin and end.

The lesson I took out of this is that having a good vocabulary is essential for understanding a language. The difference between knowing all of the words in a sentence and knowing all but one word can be the difference between a full understanding and being completely lost. While it's easy to imagine hearing such a sentence and understanding it as a full sentence with a blank in it, in practice it's much worse.