Collecting Names by Voice

Accurate name collection is one of the most infamously tough challenges we can present to an Automated Speech Recognition, such as those used in IVRs and voicebots. Sometimes, this can be because the speaker’s name doesn’t conform to standard English phonetics, sometimes it can be because of an unconventional spelling of a common name — for example, “Denice”, “Denise”, or even “Deneese”. Far more often, however, the problem comes down to simple acoustics; some sounds are just easily confused with others.

“I know! We’ll just have the caller spell their name for us, to make sure we get it right!” And, on first thought, that seems like a great idea — but it’s actually not, and here’s why: the spoken names for the letter of the English alphabet are notoriously easily confused, so spelling the user’s name can actually make things even worse. Consider the “e-set” of English letters: b, c, d, e, g, p, t, v, and (in American English) z. They differ from each other only in the initial consonant or phoneme, and they’re each only one syllable, so there’s almost no context the speech recognition can go on to narrow things down. And this isn’t the only set of similar-sounding letters; there’s also the “a-set” (a, h, j, k), as well as several easily confused pairs (u / q, f / s, m / n).

The bottom line is that asking the user to spell their name instead of simply speaking it doesn’t gain us much, and can even make things worse.

So why not use the military alphabet (alpha, bravo, charlie, delta….) and substitute words instead of letters? That would be great idea, if everyone knew the military alphabet that deliberately uses words so acoustically different from each other that they won’t be confused. But…. everyone doesn’t know the military alphabet, and they will instead substitute words that do sound similar enough to be easily confused (e.g., bat, cat, hat, fat, mat, sat). This is why it still tends to fail if we ask users to speak any word that begins with the correct letter: they’ll still use these words that are easily confused.

User also tend to become confused or overly verbose if the same letter occurs more than once in their name. My own name, for example, ends in “-rr”. Faced with something like this, many users will either try to come with different words for each occurrence, or create constructions such as, “Again, R as in Romeo”, which can trip up the ASR. No matter how a prompt is written, we simply cannot predict how users will attempt to respond (except that it will probably not be whatever the prompt requests), and it’s simply not practical to try to design an ASR that’s able to handle every possible response pattern

Is there any solution at all, to this issue?! The best answer I can give is, “Mmmmmaybe. Sometimes. Sort of….” It starts by looking at why you’re collecting the name, and what you expect the bot to do with it. It can be very risky to rely on a name as the primary identifier or authentication factor, so you’ll usually want, instead, to use some kind of numeric identifier to locate the record, account, or whatever it is you’re looking for. This may narrow things down to a single record, which may even contain a name already; if it doesn’t (for example, if you’re collecting information for making a reservation), just have the user speak their first name and then last name, without spelling it, and then let the backend code attempt to make the closest match, in post-processing. Yes, this can result in a listing of “Jon Smith” vs. “John Smith”, but the combination of the unique numeric identifier plus the name can help eliminate this in many cases.

It’s worth noting, too, that you’ll need the ASR post-processing matching sequence anyway, so the system will recognize “Liz”, “Betty”, and “Beth” as valid, even if the listed name is “Elizabeth”.

But, in the reservation case just mentioned, let a human sort it out if needed — if it’s a restaurant reservation, for instance, the host is simply going to speak the name anyway, so the spelling difference is irrelevant. There will still be cases where the AST just plain mangles a name beyond recognition — mine is one of those names that’s especially subject to this! But, I’ve learned that “all-JIRE”, “ul-grrr”, “alger”, and a host of other strange variations all mean “Aelgyrr” (and for the record, it’s just “ale-GEAR”). Humans are still much better at this than is automation, even AI-enhance automation; they’ll figure it out.