Voice User Interfaces (VUIs)

Check out this post on Voice User Interfaces that I wrote for the University of Basel blog Sci Five.

View story at Medium.com


Wearables, Speech Recognition, & UX Research: Deixis a Problem


How many computers are on your body right now?

Your cell phone’s in your hand or back pocket. One. Do you have a smartwatch or fitbit? That’s two. In a short period of time, your clothes will be computers. Your rings might be computers. Even your piercings could be.

And these computers will need to do more than just tell us the weather. For them to be worth their salt, they’ll need to register our physical gestures in concert with our speech.

Say I’m downtown. I point to a cafe. “What’s the Yelp rating for this cafe?”

Simple question. Massively complex calculation.

Gesture + Speech

Wearables that register relatively complex physical gesture are a reality. My Force Band from Sphero, for example, allows me to control BB-8 with an outstretched hand like Luke Frigging Skywalker. So yes. I’m on board.

But what about integrating speech recognition into the mix? This has some huge problems. The biggest one: deixis.

Deixis is a big deal. Deixis refers to our ability to refer to objects and events that are contextually dependent. Think about the word ‘today,’ for example. Out of context, ‘today’ means nothing. It makes reference to the present only within the present. In other words, tomorrow, ‘today’ will no longer be the same ‘today.’

So, how do you teach a machine to disambiguate between different referents in context?

The short answer: We’ve got a long way to go. But one robot is taking on the job.

Iorek at Brown University is able to disambiguate between two simple objects in a particular context. So, if a researcher asks it to hand her ‘that bowl,’ Iorek can track her gaze and then confirm. “This bowl?”

That’s pretty amazing, but it also gives you a sense of how early we are in solving this problem with regard to human/computer interaction.

There are several reasons for this. We regularly fluctuate between speaking about ‘this’ and ’that’ with reference to physical objects in our environment as well as to objects/topics in conversation (also called anaphora).

For example:

Speaker 1: “Trump left the clean dishes in the dishwasher again.”

Speaker 2: “This is exactly what I’m talking about. Something needs to be done.”


Speaker 1: “Trump is threatening nuclear war again.”

Speaker 2: “This is exactly what I’m talking about. Something needs to be done.”

Notice that the ‘this’ in these cases can refer either to a physical thing (dishes) or a topic (a threat of nuclear war).

Matters are further complicated by the fact that, even though English as a Foreign Language learners are taught that ‘this’ and ‘that’ indicate proximity to the speaker, that is, ‘this’ is the word you use to indicate something is close to you, and ‘that’ is the word to indicate something further away, the truth is not so simple.

A rudimentary analysis shows that English speakers do not obey this ‘rule’ very closely. ‘This’ does not necessarily indicate proximity to the speaker, and ‘that’ does not necessarily indicate distance. Instead, the speaker may choose one word or the other to indicate ‘social distance’ from an object. “That bowl? That bowl isn’t mine.” Second, speakers excel at shifting what’s called the ‘deictic center’ in conversation, or the primary reference point. So, in other words, ‘this’ could be proximal to the speaker, the hearer, or something else entirely, and speakers/hearers regularly shift the deictic center, even when referring to time expressions (like the ‘today’ example above).

Linguists to the Rescue

Linguists are trained to help with this problem.

But, first off, designers, roboticists, and computer scientists need to be aware that this is a problem. A google search of ‘this’ vs. ‘that’ provides the proximal/distal story outlined above. If someone goes to this resource first, they’ll start the entire project on the wrong foot.

Second, depending on the product being developed, a linguist would excel at developing a functional set of the possible referents that the product needs to take into account. Take our disambiguating robot. A linguist could help develop a set of referents based on the robot’s environment and help conduct a study to determine how humans in that particular environment negotiate that particular space. What words do humans use most frequently? What kinds of gestures do they use? How do they construct meaning from a particular context?

Let’s return to our wearable example. We want our wearable ring to tell us the Yelp rating of ‘this’ cafe. I’m pointing to Cafe A, and it’s right next to Cafe B. The computer’s calculation should be based on some UX research that takes into account the most relevant contextual data. For example:

What distance am I from the cafe?

Where am I? On the sidewalk? On a bike? In a car? Etc.

Which side of the street am I on relative to the cafe?

How many people are with me?

Was I just talking about another cafe? If so, which one?

Without doubt, a given speaker’s use of ‘this’ or ‘that’ will vary based on the manipulation of any of the above factors. And even when all factors are controlled for, there will never be 100% usage of ‘this’ or ‘that’ in any context. After all, that’s how we language.

And this point is important, because it means that there’s plenty of UX research to be done, and results will vary based on the product, the environment, and the people involved.

These kinds of fine-grained studies and analyses are exactly what linguists excel at, and they will be critical to the next phase of wearables.

Speech Interfaces like Alexa are trying to change the way we use language. They won’t succeed.


Alexa, are you listening?

She is.

She’s listening so closely, in fact, that you can’t talk about her.

I was over at a good buddy’s house, and he has an Amazon Echo. I had the gall to ask: “How do you like Alex—“

“Shhh!” he interrupted me, darting his finger to his mouth.


“You can’t say her name. The other day, she mistakenly ordered me some chocolate because we were talking about her.”

“We just call her ‘the robot,’” his wife added. “I’m uncomfortable with robots getting more advanced than her. In fact, I don’t like assigning her a gender. It.”

There’s a lot to unpackage here. First and foremost, at least in my friend’s house, Alexa is a presence. She’s a family member—the family member that nobody wants to talk about.

Let me phrase this another way. Alexa has impacted the language that my friends use at home. They can’t talk about her. She’s a new toy, and they can’t talk about her in the house. This is a linguistic user experience problem.

The problem stems from the trigger word: Alexa.

It’s a problem on two levels. First, it prevents people from being able to talk about Alexa. Second, it’s extremely unnatural from a linguistic perspective.

The Alexa Elephant in the Room Problem

Users want to talk about their toys. Amazon wants users to talk about their toy. Amazon has given her a name, an identity. But this same identity is actually serving as a pain point to users. Very few people like tiptoeing a subject in their own home, let alone a presence. Nobody wants to avoid a subject/person in their home, ESPECIALLY if it’s something they’re excited to talk about.

In essence, then, Amazon is presently shooting themselves in the foot.

Is this problem presently hurting Amazon sales? Sure doesn’t sound like it. But as customers become more aware of the power of Alexa (as well as other speech interfaces), the more sophisticated their expectations will become. The less, in other words, they will tolerate having to avoid talking about something in their own home.

This segues nicely into the second problem.

The Name Problem

On the surface, using the name “Alexa” as a trigger word makes sense. After all, when I want someone’s attention, I use their name.

Consequently, designers on the Amazon UX team are doubling down on this idea, encouraging all developers that integrate Alexa into their apps to use “Alexa” as a trigger word and create a consistent user experience across platforms. As quoted from a recent Wired article:

“That’s why Amazon is developing guidelines for third party developers. It already requires everyone to use the wake word ‘Alexa.’ It also encourages simple, explicit language in their commands.”

Developing a seamless user experience is, of course, a great idea. However, it comes at the cost of our natural linguistic experience. Beyond a few specific purposes, we simply don’t use people’s names very often. Think about it. How often do you really use the names of people around you?

Here’s a daily scenario. You’re sitting on the couch, watching a movie.

“Chuck, can you hand me the remote? Chuck, I can’t find anything to watch. Chuck, what do you suggest we watch? Chuck, can you grab me something while you’re in the kitchen? Chuck, is there any ice cream left?”


In reality, the above conversation plays out more like this:

“Remote? Nothing on. Me, too? Ice cream?”

We use context, expectations, routine, and even intonation when engaging with people. Names? Not so much.

So, the problem is that speech interfaces like Alexa are actually encouraging us to change our fundamental conversational habits. This trend comes in a long line of platforms that are trying to change the way we use language—just think about the ridiculous queries that you type into google.

This battle, I predict, will not be won by machines.

Why? We are language experts. We love talking. Some people think that language is what separates humans from other animals. So, why we will certainly make certain concessions to make a new app work, we will unlikely change the way we speak to accommodate it—at least not for long.

In other words, after becoming professionals at not using people’s names when we talk to them, we will not likely decide that we like using names to talk to machines. It’s unnatural. It’s clunky. It’s a bad user experience.

Moreover, this fact directly contradicts the goals of Amazon’s user experience team. From the same Wired article:

“‘Our core goal is to make Alexa’s interactions with a customer seamless and easy,’ says Brian Kralyevich, vice president of Amazon’s user experience design for digital products. ‘A customer shouldn’t have to learn a new language or style of speaking in order to interact with her. They should be able to speak naturally, as they would to a human, and she should be able to answer.’”

If this is the case, and Amazon does not want people to have to adopt a new style of speaking, they’ve got to drop the name-calling as a trigger word.

A Possible Solution

A suggestion: Drop the trigger word ‘Alexa.’ From a marketing perspective, it’s brilliant. Everyone knows who—sorry—what Alexa is. However, from a functional language perspective, using a name as a trigger word is a terrible idea, creating both the Alexa Elephant in the Room problem and the Name Problem.

So, what should Amazon do instead?

People should choose their own trigger words. This solution is not dissimilar from creating an avatar when you start a video game, and it makes intuitive sense.

First, if people are already treating Alexa like a person in the home, it gives them some affection for that person. They choose how to activate it.

Second, this strategy would tap into people’s limitless linguistic creativity. For example, people could make trigger words like safe words. Pumpernickle. Wobblegones. This would at least solve the first problem.

Others might opt for discourse markers, like ‘dude’ or ‘yo.’ These might not solve both problems, but they’d be more natural.

Most importantly, I think this approach would essentially crowdsource the problem, and people would naturally arrive at a solution that worked best for them, and maybe even works best for other people. After all, that’s how language works. It’s creative. It’s adaptive. It’s fun.

It seems pretty clear that we are about to be surrounded by speech interfaces. Let’s start this conversation about how to integrate them into our basic linguistic habits, not how to adapt our habits to them.