Wearables, Speech Recognition, & UX Research: Deixis a Problem


How many computers are on your body right now?

Your cell phone’s in your hand or back pocket. One. Do you have a smartwatch or fitbit? That’s two. In a short period of time, your clothes will be computers. Your rings might be computers. Even your piercings could be.

And these computers will need to do more than just tell us the weather. For them to be worth their salt, they’ll need to register our physical gestures in concert with our speech.

Say I’m downtown. I point to a cafe. “What’s the Yelp rating for this cafe?”

Simple question. Massively complex calculation.

Gesture + Speech

Wearables that register relatively complex physical gesture are a reality. My Force Band from Sphero, for example, allows me to control BB-8 with an outstretched hand like Luke Frigging Skywalker. So yes. I’m on board.

But what about integrating speech recognition into the mix? This has some huge problems. The biggest one: deixis.

Deixis is a big deal. Deixis refers to our ability to refer to objects and events that are contextually dependent. Think about the word ‘today,’ for example. Out of context, ‘today’ means nothing. It makes reference to the present only within the present. In other words, tomorrow, ‘today’ will no longer be the same ‘today.’

So, how do you teach a machine to disambiguate between different referents in context?

The short answer: We’ve got a long way to go. But one robot is taking on the job.

Iorek at Brown University is able to disambiguate between two simple objects in a particular context. So, if a researcher asks it to hand her ‘that bowl,’ Iorek can track her gaze and then confirm. “This bowl?”

That’s pretty amazing, but it also gives you a sense of how early we are in solving this problem with regard to human/computer interaction.

There are several reasons for this. We regularly fluctuate between speaking about ‘this’ and ’that’ with reference to physical objects in our environment as well as to objects/topics in conversation (also called anaphora).

For example:

Speaker 1: “Trump left the clean dishes in the dishwasher again.”

Speaker 2: “This is exactly what I’m talking about. Something needs to be done.”


Speaker 1: “Trump is threatening nuclear war again.”

Speaker 2: “This is exactly what I’m talking about. Something needs to be done.”

Notice that the ‘this’ in these cases can refer either to a physical thing (dishes) or a topic (a threat of nuclear war).

Matters are further complicated by the fact that, even though English as a Foreign Language learners are taught that ‘this’ and ‘that’ indicate proximity to the speaker, that is, ‘this’ is the word you use to indicate something is close to you, and ‘that’ is the word to indicate something further away, the truth is not so simple.

A rudimentary analysis shows that English speakers do not obey this ‘rule’ very closely. ‘This’ does not necessarily indicate proximity to the speaker, and ‘that’ does not necessarily indicate distance. Instead, the speaker may choose one word or the other to indicate ‘social distance’ from an object. “That bowl? That bowl isn’t mine.” Second, speakers excel at shifting what’s called the ‘deictic center’ in conversation, or the primary reference point. So, in other words, ‘this’ could be proximal to the speaker, the hearer, or something else entirely, and speakers/hearers regularly shift the deictic center, even when referring to time expressions (like the ‘today’ example above).

Linguists to the Rescue

Linguists are trained to help with this problem.

But, first off, designers, roboticists, and computer scientists need to be aware that this is a problem. A google search of ‘this’ vs. ‘that’ provides the proximal/distal story outlined above. If someone goes to this resource first, they’ll start the entire project on the wrong foot.

Second, depending on the product being developed, a linguist would excel at developing a functional set of the possible referents that the product needs to take into account. Take our disambiguating robot. A linguist could help develop a set of referents based on the robot’s environment and help conduct a study to determine how humans in that particular environment negotiate that particular space. What words do humans use most frequently? What kinds of gestures do they use? How do they construct meaning from a particular context?

Let’s return to our wearable example. We want our wearable ring to tell us the Yelp rating of ‘this’ cafe. I’m pointing to Cafe A, and it’s right next to Cafe B. The computer’s calculation should be based on some UX research that takes into account the most relevant contextual data. For example:

What distance am I from the cafe?

Where am I? On the sidewalk? On a bike? In a car? Etc.

Which side of the street am I on relative to the cafe?

How many people are with me?

Was I just talking about another cafe? If so, which one?

Without doubt, a given speaker’s use of ‘this’ or ‘that’ will vary based on the manipulation of any of the above factors. And even when all factors are controlled for, there will never be 100% usage of ‘this’ or ‘that’ in any context. After all, that’s how we language.

And this point is important, because it means that there’s plenty of UX research to be done, and results will vary based on the product, the environment, and the people involved.

These kinds of fine-grained studies and analyses are exactly what linguists excel at, and they will be critical to the next phase of wearables.