Wearables, Speech Recognition, & UX Research: Deixis a Problem


How many computers are on your body right now?

Your cell phone’s in your hand or back pocket. One. Do you have a smartwatch or fitbit? That’s two. In a short period of time, your clothes will be computers. Your rings might be computers. Even your piercings could be.

And these computers will need to do more than just tell us the weather. For them to be worth their salt, they’ll need to register our physical gestures in concert with our speech.

Say I’m downtown. I point to a cafe. “What’s the Yelp rating for this cafe?”

Simple question. Massively complex calculation.

Gesture + Speech

Wearables that register relatively complex physical gesture are a reality. My Force Band from Sphero, for example, allows me to control BB-8 with an outstretched hand like Luke Frigging Skywalker. So yes. I’m on board.

But what about integrating speech recognition into the mix? This has some huge problems. The biggest one: deixis.

Deixis is a big deal. Deixis refers to our ability to refer to objects and events that are contextually dependent. Think about the word ‘today,’ for example. Out of context, ‘today’ means nothing. It makes reference to the present only within the present. In other words, tomorrow, ‘today’ will no longer be the same ‘today.’

So, how do you teach a machine to disambiguate between different referents in context?

The short answer: We’ve got a long way to go. But one robot is taking on the job.

Iorek at Brown University is able to disambiguate between two simple objects in a particular context. So, if a researcher asks it to hand her ‘that bowl,’ Iorek can track her gaze and then confirm. “This bowl?”

That’s pretty amazing, but it also gives you a sense of how early we are in solving this problem with regard to human/computer interaction.

There are several reasons for this. We regularly fluctuate between speaking about ‘this’ and ’that’ with reference to physical objects in our environment as well as to objects/topics in conversation (also called anaphora).

For example:

Speaker 1: “Trump left the clean dishes in the dishwasher again.”

Speaker 2: “This is exactly what I’m talking about. Something needs to be done.”


Speaker 1: “Trump is threatening nuclear war again.”

Speaker 2: “This is exactly what I’m talking about. Something needs to be done.”

Notice that the ‘this’ in these cases can refer either to a physical thing (dishes) or a topic (a threat of nuclear war).

Matters are further complicated by the fact that, even though English as a Foreign Language learners are taught that ‘this’ and ‘that’ indicate proximity to the speaker, that is, ‘this’ is the word you use to indicate something is close to you, and ‘that’ is the word to indicate something further away, the truth is not so simple.

A rudimentary analysis shows that English speakers do not obey this ‘rule’ very closely. ‘This’ does not necessarily indicate proximity to the speaker, and ‘that’ does not necessarily indicate distance. Instead, the speaker may choose one word or the other to indicate ‘social distance’ from an object. “That bowl? That bowl isn’t mine.” Second, speakers excel at shifting what’s called the ‘deictic center’ in conversation, or the primary reference point. So, in other words, ‘this’ could be proximal to the speaker, the hearer, or something else entirely, and speakers/hearers regularly shift the deictic center, even when referring to time expressions (like the ‘today’ example above).

Linguists to the Rescue

Linguists are trained to help with this problem.

But, first off, designers, roboticists, and computer scientists need to be aware that this is a problem. A google search of ‘this’ vs. ‘that’ provides the proximal/distal story outlined above. If someone goes to this resource first, they’ll start the entire project on the wrong foot.

Second, depending on the product being developed, a linguist would excel at developing a functional set of the possible referents that the product needs to take into account. Take our disambiguating robot. A linguist could help develop a set of referents based on the robot’s environment and help conduct a study to determine how humans in that particular environment negotiate that particular space. What words do humans use most frequently? What kinds of gestures do they use? How do they construct meaning from a particular context?

Let’s return to our wearable example. We want our wearable ring to tell us the Yelp rating of ‘this’ cafe. I’m pointing to Cafe A, and it’s right next to Cafe B. The computer’s calculation should be based on some UX research that takes into account the most relevant contextual data. For example:

What distance am I from the cafe?

Where am I? On the sidewalk? On a bike? In a car? Etc.

Which side of the street am I on relative to the cafe?

How many people are with me?

Was I just talking about another cafe? If so, which one?

Without doubt, a given speaker’s use of ‘this’ or ‘that’ will vary based on the manipulation of any of the above factors. And even when all factors are controlled for, there will never be 100% usage of ‘this’ or ‘that’ in any context. After all, that’s how we language.

And this point is important, because it means that there’s plenty of UX research to be done, and results will vary based on the product, the environment, and the people involved.

These kinds of fine-grained studies and analyses are exactly what linguists excel at, and they will be critical to the next phase of wearables.


Gesture and User Experience

hand_gestureMy company, Inherent Games, has published the first ever language-learning app to incorporate gesture into its lessons. Here’s a little of what we’ve learned.

Gestures + Technology = Difficult.

We naturally gesture when we talk. It’s a huge part of how we convey meaning to people. But no one taught us gestures. We watched other people. We make our own in the moment. Oftentimes, we’re not even aware that we’re gesturing (unless you’re running for political office, where you spend the better part of the day practicing how to press your thumb on your fist to emphasize words).

Consequently, it’s a little strange for a device to tell you how to perform a particular gesture to perform a function.

Our goal was to teach people Spanish verbs by having them perform those verbs with their devices, because research shows that learners retain words better when they perform a gesture while learning them. Using the device’s built-in functionality, like the gyroscope and accelerometer, users could hold the device like a steering wheel and ‘drive’ when learning the Spanish word for drive, conducir.

But here’s the rub: Prompting the user to engage in full-device motion requires you to convince the user to think of the world outside the device. This isn’t easy.

Contrast full-device motion with a screen swipe. Showing people how to swipe onscreen is easy. Provide a prompt with an arrow or hand and show the results of performing this swipe.

However, in order to get people to perform actual hand gestures, not just swipes, we needed them to think of the world outside of their phone. As it turns out, people don’t intuitively do this.

The Process

All our prototypes were tested in a Denver area high school.

Prototype 1 had a prompt: “Gesture like you’re driving!” It seems obvious now, but this version crashed and burned. Badly. No one knew what we wanted.

We changed it.

Prototype 2 showed a picture of the device—and the device alone—and the desired motion. Again, users were baffled.

We changed it again.

Prototype 3 showed an icon of a user moving the device coupled with a prompt: “Move your device like the picture shows!.” Users were hesitant, but they finally got it. “Oh! I move it like that!” They then performed the gesture and got the payoff (a character named Jumbo Nano clears the screen of bad guys).


[Small note: I still think the name Jumbo Nano is hilarious.]

Why is this Important?

I happen to believe that full-body gesture is the next critical step in user interfaces. But there are two big problems:

1. Developing technology that registers gestures

2. Convincing users that they want to perform gestures.

Lots of people are working on Problem 1, so I’ll leave them to that.

Problem 2, however, is a domain for linguists and cognitive scientists (as others, like Microsoft, are beginning to point out).

My doctoral research shows that English speakers are particularly adept at projecting their ego onto an an external locus (like a man on a picture). This process is called deixis. I think this is exactly what’s happening with devices. People project their egos onto the device and ignore their bodies beyond the device. In order to really get people to meaningfully interact with devices that register full-body gesture, we need to remind them that their bodies can be the center of the action.

In other words, the device can’t be the center of the action, it has to be a conversant. After all, we’re used to dealing with conversants, and we naturally gesture with them.

In other words, it’s robot time.

Microsoft thinks so, too.