One morning recently, I was writing to a friend in academia who was teasing me about his smart speaker not responding as he’d liked. Going on a tangent, I was suggesting that it might not be long that people would be taking his classes so they could have a good background for getting into the virtual agent design game, since that’s how I had done it, as had many of my colleagues.
“We have to take seriously there’s a theory of mind entailed in how these systems respond and are built, especially if your kid’s growing up in a world where talking to Google Assistant or Siri is as expected as pressing a button a remote control,” or something like that.
As I wrote it though I realized I hadn’t said “theory of mind” very much at all outside of a classroom, even if it informs my work in a primary way. These days, a good many of the folks designing experiences for speech don’t have a technical background in speech technology, linguistics, cognitive science, or another discipline that where this concept is foundational to the kind of work you do every day.
Basically, the major difference between a speech interface and any other digital interface you click, or tap, or select with a remote or pointer is a theory of mind.
So what the hell is that?
Basically, when you and I talk to each other, part of the success of our communication and its general ease and facilitation is based on our human self-awareness. I’m aware of my own thinking, and my words, and generally my own consciousness. I also know that you have the same awareness.
That self-consciousness, and knowledge of my own mental processes and states, is a theory of mind. Because I identify you as a human and a fellow mind-haver, I know you have the same knowledge and awareness.
That seems simple, but compare it to the simplest computer program you can think of and you’ll quickly discover that even the concept of “knowing” is so complex and abstract there’s not really any correlate. And yet it enables virtually all of the richness, detail, and ability to easily move back and forth in time in any discussion not only over the duration of the discussion but over the duration of the entirety of your relationship with the person you’re talking to.
Reference is a function of shared information, and the information being shared is implicit because of the theory of mind. It also allows for you to do super cool things like not assume that the exact pitch or timbre or duration or power of any given utterance is meaning-bearing. Except of course when we’re doing cool stuff to do just that, like being playful by mimicking someone, or display emotion or need, and so on.
So, even more basically:
I know.
You know.
I know I know.
I know you know.
You know you know.
You know I know.
I know you know I know.
You know I know you know.
And so on.
Not everyone agrees with Cliff Nass, but I find it difficult to get away with the general idea much of his work is based on, which is this: Because we’ve basically never talked to a thing that has language that doesn’t have a mind, it’s not possible to anticipate responses that are flawed because they don’t meet the expectations a theory of mind entails.
In other words, when we talk to robots that say or do something that does not follow directly from what happened previously, or contextually, the interaction will fail. When we don’t understand and have no idea what to say next, or the robot doesn’t understand and we don’t understand why the robot doesn’t understand, it’s probably because the robot wasn’t designed to account for theory of mind.
For example, imagine a bank makes an agent that says “Go ahead, ask me anything,” and you say, “How’s the weather up there in the cloud?” If the agent replies, “Sorry I didn’t understand,” you may conclude it doesn’t answer general questions. OR you may assume it didn’t hear you, so you should repeat yourself. OR you may guess it did hear you, but banks and robots have crappy senses of humor, so you should dial it back to something plainer, like “what’s the weather?” or “Can you tell me a joke?” OR you may conclude that it’s an extension of the bank’s digital presence and you should only ask things about the bank and your account.
Each of these responses represents an extension of a theory of mind of the agent. But there’s no way to know from “Sorry I didn’t understand,”, which is very different from interacting with a human, where usually if you don’t understand, there’s an obvious paring down of possibilities. On the other hand, if the agent’s response was, “Sorry, I can only answer questions about your account and the bank itself. So go ahead, ask me anything,” you can reduce the theory of mind and corresponding mental model of the interaction to something finite and straightforward, even if it’s still relatively abstract.
When the agent says that, it’s kind of saying, “I know you know I know X” and “You know I know X,” which is the starting point for a point of reference a human can deal with.
I don’t have a really good conclusion for this post but it’s already a bit too long so I’ll just type a few more representations of how minds consider other minds because it’s funny:
I know you know I know you know I know.
You know I know you know I know you know.
I know you know I know you know I know you know.
You know I know you know I know you know I know you know.