Leonardo – Social Cognition

Social Cognition Overview

Socially intelligent robots need to understand “people as people.” Whereas research with modern autonomous robots has largely focused on their ability to interact with inanimate objects whose behavior is governed by the laws of physics (objects to be manipulated, navigated around, etc.), socially intelligent robots must understand and interact with animate entities (e.g. people, animals, and other social robots) whose behavior is governed by having a mind and body. How might we endow robots with sophisticated social skills and social understanding of others?

Coupled minds in coupled bodies is a powerful force on human social intelligence and its development. Minds are in bodies with a particular morphological structure. A body’s momentary disposition in space reflects and projects to others the internal state of the system that generated those bodily gestures. Correlations emerging from coupled like bodies with like internal cognitive systems can create — through the body’s external behaviors – higher order correlations that may lead to inferences about the internal states of self and other.

The key correlations are these: the correlation between the appearance of the self and the appearance of others (e.g., hands to hands, feet to feet), correlations between the behavior of the self and the behavior of others (looking to an object), correlations between one’s own bodily behaviors and one’s internal states (e.g. looking left and remembering what was on the left, maintaining the memory of a goal and looking in the direction of that goal), correlations between the external states of others and one own’s internal states (where they look, where a consequence one looks oneself and thus what one sees and thinks about).

The dynamic socially-embedded coupling of two intelligent systems –to each other through a similar body with similar body parts capable of doing similar things in the world – seems the likely origin of the very idea of mind. Can one, through this kind of coupling, build an artificial device with “human-like intuitions” about the internal states of others? Is it possible for a robot to acquire an “empathic” understanding of people, to go beyond recognizing a happy facial expression per se, to inferring the underlying valence in that expression? We are optimistic that the answers are “yes.”

One way robots might develop socially adept responses that seem to reflect beliefs about the internal states of others is by attempting to simulate –in its own cognitive system – the behaviors of others. We have been developing a cognitive-affective architecture based on embodied cognition theories that emphasizes the coupling of minds through bodies from psychology (supported by recent neuroscientic and brain imaging data) to give our robots such as Leonardo a variety of socio-cognitive skills and abilities. These include competencies such as shared attention mirror-neuron inspired mechanisms for recognition-generation of observable behavior, visual and mental perspective taking abilities to support mind-reading skills and simple models of emotional empathy.


L. Barsalou, C. Breazeal & L. Smith (2007) “Cognition as Coordinated non-cognition.” Cognitive Processing 8: 79-91.

L. Smith and C. Breazeal (2007) “The Dynamic Life of Developmental Process.” Developmental Science, 10(1), 61-68.

Shared Attention

To implement shared attention, the robot’s attentional state must be modeled with two related but distinct foci: the current attentional focus (what is being looked at right now) and the referential focus (the current topic of shared focus, i.e., what communication, activities, etc. is about). Furthermore, the robot must not only have a model for its own attentional state, but it must also have a model for the attentional state of the human. Thus there are three foci of interest: the robot’s attentional focus, the human’s attentional focus, and the referential focus shared by the two.

To compute the robot’s attentional focus, Leonardo’s attentional system computes the level of saliency (a measure of “interest”) for objects and events in the robot’s perceivable space. The contributing factors to an object’s overall saliency fall into three categories: its perceptual properties (its proximity to the robot, its color, whether it is moving, etc.), the internal state of the robot (i.e., whether this is a familiar object, what the robot is currently searching for, and other goals), and social reference (if something is pointed to, looked at, talked about, or is the referential focus). For each item in the perceivable space, the overall saliency at each time step is the result of the weighted sum for each of these factors. The item with the highest saliency becomes the current attentional focus of the robot, and also determines where the robot’s gaze is directed. The gaze direction of the robot is an important communication device to the human, verifying for the human partner what the robot is attending to and thinking about.

The human’s attentional focus is determined by what he or she is currently looking at. Leoardo calculates this using the head pose tracking data, assuming that the person’s head orientation is a good estimate of their gaze direction. By following the person’s gaze direction, the shared attention system determines which (if any) object is the attentional focus of human’s gaze.

The mechanism by which infants track the referential focus of communication is still an open question, but a number of sources indicate that looking time is a key factor, such as word learning studies. For example, when a child is playing with one object and hears an adult say “It’s a modi”, the child does not attach the label to the object the child happens to be looking at. Instead the child redirect’s his or her attention to look at what the adult is looking at, and attach the label to that object.
To robustly track the referential focus, we use a simple voting mechanism to track a relative-looking-time for each of the objects in the robot’s and human’s shared environment. An object receives x votes each for time step that it is the attentional focus of either the human or the robot; it loses y votes for each time step that it is not the current focus; and, it loses z votes when another object is the attentional focus of either the human or robot (x, y, and z are determined empirically). The object with the highest accumulated relative-looking-time is identified as the referent of the communication between the human and the robot.

These two videos show Leonardo and human sharing joint visual attention. In the top video, Leonardo has tracked the human’s head pose (could also be a pointing gesture) to determine that this object is the human’s attentional focus. This in turn made this object more salient to the robot and therefore the robot’s own attentional focus. Both of which thereby cast that object as the referential focus as well. In the bottom video, Leonardo not only follows the gaze direction of the person, but also actively looks back to monitor the attention of the person as well.


Thomaz, A. L., Berlin, M. and Breazeal, C. (2005), “An Embodied Computational Model of Social Referencing.” Proceedings of Fourteenth IEEE Workshop on Robot and Human Interactive Communication (Ro-Man-05), Nashville, TN. 591-598.

Perspective Taking

For robots to cooperate with people in a human-like way, they must be able to infer the mental states of others (e.g., their thoughts, intents, beliefs, desires, etc.) from observable behavior (e.g., their gestures, facial expressions, speech, actions, etc.). In humans, this competence is referred to as a theory of mind (ToM), mindreading, mind perception, or social commonsense to name a few.

In humans, this ab
ility is accomplished in part by each participant treating the other as a conspecific—viewing the other as being “like me”. Perceiving similarities between self and other is an important part of the ability to take the role or perspective of another, allowing people to relate to and to empathize with their social partners. This sort of perspective shift may help us to predict and explain other’s emotions, behaviors and other mental states such as beliefs and desires, and to formulate appropriate responses based on this understanding. For instance, it enables us to infer the intent or goal enacted by another’s behavior—an important skill for enabling richly cooperative behavior.

Simulation Theory (ST) is one of the dominant hypotheses about the nature of the cognitive mechanisms that underlie theory of mind. Simulation Theory posits that by simulating another person’s actions and the stimuli they are experiencing using our own behavioral and stimulus processing mechanisms, humans can make predictions about the behaviors and mental states of others based on the mental states and behaviors that we would possess in their situation. In short, by thinking “as if” we were the other person, we can use our own cognitive, behavioral, and motivational systems to understand what is going on in the heads of others.

From a design perspective, Simulation Theory is appealing because it suggests that instead of requiring a separate set of mechanisms for simulating other persons, we can make predictions about others by using our own cognitive mechanisms to recreate how we would think, feel, and act in their situation—thereby providing us some insight into their emotions, beliefs, desires, and intensions, etc.

We have developed a cognitive-affective learning architecture that incorporates ST-based mechanisms to enable our robots to understand people in a similar way. Importantly, it is a strategy that naturally lends itself to representing the internal state of the robot and human in comparable terms. This facilitates our robot’s ability to compare its own internal state to that of the person it is interacting with in order to infer the human’s mental states (e.g., beliefs, intents) to better collaborate with people, and to learn from observing people’s behavior and demonstrations. Such theories could provide a foothold for ultimately endowing machines with human-style social skills, learning abilities, and social understanding.

We have shown how Leonardo’s perspective taking skill enables the robot to learn from ambiguous human demonstrations, infer the beliefs of others even when they diverge from the robot’s own beliefs (false beliefs — see “Leo False Belief Task”), and infer the diverging beliefs and goals of different people to provide each with appropriate assistance in a collaborative task (see “Leo False Belief with Goal Inference”).

Leo False Belief Task
Our benchmark tasks are variants of the classic false belief task (Wimmer & Perner, 1983) from developmental psychology used to assess a child’s development of their mindreading abilities. In the classic task, subjects are told a story with pictorial aides or puppets that typically proceeds as follows: two children, Sally and Anne, are playing together in a room. Sally places a toy in one of two containers. Sally then leaves the room, and while she is gone, Anne moves the toy into the other container. Sally returns, and the subject is asked: where will Sally look for the toy? This test probes the child’s ability to realize that people may hold different beliefs about the same situation, and hold beliefs are different from the child’s. Normally developing children begin to successfully perform this task around four years of age.

The top video shows Leonardo using our perspective taking architecture to perform the false belief task.

Leo False Belief with Goal Inference
This task examines goal inference with multiple people and false beliefs within a collaborative task setting. In this task, Leonardo is introduced to two collaborative partners, Matt and Jesse. In order to successfully assist both people, Leonardo must keep track of Matt’s (black shirt) false beliefs about the object locations as well as Jesse’s (red shirt) correct beliefs about these locations. The bottom video explains the implementation and demonstrates this task.


Gray, J., Breazeal, C., Berlin, M., Brooks, A. and Lieberman, J. (2005), “Action Parsing and Goal Inference using Self as Simulator.” Proceedings of Fourteenth IEEE Workshop on Robot and Human Interactive Communication (Ro-Man05), Nashville, TN. 202-209.

C. Breazeal, J. Gray, M. Berlin (2007). “Mindreading as a foundational skill for socially intelligent roots. In Proceedings of the 2007 International Symposium on Robotics Research (ISRR-07). Hiroshima, Japan


By using a simulation theory inspired mechanism, our robot learns to decode emotional messages conveyed through facial expressions by leveraging its early facial imitation capability to bootstrap a primitive form of emotional empathy.

We are inspired by various experiments with humans that have shown a dual affect-body connection whereby posing one’s face into a specific emotive facial expression actually elicits the feeling associated with that emotion. Hence, imitating the facial expressions of others could cause a person to feel what the other is feeling. This same dual affect-body pathway coupled with early facial imitation could allow human infants to learn the association of observed emotive expressions of others with their own internal affective states. Other time-locked multi-modal cues may facilitate learning this mapping, such as affective speech that accompanies emotive facial expressions during social encounters between caregivers and infants.

In a similar way, a robot could learn the affective meaning of emotive expressions signaled through another person’s facial expressions and body language. We have based our robot’s affective system on computational models of infant-inspired emotions. In humans, emotions are centrally involved in appraising environmental and internal events that are significant to the needs and goals of a creature. The robotic implementation includes a simple appraisal process based on Damasio’s theory of somatic markers that tags the robot’s incoming perceptual and internal states with affective information, such as valence (positive or negative), arousal (high or low), and whether or not something is novel.

For the robot, certain kinds of stimui, such as pleasing or soothing tones of speech, have hardwired affective appraisals with respect to arousal and valence. This computational model is based on the developmental findings of Fernald (1989) which showed that certain prosodic contours are indicative of different affective intents in infant-directed speech. We found that even simple acoustic features, such as pitch mean and energy variance, can be used by the robot to classify the affective prosody of an utterance along valence and arousal dimension.

The tasks that couple these heterogeneous processes and in so doing drives developmental change is the face-to-face interactions (and imitations) described in the previous section. Via dual body-affect pathways, when the robot imitates the emotive facial expressions of others, it evokes the corresponding affective state (in terms of arousal and valence variables) that would ordinarily give rise to the same expression during an emotive response. This is reinforced by affective information coming from the person’s speech signal. These time-locked multi-modal states occur because of the similarity in bodies and body-affect mappings, and they enable the robot to learn to associate its internal affective state with the corresponding observed expression. Thus, through this “empathic” or direct experiential approach to social understand
ing, the robot uses its own cognitive and affective mechanisms as a simulator for inferring the human’s affective state as conveyed through behavior.

This is one example of our approach of endowing our robot with social understanding – and how it arises from heterogeneous processes that are time-locked in a shared task with bodily and mentally like others.


L. Smith and C. Breazeal (2007). “The Dynamic Life of Developmental Process.” Developmental Science, 10(1), 61-68.

Deictic Object Reference

Robust joint visual attention is necessary for achieving a common frame of reference between humans and robots using multi-modal cues while working together on real-world object-based spatial tasks. In this work, we make a comprehensive examination of one component of this process that is often otherwise implemented in an ad hoc fashion: the ability to correctly determine the object referent from deictic reference including pointing gestures and speech. We develop a modular spatial reasoning framework based around decomposition and re-synthesis of speech and gesture into a language of pointing and object labeling that supports multi-modal and uni-modal access in both real-world and mixed-reality workspaces, accounts for the need to discriminate and sequence identical and proximate objects, assists in overcoming inherent precision limitations in deictic gesture, and assists in the extraction of those gestures. We have implemented our approach on two humanoid robot platforms to date: Leonardo and NASA JSC’s Robonaut.


Brooks, A. G. and Breazeal, C. (2006). “Working with Robots and Objects: Revisiting Deictic Reference for Achieving Spatial Common Ground.” In Proceeding of the 1st ACM SIGCHI/SIGART Conference on Human-Robot interaction (Salt Lake City, Utah, USA, March 02 – 03, 2006). HRI ’06. ACM Press, New York, NY, 297-304.