Learning Faces
We have developed a real-time face recognition system for Leonardo that can be trained on the fly via a simple social interaction with the robot. The interaction allows people to introduce themselves and others to Leonardo, who tries to memorize their faces for use in subsequent interactions. Our face recognition technology is based on the appearance manifold approach described in Nayar, Nene, and Murase, "Real-Time 100 Object Recognition System," 1996.

The system receives images from the camera mounted in Leo's right eye. Faces are isolated from these images using data provided by a facial feature tracker developed by the Neven Vision corporation. Isolated face images are resampled into small (25 x 25 pixel) greyscale images and projected onto the first 40 principal components of the face image data set. These face image projections are matched against appearance manifold splines to produce a classification, retrieving the name associated with the given face.

In order to learn new faces, Leo keeps a buffer of up to 200 temporally-contiguous or near-contiguous views of the currently tracked face. This buffer is used to create a new face model whenever the person introduces themselves via speech. When a new model is created, principal component analysis (PCA) is performed on the entire face image data set, and a spline manifold is fitted to the images of the new face. The appearance manifold splines for the other face models are also recomputed at this time. This full model building process takes about 15 seconds. Since the face recognition module runs as a separate process from Leo's other cognitive modules, the addition of a new face model can be done without stalling the robot or the interaction.

The face recognition module receives speech input provided by the Sphinx-4 speech recognition system. The speech recognition system allows people to introduce themselves via simple phrases: "My name is Marc" or "Leo, this is Dan." Speech input also allows us to test Leo's recall: "Leo, can you find Andrea?"

The full system provides face recognition information at a rate of approximately 13 Hz, running on a dual-2GHz G5 macintosh.
Visual Tracking
A necessary sensory aptitude for a sociable robot is to know where people are and what they are doing. Hence, our sociable robot needs to be able to monitor humans in the environment and interpret their activities, such as gesture-based communication.

The robot must also understand aspects about the inanimate environment as well, such as how its toys behave as it plays with them. An important sensory modality for facilitating these kinds of observations is vision. The robot will need a collection of visual abilities, closely tied to the specific kind of information about the interactions that the robot needs to extract.

Towards this goal, we are developing a suite of visual capabilities as we investigate the use of Intel's OpenCV library (supplementing the routines with the addition of Mac G4 AltiVec operations). This includes a collection of visual feature detectors for objects (e.g., color, shape, and motion) and people (e.g., skin tone, eye detection, and facial feature tracking), the ability to specify a target of attention and track it, and stereo depth estimation.

Active vision behaviors include the ability to saccade to the locus of attention, smooth pursuit of a moving object, establishing and maintaining eye contact, and vergence to objects of varying depth. The movie shows Leonardo tracking a red Elmo plush doll.
The video shows a single interaction wherein Leo is introduced to two new people for the first time. Leo learns both of their names and builds a model of each of their faces. Leo's face recognition abilities are tested by asking him to find each of the people as they move to a few different locations in the scene. Leo scans the scene, and when he finds a face that matches the query, he points to it. When asked to find someone who is absent from the scene, Leo looks around for a while, then shrugs to indicate that he cannot find them.