Vision System
This real-time vision system consists of a pair of fixed-baseline stereo cameras One camera is mounted behind the robot facing the audience (see movie). The other is located overhead looking down at the terrarium. Custom software processes both sets of stereo video feeds separately, running approximately at 15 frames per second. The cameras are IEEE 1394 digital cameras, synchronized to capture their images at precisely the same time, and mounted in a hard case that keeps their axes parallel.

The software consists of several modules that perform low-level detection and a tracking system that merges the information from these subsystems. A stereo correlation engine compares the two images for stereo correspondence, computing a 3-D depth, or disparity map. This is then compared with a background depth estimate to produce a foreground depth map. The color images are simultaneously normalized and analyzed with a probabilistic model of human skin chromaticity in order to segment out areas of likely correspondence to human flesh. The foreground depth map and the skin probability map are then filtered and combined and positive regions extracted. An optimal bounding ellipse is computed for each region. For the camera behind the robot facing the audience, a Viola-Jones face detector runs on each to determine whether or not the region corresponds to a face. The regions are then tracked over time, based on their position, size, orientation and velocity; connected components are examined to match hands and faces to a single owner; and all this information is transmitted to the behavior engine controlling the robot.

The vision system was developed in collaboration with the Vision Interfaces Group at MIT CSAIL with David Demirdjian.