Changhai Xu. 2011. Steps Towards the Object Semantic Hierarchy
Doctoral dissertation, Computer Science Department, University of Texas at Austin.


An intelligent robot must be able to perceive and reason robustly about its world in terms of objects, among other foundational concepts. The robot can draw on rich data for object perception from continuous sensory input, in contrast to the usual formulation that focuses on objects in isolated still images. Additionally, the robot needs multiple object representations to deal with different tasks and/or different classes of objects. We propose the Object Semantic Hierarchy (OSH), which consists of multiple representations with different ontologies. The OSH factors the problems of object perception so that intermediate states of knowledge about an object have natural representations, with relatively easy transitions from less structured to more structured representations. Each layer in the hierarchy builds an explanation of the sensory input stream, in terms of a stochastic model consisting of a deterministic model and an unexplained "noise" term. Each layer is constructed by identifying new invariants from the previous layer. In the final model, the scene is explained in terms of constant background and object models, and low-dimensional dynamic poses of the observer and objects.

The OSH contains two types of layers: the object layers and the model layers. The object layers describe how the static background and each foreground object are individuated, and the model layers describe how the model for the static background or each foreground object evolves from less structured to more structured representations. Each object or background model contains the following layers: (1) 2D object in 2D space (2D2D): a sparse set of constant 2D object views, and the time-variant 2D object poses, (2) 2D object in 3D space (2D3D): a small collection of constant 2D components, with their individual time-variant 3D poses, and (3) 3D object in 3D space (3D3D): the same collection of constant 2D components but with invariant relations among their 3D poses, and the time-variant 3D pose of the object as a whole.

In building 2D2D object models, a fundamental problem is to segment out foreground objects in the pixel-level sensory input from the background environment, where motion information is an important cue to perform the segmentation. Traditional approaches for moving object segmentation usually appeal to motion analysis on pure image information without exploiting the robot's motor signals. We observe, however, that the background motion (from the robot's egocentric view) has stronger correlation to the robot's motor signals than the motion of foreground objects. Based on this observation, we propose a novel approach to segmenting moving objects by learning homography and fundamental matrices from motor signals.

In building 2D3D and 3D3D object models, estimating camera motion parameters plays a key role. We propose a novel method for camera motion estimation that takes advantage of both planar features and point features and fuses constraints from both homography and essential matrices in a single probabilistic framework. Using planar features greatly improves estimation accuracy over using point features only, and with the help of point features, the solution ambiguity from a planar feature is resolved. Compared to the two classic approaches that apply the constraint of either homography or essential matrix, the proposed method gives more accurate estimation results and avoids the drawbacks of the two approaches.