Stella X. Yu : Papers / Google Scholar

Contextual Image Parsing via Panoptic Segment Sorting

Jyh-Jing Hwang and Tsung-Wei Ke and Stella X. Yu

ACM Multimedia 2021 Workshop on Multimedia Understanding with Less Labeling, Online, 20-24 October 2021

Paper

Abstract

Real-world visual recognition is far more complex than object recognition: There is stuff without distinctive shape or appearance, and the same object appearing in different contexts calls for different actions. While we need context-aware visual recognition, visual context is hard to describe and impossible to label manually.
We consider visual context as semantic correlations between objects and their surroundings that include both object instances and stuff categories. We approach contextual object recognition as a pixel-wise feature representation learning problem that accomplishes supervised panoptic segmentation while discovering and encoding visual context automatically.
Panoptic segmentation is a dense image parsing task that segments an image into regions with both semantic category and object instance labels. These two aspects could conflict each other, for two adjacent cars would have the same semantic label but different instance labels. Whereas most existing approaches handle the two labeling tasks separately and then fuse the results together, we propose a single pixel-wise feature learning approach that unifies both aspects of semantic segmentation and instance segmentation.
Our work takes the metric learning perspective of SegSort but extends it non-trivially to panoptic segmentation, as we must merge segments into proper instances and handle instances of various scales. Our most exciting result is the emergence of visual context in the feature space through contrastive learning between pixels and segments, such that we can retrieve a person {\it crossing a somewhat empty street} without any such context labeling.
Our experimental results on Cityscapes and PASCAL VOC demonstrate that, in terms of surround semantics distributions, our retrievals are much more consistent with the query than the state-of-the-art segmentation method, validating our pixel-wise representation learning approach for the unsupervised discovery and learning of visual context.

Keywords

contrastive learning, context encoding, context discovery, image parsing, panoptic segmentation