Humans can easily segment moving objects without knowing what they are. That objectness could emerge from continuous visual observations motivates us to model segmentation and movement concurrently from {\it unlabeled} videos. Our premise is that a video contains different views of the same scene related by moving components, and the right region segmentation and region flow allow view synthesis which can be checked on the data itself without any external supervision.
Our model first deconstructs video frames in two separate pathways: an appearance pathway that outputs feature-based region segmentation for a single image, and a motion pathway that outputs motion features for a pair of images. It then binds them in a conjoint region flow feature representation and predicts {\em segment flow} that provides a gross characterization of moving regions for the entire scene.
Our model demonstrates the surprising emergence of objectness in the appearance pathway, surpassing prior works on {\bf 1)} zero-shot object segmentation from a single image, {\bf 2)} moving object segmentation from a video with unsupervised test-time adaptation, and {\bf 3)} semantic image segmentation with supervised fine-tuning. Our work is the first truly end-to-end learned zero-shot object segmentation model from unlabeled videos. It not only develops generic objectness for segmentation and tracking, but also outperforms image-based contrastive representation learning without augmentation engineering.