Research Scientist

@ Google Brain

and

Sloan Research Fellow &

Associate Professor

@ CSE Division,
University of Michigan, Ann Arbor

Address (UM):

3773 Bob and Betty Beyster Building

2260 Hayward Street

Ann Arbor, MI 48109

Email:

(* Link to: NSF CAREER Project website)

One of the most fundamental challenges in representation learning is how to learn highly-specific (e.g., discriminative) and robust, but invariant features. Learning invariant representations that are robust to variability in high-dimensional data (e.g., images, speech, etc.) enables machine learning systems to achieve good generalization performance while using a relatively small number of labeled training examples. Over the years, we have developed techniques for learning robust and invariant representations.

** Relevant Work/Publications: **

- efficient sparse coding algorithms (NIPS 2007, IJCAI 2009)
- sparsity regularization for RBMs, autoencoders, and deep belief networks (NIPS 2008, NIPS 2009a)
- convolutional deep belief networks (ICML 2009, NIPS 2009b, Comm. ACM 2011)
- efficient learning methods for sparse RBMs and convolutional RBMs (ICCV 2011)
- scalable unsupervised feature learning (AISTATS 2011)
- transformation-invariant feature learning (ICML 2012)
- locally-convolutional deep architectures (CVPR 2012)
- invariant feature learning (NIPS 2012)
- deep learning with adaptive model capacity (AISTATS 2012)
- weakly-supervised deep learning by jointly learning and selecting features via gating mechanism (ICML 2013)
- Invertibility of the convolutional neural networks and improving large-scale CNN classification (ICML 2016)

In this research thrust, I am interested in the following question: How can we learn better shared representations from multimodal data? We have investigated learning algorithms for multi-modal representations that can make predictions more robustly by learning associations between different input modalities (e.g., audio and video, image and text, etc.). One of the key ideas is to learn an abstract shared representation that encourages association between input modalities by generating multi-modal data from latent features. We have proposed one of the first approaches for multimodal deep learning, showing that learning from multiple modalities results in significantly better features compared to features learned from a single modality. More recently, we derived a better learning algorithm with rigorous theoretical analysis. In terms of applications, I have also demonstrated effectiveness of multimodal deep learning in many domains, such as audio-visual data, robotic sensors, image and text.

** Relevant Work/Publications: **

- multimodal deep autoencoders (ICML 2011)
- improved learning objective with conditional prediction (NIPS 2014)
- audio-visual emotion recognition (ICASSP 2013)
- multimodal feature learning for robotic grasping (RSS 2013, IJRR 2015)
- structural joint embedding from multiple input sources (e.g., text, attribute, and images) (CVPR 2015)
- multimodal embedding for fine-grained recognition (CVPR 2016)
- conditional image generation from semantic attributes (Arxiv)
- conditional text-to-image generation with generative adversarial network (ICML 2016)

In real-world tasks with complex outputs, a learner should make predictions that exploit the prior domain knowledge and global correlation structures of the outputs (e.g., segmentations or localizations of objects in the images or videos, sequence labeling for time-series data, etc.). Typically, the output space is huge in structured prediction problems, which makes inference and learning very challenging. Deep architectures with distributed representation is a very promising way of tackling this problem, and we have developed models that combine bottom-up feature learning with top-down structured prior.

** Relevant Work/Publications: **

- conditional random fields with output prior from deep generative models (CVPR 2013)
- structured prediction with Bayesian optimization and structured objective (CVPR 2015)
- convolutional neural networks for medical image segmentation (EMBC 2015)
- structured prediction with deep stochastic generative models (conditional VAE) (NIPS 2015)
- contour detection and segmentation with convolutional neural networks (CVPR 2016)
- transfer learning for sementic segmentation (CVPR 2016)

In this research thrust, we try to address the following question: how can we tease out factors of variations from complex data with deep generative models? Many latent factors of variation interact to generate sensory data; for example, pose, morphology and viewpoints for 3d object images. As a general framework, we propose to learn manifold coordinates for the relevant factors of variation and to model their joint interaction. Many feature learning algorithms focus on a single task and extract only task-relevant features that are invariant to other factors. However, models that just extract a single set of invariant features do not exploit the relationships among the latent factors. To address this, we develop deep generative models with higher-order interactions among groups of hidden units, where each group learns to encode a distinct factor of variation.

** Relevant Work/Publications: **

- Disentangling factors of variation with manifold interaction (ICML 2014)
- Action-conditional video prediction for Reinforcement Learning (NIPS 2015)
- Deep visual analogy making (NIPS 2015)
- Weakly-supervised learning of disentangled representations with transformations (NIPS 2015)
- style-content disentangling and conditional image generation from semantic attributes (Arxiv)
- style-content disentangling and conditional text-to-image generation with generative adversarial network (ICML 2016)
- Understanding and improving convolutional neural networks with concatenated ReLU (ICML 2016)
- Invertibility of the convolutional neural networks and improving large-scale CNN classification (ICML 2016)

In this research thrust, we develop algorithms for learning deep representations using weak supervision. Our approaches tightly link input data, weak supervision (such as coarse-grained/partial labels), and auxiliary target task (with no explicit labels). Examples include: learning to localize objects using class labels, learning to discover semantic attributes without explicit supervision. This approach can significantly reduce the burden of detailed annotation of data (e.g., structured outputs) by leveraging a potentially large amount of implicit supervision.

** Relevant Work/Publications: **

- weakly-supervised deep learning by jointly learning and selecting features via gating mechanism, with application to weakly-supervised object localization and segmentation (ICML 2013)
- weakly-supervised learning of semantic attributes with class-level supervision (CVPR 2013)
- deep learning with noisy or partly labeled data (ICLR 2015)
- Weakly-supervised learning of disentangled representations with transformations (NIPS 2015)
- transfer learning for sementic segmentation via weak supervision (CVPR 2016)

A long-standing challenge in artificial intelligence is to develop an intelligent agent that is capable of interacting (i.e., “taking actions”) in the physical and/or virtual environment, such as personal assistant robot and autonomous cars. This is believed to be the next critical step for making breakthroughs for building AI systems. Reinforcement learning attempts to address this problem, but its progress been hampered by the challenge of perception from real-world complex data. I am interested in the combination of reinforcement learning (RL) and deep learning, which holds promise for challenging AI problems that need both rich perception and policy-selection. We are working on the video game (e.g., Atari game) domain as a testbed, which is a challenging RL domain that requires agents to address the issues of sequential decision making, partial observability, delayed reward, high-dimensional observations, and multiple interacting objects.

** Relevant Work/Publications: **

- Convolutional neural networks for approximating offline Monte-Carlo policy (NIPS 2014)
- Action-conditional video prediction for Reinforcement Learning (NIPS 2015)
- Learning reward design for improving Monte Carlo tree search (IJCAI 2016)
- Active perception (first-persion view) and memory-based deep RL architectures (ICML 2016)