@ Google Brain
Sloan Research Fellow &
@ CSE Division, University of Michigan, Ann Arbor
3773 Bob and Betty Beyster Building
2260 Hayward Street
Ann Arbor, MI 48109
(* Link to: NSF CAREER Project website)
One of the most fundamental challenges in representation learning is how to learn highly-specific (e.g., discriminative) and robust, but invariant features. Learning invariant representations that are robust to variability in high-dimensional data (e.g., images, speech, etc.) enables machine learning systems to achieve good generalization performance while using a relatively small number of labeled training examples. Over the years, we have developed techniques for learning robust and invariant representations.
In this research thrust, I am interested in the following question: How can we learn better shared representations from multimodal data? We have investigated learning algorithms for multi-modal representations that can make predictions more robustly by learning associations between different input modalities (e.g., audio and video, image and text, etc.). One of the key ideas is to learn an abstract shared representation that encourages association between input modalities by generating multi-modal data from latent features. We have proposed one of the first approaches for multimodal deep learning, showing that learning from multiple modalities results in significantly better features compared to features learned from a single modality. More recently, we derived a better learning algorithm with rigorous theoretical analysis. In terms of applications, I have also demonstrated effectiveness of multimodal deep learning in many domains, such as audio-visual data, robotic sensors, image and text.
In real-world tasks with complex outputs, a learner should make predictions that exploit the prior domain knowledge and global correlation structures of the outputs (e.g., segmentations or localizations of objects in the images or videos, sequence labeling for time-series data, etc.). Typically, the output space is huge in structured prediction problems, which makes inference and learning very challenging. Deep architectures with distributed representation is a very promising way of tackling this problem, and we have developed models that combine bottom-up feature learning with top-down structured prior.
In this research thrust, we try to address the following question: how can we tease out factors of variations from complex data with deep generative models? Many latent factors of variation interact to generate sensory data; for example, pose, morphology and viewpoints for 3d object images. As a general framework, we propose to learn manifold coordinates for the relevant factors of variation and to model their joint interaction. Many feature learning algorithms focus on a single task and extract only task-relevant features that are invariant to other factors. However, models that just extract a single set of invariant features do not exploit the relationships among the latent factors. To address this, we develop deep generative models with higher-order interactions among groups of hidden units, where each group learns to encode a distinct factor of variation.
In this research thrust, we develop algorithms for learning deep representations using weak supervision. Our approaches tightly link input data, weak supervision (such as coarse-grained/partial labels), and auxiliary target task (with no explicit labels). Examples include: learning to localize objects using class labels, learning to discover semantic attributes without explicit supervision. This approach can significantly reduce the burden of detailed annotation of data (e.g., structured outputs) by leveraging a potentially large amount of implicit supervision.
A long-standing challenge in artificial intelligence is to develop an intelligent agent that is capable of interacting (i.e., “taking actions”) in the physical and/or virtual environment, such as personal assistant robot and autonomous cars. This is believed to be the next critical step for making breakthroughs for building AI systems. Reinforcement learning attempts to address this problem, but its progress been hampered by the challenge of perception from real-world complex data. I am interested in the combination of reinforcement learning (RL) and deep learning, which holds promise for challenging AI problems that need both rich perception and policy-selection. We are working on the video game (e.g., Atari game) domain as a testbed, which is a challenging RL domain that requires agents to address the issues of sequential decision making, partial observability, delayed reward, high-dimensional observations, and multiple interacting objects.