I do research in three main areas of data management.
- Systems and algorithms for "messy" data management includes work on information extraction (from spreadsheets, or from Web pages of different kinds), data integration (whether integrating data from Web pages or more traditional sources), machine learning workloads (such as feature engineering), and top-k ranking.
- Novel data applications, especially for social science use cases in economics and fighting human trafficking (technical paper coming soon, but in the meantime, read this article in Scientific American).
- Data systems infrastructure includes systems work that can undergird very general-purpose data management methods. My work on Hadoop is the best-known example, but also includes research into optimization for MapReduce programs and hardware support for text analytics (accepted for ICDE 2016).
In addition to writing papers, we build real systems that aim to make large, concrete, real-world impact:
Data, code, and other resources
We are grateful to many different organizations for helping to fund our research:
- The Census Bureau
- General Electric
- The National Science Foundation