Michael J. Cafarella

Mike Cafarella
Associate Professor
Computer Science and Engineering
2260 Hayward St.
University of Michigan
Ann Arbor, MI 48109-2121

Office: 4709 Beyster
Phone: 734-764-9418
Fax: 734-763-8094
Send email to me at michjc, found at umich dot edu

My students and I currently work on projects in four areas of data mangement:

Tools for Dataset Construction, including information extraction (from tables, spreadsheets, and text of various kinds) and data transformation.
Data-Intensive Programming and Debugging, such as creating data transformation programs, exploiting code corpora, building large-scale debugging systems or debugging in the face of data-quality tradeoffs.
Data Management for Economics, such as data systems for managing raw nowcasting evidence, using information extraction to attack trafficking crimes, or investigating data integration issues in macroeconomic statistics.
System Support for Machine Learning Development, such as systems for feature engineering, efficiently querying image corpora for training set construction, and even some hardware ideas (and, awhile ago, Hadoop).

In addition to writing papers, we build real systems that aim to make large, concrete, real-world impact:

Our extracted data about potential human trafficking activity has been used by law enforcement and used in investigations.
Our deployed system for predicting unemployment using social media data has been running and producing novel economics data since early 2014. For an extended period of time this data was regularly downloaded by a range of banks and governmental instiutions.
Much of our system for spreadsheet data extraction, as well as a corpus of test spreadsheets, is available to download.
With Doug Cutting, I cofounded Hadoop, which is deployed at Facebook, Twitter, Yahoo!, and more than half of the Fortune 50.

Videos, data, code, and papers for Senbazuru spreadsheet extraction
The website, data, and papers for our social media system for economics prediction
The WebTables metadata corpus
Details about DeepDive, which I contribute to
I created the RecordBreaker open source project for inducing structured data from unstructured log files.
Hadoop, etc

We are grateful to many different organizations for helping to fund our research: