Building and Searching a Structured Web Database

Mike Cafarella
Associate Professor
Computer Science and Engineering
2260 Hayward St.
University of Michigan
Ann Arbor, MI 48109-2121

Office: 4709 Beyster
Phone: 734-764-9418
Fax: 734-763-8094
Send email to me at michjc, found at umich dot edu

Building and Searching a Structured Web Database

This material describes work supported by the National Science Foundation under Grant No. IIS 1054913, Building and Searching a Structured Web Database.

This document constitutes the final report of the grant. To see the version of this document that covers year 1 (from June, 2012), click here. To see the version of this document that covers year 2 (from June, 2013), click here. To see the version of this document that covers year 3 (from June, 2014), click here. To see the version of this document that covers year 4 (from May, 2015), click here. To see the version of this document that covers year 5 (from November, 2016), click here.

A portion of this work has also been supported by an award by the Dow Chemical Company, by Google, by DARPA, and by other NSF grants.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Award Number	IIS 1054913
Duration	Five years
Award Title	CAREER: Building and Searching a Structured Web Database
PI	Michael Cafarella
Students	Shirley Zhe Chen, Dolan Antenucci, Michael Anderson, Zhongjun Jin.
Collaborators	Prof. Eytan Adar, for the work with scientific diagrams in year 1 and years 4 and 5. Professors Margaret Levenstein, Matthew Shapiro, Christopher Re for the work on nowcasting in years 2-5, H.V. Jagadish for data transformation work in final year, and (indirectly, through paper coauthorship) Sasha Dadiomov, Richard Wesley, Gang Xiao, and Jock Mackinlay for spreadsheet extraction work.
Project Goals	Our initial research proposal focused on the following work for the final year: A: Data Extraction: Run combined content-human extraction cycle to extract entire Web dataset B: Search System: Optimize keywords and result presentation using additional query data C: Broader Impacts: Final release of query data, extracted Web dataset, and search engine As mentioned in previous years' reports, the growth of high-quality structured online entity-centric datasets (such as Google’s Knowledge Graph) led us to change the focus of our work several years into the project. We have been focusing on “somewhat structured” numeric datasets, such as spreadsheet data. We have also added efforts to pursue social media extractions, which are traditionally ill-served by Web extraction systems. Further, we have added work on data transformation systems, which are often key in processing extracted data. These extracted datasets remain hard to obtain or manage, and so the core motives of our research plan continue. Over the course of this project, we have completed a large number of research projects with accompanying software or datasets: A tool for spreadsheet extraction (published at KDD 2014), and a system for spreadsheet property discovery that aids with advanced extraction problems (CIKM 2017) A user-facing tool for social media extraction (published at ICDE 2016), and a query optimization system for social media extractions (VLDB 2016) A prototype system for feature engineering infrastructure that is useful for extraction systems (published at ICDE 2016) A system for long-tail extraction (WSDM 2016) A system for synthesizing data transformation programs, which are often critical in making extracted data useful (SIGMOD 2017) Several published datasets concerning spreadsheet and social media extraction (avaialble at http://dbgroup.eecs.umich.edu/project/sheets/index.html and http://econprediction.eecs.umich.edu). The Data Extraction Effort (A) is complete. Our previous work on spreadsheet extraction (published in KDD 2014) enabled extraction of numeric and other relational-style data from spreadsheets, with little user effort. Our new work (CIKM 2017) pushes spreadsheet extraction further, by automatically identifying spreadsheet-specific properties that enable easy downstream extraction. These properties (examples include “rows that constitute aggregations” and “columns that represent crosstabs”) enable extraction from a wider variety of structural spreadsheet forms than the past work. This system used a rule-assisted active learning approach to dramatically reduce the number of manual labels that are required: 34-44% fewer manually-labeled examples when a user provides high-accuracy rules. The Search System Effort (B) was previously complete in the spreadsheet and long-tail extraction domains and is now complete for the nowcasting domain. We published details of our user query-driven demonstration system around nowcasting extractions in ICDE 2016, VLDB 2016, and the December 2016 issue of the IEEE Data Engineering Bulletin. The interface for the nowcasting work is available at http://econprediction.eecs.umich.edu/. The Broader Impacts Effort (C) is now complete for all projects. All code and data for this project can be found at http://wwweb.eecs.umich.edu/db/sheets/index.html, https://github.com/umich-dbgroup/foofah, and http://econprediction.eecs.umich.edu/.
Research Challenges and Results	There are three major lines of work performed in the last period, supported by this grant: Extraction from “somewhat structured” spreadsheets containing numeric data, and which would be useful if in relational form. We worked on this topic area previously, but on a more constrained type of spreadsheet file. We released a paper about a tool for spreadsheet extraction in CIKM 2017. Some extraction and data transformation research to date has focused on designing domain-specific languages to describe transformation processes or auto-matically converting a specific type of spreadsheets. To handle a larger variety of spreadsheets than previous extraction systems enabled, we had to identify various spread-sheet properties, which correspond to a series of transformation programs that contribute towards a general framework that converts spreadsheets to relational tables. In this work, we focused on the problem of spreadsheet property detection. We proposed a hybrid approach of building a variety of spreadsheet property detectors to reduce the amount of required human labeling effort. Our approach integrates an active learning framework with crude, easy-to-write, user-provided rules to save human labeling effort by generating additional high-quality labeled data, especially in the initial training stage. Using a bagging-like technique, our approach can also tolerate lower-quality user-provided rules. Our experiments show that when compared to a standard active learning approach, we reduced the training data needed to reach the performance plateau by 34–44% when a human provides relatively high-quality rules, and by a comparable amount with low-quality rules. A study on a large-scale web-crawled spreadsheet dataset demonstrated that it is crucial to detect a variety of spreadsheet properties in order to transform a large portion of the spreadsheets online into a relational form. Extraction from social media datasets, such as Twitter text messages. We have released several papers on this topic (published at VLDB 2016, ICDE 2016, and the IEE Data Bulletin in December 2016). The most exciting portion of this work is the data system RaccoonDB, described in the VLDB paper. Nowcasting is the practice of using social media data to quantify ongoing real-world phenomena. It has been used by researchers to measure flu activity, unemployment behavior, and more. However, the typical nowcasting workflow requires either slow and tedious manual searching of relevant social media messages or automated statistical approaches that are prone to spurious and low-quality results. In this VLDB paper, we proposed a method for declaratively specifying a nowcasting model; this method involves processing a user query over a very large social media database, which can take hours. Due to the human-in-the-loop nature of constructing nowcasting models, slow runtimes place an extreme burden on the user. Thus we also proposed a novel set of query optimization techniques, which allow users to quickly construct nowcasting models over very large datasets. Further, we propose a novel query quality alarm that helps users estimate phenomena even when historical ground truth data is not available. These contributions allowed us to build a declarative nowcasting data management system, RaccoonDB. RaccoonDB yields high-quality results in interactive time. We evaluated RaccoonDB using 40 billion tweets collected over five years. We showed that our automated system saves work over traditional manual approaches while improving result quality—57% more accurate in our user study—and that its query optimizations yield a 424x speedup, allowing it to process queries 123x faster than a 300-core Spark cluster, using only 10% of the computational resources. Data transformation systems, that automatically convert messy database data into high-quality structured databases. We have published two papers on this topic (both published at SIGMOD 2017). This work has synergy with the above extraction efforts; data transformation systems are often useful for data that was extracted imperfectly. Moreover, data transformation can itself be seen as a form of information extraction. Data transformation is not just useful for extracted data. It is often a critical first step in modern data analysis: before any analysis can be done, data from a variety of sources must be wrangled into a uniform format that is amenable to the intended analysis and analytical software package. This data transformation task is tedious, time-consuming, and often requires programming skills beyond the expertise of data analysts. In this work, we developed a technique to synthesize data transformation programs by example, reducing this burden by allowing the analyst to describe the transformation with a small input-output example pair, without being concerned with the transformation steps required to get there. We implemented our technique in a system, Foofah, that efficiently searches the space of possible data transformation operations to generate a program that will perform the desired transformation. We experimentally showed that data transformation programs can be created quickly with Foofah for a wide variety of cases, with 60% less user effort than the well-known Wrangler system.
Publications	Zhe Chen, Sasha Dadiomov, Richard Wesley, Gang Xiao, Daniel Cory, Michael Cafarella, Jock Mackinlay: Spreadsheet Property Detection With Rule-Assisted Active Learning. CIKM 2017. Dolan Antenucci, Michael R. Anderson, Michael Cafarella: A Declarative Query Processing System for Nowcasting. VLDB 10(3) 2016. Michael R. Anderson, Dolan Antenucci, Michael Cafarella: Runtime Support for Human-in-the-Loop Feature Engineering Systems. IEEE Data Engineering 39(4), December 2016. Dolan Antenucci, Michael R. Anderson, Penghua Zhao, Michael Cafarella: A Query System for Social Media Signals. Demonstration system, ICDE 2016. Zhongjun Jin, Michael R. Anderson, Michael Cafarella, H.V. Jagadish: Foofah: Transforming Data By Example. SIGMOD 2017. Zhongjun Jin, Michael R. Anderson, Michael Cafarella, H.V. Jagadish: Foofah: A Programming-By-Example System for Synthesizing Data Transformation Programs. SIGMOD 2017.
Presentations	Spreadsheet Property Detection With Rule-Assisted Active Learning, presented at CIKM 2017. A Declarative Query Processing System for Nowcasting, presented at VLDB 2016. Foofah: A Programming-By-Example System for Synthesizing Data Transformation Programs, presented at SIGMOD 2017. Foofah: Transforming Data By Example, presented at SIGMOD 2017.
Images/Videos	Please see the economics social media site and the spreadsheet extraction site for more media. We also have a video of our data transformation system at work here.
Data	A lot of our work over the course of this project has thrown off datasets that are suitable for use by other researchers. Here is the data we have collected so far. Scientific Diagrams -- We obtained 300,000 diagrams from over 150,000 sources that we crawled online. Unfortunately, we believe we do not have the legal right to distribute the extracted diagrams themselves. (Live search engines are relatively free to summarize content as needed to present results.) However, you can download the URLs where we obtained diagrams here. Spreadsheet Extraction -- We have collected several spreadsheet corpora for testing the extraction system. The first is the corpus of spreadsheets associated with the Statistical Abstract of the United States. As a series of US government publications, there are no copyright issues involved in redistributing it. You can download our crawled corpus of SAUS data here. We have also crawled the Web to discover a large number of spreadsheets posted online. Unfortunately, copyright issues apply here, too. So here, too, we have assembled a downloadable list of the URLs where we found spreadsheets online. This should enable other researchers to find the resources we used and duplicate our results. Economic Data -- Our economics website also has historical social media signals available. Available at http://econprediction.eecs.umich.edu. Data Transformation Programs -- The nascent field of synthesizing programs requires good benchmark sets. Our work in data transformation program synthesis can help. Code for our system and the benchmarks can be found at https://github.com/umich-dbgroup/foofah.
Demos	Spreadsheet Extraction -- Video of our system available at http://www.eecs.umich.edu/db/sheets/index.html. Economic Prediction -- Weekly unemployment predictions at http://econprediction.eecs.umich.edu. Data Transformation SynthesisVideo of our data transformation synthesis system can be seen at https://www.youtube.com/watch?v=Ura2pxez_Bo.
Software downloads	You can find code for the data transformation system here
Patents	None
Broader Impacts	The spreadsheet work has been commercialized as part of the Tableau Software big data visualization suite. The economics data is used by Wall Street banks and federal economics departments.
Educational Material	None
Highlights and Press	Past coverage of the economics work: Washington Post's Wonkblog: Twitter is surprisingly accurate at predicting unemployment Slate: Can Twitter Predict Economic Data? The Boston Globe: Can Twitter Predict the Economy? The Wall Street Journal: Using Twitter to Forecast New Applications for Unemployment The Economist: Separating Tweet from Chaff
Point of Contact	Michael Cafarella
Last Updated	March, 2018