Building and Searching a Structured Web Database

Mike Cafarella
Assistant Professor
Computer Science and Engineering
2260 Hayward St.
University of Michigan
Ann Arbor, MI 48109-2121

Office: 4709 Beyster
Phone: 734-764-9418
Fax: 734-763-8094
Send email to me at michjc, found at umich dot edu

Building and Searching a Structured Web Database

This material describes work supported by the National Science Foundation under Grant No. IIS 1054913, Building and Searching a Structured Web Database.

This document describes work done in the fourth year of the grant, 2014-2015. To see the version of this document that covers year 1 (from June, 2012), click here. To see the version of this document that covers year 2 (from June, 2013), click here. To see the version of this document that covers year 3 (from June, 2014), click here.

A portion of this work has also been supported by an award by the Dow Chemical Company.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Award Number	IIS 1054913
Duration	Five years
Award Title	CAREER: Building and Searching a Structured Web Database
PI	Michael Cafarella
Students	Shirley Zhe Chen, Dolan Antenucci, Jun Chen, Guan Wang, Bochun Zhang.
Collaborators	Prof. Eytan Adar, for the work with scientific diagrams in year 1 and year 4. Professors Margaret Levenstein, Matthew Shapiro, Christopher Re for the work on nowcasting in years 2-4
Project Goals	Our initial research proposal focused on the following work for Year 4: A. Data Extraction: Improve content-driven extraction with large seed set B. Search System: Query time reference reconcilation C. Broader Impacts: Release search data and cost-based extraction tools As mentioned in previous years' reports, the growth of high-quality structured online entity-centric datasets (such as Google’s Knowledge Graph) led us to change the focus of our work. We have been focusing on “somewhat structured” numeric datasets, such as data-driven diagrams and spreadsheet data. We have also added efforts to pursue "long-tail extractions" that are traditionally ill-served by Web extraction systems, to process statistical diagrams, and to add statistical extraction from social media. These datasets remain hard to obtain or manage, and so the core motives of our research plan continue. We have completed a tool for spreadsheet extraction (published at KDD 2014), have published extensive datasets for social media extraction (avaialble at http://econprediction.eecs.umich.edu), have submitted a manuscript for long-tail extraction methods, and have completed initial work on a tool for managing statistical diagrams (published at WWW 2015). Effort A is complete. Our work on "long-tail" extraction improves content-driven extraction, though not in the manner we originally anticipated. "Long tail" extraction tasks are ones where many target extractions appear very rarely; consider the set of all camera manufacturers, rather than the set of all US States. Rather than laboriously constructing a large seed dataset, we created a low-overhead method for users to create synthetic seed sets. When extracting the top-100 items from a target set, our system obtains an accuracy improvement of 23% compared to competing systems. This work is under submission at the TACL conference. Effort B is complete in the spreadsheet and long-tail extraction domains and still ongoing in the scientific diagram domain. Our deployed spreadsheet/database search system uses a large number of features to identify when two spreadsheets are referring to the same entity. This process enables the system to obtain far higher-accuracy spreadsheet extractions than would otherwise be possible. (Described in KDD 2014 paper) We use "noisy reference matching" to identify positive and negative training examples in the long-tail extraction system. (Submitted to TACL 2015.) We have also begun to apply this work to the problem of searching and managing scientific diagrams. (Early work here described in WWW 2015 paper). Effort C is complete for spreadsheets and social media projects, and will be complete pending publication of the long-tail extraction and scientific diagram projects. The spreadsheet code and data are now publicly available via http://wwweb.eecs.umich.edu/db/sheets/index.html. The social media website allows the user to download all datasets from http://econprediction.eecs.umich.edu/; unfortunately, the query traffic here has been too small to warrant release. We will release code and data for the long-tail and diagram search systems after all the relevant academic papers are complete.
Research Challenges and Results	Please see description in above text.
Publications	Zhe Chen and Michael Cafarella and H.V. Jagadish: Long-tail Vocabulary Dictionary Extraction from the Web. Submitted to TACL 2015. Zhe Chen and Eytan Adar and Michael Cafarella: DiagramFlyer: A Search Engine for Data-Driven Diagrams. Published at WWW 2015 Demonstration Track. Zhe Chen and Michael Cafarella: Integrating Social Data via Low-Effort Spreadsheet Extraction. Knowledge Discovery and Data Mining 2014. Dolan Antenucci, Michael Cafarella, Margaret Levenstein, Christopher Re, Matthew Shapiro: Using Social Media to Measure Labor Market Flows. Submitted to American Economic Review, 2014.
Presentations	Integrating Social Data via Low-Effort Spreadsheet Extraction, presented at KDD 2014 Using Social Media to Measure Labor Market Flows, presented at NSF Census event, May 2015 DiagramFlyer: A Search Engine for Data-Driven Diagrams, presented at WWW 2015, May 2013
Images/Videos	Please see the economics social media site and the spreadsheet extraction site for more media.
Data	A lot of our work throws off datasets that are suitable for use by other researchers. Here is the data we have collected so far. Scientific Diagrams -- We obtained 300,000 diagrams from over 150,000 sources that we crawled online. Unfortunately, we believe we do not have the legal right to distribute the extracted diagrams themselves. (Live search engines are relatively free to summarize content as needed to present results.) However, you can download the URLs where we obtained diagrams here. Spreadsheet Extraction -- We have collected several spreadsheet corpora for testing the extraction system. The first is the corpus of spreadsheets associated with the Statistical Abstract of the United States. As a series of US government publications, there are no copyright issues involved in redistributing it. You can download our crawled corpus of SAUS data here. We have also crawled the Web to discover a large number of spreadsheets posted online. Unfortunately, copyright issues apply here, too. So here, too, we have assembled a downloadable list of the URLs where we found spreadsheets online. This should enable other researchers to find the resources we used and duplicate our results. Economic Data -- Our economics website also has historical social media signals available. Available at http://econprediction.eecs.umich.edu.
Demos	Spreadsheet Extraction -- Video of our system available at http://www.eecs.umich.edu/db/sheets/index.html. Economic Prediction -- Weekly unemployment predictions at http://econprediction.eecs.umich.edu.
Software downloads	Our work so far involves fairly elaborate installations, so we have made the systems available for online use rather than download.
Patents	None
Broader Impacts	The spreadsheet work has been commercialized as part of the Tableau Software big data visualization suite. This is a very important big data tool and a substantial accomplishment for the student.
Educational Material	None
Highlights and Press	Coverage of the economics work: Washington Post's Wonkblog: Twitter is surprisingly accurate at predicting unemployment Slate: Can Twitter Predict Economic Data? The Boston Globe: Can Twitter Predict the Economy? The Wall Street Journal: Using Twitter to Forecast New Applications for Unemployment The Economist: Separating Tweet from Chaff
Point of Contact	Michael Cafarella
Last Updated	May, 2015