Building and Searching a Structured Web Database

Mike Cafarella
Associate Professor
Computer Science and Engineering
2260 Hayward St.
University of Michigan
Ann Arbor, MI 48109-2121

Office: 4709 Beyster
Phone: 734-764-9418
Fax: 734-763-8094
Send email to me at michjc, found at umich dot edu

Building and Searching a Structured Web Database

This material describes work supported by the National Science Foundation under Grant No. IIS 1054913, Building and Searching a Structured Web Database.

This document describes work done in the fifth year of the grant, 2015-2016. To see the version of this document that covers year 1 (from June, 2012), click here. To see the version of this document that covers year 2 (from June, 2013), click here. To see the version of this document that covers year 3 (from June, 2014), click here. To see the version of this document that covers year 4 (from May, 2015), click here.

A portion of this work has also been supported by an award by the Dow Chemical Company, by Google, by DARPA, and by other NSF grants.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Award Number	IIS 1054913
Duration	Five years
Award Title	CAREER: Building and Searching a Structured Web Database
PI	Michael Cafarella
Students	Shirley Zhe Chen, Dolan Antenucci, Jun Chen, Guan Wang, Bochun Zhang.
Collaborators	Prof. Eytan Adar, for the work with scientific diagrams in year 1 and years 4 and 5. Professors Margaret Levenstein, Matthew Shapiro, Christopher Re for the work on nowcasting in years 2-5
Project Goals	Our initial research proposal focused on the following work for Year 5: A: Data Extraction: Run combined content-human extraction cycle to extract entire Web dataset B: Search System: Optimize keywords and result presentation using additional query data C: Broader Impacts: Final release of query data, extracted Web dataset, and search engine As mentioned in previous years' reports, the growth of high-quality structured online entity-centric datasets (such as Google’s Knowledge Graph) led us to change the focus of our work several years into the project. We have been focusing on “somewhat structured” numeric datasets, such as data-driven diagrams and spreadsheet data. We have also added efforts to pursue "long-tail extractions" that are traditionally ill-served by Web extraction systems, to process statistical diagrams, and to add statistical extraction from social media. These datasets remain hard to obtain or manage, and so the core motives of our research plan continue. We have completed a tool for spreadsheet extraction (published at KDD 2014), a user-facing tool for social media extraction (published at ICDE 2016), built a prototype system for feature engineering infrastructure that is useful for extraction systems (published at ICDE 2016), built a prototype system for long-tail extraction (WSDM 2016), and have published extensive datasets for social media extraction (avaialble at http://econprediction.eecs.umich.edu). Effort A is complete. Our work on "long-tail" extraction (published in WSDM 2016) improves content-driven extraction. "Long tail" extraction tasks are ones where many target extractions appear very rarely; consider the set of all camera manufacturers, rather than the set of all US States. Rather than forcing human developers to construct a large training set by hand, we created a low-overhead method for users to create synthetic "seed" sets. When compared to traditional page-specific extraction, this method yields an improvement of 17% over competing methods. This approach uses very modest amounts of human-generated labels and so is scalable to Web-sized corpora. Effort B was previously complete in the spreadsheet and long-tail extraction domains and is now complete for the nowcasting domain. We published details of our user query-driven demonstration system around nowcasting extractions in ICDE 2016. The backend infrastructure required by that user-facing component has been published in VLDB 2016. Effort C is complete for all projects, except for social media, which is pending. All code and data for spreadsheets are now publicly available via http://wwweb.eecs.umich.edu/db/sheets/index.html. The social media website allows the user to download all datasets from http://econprediction.eecs.umich.edu/; unfortunately, the query traffic here has been too small to warrant release.
Research Challenges and Results	There are three major lines of work performed in the last period, supported by this grant: Systems for extracting web data, such as information from long-tail websites and web-published files, such as spreadsheets. (In past reports, we described our progress on the spreadsheet extraction problem and our posting of datasets for download at http://dbgroup.eecs.umich.edu/sheets/index.html) Our most recent work is a system for “Long-tail Vocabulary Dictionary Extraction from the Web”. A paper with that title described a system that attempts to extract complete dictionaries from the Web with little user effort. A dictionary — a set of instances belonging to the same conceptual class — is central to information extraction and is a useful primitive for many applications, including query log analysis and document categorization. Considerable work has focused on generating accurate dictionaries given a few example seeds, but methods to date cannot obtain long-tail (rare) items with high accuracy and recall. We developed a novel method to construct high-quality dictionaries, especially for long-tail vocabularies, using just a few user-provided seeds for each topic. Our algorithm obtains long-tail (i.e., rare) items by building and executing high-quality webpage-specific extractors. We use webpage-specific structural and textual information to build more accurate per-page extractors in order to detect the long-tail items from a single webpage. These webpage-specific extractors are obtained via a co-training procedure using distantly-supervised training data. By aggregating the page-specific dictionaries of many webpages, the system is able to output a high-quality comprehensive dictionary. We evaluated the method on 11 distinct dictionary categories ranging from “small” to “large", including country, disease, mlb-team, nba-team, us-president, camera-maker, mattress-maker, and others. We implemented it using code in Python, Java, and a range of relevant machine learning libraries. We ran it over a set of roughly 450 Web pages focused on these topics, but with absolutely no prior cleaning or normalization. We showed that in long-tail vocabulary settings where the user input is small (and thus low effort for the user), our system has obtained a 17.3% improvement on mean average precision for the dictionary generation process and 30.7% improvement on F1 for the page-specific extractors, compared to state-of-the-art methods such as SEAL. In addition to better pure F1 results, our mechanism enables dynamic adjustment for the “granularity” of extraction results. The granularity problem arises when the user gives a very small amount of input to describe a target extraction goal. With just a few user-given seeds, the extraction target may be ambiguous. For example, given seeds “Atlanta Braves” and “Chicago Cubs”, it is not clear what is the ideal target: the user may intend to get all the MLB teams or all of the U.S. sports teams. After building a user’s target extracted set, our algorithm allows the user to dynamically twist a “knob” and get smaller tighter dictionaries (e.g., National League Major League Baseball teams) or larger and more expansive dictionaries (e.g., all sports teams). We also started work on extracting data from web-embedded plots and images, described in, “A Search Engine for Data-Driven Diagrams.” This work is just getting started, but suggests a future in which experimental datasets can be recovered from images embedded in scientific or governmental reports online. Development tools for building trained systems, including extraction systems, but applicable to any supervised machine learning training task. Our most recent work is a tool for enabling easier feature engineering signals extracted from the data that distill complicated raw data objects into a small number of salient values. A trained system’s success depends substantially on the quality of its features. Unfortunately, feature engineering—the process of writing code that takes raw data objects as input and outputs feature vectors suitable for a machine learning algorithm—is a tedious, time-consuming experience. Because “big data” inputs are so diverse, feature engineering is often a trial-and-error process requiring many small, iterative code changes. Because the inputs are so large, each code change can involve a time-consuming data processing task (over each page in a Web crawl, for example). We built a data-centric system that accelerates feature engineering through intelligent input selection, optimizing the “inner loop” of the feature engineering process. This work was described in the paper, “Input Selection for Faster Feature Engineering”. This work has a strong intellectual synergy with the “Long-Tail” activity described above. The feature engineering work was motivated by a need for lower-cost and higher-quality features for extraction tasks. We implemented a novel feature evaluation system and a set of test machine learning tasks (using Weka) as a Java application of about 17,500 lines of code. This prototype takes the role of Spark or MapReduce in a feature engineering pipeline. We deployed the system on an Amazon EC2 r3.xlarge instance with 30 GB of RAM. It can speed up the feature evaluation loop by up to 8x in some settings and has reduced engineer wait times from 8 to 5 hours in others, compared to conventional methods. On the extremely common multiway documentation classification task, our system accelerated feature evaluation from 28.3 minutes on average, to just 5.6 minutes. We tested the method on a range of machine learning tasks, including extremely common ones such as document classification and linear regression. Our system was effective on all tasks. Indeed, our system can be effective even when the machine learning “statistical” component is massively more expensive than is commonplace (or put another way, when the featurization task that we focus on is comparatively unimportant). This situation can take place when using deep learning or LSTM-style learning architectures. The learning procedure can be up to 50x more expensive than is commonplace, while still allowing our data system to be a useful addition to the machine learning developer toolstack. Systems for extracting information from social media data. Unlike the web pages, spreadsheets, and data-driven images described in (a) above, social media extraction rarely entails finding a single piece of data from a single message online. Rather, the user wants to extract a time-varying signal suggested by a large collection of social media messages. This is task sometimes called “nowcasting” in the literature. We described an overall application for supporting this type of extraction in, “A Query System for Social Media Signals.” Unfortunately, that application required very long waits for the user; we described a solution for efficient execution of these extraction tasks in, “A Declarative Query Processing System for Nowcasting.” We focused on two goals: removing the need for standard supervision data when extracting data from social media, and improving runtime performance. We built both an interactive user tool and a query system for supporting efficient execution of that interactive cycle. These tools do not require any conventional data, relying instead upon an interactive exploration with users. Specifically, our system exploits a user-provided multi-part query consisting of semantic and signal components. The user can explore in real time the tradeoff between these two components to find the most useful social media messages to extract a target signal. Our system lets users search for signals within a large Twitter corpus using a dynamic web-based interface. Additionally, users of our system can share results with the general public, review and comment on others’ shared results, and clone these results as starting points for further exploration and querying. In order to support efficient execution, we created a method for declaratively specifying a social media signal extraction model; this method involves processing a user query over a very large social media database, which can take hours. Due to the human-in-the-loop nature of constructing signal extraction models, slow runtimes place an extreme burden on the user. Thus we also built a novel set of query optimization techniques, which allow users to quickly construct models over very large datasets. Further, we built a novel query quality alarm that helps users estimate phenomena even when historical ground truth data is not available. These contributions allowed us to build a declarative signal extraction data management system, called RacoonDB, which yields high-quality results in interactive time. RacoonDB uses a set of optimization methods, some previously known, but never applied successfully in this setting. These include approximate candidate pruning methods that could also be broadly useful in general feature selection workloads. We evaluated RaccoonDB using 40 billion tweets collected over five years. We showed that our automated system saves work over traditional manual approaches while improving result quality—57% more accurate in our user study—and that its query optimizations yield a 424x speedup, allowing it to process queries 123x faster than a 300-core Spark cluster, using only 10% of the computational resources. This result is tremendously useful in building a practical social media signal extraction tool. We are in the process of deploying the resulting query system in a setting that can be accessed by general-purpose users in the social sciences and other non-computational professions.
Publications	Dolan Antenucci and Michael R. Anderson and Penghua Zhao and Michael Cafarella: A Query System for Social Media Signals. ICDE 2016. Dolan Antenucci and Michael R. Anderson and Michael Cafarella: A Declarative Query Processing System for Nowcasting. VLDB 10 (3). 2016. Michael R. Anderson and Michael Cafarella: Input Selection for Fast Feature Engineering. ICDE 2016. Zhe Chen and Michael Cafarella and H.V. Jagadish: Long-tail Vocabulary Dictionary Extraction from the Web. WSDM 2016. Zhe Chen and Michael Cafarella and Eytan Adar: DiagramFlyer: A Search Engine for Data-Driven Diagrams. WWW 2015.
Presentations	Using Social Media to Measure Labor Market Flows, presented at NSF Census event, May 2015 DiagramFlyer: A Search Engine for Data-Driven Diagrams, presented at WWW 2015, May 2015 A Query System for Social Media Signals, presented at ICDE 2016. Input Selection for Fast Feature Engineering, presented at ICDE 2016. Long-tail Vocabulary Dictionary Extraction from the Web, presented at WSDM 2016.
Images/Videos	Please see the economics social media site and the spreadsheet extraction site for more media.
Data	A lot of our work throws off datasets that are suitable for use by other researchers. Here is the data we have collected so far. Scientific Diagrams -- We obtained 300,000 diagrams from over 150,000 sources that we crawled online. Unfortunately, we believe we do not have the legal right to distribute the extracted diagrams themselves. (Live search engines are relatively free to summarize content as needed to present results.) However, you can download the URLs where we obtained diagrams here. Spreadsheet Extraction -- We have collected several spreadsheet corpora for testing the extraction system. The first is the corpus of spreadsheets associated with the Statistical Abstract of the United States. As a series of US government publications, there are no copyright issues involved in redistributing it. You can download our crawled corpus of SAUS data here. We have also crawled the Web to discover a large number of spreadsheets posted online. Unfortunately, copyright issues apply here, too. So here, too, we have assembled a downloadable list of the URLs where we found spreadsheets online. This should enable other researchers to find the resources we used and duplicate our results. Economic Data -- Our economics website also has historical social media signals available. Available at http://econprediction.eecs.umich.edu.
Demos	Spreadsheet Extraction -- Video of our system available at http://www.eecs.umich.edu/db/sheets/index.html. Economic Prediction -- Weekly unemployment predictions at http://econprediction.eecs.umich.edu.
Software downloads	Our work so far involves fairly elaborate installations, so we have made the systems available for online use rather than download.
Patents	None
Broader Impacts	The spreadsheet work has been commercialized as part of the Tableau Software big data visualization suite. This is a very important big data tool and a substantial accomplishment for the student.
Educational Material	None
Highlights and Press	Coverage of the economics work: Washington Post's Wonkblog: Twitter is surprisingly accurate at predicting unemployment Slate: Can Twitter Predict Economic Data? The Boston Globe: Can Twitter Predict the Economy? The Wall Street Journal: Using Twitter to Forecast New Applications for Unemployment The Economist: Separating Tweet from Chaff
Point of Contact	Michael Cafarella
Last Updated	November, 2016