Building and Searching a Structured Web Database

Mike Cafarella
Assistant Professor
Computer Science and Engineering
2260 Hayward St.
University of Michigan
Ann Arbor, MI 48109-2121

Office: 4709 Beyster
Phone: 734-764-9418
Fax: 734-763-8094
Send email to me at michjc, found at umich dot edu

Building and Searching a Structured Web Database

This material describes work supported by the National Science Foundation under Grant No. IIS 1054913, Building and Searching a Structured Web Database.

A portion of this work has also been supported by an award by the Dow Chemical Company.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Award Number	IIS 1054913
Duration	Five years
Award Title	CAREER: Building and Searching a Structured Web Database
PI	Michael Cafarella
Students	Shirley Zhe Chen, Matthew Burgess
Collaborators	Prof. Eytan Adar, for the work with scientific diagrams.
Project Goals	A major goal of the first year of this award has been to extract and obtain different structured Web datasets for use in our structured search tool. We focused on two main datasets. Scientific Diagrams -- Our first project focused on structured scientific diagrams. Scientific diagrams are interesting structured objects that are often a reader's only access to the data that undergirds a research paper. Existing image search engines are designed primarily for photos or illustrations, so they do not exploit the comparatively small image differences that mark one diagram as more relevant than another. For example, diagrams include textual annotations in the title, legend, axis labels. The graphical marks themselves carry meaning through mechanisms other than the color, shade, and shape associated with most online images. The goal of this project is to automatically extract and create a scientific diagrams, along with extracted metadata about each one. This would be a unique collection of data that forms a high-value target for a structured data search engine. Spreadsheet Extraction -- Our second project has focused on extracting relational-style data from spreadsheets. Spreadsheets are an important, but often overlooked, part of data management. Spreadsheets are very popular and hold a lot of important data that often follows the relational model. However, any metadata in a spreadsheet is implicit, making them a challenge for long-term data management. For example, a single spreadsheet's embedded data cannot be integrated with other sources using standard tools that assume relational schemas. Further, a collection of spreadsheets may together make up an interesting relational database, but the lack of explicit metadata means there is no way to process a relational query. For example, consider an engineering company that has accumulated tens of thousands of spreadsheets describing experimental results; they cannot ask even basic queries, such as "Has experiment X ever been run?" Our system automatically transforms spreadsheets into their relational equivalent so they can be queried as if the data were simply contained in a traditional database. In case of errors, users can manually repair the extracted data. Effectively converting spreadsheet data to relational form would allow serve as a novel source of structured data for a structured data search engine. Indeed, we have constructed a prototype search engine that currently targets only spreadsheet-extracted data, but we are working to extend to any relational-style dataset.
Research Challenges	The research challenges for each project are as follows: Scientific Diagrams Diagram Corpus Extraction -- In contrast to HTML files, diagrams require special processing. Diagrams are only semi-explicitly declared in PDF files and must be distinguished from surrounding document text, tables, and other material. Text in one location (such as the axis label) has quite a different meaning from text in another location (such as the caption). Also, because diagrams are often an integral part of a highly-engineered document, they can have extensive ``implicit hyperlinks'' in the form of figure references from the body of the surrounding text. Ranking Quality -- All search engines must figure out how to score an object's relevance to a search query, but scoring diagrams is surprisingly difficult. One main reason is the paucity of text and the strong role that position plays in the meaning of different diagram components Snippet Generation -- For traditional search engines, small summaries of the searched-for content, usually called snippets, allow users to quickly scan a large number of results before actually selecting one. Conventional search engines select regions of text from the original documents, while image search engines generally scale down the original image to a small thumbnail. Neither technique can be directly applied to data-driven diagrams, either because there is so little text, or because the image makes no sense at smaller scales. Spreadsheet Extraction Spreadsheet Metadata Hierarchy -- Spreadsheets contain complicated data warehouse-style metadata that describes various potential sensible aggregations for the data. For example, a spreadsheet might indicate data that matches a hierarchy of values for ethnicity, age range, and gender. This metadata is implicitly described, but is critical for understanding the spreadsheet data's relational model. Brittle Extraction -- Unlike many information extraction projects, the output of metadata recovery is used as a data input to a secondary stage (that of recovering the actual spreadsheet contents). Even a single flaw in metadata recovery can be catastrophic for downstream processing, so some amount of direct human supervision for each task is unavoidable. However, the diversity of spreadsheets is so large that humans will never be able to spend more than a tiny amount of time repairing each extracted spreadsheet.
Current results	Research results so far are as follows: Scientific Diagrams A working search engine that searches 319k diagrams crawled from the Web. Novel techniques for performing diagram metadata extraction, diagram relevance ranking and snippet generation. As far as we know, we are the first to propose a diagram metadata extractor working on a large scale of PDFs across the web. User-study experiments that show our search system obtains a 52% improvement in search quality (as measured by mean reciprocal rank) over naive approaches. Further, we show that our system's hybrid snippet generator allows users to find results 33% more accurately than with a standard image-driven snippet. An exploration of the diagram metadata that shows the most popular diagram types and axis labels. We also show sample results from a diagram-specific form of similarity search enabled by this data. Spreadsheet Extraction Creation of two separate spreadsheet corpora: one based on the Statistical Abstract of the United States, and another based on a large general Web crawl. These data will serve as a general purpose dataset for many researchers. A statistical mechanism for automatically converting spreadsheet data into a high-quality relation. In order to obtain a perfect relational database, this process requires, on average, about 5.4 explicit user repairs per spreadsheet. A technique for exploiting metadata duplication among spreadsheets in order to amortize user repairs over a much larger set of spreadsheets. The result is fewer explicit user repairs per spreadsheet, on average. (We are in the midst of quantifying the size of this benefit right now.) A prototype structured search system that uses this extracted spreadsheet data. It enables a user to search through a large collection of spreadsheet-derived databases using text, as with a Web search engine. The user can then pick a good "hit" and further analyze the data using tools designed for structured data. In the coming year, we will extend this system to offer more structured operators (such as join and union). We will also include structured datasets beyond the extracted spreadsheet data.
Publications	All of this work is either under submission or being prepared for submission.
Presentations	Searching for Statistical Diagrams, presented at 2011 US Frontiers of Engineering event in September, 2011. Searching for Statistical Diagrams, presented at Brigham Young University in November, 2011.
Images/Videos	No videos currently. Images will appear as part of publications.
Data	A lot of our work throws off datasets that are suitable for use by other researchers. Here is the data we have collected so far. Scientific Diagrams -- We obtained 300,000 diagrams from over 150,000 sources that we crawled online. Unfortunately, we believe we do not have the legal right to distribute the extracted diagrams themselves. (Live search engines are relatively free to summarize content as needed to present results.) However, you can download the URLs where we obtained diagrams here. Spreadsheet Extraction -- We have collected several spreadsheet corpora for testing the extraction system. The first is the corpus of spreadsheets associated with the Statistical Abstract of the United States. As a series of US government publications, there are no copyright issues involved in redistributing it. You can download our crawled corpus of SAUS data here. We have also crawled the Web to discover a large number of spreadsheets posted online. Unfortunately, copyright issues apply here, too. So here, too, we have assembled a downloadable list of the URLs where we found spreadsheets online. This should enable other researchers to find the resources we used and duplicate our results.
Demos	Scientific Diagrams -- You can use our dedicated diagram search engine, DiagramFlyer, here. Spreadsheet Extraction -- You can use the prototype spreadsheet data search system, SheetsNest, here.
Software downloads	Our work so far involves fairly elaborate installations, so we have made the systems available for online use rather than download.
Patents	None
Broader Impacts	The PI discussed part of this work as part of a short class presented to high school students at Manchester High School in May, 2012. It had five components: PageRank and the Web Introduction to Data Mining Introduction to Information Extraction and Tables Time-Series Extraction from Social Media Sources Spreadsheet Extraction
Educational Material	None yet, but this will be distributed as the research results are integrated into curricula in academic year 2012-2013.
Highlights and Press	Work on diagram extraction featured in 2011 US Frontiers of Engineering Symposium, held by National Academy of Engineering.
Point of Contact	Michael Cafarella
Last Updated	July, 2012