Award Number | IIS 1054913 |
Duration | Five years |
Award Title | CAREER: Building and Searching a Structured Web Database |
PI | Michael Cafarella |
Students | Shirley Zhe Chen, Dolan Antenucci, Jun Chen, Guan Wang, Bochun Zhang. |
Collaborators | Prof. Eytan Adar, for the work with scientific diagrams in year 1 and year 4. Professors Margaret Levenstein, Matthew Shapiro, Christopher Re for the work on nowcasting in years 2-4 | .
Project Goals | Our initial research proposal focused on the following work for Year 4:
As mentioned in previous years' reports, the growth of high-quality structured online entity-centric datasets (such as Google’s Knowledge Graph) led us to change the focus of our work. We have been focusing on “somewhat structured” numeric datasets, such as data-driven diagrams and spreadsheet data. We have also added efforts to pursue "long-tail extractions" that are traditionally ill-served by Web extraction systems, to process statistical diagrams, and to add statistical extraction from social media. These datasets remain hard to obtain or manage, and so the core motives of our research plan continue. We have completed a tool for spreadsheet extraction (published at KDD 2014), have published extensive datasets for social media extraction (avaialble at http://econprediction.eecs.umich.edu), have submitted a manuscript for long-tail extraction methods, and have completed initial work on a tool for managing statistical diagrams (published at WWW 2015). Effort A is complete. Our work on "long-tail" extraction improves content-driven extraction, though not in the manner we originally anticipated. "Long tail" extraction tasks are ones where many target extractions appear very rarely; consider the set of all camera manufacturers, rather than the set of all US States. Rather than laboriously constructing a large seed dataset, we created a low-overhead method for users to create synthetic seed sets. When extracting the top-100 items from a target set, our system obtains an accuracy improvement of 23% compared to competing systems. This work is under submission at the TACL conference. Effort B is complete in the spreadsheet and long-tail extraction domains and still ongoing in the scientific diagram domain. Our deployed spreadsheet/database search system uses a large number of features to identify when two spreadsheets are referring to the same entity. This process enables the system to obtain far higher-accuracy spreadsheet extractions than would otherwise be possible. (Described in KDD 2014 paper) We use "noisy reference matching" to identify positive and negative training examples in the long-tail extraction system. (Submitted to TACL 2015.) We have also begun to apply this work to the problem of searching and managing scientific diagrams. (Early work here described in WWW 2015 paper). Effort C is complete for spreadsheets and social media projects, and will be complete pending publication of the long-tail extraction and scientific diagram projects. The spreadsheet code and data are now publicly available via http://wwweb.eecs.umich.edu/db/sheets/index.html. The social media website allows the user to download all datasets from http://econprediction.eecs.umich.edu/; unfortunately, the query traffic here has been too small to warrant release. We will release code and data for the long-tail and diagram search systems after all the relevant academic papers are complete.
|
Research Challenges and Results | Please see description in above text. |
Publications |
|
Presentations |
|
Images/Videos | Please see the economics social media site and the spreadsheet extraction site for more media. |
Data |
A lot of our work throws off datasets that are suitable for use by other researchers. Here is the data we have collected so far.
|
Demos |
|
Software downloads | Our work so far involves fairly elaborate installations, so we have made the systems available for online use rather than download. |
Patents | None |
Broader Impacts | The spreadsheet work has been commercialized as part of the Tableau Software big data visualization suite. This is a very important big data tool and a substantial accomplishment for the student. |
Educational Material | None |
Highlights and Press | Coverage of the economics work:
|
Point of Contact | Michael Cafarella |
Last Updated | May, 2015 |