ACSDb Download
The ACSDb is a dataset discussed in several papers I helped to write:
- WebTables: Exploring the Power of Tables on the Web. Michael J. Cafarella, Alon Halevy, Yang Zhang, Daisy Zhe Wang, Eugene Wu. Proceedings of VLDB 2008, August 2008. Auckland, New Zealand.
- Uncovering the Relational Web. Michael J. Cafarella, Alon Halevy, Yang Zhang, Daisy Zhe Wang, Eugene Wu. Proceedings of the Eleventh International Workshop on the Web and Databases (WebDB), June 2008. Vancouver, Canada.
- Structured Data on the Web. Michael J. Cafarella, Alon Halevy, and Jayant Madhavan. Communications of the ACM 54(2): 72-79, 2011.
It contains the schema information derived from many millions of structured data tables we recovered from a large general web crawl. This work was done while I was at Google; Google has since released this data for researchers. You can download it here.
It is a single compressed text file. Here are two sample lines:
combo_make_model_year = 13
single_make = 3068
The first line indicates that a schema with exactly three elements (make, model, year) was seen in 13 different tables. The second line indicates that the attribute make was seen in 3068 different tables. The prefix combo or single indicates whether the line contains info on an entire schema, or just a count of a single attribute. Attribute labels are separated by underscores. The right-hand of the equals sign is always an integer.
The data in this file is unique by domain, meaning that a single schema cannot be counted more than once from a single DNS domain, regardless of how many times a table with that schema appears at the domain.
If you use this data in an academic publication, please cite the VLDB 2008 paper listed above:
- Uncovering the Relational Web. Michael J. Cafarella, Alon Halevy, Yang Zhang, Daisy Zhe Wang, Eugene Wu. Proceedings of the Eleventh International Workshop on the Web and Databases (WebDB), June 2008. Vancouver, Canada.
Thanks to Google for releasing this data, and to my colleagues Alon Halevy, Daisy Wang, Eugene Wu, and Yang Zhang for collaborating on a great project.
Go to Michael Cafarella's homepage