ACSDb Download

The ACSDb is a dataset discussed in several papers I helped to write:

It contains the schema information derived from many millions of structured data tables we recovered from a large general web crawl. This work was done while I was at Google; Google has since released this data for researchers. You can download it here. It is a single compressed text file. Here are two sample lines:

combo_make_model_year = 13
single_make = 3068

The first line indicates that a schema with exactly three elements (make, model, year) was seen in 13 different tables. The second line indicates that the attribute make was seen in 3068 different tables. The prefix combo or single indicates whether the line contains info on an entire schema, or just a count of a single attribute. Attribute labels are separated by underscores. The right-hand of the equals sign is always an integer.

The data in this file is unique by domain, meaning that a single schema cannot be counted more than once from a single DNS domain, regardless of how many times a table with that schema appears at the domain.

If you use this data in an academic publication, please cite the VLDB 2008 paper listed above:

Thanks to Google for releasing this data, and to my colleagues Alon Halevy, Daisy Wang, Eugene Wu, and Yang Zhang for collaborating on a great project.


Go to Michael Cafarella's homepage