Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery
Creators
- 1. UHasselt, Hasselt University
Description
Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery
This collection consists of ten open access relations commonly used by the data management community. In addition to the relations themselves (please take note of the references to the original sources below), we added three lists in this collection that describe approximate functional dependencies found in the relations. These lists are the result of a manual annotation process performed by two independent individuals by consulting the respective schemas of the relations and identifying column combinations where one column implies another based on its semantics. As an example, in the claims.csv file, the AirportCode implies AirportName, as each code should be unique for a given airport.
The file ground_truth.csv is a comma separated file containing approximate functional dependencies. table describes the relation we refer to, lhs and rhs reference two columns of those relations where semantically we found that lhs implies rhs.
The file excluded_candidates.csv and included_candidates.csv list all column combinations that were excluded or included in the manual annotation, respectively. We excluded a candidate if there was no tuple where both attributes had a value or if the g3_prime value was too small.
Dataset References
- adult.csv: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.
- claims.csv: TSA Claims Data 2002 to 2006, published by the U.S. Department of Homeland Security.
- dblp10k.csv: Frequency-aware Similarity Measures. Lange, Dustin; Naumann, Felix (2011). 243–248. Made available as DBLP Dataset 2.
- hospital.csv: Hospital dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.
- t_biocase_... files: t_bioc_... files used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.
- tax.csv: Tax dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.
Files
adult.csv
Files
(250.9 MB)
Name | Size | Download all |
---|---|---|
md5:06a12d8b45ce8d320f62f30ef2339e4c
|
4.0 MB | Preview Download |
md5:0e933659996d859144ae45e9c6fdd7be
|
17.4 MB | Preview Download |
md5:4d9e63e1ace2166a11bcc81103d93328
|
5.0 MB | Preview Download |
md5:afc8eec78c8ca755a066a9a06ab954b5
|
270.0 kB | Preview Download |
md5:0a44bb6e5aa445f8c37125f52aab4ff2
|
6.0 kB | Preview Download |
md5:bea9c52f8dd6d2187276e8f52edaa285
|
30.6 MB | Preview Download |
md5:b53cfbd72cc3c927ec166de5bc93ad23
|
79.1 kB | Preview Download |
md5:6e2b4906d63949c18cf1aedb1fb63386
|
14.1 MB | Preview Download |
md5:1117a398aa162add9b1ddd5605923508
|
21.2 MB | Preview Download |
md5:7f871fffe3fb8c7c82969578a77b4327
|
24.9 MB | Preview Download |
md5:b02f7a46c5961addbec7c20a6fcf53d1
|
30.5 MB | Preview Download |
md5:5ebcbef341a17c4b1320c3dc887845f5
|
29.8 MB | Preview Download |
md5:ccaec8c35740f0f640fc59a675b6097d
|
73.0 MB | Preview Download |
Additional details
References
- Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824