Published June 30, 2023 | Version 1.0.0
Dataset Open

Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery

Description

Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery

This collection consists of ten open access relations commonly used by the data management community. In addition to the relations themselves (please take note of the references to the original sources below), we added three lists in this collection that describe approximate functional dependencies found in the relations. These lists are the result of a manual annotation process performed by two independent individuals by consulting the respective schemas of the relations and identifying column combinations where one column implies another based on its semantics. As an example, in the claims.csv file, the AirportCode implies AirportName, as each code should be unique for a given airport.

The file ground_truth.csv is a comma separated file containing approximate functional dependencies. table describes the relation we refer to, lhs and rhs reference two columns of those relations where semantically we found that lhs implies rhs.

The file excluded_candidates.csv and included_candidates.csv list all column combinations that were excluded or included in the manual annotation, respectively. We excluded a candidate if there was no tuple where both attributes had a value or if the g3_prime value was too small.

Dataset References

  • adult.csv: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.
  • claims.csv: TSA Claims Data 2002 to 2006, published by the U.S. Department of Homeland Security.
  • dblp10k.csv: Frequency-aware Similarity Measures. Lange, Dustin; Naumann, Felix (2011). 243–248. Made available as DBLP Dataset 2.
  • hospital.csv: Hospital dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.
  • t_biocase_... files: t_bioc_... files used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.
  • tax.csv: Tax dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.

Files

adult.csv

Files (250.9 MB)

Name Size Download all
md5:06a12d8b45ce8d320f62f30ef2339e4c
4.0 MB Preview Download
md5:0e933659996d859144ae45e9c6fdd7be
17.4 MB Preview Download
md5:4d9e63e1ace2166a11bcc81103d93328
5.0 MB Preview Download
md5:afc8eec78c8ca755a066a9a06ab954b5
270.0 kB Preview Download
md5:0a44bb6e5aa445f8c37125f52aab4ff2
6.0 kB Preview Download
md5:bea9c52f8dd6d2187276e8f52edaa285
30.6 MB Preview Download
md5:b53cfbd72cc3c927ec166de5bc93ad23
79.1 kB Preview Download
md5:6e2b4906d63949c18cf1aedb1fb63386
14.1 MB Preview Download
md5:1117a398aa162add9b1ddd5605923508
21.2 MB Preview Download
md5:7f871fffe3fb8c7c82969578a77b4327
24.9 MB Preview Download
md5:b02f7a46c5961addbec7c20a6fcf53d1
30.5 MB Preview Download
md5:5ebcbef341a17c4b1320c3dc887845f5
29.8 MB Preview Download
md5:ccaec8c35740f0f640fc59a675b6097d
73.0 MB Preview Download

Additional details

References

  • Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824