Published October 31, 2025 | Version 1
Dataset Open

OCR-D-GT-VD-SBB

Description

A ground truth (GT) dataset created within the OCR-D project and consisting of 348 pages extracted from historical documents pertaining to the "Verzeichnis der im deutschen Sprachraum erschienenen Drucke" (VD), all of which have been digitised by Staatsbibliothek zu Berlin – Berlin State Library (SBB). The data publication consists of 348 .xml files with transcriptions for  348 .tif facsimile image files. The image files pertain to 67 distinct works; four images were extracted from each of the 65 works; from two further works, 49 and 39 images respectively were extracted to create the GT. The dataset is complemented by a .csv file which contains a mapping between the identifiers used in this dataset and the unique identifiers used in the digitised collections of Staatsbibliothek zu Berlin – Berlin State Library, as well as a filelisting in .csv format. Data selection was performed within the OCR-D project at Staatsbibliothek zu Berlin – Berlin State Library. The project is funded by the German Research Foundation DFG, project grant no. 460675868. Ground truth data were established by a digitisation service provider and post-corrected by staff members of the Berlin State Library, data curation and publication was done by two members of the team of the research project "Mensch.Maschine.Kultur – Künstliche Intelligenz für das Digitale Kulturelle Erbe" at Staatsbibliothek zu Berlin – Berlin State Library. The research project was funded by the Federal Government Commissioner for Culture and the Media (BKM), project grant no. 2522DIG002.

Files

OCR-D-GT-VD-SBB.md

Files (796.8 MB)

Name Size Download all
md5:76270a4c4f5f6c69fadea7c42249800e
7.3 kB Preview Download
md5:21074b512adcdfc4f40ad681c73aa9b7
38.9 kB Preview Download
md5:e0390697ddf97e6efd465ed7b4b5f945
16.5 kB Preview Download
md5:757b1fb86979b97847ef86795f06d660
796.7 MB Preview Download

Additional details

Funding

Deutsche Forschungsgemeinschaft
Koordinierte Förderinitiative zur Weiterentwicklung von Verfahren der Optical Character Recognition (OCR) [Phase 3] 460675868