OCR-D-GT-VD-SBB

Baierer, Konstantin; Federbusch, Maria; Gerber, Mike; Lehmann, Jörg; Neudecker, Clemens

doi:10.5281/zenodo.17395956

Published October 31, 2025 | Version 1

Dataset Open

OCR-D-GT-VD-SBB

1. Berlin State Library

A ground truth (GT) dataset created within the OCR-D project and consisting of 348 pages extracted from historical documents pertaining to the "Verzeichnis der im deutschen Sprachraum erschienenen Drucke" (VD), all of which have been digitised by Staatsbibliothek zu Berlin – Berlin State Library (SBB). The data publication consists of 348 .xml files with transcriptions for 348 .tif facsimile image files. The image files pertain to 67 distinct works; four images were extracted from each of the 65 works; from two further works, 49 and 39 images respectively were extracted to create the GT. The dataset is complemented by a .csv file which contains a mapping between the identifiers used in this dataset and the unique identifiers used in the digitised collections of Staatsbibliothek zu Berlin – Berlin State Library, as well as a filelisting in .csv format. Data selection was performed within the OCR-D project at Staatsbibliothek zu Berlin – Berlin State Library. The project is funded by the German Research Foundation DFG, project grant no. 460675868. Ground truth data were established by a digitisation service provider and post-corrected by staff members of the Berlin State Library, data curation and publication was done by two members of the team of the research project "Mensch.Maschine.Kultur – Künstliche Intelligenz für das Digitale Kulturelle Erbe" at Staatsbibliothek zu Berlin – Berlin State Library. The research project was funded by the Federal Government Commissioner for Culture and the Media (BKM), project grant no. 2522DIG002.

Files

OCR-D-GT-VD-SBB.md

Files (796.8 MB)

Name	Size	Download all
MappingDirectoryName-PPN.csv md5:76270a4c4f5f6c69fadea7c42249800e	7.3 kB	Preview Download
OCR-D-GT-VD-SBB-filelisting.csv md5:21074b512adcdfc4f40ad681c73aa9b7	38.9 kB	Preview Download
OCR-D-GT-VD-SBB.md md5:e0390697ddf97e6efd465ed7b4b5f945	16.5 kB	Preview Download
OCR-D-GT-VD-SBB.zip md5:757b1fb86979b97847ef86795f06d660	796.7 MB	Preview Download

Additional details

Deutsche Forschungsgemeinschaft
Koordinierte Förderinitiative zur Weiterentwicklung von Verfahren der Optical Character Recognition (OCR) [Phase 3] 460675868

	All versions	This version
Views	54	54
Downloads	13	13
Data volume	3.2 GB	3.2 GB

OCR-D-GT-VD-SBB

Creators

Description

Files

OCR-D-GT-VD-SBB.md

Files (796.8 MB)

Additional details

Funding