AI research for public benefit

Independent research · Open source · From Madrid

Current Initiative

Opening Historical Archives

Libraries and museums contain vast collections of scanned documents that are searchable by metadata but not by content. We are working on changing that by applying state-of-the-art OCR models to extract the full text from these archives, opening them up for research, exploration, and AI training.

Our first release covers over 830,000 pages from the Biblioteca Nacional de España, drawn from 19th-century publications in science, medicine, literature, and more. Next, we plan to expand to other archives across Spain.

BNE Hemeroteca OCR Dataset

830K+

Pages

800M+

Tokens

Collections

🤗Dataset Code