Deutsches Textarchiv

The DFG-funded project Deutsches Textarchiv (DTA) started in 2007 and is located at the Berlin-Brandenburgische Akademie der Wissenschaften (BBAW) in Germany. The DTA digitizes a large cross-section of printed works in modern New High German Language, ranging from ca. 1600 to 1900. Images and electronic full-text are available online, the latter can be downloaded as HTML, XML or TCF. The DTA presents almost exclusively the first editions of the respective works. Currently (January 2016), there are 2331 texts dating from 1600–1900 online, and ca. 500 more are prepared to be published, comprising a total of ca. 650,000 digitized pages with around 1.1 billion characters and ca. 155 million tokens.

The majority of DTA’s texts is transcribed by non-native speakers using the double keying method (vendors guarantee 99.9+% character accuracy). The DTA provides linguistic applications for its corpus, i. e. tokenization, lemmatization, lemma based and phonetic search, and rewrite rules for historic spelling. All DTA texts are freely available (CC by-nc) for download in different formats: the original XML/TEI texts, an HTML rendered version, two different kinds of TCF versions, the raw text transcription. Moreover, CMDI metadata comprising TEI header information may be harvested via OAI-PMH.

Website

Partners

CLARIN-ERIC
CLARIN-D
Digitales Wörterbuch der Deutschen Sprache
Cooperations with serveral libraries, universities and research institutions

Project Team

Director: Prof. Dr. Wolfgang Klein
Project Leader: Dr. Alexander Geyken
Matthias Boenig (geb. Schulz) (Koordination)
Susanne Haaf (Koordination)
Dr. Bryan Jurish (Computerlinguistik)
Christian Thomas (Koordination)
Frank Wiegand (Software-Entwicklung und Webapplikation)
Kay-Michael Würzner (Computerlinguistik)
Kai Zimmer (Systemadministration)

Funders

German Research Foundation