Corpora and Annotation Technology

Computational-linguistic analyses and natural language applications need data. We work on creating and annotating natural language corpora from different genres, with a focus on German data. Most of our corpus and annotation work concentrates on discourse-related phenomena.

Distributed Corpora

Annotation Technology

In the early 2000s, we designed the PAULA standoff XML format (Dipper 2005) and the ANNIS linguistic database (Dipper et al. 2004) that allows for querying and visualizing multi-layer corpora. Its most recent version, ANNIS3, was built by our project partners at HU Berlin (see below). We also developed a few other layer-specific annotation tools, and a framework for format conversions, especially for discourse-level annotation.

Related publications: