Corpora and Annotation Technology

Computational-linguistic analyses and natural language applications need data. We work on creating and annotating natural language corpora from different genres, with a focus on German data. Most of our corpus and annotation work concentrates on discourse-related phenomena.

Distributed Corpora

The Potsdam Commentary Corpus (PCC): A corpus of multi-level annotated German newspaper commentaries
arg-microtexts: A German English parallel corpus of 112 short argumentative texts annotated with argumentation structures
The Potsdam Twitter Sentiment Corpus (PotTS): A collection of 8,000 German tweets manually annotated with fine-grained sentiment relations

Annotation Technology

In the early 2000s, we designed the PAULA standoff XML format (Dipper 2005) and the ANNIS linguistic database (Dipper et al. 2004) that allows for querying and visualizing multi-layer corpora. Its most recent version, ANNIS3, was built by our project partners at HU Berlin (see below). We also developed a few other layer-specific annotation tools, and a framework for format conversions, especially for discourse-level annotation.

ANNIS3: An open-source linguistic database and query tool for multi-layer-annotated corpora
discoursegraphs: a converter and merging library for syntactic and discourse-related annotation formats (Tiger, PTB, RSTTool, MMAX, Connanno, EXMARaLDA) with output support for generic graph formats (neo4j, dot, GEXF, GML, GraphML)
ConnAnno: A Java tool for semi-manually annotating connectives and their arguments
GraPAT: A graph-based, web-based annotation tool suited for sentiment and argumentation structure annotation

Related publications:

Manfred Stede. Das Potsdamer Kommentarkorpus. In Hartmut E.H. Lenk, editor, Persuasionsstile in Europa II. Olms, Hildesheim, 2016. [Bibtex]
Manfred Stede, editor. Handbuch Textannotation: Potsdamer Kommentarkorpus 2.0 Volume 8 of Potsdam Cognitive Science Series. Universitaetsverlag, Potsdam, 2016. URL: http://nbn-resolving.de/urn:nbn:de:kobv:517-opus4-82761. [Bibtex]
Arne Neumann. Discoursegraphs: a graph-based merging tool and converter for multilayer annotated corpora. In Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015), 309–312. 2015. [Bibtex]
Jonathan Sonntag and Manfred Stede. GraPAT: a tool for graph annotations. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). Reykjavik, Iceland, May 2014. [Bibtex] [PDF]
Manfred Stede and Arne Neumann. Potsdam Commentary Corpus 2.0: annotation for discourse research. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). Reykjavik, Iceland, May 2014. [Bibtex] [PDF]
Christian Chiarcos, Julia Ritz, and Manfred Stede. By all these lovely tokens... merging conflicting tokenizations. Language resources and evaluation, 46(1):53–74, 2012. [Bibtex]
Christian Chiarcos, Stefanie Dipper, Michael Götze, Ulf Leser, Anke Lüdeling, Julia Ritz, and Manfred Stede. A flexible framework for integrating annotations from different tools and tagsets. Traitement Automatique des Langues, 49(2):271–293, 2008. URL: http://www.atala.org./A-Flexible-Framework-for. [Bibtex]
Manfred Stede. The Potsdam Commentary Corpus. In Proceedings of the 2004 ACL Workshop on Discourse Annotation, 96–102. Association for Computational Linguistics, 2004. [Bibtex] [PDF]
Manfred Stede and Silvan Heintze. Machine-assisted rhetorical structure annotation. In Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), 425–431. 2004. [Bibtex] [PDF]
S. Dipper, M. Götze, M. Stede, and T. Wegst. ANNIS: A linguistic database for exploring information structure. In Interdisciplinary Studies on Information Structure Vol. 1 - Working Papers of the SFB 632. Universitätsverlag, Potsdam, 2004. [Bibtex] [PDF]