Potsdam Commentary Corpus
The Potsdam Commentary Corpus (PCC) is a corpus of 220 German newspaper commentaries (2.900 sentences, 44.000 tokens) taken from the online issues of the Märkische Allgemeine Zeitung (MAZ subcorpus) and Tagesspiegel (ProCon subcorpus) and is annotated with a range of different types of linguistic information.
The central subcorpus that we are making publicly available consists of 176 MAZ texts, which are annotated with
- Sentence Syntax
- Discourse Structure (RST)
- Connectives and their arguments
The corpus is released under a Creative Commons Attribution-NonCommercial-ShareAlike license and can be freely downloaded here. The publication to cite when using the data is Stede/Neumann 2014.
All the annotation guidelines (in German) have been published as an open access book, which can be found here.
A sample of two commentaries drawn from the corpus can be queried online as part of our ANNIS demo .
Fig. 1. Visualization of several annotation layers of the PCC in ANNIS.
Morphosyntactic and Syntactic Annotations
Fig. 2. Parts of speech and syntax annotations as visualized by TIGERSearch
The entire corpus was semimanually annotated for constituent syntax in accordance with the specifications of the TIGER corpus using the @nnotate tool ( Brants et al. 2004 ). [ Guidelines ]
Fig. 3. Coreference annotation with MMAX
The corpus is annotated for nominal and pronominal coreference according to guidelines that build upon the Potsdam Coreference Scheme (PoCoS core scheme, Krasavina & Chiarcos 2007 ) using the MMAX2 tool ( Müller & Strube 2001 ). Currently, the annotations cover strict coreference (identity) only. Indirect anaphora (bridging) has not been annotated yet.
Discourse Structure and Connectives
The PCC is one of very few corpora with annotations for Discourse Structure, i.e., the hierarchical and relational structure of entire texts (or other discourse types). The MAZ subcorpus (176 texts) has been annotated in accordance with Rhetorical Structure Theory (RST, Mann & Thompson 1988 ) using the RSTTool ( O'Donnell 2000 , Version 3.1).
Connectives are the most important surface signals for RST annotations. But their behavior need not always coincide completely with an overall rhetorical text structure. We thus introduced an independent annotation layer for connectives and their scopes (quite similar to the approach of the Penn Discourse Tree Bank). For doing semi-automatic connective annotation, we developed ConAno ( Stede & Heintze 2004 ), a tool that identifies potential German connectives in text and also makes suggestions for the two arguments (which of course can be overwritten). The tool is available for download here.
Fig. 4. RST annotation with the RSTTool.
The annotations of the Potsdam Commentary Corpus are provided in its various source formats:
|Parts of Speech, Morphology, Syntax||TIGER XML, NEGRA export format||TIGERSearch|
|Rhetorical Structure Theory||RS3||RSTTool|
For programmatic access to the corpus, we developed discoursegraphs, a graph-based converter and merging library. The tool is able to parse all the annotation formats used in the PCC and merges them into a single NetworkX-based graph representation. The graph can either be queried directly or exported to various generic graph formats (neo4j, dot, GEXF, GML, GraphML).
[Brants et al. 2004] Brants, Sabine, Stefanie Dipper, Peter Eisenberg, Silvia Hansen-Schirra, Esther König, Wolfgang Lezius, Christian Rohrer, George Smith and Hans Uszkoreit (2004). "TIGER: Linguistic Interpretation of a German Corpus". Research on Language and Computation2(4): 597-620.
[O'Donnell 2000] O'Donnell, Mich (2000). "RSTTool 2.4 - a markup tool for Rhetorical Structure Theory". In Proceedings of the 1st International Natural Language Generation Conference, Mitzpe Ramon, Israel.
[Krasavina & Chiarcos 2007] Krasavina, Olga and Christian Chiarcos (2007). "PoCoS: Potsdam Coreference Scheme", Proceedings of the First Linguistic Annotation Workshop (LAW). Held in conjunction with ACL-2007. Prague, 2007, p. 156-163
[Mann & Thompson 1988] Mann, William C. and Sandra A. Thompson (1988). "Rhetorical Structure Theory: Toward a functional theory of text organization". In: Text 8 (3), pp.243-281.
[Müller & Strube 2001] Müller, Christoph and Michael Strube (2001). "Annotating Anaphoric and Bridging Relations with MMAX. " In: Proceedings of the 2nd SIGdial Workshop on Discourse and Dialogue, Aalborg, Denmark, September 1-2, 2001, pp.90-95.
[Reitter & Stede 2003] Reitter, David and Manfred Stede (2003). "Step by step: underspecified markup in incremental rhetorical analysis". In: Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora (LINC-03) (at EACL 2003), Budapest, 2003.
[Stede 2004] Stede, Manfred (2004). "The Potsdam Commentary Corpus". Proc. Of the ACL 2004 Workshop on Discourse Annotation, pp. 96-102.
[Stede 2016] Stede, Manfred, Hrsg. (2016). "Handbuch Textannotation: Potsdamer Kommentarkorpus 2.0". Potsdam Cognitive Science Series Vol. 8. Universitätsverlag Potsdam, 2016. Online edition: urn:nbn:de:kobv:517-opus4-82761
[Stede & Heintze 2004] Stede, Manfred and S. Heintze (2004). "Machine-assisted rhetorical structure annotation". Proc. of the 20th International Conference on Computational Linguistics (COLING), Geneva.
[Stede & Neumann 2014] Stede, Manfred and A. Neumann (2014). Potsdam Commentary Corpus 2.0: Annotation for Discourse Research. Proc. of the Language Resources and Evaluation Conference (LREC), Reykjavik.