Potsdam Commentary Corpus
The Potsdam Commentary Corpus (PCC) is a corpus of 220 German newspaper commentaries (2.900 sentences, 44.000 tokens) taken from the online issues of the Märkische Allgemeine Zeitung (MAZ subcorpus) and Tagesspiegel (ProCon subcorpus) and is annotated with a range of different types of linguistic information.
The central subcorpus that we are making publicly available consists of 176 MAZ texts, which are annotated with
- Sentence Syntax
- Coreference
- Discourse Structure (RST & PDTB)
- Aboutness topics
- Summaries
The corpus is released under a Creative Commons Attribution-NonCommercial-ShareAlike license and can be freely downloaded here. The publication to cite when using the data is Bourgonje/Stede 2020 (see below for publications relating to earlier iterations of the corpus).
All the annotation guidelines (in German) have been published as an open access book, which can be found here.
The corpus can also be queried online at this ANNIS instance. The ANNIS version of the corpus is available here. Note that the ANNIS version is based on the 2.1 version of the corpus.
Fig. 1. Visualization of several annotation layers of the PCC in ANNIS.
Morphosyntactic and Syntactic Annotations
Fig. 2. Parts of speech and syntax annotations as visualized by TIGERSearch
The entire corpus was semimanually annotated for constituent syntax in accordance with the specifications of the TIGER corpus using the @nnotate tool ( Brants et al. 2004 ). [ Guidelines ]
Coreference
Fig. 3. Coreference annotation with MMAX
The corpus is annotated for nominal and pronominal coreference according to guidelines that build upon the Potsdam Coreference Scheme (PoCoS core scheme, Krasavina & Chiarcos 2007 ) using the MMAX2 tool ( Müller & Strube 2001 ). Currently, the annotations cover strict coreference (identity) only. Indirect anaphora (bridging) has not been annotated yet.
Discourse Structure and Connectives
The PCC is one of very few corpora with annotations for Discourse Structure, i.e., the hierarchical and relational structure of entire texts (or other discourse types). The MAZ subcorpus (176 texts) has been annotated in accordance with Rhetorical Structure Theory (RST, Mann & Thompson 1988 ) using the RSTTool ( O'Donnell 2000 , Version 3.1).
Connectives are the most important surface signals for RST annotations. But their behavior need not always coincide completely with an overall rhetorical text structure. We thus introduced an independent annotation layer following the approach of the Penn Discourse Tree Bank. In this layer, connectives and their arguments, along with their relation sense, have been annotated. For doing these semi-automatic connective-centered annotations, we developed ConnAnno ( Stede & Heintze 2004 ), a tool that identifies potential German connectives in text, makes suggestions for the two arguments (which of course can be overwritten) and presents the possible relation senses in a drop-down menu. The tool is available for download here.
These annotations have been merged with annotations for the remaining relation types of the PDTB ( PDTB3 Annotation Manual ), i.e. implicit relations (without a connective), alternative lexicalisations, 'entity relations' and 'no relation' instances. This has been documented in Bourgonje/Stede 2020.
Fig. 4. RST annotation with the RSTTool.
Summaries
The PCC also contains summaries that consist of the three “most important” sentences. The syntax annotations were used to split the texts into sentences. We then asked annotators to choose the three sentences that “represent the core of the text”. The three sentences are ranked according to importance, with the most important sentence labelled with a “1”. The summaries are available in JSON and .txt format. The summaries can be found here. More information can be found in Hewett & Stede 2022.
Formats
The annotations of the Potsdam Commentary Corpus are provided in its various source formats:
Annotation Layers | Formats | Tool |
---|---|---|
Parts of Speech, Morphology, Syntax | TIGER XML, NEGRA export format | TIGERSearch |
Coreference | MMAX2 | MMAX2 |
Connectives | inline XML | ConAno |
Rhetorical Structure Theory | RS3 | RSTTool |
For programmatic access to the corpus, we developed discoursegraphs, a graph-based converter and merging library. The tool is able to parse all the annotation formats used in the PCC and merges them into a single NetworkX-based graph representation. The graph can either be queried directly or exported to various generic graph formats (neo4j, dot, GEXF, GML, GraphML).
References
[Brants et al. 2004] Brants, Sabine, Stefanie Dipper, Peter Eisenberg, Silvia Hansen-Schirra, Esther König, Wolfgang Lezius, Christian Rohrer, George Smith and Hans Uszkoreit (2004). "TIGER: Linguistic Interpretation of a German Corpus". Research on Language and Computation2(4): 597-620.
[O'Donnell 2000] O'Donnell, Mich (2000). "RSTTool 2.4 - a markup tool for Rhetorical Structure Theory". In Proceedings of the 1st International Natural Language Generation Conference, Mitzpe Ramon, Israel.
[Krasavina & Chiarcos 2007] Krasavina, Olga and Christian Chiarcos (2007). "PoCoS: Potsdam Coreference Scheme", Proceedings of the First Linguistic Annotation Workshop (LAW). Held in conjunction with ACL-2007. Prague, 2007, p. 156-163
[Mann & Thompson 1988] Mann, William C. and Sandra A. Thompson (1988). "Rhetorical Structure Theory: Toward a functional theory of text organization". In: Text 8 (3), pp.243-281.
[Müller & Strube 2001] Müller, Christoph and Michael Strube (2001). "Annotating Anaphoric and Bridging Relations with MMAX. " In: Proceedings of the 2nd SIGdial Workshop on Discourse and Dialogue, Aalborg, Denmark, September 1-2, 2001, pp.90-95.
[Reitter & Stede 2003] Reitter, David and Manfred Stede (2003). "Step by step: underspecified markup in incremental rhetorical analysis". In: Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora (LINC-03) (at EACL 2003), Budapest, 2003.
[Stede 2004] Stede, Manfred (2004). "The Potsdam Commentary Corpus". Proc. Of the ACL 2004 Workshop on Discourse Annotation, pp. 96-102.
[Stede 2016a] Stede, Manfred, Hrsg. (2016). "Handbuch Textannotation: Potsdamer Kommentarkorpus 2.0". Potsdam Cognitive Science Series Vol. 8. Universitätsverlag Potsdam, 2016. Online edition: urn:nbn:de:kobv:517-opus4-82761
[Stede & Heintze 2004] Stede, Manfred and S. Heintze (2004). "Machine-assisted rhetorical structure annotation". Proc. of the 20th International Conference on Computational Linguistics (COLING), Geneva.
[Stede & Neumann 2014] Stede, Manfred and A. Neumann (2014). Potsdam Commentary Corpus 2.0: Annotation for Discourse Research. Proc. of the Language Resources and Evaluation Conference (LREC), Reykjavik.
[Stede 2016b] Stede, Manfred, Hrsg. (2016). "Das Potsdamer Kommentarkorpus". In: H. Lenk (Hg.): Persuasionsstile in Europa II. Hildesheim: Olms.
[Bourgonje & Stede 2020] Bourgonje, Peter and Stede, Manfred (2020). The Potsdam Commentary Corpus 2.2: Extending Annotations for Shallow Discourse Parsing Proc. of the Language Resources and Evaluation Conference (LREC), Marseille.
[Hewett & Stede 2022] Hewett, Freya and Stede, Manfred (2022). Extractive summarisation for German-language data: a text-level approach with discourse features. Proceedings of the 29th International Conference on Computational Linguistics (COLING), Gyeongju, Republic of Korea.