Discourse Structure

In Computational Linguistics, the term discourse is used to refer to a communicative event such as a text or a dialog. In our lab, in recent years the work focused on monologue text. We study text structure both from a theoretical perspective and with the goal of automatic analysis. Notice we can provide only a very brief summary here. One of our favorite phenomena is connectives, explained in a separate section below.

The Structure of Discourse

In a text, structure arises on multiple levels of description (cf. the German monograph (Stede 2018) or in (Stede 2008)), for example:

A fair amount of our work dealt with Rhetorical Structure Theory (Mann/Thompson 1988), both for text analysis and generation. We provide annotation guidelines for RST, which originated in collaboration with Maite Taboada (SFU, Vancouver) (Stede et al. 2017). Also, we suggested certain modifications to the theory, especially concerning nuclearity in (Stede 2008). Our Potsdam Commentary Corpus (PCC) has RST trees as one annotation layer.

We also provide these guidelines for RST annotation on Twitter.

A view of discourse analysis that makes fewer commitments on an overarching text structure is embodied in the Penn Discourse Treebank (PDTB), and it lead to the computational task of shallow discourse parsing (SDP). To help bootstrapping SDP work on German, we created a machine-translated German version of the PDTB texts, and automatically projected the discourse annotations (Sluyter-Gaethje et al. 2020).

But other levels of analysis are of equal importance to us, see e.g. our separate page on Coreference.

Discourse Connectives

Discourse connectives are lexical items that encode semantic or pragmatic relations between adjacent spans of text, such as causality or contrast. We have developed analyses of particular connectives and groups of them (especially causal, contrastive and concessive ones), and also comparative analyses across several languages. One result of our work is DiMLex, a computer-readable discourse connective lexicon for German (Stede 2002). We recently contributed to building an Italian version (Feltracco et al. 16), a Dutch version (Bourgonje et al. 2018), an English version (Das et al. 2018) and a Bangla version (Das et al. 2020), and released a multilingual connective database (see "Resources" below) that was built in collaboration with partners from the TextLink network.

Also, one of the annotation layers in our Potsdam Commentary Corpus (PCC) follows the Penn Discourse Treebank framework, including annotations for connectives, their arguments and relation sense.

Discourse Parsing

Our work on automatic analysis started with the first SVM-based RST parser (for German) by Reitter (2003). Recently, our focus became shallow discourse parsing in the style of the Penn Discourse Treebank (PDTB). With colleagues in Oslo and Teesside, we built the best-performing English discourse parser for the CONLL 2016 Shared Task (Oepen et al. 2016). Furthermore, we built the first shallow discourse parser for German (Bourgonje/Stede 2019/20) and made it available (beta version) for research purposes (see below).

Other topics that we addressed include the analysis of genre-specific zones (e.g., Bieler et al. 2007) or the disambiguation of German connectives (Dipper/Stede 2006, Schneider/Stede 2012, Bourgonje/Stede 2018).

For a general overview, the range of subtopics of discourse processing is explained in the monograph Discourse Processing (Stede 2011).

Related Projects

Related Resources

Related publications: