The project Anaphoricity in connectives: From corpus analysis to lexical description and consequences for discourse parsing deals with non-structural discourse connectives in German and English. Discourse connectives are linguistic elements connecting two propositions (examples are but, because, however, etc.), and are thus essentially two-place predicates. The group of connectives is typically divided into structural and non-structural connectives, where the structural connectives take their arguments based on syntactic constraints, while for non-structural connectives one argument can be inferred from the discourse, hence be anaphoric. This project focuses on the latter group, and addresses the following key problems:
- Problem 1: Lexical ambiguity. Connectives in general are ambiguous with respect to (1a) non-connective readings (e.g., German da, which can also be a locative anaphor ('since'/'there')) and with respect to (1b) connective sense (e.g., nämlich: Reason (difficult to render with an English adverbial; similar to 'after all') versus Elaboration/Specification ('in particular')).
- Problem 2: Non-adjacent extargs. Contrary to simplifying assumptions in RST and in implemented discourse parsers, arguments need not be adjacent. Corpus evidence shows that this is in fact quite often the case.
- Problem 3: Vague boundaries of intargs and extargs. Given a non-structural connective, for both types of arguments (but more often for extargs), their precise boundaries are often difficult to agree on.
- Problem 4: Non-explicit extargs. In some cases, the extarg is not given explicitly in the text but must be inferred by the reader. This is an unresolved problem both for manual annotation.
The Computational-Linguistic interest in connectives stems from the task of 'Shallow Discourse Parsing', which automatically detects the presence of coherence relations (such as those signalled by connectives) in text. Our project addresses this task as well, aiming at making the first such discourse parser for German available.
Project Goals and Interim Results
On the linguistic side, the project seeks solutions for the above mentioned problems by systematic studies of non-structural connectives in authentic contexts, i.e. based on corpora.
We see a bilingual approach of our project as important: Subclassifications of connectives in terms of discourse-structural features, for example, are much more informative when performed parallel on more than one language. Compared to English, the relatively free word order in German renders many phenomena with non-structural connectives more challenging. Specifically, the goals of this project are the following. (1) and (2) amount to the core tasks of linguistic investigation, which will address the problems P1-P4 summarized above, and (4) seeks to exploit the results for discourse parsing.
- 1) The first core research task within the project is the cross-lingual comparison and corpus-based lexical description of the specified target set of non-structural connectives in German and English, including the description of their translation constraints given by morphology, syntax, semantic class and further contextual features. Initial experiments with connective projection working with German-Italian are described in Bourgonje et al., 2017. These, and further projection experiments led to several additions to DiMLex (Stede, 2002), our German lexicon of connectives.
- 2) The second core task is a detailed study of the argument assignment problems by means of providing corpus evidence and linguistic explanation. In Bourgonje and Stede, 2019, we describe strategies for finding arguments of both structural and non-structural connectives and provide an overview (Table 1) of the distribution of argument positions in the Potsdam Commentary Corpus.
- 3) A bilingual connective database is designed and implemented. The database is available at connective-lex.info and currently includes connective lexicons for nine different languages, four of which were actively worked on in the course of this project (Das et al., 2018, Bourgonje et al., 2018, Mendes et al., 2018 and DiMLex). The design of the database itself is detailed in Scheffler et al., 2018.
On the computational side, the findings on argument assignment will be translated into consequences for the application of discourse structure annotation and for automatic discourse parsing. We aim at building a parser that, compared to the state of the art, uses more sophisticated ways of associating adverbial connectives with their arguments. Given the bilingual database, our approach will be applicable to both English and German, but the primary goal is to construct the first shallow discourse parser for German. Bourgonje and Stede, 2018 and Bourgonje and Stede, 2019 describe experiments working toward this goal. The code for this parser, currently under development, is available here.
Furthermore, to support data-oriented exploration and analysis related to the phenomena explained above, in the course of this project, the Potsdam Commentary Corpus has been made publicly available through the ANNIS3 web corpus browser (Bourgonje and Stede, 2018), and a large Wikipedia dump (January 2019) has been indexed for efficient and convenient querying.
Peter Bourgonje Prof. Dr. Manfred Stede Dr. Yulia Grishina (2018)