Resources
Our research projects and some student's term projects have lead to several NLP resources that we are making available.
Corpora
- The Potsdam Commentary Corpus (PCC): A corpus of multi-level annotated German newspaper commentaries
- arg-microtexts: A German English parallel corpus of 112 short argumentative texts (plus 178 in English only) annotated with argumentation structures
- The Potsdam Twitter Sentiment Corpus (PotTS): A collection of 8,000 German tweets manually annotated with fine-grained sentiment relations
- German text complexity levels: A collection of encylopedia-style texts for readers of different age groups. (See: Hewett/Stede 2021: Automatically evaluating the conceptual complexity of German texts)
- APA-RST: A text simplification corpus with RST annotations
Grammars and Lexica
- DiMLex: A lexicon of German discourse connectives
- connective-lex.info: A web interface to connective lexicons in nine languages
- klimadiskurs.info: A linguistically-oriented online glossary of 250 German climate compound nouns used in politically-oriented discourse
- A fragment of an OpenCCG grammar for German
Tools
- discopy: A shallow discourse parser for English, developed by Rene Knaebel
- GermanShallowDicourseParser: An end-to-end Shallow Discourse Parser for German, operating on plain text and returning PDTB-style JSON-format output of identified discourse relations.
- ANNIS3: An open-source linguistic database and query tool for multi-layer-annotated corpora (developed in the SFB D1 project with Anke Lüdeling's group at HU Berlin)
- ConnAnno: A Java tool for semi-automatically annotating connectives and their arguments
- GraPAT: A graph-based, web-based annotation tool suited for sentiment and argumentation structure annotation
- discoursegraphs: A converter and merging library for syntactic and discourse-related annotation formats (Tiger, PTB, RSTTool, MMAX, Connanno, EXMARaLDA) with output support for generic graph formats (neo4j, dot, GEXF, GML, GraphML)
- CRFSuite-0.13: An updated version of Naoaki Okazaki's CRFSuite that was extended with tree-structured and higher-order linear-chain and semi-Markov CRF models
- DiscourseSegmenter: A python package providing rule-based and machine learning discourse segmenters
- DiscourseSenser: A python package for sense disambiguation of discourse relations in PDTB-style discourse parsing
- OsloPots: A docker image of the shallow discourse parser created for the CoNLL 2016 Shared Task competition;
- SentiLex: A collection of tools for generating sentiment lexicons from neural word embeddings, corpora, and lexical taxonomies
- NarraSpeech: A tool for recognizing direct and indirect speech, thought and writing in German narrative