Working package 3: Content analysis in German text corpora

In recent years, it has been found that internationally incredibly various and diverse activities for automatic text recognition on a single character (as usual) provide a quality that allows improvements only for very complicated scenarios or with very great effort. In particular, the learning data necessary for a corresponding training is not available or only available at disproportionately great expense, so that currently this area is mainly covered by large players such as Google, Microsoft and Facebook.


On the other hand, it has also shown that now the actual use of the text recognition results becomes important. This can be the semantic interpretation of what has been read for the respective application on all text levels such as words, phrases, sentences, paragraphs, pages, articles, documents, books and even entire corpora. From a technical point of view, this higher level of reading texts today is understood as Natural Language Processing (NLP), which includes, among other things, technologically fundamental methods to complex tasks such as Topic Modeling, Recognition of Values and Opinions (Stance Detection, Sentiment Analysis) or the Text Summarization.Dabei gehen wir davon aus, dass eine neuartige Kombination von Grundideen aus verschiedenen Bereichen notwendig sein wird: Bisher wurden (zumeist) NLP-Methoden auf fertig gelesene Texte angewendet. Da die Ergebnisse der Textlesung auf Wortebene (und höher) jedoch nach wie vor fehleranfällig sind, werden die Texte aus Automatisierten Texterkennungsverfahren weithin korrigert, teils sehr aufwendig manuell oder auch mit (semi)automatisierter sogenannter Post-OCR. Verwendet man statt üblicher Texte jedoch stochastische Leseergebnisse, wie sie beispielsweise Neuronale Netze liefern, dann stehen für die weiteren Verarbeitungsstufen viele wahrscheinliche Textvarianten zur Verfügung, die bisher ignoriert wurden.


In doing so, we assume that a novel combination of basic ideas from different areas will be necessary: So far NLP methods have been applied to finished texts. However, since the results of text reading at verbal level (and higher) are still prone to error, texts from automated text recognition methods are largely corrected. This is done sometimes very laboriously manually or with (semi) automated so-called post-OCR. However, if stochastic reading results are used instead of conventional texts, for example, provided by neural networks, then many probable text variants are available for the further processing stages that were previously ignored.


The further development of such an approach is also of great practical relevance, because it eliminates the need and gigantic effort to create and store as perfect as possible perfect text for the huge text stocks of libraries and archives. Rather, the unfinished intermediate stage of stochastic reading results would only be evaluated under concrete goals.