OPUS 4 | Search

22 search hits

11 to 20

Sort by

OCR Nachkorrektur des Royal Society Corpus (2019)

Klaus, Carsten ; Fankhauser, Peter ; Klakow, Dietrich

We present an approach for automatic detection and correction of OCR-induced misspellings in historical texts. The main objective is the post-correction of the digitized Royal Society Corpus, a set of historical documents from 1665 to 1869. Due to the aged material the OCR procedure has made mistakes, thus leading to files corrupted by thousands of misspellings. This motivates a post processing step. The current correction technique is a pattern-based approach which due to its lack of generalization suffers from bad recall. To generalize from the patterns we propose to use the noisy channel model. From the pattern based substitutions we train a corpus specific error model complemented with a language model. With an F1-Score of 0.61 the presented technique significantly outperforms the pattern based approach which has an F1-score of 0.28. Due to its more accurate error model it also outperforms other implementations of the noisy channel model.

OCR post-correction of the Royal Society Corpus based on the noisy channel model (2019)

Klaus, Carsten ; Klakow, Dietrich ; Fankhauser, Peter

How to get the computation near the data: improving data accessibility to, and reusability of analysis functions in corpus query platforms (2018)

Kupietz, Marc ; Diewald, Nils ; Fankhauser, Peter

The paper discusses use cases and proposals to increase the flexibility and reusability of components for analysis and further processing of analysis results in corpus query platforms by providing standardized interfaces to access data at multiple levels.

Visualizing Language Change in a Corpus of Contemporary German (2017)

Fankhauser, Peter ; Kupietz, Marc

Visual Correlation for Detecting Patterns in Language Change (2017)

Fankhauser, Peter ; Kupietz, Marc

Topical Diversification Over Time In The Royal Society Corpus (2016)

Fankhauser, Peter ; Knappen, Jörg ; Teich, Elke

The linguistic construal of disciplinarity: A data-mining approach using register features (2015)

Teich, Elke ; Degaetano-Ortlieb, Stefania ; Fankhauser, Peter ; Kermes, Hannah ; Lapshinova-Koltunski, Ekaterina

We analyze the linguistic evolution of selected scientific disciplines over a 30-year time span (1970s to 2000s). Our focus is on four highly specialized disciplines at the boundaries of computer science that emerged during that time: computational linguistics, bioinformatics, digital construction, and microelectronics. Our analysis is driven by the question whether these disciplines develop a distinctive language use—both individually and collectively—over the given time period. The data set is the English Scientific Text Corpus (scitex), which includes texts from the 1970s/1980s and early 2000s. Our theoretical basis is register theory. In terms of methods, we combine corpus-based methods of feature extraction (various aggregated features [part-of-speech based], n-grams, lexico-grammatical patterns) and automatic text classification. The results of our research are directly relevant to the study of linguistic variation and languages for specific purposes (LSP) and have implications for various natural language processing (NLP) tasks, for example, authorship attribution, text mining, or training NLP tools.

Combining macro- and microanalysis for exploring the construal of scientific disciplinarity (2014)

Fankhauser, Peter ; Kermes, Hannah ; Teich, Elke

CLARIN’s virtual language observatory (VLO) under scrutiny - the VLO taskforce of the CLARIN-D centres (2014)

Haaf, Susanne ; Fankhauser, Peter ; Trippel, Thorsten ; Eckart, Kerstin ; Hedeland, Hanna ; Herold, Axel ; Knappen, Jörg ; Schiel, Florian ; Stegmann, Jens ; Van Uytvanck, Dieter

Exploring and visualizing variation in language resources (2014)

Fankhauser, Peter ; Knappen, Jörg ; Teich, Elke

Language resources are often compiled for the purpose of variational analysis, such as studying differences between genres, registers, and disciplines, regional and diachronic variation, influence of gender, cultural context, etc. Often the sheer number of potentially interesting contrastive pairs can get overwhelming due to the combinatorial explosion of possible combinations. In this paper, we present an approach that combines well understood techniques for visualization heatmaps and word clouds with intuitive paradigms for exploration drill down and side by side comparison to facilitate the analysis of language variation in such highly combinatorial situations. Heatmaps assist in analyzing the overall pattern of variation in a corpus, and word clouds allow for inspecting variation at the level of words.

11 to 20

Open Access

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Publicationstate

Reviewstate

Publisher

22 search hits