OPUS 4 | Search

7 search hits

1 to 7

Sort by

Matrix and double-array representations for efficient finite state tokenization (2022)

Diewald, Nils

This paper presents an algorithm and an implementation for efficient tokenization of texts of space-delimited languages based on a deterministic finite state automaton. Two representations of the underlying data structure are presented and a model implementation for German is compared with state-of-the-art approaches. The presented solution is faster than other tools while maintaining comparable quality.

The use of graphs as a data structure for reusing lexical resources (2007)

Trippel, Thorsten

Lexical resources are often represented in table form, e. g., in relational databases, or represented in specially marked up texts, for example, in document based XML models. This paper describes how it is possible to model lexical structures as graphs and how this model can be used to exploit existing lexical resources and even how different types of lexical resources can be combined.

Modeling and annotating complex data structures (2019)

Bański, Piotr ; Witt, Andreas

Datenmodellierung (2016)

Herold, Axel ; Meyer, Peter ; Müller-Spitzer, Carolin

The Impact of Lacking Metadata for the Measurement of Cultural and Linguistic Change Using the Google Ngram Data Sets—Reconstructing the Composition of the German Corpus in Times of WWII (2017)

Koplenig, Alexander

The Google Ngram Corpora seem to offer a unique opportunity to study linguistic and cultural change in quantitative terms. To avoid breaking any copyright laws, the data sets are not accompanied by any metadata regarding the texts the corpora consist of. Some of the consequences of this strategy are analyzed in this article. I chose the example of measuring censorship in Nazi Germany, which received widespread attention and was published in a paper that accompanied the release of the Google Ngram data (Michel et al. (2010): Quantitative analysis of culture using millions of digitized books. Science, 331(6014): 176–82). I show that without proper metadata, it is unclear whether the results actually reflect any kind of censorship at all. Collectively, the findings imply that observed changes in this period of time can only be linked directly to World War II to a certain extent. Therefore, instead of speaking about general linguistic or cultural change, it seems to be preferable to explicitly restrict the results to linguistic or cultural change ‘as it is represented in the Google Ngram data’. On a more general level, the analysis demonstrates the importance of metadata, the availability of which is not just a nice add-on, but a powerful source of information for the digital humanities.

Introduction: Modeling, Learning and Processing of Text-Technological Data Structures (2011)

Mehler, Alexander ; Kühnberger, Kai-Uwe ; Lobin, Henning ; Lüngen, Harald ; Storrer, Angelika ; Witt, Andreas

Researchers in many disciplines, sometimes working in close cooperation, have been concerned with modeling textual data in order to account for texts as the prime information unit of written communication. The list of disciplines includes computer science and linguistics as well as more specialized disciplines like computational linguistics and text technology. What many of these efforts have in common is the aim to model textual data by means of abstract data types or data structures that support at least the semi-automatic processing of texts in any area of written communication.

The impact of lacking metadata and data truncation for the measurement of cultural and linguistic change using the Google Ngram datasets (2014)

Koplenig, Alexander

As a result of legal restrictions the Google Ngram Corpora datasets are a) not accompanied by any metadata regarding the texts the corpora consist of and the data are b) truncated to prevent an indirect conclusion from the n-gram to the author of the text. Some of the consequences of this strategy are discussed in this article.

1 to 7

Open Access

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Publicationstate

Reviewstate

Publisher

7 search hits