410 Linguistik
Refine
Document Type
- Article (3)
Language
- English (3)
Has Fulltext
- yes (3)
Keywords
- Deutsch (1)
- Internet (1)
- Kollokation (1)
- Korpus <Linguistik> (1)
- Mehrworteinheit (1)
- Sprachgebrauch (1)
- Sprachstatistik (1)
- automatic classification (1)
- data mining (1)
Publicationstate
- Postprint (2)
Reviewstate
- Peer-Revied (3) (remove)
Publisher
Linguistic usage patterns are not just coincidental phenomena on the textual surface but constitute a fundamental constructional principle of language. At the same time, however, linguistic patterns are highly idiosyncratic in the sense that they tend to be item-specific and unpredictable, thus defying all attempts at capturing them by general abstract rules. […] What all these approaches [that deal with constructions, collocations, patterns, etc. K.S.] share, in addition to their interest in recurrent patterns, is a strong commitment to the value of usage, be it in the wider sense of usage as an empirical basis for sound linguistic analysis and description or in the narrower sense of usage as constituting the basis for the emergence and consolidation of linguistic knowledge. (Herbst et al. 2014: 1)
In consequence of the feasibility of studying language data in new quantitative dimensions, the phraseology faces a paradigm shift. The traditional focus on strongly lexicalized, often idiomatic multi-word expressions (MWE) has led to an overestimation of their unique status in the mental lexicon. The majority of MWEs are typical lexical realisations of templates (‘MW patterns’) that emerged from repeated usage and can be instantiated with ever changing lexical elements. The – primarily functional – pattern restrictions cannot always be predicted with rules, but are the result of recurring context factors. In this article, at first, it has been shown the nature and the interrelations of MW patterns that are reconstructed with complex corpus-driven methods. Furthermore, a vision of a new phraseography of MW pattern that described their hierarchies and functions based on authentic corpus data like KWIC bundles, slot filler tables and collocation profiles has been discussed.
We analyze the linguistic evolution of selected scientific disciplines over a 30-year time span (1970s to 2000s). Our focus is on four highly specialized disciplines at the boundaries of computer science that emerged during that time: computational linguistics, bioinformatics, digital construction, and microelectronics. Our analysis is driven by the question whether these disciplines develop a distinctive language use—both individually and collectively—over the given time period. The data set is the English Scientific Text Corpus (scitex), which includes texts from the 1970s/1980s and early 2000s. Our theoretical basis is register theory. In terms of methods, we combine corpus-based methods of feature extraction (various aggregated features [part-of-speech based], n-grams, lexico-grammatical patterns) and automatic text classification. The results of our research are directly relevant to the study of linguistic variation and languages for specific purposes (LSP) and have implications for various natural language processing (NLP) tasks, for example, authorship attribution, text mining, or training NLP tools.
Recently, a claim was made, on the basis of the German Google Books 1-gram corpus (Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books. Science 2010; 331: 176–82), that there was a linear relationship between six non-technical non-Nazi words and three ‘explicitly Nazi words’ in times of World War II (Caruana-Galizia. 2015. Politics and the German language: Testing Orwell’s hypothesis using the Google N-Gram corpus. Digital Scholarship in the Humanities [Online]. http://dsh.oxfordjournals.org/cgi/doi/10.1093/llc/fqv011 (accessed 15 April 2015)). Here, I try to show that apparent relationships like this are the result of misspecified models that do not take into account the temporal aspect of time-series data. The main point of this article is to demonstrate why such analyses run the risk of incorrect statistical inference, where potential effects are both meaningless and can potentially lead to wrong conclusions.