Refine
Year of publication
- 2010 (23) (remove)
Document Type
- Conference Proceeding (11)
- Part of a Book (6)
- Article (4)
- Contribution to a Periodical (1)
- Other (1)
Has Fulltext
- yes (23)
Is part of the Bibliography
- no (23)
Keywords
- Korpus <Linguistik> (23) (remove)
Publicationstate
- Veröffentlichungsversion (12)
- Postprint (3)
- Zweitveröffentlichung (1)
Reviewstate
- (Verlags)-Lektorat (7)
- Peer-Review (5)
- Verlags-Lektorat (2)
- Peer-review (1)
Publisher
This paper describes general requirements for evaluating and documenting NLP tools with a focus on morphological analysers and the design of a Gold Standard. It is argued that any evaluation must be measurable and documentation thereof must be made accessible for any user of the tool. The documentation must be of a kind that it enables the user to compare different tools offering the same service, hence the descriptions must contain measurable values. A Gold Standard presents a vital part of any measurable evaluation process, therefore, the corpus-based design of a Gold Standard, its creation and problems that occur are reported upon here. Our project concentrates on SMOR, a morphological analyser for German that is to be offered as a web-service. We not only utilize this analyser for designing the Gold Standard, but also evaluate the tool itself at the same time. Note that the project is ongoing, therefore, we cannot present final results.
Corpus-based identification and disambiguation of reading indicators for German nominalizations
(2010)
Corpus data is often structurally and lexically ambiguous; corpus extraction methodologies thus must be made aware of ambiguities. Therefore, given an extraction task, all relevant ambiguities must be identified. To resolve these ambiguities, contextual data responsible for one or another reading is to be considered. In the context of our present work, German -ung-nominalizations and their sortal readings are under examination. A number of these nominalizations may be read as an event or a result, depending on the semantic group they belong to. Here, we concentrate on nominalizations of verbs of saying (henceforth: "verba dicendi"), identify their context partners and their influence on the sortal reading of the nominalizations in question. We present a tool which calculates the sortal reading of such nominalizations and thus may improve not only corpus extraction, but also e.g. machine translation. Lastly, we describe successful attempts to identify the correct sortal reading, conclusions and future work.
This paper describes the application of probabilistic part of speech taggers to the Dzongkha language. A tag set containing 66 tags is designed, which is based on the Penn Treebank. A training corpus of 40,247 tokens is utilized to train the model. Using the lexicon extracted from the training corpus and lexicon from the available word list, we used two statistical taggers for comparison reasons. The best result achieved was 93.1% accuracy in a 10-fold cross validation on the training set. The winning tagger was thereafter applied to annotate a 570,247 token corpus.
This study examines what kind of cues and constraints for discourse interpretation can be derived from the logical and generic document structure of complex texts by the example of scientific journal articles. We performed statistical analysis on a corpus of scientific articles annotated on different annotations layers within the framework of XML-based multi-layer annotation. We introduce different discourse segment types that constrain the textual domains in which to identify rhetorical relation spans, and we show how a canonical sequence of text type structure categories is derived from the corpus annotations. Finally, we demonstrate how and which text type structure categories assigned to complex discourse segments of the type “block” statistically constrain the occurrence of rhetorical relation types.
The possibility to search electronically very large corpora of texts has opened up ways in which we can truly evaluate the rules through which grammarians have tried and continue to try to simulate natural languages. However, the possibility to handle incredibly large amounts of texts might lead to problems with the assessment of certain phenomena that are hardly ever represented in those corpora and yet, have always been regarded as grammatically correct elements of a given language. In German, typical phenomena of this kind are forms like betrögest or erwögest, i.e. second person singular of the so-called strong verbs in the subjunctive mood. Should we see them merely as grammarians’ inventions? Before doing so, we should reconsider the nature of these phenomena. They may appear to be isolated word forms but, in fact, are compact realizations of syntactic constructions, and it is the frequency of these constructions that should be evaluated, not the frequency of their specific realizations.
Within cognitive linguistics, there is an increasing awareness that the study of linguistic phenomena needs to be grounded in usage. Ideally, research in cognitive linguistics should be based on authentic language use, its results should be replicable, and its claims falsifiable. Consequently, more and more studies now turn to corpora as a source of data. While corpus-based methodologies have increased in sophistication, the use of corpus data is also associated with a number of unresolved problems. The study of cognition through off-line linguistic data is, arguably, indirect, even if such data fulfils desirable qualities such as being natural, representative and plentiful. Several topics in this context stand out as particularly pressing issues. This discussion note addresses (1) converging evidence from corpora and experimentation, (2) whether corpora mirror psychological reality, (3) the theoretical value of corpus linguistic studies of ‘alternations’, (4) the relation of corpus linguistics and grammaticality judgments, and, lastly, (5) the nature of explanations in cognitive corpus linguistics. We do not claim to resolve these issues nor to cover all possible angles; instead, we strongly encourage reactions and further discussion.
This paper describes work directed towards the development of a syllable prominence-based prosody generation functionality for a German unit selection speech synthesis system. A general concept for syllable prominence-based prosody generation in unit selection synthesis is proposed. As a first step towards its implementation, an automated syllable prominence annotation procedure based on acoustic analyses has been performed on the BOSS speech corpus. The prominence labeling has been evaluated against an existing annotation of lexical stress levels and manual prominence labeling on a subset of the corpus. We discuss methods and results and give an outlook on further implementation steps.
This paper shows how corpora and related tools can be used to analyse and present significant colligational patterns lexicographically. In German, patterns such as das nötige Wissen vermitteln and sein Wissen unter Beweis stellen play a vital role when learning the language, as they exhibit relevant idiomatic usage and lexical and syntactic rules of combination. Each item has specific semantic and grammatical functions and particular preferences with respect to position and distribution. An analysis of adjectives, for example, identifies preferences in adverbial, attributive, or predicative functions.
Traditionally, corpus analyses of syntagmatic constructions have not been conducted for lexicographic purposes. This paper shows how to utilise corpora to extract and examine typical syntagms and how the results of such an analysis are documented systematically in ELEXIKO, a large-scale corpus-based Internet reference work of German. It also demonstrates how this dictionary accounts for the lexical and grammatical interplay between units in a syntagm and how authentic corpus material and complementary prose-style usage notes are a useful guide to text production or reception.
Some grammatical phenomena that only seldom appear in the corpora of written language often coincide with Speakers' uncertainty about a given form's grammatical Status. Such display of uncertainty is often subject to prescriptive criticism, which pays little attention to actual usage. However, thorough and discriminating corpus analyses can help in a proper description of various low-frequency phenomena and in situating them more adequately in the grammatical System, against the background of different contexts, communicative situations, and language varieties. To exemplify this potential, this study examines three linguistic phenomena in German, using a corpus-based approach: the dative singular ending -e, the construction aus aller Herren Länder, which lacks the dative plural ending -t and the non-standard preterite form frug. The results can be seen as a contribution to a more precise grammatical description on the one hand and, on the other, as a basis for an improved, more usage-oriented approach in providing practical advice to language users.