Korpuslinguistik
Refine
Year of publication
- 2006 (10) (remove)
Document Type
- Conference Proceeding (5)
- Part of a Book (2)
- Part of Periodical (2)
- Article (1)
Has Fulltext
- yes (10)
Is part of the Bibliography
- no (10)
Keywords
- Korpus <Linguistik> (5)
- Automatische Sprachanalyse (2)
- Deutsch (2)
- Modeling (2)
- Polnisch (2)
- Pronomen (2)
- Sprachgeschichte (2)
- Sprachpflege (2)
- Sprachstatistik (2)
- Anapher <Syntax> (1)
Publicationstate
- Veröffentlichungsversion (5)
- Postprint (1)
- Zweitveröffentlichung (1)
Reviewstate
This paper presents the current results of an ongoing research project on corpus distribution of prepositions and pronouns within Polish preposition-pronoun contractions. The goal of the project is to provide a quantitative description of Polish preposition-pronoun contractions taking into consideration morphosyntactic properties of their components. It is expected that the results will provide a basis for a revision of the traditionally assumed inflectional paradigms of Polish pronouns and, thus, for a possible remodeling of these paradigms. The results of corpus-based investigations of the distribution of prepositions within preposition-pronoun contractions can be used for grammar-theoretical and lexicographic purposes.
The aim of the paper is twofold. Firstly, an approach is presented how to select the correct antecedent for an anaphoric element according to the kind of text segments in which both of them occur. Basically, information on logical text structure (e.g. chapters, sections, paragraphs) is used in order to select the antecedent life span of a linguistic expression, i.e. some linguistic expressions are more likely to be chosen as an antecedent throughout the whole text than others. In addition, an appropriate search scope for an anaphora expressed by an expression can be defined according to the document structuring elements that include the linguistic expression. Corpus investigations give rise to the supposition that logical text structure influences the search scope of candidates for antecedents. Second, a solution is presented how to integrate the resources used for anaphora resolution. In this approach, multi-layered XML annotation is used in order to make a set of resources accessible for the anaphora resolution system.
Linguistic corpora have been annotated by means of SGML-based markup languages for almost 20 years. We can, very roughly, differentiate between three distinct evolutionary stages of markup technologies. (1)Originally, single SGML tree-based document instances were deemed sufficient for the representation of linguistic structures. (2) Linguists began to realize that alternatives and extensions to the traditional model are needed. Formalisms such as, for example, NITE were proposed: the NITE Object Model (NOM) consists of multi-rooted trees. (3) We are now on the threshold of the third evolutionary stage: even NITE's very flexible approach is not suited for all linguistic purposes. As some structures, such as these, cannot be modeled by multi-rooted trees, an even more flexible approach is needed in order to provide a generic annotation format that is able to represent genuinely arbitrary linguistic data structures.
Discourse segmentation is the division of a text into minimal discourse segments, which form the leaves in the trees that are used to represent discourse structures. A definition of elementary discourse segments in German is provided by adapting widely used segmentation principles for English minimal units, while considering punctuation, morphology, sytax, and aspects of the logical document structure of a complex text type, namely scientific articles. The algorithm and implementation of a discourse segmenter based on these principles is presented, as well an evaluation of test runs.
Eine angemessene, sachgemäße Diskussion über Stärken und Schwächen, Möglichkeiten und Grenzen der Korpuslinguistik ist überschattet von vielen Mythen, die sich mittlerweile eingebürgert haben und die in vielen Diskussionen – gerade unter Linguisten – immer wieder aufkommen. An dieser Stelle möchten wir einige der verbreitetsten Mythen zusammenstellen und die Hintergründe aus dieser korpuslinguistischen Perspektive erörtern.