Refine
Year of publication
Document Type
- Conference Proceeding (212)
- Part of a Book (96)
- Article (63)
- Book (14)
- Working Paper (11)
- Contribution to a Periodical (5)
- Doctoral Thesis (5)
- Preprint (4)
- Other (3)
- Course Material (1)
Language
- English (416) (remove)
Keywords
- Korpus <Linguistik> (416) (remove)
Publicationstate
- Veröffentlichungsversion (279)
- Zweitveröffentlichung (51)
- Postprint (23)
- Ahead of Print (1)
- Erstveröffentlichung (1)
- Preprint (1)
Reviewstate
- Peer-Review (218)
- (Verlags)-Lektorat (94)
- Peer-review (9)
- Qualifikationsarbeit (Dissertation, Habilitationsschrift) (5)
- Peer-Revied (3)
- Review-Status-unbekannt (3)
- Zweitveröffentlichung (2)
- (Verlags-)Lektorat (1)
- Abschlussarbeit (Bachelor, Master, Diplom, Magister) (Bachelor, Master, Diss.) (1)
- Peer-reviewed (1)
Publisher
- European Language Resources Association (ELRA) (28)
- European Language Resources Association (23)
- de Gruyter (20)
- Institut für Deutsche Sprache (19)
- Association for Computational Linguistics (18)
- IDS-Verlag (16)
- Narr (14)
- Leibniz-Institut für Deutsche Sprache (12)
- Linköping University Electronic Press (11)
- CLARIN (8)
This paper deals with multiword lexemes (MWLs), focussing on two types of verbal MWLs: verbal idioms and support verb constructions. We discuss the characteristic properties of MWLs, namely nonstandard compositionality, restricted substitutability of components, and restricted morpho-syntactic flexibility, and we show how these properties may cause serious problems during the analysis, generation, and transfer steps of machine translation systems. In order to cope with these problems, MT lexicons need to provide detailed descriptions of MWL properties. We list the types of information which we consider the necessary minimum for a successful processing of MWLs, and report on some feasibility studies aimed at the automatic extraction of German verbal multiword lexemes from text corpora and machine-readable dictionaries.
The Partitur Format at BAS
(1997)
Most spoken language resources are produced and disseminated together with symbolic information relating to the speech signal. These are for instance orthographic transcript labeling and segmentation on the phonologic phoneti prosodic phrasal level. Most of the known formats for these symbolic data are defined in a ‘closed form’ that is not fexible enough to allow simple and platform independent processing and easy extensions.
At the Bavarian Archive for Speech Signals (BAS) a new format has been developed and used over the last few years that shows some significant advantages over other existing formats. This paper describes the basic principles behind this format discusses briefly the advantages and gives detailed definitions of the description levels used so far.
In the context of the HyTex project, our goal is to convert a corpus into a hypertext, basing conversion strategies on annotations which explicitly mark up the text-grammatical structures and relations between text segments. Domain-specific knowledge is represented in the form of a knowledge net, using topic maps. We use XML as an interchange format. In this paper, we focus on a declarative rule language designed to express conversion strategies in terms of text-grammatical structures and hypertext results. The strategies can be formulated in a concise formal syntax which is independend of the markup, and which can be transformed automatically into executable program code.
Online Access Tools for Spoken German: The Resources of the Deutsches Spracharchiv in a Database
(2002)
This paper shows some details of the modernization of the Deutsches Spracharchiv (DSAv). It explores some future possibilities of linguistical documentation and analysis using the Web. The Institut für Deutsche Sprache (IDS) in Mannheim is the central institution for linguistic research in Germany. The DSAv in the IDS is the center for documentation and research of spoken German. These archives include the largest collection of sound recordings of spoken German (dialects and colloquial speech, including e.g. lots of extinct dialects of former German territories in Eastern Europe) - altogether more than 15,000 sound recordings. The lacking clarification and accessibility of this data material has been felt as an essential deficit. The opportunity to edit the sound signal digitally offers a much easier access to spoken language. Through the integration of the already existing information about the corpora and the transcribed texts in an information- and full text databank, as well as the linking of the data with the acoustic signal (alignment), arises a data-pool with considerably better documentation of the materials and a fast direct grasp of the recorded sounds. Thus, the DSAv initiates totally new research questions for the work at the IDS, as well as for linguistics altogether.
In this paper, we investigate the practical applicability of Co-Training for the task of building a classifier for reference resolution. We are concerned with the question if Co-Training can significantly reduce the amount of manual labeling work and still produce a classifier with an acceptable performance.
We describe a simple and efficient Java object model and application programming interface (API) for (possibly multi-modal) annotated natural language corpora. Corpora are represented as elements like Sentences, Turns, Utterances, Words, Gestures and Markables. The API allows linguists to access corpora in terms of these discourse-level elements, i.e. at a conceptual level they are familiar with, with the flexibility offered by a general purpose programming language. It is also a contribution to corpus standardization efforts because it is based on a straightforward and easily extensible data model which can serve as a target for conversion of different corpus formats.