Refine
Year of publication
- 2015 (29) (remove)
Document Type
- Conference Proceeding (29) (remove)
Language
- English (29)
Has Fulltext
- yes (29)
Keywords
- Korpus <Linguistik> (15)
- Annotation (11)
- Deutsch (10)
- Corpus annotation (6)
- Corpus technology (6)
- Datenbanksystem (6)
- Corpus linguistics (5)
- Large corpora (5)
- Computerlinguistik (4)
- Englisch (4)
Publicationstate
- Veröffentlichungsversion (29) (remove)
Reviewstate
- Peer-Review (15)
- (Verlags)-Lektorat (3)
- Peer-review (2)
- Review-Status-unbekannt (2)
Publisher
- Institut für Deutsche Sprache (8)
- Association for Computational Linguistics (2)
- German Society for Computational Linguistics & Language Technology (GSCL) (2)
- Gesellschaft für Sprachtechnologie and Computerlinguistik (2)
- International Phonetic Association (2)
- International Speech Communication Association (2)
- The Association for Computational Linguistics (2)
- Association for Computational Linguistics ( ACL ); Curran Associates, Inc. (1)
- INCOMA Ltd. (1)
- International Phonetic Association (IPA) (1)
Based on specific linguistic landmarks in the speech signal, this study investigates pitch level and pitch span differences in English, German, Bulgarian and Polish. The analysis is based on 22 speakers per language (11 males and 11 females). Linear mixed models were computed that include various linguistic measures of pitch level and span, revealing characteristic differences across languages and between language groups. Pitch level appeared to have significantly higher values for the female speakers in the Slavic than the Germanic group. The male speakers showed slightly different results, with only the Polish speakers displaying significantly higher mean values for pitch level than the German males. Overall, the results show that the Slavic speakers tend to have a wider pitch span than the German speakers. But for the linguistic measure, namely for span between the initial peaks and the non-prominent valleys, we only find the difference between Polish and German speakers. We found a flatter intonation contour in German than in Polish, Bulgarian and English male and female speakers and differences in the frequency of the landmarks between languages. Concerning “speaker liveliness” we found that the speakers from the Slavic group are significantly livelier than the speakers from the Germanic group.
Ein integriertes Datenbank-, Such- und Tagging-Tool (IDaSTo) wird vorgestellt, das sich besonders für Variablenanalysen, für Paralleltexte und für diachronische Untersuchungen eignet. Relevante Kategorien bzw. Variablen können individuell definiert, Tags frei im Text und auf verschiedenen Wegen gesetzt und ihre Häufigkeiten in den verlinkten Statistiken direkt abgerufen werden.
The task-oriented and format-driven development of corpus query systems has led to the creation of numerous corpus query languages (QLs) that vary strongly in expressiveness and syntax. This is a severe impediment for the interoperability of corpus analysis systems, which lack a common protocol. In this paper, we present KoralQuery, a JSON-LD based general corpus query protocol, aiming to be independent of particular QLs, tasks and corpus formats. In addition to describing the system of types and operations that Koral- Query is built on, we exemplify the representation of corpus queries in the serialized format and illustrate use cases in the KorAP project.
With an increasing amount of text data available it is possible to automatically extract a variety of information about language. One way to obtain knowledge about subtle relations and analogies between words is to observe words which are used in the same context. Recently, Mikolov et al. proposed a method to efficiently compute Euclidean word representations which seem to capture subtle relations and analogies between words in the English language. We demonstrate that this method also captures analogies in the German language. Furthermore, we show that we can transfer information extracted from large non-annotated corpora into small annotated corpora, which are then, in turn, used for training NLP systems.
We investigate whether non-configurational languages, which display more word order variation than configurational ones, require more training data for a phenomenon to be parsed successfully. We perform a tightly controlled study comparing the dative alternation for English (a configurational language), German, and Russian (both non-configurational). More specifically, we compare the performance of a dependency parser when only canonical word order is present with its performance on data sets when all word orders are present. Our results show that for all languages, canonical data not only is easier to parse, but there exists no direct correspondence between the size of training sets containing free(er) word order variation and performance.
Ph@ttSessionz and Deutsch heute are two large German speech databases. They were created for different purposes: Ph@ttSessionz to test Internet-based recordings and to adapt speech recognizers to the voices of adolescent speakers, Deutsch heute to document regional variation of German. The databases differ in their recording technique, the selection of recording locations and speakers, elicitation mode, and data processing.
In this paper, we outline how the recordings were performed, how the data was processed and annotated, and how the two databases were imported into a single relational database system. We present acoustical measurements on the digit items of both databases. Our results confirm that the elicitation technique affects the speech produced, that f0 is quite comparable despite different recording procedures, and that large speech technology databases with suitable metadata may well be used for the analysis of regional variation of speech.
The IMS Open Corpus Workbench (CWB) software currently uses a simple tabular data model with proven limitations. We outline and justify the need for a new data model to underlie the next major version of CWB. This data model, dubbed Ziggurat, defines a series of types of data layer to represent different structures and relations within an annotated corpus; each such layer may contain variables of different types. Ziggurat will allow us to gradually extend and enhance CWB’s existing CQP-syntax for corpus queries, and also make possible more radical departures relative not only to the current version of CWB but also to other contemporary corpus-analysis software.
The availability of large multi-parallel corpora offers an enormous wealth of material to contrastive corpus linguists, translators and language learners, if we can exploit the data properly. Necessary preparation steps include sentence and word alignment across multiple languages. Additionally, linguistic annotation such as partof- speech tagging, lemmatisation, chunking, and dependency parsing facilitate precise querying of linguistic properties and can be used to extend word alignment to sub-sentential groups. Such highly interconnected data is stored in a relational database to allow for efficient retrieval and linguistic data mining, which may include the statistics-based selection of good example sentences. The varying information needs of contrastive linguists require a flexible linguistic query language for ad hoc searches. Such queries in the format of generalised treebank query languages will be automatically translated into SQL queries.
Centering on German self-motion verbs, this paper demonstrates the advantages of free-sorting over creating and delineating word fields with more traditional methods. In particular, I draw a comparison to Snell-Hornby’s (1983) work on German descriptive verbs, which produces lexical fields with the help of dictionary entries, a thesaurus, a small corpus of written text and limited speaker feedback. While these methods have benefits, they are limited in their ability to represent the average organization of semantic fields in the mind of everyday speakers. Freesorting, by contrast, does not rely on academic resources, corpora or singular speaker judgments. In sorting, a group of informants creates visible sets of items according to perceived similarity. Psycholinguists have used the method to quantitatively explore the perception of color terms across cultures (c.f. Roberson et al. 2005). With a sufficiently large number of informants, one can generate lexical sorting data that is apt for cluster analysis, the results of which are represented by dendrograms. The experiment I conducted involved 33 school children from a middle class neighborhood in Braunschweig, Northern Germany. My experiment shows that Snell-Hornby’s (1983) representation of the self-motion field can be improved by integrating further dimensions of meaning, such as body-space relations and sound, that young speakers find salient in the grouping procedure.