Volltext-Downloads (blau) und Frontdoor-Views (grau)

Data-driven identification of German phrasal compounds

  • We present a method to identify and document a phenomenon on which there is very little empirical data: German phrasal compounds occurring in the form of as a single token (without punctuation between their components). Relying on linguistic criteria, our approach implies to have an operational notion of compounds which can be systematically applied as well as (web) corpora which are large and diverse enough to contain rarely seen phenomena. The method is based on word segmentation and morphological analysis, it takes advantage of a data-driven learning process. Our results show that coarse-grained identification of phrasal compounds is best performed with empirical data, whereas fine-grained detection could be improved with a combination of rule-based and frequency-based word lists. Along with the characteristics of web texts, the orthographic realizations seem to be linked to the degree of expressivity.

Download full text files

  • Barbaresi_Hein_Data_driven_2017.pdf
    eng

    (Printversion - IDS-intern)

  • Barbaresi_Hein_Data_driven_2017_FINAL-DRAFT.pdf
    eng

    (Postprint - IDS-intern)

Export metadata

Additional Services

Share in Twitter Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Adrien Barbaresi, Katrin Hein
URN:urn:nbn:de:bsz:mh39-63888
URL:https://hal.archives-ouvertes.fr/hal-01575651/document
DOI:https://doi.org/10.1007/978-3-319-64206-2_22
ISBN:978-3-319-64205-5
Parent Title (English):Text, Speech, and Dialogue. 20th International Conference, TSD 2017, Prague, Czech Republic, August 27-31, 2017, proceedings
Series (Serial Number):Lecture Notes in Computer Science (10415)
Publisher:Springer
Place of publication:Cham
Editor:Kamil Ekštein, Václav Matoušek
Document Type:Part of a Book
Language:English
Year of first Publication:2017
Date of Publication (online):2017/08/30
Publicationstate:Postprint
Reviewstate:Peer-Review
Tag:corpus linguistics; morphological analysis; web corpora; word segmentation
GND Keyword:Deutsch; Korpus <Linguistik>; Morphemanalyse; Segmentierung; Wort
First Page:192
Last Page:200
Dewey Decimal Classification:400 Sprache / 430 Deutsch
Open Access?:Nein
Licence (German):Es gilt das UrhG