Refine
Year of publication
Document Type
- Article (103)
- Part of a Book (99)
- Conference Proceeding (30)
- Review (2)
- Preprint (1)
Language
- English (235) (remove)
Has Fulltext
- yes (235) (remove)
Keywords
- Deutsch (65)
- Konversationsanalyse (36)
- Korpus <Linguistik> (23)
- Interaktion (21)
- Computerlinguistik (20)
- Semantik (15)
- Mehrsprachigkeit (13)
- Englisch (12)
- Sprachpolitik (12)
- Automatische Sprachanalyse (11)
Publicationstate
- Postprint (235) (remove)
Reviewstate
- Peer-Review (112)
- (Verlags)-Lektorat (83)
- Peer-review (8)
- Verlags-Lektorat (5)
- Peer-Revied (2)
Publisher
- Benjamins (49)
- Springer (30)
- Oxford University Press (19)
- Elsevier (13)
- Equinox (6)
- Wiley (6)
- Sage (5)
- Edinburgh University Press (4)
- Palgrave Macmillan (4)
- Routledge (4)
Question Answering Systems for retrieving information from Knowledge Graphs (KG) have become a major area of interest in recent years. Current systems search for words and entities but cannot search for grammatical phenomena. The purpose of this paper is to present our research on developing a QA System that answers natural language questions about German grammar.
Our goal is to build a KG which contains facts and rules about German grammar, and is also able to answer specific questions about a concrete grammatical issue. An overview of the current research in the topic of QA systems and ontology design is given and we show how we plan to construct the KG by integrating the data in the grammatical information system Grammis, hosted by the Leibniz-Institut für Deutsche Sprache (IDS). In this paper, we describe the construction of the initial KG, sketch our resulting graph, and demonstrate the effectiveness of such an approach. A grammar correction component will be part of a later stage. The paper concludes with the potential areas for future research.
Linguistic relativists have traditionally asked 'how language influences thought', but conversation analysts and anthropological linguists have moved the focus from thought to social action. We argue that 'social action' should in this context not become simply a new dependent variable, because the formulation 'does language influence action' suggests that social action would already be meaningfully constituted prior to its local (verbal and multi-modal) accomplishment. We draw on work by the gestalt psychologist Karl Duncker to show that close attention to action-in-a-situation helps us ground empirical work on cross-cultural diversity in an appreciation of the invariances that make culture-specific elements of practice meaningful.
Are borrowed neologisms accepted more slowly into the German language than German words resulting from the application of wrd formation rules? This study addresses this question by focusing on two possible indicators for the acceptance of neologisms: a) frequency development of 239 German neologisms from the 1990s (loanwords as well as new words resulting from the application of word formation rules) in the German reference corpus DEREKO and b) frequency development in the use of pragmatic markers (‘flags’, namely quotation marks and phrases such as sogenannt ‘so-called’) with these words. In the second part of the article, a psycholinguistic approach to evaluating the (psychological) status of different neologisms and non-words in an experimentally controlled study and plans to carry out interviews in a field test to collect speakers’ opinions on the acceptance of the analysed neologisms are outlined. Finally, implications for the lexicographic treatment of both types of neologisms are discussed.
Language attitudes matter; they influence people’s behaviour and decisions. Therefore, it is crucial to learn more about patterns in the way that languages are evaluated. One means of doing so is using a quantitative approach with data representative of a whole population, so that results mirror dispositions at a societal level. This kind of approach is adopted here, with a focus on the situation in Germany. The article consists of two parts. First, I will present some results of a new representative survey on language attitudes in Germany (the Germany Survey 2017). Second, I will show how language attitudes penetrate even seemingly objective data collection processes by examining the German Microcensus. In 2017, for the first time in eighty years, the German Microcensus included a question on language use ‘at home’. Unfortunately, however, the question was clearly tainted by language attitudes instead of being objective. As a result, the Microcensus significantly misrepresents the linguistic reality of different migrant languages spoken in Germany.
The recognizability of a stretch of conduct as social action depends on details of turn construction as well as the turn’s context. We examine details of turn construction as they enter into actions offering interpretations of prior talk. Such actions either initiate repair or formulate a conclusion from prior talk. We focus on how interpretation markers (das heißt [“that means”] vs. du meinst [“you mean”]) and interpretation formats (phrasal vs. clausal turn completions) each make their invariant contribution to specific interpreting practices. Interpretation marker and turn format go hand in hand, which leads to distinct patterns of interpreting practices: Das heißt+clause is especially apt for formulations, du meinst+phrase for repair. The results suggest that details of turn construction can systematically enter into the constitution of social action. Data are in German with English translation.
This paper discusses German neologisms in the so-called “new-media” and presents a German corpus-based online dictionary of neologisms. Several neological morphemes and lexemes, as well as their meaning will be presented, showing that these new modes of communication are an important source of enrichment of German lexicon.
How Do Speakers Define the Meaning of Expressions? The Case of German x heißt y (“x means y”)
(2020)
To secure mutual understanding in interaction, speakers sometimes explain or negotiate expressions. Adopting a conversation analytic and interaction linguistic approach, I examine how participants explain which kinds of expressions in different sequential environments, using the format x heißt y (“x means y”). When speakers use it to clarify technical terms or foreign words that are unfamiliar to co-participants, they often provide a situationally anchored definition that however is rather context-free and therefore transferable to future situations. When they explain common (but indexical, ambiguous, polysemous, or problematic) expressions instead, speakers always design their explanation strongly connected to the local context, building on situational circumstances. I argue that x heißt y definitions in interaction do not meet the requirements of scientific or philosophical definitions but that this is irrelevant for the situational exigencies speakers face.
Social perception studies have revealed that smiling individuals are perceived more favourably on many communion dimensions in comparison to nonsmiling individuals. Research on gender differences in smiling habits showed that women smile more than men. In our study, we investigated this phenomena further and hypothesised that women perceive smiling individuals as more honest than men. An experiment conducted in seven countries (China, Germany, Mexico, Norway, Poland, Republic of South Africa and USA) revealed that gender may influence the perception of honesty in smiling individuals. We compared ratings of honesty made by male and female participants who viewed photos of smiling and nonsmiling people. While men and women did not differ on ratings of honesty in nonsmiling individuals, women assessed smiling individuals as more honest than men did. We discuss these results from a social norms perspective.
Objective: Discrimination against nonnative speakers is widespread and largely socially acceptable. Nonnative speakers are evaluated negatively because accent is a sign that they belong to an outgroup and because understanding their speech requires unusual effort from listeners. The present research investigated intergroup bias, based on stronger support for hierarchical relations between groups (social dominance orientation [SDO]), as a predictor of hiring recommendations of nonnative speakers.
Method: In an online experiment using an adaptation of the thin-slices methodology, 65 U.S. adults (54% women; 80% White; M[age] = 35.91, range = 18–67) heard a recording of a job applicant speaking with an Asian (Mandarin Chinese) or a Latino (Spanish) accent. Participants indicated how likely they would be to recommend hiring the speaker, answered questions about the text, and indicated how difficult it was to understand the applicant.
Results: Independent of objective comprehension, participants high in SDO reported that it was more difficult to understand a Latino speaker than an Asian speaker. SDO predicted hiring recommendations of the speakers, but this relationship was mediated by the perception that nonnative speakers were difficult to understand. This effect was stronger for speakers from lower status groups (Latinos relative to Asians) and was not related to objective comprehension.
Conclusions: These findings suggest a cycle of prejudice toward nonnative speakers: Not only do perceptions of difficulty in understanding cause prejudice toward them, but also prejudice toward low-status groups can lead to perceived difficulty in understanding members of these groups.
Most research on ethnicity has focused on visual cues. However, accents are strong social cues that can match or contradict visual cues. We examined understudied reactions to people whose one cue suggests one ethnicity, whereas the other cue contradicts it. In an experiment conducted in Germany, job candidates spoke with an accent either congruent or incongruent with their (German or Turkish) appearance. Based on ethnolinguistic identity theory, we predicted that accents would be strong cues for categorization and evaluation. Based on expectancy violations theory we expected that incongruent targets would be evaluated more extremely than congruent targets. Both predictions were confirmed: accents strongly influenced perceptions and Turkish-looking German-accented targets were perceived as most competent of all targets (and additionally most warm). The findings show that bringing together visual and auditory information yields a more complete picture of the processes underlying impression formation.
This paper discusses the semi-formal language of mathematics and presents the Naproche CNL, a controlled natural language for mathematical authoring. Proof Representation Structures, an adaptation of Discourse Representation Structures, are used to represent the semantics of texts written in the Naproche CNL. We discuss how the Naproche CNL can be used in formal mathematics, and present our prototypical Naproche system, a computer program for parsing texts in the Naproche CNL and checking the proofs in them for logical correctness.
Growing globalisation of the world draws attention to cultural differences between people from different countries or from different cultures within the countries. Notwithstanding the diversity of people’s worldviews, current cross-cultural research still faces the challenge of how to avoid ethnocentrism; comparing Western-driven phenomena with like variables across countries without checking their conceptual equivalence clearly is highly problematic. In the present article we argue that simple comparison of measurements (in the quantitative domain) or of semantic interpretations (in the qualitative domain) across cultures easily leads to inadequate results. Questionnaire items or text produced in interviews or via open-ended questions have culturally laden meanings and cannot be mapped onto the same semantic metric. We call the culture-specific space and relationship between variables or meanings a ’cultural metric’, that is a set of notions that are inter-related and that mutually specify each other’s meaning. We illustrate the problems and their possible solutions with examples from quantitative and qualitative research. The suggested methods allow to respect the semantic space of notions in cultures and language groups and the resulting similarities or differences between cultures can be better understood and interpreted.
Prejudice against a social group may lead to discrimination of members of this group. One very strong cue of group membership is a (non)standard accent in speech. Surprisingly, hardly any interventions against accent-based discrimination have been tested. In the current article, we introduce an intervention in which what participants experience themselves unobtrusively changes their evaluations of others. In the present experiment, participants in the experimental condition talked to a confederate in a foreign language before the experiment, whereas those in the control condition received no treatment. Replicating previous research, participants in the control condition discriminated against Turkish-accented job candidates. In contrast, those in the experimental condition evaluated Turkish- and standard-accented candidates as similarly competent. We discuss potential mediating and moderating factors of this effect.
Distributional models of word use constitute an indispensable tool in corpus based lexicological research for discovering paradigmatic relations and syntagmatic patterns (Belica et al. 2010). Recently, word embeddings (Mikolov et al. 2013) have revived the field by allowing to construct and analyze distributional models on very large corpora. This is accomplished by reducing the very high dimensionality of word cooccurrence contexts, the size of the vocabulary, to few dimensions, such as 100-200. However, word use and meaning can vary widely along dimensions such as domain, register, and time, and word embeddings tend to represent only the most prevalent meaning. In this paper we thus construct domain specific word embeddings to allow for systematically analyzing variations in word use. Moreover, we also demonstrate how to reconstruct domain specific co-occurrence contexts from the dense word embeddings.
Aversion to loanwords may express itself in various ways: deliberate and motivated by ideology of linguistic purism or more implicit and motivated by the strength of one’s national identification and ethnolinguistic vitality. A study of Polish philology students assessed their tendency to choose loanwords versus synonymous native words. The results supported a two-path model of linguistic purism. Social identity (strength of identification) directly predicted avoidance of loanwords, whereas ideological concerns (conservative political views) predicted it indirectly, through purist ideology.
Psychological research has neglected people whose accent does not match their appearance. Most research on person perception has focused on appearance, overlooking accents that are equally important social cues. If accents were studied, it was often done in isolation (i.e., detached from appearance). We examine how varying accent and appearance information about people affects evaluations. We show that evaluations of expectancy-violating people shift in the direction of the added information. When a job candidate looked foreign, but later spoke with a native accent, his evaluations rose and he was evaluated best of all candidates (Experiment 1a). However, the sequence in which information was presented mattered: When heard first and then seen, his evaluations dropped (Experiment 1b). Findings demonstrate the importance of studying the combination and sequence of different types of information in impression formation. They also allow predicting reactions to ethnically mixed people, who are increasingly present in modern societies.
Feminine forms of job titles raise great interest in many countries. However, it is still unknown how they shape stereotypical impressions on warmth and competence dimensions among female and male listeners. In an experiment with fictitious job titles men perceived women described with feminine job titles as significantly less warm and marginally less competent than women with masculine job titles, which led to lower willingness to employ them. No such effects were observed among women.
Introduction
(2010)
In this article, we examine the effectiveness of bootstrapping supervised machine-learning polarity classifiers with the help of a domain-independent rule-based classifier that relies on a lexical resource, i.e., a polarity lexicon and a set of linguistic rules. The benefit of this method is that though no labeled training data are required, it allows a classifier to capture in-domain knowledge by training a supervised classifier with in-domain features, such as bag of words, on instances labeled by a rule-based classifier. Thus, this approach can be considered as a simple and effective method for domain adaptation. Among the list of components of this approach, we investigate how important the quality of the rule-based classifier is and what features are useful for the supervised classifier. In particular, the former addresses the issue in how far linguistic modeling is relevant for this task. We not only examine how this method performs under more difficult settings in which classes are not balanced and mixed reviews are included in the data set but also compare how this linguistically-driven method relates to state-of-the-art statistical domain adaptation.
Based on German speaking data from various activity types, the range of multimodal resources used to construct turn-beginnings is reviewed. It is claimed that participants in talk-in-interaction need to deal with four tasks in order to construct a turn which precisely fits the interactional moment of its production:
1. Achieve joint orientation: The accomplishment of the socio-spatial prerequisites necessary for producing a turn which is to become part of the participants’ common ground.
2. Display uptake: Next speaker needs to display his/her understanding of the interaction so far as the backdrop on which the production of the upcoming turn is based.
3. Deal with projections from prior talk: The speaker has to deal with projections which have been established by (the) previous turn(s) with respect to the upcoming turn.
4. Project properties of turn-in-progress: The speaker needs to orient the recipient to properties of the turn s/he is about to produce.
Turn-design thus can be seen to be informed by tasks related to the multimodal, embodied, and interactive contingencies of online-construction of turns. The four tasks are ordered in terms of prior tasks providing the prerequisite for accomplishing a later task.
In this article, we explore the feasibility of extracting suitable and unsuitable food items for particular health conditions from natural language text. We refer to this task as conditional healthiness classification. For that purpose, we annotate a corpus extracted from forum entries of a food-related website. We identify different relation types that hold between food items and health conditions going beyond a binary distinction of suitability and unsuitability and devise various supervised classifiers using different types of features. We examine the impact of different task-specific resources, such as a healthiness lexicon that lists the healthiness status of a food item and a sentiment lexicon. Moreover, we also consider task-specific linguistic features that disambiguate a context in which mentions of a food item and a health condition co-occur and compare them with standard features using bag of words, part-of-speech information and syntactic parses. We also investigate in how far individual food items and health conditions correlate with specific relation types and try to harness this information for classification.
Just like most varieties of West Germanic, virtually all varieties of German use a construction in which a cognate of the English verb 'do' (standard German 'tun') functions as an auxiliary and selects another verb in the bare infinitive, a construction known as 'do'-periphrasis or 'do'-support. The present paper provides an Optimality Theoretic (OT) analysis of this phenomenon. It builds on a previous analysis by Bader and Schmid (An OT-analysis of 'do'-support in Modern German, 2006) but (i) extends it from root clauses to subordinate clauses and (ii) aims to capture all of the major distributional patterns found across (mostly non-standard) varieties of German. In so doing, the data are used as a testing ground for different models of German clause structure. At first sight, the occurrence of 'do' in subordinate clauses, as found in many varieties, appears to support the standard CP-IP-VP analysis of German. In actual fact, however, the full range of data turn out to challenge, rather than support, this model. Instead, I propose an analysis within the IP-less model by Haider (Deutsche Syntax - generativ. Vorstudien zur Theorie einer projektiven Grammatik, Narr, Tübingen, 1993 et seq.). In sum, the 'do'-support data will be shown to have implications not only for the analysis of clause structure but also for the OT constraints commonly assumed to govern the distribution of 'do', for the theory of non-projecting words (Toivonen in Non-projecting words, Kluwer, Dordrecht, 2003) as well as research on grammaticalization.
The analysis which we present in this paper is part of an ethnographically based sociolinguistic study of various immigrant youth groups and their social style of communication. The study describes the wide variety of migrant groups and their socio-cultural orientation in relation to different migrant worlds as well as to different social worlds of the dominant German society. The development of a social style of communication is grounded in the groups’ socio-cultural orientation as well as in the perception of themselves in relation to relevant others. The main purpose of our study is to analyse the construction of the groups’ social identity in terms of their social style of communication.
The main point of this chapter is to demonstrate how a speaker’s concept of his/her professional role can be inferred from his/her perspectival work (perspective setting and relating different perspectives to one another) in professional encounters. Thereby some risks of complex perspectival work in discourse will become manifest which result - at one point in the talk - in perspectival inconsistency, revealing a deeply grounded social problem for the speaker. This will be examined in the framework of a rhetorical conversation analysis.
How do people communicate in mobile settings of interaction? How does mobility affect the way we speak? How does mobility exert influence on the manner in which talk itself is consequential for how we move in space? Recently, questions of this sort have attracted increasing attention in the human and social sciences. This Special Issue contributes to the emerging body of studies on mobility and talk by inspecting an ordinary and ubiquitous phenomenon in which communication among mobile participants is paramount: participation in traffic. This editorial presents previous work on mobility in natural settings, as carried out by interactionally oriented researchers. It also shows how the investigation into traffic participation adds new perspectives to research on language and communication.
This paper asks whether and in which ways managing coordination tasks in traffic involve the accomplishment of intersubjectivity. Taking instances of coordinating passing an obstacle with oncoming traffic as the empirical case, four different practices were found.
1. Intersubjectivity can be presupposed by expecting others to stick to the traffic code and other mutually shared expectations.
2. Intersubjective solutions emerge step by step by mutual responsive-anticipatory adaptation of driving decisions.
3. Intersubjectivity can be accomplished by explicit interactive negotiation of passages.
4. Coordination problems can be solved without relying on intersubjectivity by unilateral, responsive-anticipatory adaptation to others’ behaviors.
We present a technique called event mapping that allows to project text representations into event lists, produce an event table, and derive quantitative conclusions to compare the text representations. The main application of the technique is the case where two classes of text representations have been collected in two different settings (e.g., as annotations in two different formal frameworks) and we can compare the two classes with respect to their systematic differences in the event table. We illustrate how the technique works by applying it to data collected in two experiments (one using annotations in Vladimir Propp’s framework, the other using natural language summaries).
Based on the empirical data of 97 fourth-graders from three districts of Braunschweig in Germany, this paper investigates the possibility of changing semantic frames in multilingual communities. The focus of study is the verb field of self-motion. In a free-sorting task involving 52 verbs, Turkish-speaking students, in particular, placed the verbs schleichen (‘to sneak’) and kommen (‘to come’) in the same group. When explaining the perceived similarity they also used the word schleichen (‘to sneak’), in a specific grammatical construction that is not found in Standard German. This paper suggests that semantic frames may change along with grammatical constructions when typologically distinct languages come into close contact.
Terminological resources play a central role in the organization and retrieval of scientific texts. Both simple keyword lists and advanced modelings of relationships between terminological concepts can make a most valuable contribution to the analysis, classification, and finding of appropriate digital documents, either on the web or within local repositories. This seems especially true for long-established scientific fields with elusive theoretical and historical branches, where the use of terminology within documents from different origins is often far from being consistent. In this paper, we report on the progress of a linguistically motivated project on the onomasiological re-modeling of the terminological resources for the grammatical information system grammis. We present the design principles and the results of their application. In particular, we focus on new features for the authoring backend and discuss how these innovations help to evaluate existing, loosely structured terminological content, as well as to efficiently deal with automatic term extraction. Furthermore, we introduce a transformation to a future SKOS representation. We conclude with a positioning of our resources with regard to the Knowledge Organization discourse and discuss how a highly complex information environment like grammis benefits from the re-designed terminological KOS.
We continue the study of the reproducibility of Propp’s annotations from Bod et al. (2012). We present four experiments in which test subjects were taught Propp’s annotation system; we conclude that Propp’s system needs a significant amount of training, but that with sufficient time investment, it can be reliably trained for simple tales.
The article presents the results of a survey on dictionary use in Europe, focusing on general monolingual dictionaries. The survey is the broadest survey of dictionary use to date, covering close to 10,000 dictionary users (and non-users) in nearly thirty countries. Our survey covers varied user groups, going beyond the students and translators who have tended to dominate such studies thus far. The survey was delivered via an online survey platform, in language versions specific to each target country. It was completed by 9,562 respondents, over 300 respondents per country on average. The survey consisted of the general section, which was translated and presented to all participants, as well as country-specific sections for a subset of 11 countries, which were drafted by collaborators at the national level. The present report covers the general section.
This article examines a recurrent format that speakers use for defining ordinary expressions or technical terms. Drawing on data from four different languages - Flemish, French, German, and Italian - it focuses on definitions in which a definiendum is first followed by a negative definitional component (‘definiendum is not X’), and then by a positive definitional component (‘definiendum is Y’). The analysis shows that by employing this format, speakers display sensitivity towards a potential meaning of the definiendum that recipients could have taken to be valid. By negating this meaning, speakers discard this possible, yet unintended understanding. The format serves three distinct interactional purposes: (a) it is used for argumentation, e.g. in discussions and political debates, (b) it works as a resource for imparting knowledge, e.g. in expert talk and instructions, and (c) it is employed, in ordinary conversation, for securing the addressee's correct understanding of a possibly problematic expression. The findings contribute to our understanding of how epistemic claims and displays relate to the turn-constructional and sequential organization of talk. They also show that the much quoted ‘problem of meaning’ is, first and foremost, a participant's problem.
To date, little is known about prosodic accommodation and its conversational functions in instances of overlapping talk in conversation. A major conversational action that happens in overlap is turn competition. It is not known whether participants accommodate prosodic parameters locally in the overlapped turn (initialisation) or access a repertoire of prosodic patterns that refer to general prosodic parameter norms (normalisation) when competing for the turn in overlap. This paper investigates the initialisation and normalisation of fundamental frequency (f0) and assesses its role as a resource for turn competition in overlap. We drew instances of overlapping talk from a corpus of conversational multi-party interactions in British English. We annotated the overlaps on a competitiveness scale and categorised them by overlap onset position and conversational function. We automatically extracted f0 parameters from the speech signal and processed them into f0 accommodation features that represent the normalising or the initialising use of f0. Using decision tree classification we found that f0 accommodation is only relevant as a turn competitive resource in overlaps that start clearly before a speaker transition. In this turn context, we found that normalising and initialising f0 features can both be relevant turn competitive resources. Their deployment depends on the conversational function of overlap.
Consistency of reference structures is an important issue in lexicography and dictionary research, especially with respect to information on sense-related items. In this paper, the systematic challenges of this area (e.g. ‘non-reversed reference’, bidirectional linking being realised as unidirectional structures) will be outlined, and the problems which can be caused by these challenges for both lexicographers and dictionary users will be discussed. The paper also discusses how text-technological Solutions may help to provide Support for the consistency of sense-related pairings during the process of compiling a dictionary.
Present-day German uses two formally different patterns of compounding in N+N compounds. The first combines bare stems (e.g. Tisch+decke ‘tablecloth’) while the second contains an intervening linking element (LE) as in Geburt-s-ort ‘birth-LE-place’. The linked compounding type developed in Early New High German (1350–1650) from phrasal constructions by reanalyzing genitive attributes as first constituents of compounds. The present paper uses corpus data to explore three key stages in this development: In the initial stage, it shows how prenominal non-specific genitive constructions lent themselves to reanalysis due to their functional overlap and formal similarity. Additionally, compounds seem to have replaced not only prenominal genitives, but also structurally different postnominal genitives. In the second stage, the new compounding pattern increases in productivity between 1500 and 1710, especially compared to the older pattern without linking elements. The last stage pertains to changes in spelling practice. It shows that linked compounds were written separately in the beginning. Their gradual graphematic integration into directly connected words was reversed by a century of hyphenation (1650–1750). This is strikingly different from present-day spelling practice and shows that the linked pattern was still perceived as marked.
This study examines what kind of cues and constraints for discourse interpretation can be derived from the logical and generic document structure of complex texts by the example of scientific journal articles. We performed statistical analysis on a corpus of scientific articles annotated on different annotations layers within the framework of XML-based multi-layer annotation. We introduce different discourse segment types that constrain the textual domains in which to identify rhetorical relation spans, and we show how a canonical sequence of text type structure categories is derived from the corpus annotations. Finally, we demonstrate how and which text type structure categories assigned to complex discourse segments of the type “block” statistically constrain the occurrence of rhetorical relation types.
Germany's (single) national official language is German. The dominance of German in schools, politics, the legal system, administration and the entire written public domain is so great that for a long time the lack of a coherent language policy was not seen as a problem. State restraint in this area is due, on the one hand, to historical reasons; on the other hand, it has been promoted by the federal system in Germany, which grants the federal states far-reaching responsibilities in the fields of education and culture. More recently, multilingualism among the population has increased and has resulted in a growing interest in understanding the language situation in Germany and (in particular) taking a closer look at the different minority languages. In 2017, for the first time in about 80 years, there is a question on the language of the population in the German micro census. The Institute for the German Language has also carried out various representative surveys; in the winter of 2017/201, a large representative survey with questions on the language repertoire and language attitudes is in the field.
Response particles manage intersubjectivity. This conversation analytic study describes German eben (“exactly”). With eben, speaker A locally agrees with the immediately prior turn of B (the “confirmable”) and establishes a second indexical link: A relates B’s confirmable to a position A herself had already displayed (the “anchor”). Through claiming temporal priority, eben speakers treat a just-formulated position as self-evident and mark independence. Further evidence for the three-part structure “anchor-confirmable-eben” that eben sets in motion retrospectively comes from instances where eben speakers supply a missing/opaque anchor via a postpositioned display of independent access. Data are in German with English translation.
Content analysis provides a useful and multifaceted, methodological framework for Twitter analysis. CAQDAS tools support the structuring of textual data by enabling categorising and coding. Depending on the research objective, it may be appropriate to choose a mixed-methods approach that combines quantitative and qualitative elements of analysis and plays out their respective advantages to the greatest possible extent while minimising their shortcomings. In this chapter, we will discuss CAQDAS speech act analysis of tweets as an example of software-assisted content analysis. We start with some elementary thoughts on the challenges of the collection and evaluation of Twitter data before we give a brief description of the potentials and limitations of using the software QDA Miner (as one typical example for possible analysis programmes). Our focus will lie on analytical features that can be particularly helpful in speech act analysis of tweets.
We present an empirical study addressing the question whether, and to which extent, lexicographic writing aids improve text revision results. German university students were asked to optimise two German texts using (1) no aids at all, (2) highlighted problems, or (3) highlighted problems accompanied by lexicographic resources that could be used to solve the specific problems. We found that participants from the third group corrected the largest number of problems and introduced the fewest semantic distortions during revision. Also, they reached the highest overall score and were most efficient (as measured in points per time). The second group with highlighted problems lies between the two other groups in almost every measure we analysed. We discuss these findings in the scope of intelligent writing environments, the effectiveness of writing aids in practical usage situations and teaching dictionary skills.
Instruction practices in German driving lessons: Differential uses of declaratives and imperatives
(2018)
Building on а corpus of 70 hours of German driving lessons, this paper studies the use of declaratives vs. imperatives for instruction. It shows how these linguistic resources are adapted to different praxeological, temporal and participant-related environments. Declaratives are used for first instructions, task-setting and post- trial discussions. They exhibit complex syntax and do not call for immediate compliance. Their high degree of explicitness conveys how the action is to be carried out. Imperative instructions overwhelmingly correct ongoing actions of students or respond to their failure to produce expected actions. They exhibit minimal argument structure. They are reminders which presuppose that the student monitors the scene and can perform the action unproblematically. They index that requests have to be complied with immediately or even urgently.
Our paper deals with the use of ICH WEIß NICHT (‘I don’t know’) in German talk-in-interaction. Pursuing an Interactional Linguistics approach, we identify different interactional uses of ICH WEIß NICHT and discuss their relationship to variation in argument structure (SV (O), (O)VS, V-only). After ICH WEIß NICHT with full complementation, speakers emphasize their lack of knowledge or display reluctance to answer. In contrast, after variants without an object complement, in contrast, speakers display uncertainty about the truth of the following proposition or about its sufficiency as an answer. Thus, while uses with both subject and object tend to close a sequence or display lack of knowledge, responses without an object, in contrast, function as a prepositioned epistemic hedge or a pragmatic marker framing the following TCU. When ICH WEIß NICHT is used in response to a statement, it indexes disagreement (independently from all complementation patterns).
Wolfgang von Kempelen's book "The Mechanism of Human Speech" from 1791 is a famous milestone in the history of speech communication research. It has an enormous relevance for the phonetic sciences and it marks an important turning point for the development of the (mechanical) speech synthesis. So far no English version of this work was available, which excludes many interested researchers. Access to the original versions in German and French is restricted for various reasons. For example the blackletter script of the German version is troublesome for most of today's readers. We report here on a new edition of Kempelen's book which unites a better readable German version and its English translation. It will now also be in a searchable electronic format and has been enriched with many commentaries, which aid in the understanding of details of the late 18th century that are little known or unknown to many researchers today.
There are a number of recent replicas of Wolfgang von Kempelen's speaking machine. Although all of them are explicitly based on Kempelen's own description nearly none of them are identical in construction and sound. In this paper we want to illustrate some of these differences and their reasons for five replicas built by ourselves.
Reformulating place
(2013)
This report examines what can be accomplished in conversation by reformulating a reference to a place using the practices of repair. It is based on an analysis of a collection of place references situated in second pair parts of adjacency pairs taken from a wide range of field recordings of talk-in-interaction. Not surprisingly, place references are sometimes reformulated so as to indicate a misspeaking or in pursuit of recipient recognition. At other times, however, we show that place references can be reformulated to more adequately implement the action of a turn in prosecuting the course of action of which it is a part. In these cases repairing a place reference can target a source of trouble associated with implementing the action of a turn at talk, and thus reformulating place can serve as a practical resource for accomplishing a range of interactional tasks. We conclude with a more complex case in which two reformulations are deployed in responding to a so-called ‘double-barrelled’ initiating action.
This paper is concerned with a novel methodology for generating phonetic questions used in tree-based state tying for speech recognition. In order to implement a speech recognition system, language-dependent knowledge which goes beyond annotated material is usually required. The approach presented here generates phonetic questions for decision trees are based on a feature table that summarizes the articulatory characteristics of each sound. On the one hand, this method allows better language-specific triphone models to be defined given only a feature-table as linguistic input. On the other hand, the feature-table approach facilitates efficient definition of triphone models for other languages since again only a feature table for this language is required. The approach is exemplified with speech recognition systems for English and Thai.
Pseudoclefts in Hungarian
(2013)
Based on novel data from Hungarian, this paper makes the case that in at least some languages specificational pseudocleft sentences must receive a ‘what-you- see-is-what-you-get’ syntactic analysis. More specifically, it is argued that the clefted constituent is the subject of predication (underlyingly base-generated in Spec, Pr), whereas the cleft clause acts as a predicate in the structure. Alongside connectivity effects characteristic of specificational pseudoclefts, we also discuss a range of anti-connectivity effects, which we show to receive a straightforward explanation under the proposed analysis. It follows that attested connectivity effects, in turn, require a semantic, rather than a syntactic account, along the lines of Jacobson (1994) and Sharvit (1999).
Language Change
(2017)
The present chapter outlines a research program for historical linguistics based on the idea that the object of the formal study of language change should be defined as grammar change, that is, a set of discrete differences between the target grammar and the grammar acquired by the learner (Hale 2007). This approach is shown to offer new answers to some classical problems of historical linguistics (Weinreich et al. 1968), concerning, specifically, the actuation of changes and the observation that the transition from one historical state to another proceeds gradually. It is argued that learners are highly sensitive to small fluctuations in the linguistic input they receive, making change inevitable, while the impression of gradualness is linked to independent factors (diffusion in a speech community, and grammar competition). Special attention is paid to grammaticalization phenomena, which offer insights into the nature of functional categories, the building blocks of clause structure.
In spite of the obvious importance that is accorded to the notion grammatical construction in any approach that sees itself as a construction grammar (CxG), there is as yet no generally accepted definition of the term across different variants of the framework. In particular, there are different assumptions about which additional requirements a given structure has to meet in order to be recognized as a construction besides being a ‘form-meaning pair’. Since the choice of a particular definition will determine the range of both relevant phenomena and concrete observations to be considered in empirical research within the framework, the issue is not just a mere terminological quibble but has important methodological repercussions especially for quantitative research in areas such as corpus linguistics. The present study illustrates some problems in identifying and delimiting such patterns in naturally occurring text and presents arguments for a usage-based interpretation of the term grammatical construction.
In the context of the HyTex project, our goal is to convert a corpus into a hypertext, basing conversion strategies on annotations which explicitly mark up the text-grammatical structures and relations between text segments. Domain-specific knowledge is represented in the form of a knowledge net, using topic maps. We use XML as an interchange format. In this paper, we focus on a declarative rule language designed to express conversion strategies in terms of text-grammatical structures and hypertext results. The strategies can be formulated in a concise formal syntax which is independend of the markup, and which can be transformed automatically into executable program code.
In the management of cooperation, the fit of a requested action with what the addressee is presently doing is a pervasively relevant consideration. We present evidence that imperative turns are adapted to, and reflexively create, contexts in which the other person is committed to the course of action advanced by the imperative. This evidence comes from systematic variation in the design of imperative turns, relative to the fittedness of the imperatively mandated action to the addressee’s ongoing trajectory of actions, what we call the “dine of commitment”. We present four points on this dine: Responsive imperatives perform an operation on the deontic dimension of what the addressee has announced or already begun to do (in particular its permissibility); local-project imperatives formulate a new action advancing a course of action in which the addressee is already actively engaged; global-project-imperatives target a next task for which the addressee is available on the grounds of their participation in the overall event, and in the absence of any competing work; and competitive imperatives draw on a presently otherwise engaged addressee on the grounds of their social commitment to the relevant course of actions. These four turn shapes are increasingly complex, reflecting the interactional work required to bridge the increasing distance between what the addressee is currently doing, and what the imperative mandates. We present data from German and Polish informal and institutional settings.
The lexicography of German
(2020)
This chapter discusses the main dictionaries of the German language as it is spoken and written in Germany, and also German as it is spoken and written in Austria, Switzerland, the eastern fringes of Belgium, and South Tyrol. It also briefly describes Pennsylvania German. Corpora and other language resources used in German dictionary-making are also presented. Finally, there is a discussion of some current issues in German lexicography, as well as future prospects.
Within cognitive linguistics, there is an increasing awareness that the study of linguistic phenomena needs to be grounded in usage. Ideally, research in cognitive linguistics should be based on authentic language use, its results should be replicable, and its claims falsifiable. Consequently, more and more studies now turn to corpora as a source of data. While corpus-based methodologies have increased in sophistication, the use of corpus data is also associated with a number of unresolved problems. The study of cognition through off-line linguistic data is, arguably, indirect, even if such data fulfils desirable qualities such as being natural, representative and plentiful. Several topics in this context stand out as particularly pressing issues. This discussion note addresses (1) converging evidence from corpora and experimentation, (2) whether corpora mirror psychological reality, (3) the theoretical value of corpus linguistic studies of ‘alternations’, (4) the relation of corpus linguistics and grammaticality judgments, and, lastly, (5) the nature of explanations in cognitive corpus linguistics. We do not claim to resolve these issues nor to cover all possible angles; instead, we strongly encourage reactions and further discussion.
We present a method to identify and document a phenomenon on which there is very little empirical data: German phrasal compounds occurring in the form of as a single token (without punctuation between their components). Relying on linguistic criteria, our approach implies to have an operational notion of compounds which can be systematically applied as well as (web) corpora which are large and diverse enough to contain rarely seen phenomena. The method is based on word segmentation and morphological analysis, it takes advantage of a data-driven learning process. Our results show that coarse-grained identification of phrasal compounds is best performed with empirical data, whereas fine-grained detection could be improved with a combination of rule-based and frequency-based word lists. Along with the characteristics of web texts, the orthographic realizations seem to be linked to the degree of expressivity.
In this paper, I argue against the analyses of the there-construction by Moro (1997) and Hoekstra & Mulder (1990) and for an analysis in the frame of Williams (1994), Hazout (2004) from two angles. First of all, Moro and Hoekstra & Mulder do not correctly predict the behaviour of the there-construction under wh-movement; second, from a semantic point of view, the predicate in the small clause structure is the postverbal DP and not there. Alternatively, I follow the proposal by Williams (1994) in which there is the subject of predication and I will point out a direction to analyse the problematic wh-movement data within this framework.
Precise multimodal studies require precise synchronisation between audio and video signals. However, raw audio and audio from video recordings can be out of sync for several reasons. In order to re-synchronise them, a dynamic programming (DP) approach is presented here. Traditionally, DP is performed on the rectangular distance matrix comparing each value in signal A with each value in signal B. Previous work limited the search space using for example the Sakoe Chiba Band (Sakoe and Chiba, 1978). However, the overall space of the distance matrix remains identical. Here, a tunnel matrix and its according DP-algorithm are presented. The matrix contains merely the computed distance of two signals to a pre-specified bandwidth and the computational cost is equally reduced. An example implementation demonstrates the functionality on artificial data and on data from real audio and video recordings.
HMMs are the dominating technique used in speech recognition today since they perform well in overall phone recognition. In this paper, we show the comparison of HMM methods and machine learning techniques, such as neural networks, decision trees and ensemble classifiers with boosting and bagging in the task of articulatory-acoustic feature classification. The experimental results show that HMM methods work well for the classification of such features as vocalic. However, decision tree and bagging outperform HMMs for the fricative classification task since the data skewness is much higher than for the feature vocalic classification task. This demonstrates that HMMs do not perform as well as decision trees and bagging in highly skewed data settings.
MRI data of German vowels and consonants was acquired for 9 speakers. In this paper tongue contours for the vowels were analyzed using the three-mode factor analysis technique PARAFAC. After some difficulties, probably related to what constitutes an adequate speaker sample for this three-mode technique to work, a stable two-factor solution was extracted that explained about 90% of the variance. Factor 1 roughly captured the dimension low back to high front; Factor 2 that from mid front to high back. These factors are compared with earlier models based on PARAFAC. These analyses were based on midsagittal contours; the paper concludes by illustrating from coronal and axial sections how non-midline information could be incorporated into this approach.
Analyses of jaw movement(obtained by Electromagnetic Articulography) and acoustics show that loud speech is an intricate phenomenon. Besides involving higher intensity and subglottal pressure it affects jaw movements as well as fundamental frequency and especially first formants. It is argued that all these effects serve the purpose of enhancing perceptual salience.
The vowel quality in some diphthongs of Swabian (an upper german dialect) was determined by measurement of first and second formant values. A minimal contrast could be shown between two different diphthong qualities […], where for Standard German only one is assumed, viz. /ai/. The two diphthong qualities differ only slightly in onset and offset vowel quality, so a better understanding of their relationship was expected from an examination of their dynamic aspects. Our preliminary results suggest that there is indeed a difference in the temporal structure of the two diphthongs.
The paper presents the results of a joint effort of a group of multimodality researchers and tool developers to improve the interoperability between several tools used for the annotation and analysis of multimodality. Each of the tools has specific strengths so that a variety of different tools, working on the same data, can be desirable for project work. However this usually requires tedious conversion between formats. We propose a common exchange format for multimodal annotation, based on the annotation graph (AG) formalism, which is supported by import and export routines in the respective tools. In the current version of this format the common denominator information can be reliably exchanged between the tools, and additional information can be stored in a standardized way.
The English language has taken advantage of the Digital Revolution to establish itself as the global language; however, only 28.6 %of Internet users speak English as their native language. Machine Trans-lation (MT) is a powerful technology that can bridge this gap. In devel-opment since the mid-20th century, MT has become available to every Internet user in the last decade, due to free online MT services. This paper aims to discuss the implications that these tools may have for the privacy of their users and how they are addressed by EU data protec-tion law. It examines the data-flows in respect of the initial processing (both from the perspective of the user and the MT service provider) and potential further processing that may be undertaken by the MT service provider.
This paper provides a unified semantic and discourse pragmatic analysis of the German particle nämlich, traditionally described as having a specificational and an explanative reading. Our claim is that nämlich is a discourse marker which signals that the expression it is attached to is a short (elliptic) answer to a salient implicit question about the previous utterance. We show how both the explanative and the specificational reading can be derived from this more general semantic contribution. In addition we discuss some cross linguistic consequences of our analysis.
A polarity-sensitive item (PSI), as traditionally defined, is an expression that is restricted to either an affirmative or negative context. PSIs like ‘lift a finger’ and ‘all the time in the world’ sub-serve discourse routines like understatement and emphasis. Lexical–semantic classes are increasingly invoked in descriptions of the properties of PSIs. Here, we use English corpus data and the tools of Frame Semantics (Fillmore, 1982, 1985) to explore Israel’s (2011) observation that the semantic role of a PSI determines how the expression fits into a contextually constructed scalar model. We focus on a class of exceptions implied by Israel’s model: cases in which a given PSI displays two countervailing patterns of polarity sensitivity, with attendant differences in scalar entailments. We offer a set of case studies of polaritysensitive expressions – including verbs of attraction and aversion like ‘can live without’, monetary units like ‘a red cent’, comparative adjectives and time-span adverbials – that demonstrate that the interpretation of a given PSI in a given polar context is based on multiple factors. These factors include the speaker’s perspective on and affective stance towards the described event, available inferences about causality and, perhaps most critically, particulars of the predication, including the verb or adjective’s frame membership, the presence or absence of an ability modal like can, the grammatical construction used and the range of contingencies evoked by the utterance.
When a noise verb is used to indicate verbal communication, factors from both the source domain of the verb (perception) and the target domain (communication) play a role in determining the argument structure of the sentence. While the target domain supplies a syntactic structure, the source domain’s semantics constrain the degree to which that syntactic structure can be exploited. This can be determined by comparing noise verbs in this use with manner-of-communication verbs, which are superficially similar, but native to communication. Data for these two classes of verbs were drawn from the British National Corpus. The data were annotated with frame-semantic markup, as described in the Berkeley FrameNet Project. We compared the presence, type of syntactic realization, and position of the semantically annotated arguments for both classes of verbs. We found that noise and manner verbs show statistically significant differences in these three areas. For instance, noise verbs are more focused on the form of the message than manner verbs: noise verbs appear more frequently with a quoted message. In addition, there are differences other than the complementation patterns: certain noise verbs are biased with respect to speakers’ genders, message types, and even orthography in quoted messages
We provide a unified account of semantic effects observable in attested examples of the German applicative (‘be-’) construction, e.g. Rollstuhlfahrer Poul Sehachsen aus Kopenhagen will den 1997 erschienenen Wegweiser Handiguide Europa fortführen und zusammen mit Movado Berlin berollen (‘Wheelchair user Poul Schacksen from Copenhagen wants to continue the guide ‘Handiguide Europe’, which came out in 1997, and roll Berlin together with Movado.’). We argue that these effects do not come from lexico-semantic operations on ‘input’ verbs, but are instead the products of a reconciliation procedure in which the meaning of the verb is integrated into the event-structure schema denoted by the applicative construction. We analyze the applicative pattern as an argument-structure construction, in terms of Goldberg (1995). We contrast this approach with that of Brinkmann (1997), in which properties associated with the applicative pattern (e.g. omissibility of the theme argument, holistic interpretation of the goal argument, and planar construal of the location argument) are attributed to general semantico-pragmatic principles. We undermine the generality of the principles as stated, and assert that these properties are instead construction-particular. We further argue that the constructional account provides an elegant model of the valence-creation and valence-augmentation functions of the prefix. We describe the constructional semantics as prototype-based: diverse implications of fee-predications, including iteration, transfer, affectedness, intensity and saturation, derive via regular patterns of semantic extension from the topological concept of coverage.
Contemporary studies on the characteristics of natural language benefit enormously from the increasing amount of linguistic corpora. Aside from text and speech corpora, corpora of computer-mediated communication (CMC) Position themselves between orality and literacy, and beyond that provide in- sight into the impact of "new", mainly intemet-based media on language beha- viour. In this paper, we present an empirical attempt to work with annotated CMC corpora for the explanation of linguistic phenomena. In concrete terms, we implement machine leaming algorithms to produce decision trees that reveal rules and tendencies about the use of genitive markers in German.
Discourse parsing of complex text types such as scientific research articles requires the analysis of an input document on linguistic and structural levels that go beyond traditionally employed lexical discourse markers. This chapter describes a text-technological approach to discourse parsing. Discourse parsing with the aim of providing a discourse structure is seen as the addition of a new annotation layer for input documents marked up on several linguistic annotation levels. The discourse parser generates discourse structures according to the Rhetorical Structure Theory. An overview of the knowledge sources and components for parsing scientific joumal articles is given. The parser’s core consists of cascaded applications of the GAP, a Generic Annotation Parser. Details of the chart parsing algorithm are provided, as well as a short evaluation in terms of comparisons with reference annotations from our corpus and with recently developed Systems with a similar task.
This chapter addresses the requirements and linguistic foundations of automatic relational discourse analysis of complex text types such as scientific journal articles. It is argued that besides lexical and grammatical discourse markers, which have traditionally been employed in discourse parsing, cues derived from the logical and generical document structure and the thematic structure of a text must be taken into account. An approach to modelling such types of linguistic information in terms of XML-based multi-layer annotations and to a text-technological representation of additional knowledge sources is presented. By means of quantitative and qualitative corpus analyses, cues and constraints for automatic discourse analysis can be derived. Furthermore, the proposed representations are used as the input sources for discourse parsing. A short overview of the projected parsing architecture is given.
Linguistic variation and linguistic virtuosity of young “Ghetto”-migrants in Mannheim, Germany
(2011)
In this paper, we provide an insight into the life world and social experiences of young Turkish migrants who are categorised by German society as “social problem cases”. Based on natural conversational data, we describe the communicative repertoire of one migrant adolescent and that of his friends. Our aims are (a) to isolate those linguistic features that convey the impression of “foreignness”, and stand out among other German speakers’ features, and (b) to analyse the variability in our informants’ discursive practices - i.e. code- or style-switching, as it is commonly referred to in the literature - in order to show how variation serves as a communicative resource. Our findings show that these adolescents’ remarkable linguistic proficiency and communicative competence contrast markedly to their low educational and professional status.
Complex common names such as Indian elephant or green tea denote a certain type of entity, viz. kinds. Moreover, those kinds are always subkinds of the kind denoted by their head noun. Establishing such subkinds is essentially the task of classifying modifiers that are a defining trait of endocentrically structured complex common names. Examining complex common names of different lexico-syntactic types(NN compounds, N+N syntagmas, NP/PP syntagmas, A+N syntagmas) and from different languages (particularly English, German and French) it can be shown that complex common names are subject to language- independent formal and semantic constraints. In particular, complex common names qualify as name-like expressions in that they tend to be deficient in terms of formal complexity and semantic compositionality.
In this chapter, emotions are not regarded primarily as internal-psychological phenomena, but as socially proscribed and formed entities, which are constituted in accordance with social rules of emotionality and which are manifested, interpreted, and processed together communicatively in the interaction for definite purposes by the persons involved. In the elaboration of such an interactive conception of emotionality, the following aspects are treated: the value of emotionality in linguistic theories; emotions as a specific form of experiencing; the rules of emotionality; communication of emotions as transmission of evaluations; practices of manifestation, interpretation and processing of emotions in the communication process; fundamental interrelations between emotions and communication behavior; and methodology of the analysis of emotions and emotionality in specific conversation types. Finally, the developed theoretical apparatus in the analysis of two short conversation sections is elucidated.
Integrated Linguistic Annotation Models and Their Application in the Domain of Antecedent Detection
(2011)
Seamless integration of various, often heterogeneous linguistic resources in terms of their output formats and a combined analysis of the respective annotation layers are crucial tasks for linguistic research. After a decade of concentration on the development of formats to structure single annotations for specific linguistic issues, in the last years a variety of specifications to store multiple annotations over the same primary data has been developed. The paper focuses on the integration of the knowledge resource logical document structure information into a text document to enhance the task of automatic anaphora resolution both for the task of candidate detection and antecedent selection. The paper investigates data structures necessary for knowledge integration and retrieval.
Different Views on Markup
(2010)
In this chapter, two different ways of grouping information represented in document markup are examined: annotation levels, referring to conceptual levels of description, and annotation layers, referring to the technical realisation of markup using e.g. document grammars. In many current XML annotation projects, multiple levels are integrated into one layer, often leading to the problem of having to deal with overlapping hierarchies. As a solution, we propose a framework for XML-based multiple, independent XML annotation layers for one text, based on an abstract representation of XML documents with logical predicates. Two realisations of the abstract representation are presented, a Prolog fact base format together with an application architecture, and a specification for XML native databases. We conclude with a discussion of projects that have currently adopted this framework.
An approach to the unification of XML (Extensible Markup Language) documents with identical textual content and concurrent markup in the framework of XML-based multi-layer annotation is introduced. A Prolog program allows the possible relationships between element instances on two annotation layers that share PCDATA to be explored and also the computing of a target node hierarchy for a well-formed, merged XML document. Special attention is paid to identity conflicts between element instances, for which a default solution that takes into account metarelations that hold between element types on the different annotation layers is provided. In addition, rules can be specified by a user to prescribe how identity conflicts should be solved for certain element types.
This article introduces the topic of ‘‘Multilingual language resources and interoperability’’. We start with a taxonomy and parameters for classifying language resources. Later we provide examples and issues of interoperatability, and resource architectures to solve such issues. Finally we discuss aspects of linguistic formalisms and interoperability.
Researchers in many disciplines, sometimes working in close cooperation, have been concerned with modeling textual data in order to account for texts as the prime information unit of written communication. The list of disciplines includes computer science and linguistics as well as more specialized disciplines like computational linguistics and text technology. What many of these efforts have in common is the aim to model textual data by means of abstract data types or data structures that support at least the semi-automatic processing of texts in any area of written communication.
We report on finished work in a project that is concerned with providing methods, tools, best practice guidelines, and solutions for sustainable linguistic resources. The article discusses several general aspects of sustainability and introduces an approach to normalizing corpus data and metadata records. Moreover, the architecture of the sustainability platform implemented by the authors is described.
This article shows that the TEI tag set for feature structures can be adopted to represent a heterogeneous set of linguistic corpora. The majority of corpora is annotated using markup languages that are based on the Annotation Graph framework, the upcoming Linguistic Annotation Format ISO standard, or according to tag sets defined by or based upon the TEI guidelines. A unified representation comprises the separation of conceptually different annotation layers contained in the original corpus data (e.g. syntax, phonology, and semantics) into multiple XML files. These annotation layers are linked to each other implicitly by the identical textual content of all files. A suitable data structure for the representation of these annotations is a multi-rooted tree that again can be represented by the TEI and ISO tag set for feature structures. The mapping process and representational issues are discussed as well as the advantages and drawbacks associated with the use of the TEI tag set for feature structures as a storage and exchange format for linguistically annotated data.
We analyze the linguistic evolution of selected scientific disciplines over a 30-year time span (1970s to 2000s). Our focus is on four highly specialized disciplines at the boundaries of computer science that emerged during that time: computational linguistics, bioinformatics, digital construction, and microelectronics. Our analysis is driven by the question whether these disciplines develop a distinctive language use—both individually and collectively—over the given time period. The data set is the English Scientific Text Corpus (scitex), which includes texts from the 1970s/1980s and early 2000s. Our theoretical basis is register theory. In terms of methods, we combine corpus-based methods of feature extraction (various aggregated features [part-of-speech based], n-grams, lexico-grammatical patterns) and automatic text classification. The results of our research are directly relevant to the study of linguistic variation and languages for specific purposes (LSP) and have implications for various natural language processing (NLP) tasks, for example, authorship attribution, text mining, or training NLP tools.
Recently, a claim was made, on the basis of the German Google Books 1-gram corpus (Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books. Science 2010; 331: 176–82), that there was a linear relationship between six non-technical non-Nazi words and three ‘explicitly Nazi words’ in times of World War II (Caruana-Galizia. 2015. Politics and the German language: Testing Orwell’s hypothesis using the Google N-Gram corpus. Digital Scholarship in the Humanities [Online]. http://dsh.oxfordjournals.org/cgi/doi/10.1093/llc/fqv011 (accessed 15 April 2015)). Here, I try to show that apparent relationships like this are the result of misspecified models that do not take into account the temporal aspect of time-series data. The main point of this article is to demonstrate why such analyses run the risk of incorrect statistical inference, where potential effects are both meaningless and can potentially lead to wrong conclusions.
Using the Google Ngram Corpora for six different languages (including two varieties of English), a large-scale time series analysis is conducted. It is demonstrated that diachronic changes of the parameters of the Zipf–Mandelbrot law (and the parameter of the Zipf law, all estimated by maximum likelihood) can be used to quantify and visualize important aspects of linguistic change (as represented in the Google Ngram Corpora). The analysis also reveals that there are important cross-linguistic differences. It is argued that the Zipf–Mandelbrot parameters can be used as a first indicator of diachronic linguistic change, but more thorough analyses should make use of the full spectrum of different lexical, syntactical and stylometric measures to fully understand the factors that actually drive those changes.
Learning from Errors. Systematic Analysis of Complex Writing Errors for Improving Writing Technology
(2015)
In this paper, we describe ongoing research on writing errors with the ultimate goal to develop error-preventing editing functions in word-processors. Drawing from the state-of-the-art research in errors carried out in various fields, we propose the application of a general concept for action-slips as introduced by Norman. We demonstrate the feasibility of this approach by using a large corpus of writing errors in published texts. The concept of slips considers both the process and the product: some failure in a procedure results in an error in the product, i.e., is visible in the written text. In order to develop preventing functions, we need to determine causes of such visible errors.