10: Text

    As mentioned repeatedly, linguistic corpora, by their nature, consist of word forms, while other levels of linguistic representation are not represented unless the corresponding annotations are added. In written corpora, there is one level other than the lexical that is (or can be) directly represented: the text. Well-constructed linguistic corpora typically consist of (samples from) individual texts, whose meta-information (author, title, original place and context of publication, etc.) are known. There is a substantial body of corpus-linguistic research based on designs that combine the two inherently represented variables \(\mathrm{Word}\) (\(\mathrm{Form}\)) and \(\mathrm{Text}\); such designs may be concerned with the occurrence of words in individual texts, or, more typically, with the occurrence of words in clusters of texts belonging to the same language variety (defined by topic, genre, function, etc.).

    Texts are, of course, produced by speakers, and depending on how much and what kind of information about these speakers is available, we can also cluster texts according to demographic variables such as dialect, socioeconomic status, gender, age, political or religious affiliation, etc. (as we have done in many of the examples in earlier chapters). In these cases, quantitative corpus linguistics is essentially a variant of sociolinguistics, differing mainly in that the linguistic phenomena it pays most attention to are not necessarily those most central to sociolinguistic research in general.

