12: Epilogue

Last updated
Save as PDF

Page ID: 81942

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

In this book, I have focused on corpus linguistics as a methodology, more precisely, as an application of a general observational scientific procedure to large samples of linguistic usage. I have refrained from placing this method in a particular theoretical framework for two reasons.

The first reason is that I am not convinced that linguistics should be focusing quite as much on theoretical frameworks, but rather on linguistic description based on data. Edward Sapir famously said that “unfortunately, or luckily, no language is tyrannically consistent. All grammars leak” (Sapir 1921: 39). This is all the more true of formal models, which have a tendency to attempt to achieve tyrannical consistency by pretending those leaks do not exist or, if they do exist, are someone else’s problem. To me, and to many others whose studies I discussed in this book, the ways grammars leak are simply more interesting than the formalisms that help us ignore these leaks.

The second reason is that I believe that corpus linguistics has a place in any theoretical linguistic framework, as long as that framework has some commitment to modeling linguistic reality. Obviously, the precise place, or rather, the distance from the data analyzed using this method and the consequences of this analysis for the model depend on the kind of linguistic reality that is being modeled. If it is language use, as it usually is in historically or sociolinguistically oriented studies, the distance is relatively short, requiring the researcher to discover the systematicity behind the usage patterns observed in the data. If it is the mental representation of language, the length of the distance depends on your assumptions about those representations.

Traditionally, those representations have been argued to be something fundamentally different from linguistic usage. It has been claimed that they are an ephemeral “competence” based on a “universal” grammar. There is disagreement as to the nature of this universal grammar – some claim that is a “mental organ” (Chomsky 1980), some imagine it as an evolved biological instinct (Pinker 1994). But all proponents of a universal grammar are certain that mental representations of language are dependent on and responsible for linguistic usage only in the most indirect ways imaginable, making corpora largely useless for the study of language. As I have argued in Chapters 1 and 2, the only methodological alternative to corpus data that proponents of this view offer – i.e. introspective grammaticality judgments – suffer from all the same problems as corpus data, without offering any of the advantages.

However, more recent models do not draw as strict a line between usage and mental representations. The Usage-Based Model (Langacker 1991) is a model of linguistic knowledge based on the assumption that speakers initially learn language as a set of unanalyzed chunks of various sizes (“established units”), from which they derive linguistic representations of varying degrees of abstractness and complexity based on formal and semantic correspondences across these units (cf. Langacker 1991: 266f). The Emergent Grammar model is based on similar assumptions but eschews abstractness altogether, viewing language as “built up out of combinations of [...] prefabricated parts”, as “a kind of pastiche, pasted together in an improvised way out of ready-made elements” (Hopper 1987: 144).

In these models, the corpus becomes more than just a research tool, it becomes an integral part of a model of linguistic competence (cf. Stefanowitsch 2011). This view is most radically expressed in the notion of “lexical priming” developed in Hoey (2005), in which linguistic competence is seen as a mental concordance over linguistic experience:

The notion of priming as here outlined assumes that the mind has a mental concordance of every word it has encountered, a concordance that has been richly glossed for social, physical, discoursal, generic and interpersonal context. This mental concordance is accessible and can be processed in much the same way that a computer concordance is, so that all kinds of patterns, including collocational patterns, are available for use. It simultaneously serves as a part, at least, of our knowledge base. (Hoey 2005: 11)

Obviously, this mental concordance would not correspond exactly to any concordance derived form an actual linguistic corpus. First, because – as discussed in Chapters 1 and 2 – no linguistic corpus captures the linguistic experience of a given individual speaker or the “average” speaker in a speech community; second, because the concordance that Hoey envisions is not a concordance of linguistic forms but of contextualized linguistic signs – it contains all the semantic and pragmatic information that corpus linguists have to reconstruct laboriously in their analyses. Still, an appropriately annotated concordance from a balanced corpus would be a reasonable operationalization of this mental concordance (cf. also Taylor 2012).

In less radical usage-based models of language, such as Langacker’s, the corpus is not a model of linguistic competence – the latter is seen as a consequence of linguistic input perceived and organized by human minds with a particular structure (such as the capacity for figure-ground categorization). The corpus is, however, a reasonable model (or at least an operationalization) of this linguistic input. Many of the properties of language that guide the storage of units and the abstraction of schemas over these stored units can be derived from corpora – frequencies, associations between units of linguistic structure, distributions of these units across grammatical and textual contexts, the internal variability of these units, etc. (cf. Stefanowitsch & Flach 2016 for discussion).

This view is explicitly taken in language acquisition research conducted within the Usage-Based Model (e.g. Tomasello 2003, cf. also Dabrowska 2001, Diessel 2004), where children’s expanding grammatical abilities, as reflected in their linguistic output, are investigated against the input they get from their caretakers as recorded in large corpora of caretaker-child interactions. The view of the corpus as a model of linguistic input is less explicit in the work of the major theoretical proponents of the Usage-Based Model, who connect the notion of usage to the notion of linguistic corpora only in theory. However, it is a view that offers a tremendous potential to bring together two broad strands of research – cognitive-functional linguistics (including some versions of construction grammar) and corpus linguistics (including attempts to build theoretical models on corpus data, such as pattern grammar (Hunston & Francis 2000) and Lexical Priming (Hoey 2005)). These strands have developed more or less independently and their proponents are sometimes mildly hostile toward each other over small but fundamental differences in perspective (see McEnery & Hardie 2012, Section 8.3 for discussion). If they could overcome these differences, they could complement each other in many ways, cognitive linguistics providing a more explicitly psychological framework than most corpus linguists adopt, and corpus linguistics providing a methodology that cognitive linguists serious about usage urgently need.

Finally, in usage-based models as well as in models of language in general, corpora can be treated as models (or operationalizations) of the typical linguistic output of the members of a speech community, i.e. the language produced based on their internalized linguistic knowledge. This is the least controversial view, and the one that I have essentially adopted throughout this book. Even under this view, corpus data remain one of the best sources of linguistic data we have – one that can only keep growing, providing us with ever deeper insights into the leaky, intricate, ever-changing signature activity of our species.

I hope this book has inspired you and I hope it will help you produce research that inspires all of us.