Skip to main content
Social Sci LibreTexts

1.3: Intuition data vs. corpus data

  • Page ID
    81903
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    As the preceding section has shown, intuited judgments are just as vulnerable as corpus data as far as the major points of criticism leveled at the latter are concerned. In fact, I have tried to argue that they are, in some respects, more vulnerable to these criticisms. For those readers who are not yet convinced of the need for corpus data, let me compare the quality of intuited “data” and corpus data in terms of two aspects that are considered much more crucial in methodological discussions outside of linguistics than those discussed above:

    1. data reliability (roughly, how sure can we be that other people will arrive at the same set of data using the same procedures);
    2. data validity or epistemological status of the data (roughly, how well do we understand what real world phenomenon the data correspond to);5

    As to the first criterion, note that the problem is not that intuition “data” are necessarily wrong. Very often, intuitive judgments turn out to agree very well with more objective kinds of evidence, and this should not come as a surprise. After all, as native speakers of a language, or even as advanced foreign-language speakers, we have considerable experience with using that language actively (speaking and writing) and passively (listening and reading). It would thus be surprising if we were categorically unable to make statements about the probability of occurrence of a particular expression.

    Instead, the problem is that we have no way of determining introspectively whether a particular piece of intuited “data” is correct or not. To decide this, we need objective evidence, obtained either by serious experiments (including elicitation experiments) or by corpus-linguistic methods. But if that is the case, the question is why we need intuition “data” in the first place. In other words, intuition “data” are simply not reliable.

    The second criterion provides an even more important argument, perhaps the most important argument, against the practice of intuiting. Note that even if we manage to solve the problem of reliability (as systematic elicitation from a representative sample of speakers does to some extent), the epistemological status of intuitive data remains completely unclear. This is particularly evident in the case of grammaticality judgments: we simply do not know what it means to say that a sentence is “grammatical” or “ungrammatical”, i.e., whether grammaticality is a property of natural languages or their mental representations in the first place. It is not entirely implausible to doubt this (cf. Sampson 1987), and even if one does not, one would have to offer a theoretically well-founded definition of what grammaticality is and one would have to show how it is measured by grammaticality judgments. Neither task has been satisfactorily undertaken.

    In contrast, the epistemological status of a corpus datum is crystal clear: it is (a graphemic representation of) something that a specific speaker has said or written on a specific occasion in a specific situation. Statements that go beyond a specific speaker, a specific occasion or a specific situation must, of course, be inferred from these data; this is difficult and there is a constant risk that we get it wrong. However, inferring general principles from specific cases is one of the central tasks of all scientific research and the history of any discipline is full of inferences that turned out to be wrong. Intuited data may create the illusion that we can jump to generalizations directly and without the risk of errors. The fact that corpus data do not allow us to maintain this illusion does not make them inferior to intuition, it makes them superior. More importantly, it makes them normal observational data, no different from observational data in any other discipline.

    To put it bluntly, then, intuition “data” are less reliable and less valid than corpus data, and they are just as incomplete and in need of interpretation. Does this mean that intuition “data” should be banned completely from linguistics? The answer is no, but not straightforwardly.

    On the one hand, we would deprive ourselves of a potentially very rich source of information by dogmatically abandoning the use of our linguistic intuition (native-speaker or not). On the other hand, given the unreliability and questionable epistemological status of intuition data, we cannot simply use them, as some corpus linguists suggest (e.g. McEnery & Wilson 2001: 19), to augment our corpus data. The problem is that any mixed data set (i.e. any set containing both corpus data and intuition “data”) will only be as valid, reliable, and complete as the weakest subset of data it contains. We have already established that intuition “data” and corpus data are both incomplete, thus a mixed set will still be incomplete (albeit perhaps less incomplete than a pure set), so nothing much is gained. Instead, the mixed set will simply inherit the lack of validity and reliability from the intuition “data”, and thus its quality will actually be lowered by the inclusion of these.

    The solution to this problem is quite simple. While intuited information about linguistic patterns fails to meet even the most basic requirements for scientific data, it meets every requirement for scientific hypotheses. A hypothesis has to be neither reliable, nor valid (in the sense of the term used here), nor complete. In fact, these words do not have any meaning if we apply them to hypotheses – the only requirement a hypothesis must meet is that of testability (as discussed further Chapter 3). There is nothing wrong with introspectively accessing our experience as a native speaker of a language (or a non-native one at that), provided we treat the results of our introspection as hypotheses about the meaning or probability of occurrence rather than as facts.

    Since there are no standards of purity for hypotheses, it is also unproblematic to mix intuition and corpus data in order to come up with more fine-grained hypotheses (cf. in this context Aston & Burnard 1998: 143), as long as we then test our hypothesis on a pure data set that does not include the corpus-data used in generating it in the first place.

    5 Readers who are well-versed in methodological issues are asked to excuse this somewhat abbreviated use of the term validity; there are, of course, a range of uses in the philosophy of science and methodological theory for the term validity (we will encounter a different use from the one here in Chapters 2.3 and 4).


    This page titled 1.3: Intuition data vs. corpus data is shared under a CC BY-SA license and was authored, remixed, and/or curated by Anatol Stefanowitsch (Language Science Press) .

    • Was this article helpful?