6.1: Statistical hypothesis testing

Last updated
Save as PDF

Page ID: 81922

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

As discussed extensively in Chapter 3, scientific hypotheses that are stated in terms of universal statements can only be falsified (proven to be false), but never verified (proven to be true). This insight is the basis for the Popperian idea of a research cycle where the researcher formulates a hypothesis and then attempts to falsify it. If they manage to do so, the hypothesis has to be rejected and replaced by a new hypothesis. As long as they do not manage to do so, they may continue to treat it as a useful working hypothesis. They may even take the repeated failure to falsify a hypothesis as corroborating evidence for its correctness. If the hypothesis can be formulated in such a way that it could be falsified by a counterexample (and if it is clear what would count as a counterexample), this procedure seems fairly straightforward.

However, as also discussed in Chapter 3, many if not most hypotheses in corpus linguistics have to be formulated in relative terms – like those introduced in Chapter 5. As discussed in Section 3.1.2, individual counterexamples are irrelevant in this case: if my hypothesis is that most swans are white, this does not preclude the existence of differently-colored swans, so the hypothesis is not falsified if we come across a black swan in the course of our investigation. In this chapter, we will discuss how relative statements can be investigated within the scientific framework introduced in Chapter 3.

6.1 Statistical hypothesis testing

Obviously, if our hypothesis is stated in terms of proportions rather than absolutes, we must also look at our data in terms of proportions rather than absolutes. A single counterexample will not disprove our hypothesis, but what if the majority cases we come across are counterexamples? For example, if we found more black swans than white swans, would this not falsify our hypothesis that most swans are white? The answer is: not quite. With a hypothesis stated in absolute terms, it is easy to specify how many counterexamples we need to disprove it: one. If we find just one black swan, then it cannot be true that all swans are white, regardless of how many swans we have looked at and how many swans there are.

But with a hypothesis stated in terms of proportions, matters are different: even if the majority or even all of the cases in our data contradict it, this does not preclude the possibility that our hypothesis is true – our data will always just constitute a sample, and there is no telling whether this sample corresponds to the totality of cases from which it was drawn. Even if most or all of the swans we observe are black, this may simply be an unfortunate accident – in the total population of swans, the majority could still be white. (By the same reasoning, of course, a hypothesis is not verified if our sample consists exclusively of cases that corroborate it, since this does not preclude the possibility that in the total population, counterexamples are the majority).

So if relative statements cannot be falsified, and if (like universal statements) they cannot be verified, what can we do? There are various answers to this question, all based in probability theory (i.e., statistics). The most widely-used and broadly-accepted of these, and the one we adopt in this book, is an approach sometimes referred to as “Null Hypothesis Significance Testing”.¹

In this approach, which I will refer to simply as statistical hypothesis testing, the problem of the non-falsifiability of quantitative hypotheses is solved in an indirect but rather elegant way. Note that with respect to any two variables, there are two broad possibilities concerning their distribution in a population: the distribution could be random (meaning that there is no relationship between the values of the two variables), or it could be non-random (meaning that one value of one variable is more probable to occur with a particular value of the other variable). For example, it could be the case that swans are randomly black or white, or it could be the case that they are more probable to have one of these colors. If the latter is true, there are, again, two broad possibilities: the data could agree with our hypothesis, or they could disagree with it. For example, it could be the case that there are more white swans than black swans (corroborating our hypothesis), or that there are more black swans than white swans (falsifying our hypothesis).

Unless we have a very specific prediction as to exactly what proportion of our data should consist of counterexamples, we cannot draw any conclusions from a sample. For most research hypotheses, we cannot specify such an exact proportion – if our hypothesis is that MOST SWANS ARE WHITE, then “most” could mean anything from 50.01 percent to 99.99 percent. But as we will see in the next subsection, we can always specify the exact proportion of counterexamples that we would expect to find if there was a random relationship between our variables, and we can then use a sample whether such a random relationship holds (or rather, how probable it is to hold).

Statistical hypothesis testing utilizes this fact by formulating not one, but two hypotheses – first, a research hypothesis postulating a relationship between two variables (like “Most swans are white” or like the hypotheses introduced in Chapter 5), also referred to as H₁ or alternative hypothesis; second, the hypothesis that there is a random relationship between the variables mentioned in the research hypothesis, also referred to as H₀ or null hypothesis. We then attempt to falsify the null hypothesis and to show that the data conform to the alternative hypothesis.

In a first step, this involves turning the null hypothesis and the alternative hypothesis are turned into quantitative predictions concerning the intersections of the variables, as schematically shown in (1a, b):

(1) a. Null hypothesis (H₀ ): There is no relationship between Variable A and Variable B.

Prediction: The data should be distributed randomly across the intersections of A and B; i.e., the frequency/medians/means of the intersections should not differ from those expected by chance.

b. Alternative hypothesis (H₁ ): There is a relationship between Variable A and Variable B such that some value(s) of A tend to co- occur with some value(s) of B.

Prediction: The data should be distributed non-randomly across the intersections of A and B; i.e., the frequency/medians/means of some the intersections should be higher and/or lower than those expected by chance.

Once we have formulated our research hypothesis and the corresponding null hypothesis in this way (and once we have operationalized the constructs used in formulating them), we collect, annotate and quantify the relevant data, as discussed in the preceding chapter.

The crucial step in terms of statistical significance testing then consists in determining whether the observed distribution differs from the distribution we would expect if the null hypothesis were true – if the values of our variables were distributed randomly in the data. Of course, it is not enough to observe a difference – a certain amount of variation is to be expected even if there is no relationship between our variables. As will be discussed in detail in the next section, we must determine whether the difference is large enough to assume that it does not fall within the range of variation that could occur randomly. If we are satisfied that this is the case, we can (provisionally) reject the null hypothesis. If not, we must (provisionally) reject our research hypothesis.

In a third step (or in parallel with the second step), we must determine whether the data conform to our research hypothesis, or, more precisely, whether they differ from the prediction of H₀ in the direction predicted by H₁ . If they do (for example, if there are more white swans than black swans), we can (provisionally) accept our research hypothesis, i.e., we can continue to use it as a working hypothesis in the same way that we would continue to use an absolute hypothesis in this way as long as we do not find a counterexample. If the data differ from the prediction of H₀ in the opposite direction to that predicted by our research hypothesis – for example, if there are more black than white swans – we must, of course, also reject our research hypothesis, and treat the unexpected result as a new problem to be investigated further.

Let us now turn to a more detailed discussion of probabilities, random variation and how statistics can be used to (potentially) reject

¹It should be mentioned that there is a small but vocal group of critics that have pointed out a range of real and apparent problems with Null-Hypothesis Significance Testing. In my view, there are three reasons that justify ignoring their criticism in a text book like this. First, they have not managed to convince a significant (pun intended) number of practitioners in any field using statistics, which may not constitute a theoretical argument against the criticism, but certainly a practical one. Second, most, if not all of the criticisms, pertain to the way in which Null Hypothesis Significance Testing is used and to the way in which the results are (mis-)interpreted in the view of the critics. Along with many other practitioners, and even some of the critics, I believe that the best response to this is to make sure we apply the method appropriately and interpret the results carefully, rather than to give up a near-universally used fruitful set of procedures. Third, it is not clear to me that the alternatives suggested by the critics are, on the whole, less problematic or less prone to abuse and misinterpretation.