5.1: Types of data
-
- Last updated
- Save as PDF
Recall, once again, that at the end of Chapter 2, we defined corpus linguistics as
the investigation of linguistic research questions that have been framed in terms of the conditional distribution of linguistic phenomena in a linguistic corpus.
We discussed the fact that this definition covers cases of hypotheses phrased in absolute terms, i.e. cases where the distribution of a phenomenon across different conditions is a matter of all or nothing (as in “All speakers of American English refer to the front window of a car as windshield ; all speakers of British English refer to it as windscreen ”) as well as cases where the distribution is a matter of more-or-less (as in “British English speakers prefer the word railway over railroad when referring to train tracks; American English speakers prefer railroad over railway ” or “More British speakers refer to networks of train tracks as railway instead of railroad ; more American English speakers refer to them as railroad instead of railway ”).
In the case of hypotheses stated in terms of more-or-less, predictions must be stated in quantitative terms which in turn means that our data have to be quantified in some way so that we can compare them to our predictions. In this chapter, we will discuss in more detail how this is done when dealing with different types of data.
Specifically, we will discuss three types of data (or levels of measurement ) that we might encounter in the process of quantifying the (annotated) results of a corpus query (Section 5.1): nominal data (discussed in more detail in Section 5.2), ordinal (or rank) data (discussed in more detail in Section 5.3, and cardinal data (discussed in more detail in Section 5.4. These discussions, summarized in Section 5.5, will lay the ground work for the introduction to statistical hypothesis testing presented in the next chapter.
5.1 Types of data
In order to illustrate these types of data, let us turn to a linguistic phenomenon that is more complex than the distribution of words across varieties, and close to the kind of phenomenon actually of interest to corpus linguists: that of the two English possessive constructions introduced in Section 4.2.3 of Chapter 4 above. As discussed there, the two constructions can often be used seemingly interchangeably, as in (1a, b):
(1) a. The city’s museums are treasure houses of inspiring objects from all eras and cultures. ( www.res.org.uk )
b. Today one can find the monuments and artifacts from all of these eras in the museums of the city. ( www.travelhouseuk.co.uk )
However, there are limits to this interchangeability. First, there are a number of relations that are exclusively encoded by the of -construction, such as quantities (both generic, as in a couple/bit/lot of , and in terms of measures, as in six miles/years/gallons of ), type relations ( a kind/type/sort/class of ) and composition or constitution ( a mixture of water and whisky, a dress of silk, etc.) (cf., e.g., Stefanowitsch 2003).
Second, and more interestingly, even where a relation can be expressed by both constructions, there is often a preference for one or the other in a given context. A number of factors underlying these preferences have been suggested and investigated using quantitative corpus-linguistic methods. Among these, there are three that are widely agreed upon to have an influence, namely the givenness, animacy and weight of the modifier. These three factors nicely illustrate the levels of measurement mentioned above, so we will look at each of them in some detail.
(a) Givenness Following the principle of Functional Sentence Perspective, the s -possessive will be preferred if the modifier (the phrase marked by ’ s or of ) refers to given information, while the construction with of will be preferred if the modifier is new (Standwell 1982). Thus, (2a) and (3a) sound more natural than (2b) and (3b) respectively:
(2) a. In New York, we visited the city’s many museums.
b. ?? In New York, we visited the many museums of the city .
(3) a. The Guggenheim is much larger than the museums of other major cities .
b. ?? The Guggenheim is much larger than other major cities’ museums.
(b) Animacy Since animate referents tend to be more topical than inanimate ones and more topical elements tend to precede less topical ones, if the modifier is animate, the s - possessive will be preferred, if it is inanimate, the construction with of will be preferred (cf. Quirk et al. 1972: 192–203; Deane 1987):
(4) a. Solomon R. Guggenheim’s collection contains some fine paintings.
b. ?? The collection of Solomon R. Guggenheim contains some fine paintings.
(5) a. The collection of the Guggenheim museum contains some fine paintings.
b. ?? The Guggenheim museum’s collection contains some fine paintings.
(c) Length Since short constituents generally precede long constituents, if the modifier is short, the s-possessive will be preferred, if it is long, the construction with of will be preferred (Altenberg 1980):
(6) a. The museum’s collection is stunning.
b. ?? The collection of the museum is stunning.
(7) a. The collection of the most famous museum in New York is stunning.
b. ?? The most famous museum in New York’s collection is stunning.
In the case of all three factors, we are dealing with hypotheses concerning preferences rather than absolute difference. None of the examples with question marks are ungrammatical and all of them could conceivably occur; they just sound a little bit odd. Thus, the predictions we can derive from each hypothesis must be stated and tested in terms of relative rather than absolute differences – they all involve predictions stated in terms more-or-less rather than all-or-nothing. Relative quantitative differences are expressed and dealt with in different ways depending on the type of data they involve.
5.1.1 Nominal data
A nominal variable is a variable whose values are labels for categories that have no intrinsic order with respect to each other (i.e., there is no aspect of their definition that would allow us to put them in a natural order) – for example, SEX, NATIONALITY or NATIVE LANGUAGE. If we categorize data in terms of such a nominal variable, the only way to quantify them is to count the number of observations of each category in a given sample and express the result in absolute frequencies (i.e., raw numbers) or relative frequencies (such as percentages). For example, in the population of the world in 2005, there were 92 million native speakers of GERMAN and 75 million speakers of FRENCH.
We cannot rank the values of nominal variables based on intrinsic criteria. For example, we cannot rank the German language higher than the French language on the basis of any intrinsic property of German and French. They are simply two different manifestations of the phenomenon LANGUAGE, part of an unordered set including all human languages.
That we cannot rank them based on intrinsic criteria does not mean that we cannot rank them at all. For example, we could rank them by number of speakers worldwide (in which case, as the numbers cited above show, German ranks above French). We could also rank them by the number of countries in which they are an official language (in which case French, which has official status in 29 countries, ranks above German, with an official status in only 6 countries). But the number of native speakers or the number of countries where a language has an official status is not an intrinsic property of that language – German would still be German if its number of speakers was reduced by half by an asteroid strike, and French would still be French if it lost its official status in all 29 countries). In other words, we are not really ranking FRENCH and GERMAN as values of LANGUAGE at all; instead, we are ranking values of the variables SIZE OF NATIVE SPEECH COMMUNITY and NUMBER OF COUNTRIES WITH OFFICIAL LANGUAGE X respectively.
We also cannot calculate mean values (“averages”) between the values of nominal variables. We cannot claim, for example, that Javanese is the mean of German and French because the number of Javanese native speakers falls (roughly) halfway between that of German and French native speakers. Again, what we would be calculating a mean of is the values of the variable SIZE OF NATIVE SPEECH COMMUNITY, and while it makes a sort of sense to say that the mean of the values NUMBER OF FRENCH NATIVE SPEAKERS and NUMBER OF GERMAN NATIVE SPEAKERS was 83.5 in 2005, it does not make sense to refer to this mean as NUMBER OF JAVANESE SPEAKERS.
With respect to the three hypotheses concerning the distribution of the s -possessive and the of -possessive, it is obvious that they all involve at least one nominal variable – the constructions themselves. These are essentially values of a variable we could call TYPE OF POSSESSIVE CONSTRUCTION. We could categorize all grammatical expressions of possession in a corpus in terms of the values S-POSSESSIVE and OF-POSESSIVE, count them and express the result in terms of absolute or relative frequencies. For example, the s -possessive occurs 22,193 times in the BROWN corpus (excluding proper names and instances of the double s -possessive), and the of -possessive occurs 17,800 times. 1
As with the example of the variable NATIVE LANGUAGE above, we can rank the constructions (i.e., the values of the variable TYPE OF POSSESSIVE CONSTRUCTION) in terms of their frequency (the s-possessive is more frequent), but again we are not ranking these values based on an intrinsic criterion but on an extrinsic one: their corpus frequency in one particular corpus. We can also calculate their mean frequency (19,996.50), but again, this is not a mean of the two constructions, but of their frequencies in one particular corpus.
5.1.2 Ordinal data
An ordinal variable is a variable whose values are labels for categories that do have an intrinsic order with respect to each other but that cannot be expressed in terms of natural numbers. In other words, ordinal variables are variables that are defined in such a way that some aspect of their definition allows us to order them without reference to an extrinsic criterion, but that does not give us any information about the distance (or degree of difference) between one category and the next. If we categorize data in terms of such an ordinal variable, we can treat them accordingly (i.e., we can rank them), or we can treat them like nominal data by simply ignoring their inherent order (i.e., we can still count the number of observations for each value and report absolute or relative frequencies). We cannot calculate mean values.
Some typical examples of ordinal variables are demographic variables like EDUCATION or (in the appropriate sub-demographic) MILITARY RANK, but also SCHOOL GRADES and the kind of ratings often found in questionnaires (both of which are, however, often treated as though they were cardinal data, see below).
For example, academic degrees are intrinsically ordered: it is part of the definition of a PhD degree that it ranks higher than a master’s degree, which in turn ranks higher than a bachelor’s degree. Thus, we can easily rank speakers in a sample of university graduates based on the highest degree they have completed. We can also simply count the number of PhDs, MAs, and BAs and ignore the ordering of the degrees. But we cannot calculate a mean: if five speakers in our sample of ten speakers have a PhD and five have a BA, this does not allow us to claim that all of them have an MA degree on average. The first important reason for this is that the size of the difference in terms of skills and knowledge that separates a BA from an MA is not the same as that separating an MA from a PhD: in Europe, one typically studies two years for an MA, but it typically takes three to five years to complete a PhD. The second important reason is that the values of ordinal variables typically differ along more than one dimension: while it is true that a PhD is a higher degree than an MA, which is a higher degree than a BA, the three degrees also differ in terms of specialization (from a relatively broad BA to a very narrow PhD), and the PhD degree differs from the two other degrees qualitatively: a BA and an MA primarily show that one has acquired knowledge and (more or less practical skills), but a PhD primarily shows that one has acquired research skills.
With respect to the three hypotheses concerning the distribution of the s -possessive and the of -possessive, clearly ANIMACY is an ordinal variable, at least if we think of it in terms of a scale, as we did in Chapter 3, Section 3.2. Recall that a simple animacy scale might look like this:
(8) ANIMATE > INANIMATE > ABSTRACT
On this scale, ANIMATE ranks higher than INANIMATE which ranks higher than ABSTRACT in terms of the property we are calling ANIMACY, and this ranking is determined by the scale itself, not by any extrinsic criteria.
This means that we could categorize and rank all nouns in a corpus according to their animacy. But again, we cannot calculate a mean. If we have 50 HUMAN nouns and 50 ABSTRACT nouns, we cannot say that we have 100 nouns with a mean value of INANIMATE. Again, this is because we have no way of knowing whether, in terms of animacy, the difference between ANIMATE and INANIMATE is the same quantitatively as that between INANIMATE and ABSTRACT, but also, because we are, again, dealing with qualitative as well as quantitative differences: the difference between animate and inanimate on the one hand and abstract on the other is that the first two have physical existence; and the difference between animate on the one hand and inanimate and abstract on the other is that animates are potentially alive and the other two are not. In other words, our scale is really a combination of at least two dimensions.
Again, we could ignore the intrinsic order of the values on our ANIMACY scale and simply treat them as nominal data, i.e., count them and report the frequency with which each value occurs in our data. Potentially ordinal data are actually frequently treated like nominal data in corpus linguistics (cf. Section 5.3.2), and with complex scales combining a range of different dimensions, this is probably a good idea; but ordinal data also have a useful place in quantitative corpus linguistics.
5.1.3 Cardinal data
Cardinal variables are variables whose values are numerical measurements along a particular dimension. In other words, they are intrinsically ordered (like ordinal data), but not because some aspect of their definition allows us to order them, but because of their nature as numbers. Also, the distance between any two measurements is precisely known and can directly be expressed as a number itself. This means that we can perform any arithmetic operation on cardinal data – crucially, we can calculate means. Of course, we can also treat cardinal data like rank data by ignoring all of their mathematical properties other than their order, and we can also treat them as nominal data.
Typical cases of cardinal variables are demographic variables like AGE or INCOME. For example, we can categorize a sample of speakers by their age and then calculate the mean age of our sample. If our sample contains five 50-year-olds and five 30-year-olds, it makes perfect sense to say that the mean age in our sample is 40; we might need additional information to distinguish between this sample and another sample that consists of 5 41-year-olds and 5 39-year-olds, that would also have a mean age of 40 (cf. Chapter 6), but the mean itself is meaningful, because the distance between 30 and 40 is the same as that between 40 and 50 and all measurements involve just a single dimension (age).
With respect to the two possessives, the variables LENGTH and GIVENNESS are cardinal variables. It should be obvious that we can calculate the mean length of words or other constituents in a corpus, a particular sample, a particular position in a grammatical construction, etc.
As mentioned above, we can also treat cardinal data like ordinal data. This may sometimes actually be necessary for mathematical reasons (see Chapter 6 below); in other cases, we may want to transform cardinal data to ordinal data based on theoretical considerations.
For example, the measure of Referential Distance discussed in Chapter 3, Section 3.2 yields cardinal data ranging from 0 to whatever maximum distance we decide on and it would be possible, and reasonable, to calculate the mean referential distance of a particular type of referring expression. However Givón (1992: 20f) argues that we should actually think of referential distance as ordinal data: as most referring expressions consistently have a referential distance of either 0–1, or 2–3, or larger than 3, he suggests converting measures of REFERENTIAL DISTANCE into just three categories: MINIMAL GAP (0–1), SMALL GAP (2–3) and LONG GAP (> 3). Once we have done this, we can no longer calculate a mean, because the categories are no longer equivalent in size or distance, but we can still rank them. Of course, we can also treat them as nominal data, simply counting the number of referring expressions in the categories MINIMAL GAP, SMALL GAP and LONG GAP.
5.1.4 Interim summary
In the preceding three subsections, we have repeatedly mentioned concepts like frequency, percentage, rank and mean . In the following three sections, we will introduce these concepts in more detail, providing a solid foundation of descriptive statistical measures for nominal, ordinal and cardinal data.
Note, however, that most research designs, including those useful for investigating the hypotheses about the two possessive constructions, involve (at least) two variables: (at least) one independent variable and (at least) one dependent variable. Even our definition of corpus linguistics makes reference to this fact when it states that research questions should be framed such that it enables us to answer them by looking at the distribution of linguistic phenomena across different conditions.
Since such conditions are most likely to be nominal in character (a set of language varieties, groups of speakers, grammatical constructions, etc.), we will limit the discussion to combinations of variables where at least one variable is nominal, i.e., (a) designs with two nominal variables, (b) designs with one nominal and one ordinal variable, and (c) designs with one nominal and one cardinal variable. Logically, there are three additional designs, namely designs with (d) two ordinal variables, (e) two cardinal variables or (f) one ordinal and one cardinal variable. For such cases, we would need different variants of correlation analysis, which we will not discuss in this book in any detail (but there are pointers to the relevant literature in the Study Notes to Chapter 6 and we will touch upon such designs in some of the Case Studies in Part II of this book).
1 This is an estimate; it would take too long to go through all 36,406 occurrences of of and identify those that occur in the structure relevant here, so I categorized a random subsample of 500 hits of of and generalized the proportion of of -possessives vs. other uses of of to the total number of hits for of.