Skip to main content
Social Sci LibreTexts

10.1: Keyword analysis

  • Page ID
    81937
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    In the investigation of relationships between words (or other units of language structure) and texts (or clusters of texts), researchers frequently use a method referred to as keyword analysis.1 The term was originally used in contexts where cultural values and practices were studied through particular lexical items (cf. Williams 1976, Wierzbicka 2003); in corpus linguistics, it is used in a related but slightly broader sense of words that are characteristic of a particular text, language variety or demographic in the sense that they occur with “unusual frequency in a given text” or set of texts, where “unusual” means high “by comparison with a reference corpus of some kind” (Scott 1997: 236).

    In other words, the corpus-linguistic identification of keywords is analogous to the identification of differential collocates, except that it analyses the association of a word W to a particular text (or collection of texts) T in comparison to the language as a whole (as represented by the reference corpus, which is typically a large, balanced corpus). \(Table \text { } 10.1\) shows this schematically.

    \(Table \text{ } 10.1\): A generic 2-by-2 table for keyword analysis

    clipboard_e5ececd800fcdc01365ed119fa8143447.png

    Just like collocation analysis, keyword analysis is most often applied inductively, but there is nothing that precludes a deductive design if we have hypotheses about the over- or underrepresentation of particular lexical items in a particular text or collection of texts. In either case, we have two nominal variables: \(\mathrm{Keyword}\) (with the individual words as values) and \(\mathrm{Text}\) (with the values \(\mathrm{text}\) and \(\mathrm{reference \space corpus}\)).

    If keyword analysis is applied to a single text, the aim is typically to identify either the topic area or some stylistic property of that text. When applied to text categories, the aim is typically to identify general lexical and/or grammatical properties of the language variety represented by the text categories.

    As a first example of the kind of results that keyword analysis yields, consider \(Table \text{ } 10.2\), which shows the 20 most frequent tokens (including punctuation marks) in the LOB corpus and two individual texts (all words were converted to lower case).

    As we can see, the differences are relatively small, as all lists are dominated by frequent function words and punctuation marks. Ten of these occur on all three lists (a, and, in, of, that, the, to, was, the comma and the period), and another six occur on two of them (as, he, it, on, and opening and closing quotation marks – although the latter are single quotation marks in the case of LOB and double quotation marks in the case of Text B). Even the types that occur only

    \(Table \text{ } 10.2\): Most frequent words in three texts (relative frequencies)

    clipboard_ef4f221f2561aa9570bad91b9453bb7e0.png

    once are mostly uninformative with respect to the language variety (or text category) we may be dealing with (1959, at, by, for, had, is, with, the hyphen and opening and closing parentheses). The only exceptions are four content words in Text A: Neosho, river, species, station – these suggest that the text is about the Neosho river and perhaps that it deals with biology (as suggested by the word species).

    Applying keyword analysis to each text or collection of texts allows us to identify the words that differ most significantly in frequency from the reference corpus, telling us how the text in question differs lexically from the (written) language of its time as a whole. \(Table \text{ } 10.3\) lists the keywords for Text A.

    The keywords now convey a very specific idea of what the text is about: there are two proper names of rivers (the Neosho already seen on the frequency list and the Marais des Cygnes, represented by its constituents Cygnes, Marais and des), and there are a number of words for specific species of fish as well as the words river and channel.

    \(Table \text{ } 10.3\): Keywords in a report on fish populations

    clipboard_e06f42303dd5243c344ac90ad04604646.png

    The text is clearly about fish in the two rivers. The occurrence of the words station and abundance suggests a research context, which is supported by the occurrence of two dates and opening and closing parentheses (which are often used in scientific texts to introduce references). The text in question is indeed a scientific report on fish populations: Fish Populations, Following a Drought, In the Neosho and Marais des Cygnes Rivers of Kansas (available via Project Gutenberg and in the Supplementary Online Material, file TXQP). Note that the occurrence of some tokens (such as the dates and the parentheses) may be characteristic of a language variety rather than an individual text, a point we will return to below.

    Next, consider \(Table \text { } 10.4\), which lists the keywords for Text B. Three things are noticeable: the keyness of a number of words that are most likely proper names

    \(Table \text { } 10.4\): Keywords in a science fiction novel

    clipboard_e60ff9269ccbbef7fbab7caa23294ebb1.png

    (Hume, Vye, Rynch, Wass, Brodie and Jumala), pronouns (he, his) and punctuation marks indicative of direct speech (the quotation marks and the exclamation mark).

    This does not tell us anything about this particular text, but taken together, these pieces of evidence point to a particular genre: narrative text (novels, short stories, etc.). The few potential content words suggest a particular sub-genre: the archaic hunter in combination with the unusual word flitter is suggestive of fantasy or science fiction. If we were to include the next twenty most strongly associated nouns, we would find patrol, camp, needler, safari, guild, tube, planet and out-hunter, which corroborate the impression that we are dealing with a science-fiction novel. And indeed, the text in question is the science-fiction novel 35710 Text Starhunter by Andre Alice Norton (available via Project Gutenberg in the Supplementary Online Material, file TXQP).

    Again, the keywords identified are a mixture of topical markers and markers for the language variety (in this case, the genre) of the text, so even a study of the keywords of single texts provides information about more general linguistic properties of the text in question as well as its specific topic. But keyword analysis reveals its true potential when we apply it to clusters of texts, as in the case studies in the next section.

    _______________________________

    1The term keyword is frequently spelled as two words (key word) or with a hyphen (key-word). I have chosen the spelling as a single word here because it seems simplest (at least to me, as a native writer of German, where compounds are always spelled as single words).


    This page titled 10.1: Keyword analysis is shared under a CC BY-SA license and was authored, remixed, and/or curated by Anatol Stefanowitsch (Language Science Press) .

    • Was this article helpful?