Skip to main content
Social Sci LibreTexts

10.5: Evidence for phonemes as mental categories

  • Page ID
    • Catherine Anderson, Bronwyn Bjorkman, Derek Denis, Julianne Doner, Margaret Grant, Nathan Sanders, and Ai Taniguchi
    • eCampusOntario

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    In Chapter 4, you learned that every human language has a phonology, but that the phonology of each language is distinct. For example, two sounds that are allophones of a single phoneme in one language might be separate phonemes in another language. The conclusion that we reach, then, is that each language has its own way of organizing speech sounds into a phonological system. This system is part of the mental grammar of a speaker of that language. In this section, we will examine some evidence from psycholinguistic and neurolinguistic experiments that provide further support for phonology as something that the mind and brain do.

    Chapter 4 defined a phoneme as the smallest unit in a language that can create contrast, such that exchanging one phoneme for another can create a minimal pair. The English words pat (/pæt/) and bat (/bæt/) differ only in the initial phoneme (/p/ or /b/), but have different meanings. This makes them a minimal pair, and the fact that we can make a minimal pair shows us that /p/ and /b/ are separate phonemes in English. Another way of thinking about phonemes is to say that a phoneme is a mental category of speech sounds (signed languages also have categories that permit variation, so this isn’t something special about spoken language). It was also noted in Chapter 4 that English voiceless stops like /p/ are produced with aspiration (written [pʰ]) at the beginning of a stressed syllable. But the difference between an aspirated and unaspirated voiceless bilabial stop ([pʰ] vs. [p]) cannot create contrast in English. This is because [pʰ] and [p] are two variants of a single phoneme. When a speaker of English hears either [pʰ] or [p], the mind of that speaker maps the sound onto the same category /p/. Of course there are many other languages that treat [pʰ] and [p] as separate phonemes, for example Hindi and Thai.

    We can make conclusions about the categories a speaker of a given language has in their mind by doing phonological analysis. We can look for minimal pairs, for example, or try to characterize the phonological environments in which a given speech sound appears. But in this chapter we are exploring the idea that these categories are exist in the mind and brain of a speaker. We might ask then, can we find evidence that our brains map sounds to the category they belong to. In other words, we might ask whether our brains treat sounds that vary a little bit in their acoustic qualities as the same, because they are examples of one phoneme.

    One group of researchers (Phillips et al., 2000) looked for this kind of evidence by examining whether our brains show a ‘surprise’ response to a new sound that is in a different phonological category to the others.

    How to be a linguist: The use of electro/magneto-encephalography in linguistics

    You have likely seen visual representations of electroencephalographic (EEG) recordings in medical or scientific settings. These look like wavy lines and are recordings of electrical activity from electrodes placed on the surface of the scalp. The overall character of the wavy lines varies depending on a number of factors, for example whether the person whose scalp is being recorded is awake, asleep, or having a neurological problem like a seizure. Psycholinguists, however, are typically interested not in these overall differences but in very small changes in the electrical field generated by the brain in response to a stimulus. These are called Event-Related Potentials, or ERPs. To compare ERPs to different stimuli, researchers must typically collect a number of responses from one participant, and collect responses from a number of participants. In the end, an average response to a stimulus might look something like the following:

    A sample diagram of an Event-Related Potential wave. Beginning at stimulus onset, the averaged electrical potential at a given electrode proceeds in a series of peaks and valleys over the course of several hundred milliseconds.
    Sample averaged ERP wave from a single electrode.

    Notice that in this diagram, negative electrical potentials are plotted up; this is merely a convention in this type of research. The horizontal axis represents time, beginning at the time the stimulus is presented to participants.

    Several decades of research into Event-Related Potentials have shown that there are characteristic brain responses to, for example, seeing a printed word on a computer screen or hearing a spoken word of a vocal language. Researchers have shown that ERPs are sensitive to, for example, whether a word is expected versus unexpected in a sentence (see e.g., DeLong, Urbach and Kutas 2005), or when it is an ungrammatical continuation of a sentence versus a grammatical continuation (see e.g., Friederici, Hahne and Mecklinger, 1996), etc.

    ERPs are a useful source of information to psycho- and neurolinguists because they record the brain’s activity with high temporal resolution: they record the brain’s responses as they are happening with timing accurate down to the millisecond. However, although differences in ERPs can be found at different electrode locations on the scalp, typical ERP studies cannot tell us very much about where in the brain the critical response is happening. In other words, its spatial resolution is poor.

    Another method that, like EEG, has excellent temporal resolution is magnetoencephalography (MEG). This method examines changes in the magnetic field generated by the brain (which is of course related to the electrical field). The main advantage of MEG over EEG is that MEG allows researchers to draw better conclusions about where in the brain the response of interest originated.

    Previous research in ERPs had shown that there is a measurable brain response to auditory stimuli (sounds) that stand out from the rest. For example, if you have people listen to tones, where most of the tones have an identical frequency but a small proportion have a different frequency, the tones in the minority are associated with a specific brain reaction that gets called the Mismatch Negativity: Mismatch because the tone mismatches what is normally heard, and negativity because the measured brain reaction is a negative-going wave in the measured electrical signal. The Mismatch Negativity can be measured even if a person is not really paying attention to the sounds, for example if they are watching a silent movie during the experiment (see Näätänen and Kreegipuu, 2012 for a review findings regarding the Mismatch Negativity).

    Using magnetoencephalography (MEG), Phillips and colleagues (2000) asked whether presenting stimuli that had the structure in terms of phonological category, but importantly not in terms of a mere acoustic difference, would elicit the MEG version of the Mismatch Negativity, called the Mismatch Field. The specific contrast they examined was a voicing contrast – whether a sound would be categorized as /dæ/ or /tæ/. The difference between these two comes down to a difference in the time between the release of the stop consonant and the beginning of voicing the vowel, or the Voice Onset Time. Critically, speakers of English perceive a /t/ sound when Voice Onset time is above about 25ms, and a /d/ sound when it is below – the switch is sharp rather than gradual. But within those categories, the millisecond value of VOT can vary. Take a look at this figure. Each dot represents one syllable sound, with the vertical axis representing Voice Onset Time. You can see that there is a variety of Voice Onset Times in the diagram. None of them would stand out in particular if we hadn’t marked the perceptual boundary between /ta/ and /da/ with a dotted line. The green dots represent sounds that would be identified as /ta/ and the blue dots as /da/. This diagram shows that the important many-to-one relationship isn’t there when considering acoustic values only. But from a phonological point of view, there are a lot of sounds in the /d/ category and only a couple in the /t/ category.

    Syllable sounds that vary in Voice Onset Time. The dotted line represents the perceptual boundary between the phonemes /da/ and /ta/. Green dots represent syllables likely to be identified as /ta/, and blue dots represent syllables likely to be identified as /da/.

    By presenting their participants with sounds that varied in their millisecond value of VOT, but where only a small subset crossed the boundary to be perceived as /t/, Phillips et al. were able to test whether a mismatch effect would occur at the phonological level. This is because the critical many-to-one relationship that leads to a Mismatch Negativity only existed at the phonological level, not at a purely acoustic level. Phillips et al. found a phonological Mismatch Negativity, and they further showed that this effect originated in a part of the brain that processes auditory information. The fact that the Mismatch Negativity was present in this part of the brain shows that the brain processes phonological contrasts quite ‘early’ in perceptual processing, before other brain areas more typically associated with language processing get involved.


    DeLong, K. A., Urbach, T. P., & Kutas, M. (2005). Probabilistic word pre-activation during language comprehension inferred from electrical brain activity. Nature Neuroscience, 8(8), 1117–1121.

    Friederici, A. D., Hahne, A., & Mecklinger, A. (1996). Temporal structure of syntactic parsing: Early and late event-related brain potential effects. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22(5), 1219–1248.

    Näätänen, R., & Kreegipuu, K. (2011). The Mismatch Negativity (MMN). Oxford University Press.

    Phillips, C., Pellathy, T., Marantz, A., Yellin, E., Wexler, K., Poeppel, D., McGinnis, M., & Roberts, T. (2000). Auditory Cortex Accesses Phonological Categories: An MEG Mismatch Study. Journal of Cognitive Neuroscience, 12(6), 1038–1055.

    13.3: Evidence for language-specific phonology

    In Section 13.2, we examined some evidence that the part of the brain that processes auditory information is sensitive to phonological categories. Critically, in the study by Phillips and colleagues (2000), the participants were English speakers who have separate /t/ and /d/ phonemes as a part of their mental grammar. We could expect that for a speaker of a language that doesn’t have that distinction, the pattern of brain reactions could be quite different.

    Researchers have studied how a person’s native language could influence their processing of vocal language. For example, Marslen-Wilson and Lahiri (1991) asked whether Bengali speakers and English speakers would process nasal and non-nasal vowels differently. Both English and Bengali have nasal vowels, but the nasal/oral distinction is only phonemic — in other words it can only create contrast — in Bengali.

    For example, the English word ban is typically pronounced with a nasal vowel ([bæ̃n]) because of a phonological process called nasalization. The vowel becomes nasal because of the influence of the upcoming nasal consonant /n/. The vowel in bad is not nasalized because /d/ is an oral consonant. So in English, nasal vowels are predictable based on the phonological environment they are in: before a nasal consonant, the vowel is nasalized, and elsewhere the vowel is oral. Therefore, [æ̃] and [æ] are variants — allophones — of one phoneme.

    Bengali also has a rule nasalizing vowels before nasal consonants, but is different from English in that having a nasal versus an oral vowel is not completely predictable based on the phonological environment. For example, Bengali has the minimal pair /bãd/ (which means ‘dam’) and /bad/ (which means ‘difference’) that differ only in the nasal/oral status of the vowel /a/ and yet have different meanings. This means that in Bengali, /a/ and /ã/ are separate phonemes.

    Marslen-Wilson and Lahiri showed that this difference in the phonemic status of nasal and oral vowels between English and Bengali has an influence on how speakers of these languages recognize spoken words. Before we get to their experiment, let us introduce some background about spoken word recognition and their experimental method, the gating task.

    Spoken words unfold over time. The human mind doesn’t wait for a word to be over before recognizing it, but rather activates potential matches from the very beginning of hearing the word. Upon hearing the first sound of a word, there will be a large number of potential matches. This number will get smaller and smaller as more of the word is heard, because potential candidates will be ruled out. For example, imagine that a listener hears the word report (/ɹipoʊɹt/). The first phoneme, /ɹ/, is compatible with lots of words: report, red, reach, robot, etc. Once /i/ is heard, then red and robot would be ruled out because they are no longer compatible with the input. The set of all potential matches that overlap with the beginning of a word up to a given point is called an onset cohort. One influential model of spoken word perception, the Cohort Model (see Marslen-Wilson & Tyler, 1980) claims that members of the onset cohort of a word become active during the hearing of the word, but that the activation for a potential match drops off once the evidence is no longer compatible with that word.

    At a certain point in each spoken word, listeners (on average) will be able to correctly identify what the word will be. This is called the recognition point for that word. One way to determine a word’s recognition point is through an experimental method called a gating task. In the gating task, a recording of a word is presented to experimental participants in progressively bigger fragments. After hearing a fragment of the recording, participants are asked to guess what the word is, perhaps by writing down their guess. As you might imagine, these guesses become more accurate as the fragments become longer. Eventually a particular fragment length will provide enough information to reach a threshold where most people correctly identify the word, so the end of that fragment can be said to be the word’s recognition point.

    Marslen-Wilson and Lahiri asked whether a listener’s knowledge of the phonology of their native language would influence their ability to recognize words as they unfold. They found that English listeners could identify whether a word was ban or bad before they heard the last consonant, because the nasal or oral quality of the vowel helped them predict what the upcoming consonant would be. Bengali listeners, on the other hand, needed more information before identifying a word with a nasal vowel, leading to a later recognition point for those words. This is presumably because, in Bengali, a word with a nasal vowel could end in a nasal consonant, like /n/, or an oral consonant, like /d/. Bengali speakers do not use the nasal or oral quality of the vowel to predict the upcoming consonant because, in their mental grammars, nasal and oral vowels are separate phonemes.

    Further language-specific phonological knowledge has been found using ERPs and again, the Mismatch Negativity. Dehaene-Lambertz and colleagues (2001) asked whether sequences of syllables would be processed similarly by speakers of languages with different phonotactic constraints. Remember from Chapter 4, Section 4.2 that languages have restrictions on what syllables they allow. In Japanese, for example, nasals are the only consonants allowed at the end of a syllable – oral consonants cannot be syllable codas, in other words. English and French, however, allow a variety of consonants in coda position. So what happens when a Japanese speaker listens to sequences of syllables that have an illegal coda consonant?

    Following up on earlier work by Dupoux and colleagues, Dahaene-Lambertz et al. presented French native speakers and Japanese native speakers with fake words like igumo and igmo. The first one, igumo, is possible with either Japanese or French phonotactics because it can be split (here I have used ‘.’ to indicate a syllable boundary). The second one only fits the phonotactics of French. The sequence /gm/ is not a good syllable onset in either language, so the only potential syllabification is This is a possible word of French but not Japanese, because it has the a /g/ as a coda consonant.

    In their experiment, participants listened to sequences of fake words while the electrical signal from the surface of the scalp was recorded (EEG). The participants heard one word several times, which was then followed by either the same word again, a word that differed only the presence or absence of /u/, or a completely different word /igimo/. They found that for the cases that differed only in the presence of /u/, French speakers indicated that the last word in the sequence was different from the rest, whereas Japanese speakers largely thought they were the same. The brain’s response echoed the responses – the French speakers showed a response that can be interpreted as a Mismatch Negativity for the ‘deviant’ items but the Japanese speakers did not. So why would the Japanese speakers not notice a difference between /igmo/ and /igumo/? One interpretation of this finding is that because /igmo/ doesn’t fit with the phonotactic constraints of their language, Japanese speakers mentally insert a vowel to correct the illegal coda. In other words, Japanese speakers “hear” /igumo/ rather than /igmo/. So our mental grammar can influence the way we perceive speech.

    This experiment is part of a body of evidence demonstrating that our knowledge of the phonology of our native language, as a part of mental grammar, has an influence on how our brains process language.


    Dehaene-Lambertz, G., Dupoux, E., & Gout, A. (2000). Electrophysiological Correlates of Phonological Processing: A Cross-Linguistic Study. Journal of Cognitive Neuroscience, 12(4), 635–647.

    Dupoux, E., Kakehi, K., Hirose, Y., Pallier, C., & Mehler, J. (1999). Epenthetic vowels in Japanese: A perceptual illusion? Journal of Experimental Psychology: Human Perception and Performance, 25(6), 1568–1578.

    Lahiri, A., & Marslen-Wilson, W. (1991). The mental representation of lexical form: A phonological approach to the recognition lexicon. Cognition, 38(3), 245–294.

    Marslen-Wilson, W., & Tyler, L. K. (1980). The temporal structure of spoken language understanding. Cognition, 8(1), 1–71.<

    Phillips, C., Pellathy, T., Marantz, A., Yellin, E., Wexler, K., Poeppel, D., McGinnis, M., & Roberts, T. (2000). Auditory Cortex Accesses Phonological Categories: An MEG Mismatch Study. Journal of Cognitive Neuroscience, 12(6), 1038–1055.

    This page titled 10.5: Evidence for phonemes as mental categories is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Catherine Anderson, Bronwyn Bjorkman, Derek Denis, Julianne Doner, Margaret Grant, Nathan Sanders, and Ai Taniguchi (eCampusOntario) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.