Now let’s quantify these effects and then see if they’re statistically significant. In this exercise, we’ll take the approach that many studies (including my own) used many years ago but has become less common as people figured out an approach that is both simpler and better. We’ll start with the outdated approach because that will allow you to better see (a) why the newer approach is better, and (b) how the newer approach provides equivalent information about the key questions.
Let’s start by scoring the LRP amplitude from each participant’s ERP waveforms. There are many different ways to score the amplitude or latency of an ERP component, as described in detail in Chapter 9 of Luck (2014). In most cases, mean amplitude is the best way to quantify ERP amplitudes. In the present exercise, we’ll measure the mean amplitude from 200-250 ms. This just means that the scoring routine will sum together the voltage values for each time point in this latency range and then divide by the number of time points. It’s that simple! The simplicity of mean amplitude means that it’s very easy to understand and even make mathematical proofs about, and it has some very nice properties that we’ll see as we go through the next few exercises.
One of the most important issues involved in scoring ERPs is the choice of the time window. I chose 200-250 ms for this exercise because this is the approximate time range in which the opposite-polarity effect for incompatible trials is typically seen.
Quit and restart EEGLAB, make sure that Chapter_10 is Matlab’s current folder, and then load all 40 ERPsets from the Chapter_10 > Data > ERPsets_CI folder. Then select EEGLAB > ERPLAB > ERP Measurement Tool and enter the parameters shown in Screenshot 10.1. The left side of the GUI is used to indicate which ERPsets should be scored. We’re going to measure from all 40 ERPsets that you just loaded. The right side of the GUI controls the scoring algorithm. You’ll specify that the basic algorithm is Mean amplitude between two fixed latencies, and you’ll indicate that the starting and stopping latencies are 200 250. This is the measurement window. We’re going to measure from the C3 and C4 channels (12 14) in all four bins (1:4). We’re going to save the scores in a text file named mean_amplitude.txt.
It’s really tempting to hit the RUN button and get the scores, but you should always check the measurements against the ERP waveforms first. You can do this by clicking the Viewer button. The Viewer tool will open, and you’ll see the ERP waveform from the first bin and first channel in the first ERPset. The measurement window is indicated by the yellow region, and the value produced by the scoring algorithm (5.628 µV) is shown in the window at the bottom.
You can then step through the different bins, channels, and ERPsets to verify that the algorithm is working sensibly. You may find it convenient to look at multiple waveforms per screen. In the two cases shown in Screenshot 10.2, for example, I clicked the all box for Bin and Channel to overlay the two bins and the two channels.
Not much can go wrong with the algorithm for measuring mean amplitude, but you may find surprising and problematic scores for some participants when you use other algorithms (e.g., peak amplitude or peak latency). Even with mean amplitude, it’s humbling and informative to see how much the ERP waveforms vary across participants. For example, the participant on the right in Screenshot 10.2 has waveforms that are similar to those in the grand average (Figure 10.2)—with distinct P2, N2, and P3 peaks—and the measurement window runs right through the N2 peak. By contrast, the participant on the left doesn’t have very distinct peaks, and the measurement window is at the time of a positive peak.
This brings up an important point about ERPs (and most other methods used in the mind and brain sciences): Averages are a convenient fiction. The ERP waveforms we get by averaging together multiple single-trial epochs may not be a good representation of what happened on the single trials, and a grand average waveform across participants may not be a good representation of the individual participants. However, it is difficult to avoid averaging (or methods that are generalizations of the same underlying idea, such as regression). Chapters 2 and 8 in Luck (2014) discuss this issue in more detail.
Once you’ve finished scanning through all the ERP waveforms using the Viewer, click the Measurement Tool button to go back to the Measurement Tool, and then click RUN to get the scores. Assuming that Chapter_10 is still the current folder in Matlab, a file named mean_amplitude.txt should now be present in the Chapter_10 folder. Double-click on this file in Matlab’s Current Folder pane to open it in the Matlab text editor. You’ll see that it consists of a set of tab-separated columns. Matlab’s text editor doesn’t handle the tabs very well, so the column headings may not line up properly. I recommend opening it instead in a spreadsheet program like Excel. Here’s what the first few lines should look like:
bin1_C3
bin1_C4
bin2_C3
bin2_C4
bin3_C3
bin3_C4
bin4_C3
bin4_C4
ERPset
5.628
4.124
6.818
5.741
6.623
5.66
7.247
6.607
1_LRP
6.902
7.534
4.72
8.7
3.629
6.546
6.178
5.47
2_LRP
3.149
5.122
1.19
1.962
4.361
4.309
4.441
4.638
3_LRP
Each row contains the data from one participant, and each column holds the score (mean amplitude value) for a bin/channel combination for that participant. The Measurement Tool can also output the measurements in a “long” format in which each score is on a separate line. This long format is particularly good for using pivot tables to summarize the data in Excel, and it works well with some statistical packages. The “wide” format shown in the table above is ideal for statistical packages in which all the data for a given participant are expected to be in a single row (e.g., SPSS, JASP).
Now that we have the scores, let’s do a statistical analysis using a traditional ANOVA. You can use any statistical package you like. As I mentioned earlier, I recommend JASP if you don’t already have a package that can do basic t tests and within-subjects ANOVAs.
The ANOVA should have three within-subjects factors, each with two levels: Electrode Hemisphere (left or right), Response Hand (left or right), and Compatibility (Compatible or Incompatible). When you load the data into your statistical software and specify the variables, it’s really easy to get the columns in the wrong order. Your first step in the statistical analysis should therefore be to examine the table or plot of the descriptive statistics provided by your statistical software so that you can make sure that the data were organized correctly. Figure 10.3 shows what I obtained in JASP.
But how do you know what the correct values should be? It turns out that with mean amplitude scores (but not most other scores), you get the same result by averaging the single-subject scores and by obtaining the scores from the grand average waveforms (see the Appendix in Luck, 2014 for details). Load the grand average you created earlier (grand.erp) and run the Measurement Tool again, but specifying that it should measure only from this ERPset and save the results in a file named mean_amplitude_grand.txt. You can then compare those numbers to the values in the table or figure of descriptive statistics. Here are the values I obtained:
bin1_C3
bin1_C4
bin2_C3
bin2_C4
bin3_C3
bin3_C4
bin4_C3
bin4_C4
ERPset
2.143
1.04
1.348
2.396
1.379
1.649
2.002
1.159
grand
These values exactly match the means shown in Figure 10.3. Success! Note that if you use some other scoring algorithm (e.g., peak amplitude) in your own studies, the values won’t match exactly. However, you can at least make sure that the pattern is the same.
This verification process is very veryvery important! I estimate that you will find an error at least 10% of the time if you have three or more factors in your design.
Before we look at the inferential statistics, let’s think about what main effects and interactions we would expect to see. First consider the Compatible condition, in which the voltage should be more negative for the contralateral hemisphere than for the ipsilateral hemisphere. This gives us a more negative voltage for left-hand responses than for right-hand responses over the right hemisphere, and the reverse pattern over the left hemisphere. In other words, the presence of the LRP is captured in the ANOVA as an interaction between Hemisphere and Hand. During this 200-250 ms time period, we expect to see an opposite effect for the Incompatible trials (because the voltage is more negative over the hemisphere contralateral to the incorrect response, which makes it more positive contralateral to the correct response). Consequently, the difference between the Compatible and Incompatible trials should lead to a three-way interaction between Compatibility, Hemisphere, and Hand.
Table 10.1 shows the inferential statistics I obtained from JASP. You can see that the main effects of Hand and Hemisphere are not significant, consistent with the fact that Figure 10.3 shows little or no overall difference between left-hand and right-hand responses or between the left and right hemispheres. The main effect of Compatibility is also not significant, consistent with the fact that the average voltage across cells for the Compatible condition was about the same as the average voltage across cells for the Incompatible condition.
By contrast, the interaction between Hemisphere and Hand was significant. This interaction is equivalent to asking about the contralaterality of the voltage if we averaged across Compatible and Incompatible trials. These two conditions yielded opposite-direction effects that partially cancel each other out. However, the contralateral negativity for the Compatible trials was larger than the contralateral positivity for the Incompatible trials, and this gives us an overall significant interaction. But this interaction is meaningless at best and misleading at worst, because the patterns were opposite for the Compatible and Incompatible trials, as indicated by the significant three-way interaction between Hemisphere, Hand, and Compatibility. This kind of complication is one of the reasons why many researchers have stopped using this approach and have shifted to the simpler approach described in the next exercise.
The next step in our statistical analysis would be to perform specific contrasts so that we can see, for example, if the Hemisphere x Hand interaction is significant when the Compatible and Incompatible trials are analyzed separately. However, we’re not going to take that next step, because this way of analyzing the data is less than ideal. First, the size of the LRP is captured by the Hemisphere x Hand interaction rather than a main effect, which makes things difficult to understand. Second, this approach generates a lot of p values, which means that the probability that we obtain one or more bogus-but-significant effects is quite high. If you run a three-way ANOVA, you get 7 p values (as shown in Table 10.1), and you’ll have about a 30% chance of getting at least one bogus-but-significant effect (if the null hypothesis is actually true for all 7 effects). So, it’s important to minimize the number of factors in your analyses (see Luck & Gaspelin, 2017 for a detailed discussion of this issue). The next exercise will show you a better approach.
Bogus Effects
When an effect in the data is just a result of random variation and does not reflect a true effect in the population, I like to refer to that effect as bogus. And if the effect is statistically significant, I refer to it as a bogus-but-significant effect. The technical term for this is a Type I error. But that’s a dry, abstract, and hard-to-remember way of describing an incorrect conclusion that might be permanently etched into the scientific literature.