13.6: A2.6 Avoiding Bugs in Your Scripts with Good Programming Practices

Last updated
Save as PDF

Page ID: 137517

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

The first step in debugging a script is to get in a time machine, go back to the moment when you started writing the script, and tell yourself that you’re now wasting a huge amount of time trying to debug the script. “Please,” you should tell your earlier self, “follow good programming practices while writing this script so that I won’t need to waste so much time. I’m even busier now than I was when I first wrote the script.”

If you don’t have a time machine, you should resolve to follow good programming practices from now on. It will take a little more time now, but you will be giving a great gift to your future self. As Ben Franklin famously said, “An ounce of prevention is worth a pound of cure.”

This section contains a set of good programming practices that I find to be particularly relevant for scientists who are analyzing EEG/ERP data in Matlab and are relatively new to coding.

Rapid cycling between coding and testing

Perhaps the most common mistake that novice coders make is trying to write an entire script without doing any testing along the way. If you write a 30-line script, you will probably have 8 different errors in the script, and it will be really hard to figure out what’s going wrong.

As I mentioned before, the best approach is to write a small amount of code, test it, debug it if necessary, and then add more code. When you’re new to programming, this might be only 1-3 lines of code at a time. As you gain experience, you can write more lines before testing, but even an experienced programmer usually does some testing after every 20-40 new lines of code.

Define all values as variables at the top of the script

If you’ve already read the chapter on scripting, you’ll know that this is my #1 principle of writing good code. For example, in the N170 experiment that is the focus of the chapter, we analyzed the data from subjects 1-10, but leaving out subject t. When we looped through the subjects, we needed a line of code like this:

for subject = [ 1 2 3 4 6 7 8 9 10 ] # Note that 5 is missing from this list

Imagine that we need to loop through the subjects in three different parts of the script (e.g., once for pre-ICA EEG processing, once for post-ICA EEG processing, and once for ERP processing). We could just repeat that same loop in each of these three different parts of the script. But now imagine that, a year after we’ve analyzed the data, we get reviews back from a journal and a reviewer wants us to reanalyze the data without excluding subject 5. Now we need to find all the parts of the script with this loop and modify them. Will we remember that we had three loops? There’s a good chance that we will have forgotten and won’t find all three of them. As a result, we will have a bug. And we will either end up with the wrong result or waste hours of time trying to find the problem.

To avoid this problem, you should always, always, ALWAYS use a variable at the top of the script to define a list like this. Here’s an example:

% This line is in the top section of the script
SUB = [ 1 2 3 5 6 7 8 9 10 ]; % Array of subject IDs, excluding subject 5

% This is how we use the list later in the script
for subject = SUB

The same principle applies to individual numbers (e.g., the number of subjects) and strings (e.g., a filename).

Even if you understand and appreciate this advice, it’s easy to ignore it by saying to yourself, “This script is just a few lines. I don’t need to worry about putting the values into variables at the top.” Most long scripts start as short scripts, and this is just being shortsighted. So, at the risk of repeating myself, you should always, always, ALWAYS use a variable at the top of the script to define values.

Note that zeros and ones can be an exception to this rule when they are being used more conceptually. For example, zero and one are sometimes used to mean TRUE and FALSE. Or you might do something like this:

% This line is in the top section of the script
SUB = [ 1 2 3 5 6 7 8 9 10 ]; % Array of subject IDs, excluding subject 5
num_subjects = length(SUB); % Number of subjects

% This is how we use the list later in the script
for subject_num = 1:num_subjects
subject = SUB(subject_num);
% More code here to process the data from this subject
end

Make your code readable

The reality of science is that you will often start a script, come back to it a few weeks later to finish it, but then modify it 18 months later (after you get the reviews for a manuscript). And someone else may get a copy of your script and modify it for their own studies. If the code isn’t easily readable, bugs are likely to be introduced at these times. Here are a few simple things you can make your code more readable:

Include lots of internal documentation in your scripts. It’s a great gift to your future self.
Define all values as variables at the top, as noted before, but also make sure there is a comment indicating the purpose of each variable
Divide your code into small, modular sections (or separate functions), with a comment at the beginning of each section or function that explains what that section or function does
Use intrinsically meaningful variable names (e.g., num_subjects instead of ns) and function names (e.g., ploterps instead of npbd). I wasted a couple hours one night in my first year of graduate school because someone had used the name npbd for a function that plotted ERP waveforms, and I’m still bitter…

You can find more discussion of the importance of readability in the scripting chapter.

Make your code modular

If you have a single script that is more than 200 lines long, it should probably be broken into a sequence of multiple scripts. It’s a lot harder to find problems in a long script than in a short script. And it’s a lot easier to introduce problems into a long script (e.g., by adding code to the wrong section). For example, the EEG/ERP processing pipeline in Chapter 10 consists of a series of 7 scripts. In the ERP CORE experiments, we had about 20 different scripts for each individual experiment.

Make your code portable by using relative paths

Almost all EEG/ERP processing scripts need to access files via a path. The worst way to handle this is something like this (for loading a dataset):

EEG = pop_loadset('filename', '/Users/luck/ERP_Analysis_Book/Appendix_2_Troubleshooting/Exercises/1_N170.set');

This violates the principle of defining all values at the top of the script. A better, but still problematic, approach is this:

% Variables defined at the top of the script
Data_DIR = '/Users/luck/ERP_Analysis_Book/Appendix_2_Troubleshooting/Exercises/';
setname = '1_N170.set';

% Loading the data later in the script
EEG = pop_loadset('filename', setname, 'filepath', Data_DIR);

The problem with this approach is that it will break if you switch to a different computer, move your data to a different location, or share your script with someone else.

A better approach is to determine the path from the location of the script (assuming that the script is kept with the data):

Data_DIR = pwd; %Current folder (where the script should be located)

Chapter 10 describes this in much more detail.

Code review

It is becoming very common (and sometimes required) for researchers to post their data analysis code online along with their data when publishing a paper. That way, other researchers can verify that they get the same results with your data and can use your code in their own studies. I think this is a wonderful trend.

When you realize that other people will be looking at and running your code, this tends to increase the pressure to make sure that the code actually works correctly. In theory, you should already be highly motivated to make sure that your code works, because your findings depend on code that works correctly. But public scrutiny is often an even stronger motivator.

A really good way to make sure that your code works correctly is to use code review. This is just a fancy term for having someone else go through your code to make sure it’s correct. Of course, code review is a lot easier and more effective if you’ve made your code readable and portable. The person who reviews your code will also likely have suggestions for making your code even more readable and will provide a good test of whether your code is portable (i.e., whether it works on the reviewer’s computer).