10: Chapter 10. Scatterplots and Correlation
- Page ID
- 189444
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)The last two chapters dealt with relationships between categorical variables. But what about when you have ratio-level variables? The next six chapters all deal with ratio-level or ratio-like dependent variables. Chapter 10 focuses on bivariate (two-variable) relationships and covers scatterplots and correlation, as well as introducing regression lines.
Scatterplots
Have you heard the notion of "health = wealth"?
Below we will compare estimated U.S. states poverty rates and life expectancies to evaluate if they have a relationship, and if so, what the relationship is.
As of 2022, 37.9 million people in the United States were officially poverty, over one in 10 (11.5% of) U.S.-Americans. This is based on the government's official poverty measure, which in 2022 ranged from under $14,036 for a single-person household adult over the age of 65 to a family income under $20,0172 for a two-person household with both people under the age of 65 to under $35,801 for a five-person household with two related children under 17 years old. Many people struggle who have family incomes above the poverty thresholds constructed from the government's official poverty measure.
Someone born in 2020 in the United States is predicted to live 77.0 years, assuming continued patterns of mortality at that time.
Here is a scatterplot of life expectancy and official poverty rate by U.S. state (as well as Washington, D.C.):
A scatterplot is a graph that shows pairs of quantitative data using a rectangular coordinate system.
Check out this annotated version of the scatterplot:
The independent variable, x, goes on the horizontal axis, also called the x-axis.
(An axis is a fixed reference line, in this case the number line for the independent variable. The plural of axis is axes, with a long e like in cheese.)
- In this example, the independent variable is official poverty rate. You can see a number line for poverty rate on the horizontal axis.
The dependent variable, y, goes on the vertical axis, also called the y-axis.
- In this example, the dependent variable is life expectancy. You can see a number line for life expectancy on the vertical axis.
Each case has a point (marked with a dot) at a particular coordinate (x, y). X is the value of the independent variable for that case and Y is the value of the dependent variable for that case.
- In this example, there are 51 cases: each U.S. state and Washington, D.C. Each case has both a particular poverty rate and a life expectancy. The scatterplot shows their intersection.
- Take a look at the purple dotted lines. This is the state with the lowest life expectancy: Mississippi. Mississippi has a poverty rate of 17.8% and a life expectancy of 71.9 years. Notice how the point is located at those (x, y) coordinates (17.8%, 71.9%).
- Take a look at the red dotted lines. This is the state with the highest life expectancy: Hawaii. Hawaii has a poverty rate of 10.2% and a life expectancy of 80.8.
Once these points are all mapped out, we can begin to analyze the graph for patterns/relationships.
The first step is to determine whether a relationship exists.
If there is a relationship, the two variables should co-vary. Changes to the independent variable would be correlated with patterned changes in the dependent variable. In this case, the independent variable is poverty rates. As we observe different poverty rates for different states, does our dependent variable, life expectancy, change in any patterned way?
If there was no relationship, as poverty level increased, there would be no change to life expectancy.
Here is a scatterplot where every state is put at the U.S. mean life expectancy of 77.0 years old.
As you go from left to right, as poverty rates increase, there is no change to life expectancy. It's a flat horizontal line. This indicates no relationship, as changes in the independent variable are not associated with changes in the dependent variable.
As you go from left to right, if there is a clear general patterned trend to the data in terms of changes to the dependent variable values, there is a relationship.
For the scatterplot of poverty rate and life expectancy by state, we can see that as we go from left to right (as the independent variable values increase, as the poverty rate increases, as we go from states with lower to states with higher poverty rates), there is a general trend where the the points go down (the dependent variable values decrease, life expectancy age decreases, states tend to have lower life expectancies).
Now that we have determined there is a relationship, we have three other questions to ask:
1. It is a linear relationship?
A linear relationship is a relationship between two variables that takes the form of a line when graphed.
(Notice the word line in linear.)
Not all relationships are linear.
For example, check out the graph below of average body temperature throughout the day. It follows a pattern, but it's not a linear pattern.
As the x-axis values increase (as you go from left to right), if the y-axis values consistently go in one general direction (either up or down), and do at a generally consistent rate, the relationship is linear.
Our data has a linear relationship. You can see how the data generally converges onto the line drawn in the graph above.
It's not a perfect relationship --- meaning not all points are on the line. But the general pattern is linear.
The line drawn on the scatterplot above is a "line of best fit," also called a "regression line" and "least squares line."
The process of fitting a line to a set of data is called linear regression, "Ordinary Least Squares regression," or "OLS regression". The next 5 chapters all focus on linear regression. The line that is the "best fit" is the one that minimizes how far each case's dependent value is from the line, calculated based on the sum of the squared difference between the data points and the line.
The regression line is the line that, when drawn, is closest to the observed data. This is measured by minimizing the sum of the squared distance of each point's y-coordinate from a potential line's corresponding y-coordinate.
In the scatterplot above, not all points are on the line. Unless there is a perfect relationship, no single straight line can be drawn that would have all points on it. Some points are closer to the line and some points are further away.
Moving left to right, let's take the first five points.
- Utah has an official poverty rate of 7.1% and a life expectancy of 78.6 years. The regression line predicts that a state with a poverty rate of 7.1% would have a life expectancy of 78.8 years, 0.2 years off.
- New Hampshire has an official poverty rate of 7.1% and a life expectancy of 79.0 years. The regression line predicts that a state with a poverty rate of 7.1% would have a life expectancy of 78.8 years, 0.2 years off.
- Minnesota has an official poverty rate of 7.7% and a life expectancy of 79.1 years. The regression line predicts that a state with a poverty rate of 7.7% would have a life expectancy of 78.5 years, 0.6 years off.
- Wisconsin has an official poverty rate of 8.0% and a life expectancy of 77.7 years. The regression line predicts that a state with a poverty rate of 8.0% would have a life expectancy of 78.3, 0.6 years off.
- Nebraska has an official poverty rate of 8.1% and a life expectancy of 77.7 years. The regression line predicts that a state with a poverty rate of 8.1% would have a life expectancy of 78.3, 0.6 years off.
Many lines could be drawn on top of a scatterplot:
However, if you look at these lines, some are a better fit for the data then others. If you consider how far apart each point is from a line, square those differences, and add them up, the lines that are worse fits will have greater sums, and the lines that are better fits will have smaller sums. SPSS finds the line that minimizes the sum of the squared deviances as much as possible:
2. If it is linear, what is its direction? Is it negative or positive?
As you move from left to right, there will be a generally upward or downward movement. Upward movement reflects a positive relationship and downward movement reflects a negative relationship.
In a positive relationship: as x (independent variable) increases, y (dependent variable) increases.
In a negative relationship: as x (independent variable) increases, y (dependent variable) decreases.
Our variables have a negative relationship. As you move from left to right, the data generally goes down. As x increases, y decreases.
Interpreting direction in real-life context:
What does it mean to say: as x increases, y increases? Here x is the official poverty rate and y is life expectancy. So we're saying, among U.S. states (and Washington D.C.), as poverty rate increases, life expectancy decreases. Higher poverty rates are associated with lower life expectancy. States with higher poverty rates tend to have lower life expectancies.
Template:
Among [unit of analysis/target population], as [IV] increases, [DV] increases/decreases
Example:
Among U.S. states, as poverty rates increase, life expectancy decreases.
Be careful not to use causal language in describing relationships. We're just describing the pattern or co-variance of the two variables. The pattern just tells us that there is a relationship, not that the IV causes the DV. More poverty may decrease how long people live (and conversely less poverty may increase how long people live), but the scatterplot is just showing us if and how the two variables co-vary, not telling us that the IV causes change in the DV. (Remember that causality requires a significant and substantive relationship that makes theoretical sense, has the correct causal order, and is nonspurious).
3. If it is linear, how substantive is the relationship?
This is asking about how steep the line is. What is the pattern? Does a 5% increase in a state's poverty rate seem to be associated with a lowering of 1 year of the average life expectancy? 10 years? Or do you need a 10% or 20% increase in the poverty rate for any change to mean life expectancy? To evaluate how substantive a linear relationship is, we'll look at its slope (rise over run, change in y associated with a one-unit change in x). In this case, the slope of the line drawn above is -0.54, which means that the general pattern is that for every 1% increase in a state's poverty rate, there is about a 1/2 year decrease in the state's life expectancy. So for example a state with a 5% higher poverty rate (1 in 20 more people in poverty) than another would be predicted to have a life expectancy that is 2.5 years lower. Of the 13 states with a life expectancy over 78 years old (78.2 to 80.7), 9 have poverty rates below 9% and all have poverty rates below 12%. Of the 10 states with life expectancies below 75 years old (71.9 to 74.8), 8 have poverty rates above 15% and all have poverty rates above 11%.
You need to be careful analyzing how substantive a relationship is when you look at scatterplots, because like other graphs placed on number lines, the scale may make the relationship seem more or less steep. Check out the two scatterplots below. Notice how one makes it seem like the relationship is a lot more substantive than the other.
These scatterplots are the same data. The only difference are the y-axes: the scatterplot on the left shows life expectancy ranging from 71 to 81 years old and the scatterplot on the right shows life expectancy ranges from 60 to 90 years old.
We'll look more closely at the slope and substantiveness in Chapter 11.
4. If it is linear, how strong is the relationship?
For linear relationships, the term strength has a particular statistical meaning.
The strength of a linear relationship is the extent to which the points converge into a common pattern, into a particular linear relationship.
If they were scattered all over with no general trend, we would have no relationship and therefore no strength..
Here is an example of a scatterplot with no relationship:
If a relationship was perfect, then all points would fall on the line. For example, what if a high school required all students to take seven classes each year. Here would be a graph of how many years a student has been at the high school and how many classes they have taken.
You can imagine drawing a line through the five points. Every point would go through it. It is a perfect relationship because the points converge into a common linear pattern perfectly.
Strength can be characterized as no relationship, as a perfect relationship, and in many ways in between the two.
The figure below shows six scatterplots. The ones on the left have strong relationships, the ones in the middle have moderate relationships, and the ones to the right have weak relationships. Take note of how the data converges closer to the line of best fit or further away from it. Consider how you could make more accurate predictions with the ones on the left, and as you move to the scatterplots to the right, as your strength decreases, how your ability to make accurate predictions also decreases.
Let's return to our example. Take a look at the line and the points. Does this seem like no relationship, a very weak relationship, a moderate to strong relationship, or a very strong relationship?
Here, there is a general pattern, but there is also variation away from the line. There are a few states on the line, but most are not. We cannot simply use poverty rate to make accurate predictions of life expectancy. But how far away are states from the line? Are most pretty close? Pretty far away? How well could we use a state's official poverty rate to predict the state's life expectancy? We would definitely make better predictions than if we only knew that the overall U.S. life expectancy was 77 years old, but most of the time our predictions would not be accurate. However, while in some cases (like with Hawaii) we would have worse predictions, it looks like most of our predictions of life expectancy would be less than one year off.
Rather than just form impressions of the strength of a bivariate linear relationship, we can quantify it. This is done using a scale from -1 for a perfect negative relationship to 0 for no relationship to 1 for a perfect positive relationship. The measure is called correlation.
Correlation measures the strength and direction of linear relationships.
The measure can be referred to as Pearson's correlation coefficient, Pearson's r, r, correlation, or by similar terms.
The correlation between two variables will be the same regardless of the direction of variables. It doesn't matter which is the independent variable and which is the dependent variable.
Direction:
If r is negative (less than 0), this indicates a negative relationship.
If r is positive (more than 0), this indicates a positive relationship.
Strength:
The measure of strength, of how well the data converges into a common linear pattern, ranges from 0 (no relationship) to an absolute value of 1 (perfect relationship). For example, a relationship with a correlation coefficient of -0.5 and +0.5 have the same strength, just with the former relationship being negative and the latter relationship positive.
Here is a table with a guide to interpreting relationship strength given a correlation coefficient:
(For direction, the interpretation is: "Among [unit of analysis/target population], as [IV] increases, [DV] increases/decreases.")
Correlation Coefficient | Interpretation (of strength) | Story (Real-life context) |
---|---|---|
0.00 |
No relationship There is no (linear) relationship between [IV] and [DV]. |
There is a no relationship between [IV] and [DV]. There is no tendency among [target population] for [DV] to either increase or decrease based on [IV]. (Knowing [IV] does not improve our predictions of [DV] over just knowing the overall average [DV] among all [unit of analysis].) |
±0.01 to ±0.09
|
Negligible, very weak There is close to no (linear) relationship between [IV] and [DV]. A line of best fit would be close to a horizontal line, but would not be completely flat, as some co-variance or pattern can be detected (within the sample, if sample data). |
There is a barely a relationship between [IV] and [DV]. There is a very weak, though still faintly discernible, trend among [target population] that [describe direction of relationship]. If we try to make predictions of [dependent variable] for [unit of analysis] in the sample [delete "in the sample" if it's population data] based on knowing [independent variable], our predictions are likely to be very inaccurate. |
±0.10 to ±0.29 |
Weak There is a weak (linear) relationship between [IV] and [DV], which means the data does not converge well onto the line of best fit, though there is still a definite observable pattern. |
There is a weak relationship between [IV] and [DV]. While there is an overall observable trend among [target population] that [describe direction of relationship], there is substantial variability. We can only make weak predictions of [dependent variable] for [unit of analysis] in the sample [delete "in the sample" if it's population data] based on knowing [independent variable], though our predictions will overall be more accurate than if we ignored [IV]. |
±0.30 to ±0.39
|
Weak to moderate There is a weak to moderate (linear) relationship between [IV] and [DV], which means the data converges into an observable pattern, though with a moderate to substantial amount of variability from a line of best fit. |
There is a weak to moderate relationship between [IV] and [DV]. There is an overall observable trend among [target population] that [describe direction of relationship], there is some variability. We can make weak to moderate predictions of [dependent variable] for [unit of analysis] in the sample [delete "in the sample" if it's population data] based on knowing [independent variable], with our predictions overall noticeably more accurate than if we ignored [IV]. |
±0.40 to ±0.59 |
Moderate There is a moderate relationship between [IV] and [DV], which means the data converges moderately well onto a line of best fit, with a clear observable pattern that has a moderate amount of variability. |
There is a moderate relationship between [IV] and [DV]. There is an overall trend among [target population] where [describe direction of relationship], though there is a fair amount of variability. We can do a moderately good job making predictions of [dependent variable] for [unit of analysis] in the sample [delete "in the sample" if it's population data] based on knowing [independent variable]. |
±0.60 to ±0.69 |
Moderate to strong There is a moderate to strong relationship between [IV] and [DV], which means the data converges pretty well onto a line of best fit, with a clear observable pattern that has a small to moderate amount of variability. |
There is a moderate to strong relationship between [IV] and [DV]. There is an often consistent trend among [target population] where [describe direction of relationship], though there is a small to fair amount of variability. We can do a moderately to pretty good job making predictions of [dependent variable] for [unit of analysis] in the sample [delete "in the sample" if it's population data] based on knowing [independent variable]. |
±0.70 to ±0.89
|
Strong There is a strong relationship between [IV] and [DV], which means the data converges well onto a line of best fit, with a clear observable pattern that has a small amount of variability. |
There is a strong relationship between [IV] and [DV]. There is a consistent trend among [target population] where [describe direction of relationship], though there is a small amount of variability. We can do a pretty good job making predictions of [dependent variable] for [unit of analysis] in the sample [delete "in the sample" if it's population data] based on knowing [independent variable]. |
±0.90 to ±0.99 |
Very strong There is a very strong relationship between [IV] and [DV], which means the data converges extremely well onto the line of best fit, with a clear and relatively uniform observable pattern. |
There is a very strong relationship between [IV] and [DV]. There is a consistent trend among [target population] where [describe direction of relationship], though there is still a bit of variability. We can do a quite good job making predictions of [dependent variable] for [unit of analysis] in the sample [delete "in the sample" if it's population data] based on knowing [independent variable]. |
±1.00
|
Perfect There is a perfect relationship between [IV] and [DV], which means all data points are on the same line. The data converges into a perfect linear pattern. |
[Describe direction of relationship.] If you know [unit of analysis's] [IV], you can predict [DV] with perfect accuracy. |
Begin to get a feel for correlation coefficients and how they map onto scatterplots by playing a few rounds of the game Guess the Correlation. Go to http://guessthecorrelation.com/ to play. Note that there are only positive correlations in the game.
After playing a few rounds, take a guess at the correlation coefficient for the scatterplot from our example.
Scroll down to see how accurate your guess is.
The correlation coefficient for our example is -0.808.
-0.808 is a negative strong relationship. There is a strong relationship between poverty rate and life expectancy, which means the data converges well onto a line of best fit, with a clear observable pattern that has a small amount of variability.
Here was the template language to interpret a correlation of -0.808: There is a strong relationship between [IV] and [DV]. There is a consistent trend among [target population] where [describe direction of relationship], though there is a small amount of variability. We can do a pretty good job making predictions of [dependent variable] for [unit of analysis] in the sample [delete "in the sample" if it's population data] based on knowing [independent variable].
Using the template, here is what I wrote to explain the correlation coefficient in context, telling its story:
There is a strong negative relationship between poverty rates and life expectancy. There is a consistent trend among U.S. states (and DC) where, as poverty rates increase, life expectancy decreases, though there is a small amount of variability. We can do a pretty good job making predictions of states' average life expectancy based on knowing states' poverty rates.
The word strength has many meanings and interpretations. But when discussing correlation and linear relationships, it has a very specific and particular definition: how well the data fits the line. How well the data converges into a common linear pattern. How consistent is the pattern across the data points.?
A common mistake for people first encountering this term is to confuse strength with substantiveness, to think about how steep a line's slope is. It's important to be able to differentiate these two constructs. Below are a series of scatterplots with their correlation coefficient values marked. Notice how the slopes/steepness of the patterns of these lines have no impact on the correlation coefficient values. Substantiveness and slope will be discussed more in the next chapter. For now, with the focus on strength, remember to stick to evaluating how well the data conforms into a common linear pattern.
Pearson's correlation coefficient is only for linear relationships. You could have a correlation coefficient close to 0 and still have a strong relation relationship between your two variables if you have curvilinear relationship rather than a linear relationship (e.g., if as the independent variable increases, the dependent variable first increases and then decreases).
Example:
Check out this scatterplot. There is clearly a pattern between the independent and dependent variable. However, r=(-0.042).
Check out the line of best fit below.
The data does not converge well into a linear pattern. When you find that there is a low correlation coefficient value, it does not mean there's necessarily a non-existent or weak relationship. It means there is a weak or non-existent linear relationship. There are more advanced statistical techniques that can be used to explore non-linear relationships.
The scatterplot below is also of all 50 states as well as Washington, D.C. The independent variable is the percent of the state's population that lives in an urban area as of 2020, a measure defined by the U.S. Census based on housing unit density. The dependent variable is the percent of the state's voters who voted for Joe Biden for president in 2020.
Here is the scatterplot with a line of best fit added:
Pearson's r is 0.54, a moderate, positive relationship. There is an overall trend where States with higher percentages of their population living in urban areas have higher percentages of voters voting for Biden, though there is a fair amount of variability. We can do a moderately good job making predictions of what percent of a state voted for Biden based on knowing the percent of the state that is living in an urban area.
However, notice that while r=0.54, generally jurisdictions seem to be pretty well-clustered around the line. With the exception of three points (shaded red, green, and pink above), which are pretty far away from the line. They do not fit the general pattern.
Outliers have a substantive impact on correlation coefficients, as well as on lines of best fit. The squared difference of these jurisdictions from the lines is considerable, requiring the line to adjust to account for them, and their distance from the pattern lowers the correlation coefficient value.
If you have one or more outliers, the first thing to do is to look them up and investigate what's going on. You want to make sure there was not a data entry error, and if the data is accurate, investigate what makes these cases unique.
In this case, the outliers from the overall pattern include:
- Vermont (red point): 35.1% of Vermont's population is urban, but 66.1% of Vermont voters voted for Biden.
- Maine (green point): 38.6% of Maine's population is urban, but a majority of Maine voters voted for Biden (53.1%).
- Washington, D.C. (pink point): 100% of DC's population is urban, and 92.2% of voters voted for Biden (the next highest was Vermont, over 25% lower).
Additionally, you may want to consider whether or not it's appropriate to run a second model that excludes the outliers, to use in concert with the first model. We can't just remove data we don't like --- without these three outliers, the edited scatterplot only makes predictions for the remaining 48 states, and the predictions don't hold for Vermont, Maine, and Washington, D.C. However, we may be able to do a much better job understanding the patterns and making predictions if we leave those three jurisdictions out. For the remaining 48 states, r increases from 0.54 to 0.73.
Here is a scatterplot without those three jurisdictions, of 48 states:
Notice how the line of best fit converges better into a common linear pattern that fits more points here than it did when the outliers remained included.
Using SPSS
Here is how to make a scatterplot using SPSS.
Syntax:
*Replace the IVName and DVName with the names of your independent and dependent variables.
GRAPH
/SCATTERPLOT(BIVAR)=IVName WITH DVName
/MISSING=LISTWISE.
Menu:
Go to: Graphs → Scatter/Dot
Click on the scatterplot picture to the left of "Simple Scatter" and then click Define.
Put your independent variable in the box labeled "X Axis:" and your dependent variable in the box labeled "Y Axis:", then click OK.
Once you have generated a scatterplot and it is in your output, double-click on the scatterplot, or right-click and select "Edit."
Go to: Elements → Fit line at total
Alternatively, you can click on the icon on the menu bar right above the graph that looks like a scatterplot with a fit line:
Select the circle for “Linear” for "Fit Method". Click “Apply” then “Close”.
Click on the r-squared statistic in the top right corner and then click delete.
Either click on then delete the equation (y=...), or click and drag it over so it is not on top of data..
What would you do if you were making a scatterplot of age and income and two respondents had the same age and income? When you have datasets with a smaller number of cases and more specificity and variation in values, scatterplots can work really well. However, if you have a dataset that has too many cases with the same coordinates, you are prone to misinterpret your dataset, and if you have a dataset that also has a lot of cases, you are prone to not even be able to begin an interpretation. This is because having 1 case or 100 cases at a particular coordinate may look the same.
The scatterplot below is from an American National Election Studies of U.S. eligible voters following the 2020 election. The variables are feelings towards Joe Biden and feelings towards Kamala Harris, measure using a feelings thermometer where 0 is very cold, 50 is neutral, and 100 is very warm.
The variables have a correlation coefficient of 0.91, a very strong positive relationship. Can you tell? I notice that there is an increasing sparseness of data when you get to warm feelings towards one candidate being paired with cold feelings towards the other, and I can see there are more pairings in terms of warm-warm and cold-cold groupings, so I get a sense of the positive direction, but not a correlation coefficient as high as 0.91.
This is especially because, out of the 2,578 cases, many are invisible on the scatterplot. Can you tell that there are 294 cases at (100, 100) where individuals rated both Biden and Harris at 100? How about 436 cases at (0,0)? There are over 100 cases at: (15,0), (50,50), (70,70), and (85,85). Meanwhile, there are many coordinates that only have 1 case, such as (0,85), (1,0), (35,0), (50,85), and (100,0). Do the coordinates with 1 case look any different than the coordinates with over 100 cases?
Here's another scatterplot, this one of age and feelings towards police.
Here r=0.289, a weak positive relationship. Do you see an observable pattern? If you were to try to draw a line of best fit for that pattern, is the line you would have come up with the one pictured above? Can you see 2,550 cases?
Scatterplots often do not work well for displaying data when there are repeated coordinates. You can still calculate correlation.
One way some people try to deal with this is through messier scatterplots that try to show repeated coordinates. One way is to use "jittering." Think about jittering as basically taking the scatterplot and giving it a little shake so that coordinate pairs are not exactly on their coordinates, reducing overlapping and allowing us to see more of the data. The easiest way I have found to add jittering in SPSS is to use "Chart Builder" under "Graphs" to generate the scatterplot, copy the syntax, and then run the syntax, changing "ELEMENT: point" to "ELEMENT: point.jitter". Here are the scatterplots from above, this time with jittering:
.
Another way people try to deal with the problem of multiple points sharing coordinate locations is by showing density of each location. For example, a sunflower plot (the first one below) has little "petals" on each point to indicate how many cases are at each coordinate. The scatterplots below were all produced in SPSS by double-clicking on the initial scatterplot and then going to Options → Bin Element.
.
For any pair of ratio-like variables, you can figure out the correlation coefficient value, and for representative samples evaluate significance, by creating a correlation matrix.
Syntax
*You only need to list two variables, replacing VariableOneName etc with your variable names, but you can list as many variables as you want.
*Remember that even if you list multiple variables, the matrix will just be comparing two variables at a time, all possible pairings.
CORRELATIONS
/VARIABLES=VariableOneName VariableTwoName VariableThreeNameEtc
/PRINT=TWOTAIL NOSIG FULL
/STATISTICS DESCRIPTIVES /CI CILEVEL(95)
/MISSING=PAIRWISE.
Menu
Go to Analyze → Correlate → Bivariate.
Add at least 2 variables to the "Variables:" box on the right. (You can add as many as you want, but remember that the matrix will be giving you bivariate correlations, all possible pairings of the variables.)
Click on "Options...", in the pop-up window check the box for "Means and standard deviations" and click "Continue."
Click on "Confidence Interval...", in the pop-up window check the box for "Esimate confidence interval of bivariate correlation parameter" and click "Continue."
Click OK.
Here is an SPSS output, based on one of the examples above:
The descriptive statistics at the top give you summary statistics for each of the variables you selected.
The confidence intervals at the bottom tells you which 2-variable pair was analyzed, gives you the correlation coefficient (r=0.909), the p-value (p<0.001), and a 95% confidence interval for the correlation coefficient (we are 95% confident that among all U.S. eligible voters following the 2020 election, the correlation coefficient would be somewhere between 0.902 and 0.916, a positive and very strong relationship). Correlation coefficients are not inferential statistics; they are a summary of the relationship in the sample data (or population data if used).
Let's take a look at the correlation matrix in the middle. We'll go through each part. I've added letters to cells for reference.
A. This is the correlation coefficient value for the variable with itself. Because (x,y) are always identical and always match, this will always be a 1, a perfect relationship. This does not give you any useful information and can be ignored.
B. This is the correlation coefficient value for the relationship between the two variables. It is identical in both places because direction does not matter for correlation (this is about correlation, not causation). Here r=0.909, a positive strong relationship following the 2020 election between feelings towards Biden and feelings towards Harris.
C. N is the total number of cases (the sample size). Here we can observe there were 2,617 valid cases evaluating Joe Biden, 2,598 cases evaluating Kamala Harris, and 2,578 cases that had evaluations of both Biden and Harris. (The correlation matrix above is weighted, and the n here reflects weighted sample sizes. If I ran with weighting off, these numbers respectively would be 2,643, 2,635, and 2,613).
D. This is the p-value for the relationship. If we have a representative sample, we can use the p-value to evaluate significance. Here p<0.001, so we are over 99.9% confident that, among all U.S. eligible voters following the 2020 election, there was a relationship between how people felt about Biden and how people felt about Harris. This is our level of confidence that the correlation coefficient value is not zero, which would indicate no relationship.
E. The ** relates to the p-value from D. SPSS uses * for p<0.05 and ** for p<0.01. Oftentimes journal articles will use similar notation for significance levels, most commonly * for p<0.05, ** for p<0.01, and *** for p<0.001.
Correlation Matrices with three or more variables
The correlation matrix above only had two variables. However, you can also run a correlation matrix with more variables.
Here I put in 5 variables, using the data about U.S. states I was using earlier in the chapter.
Here are descriptive statistics for each variable:
Here is the correlation matrix:
This can be a lot to look at, but it's just more pairings. For example, the 0.724 is the r (correlation coefficient value) for the relationship between the official poverty rate and supplemental poverty rate, while -0.412 is the r for the relationship between supplemental poverty rate and life expectancy.
When you are in SPSS, you can check "Show only the lower triangle" at the bottom, or in syntax, replace the word "FULL" with "LOWER" in the "/PRINT" line, and will get a table that instead looks like this:
This is the same information as the table above, just without duplications, and without showing a correlation coefficient for a variable with itself. I prefer the full table, but this is personal preference. Sometimes you will see correlation matrices like this one in journal articles.
Here is the confidence intervals output:
You can see that for each row it tells you what the two variable pair consists of (e.g., the first one is official poverty rate and supplemental poverty rate). I included the confidence intervals output so you could get a sense of it, but I would not use these confidence intervals nor the p-values in the correlation matrix above for this particular data, because the cases (50 states and DC) are not a representative sample generalizable to any broader target population.
Scatterplots are good for ratio-like variables only. For example, check out this "scatterplot" of poverty rate and life expectancy by U.S. state, where poverty rate is an indicator variable (0=<10% poverty rate, 1=10% or higher poverty rate):
You're 10 chapters in, and we've now covered a whole host of different descriptive statistics, inferential statistics, charts, and graphs. Let's review when each should be used:
Descriptive statistics:
Mean: Needs ratio-like variable, or indicator variable (interpreted as proportion with value of 1)
Standard deviation: Needs ratio-like variable
Index of Qualitative Variation: Needs nominal variable
5 Point Summary (Median, Minimum, Maximum, Lower Quartile, Upper Quartile): Needs ordinal, ratio-like, or interval variable
Mode: Any level of measurement (but not useful)
Range and InterQuartile Range: Needs ratio-like variable, if ordinal can be described but usually not calculated
Gamma: Needs two ordinal-level variables (can use indicator variables)
Correlation: Needs two ratio-like variables (can use indicator variables, but they will not reflect a linear pattern)
Tables:
Frequency Distribution: Needs categorical variable
Crosstabs: Needs 2 categorical variables (can use ratio-like if a small discrete number of answers)
Elaborated Crosstabs: Needs 3+ categorical variables (can use ratio-like if a small discrete number of answers)
Graphs:
Pie chart: Needs categorical variable (with not too many categories)
Bar Graph: Needs categorical variable
Histogram: Needs ratio-like variable
Box and Whisker Plot: Needs ratio-like variable
Scatterplot: Needs 2 ratio-like variables (with unique coordinates unless adjusting scatterplot)
Inferential statistics:
Only proceed if data is from a representative sample of target population.
Confidence Interval: Needs ratio-like variable, or indicator variable (interpreted as proportion with value of 1)
Two-sample T-test: Needs 1 ratio-like variable, or 1 indicator variable (interpreted as proportion with value of 1); and 1 categorical variable with 2 selected categories
One-way ANOVA: Needs 1 ratio-like variable, or 1 indicator variable (interpreted as proportion with value of 1); and 1 categorical variable with multiple categories
Chi-square: Needs 2 categorical variables (can use ratio-like if a small discrete number of answers)
Elaborated chi-square: Needs 3+ categorical variables (can use ratio-like if a small discrete number of answers)