📈 Linear regression & correlation

Ever noticed that taller players tend to score more in basketball? Or that the more you practice, the better you get? Data often follows patterns like this — and linear regression is the math tool that helps you find, describe, and use those patterns.

Use Next and Previous at the bottom to move through each page. Each page has one idea, real examples, and sometimes a graph.

Here’s what you’ll learn:

  • What correlation means — when two things tend to go up or down together
  • The correlation coefficient r — a number that tells you how strong the pattern is
  • What a line of best fit is and how to use it to make predictions

Quick check

What does linear regression help you do?   (linear regression = the method from this lesson)

What is correlation?

Correlation describes whether two things tend to move together. When one goes up, does the other usually go up too? Go down? Or is there no connection at all?

Here are three types of patterns you might see:

Three possible patterns

Positive: Both go up together — like hours of sleep and how energized you feel. Negative: One goes up while the other goes down — like hours of screen time before bed and how rested you feel in the morning. No pattern: No real connection — like shoe size and favorite color.

Quick check

If one variable goes up and the other usually goes down, the correlation is:   (correlation = how x and y move together)

The correlation coefficient r

Scientists summarize correlation with a single number called r. It’s always between −1 and +1. You don’t calculate it by hand — you use it to read and describe how strong a pattern is.

What does r mean?

rWhat it meansWhat the graph looks like
close to +1Strong positive — when x goes up, y usually goes up tooDots cluster near a line going up-right
close to 0No real pattern — the two things aren’t relatedDots scattered everywhere, no clear line
close to −1Strong negative — when x goes up, y usually goes downDots cluster near a line going down-right
r ≈ +1 → strong positive  |  r ≈ −1 → strong negative  |  r ≈ 0 → no pattern

Think of r as a "pattern score." The closer to +1 or −1, the stronger and clearer the pattern.

Quick check

When r is close to 0, what does that mean?   (r = correlation coefficient)

Scatter plot: positive correlation

A scatter plot shows two things plotted against each other — each dot is one data point. This one shows hours studied (x) vs test score (y). You can see right away that more study time tends to mean a higher score.

Dots going up and to the right = positive correlation. The pattern doesn’t need to be a perfect line — as long as the trend is clear, it counts.

Quick check

On a scatter plot, dots going up and to the right suggest:   (scatter plot = graph of (x,y) points)

Line of best fit

The line of best fit is a straight line drawn through the middle of your scatter plot — it’s the line that gets as close as possible to all the dots at once. We use it to make predictions.

Its equation looks like y = mx + b, where m is the slope (how steep the line is) and b is where it starts on the y-axis.

Example: If the line is score = 15 × hours + 40, then 3 hours of studying predicts a score of 15(3) + 40 = 85. Pretty useful!

The red line is the line of best fit. The closer the dots are to the line, the stronger the correlation — and the more reliable your predictions.

Quick check

What is the line of best fit used for?   (line of best fit = the straight line through the data)

When r is close to 0

Sometimes two things have no real connection. The dots on the scatter plot are all over the place — no clear direction up or down. That’s when r is close to 0.

Real-life example: Hours you spend on a hobby vs your math grade. For most people, there’s no consistent link — one doesn’t predict the other.

You can still draw a line of best fit, but it won’t mean much. A nearly flat line surrounded by scattered dots is the graph version of “I don’t see a pattern here.”

Important: just because two things happen at the same time doesn’t mean one causes the other. Correlation isn’t causation!

Quick check

Correlation tells you that two variables are related. Does that mean one causes the other?   (causation = one thing causing the other)

Nice work — here's a recap

  • Correlation: Two things moving together — positive (both up), negative (one up, one down), or no pattern at all.
  • Correlation coefficient r: A number from −1 to +1. Close to ±1 = strong pattern. Close to 0 = no pattern.
  • Line of best fit: The straight line closest to all the dots. Equation: y = mx + b. Use it to predict y when you know x.
  • Use a scatter plot to see the relationship. Use r to measure how strong it is. Use the line to make predictions.

Head to the quiz and see what you remember!

Quick check

To predict y when you know x, you use:   (x and y = the variables from the lesson)

End of lesson test

Pick an answer for each question, then click Submit answers to see how you did — with explanations!

A scatter plot of car age (years) vs resale value ($) shows a clear downward trend. As age increases, value tends to decrease. What does this imply about r?

A study of daily rainfall (inches) vs umbrella sales finds r = 0.02. What does this most likely indicate?

Why are predictions from the line of best fit unreliable when r ≈ 0?

In many cities, air conditioning usage and electricity bills are strongly correlated. Someone concludes that AC usage causes high bills. What principle are they overlooking?

The line of best fit for advertising spend (x, in $1000s) vs unsold inventory (y, units) is y = −4x + 60. When x increases by 1 unit ($1000), what happens to the predicted y?

A scatter plot of altitude (feet) vs air temperature (°F) shows points tightly clustered along a line sloping down from left to right. What does this suggest?

What does linear regression assume about the relationship between x and y?

For coffee price ($/lb) and quantity sold (lbs), a researcher reports r = 0.94. Which interpretation is correct?

When is it risky to use the line of best fit for a prediction?

In the equation y = mx + b, what does b represent?

Your data for distance from downtown (miles) vs rent ($) covers x from 2 to 15 miles. You use the line to predict rent for x = 25 miles. What is this an example of?

Which r value would yield the most reliable predictions from the line of best fit?

A scatter plot of driving speed (mph) vs fuel consumption (gallons per 100 miles) shows dots trending up and to the right. What does this imply about r?

If the slope m of your line of best fit is negative, what can you conclude about the correlation?

For movie runtime (minutes) and box office revenue ($), r = 0.88. Which conclusion is not justified by r alone?

For study hours (x) vs exam score (y), the line predicts 78 for a student who studied 5 hours. The student actually scored 85. What is the residual?

For fertilizer amount (lbs/acre) vs crop yield (bushels), r = 0.8. What does r² = 0.64 tell you?

A scatter plot of exercise minutes per week vs resting heart rate (bpm) has one point far from the rest. What is such a point called, and why does it matter?

Your data for engine size (liters) vs fuel economy (mpg) covers x from 1.5 to 4.0. You predict mpg for x = 2.5. What is this?

Why might r be close to 0 even when x and y are clearly related?