Pearson r, R² & Spearman ρ

three correlation concepts — built up from scratch
Section 1 Pearson r — measuring linear correlation
You have two lists of numbers — call them X and Y. The question: when X is large, does Y tend to be large too? Pearson r gives you a single number between −1 and +1 that answers exactly that — assuming the relationship between them is a straight line.
01
Mean-center both variables
Subtract each variable's mean from its values. Now: positive = above average, negative = below average. Units don't matter anymore.
02
Multiply each pair of deviations
For each data point i, compute (xᵢ−x̄)·(yᵢ−ȳ). Both above average → positive product. One above, one below → negative product.
03
Sum and normalize
Sum all those products. Divide by the product of the two standard deviations. This rescales everything to [−1, +1] no matter what the original units were.
04
Read the result
r = +1: perfect line up ↗
r = −1: perfect line down ↘
r = 0: no linear trend
Weakness: outliers and curves break it.
r = Σ (xᵢ−x̄)(yᵢ−ȳ) / √[ Σ(xᵢ−x̄)² · Σ(yᵢ−ȳ)² ] numerator: how much X and Y deviate together  |  denominator: normalises to [−1, +1]
The critical weakness of Pearson r:
It operates on raw values. One extreme outlier can yank the numerator massively, making r look huge even if the rest of the data shows no trend at all — or it can kill a real trend. It also completely misses relationships that are curved rather than straight.
Section 2 What is a rank?
A rank is a value's position when you sort the list from smallest to largest. Smallest value → rank 1. Next smallest → rank 2. And so on. The actual numeric distances between values are discarded — only the ordering survives.
① Original values
sort ↓
smallest first
② Sorted order
number →
position = rank
③ Ranks (back in original order)
Key insight
4000 is a wild outlier — 85× bigger than 47. But its rank is just 5 — the same as if it were 50 or 500. Once you rank, the size of the gap between values no longer exists. Only "it's the biggest" survives. This is what protects Spearman from outliers.
Section 3 Spearman ρ — Pearson r applied to ranks
Spearman ρ is not a fundamentally new idea. It's just Pearson r applied to the ranked versions of X and Y instead of the raw values. That's the whole trick. Because ranks neutralise outliers and compress any monotone curve into a straight line, ρ works on any relationship that only goes one direction — not just linear ones.
01
Rank your X values
Replace each raw X with its rank (1 = smallest). Ties get the average of the positions they would have occupied.
02
Rank your Y values
Same for Y. Now both variables live on the same 1…n scale. No units. No outliers. Just order.
03
Run Pearson r on the ranks
Feed rank(X) and rank(Y) into the exact same Pearson formula from Section 1. The result is ρ.
04
Interpret ρ
ρ = +1: same rank order ↗
ρ = −1: reversed rank order ↘
ρ = 0: no monotone trend
Measures any monotone shape, not just lines.
Section 4 R² — the fraction of variance explained
Historical background. Francis Galton coined "regression" in the 1880s studying how children's heights regress toward the population mean. Karl Pearson formalised it and defined r as the correlation coefficient. R² fell out naturally: once you have r, squaring it tells you how much of the variation in one variable is accounted for by the other. It became the standard goodness-of-fit measure for linear regression — so standard that most people encounter R² before they ever think carefully about r.
What R² literally is. R² is exactly r squared — . More precisely: it is the square of the Pearson correlation between the predicted values (ŷ, from your regression line) and the observed values (y). In simple linear regression those two definitions are identical.
Why squaring gives a clean interpretation. r can be negative (−1 to +1). Squaring it does two things: it removes the sign (direction no longer matters — only strength does), and it maps everything to [0, 1]. That [0, 1] range has a precise meaning: it is the fraction of the total variance in Y that your model explains. The rest — (1 − R²) — is variance left unexplained, driven by noise or variables you didn't measure.
Definition
= r²  =  SSreg / SStot SSreg = variance explained by the model  ·  SStot = total variance in Y
Equivalent form
= 1 − SSres / SStot SSres = variance not explained (residuals)  ·  same result, different framing
R² = 1.0 perfect fit
Every point sits exactly on the regression line. The model explains 100% of the variance in Y — knowing X tells you Y perfectly. Example: distance = speed × time, no measurement error.
R² = 0.80 strong but imperfect
The line captures the trend well, but points scatter around it. 80% of variation in Y is explained by X. The remaining 20% is noise, measurement error, or unmeasured variables. Example: study hours predicting exam score.
R² = 0.0 no linear relationship
The regression line is flat. X explains none of the variance in Y — knowing X tells you nothing about Y. Example: shoe size predicting salary.
Interactive drag points · click to add · right-click to remove · or load a preset
Pearson r on raw data
Spearman ρ — same points, rank-transformed (what Spearman actually correlates)