Pearson r, R² & Spearman ρ

three correlation concepts — built up from scratch

Section 1 Pearson r — measuring linear correlation

You have two lists of numbers — call them X and Y. The question: when X is large, does Y tend to be large too? Pearson r gives you a single number between −1 and +1 that answers exactly that — assuming the relationship between them is a straight line.

Mean-center both variables

Subtract each variable's mean from its values. Now: positive = above average, negative = below average. Units don't matter anymore.

→

Multiply each pair of deviations

For each data point i, compute (xᵢ−x̄)·(yᵢ−ȳ). Both above average → positive product. One above, one below → negative product.

→

Sum and normalize

Sum all those products. Divide by the product of the two standard deviations. This rescales everything to [−1, +1] no matter what the original units were.

→

Read the result

r = +1: perfect line up ↗
r = −1: perfect line down ↘
r = 0: no linear trend
Weakness: outliers and curves break it.

r = Σ (xᵢ−x̄)(yᵢ−ȳ) / √[ Σ(xᵢ−x̄)² · Σ(yᵢ−ȳ)² ] numerator: how much X and Y deviate together | denominator: normalises to [−1, +1]

The critical weakness of Pearson r:
It operates on raw values. One extreme outlier can yank the numerator massively, making r look huge even if the rest of the data shows no trend at all — or it can kill a real trend. It also completely misses relationships that are curved rather than straight.

Section 2 What is a rank?

A rank is a value's position when you sort the list from smallest to largest. Smallest value → rank 1. Next smallest → rank 2. And so on. The actual numeric distances between values are discarded — only the ordering survives.

① Original values

sort ↓

smallest first

② Sorted order

number →

position = rank

③ Ranks (back in original order)

Key insight

4000 is a wild outlier — 85× bigger than 47. But its rank is just 5 — the same as if it were 50 or 500. Once you rank, the size of the gap between values no longer exists. Only "it's the biggest" survives. This is what protects Spearman from outliers.

Section 3 Spearman ρ — Pearson r applied to ranks

Spearman ρ is not a fundamentally new idea. It's just Pearson r applied to the ranked versions of X and Y instead of the raw values. That's the whole trick. Because ranks neutralise outliers and compress any monotone curve into a straight line, ρ works on any relationship that only goes one direction — not just linear ones.

Rank your X values

Replace each raw X with its rank (1 = smallest). Ties get the average of the positions they would have occupied.

→

Rank your Y values

Same for Y. Now both variables live on the same 1…n scale. No units. No outliers. Just order.

→

Run Pearson r on the ranks

Feed rank(X) and rank(Y) into the exact same Pearson formula from Section 1. The result is ρ.

→

Interpret ρ

ρ = +1: same rank order ↗
ρ = −1: reversed rank order ↘
ρ = 0: no monotone trend
Measures any monotone shape, not just lines.

Section 4 R² — the fraction of variance explained

Historical background. Francis Galton coined "regression" in the 1880s studying how children's heights regress toward the population mean. Karl Pearson formalised it and defined r as the correlation coefficient. R² fell out naturally: once you have r, squaring it tells you how much of the variation in one variable is accounted for by the other. It became the standard goodness-of-fit measure for linear regression — so standard that most people encounter R² before they ever think carefully about r.

What R² literally is. R² is exactly r squared — r². More precisely: it is the square of the Pearson correlation between the predicted values (ŷ, from your regression line) and the observed values (y). In simple linear regression those two definitions are identical.

Why squaring gives a clean interpretation. r can be negative (−1 to +1). Squaring it does two things: it removes the sign (direction no longer matters — only strength does), and it maps everything to [0, 1]. That [0, 1] range has a precise meaning: it is the fraction of the total variance in Y that your model explains. The rest — (1 − R²) — is variance left unexplained, driven by noise or variables you didn't measure.

Definition

R² = r² = SS_reg / SS_tot SS_reg = variance explained by the model · SS_tot = total variance in Y

Equivalent form

R² = 1 − SS_res / SS_tot SS_res = variance not explained (residuals) · same result, different framing

R² = 1.0 perfect fit

Every point sits exactly on the regression line. The model explains 100% of the variance in Y — knowing X tells you Y perfectly. Example: distance = speed × time, no measurement error.

R² = 0.80 strong but imperfect

The line captures the trend well, but points scatter around it. 80% of variation in Y is explained by X. The remaining 20% is noise, measurement error, or unmeasured variables. Example: study hours predicting exam score.

R² = 0.0 no linear relationship

The regression line is flat. X explains none of the variance in Y — knowing X tells you nothing about Y. Example: shoe size predicting salary.

Interactive drag points · click to add · right-click to remove · or load a preset

Pearson r on raw data

Spearman ρ — same points, rank-transformed (what Spearman actually correlates)

Live results

—

Pearson r

—

Spearman ρ

Pearson |r|

Spearman |ρ|

R² (= r²)

load a preset or add points

Presets

Click a preset to see a specific scenario explained.

Spearman shortcut (no ties)

ρ = 1 − 6Σdᵢ² / n(n²−1)
dᵢ = rank(xᵢ) − rank(yᵢ) · equivalent to Pearson on ranks

n = 0 points