Pearson Correlation Calculator (r)
Measure the linear relationship between two paired numeric variables.
Paste your X and Y data below. Values can be separated by spaces, commas, tabs, or new lines. Pairs match by index.
How is this calculated?
Formula: r = sum((xi - mean_x)(yi - mean_y)) / sqrt(sum((xi - mean_x)^2) sum((yi - mean_y)^2)). The result ranges from -1 (perfect negative) to +1 (perfect positive). r squared is the share of Y variance explained by X under a linear fit. Source: Pearson 1895, standard statistics textbooks.
About
The Pearson correlation coefficient r is a standardised measure of the linear relationship between two paired numeric variables. It ranges from -1 (perfect inverse) to +1 (perfect direct), with 0 indicating no linear relationship. The tool computes r, r squared, and basic descriptive statistics from any two equal-length arrays.
How it works
r is the covariance of two variables divided by the product of their standard deviations. Geometrically, it is the cosine of the angle between the mean-centred X and Y vectors. Algebraically:
r = cov(X, Y) / (sigma_X * sigma_Y)
= sum_i [(x_i - mean(X)) * (y_i - mean(Y))]
-----------------------------------------------------
sqrt( sum_i (x_i - mean(X))^2 * sum_i (y_i - mean(Y))^2 )
r squared = proportion of Y variance explained by linear X
Both X and Y must be on interval or ratio scales (numeric, evenly spaced). The formula is symmetric: r(X,Y) = r(Y,X). It assumes the relationship is linear; a perfect quadratic curve can produce r ~ 0.
Worked example
A teacher records hours of weekly study and final exam scores for 5 students: study X = [2, 4, 6, 8, 10]; scores Y = [58, 65, 72, 78, 87].
- Means: mean(X) = 6, mean(Y) = 72.
- Deviations: X - mean = [-4, -2, 0, 2, 4]; Y - mean = [-14, -7, 0, 6, 15].
- Cross-product sum: 56 + 14 + 0 + 12 + 60 = 142.
- Sum of squares X: 16 + 4 + 0 + 4 + 16 = 40. Sum of squares Y: 196 + 49 + 0 + 36 + 225 = 506.
- Apply formula: r = 142 / sqrt(40 x 506) = 142 / sqrt(20,240) = 142 / 142.27 = 0.9981.
- r squared: 0.9962, so 99.6 percent of score variance is linearly explained by study hours.
Reference table
Common interpretation thresholds for the absolute value |r|, after Cohen (1988):
| |r| | Verbal label (Cohen) | r squared (var. explained) | Field example |
|---|---|---|---|
| 0.00-0.09 | Negligible | < 1 percent | Coin flips vs weather |
| 0.10-0.29 | Small / weak | 1 to 8 percent | Personality trait predicting outcomes |
| 0.30-0.49 | Medium / moderate | 9 to 24 percent | Education and income |
| 0.50-0.69 | Large / strong | 25 to 48 percent | Height of parents vs children |
| 0.70-0.89 | Very strong | 49 to 79 percent | SAT verbal vs SAT math |
| 0.90-0.99 | Near perfect | 81 to 98 percent | Twin IQs (monozygotic) |
| 1.00 | Perfect linear | 100 percent | F = C x 9/5 + 32 |
Common pitfalls
- Confusing correlation with causation. Two variables can correlate strongly because both depend on a third (the "lurking variable"). Ice-cream sales and drownings correlate via summer temperature.
- Missing non-linear patterns. r near 0 only rules out a linear pattern. A perfect U-shape (y = x^2 centred on zero) gives r = 0.
- Outliers. One extreme point can drag r from 0 to 0.9 or vice versa. Plot the scatter first; consider Spearman or robust alternatives if outliers are real.
- Aggregating to ecological correlations. Group-level r is often much larger than individual-level r (Simpson's paradox). Always check the unit of analysis.
- Statistical significance vs effect size. With n = 10,000, r = 0.02 is "statistically significant" but explains 0.04 percent of variance. Report both r and the confidence interval.
- Truncated range. Selecting only top scorers compresses Y's variance and shrinks r toward 0 (range restriction).
Related tools and glossary
Frequently asked questions
What does the Pearson r value mean?
Pearson r ranges from -1 to +1 and measures the strength and direction of a linear relationship between two variables. r = +1 is a perfect positive linear fit, 0 is no linear relationship, and -1 is a perfect negative fit. Cohen's 1988 conventions classify |r| as small (0.10), medium (0.30), and large (0.50), but the practical interpretation depends on the field.
What is the difference between r and r squared?
r is the correlation coefficient. r squared (the coefficient of determination) is the share of variance in Y explained by a linear fit on X. r = 0.7 implies r squared = 0.49, so 49 percent of Y variance is linearly explained by X, leaving 51 percent unexplained or noise.
Does a high r mean X causes Y?
No. Correlation is symmetric (r(X,Y) = r(Y,X)) and reflects only co-variation. Causal claims require a research design that rules out confounders, reverse causation, and selection (e.g. randomized experiment, instrumental variable, or natural experiment). Ice-cream sales and drowning rates correlate strongly through summer temperature.
When should I use Spearman or Kendall instead of Pearson?
Use Spearman's rho or Kendall's tau when the relationship is monotonic but not linear, when data are ordinal, or when outliers dominate Pearson r. Pearson assumes bivariate normality and constant variance; both are violated by heavy-tailed financial data, where rank-based correlations give more stable estimates.
Sources
- Pearson K. (1895), Notes on regression and inheritance in the case of two parents, Proceedings of the Royal Society.
- Cohen J. (1988), Statistical Power Analysis for the Behavioral Sciences, 2nd ed., Lawrence Erlbaum (effect-size thresholds).
- Wasserman L. (2004), All of Statistics, Springer (chapter 14: linear regression and correlation).
- NIST/SEMATECH e-Handbook of Statistical Methods, section 7.2.6 (correlation interpretation).
