Python SciPy stats tutorial shows how to perform advanced statistical analysis using scipy.stats.
last modified March 8, 2025
This tutorial explores statistical analysis in Python using the scipy.stats module, part of the SciPy library, ideal for advanced data science tasks.
The scipy.stats module offers tools for descriptive statistics, probability distributions, and hypothesis testing, far exceeding the basic capabilities of Python’s statistics module.
$ pip install scipy
Install SciPy using pip to access the stats module and its powerful statistical functions.
The mean is the average of a dataset, while the median is the middle value when sorted, robust against outliers.
mean_median.py
#!/usr/bin/python
from scipy import stats import numpy as np
sales = [1200, 1500, 1300, 1700, 5000] # Monthly sales in USD mean = np.mean(sales) median = np.median(sales)
print(f"Mean Sales: ${mean:.2f}") print(f"Median Sales: ${median:.2f}")
We import scipy.stats and numpy for efficient array operations. The sales list represents monthly revenue. We use np.mean and np.median for calculations.
The mean ($2,340) is skewed by the outlier ($5,000), while the median ($1,500) better reflects typical sales, showing scipy.stats integration with NumPy.
$ ./mean_median.py Mean Sales: $2340.00 Median Sales: $1500.00
The mode identifies the most frequent value in a dataset, useful for categorical or discrete data analysis.
mode.py
#!/usr/bin/python
from scipy import stats
ratings = [5, 4, 5, 3, 4, 5, 2, 4, 5] # Customer satisfaction scores mode_result = stats.mode(ratings)
print(f"Mode: {mode_result.mode[0]}, Count: {mode_result.count[0]}")
The ratings list simulates customer feedback (1-5 scale). stats.mode returns a ModeResult object with mode (most frequent value) and count (its frequency). Here, 5 appears 4 times.
$ ./mode.py Mode: 5, Count: 4
The variance measures data spread as the average squared difference from the mean. The standard deviation, its square root, indicates typical deviation in original units.
var_stdev.py
#!/usr/bin/python
from scipy import stats import numpy as np
temps = [22.5, 23.0, 21.8, 24.1, 22.9] # Daily temperatures (°C) variance = np.var(temps, ddof=1) # Sample variance stdev = np.std(temps, ddof=1) # Sample standard deviation
print(f"Variance: {variance:.2f} °C²") print(f"Standard Deviation: {stdev:.2f} °C")
We use temperature data to compute sample variance and standard deviation (ddof=1 for sample, not population). Variance (0.85 °C²) shows spread in squared units, while standard deviation (0.92 °C) is more interpretable.
$ ./var_stdev.py Variance: 0.85 °C² Standard Deviation: 0.92 °C
The normal distribution models continuous data with a bell-shaped curve, defined by mean and standard deviation, common in natural phenomena.
normal_dist.py
#!/usr/bin/python
from scipy import stats
iq_dist = stats.norm(loc=100, scale=15) prob_above_120 = 1 - iq_dist.cdf(120) sample = iq_dist.rvs(size=5)
print(f"P(IQ > 120): {prob_above_120:.3f}") print(f"Random IQs: {sample}")
We model IQ scores with a normal distribution (loc = mean, scale = std). cdf(120) gives the cumulative probability up to 120, so 1 - cdf is the chance of exceeding 120. rvs generates random samples.
This is useful for simulations or probability assessments in psychology or education.
$ ./normal_dist.py P(IQ > 120): 0.091 Random IQs: [ 95.2 112.7 88.4 104.1 99.6] # Values may vary
A t-test compares means to assess if differences are statistically significant, widely used in hypothesis testing.
ttest.py
#!/usr/bin/python
from scipy import stats
group1 = [85, 88, 90, 87, 86] # Test scores, method A group2 = [90, 92, 89, 94, 91] # Test scores, method B t_stat, p_val = stats.ttest_ind(group1, group2)
print(f"T-Statistic: {t_stat:.2f}") print(f"P-Value: {p_val:.3f}")
We compare test scores from two teaching methods. ttest_ind performs an independent t-test, returning the t-statistic and p-value. A low p-value (<0.05) suggests a significant difference.
Here, p=0.013 indicates method B likely improves scores, a common analysis in educational research.
$ ./ttest.py T-Statistic: -2.98 P-Value: 0.013
The correlation coefficient measures the linear relationship between two variables, ranging from -1 (perfect negative) to 1 (perfect positive).
correlation.py
#!/usr/bin/python
from scipy import stats
hours = [2, 3, 4, 5, 6] # Study hours scores = [65, 70, 75, 85, 90] # Exam scores r, p = stats.pearsonr(hours, scores)
print(f"Pearson Correlation: {r:.2f}") print(f"P-Value: {p:.3f}")
We test if study hours correlate with exam scores. pearsonr computes Pearson’s correlation coefficient and p-value. A high r (0.97) and low p-value (0.006) confirm a strong positive relationship.
$ ./correlation.py Pearson Correlation: 0.97 P-Value: 0.006
The kurtosis measures the “tailedness” of a distribution, indicating whether data has heavy or light tails compared to a normal distribution. Positive kurtosis means heavy tails (more outliers), while negative means light tails.
kurtosis.py
#!/usr/bin/python
from scipy import stats import numpy as np
returns = [0.5, -0.3, 1.2, -2.5, 0.8, 3.1, -1.8, 0.2, 4.0, -3.2] kurt = stats.kurtosis(returns)
print(f"Kurtosis of Returns: {kurt:.2f}")
We analyze daily stock returns, a common financial dataset. The returns list simulates percentage changes in stock price over 10 days. stats.kurtosis calculates the excess kurtosis (relative to a normal distribution, where kurtosis = 0).
A kurtosis of 0.73 indicates heavier tails than a normal distribution, suggesting more extreme price swings—useful for risk assessment in finance.
$ ./kurtosis.py Kurtosis of Returns: 0.73
The skew measures the asymmetry of a distribution. Positive skew means a longer right tail (e.g., income data), while negative skew indicates a longer left tail (e.g., time to failure).
skew.py
#!/usr/bin/python
from scipy import stats import numpy as np
incomes = [25, 30, 35, 40, 45, 50, 60, 80, 120, 200] skewness = stats.skew(incomes)
print(f"Skewness of Incomes: {skewness:.2f}")
The incomes list represents annual earnings, typical of economic data with a few high earners. stats.skew computes the skewness. A positive value (1.46) confirms a right-skewed distribution, common in income studies.
This helps economists understand wealth distribution and identify inequality trends in the population.
$ ./skew.py Skewness of Incomes: 1.46
The find_repeats function identifies repeated values in an array and their counts, useful for spotting patterns or anomalies in discrete data.
find_repeats.py
#!/usr/bin/python
from scipy import stats import numpy as np
purchases = [1, 2, 3, 2, 4, 1, 2, 5, 1, 3] repeats = stats.find_repeats(purchases)
print(f"Repeated Values: {repeats.values}") print(f"Counts: {repeats.counts}")
The purchases array tracks how many items each customer bought. stats.find_repeats returns a FindRepeatsResult object with values (repeated numbers) and counts (how often they repeat).
Here, 1 and 2 appear multiple times (3 and 3), indicating frequent small purchases. This is valuable for retail analysis to optimize inventory or promotions.
$ ./find_repeats.py Repeated Values: [1. 2. 3.] Counts: [3 3 2]
The describe function provides a summary of a dataset, including count, mean, variance, skew, kurtosis, and min/max, offering a quick overview.
describe.py
#!/usr/bin/python
from scipy import stats import numpy as np
wait_times = [10, 15, 12, 20, 25, 18, 30, 22, 35, 40] summary = stats.describe(wait_times)
print(f"Number of Observations: {summary.nobs}") print(f"Mean: {summary.mean:.2f} minutes") print(f"Variance: {summary.variance:.2f} min²") print(f"Skewness: {summary.skewness:.2f}") print(f"Kurtosis: {summary.kurtosis:.2f}") print(f"Min: {summary.minmax[0]}, Max: {summary.minmax[1]}")
The wait_times list simulates patient wait times in a healthcare setting. stats.describe returns a DescribeResult object with key statistics, accessed via attributes like nobs (count) and mean.
The summary shows a mean of 22.7 minutes, slight positive skew (0.47), and negative kurtosis (-0.73), suggesting a flatter distribution. This helps clinics assess service efficiency and patient experience.
$ ./describe.py Number of Observations: 10 Mean: 22.70 minutes Variance: 90.46 min² Skewness: 0.47 Kurtosis: -0.73 Min: 10, Max: 40
Linear regression models the relationship between a dependent variable and one or more independent variables, predicting trends.
regression.py
#!/usr/bin/python
from scipy import stats
ad_spend = [100, 200, 300, 400, 500] # Ad budget ($) sales = [1200, 1500, 1800, 2100, 2400] # Sales ($) slope, intercept, r, p, se = stats.linregress(ad_spend, sales)
print(f"Slope: {slope:.2f}, Intercept: {intercept:.2f}") print(f"R²: {r**2:.3f}, P-Value: {p:.3f}")
We model sales against ad spending. linregress returns slope, intercept, correlation coefficient (r), p-value, and standard error. The slope (2.4) suggests $1 in ads yields $2.40 in sales.
R² (1.0) indicates a perfect fit, though real data would show more variation. This is useful for marketing analysis.
$ ./regression.py Slope: 2.40, Intercept: 960.00 R²: 1.000, P-Value: 0.000
Use NumPy Arrays: Convert lists to arrays for efficiency.
Check Assumptions: Verify normality for t-tests.
Interpret P-Values: Small p (<0.05) suggests significance.
Document Models: Note distribution parameters used.
This tutorial showcased scipy.stats for advanced statistical analysis, from distributions to regression, with practical examples.
My name is Jan Bodnar, and I am a passionate programmer with extensive programming experience. I have been writing programming articles since 2007. To date, I have authored over 1,400 articles and 8 e-books. I possess more than ten years of experience in teaching programming.
List all Python tutorials.