two sample z test scipy

To start we’re going to import a few python packages.

ztest: Impors the function to perform a Z-test for comparing means.

numpy as np: for numberical operations.

from scipy.stats import norm: Imports the normal distribution functions.

				
					from statsmodels.stats.weightstats import ztest
import numpy as np
from scipy.stats import norm
#import math
				
			

Then we set the significance level to 0.05.

				
					alpha = 0.05
				
			

We set the random seed to 10 for reproducibility.

				
					np.random.seed(10)  # For reproducibility
				
			

Example 1

we create a list of numbers called sample_1

				
					sample_1 = [370, 395, 400, 405, 390, 385, 410, 395, 400, 380, 390, 400, 410, 415, 395, 405, 390, 400, 420, 375, 400, 385, 390, 395, 410, 405, 400, 395, 380, 400]
				
			

We also create a list of numbers called sample_2

				
					sample_2 = [360, 375, 385, 390, 370, 380, 395, 390, 385, 375, 380, 395, 400, 405, 385, 395, 375, 385, 395, 370, 380, 395, 390, 385, 375, 380, 395, 400, 385, 395]
				
			

We calculate the mean of sample_1 by dividing the sum of sample_1 by the length of sample_1

				
					mean_sample_1 = sum(sample_1) / len(sample_1)  # Mean of Sample 1

				
			

We also calculate the mean of sample_2 by dividing sum of sample_2 by the length of sample_2

				
					mean_sample_2 = sum(sample_2) / len(sample_2)  # Mean of Sample 2
				
			
				
					print("Sample 1 Mean:", mean_sample_1)  # Expected: 396.3
				
			
				
					print("Sample 2 Mean:", mean_sample_2)  # Expected: 385.5

				
			

we set a standard deviation of 15

				
					std_dev = 15  # Given: standard deviation for both samples

				
			

Next, we calculate the number of observations in each sample.

				
					n1, n2 = len(sample_1), len(sample_2)
				
			

Then we calculate the pooled standard error.

 

 

				
					pooled_se = np.sqrt((std_dev**2 / n1) + (std_dev**2 / n2))
				
			
				
					print("Pooled Standard Error:", round(pooled_se, 2))  # Expected: 3.87
				
			

Here, we calculate the Z-statistic for a two-sample Z-test.

  • mean_sample_1 - mean_sample_2: Difference between the two sample means.

  • pooled_se: Standard error of the difference in means (computed earlier).

				
					z_statistic = (mean_sample_1 - mean_sample_2) / pooled_se
				
			
				
					print("Z-Statistic:", round(z_statistic, 2))  # Expected: 2.79
				
			
  • abs(z_statistic): Takes the absolute value of the Z-score.

  • norm.cdf(...): Calculates the cumulative probability up to that Z-score under the standard normal distribution.

  • 1 - norm.cdf(...): Gets the probability in the tail beyond the Z-score.

  • 2 * (...): Doubles it for the two-tailed test (since the difference could be in either direction).

				
					p_value = 2 * (1 - norm.cdf(abs(z_statistic)))  # Two-tailed test

				
			
				
					print("P-Value:", round(p_value, 4))  # Expected: 0.0052

				
			
				
					if p_value < alpha:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

				
			
Quicker way to test it – Not Entirely precise
The ztest function in statsmodels.stats.weightstats
does not explicitly allow for directly
passing the population standard deviation. Instead, it estimates the standard error based on the
sample standard deviations unless the sample variance is explicitly pooled
				
					z_stat, p_value = ztest(sample_1, sample_2, alternative='two-sided')
				
			
				
					print(z_stat)
				
			
				
					print(p_value)
				
			

Example 2 marathon times of two running clubs

				
					marathon_std = 30
				
			
				
					sample1 = np.random.normal(loc=272, scale=25, size=50)  # Sample 1: Mean = 270, Std = 30
				
			
				
					sample2 = np.random.normal(loc=255, scale=25, size=50)  # Sample 2: Mean = 260, Std = 30
				
			
				
					# Calculate means and standard deviations for both samples
mean1, size1 = np.mean(sample1), len(sample1)
				
			
				
					mean2, size2 = np.mean(sample2), len(sample2)
				
			
				
					mean2, size2 = np.mean(sample2), len(sample2)
				
			
				
					pooled_se = np.sqrt((marathon_std**2 / size1) + (marathon_std**2 / size2))
				
			
				
					z_score = (mean1 - mean2) / pooled_se
				
			
				
					p_value = 2 * (1 - norm.cdf(abs(z_score)))
				
			
				
					if p_value < alpha:
    print("Reject the null hypothesis: The sample mean is significantly different from the population mean.")
else:
    print("Fail to reject the null hypothesis: No significant difference between the sample mean and population mean.")
				
			

Reject the null hypothesis: The sample mean is significantly different from the population mean.

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.

Leave a Reply

Your email address will not be published. Required fields are marked *