scipy chi square test of independence

#The Chi-Squared test determines whether there’s a significant association between categorical variables.
#It compares the observed frequencies (counts) to the expected frequencies, calculated under the null hypothesis that the variables are independent or that the observed distribution fits a given distribution.

#Chi-Squared Test for Independence
#This test checks if there’s an association between two categorical variables by comparing observed frequencies in a contingency table.

#Imagine we have a survey with data on the preferred coffee types by gender.
#We want to determine if there’s a relationship between gender and coffee preference.

#Step 1 – Create the Contingency Table

#Espresso Latte Black
#Male 30 20 50
#Female 35 25 30

#observed = np.array([[30, 20, 50], [35, 25, 30]])


#Step 2 – Run the Chi-Squared Test for Independence

#output
#Chi-Squared: 5.0943
#p-value: 0.0782
#Degrees of Freedom: 2
#Expected Frequencies:
#[[32.64 22.92 44.44]
# [32.36 22.08 45.56]]

#Since the p-value (0.0782) is greater than 0.05, we fail to reject the null hypothesis,
#suggesting there’s no significant association between gender and coffee preference
				
					import numpy as np
import scipy.stats as stats
				
			
				
					alpha = 0.05
				
			

Example 1 Manual Calculation #morning drink choice

				
					observed = np.array([
    [40, 30, 10],  # Male
    [80, 20, 20]   # Female
])
				
			
				
					row_totals = observed.sum(axis=1)
				
			
				
					col_totals = observed.sum(axis=0)
				
			
				
					grand_total = observed.sum()
				
			
				
					print(grand_total)
				
			
				
					# Step 1: Calculate expected values
expected = np.outer(row_totals, col_totals) / grand_total
				
			
				
					print(expected)
				
			
				
					# Step 2: Calculate the Chi-Squared statistic
chi_squared = ((observed - expected) ** 2 / expected).sum()
				
			
				
					print(chi_squared)
				
			
				
					# Step 3: Calculate the p-value
# Degrees of freedom = (rows - 1) * (columns - 1)
degrees_of_freedom = (observed.shape[0] - 1) * (observed.shape[1] - 1)
				
			
				
					print(degrees_of_freedom)
				
			
				
					p_value = 1 - stats.chi2.cdf(chi_squared, degrees_of_freedom)
				
			
				
					print(p_value)
				
			
				
					if p_value > alpha:
    print("The two variables are independent (there is no association between them). fail to reject H0.")
else:
    print("The two variables are dependent (there is an association between them). reject H0.")
				
			

Example 2 Shortcut

				
					chi2, p, dof, expected = stats.chi2_contingency(observed)
				
			
				
					print("Chi-Squared:", chi2)
				
			
				
					print("p-value:", p)
				
			
				
					print("Degrees of Freedom:", dof)
				
			
				
					print("Expected Frequencies:\n", expected)
				
			
#p-value: If this value is less than our significance level (e.g., 0.05), we reject the null hypothesis
#and conclude an association exists

#expected: These are the expected frequencies under the null hypothesis

Example X 3 different Groups

#Suppose we want to investigate whether race distance preferance is associated with age group.
#We surveyed 200 people and categorized them by their age group and voting preference

# 5k marathon ultra marathon
#Under 30 50 30 20
#30-50 40 40 20
#Over 50 30 30 40
				
					observed = np.array([[55, 30, 25],
                     [35, 30, 25],
                     [30, 40, 30]])
				
			
				
					chi2, p_value, dof, expected = stats.chi2_contingency(observed)
				
			
				
					print("Chi-Squared Statistic:", chi2)
				
			
				
					print("p-value:", p)
				
			
				
					print("Degrees of Freedom:", dof)
				
			
				
					print("Expected Frequencies:\n", expected)
				
			
				
					if p_value > alpha:
    print("The two variables are independent (there is no association between them). fail to reject H0.")
else:
    print("The two variables are dependent (there is an association between them). reject H0.")
				
			

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.

Leave a Reply

Your email address will not be published. Required fields are marked *