One-Way ANOVA
#Use ANOVA to compare the means of three or more groups.
#Ensure the assumptions of ANOVA are satisfied before applying the test.
#One Way
#Independent Variable: Only one factor is analyzed (e.g., different treatments or conditions).
#Testing the effect of different diets (Diet A, Diet B, Diet C) on weight loss. Here, the diet is the single factor
#Null Hypothesis (H₀): All group means are equal.
#Alternative Hypothesis (H₁): At least one group mean is different.
#One Way Assumptions
#Each group should consist of independent observations. This means that the data points within each group and across different groups should not influence one another.
#The data within each group should be approximately normally distributed. This assumption is especially important when the sample sizes are small. For larger sample sizes, ANOVA is fairly robust to violations of normality due to the Central Limit Theorem
#The variances across the different groups should be approximately equal. This can be checked using tests like Levene’s Test
#If the variances are unequal, a more robust version of ANOVA (like Welch’s ANOVA) may be used
#The variable being measured (dependent variable) should be continuous and interval or ratio-level data (e.g., height, weight, score, etc.)
#The independent variable (or factor) must consist of two or more categorical, independent groups (e.g., different diets, treatments, or experimental conditions)
#The data should ideally be collected through random sampling to avoid biases that could distort the results
#two way
#Used to compare the means of groups that differ based on two independent variables (or factors),
#and it can also examine interactions between the two factors
#Two factors are analyzed, and it assesses both their individual effects and how they interact with each other
#Testing the effect of different diets (Diet A, Diet B, Diet C) and exercise levels (Low, Medium, High)
#on weight loss. Here, both diet and exercise are factors
#Main effects hypotheses:
#H₀: The means for the different levels of the first factor are equal.
#H₀: The means for the different levels of the second factor are equal.
#Interaction effect hypothesis:
#H₀: There is no interaction between the two factors.
#The effect of each independent variable (main effects).
#The interaction between the two variables (interaction effect).
import numpy as np import scipy.stats as stats
alpha = 0.05
#Example 1 – Manual
# Baseball team batting averages team_A = [0.285, 0.270, 0.290, 0.300, 0.275, 0.295, 0.280, 0.265, 0.285] team_B = [0.260, 0.250, 0.245, 0.255, 0.270, 0.265, 0.250, 0.240, 0.255] team_C = [0.305, 0.295, 0.310, 0.320, 0.300, 0.315, 0.290, 0.305, 0.315]
# Step 1 - Shapiro-Wilk test for normality _, p_value_shapiro_A = stats.shapiro(team_A) _, p_value_shapiro_B = stats.shapiro(team_B) _, p_value_shapiro_C = stats.shapiro(team_C)
print(p_value_shapiro_A)

print(p_value_shapiro_B)

# Step 2 Levene's test for equal variances _, p_value_levene_test = stats.levene(team_A, team_B, team_C)
print(p_value_levene_test)

# Step 3: Calculate group means and overall mean mean_A = np.mean(team_A) mean_B = np.mean(team_B) mean_C = np.mean(team_C) overall_mean = np.mean(team_A + team_B + team_C)
print(mean_A)

print(mean_B)

print(mean_C)

print(overall_mean)

# Step 4: Calculate SSB (Between-group sum of squares) SSB = len(team_A) * (mean_A - overall_mean)**2 + len(team_B) * (mean_B - overall_mean)**2 + len(team_C) * (mean_C - overall_mean)**2
print(SSB)

# Step 5: Calculate SSW (Within-group sum of squares) SSW_A = sum((x - mean_A)**2 for x in team_A) SSW_B = sum((x - mean_B)**2 for x in team_B) SSW_C = sum((x - mean_C)**2 for x in team_C) SSW = SSW_A + SSW_B + SSW_C
print(SSW_A)

print(SSW_B)

print(SSW)

# Step 6: Degrees of freedom df_between = 3 - 1 # 3 groups df_within = len(team_A + team_B + team_C) - 3 # Total samples - number of groups
# Step 7: Calculate MSB and MSW MSB = SSB / df_between MSW = SSW / df_within
print(MSB)

print(MSW)

# Step 8: Calculate F-statistic F_statistic_manual = MSB / MSW
print(F_statistic_manual)

# Step 9: Calculate the p-value manually using the F-distribution's survival function p_value_manual = stats.f.sf(F_statistic_manual, df_between, df_within)
print(p_value_manual)

#Example 2
# Sample data: 3 groups with different means
pegasus = [220, 215, 225, 230, 235, 240] # Marathon times in minutes for Pegasus shoes
vaporfly = [210, 205, 215, 200, 195, 220] # Marathon times in minutes for Vaporfly shoes
speedgoats = [250, 245, 255, 260, 240, 270] # Marathon times in minutes for Speedgoats shoes
#The data within each group should be approximately normally distributed. This assumption is especially important when the sample sizes are small.
#For larger sample sizes,
#ANOVA is fairly robust to violations of normality due to the Central Limit Theorem
stat, shapiro_pvalue_pegasus = stats.shapiro(pegasus)
print(shapiro_pvalue_pegasus)

if shapiro_pvalue_pegasus > alpha: print("The data is likely normally distributed (fail to reject H0).") else: print("The data is NOT normally distributed (reject H0).")

stat, shapiro_pvalue_vaporfly = stats.shapiro(vaporfly)
print(shapiro_pvalue_vaporfly)

if shapiro_pvalue_vaporfly > alpha: print("The data is likely normally distributed (fail to reject H0).") else: print("The data is NOT normally distributed (reject H0).")

stat, shapiro_pvalue_speedgoats = stats.shapiro(speedgoats)
print(shapiro_pvalue_speedgoats)

if shapiro_pvalue_speedgoats > alpha: print("The data is likely normally distributed (fail to reject H0).") else: print("The data is NOT normally distributed (reject H0).")

#Since all 3 are above 0.05 p value we can assume normality through levenes test
# Perform Levene's test for homogeneity of variances stat, pvalue_levene = stats.levene(pegasus, vaporfly, speedgoats)
print(pvalue_levene)

if pvalue_levene < alpha: print("Reject the null hypothesis, different variance") else: print("Fail to reject the null hypothesis, same variance")

#Assumptions done, now do the One-Way ANOVA Test
# Perform one-way ANOVA F_statistic, pvalue_anova = stats.f_oneway(pegasus, vaporfly, speedgoats)
print(pvalue_anova)

if pvalue_anova < alpha: print("Reject the null hypothesis, different means") else: print("Fail to reject the null hypothesis, same mean")

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.