Python Z-Score

We are going to be looking at Python Z-score.

Z-score tells us how far a data poin is from the mean.

To start we’re going to import the necessary libraries.

numpy as np, stats and pandas

				
					import numpy as np
from scipy import stats
import pandas as pd
				
			

Example 1 Numpy

Let’s create a list of numbers called data containing a sequence of integers.

				
					data = [10, 12, 15, 18, 20, 21, 23, 25, 30, 32]
				
			

Next we calculate the average mean of the numbers in the data list using Numpy’s mean() function

				
					mean = np.mean(data)
				
			

Here we compute the standard deviation of the data list using Numpy, measuring how spread out the numbers are from the mean

				
					std_dev = np.std(data)
				
			

Here we calculate the z-scores for each value in data, showing how many standard deviations each value is from the mean

				
					z_scores_manual = (data - mean) / std_dev
				
			
				
					print(z_scores_manual)
				
			

Example 2 scipy much easier

Here we use SciPy’s built-in function to calculate the z-scores for all values in data

				
					z_scores_scipy = stats.zscore(data)
				
			
				
					print(z_scores_scipy)
				
			

Example 3 detect outliers with Pandas

Here we set the random seed in NumPy to 7, ensuring that any random numbers generated afterwards are reproducible.

				
					np.random.seed(7) 
				
			

This line generates 1,000 random numbers from a normal distribution with

loc=0:  mean of 0

scale=1: standard deviation of 1

size=1000: total 1,000 values

				
					data = np.random.normal(loc=0, scale=1, size=1000)
				
			

Next we create a Pandas DataFrame from the data array with a single column named “Values”, allowing for easier data analysis and manipulation.

				
					df = pd.DataFrame(data, columns=['Values'])
				
			
				
					df.head()
				
			

Herre we add a new column “Z-Score” to the DataFrame df, containing the z-score for each value in “Values”, showing how far each value is from the mean in terms of standard deviations.

				
					df['Z-Score'] = (df['Values'] - df['Values'].mean()) / df['Values'].std()
				
			

Here we calculate the percentage of values in the DataFrame df that fall within

1 standard deviation of the mean within_1_std”

2 standard deviations “within_2_std”

3 standard deviations “within_3_std”

				
					# Calculate the percentage of data within each standard deviation range
within_1_std = len(df[(df['Z-Score'] >= -1) & (df['Z-Score'] <= 1)]) / len(df) * 100
within_2_std = len(df[(df['Z-Score'] >= -2) & (df['Z-Score'] <= 2)]) / len(df) * 100
within_3_std = len(df[(df['Z-Score'] >= -3) & (df['Z-Score'] <= 3)]) / len(df) * 100
				
			

This code creates a summary called summary_df that nearly displays the standard deviation and percentage pf data within each range from previous calculations.

				
					# Display the results in a DataFrame
summary_df = pd.DataFrame({
    'Standard Deviations': ['±1σ', '±2σ', '±3σ'],
    'Percentage of Data': [within_1_std, within_2_std, within_3_std]
})
				
			
				
					summary_df
				
			

Here we create a new column called “Outlier” in the df DataFrame, marking True for values whose z-score is greater than 3 or less than -3

				
					df['Outlier'] = (df['Z-Score'] > 3) | (df['Z-Score'] < -3)
				
			

Here we sort th DataFrame df ny thr “Z-score” column in descending order and selects the top 5 rows, which represent the 5 most extreme high values in the dataset

				
					top_5_highest = df.sort_values(by='Z-Score', ascending=False).head(5)
				
			
				
					print(top_5_highest)
				
			

This line sorts df by ‘Z-Score’ in ascending order and selects the top 5 rows, showing the 5 most extreme low values

				
					top_5_lowest = df.sort_values(by='Z-Score', ascending=True).head(5)
				
			
				
					print(top_5_lowest)
				
			

This line removes outliers from the DataFrame df and saves the result in a new DataFrame “df_no_outliers”

				
					df_no_outliers = df[df['Outlier'] == False].copy()
				
			
				
					print(top_5_highest_no_outlier)
				
			

This line extracts the top 5 lowest Z-Score entries excluding the outliers from the DataFrame

				
					top_5_lowest_no_outlier = df_no_outliers.sort_values(by='Z-Score', ascending=True).head(5)
				
			
				
					print(top_5_lowest_no_outlier)
				
			

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.

Leave a Reply

Your email address will not be published. Required fields are marked *