Python Cumulative distribution function

In this article, we’ll explore how to use the Cumulative Distribution Function (CDF) in Python through several practical examples.

First, we’ll do a manual calculation to get a better grasp of the idea conceptually. Then, we’ll go on to utilizing NumPy and SciPy to do it more efficiently. Lastly, we’ll use Matplotlib and Seaborn to show the CDF.

What Is the Cumulative Distribution Function (CDF)?

The Cumulative Distribution Function (CDF) describes the probability that a random variable XX will take a value less than or equal to a specific value xx.

Formally, for a continuous random variable XX:

F(x)=P(X≤x)

This means that F(x) gives the area under the probability density curve up to xx

Example 1 Manual Calculation

Let’s start with a simple dataset and calculate the CDF manually.

we import the necessary libraies.

				
					import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import matplotlib.pyplot as plt
import seaborn as sns
				
			

Example dataset

				
					data = [2, 3, 3, 5, 7]
				
			

Next, we sort the data

				
					sorted_data = np.sort(data)
				
			

Then we find the total number of data points.

				
					data_len = len(sorted_data)
				
			

Then we manually compute the CDF

				
					cdf_values = []
				
			
				
					for i in range(data_len):
        # Calculate CDF as the proportion of data points less than or equal to sorted_data[i]
        cdf_value = np.sum(sorted_data <= sorted_data[i]) / data_len
        #each element is True if the corresponding element in sorted_data is less than or equal to sorted_data[i], and False otherwise
        cdf_values.append(cdf_value)
				
			
				
					print(cdf_values)
				
			

Interpretation:

  • P(X≤2)=0.2

  • P(X≤3)=0.6

  • P(X≤5)=0.8P

  • P(X≤7)=1.0

This manual approach is useful for learning, but in practice, we’ll use built-in functions for efficiency.

Example 2 CDF at a single point

A Cumulative Distribution Function (CDF) can be used with either a value from the distribution or a Z-score, depending on the context:
 

Now let’s use the SciPy norm.cdf() function, which simplifies the process significantly.

 
 

We create a random normally distributed data

				
					np.random.seed(12)
				
			
				
					mean = 0
std_dev = 1
size = 1000
				
			
				
					data = np.random.normal(loc=mean, scale=std_dev, size=size)
				
			

Then we calculate the CDF at x = -1

				
					cdf_neg_one = norm.cdf(-1, loc=data.mean(), scale=data.std())
				
			
				
					print(cdf_neg_one)
				
			

This means that about 16.7% of the values in the dataset are less than or equal to -1.

				
					cdf_one = norm.cdf(1, loc=data.mean(), scale=data.std())
				
			
				
					print(cdf_one)
				
			

This means that about 83% of the values in the dataset are less than or equal to 1.

Example 3 CDF Range

To find the probability that lies between two values (e.g., between -2 and 2)

				
					Upper_CDF = norm.cdf(2, loc=data.mean(), scale=data.std())
Lower_CDF = norm.cdf(-2, loc=data.mean(), scale=data.std())
				
			
				
					cdf_range = Upper_CDF - Lower_CDF
				
			
				
					print(cdf_range)
				
			

So, approximately 95% of the data lies between -2 and 2 — consistent with the empirical rule for normal distributions.

Example 4 CDF Right Side, Probability Greater Than a Value

If you want the probability that :

				
					value_greater_2 = 1 - norm.cdf(2, loc=data.mean(), scale=data.std())
				
			
				
					print(value_greater_2)
				
			

So, about 2.5% of the data is greater than 2.

Example 5 Graph Seaborn

Finally, let’s visualize the CDF using Seaborn and Matplotlib.

				
					sns.ecdfplot(data, label='CDF')
plt.title('CDF of Normally Distributed Data')
plt.xlabel('Data Values')
plt.ylabel('Cumulative Probability')
plt.legend()
plt.grid(True)
plt.show()
				
			

This produces a smooth S-shaped curve, typical for normal distributions.

 

You can easily read off probabilities:

  • At , the CDF ≈ 0.17

  • At x=0, the CDF ≈ 0.5

  • At , the CDF ≈ 0.83

 

In this tutorial, we learned:

  • What the Cumulative Distribution Function (CDF) represents.

  • How to compute it manually to understand the underlying math.

  • How to efficiently compute CDFs using NumPy and SciPy.

  • How to visualize the CDF using Seaborn and Matplotlib.

The CDF is a powerful tool in data science, statistics, and machine learning, helping you understand the cumulative probability and distribution of your data.

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.

Leave a Reply

Your email address will not be published. Required fields are marked *