Python Pandas

Pandas Sample

We are going to be looking at Pandas Sample().

The sample() method returns a specified number of random rows.

it also returns one row if a number is not specified

Example 1 - if else state location

To start with, we are going to be importing various libraries.

pandas as pd

random

string

numpy as np

				
					#DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)

				
					import pandas as pd
import random
import string
import numpy as np

Prep The Dataframe

Next we create a ffunction that generates n random 5-character alphanumeric strings.

we use uppercase letters and digits with a fixes random seed “42” for reproducibility

				
					# Function to generate random 5-character alphanumeric strings
def generate_all_merchant_ids(n):
    random_gen = random.Random(42)  # Independent random generator
    return [''.join(random_gen.choices(string.ascii_uppercase + string.digits, k=5)) for _ in range(n)]

Here we calculate how many of the 500 records are fradulent and how many are not.

20% are fraud

80% are not.

				
					# Total number of records
n_records = 500
n_fraud = int(n_records * 0.2)
n_non_fraud = n_records - n_fraud

Next we generate 500 unique 5 character alphanumeric merchant IDs using the generate_all_mechant_ids() function

				
					merchant_ids = generate_all_merchant_ids(n_records)

This code creates a list of fraud labels.

1 for fraud

0 for non_fraud

The we shuffle the labels randomly using seperate random generator

				
					# Generate fraud and non-fraud labels
label_rng = random.Random(11)  # Independent RNG
fraud_labels = [1] * n_fraud + [0] * n_non_fraud
label_rng.shuffle(fraud_labels)

Here we create a padans DataFrame with two colums “mecharnt_id” and “is_fraud”

The mecharnt_id stores the merchant_ids,

while the is_fraud stores the fraud_labels

				
					# Create the DataFrame
df = pd.DataFrame({
    'merchant_id': merchant_ids,
    'is_fraud': fraud_labels
})

Example 1 Sample 5 random rows Reproducible results using random_state

This line of code randomly selects 5 rows from the DataFrame df using a fixed random seed (11) to ensure the sample is retured every time it’s run

				
					df.sample(n=5, random_state=11)

Example 2 random sample with % dont need to show the whole table on the article!

This line randomly selects 10% of the rows from the DataFrame df using random_state=11 to make the sampling reproducible.

				
					df.sample(frac=0.1, random_state=11)

Example 3 Ignore Index

This line randomly selects 4 rows from the DataFrame df,

it resets their index using “ignore_index=True”, and usues random_state=11 to ensure reproducibility

				
					df.sample(n=4, ignore_index=True, random_state=11)

Example 4 Sample With Replace. Allow or disallow sampling of the same row more than once. if this is set to true, we can get the same row twice in our dataset

This line Randomly samples 10% of df’s rows with replacement, so duplicates are possible

it uses random_state=11 for reporoducibility,

Then it sorts the sampled rows by their original index.

				
					df.sample(frac=0.1, replace=True, random_state=11).sort_index()

Example 5 Assign weights More weight for a fraud with less weight for non fraud. remember fraud is only 20% of the dataset. frauds are 5x more likely to be picked

Here we create a new column called “sampling_weight” in the DataFrame df, assigning

weight of 5 to rows where is_fraud ==1,

else 1 to non_fraud

				
					df['sampling_weight'] = df['is_fraud'].apply(lambda x: 5 if x == 1 else 1)

Next we randomly select 100 rows from df using weighted sampling based on the sampling_weight column. we set the random state=11 to ensure reproducibility

				
					sampled_df = df.sample(n=100, weights='sampling_weight', random_state=11)

This line counts how many fraud and non_fraud rows are in sampled_df, showing the distribution of the samoled data’s is-fraud labels

				
					sampled_df['is_fraud'].value_counts()

				
					#no weights

This line randomly selects 100 rows from df without any weights, using random_sate=11 for reproducibility

				
					sampled_df_no_weights = df.sample(n=100, random_state=11)

Here, we count how many fraud and non_fraud rows are in the unweighted sample “sample_df_no_weights”, showing the distribution of “is_fraud” in that sample.

				
					sampled_df_no_weights['is_fraud'].value_counts()

Example 6 Sample Columns

Here we create a dictionary “data” containing three keys , each mapped to a list of sales numbers, some missing as np.nan

				
					data = {
    'The Who': [np.nan, 25000, np.nan, 42000],  # Large arena/stadium sales
    'All Them Witches': [np.nan, 1500, 2200, 2800],  # Theater/club-level sales
    'Goose': [3200, 5000, 7200, 8800]  # Large theaters and small arenas
}

Next using the “data” we create a pandas DataFrame

				
					df2 = pd.DataFrame(data)

				
					df2.head()

This line randomly selects 2 columns “axis=1” from the DataFrame df2, using random_sate=11 to ensure the selection is reproducible.

				
					df2.sample(n=2, axis=1, random_state=11)

Example 7 Both Rows and Colums

This code does two things in sequence.

Rndomly samples 2 rows from df2 “default axis=0” with ramdom_state=11,

from the sampled rows, it randomly selects 2 columns “axis=1” with the same seed “random_state=11”.

				
					df2.sample(n=2, random_state=11).sample(n=2, axis=1, random_state=11)

Free Community

Join 1,000+ AI Automation Builders

Weekly tutorials, live calls & direct access to Ryan & Matt.

Join Free →

Ryan Nolan

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.

Pandas Sample

Table of Contents

Example 1 - if else state location

Prep The Dataframe

Example 1 Sample 5 random rows Reproducible results using random_state

Example 2 random sample with % dont need to show the whole table on the article!

Example 3 Ignore Index

Example 4 Sample With Replace. Allow or disallow sampling of the same row more than once. if this is set to true, we can get the same row twice in our dataset

Example 5 Assign weights More weight for a fraud with less weight for non fraud. remember fraud is only 20% of the dataset. frauds are 5x more likely to be picked

Example 6 Sample Columns

Example 7 Both Rows and Colums

Join 1,000+ AI Automation Builders

Ryan Nolan

Important Links

LinkedIn

Social Media

Keep Learning

pandas create dataframe

Python Pandas Data Cleaning

Pandas Columns

Pandas Resample

Python Pandas JSON