Pandas Sample

We are going to be looking at Pandas Sample().

The sample() method returns a specified number of random rows.

it also returns one row if a number is not specified

Example 1 - if else state location

To start with, we are going to be importing various libraries.

pandas as pd

random

stringÂ

numpy as np

  #DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)
  import pandas as pd import random import string import numpy as np

Prep The Dataframe

Next we create a ffunction that generates n random 5-character alphanumeric strings.

we use uppercase letters and digits with a fixes random seed “42” for reproducibility

  # Function to generate random 5-character alphanumeric strings def generate_all_merchant_ids(n): random_gen = random.Random(42) # Independent random generator return [''.join(random_gen.choices(string.ascii_uppercase + string.digits, k=5)) for _ in range(n)]

Here we calculate how many of the 500 records are fradulent and how many are not.

20% are fraud

80% are not.

  # Total number of records n_records = 500 n_fraud = int(n_records * 0.2) n_non_fraud = n_records - n_fraud

Next we generate 500 unique 5 character alphanumeric merchant IDs using the generate_all_mechant_ids() function

  merchant_ids = generate_all_merchant_ids(n_records)

This code creates a list of fraud labels.

1 for fraud

0 for non_fraud

The we shuffle the labels randomly using  seperate random generator

  # Generate fraud and non-fraud labels label_rng = random.Random(11) # Independent RNG fraud_labels = [1] * n_fraud + [0] * n_non_fraud label_rng.shuffle(fraud_labels)

Here we create a padans DataFrame with two colums “mecharnt_id” and “is_fraud”

The mecharnt_id stores the merchant_ids,

while the is_fraud stores the fraud_labels

  # Create the DataFrame df = pd.DataFrame({ 'merchant_id': merchant_ids, 'is_fraud': fraud_labels })

Example 1 Sample 5 random rows Reproducible results using random_state

This line of code randomly selects 5 rows from the DataFrame df using a fixed random seed (11) to ensure the sample is retured every time it’s run

  df.sample(n=5, random_state=11)

Example 2 random sample with % dont need to show the whole table on the article!

This line randomly selects 10% of the rows from the DataFrame df using random_state=11 to make the sampling reproducible.

  df.sample(frac=0.1, random_state=11)

Example 3 Ignore Index

This line randomly selects 4 rows from the DataFrame df,Â

it resets their index using “ignore_index=True”, and usues random_state=11 to ensure reproducibility

  df.sample(n=4, ignore_index=True, random_state=11)

Example 4 Sample With Replace. Allow or disallow sampling of the same row more than once. if this is set to true, we can get the same row twice in our dataset

This line Randomly samples 10% of df’s rows with replacement, so duplicates are possibleÂ

it uses random_state=11 for reporoducibility,

Then it sorts the sampled rows by their original index.

  df.sample(frac=0.1, replace=True, random_state=11).sort_index()

Example 5 Assign weights More weight for a fraud with less weight for non fraud. remember fraud is only 20% of the dataset. frauds are 5x more likely to be picked

Here we create a new column called “sampling_weight” in the DataFrame df, assigning

weight of 5 to rows where is_fraud ==1,

else 1 to non_fraud

  df['sampling_weight'] = df['is_fraud'].apply(lambda x: 5 if x == 1 else 1)

Next we randomly select 100 rows from df using weighted sampling based on the sampling_weight column. we set the random state=11 to ensure reproducibility

  sampled_df = df.sample(n=100, weights='sampling_weight', random_state=11)

This line counts how many fraud and non_fraud rows are in sampled_df, showing the distribution of the samoled data’s is-fraud labels

  sampled_df['is_fraud'].value_counts()
  #no weights

This line randomly selects 100 rows from df without any weights, using random_sate=11 for reproducibility

  sampled_df_no_weights = df.sample(n=100, random_state=11)

Here, we count how many fraud and non_fraud rows are in the unweighted sample “sample_df_no_weights”, showing the distribution of “is_fraud” in that sample.

  sampled_df_no_weights['is_fraud'].value_counts()

Example 6 Sample Columns

Here we create a dictionary “data” containing three keys , each mapped to a list of sales numbers, some missing as np.nan

  data = { 'The Who': [np.nan, 25000, np.nan, 42000], # Large arena/stadium sales 'All Them Witches': [np.nan, 1500, 2200, 2800], # Theater/club-level sales 'Goose': [3200, 5000, 7200, 8800] # Large theaters and small arenas }

Next using the “data” we create a pandas DataFrame

  df2 = pd.DataFrame(data)
  df2.head()

This line randomly selects 2 columns “axis=1” from the DataFrame df2, using random_sate=11 to ensure the selection is reproducible.

  df2.sample(n=2, axis=1, random_state=11)

Example 7 Both Rows and Colums

This code does two things in sequence.

Rndomly samples 2 rows from df2 “default axis=0” with ramdom_state=11,

from the sampled rows, it randomly selects 2 columns “axis=1” with the same seed “random_state=11”.

  df2.sample(n=2, random_state=11).sample(n=2, axis=1, random_state=11)

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.

Leave a Reply

Your email address will not be published. Required fields are marked *