principal component analysis scikit learn

PCA (Principal Component Analysis) in Python using Scikit-learn is a technique used to reduce the number of features in a dataset while preserving most of the variance (information).

It works by:

  1. Finding new axes (principal components) that capture the most variance.

  2. Projecting the data onto these fewer dimensions.

It’s useful for visualization, speeding up models, and removing noise.

We start by importing pandas and numpy.

  import pandas as pd
  import numpy as np

Next we read our data set ‘2022mlbteams.csv’

  df = pd.read_csv('2022mlbteams.csv')

Then we remove the column ‘Tm’ from the DataFrame df.

axis=1 means we are dropping a column not a row.

inplace=True means the chnages is made directly to df without the need to assign it back.

  df.drop('Tm', axis=1, inplace=True)

Here we selct the columns 0 to 26 from the DataFrame df and assign them to X.

This would be used as the features matric or input data.

  X = df.iloc[:, 0:27]

Here we select columns 27 i.e the 28th column from df and we assign it to y, which would be used as the target or output variable.

  y = df.iloc[:,27]

Next we import train_test_split from sklearn, which is used to split the dataset df into training and testing sets.

  from sklearn.model_selection import train_test_split
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)

Here we import StandardScaler which is used to standardize or normalize features so that they have a mean of 0 and a standard deviation of 1.Â

  from sklearn.preprocessing import StandardScaler

Then we assign the StandardScaler to ‘scaleStandard’

  scaleStandard = StandardScaler()

Next we use the .fit_transform() method to calculate the mean and standard deviation of each feature in X-train, then we also standardize the data.

  X_train = scaleStandard.fit_transform(X_train)

Here, we convert the standardize X_train back into a Pandas DataFrame, and assign column names to it.

  X_train = pd.DataFrame(X_train, columns =['BatAge', 'R/G', 'G', 'PA', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'BA', 'OBP', 'SLG', 'OPS', 'OPS+', 'TB', 'GDP', 'HBP', 'SH', 'SF', 'IBB', 'LOB'])
  X_train.head(10)
  X_train.describe().round(3)

Next we import PCA (Principal Component Analysis) from sklearn.decompositon.

Then we assign PCA() to pca1

  from sklearn.decomposition import PCA
  pca1 = PCA()

Here we fit the model to X_train so it learns the directions of maximum variance.

Then transforms X_train into a new set of features (principal components)

  X_pca1 = pca1.fit_transform(X_train)

 .explained_variance_ratio_ shows the proportion of variance each principal component explains.

  pca1.explained_variance_ratio_

Prasad Ostwal Github

Here we import matplotlib.pyplot as plt.

  import matplotlib.pyplot as plt

This block of code visualizes how much variance each PCA component explains and how much is cumulatively explained.

  plt.bar(range(1,len(pca1.explained_variance_ )+1),pca1.explained_variance_ ) plt.ylabel('Explained variance') plt.xlabel('Components') plt.plot(range(1,len(pca1.explained_variance_ )+1), np.cumsum(pca1.explained_variance_), c='red', label="Cumulative Explained Variance") plt.legend(loc='upper left')
  plt.plot(pca1.explained_variance_ratio_) plt.xlabel('number of components') plt.ylabel('cumulative explained variance') plt.show()
95% of the variance within the data.
This creates a PCA object that will automatically choose the number of components needed to retain 95% of the total variance in the dataset.
  pca2 = PCA(0.95)

Once again we fit_transform().

  X_pca2 = pca2.fit_transform(X_train)
  X_pca2.shape
  pca2.explained_variance_ratio_

This line creates a PCA object that will reduce the data to exactly 2 principal components:

  • To simplify the data to just 2 dimensions.

  • Commonly used for visualization (e.g., 2D scatter plots of high-dimensional data).

  • Retains the 2 directions with the most variance in the dataset.

  pca2c = PCA(n_components=2)

We also fit_transform()

  X_pca2c = pca2c.fit_transform(X_train)

Next we get the coolwarm colormap from matplotlib.

So we can use it to plot.

  colormap = plt.cm.get_cmap('coolwarm')
  plt.figure() scatter = plt.scatter(X_pca2c[:, 0], X_pca2c[:, 1], c=y_train, cmap=colormap) plt.xlabel('PC1') plt.ylabel('PC2') plt.colorbar(scatter, label='Playoffs') plt.show()

Here we create a PCA object that will reduce the dataset to 3 principal components.

Â

To simplify the data while keeping the 3 directions with the most variance.

Useful for:

3D visualization

Further dimensionality reduction before modeling

Preserving more variance than 2 components while still reducing complexity.

  pca3c = PCA(n_components=3)

We fit_transform() as usual.

  X_pca3c = pca3c.fit_transform(X_train)
  pca3c.explained_variance_ratio_
  from mpl_toolkits.mplot3d import Axes3D # Import the 3D plotting toolkit # Create a figure and a 3D axis fig = plt.figure() ax = fig.add_subplot(111, projection='3d') # Create the 3D scatter plot ax.scatter(X_pca3c[:,0], X_pca3c[:,1], X_pca3c[:,2], c=y_train, cmap=colormap) # Set labels for the axes ax.set_xlabel('X Label') ax.set_ylabel('Y Label') ax.set_zlabel('Z Label') # Show the plot plt.colorbar(scatter, label='Playoffs') plt.show()

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.

Leave a Reply

Your email address will not be published. Required fields are marked *