K-Nearest Neighbors

December 20, 2024 adeyanju victor No comments yet

Comprehensive Understanding to K-Nearest Neighbors (KNN) in Supervised Machine Learning.

K-Nearest Neighbors (KNN) is a simple, widely used supervised learning algorithm in data science and machine learning

It was developed by Evelyn Fix and Joseph Hodges in 1951. Known for it usefulness and versatality, KNN can handle both classification and regression tasks when needed.

The K-Nearest Neighbors algorithm works by finding the K nearest neighbors to a given data point based on a distance metric, such as Euclidean distance.

The class or value of the data point is then determined by the majority vote or the mean of the K neighbors. This system allows the model to adapt to different patterns and make predictions.

In this guide, we’ll explore the application of KNN with a sports dataset example on Baseball.

We will classify baseball players into Hall of Fame inductees or non-inductees. By following the detailed guidelines and steps for beginners understanding, you’ll gain knowledge and insights into how to preprocess data, train the model, and evaluate its performance effectively.

To simplify how K-Nearest Neighbors (KNN) works, imagine a painting club with two groups: one focused on painting nature and the other on painting people.

Each painter’s preference can be identified by their uniform color and hat style.

When a new painter arrives, you’re unsure which group they belong to. To decide, you observe their outfit and compare it to the nearby painters.

If most of the nearby painters are wearing blue overalls and beret hats, you’d guess the new painter enjoys painting people.

This is how KNN works: it checks the “nearest neighbors” to make a prediction. The “K” in KNN refers to the number of neighbors considered. For example, if K is 3, the decision is based on the 3 closest painters.

KNN is essentially a smart guessing game that predicts based on the closest associations.

The Dataset

The dataset used in this example focuses on baseball players and their statistical attributes, with the target of predicting whether a player is inducted into the Hall of Fame.

This classification problem is well-suited for explaining the forte of KNN algorithm.

Step 1: Importing Required Libraries

To begin, we need to import the essential Python libraries or dependencies which helps us to work with our data and perform required tasks:



import  pandas as pd # For data manipulation and analysis 
from sklearn.preprocessing import MinMaxScaler # For scaling data
from sklearn.model_selection import train_test_split # For splitting the dataset
from sklearn.neighbors import KNeighborsClassifier # The KNN algorithm
from sklearn.metrics import confusion_matrix # For model evaluation
from sklearn.metrics import classification_report  # For model evaluation

Step 2: Data Preprocessing

Preprocessing is a critical step in any machine learning project. It ensures the data is clean, relevant, and properly formatted for the model.


df = pd.read_csv (‘500hits.csv, encoding = ‘latin-1’)

Dropping Irrelevant Columns

The dataset may contain columns that are not useful for classification. Removing these columns simplifies the data and reduces noise and one of the reason for dropping the columns is to make the data easy to work with:


df = df.drop(columns=[‘player’, ‘cs’ ])

Splitting the Dataset into Features and Target

Next, we separate the dataset into features (X) and target labels (y) so that we can be able to split them into train test split:


X = df.iloc[:, 0:13]
y = df.iloc[:, 13]

Splitting Data into Training and Testing Sets

To evaluate the model, we split the data into training and testing sets using an 80-20 ratio or most times 70-30 ratio depending on what you prefer:


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=11)

This ensures the model is trained on a subset of data and tested on unseen data, providing a reliable measure of its performance and for evaluation.

Step 3: Scaling the Data

KNN calculates the distance between data points to make predictions. To ensure all features contribute equally. We have various method for scaling data but the minmaxscaler was used to scale the data, we scale the data :


scaler = MinMaxScaler(feature_range=(0,1))
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

The MinMaxScaler scales each feature to a range between 0 and 1, ensuring that features with larger scales do not dominate the distance calculations.

Step 4: Training the KNN Model

With the data prepared, we can now train the KNN model. The key parameter in KNN is the number of neighbors ‘n_neighbors’, which determines how many nearby points influence the prediction:


knn = KNeighborsClassifier(n_neighbors=8)  # Using 8 neighbors
knn.fit(X_train, y_train) # Training the model

Choosing the right value for nearest neighbors is crucial. A small value may lead to overfitting, while a large value may oversimplify the model. Hyperparameter tuning can help find the optimal value.

Step 5: Evaluating the Model

Model evaluation is essential to understand its performance and identify areas for improvement.

Making Predictions

First, we use the trained model to make predictions on the test set:


y_pred = knn.predict(X_test)
knn.score(X_test, y_test)

Confusion Matrix

The confusion matrix provides a detailed breakdown of correct and incorrect predictions:


cm = confusion_matrix(y_test, y_pred)
print(cm)

Classification Report

The classification report includes metrics like precision, recall, and F1-score for a more comprehensive evaluation:


cf = classification_report(y_test, y_pred)

These metrics help assess the model’s strengths and weaknesses in classifying Hall of Fame players.

Insights and Possible Improvements

After evaluating the model, there are several strategies to improve its performance:

Feature Selection

Analyze the impact of each feature and remove those with little contribution to the target variable.

Tools like feature importance scores or recursive feature elimination can assist in this process.

Hyperparameter Tuning

Optimize the value of n_neighbors and other parameters like distance metrics using techniques such as grid search or random search.

Cross-Validation

Use k-fold cross-validation to ensure the model’s performance is consistent across different subsets of the data.

Experiment with Data Scaling Methods

While MinMaxScaler is used here, experimenting with other scaling methods like StandardScaler may yield better results depending on the dataset.

Conclusion

For this guide, we demonstrated how to implement the K-Nearest Neighbors algorithm to classify baseball players as Hall of Famers. By following the steps, data preprocessing, scaling, model training, and evaluation, you can apply KNN to a variety of classification tasks and possibly regression task.

KNN’s simplicity and usefulness makes it an excellent algorithm for beginners, while its effectiveness ensures it remains a valuable tool for professionals.

By experimenting with different datasets, tuning hyperparameters to, and employing advanced evaluation techniques, you can maximize the potential of this versatile algorithm.

Whether you’re tackling sports analytics, healthcare data, or customer segmentation, KNN offers a solid foundation for solving classification problems.