Voting Classifier

Boosting Accuracy with Voting Classifiers

In machine learning, combining multiple models often leads to better performance than relying on a single one. A Voting Classifier is a simple ensemble method that does just that — it aggregates predictions from several models to improve accuracy.

There are two types:

Hard Voting: Takes the majority vote from all classifiers (the most common prediction).

Soft Voting: Averages predicted probabilities and picks the class with the highest average. This requires models that support predict_proba().

Even if individual models are imperfect, their combined output can reduce errors and generalize better. You can tune a Voting Classifier by:

Choosing between "hard" and "soft" voting.

Assigning weights to give more influence to stronger models.

Voting Classifiers are a practical way to build more reliable systems by leveraging the strengths of multiple models.

we import pandas as pd

we also import make_classification from sklearn.datasets to generate synthetic data.

 

n_samples=2000: Generates 2000 rows.

n_features=10: Each sample has 10 features.

n_informative=8: 8 features are actually useful for prediction.

n_redundant=2: 2 features are linear combinations of the informative ones.

train_test_split: Splits data into 80% training and 20% testing.

 

				
					import pandas as pd
				
			
				
					from sklearn.datasets import make_classification
				
			
				
					X, y = make_classification(n_samples=2000, n_features=10, n_informative=8, n_redundant=2, random_state=11)
				
			

Next we imort train_test_split from sklearn so we can split our X, y data into training and test sets.

				
					from sklearn.model_selection import train_test_split
				
			
				
					X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=11)
				
			

Next we import cross_val_score to evaluate the models performance using cross-validation.

 

				
					from sklearn.model_selection import cross_val_score
				
			

We also import Gaussian Naive Bayes, which is a fast and simple probabilistic classifier that works well on many problems, especially when features are independent and normally distributed. 

				
					from sklearn.naive_bayes import GaussianNB
				
			
				
					gnb = GaussianNB()
				
			
				
					gnb.fit(X_train, y_train)
				
			
				
					cross_val_score(gnb, X_train, y_train, cv=3).mean()
				
			
				
					from sklearn.linear_model import LogisticRegression
				
			
				
					lr = LogisticRegression()
				
			
				
					lr.fit(X_train, y_train)
				
			
				
					cross_val_score(lr, X_train, y_train, cv=3).mean()
				
			
				
					from sklearn.ensemble import RandomForestClassifier
				
			
				
					rfc = RandomForestClassifier()
				
			
				
					rfc.fit(X_train, y_train)
				
			
				
					cross_val_score(rfc, X_train, y_train, cv=3).mean()
				
			
				
					from sklearn.ensemble import VotingClassifier
				
			
				
					vc = VotingClassifier([('NaiveBayes', gnb), ('LogisticRegression', lr), ('RandomForestClassifier', rfc)])
				
			
				
					cross_val_score(vc, X_train, y_train, cv=3).mean()
				
			
#voting{‘hard’, ‘soft’}, default=’hard’
#If ‘hard’, uses predicted class labels for majority rule voting.
#Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities,
#which is recommended for an ensemble of well-calibrated classifiers.
#tuning
				
					param_grid = {'voting':['hard', 'soft'],
              'weights':[(1,1,1), (2,1,1), (1,2,1), (1,1,2)]}
				
			
				
					from sklearn.model_selection import GridSearchCV
				
			
				
					vc2 = GridSearchCV(vc, param_grid, cv=5, n_jobs=-1)
				
			
				
					vc2.fit(X_train, y_train)
				
			
				
					vc2.best_params_
				
			
				
					vc2.best_score_
				
			
				
					vc3 = VotingClassifier([('NaiveBayes', gnb), ('LogisticRegression', lr), ('RandomForestClassifier', rfc)], voting='soft', weights=[1,1,2])
				
			
				
					vc3.fit(X_train, y_train)
				
			
				
					cross_val_score(vc3, X_train, y_train, cv=3).mean()
				
			

Ryan is a Data Scientist at a fintech company, where he focuses on fraud prevention in underwriting and risk. Before that, he worked as a Data Analyst at a tax software company. He holds a degree in Electrical Engineering from UCF.

Leave a Reply

Your email address will not be published. Required fields are marked *