# Dealing with Imbalanced Data

*One of the main challenges faced across many domains, when using machine learning, is data imbalance.*

Machine learning algorithms are susceptible to returning unsatisfactory predictions when trained on imbalanced datasets. This is because the classifier often learns to simply predict the majority class all of the time. Taking an extreme example of a two class dataset where the minority class only accounts for 5% of the data an accuracy of 95% would be reported. Taken in isolation this result looks good however, in reality, we are producing a useless classifier.

In this post we will consider some strategies for dealing with imbalanced data. We will be using a synthetic dataset generated using `sklearn_datasets`

:

import numpy as np import pandas as pd from sklearn.datasets import make_classification

# Generate the dataset X, y = make_classification(n_classes=2, class_sep=0.5, weights=[0.05, 0.95], n_informative=1, n_redundant=1, flip_y=0, n_features=5, n_clusters_per_class=1, n_samples=100, random_state=2) train_data = pd.DataFrame(np.column_stack((X,y)))

# Count classes and plot target_count = train_data.iloc[:,-1].value_counts() print(‘Class 0:’, target_count[0]) print(‘Class 1:’, target_count[1]) target_count.plot(kind=’bar’, title=’Count (target)’);

Class 0: 5 Class 1: 95

**Model Evaluation**

One of the most common problems that beginners in machine learning make when working with imbalanced data is not properly considering their model evaluation metrics. Simply using a metric like `accuracy_score`

can produce misleading results as it oversimplifies model evaluation, providing results which look good but are in fact not.

Consider the following example, using a simple SGD Classifier with the default values and no feature manipulation.

from sklearn.linear_model import SGDClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from scipy.stats import itemfreq

# Remove column labels = train_data.columns[:-1] X = train_data[labels] y = train_data.iloc[:,-1]

# Split the data into train:test @ ratio 80:20 X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=1)

# Define the classifier model model = SGDClassifier(max_iter=1000, tol=1e-3)

# Fit data to the model (train) model.fit(X_train, y_train)

# Predict results of test data y_pred_raw = model.predict(X_test)

# Get accuracy score accuracy = accuracy_score(y_test, y_pred_raw) print('Accuracy: %.2f%%' % (accuracy * 100.0))

# Show predicted classes print(itemfreq(y_pred_raw))

Accuracy: 90.00% [[ 1. 20.]]

We can see that this model produces an accuracy of 90% but this is simply achieved by predicting class 1 every time. Let’s further prove the point by slightly modifying the classifier. This time we will only train using a single feature, which should have a negative impact on the accuracy. As the classifier will have less information about each data point.

*NOTE: The accuracy shown deviates from the expected 95% because there are 2 examples of the minority class in the test set of 20 observations.*

# Fit single feature data to the model (train) model.fit(X_train.iloc[:,-1].values.reshape(-1, 1), y_train) # Predict results of test data y_pred_single = model.predict( X_test.iloc[:,-1].values.reshape(-1, 1))

# Get accuracy score accuracy = accuracy_score(y_test, y_pred_single) print('Accuracy: %.2f%%' % (accuracy * 100.0))

# Show predicted classes print(itemfreq(y_pred_single))

Accuracy: 90.00% [[ 1. 20.]]

As we can see, the high accuracy rate is a very misleading metric.

A better way to evaluate the results is through the use of a confusion matrix. In a confusion matrix the number of correct and incorrect predictions are summarised with count values and broken down by each class. Showing what each target was predicted to be by the classifier, with high diagonal values showing a high number of correct predictions across all classes.

from sklearn.metrics import confusion_matrix from matplotlib import pyplot as plt

# Build confusion matrix from ground truth labels and model predictions conf_mat = confusion_matrix(y_true=y_test, y_pred=y_pred_single) print(‘Confusion matrix:\n’, conf_mat)

# Plot matrix plt.matshow(conf_mat) plt.colorbar() plt.ylabel(‘Real Class’) plt.xlabel(‘Predicted Class’) plt.show()

Confusion matrix: [[ 0 2] [ 0 18]]

Here we can clearly see that every prediction was for class = 1.

Choosing the right evaluation metric is dependent on understanding your data and model. More information about various metrics can be found here.

Now we understand how to better evaluate the results from a given model, we can consider how to manipulate our data to give the classifier a better chance of behaving as we expect.

**Using Weights**

Most machine learning algorithms have an option to associate weights to each class. This allows us to increase, or decrease, the importance of any feature within our dataset. Below we will use the `class_weight`

parameter with the `'balanced’`

keyword on the previously used model. The `'balanced’`

option uses values of y to automatically adjust weights inversely proportional to class frequencies in the input data.

# Redefine classifier model to use class_weight model = SGDClassifier(max_iter=1000, tol=1e-3, class_weight=’balanced’) # Train model.fit(X_train, y_train) # Predict y_pred_wtd = model.predict(X_test)

# Get accuracy score accuracy = accuracy_score(y_test, y_pred_wtd) print('Accuracy: %.2f%%' % (accuracy * 100.0)) print(itemfreq(y_pred))

# Build confusion matrix conf_mat = confusion_matrix(y_true=y_test, y_pred=y_pred_wtd) print(‘Confusion matrix:\n’, conf_mat)

# Plot matrix plt.matshow(conf_mat) plt.colorbar() plt.ylabel(‘Real Class’) plt.xlabel(‘Predicted Class’) plt.show()

Accuracy: 45.00% [[ 0. 12.] [ 1. 8.]] Confusion matrix: [[ 2 0] [11 7]]

Here we can see that, even though our accuracy has dropped, the predictions are no longer always for a single class. However, although we now have a model which will predict more than one class it’s still not very good at predicting the correct class. This is because there is still a real lack of data for the minority class.

**Using Sampling Strategies**

Another way to approach the problem of imbalance is to use some form of sampling, in order to balance the classes before giving them to the model. This allows for greater control of the data and domain appropriate strategy selection. We will explore some of the possible options in the following sections.

### Oversampling

In oversampling we create additional data for the minority class either by making duplicates from the minority class or by some method to make additional synthetic data that is representative of the minority class.

### Undersampling

In undersampling we remove data for the majority class either randomly or by some method to choose the most ‘appropriate’ points to remove.

### Random undersampling

In undersampling we simply choose random data points from within the majority class and delete them until both classes are the same size.

# Class count target_1_count, target_0_count=train_data.iloc[:,-1].value_counts()

# Seperate classes target_0 = train_data[train_data.iloc[:,-1] == 0] target_1 = train_data[train_data.iloc[:,-1] == 1]

# Resample target1 to match target 0 count target_1_undersample = target_1.sample(target_0_count)

# Merge back to single df test_undersample = pd.concat([target_1_undersample, target_0], axis=0)

# Show counts and plot print(‘Random under-sampling:’) print(test_undersample.iloc[:,-1].value_counts()) test_undersample.iloc[:,-1].value_counts().plot(kind=’bar’, title=’Count (target)’);

Random under-sampling: 0.0 5 1.0 5 Name: 5, dtype: int64

While undersampling is generally the preferred sampling option, as all the data remains ‘real’, it can lead to the situation where there is not enough data to train a model.

### Random oversampling

The easiest, and most naive option is random oversampling where we randomly duplicate data points within the minority class until both classes are the same size. Here we need to use `replace=True`

in order to sample with replacement, which allows the method to duplicate data.

# Resample target0 to match target 1 count target_0_oversample = target_0.sample(target_1_count, replace=True) # Merge back to single df test_oversample = pd.concat([target_0_oversample, target_1], axis=0)

# Show counts and plot print(‘Random over-sampling:’) print(test_oversample.iloc[:,-1].value_counts()) test_oversample.iloc[:,-1].value_counts().plot(kind=’bar’, title=’Count (target)’);

Random over-sampling: 1.0 95 0.0 95 Name: 5, dtype: int64

### Data Aware Sampling

Data aware methods attempt to improve sampling by making decisions based on relationships within the data. This is easier to visualise in two dimensions. Therefore we will use PCA to select 2 principal components from our data and define a helper function for plotting.

from sklearn.decomposition import PCA

# Define PCA model, specifying 2 dimensions pca = PCA(n_components=2) # Fit data X_2d = pca.fit_transform(X)

# Plot helper function def draw_plot(X, y, label): for l in np.unique(y): plt.scatter( X[y==l, 0], X[y==l, 1], label=l ) plt.title(label) plt.legend() plt.show()

# plot raw PCA draw_plot(X_2d, y, ‘Raw Data’)

### SMOTE

The Synthetic Minority Over-sampling TEchnique (SMOTE) generates new points for the minority class by fully connecting all points in the minority class with straight lines. Then for each existing data point SMOTE then determines a point on these interconnections to make a new point based on how many of the closest neighbours are considered for synthesis (`k_neighbors`

).

NOTE: By default `k_neighbors=5`

. Here minority class only has 5 members in total. Therefore, the maximum number of neighbours any single point can have is 4 so we use `k_neighbors=4`

.

from imblearn.over_sampling import SMOTE

# Define SMOTE model and specify minority class for oversample smote = SMOTE(ratio=’minority’, k_neighbors=4) # Fit data X_smote, y_smote = smote.fit_sample(X_2d, y) # Plot draw_plot(X_smote, y_smote, ‘SMOTE’)

### ADASYN

The ADAptive SYNthetic (ADASYN) sampling approach is essentially an extension of SMOTE. ADASYN attempts to infer which points in the minority class would be the most difficult for a model to learn and attempts to place a higher ratio of synthetic data close to these points.

from imblearn.over_sampling import ADASYN

# Define SMOTE model and specify minority class for oversample adasyn = ADASYN(ratio=’minority’, n_neighbors=4) # Fit Data X_adasyn, y_adasyn = adasyn.fit_sample(X_2d, y) # Plot draw_plot(X_adasyn, y_adasyn, ‘ADASYN’)

### Tomek Links

The Tomek Links algorithm removes data from the majority class that have tomek links. A tomek link is defined as data of different classes which are nearest neighbours of each other.

from imblearn.under_sampling import TomekLinks from collections import Counter

# Define model tome = TomekLinks(return_indices=True, ratio=’auto’, random_state=42) # Fit data X_tome, y_tome, id_tome = tome.fit_sample(X_2d, y) # Find removed indices idx_samples_removed = np.setdiff1d(np.arange(X_2d.shape[0]),id_tome)

# Show result print(‘Removed indexes:’, idx_samples_removed) print(‘Original dataset shape {}’.format(Counter(y))) print(‘Resampled dataset shape {}’.format(Counter(y_tome))) draw_plot(X_tome, y_tome, ‘Tomek links under-sampling’)

Removed indexes: [16] Original dataset shape Counter({1.0: 95, 0.0: 5}) Resampled dataset shape Counter({1.0: 94, 0.0: 5})

### Mixed Strategies

We can also use mixed, or ensemble, approaches such as performing oversampling using SMOTE and cleaning the data using Tomek links.

from imblearn.combine import SMOTETomek

# Define model using previous SMOTE model smto = SMOTETomek(ratio=’auto’, smote=smote) # Fit data X_smto, y_smto = smto.fit_sample(X_2d, y) # Plot draw_plot(X_st, y_st, ‘SMOTE + TOMEK’)

Let’s try and use the data generated through `SMOTETomek` with our classifier from earlier and see if the results have improved.

# Create train: test split for new data X_train_st, X_test_st, y_train_st, y_test_st = \ train_test_split(X_smto, y_smto, test_size=0.2, random_state=1)

# Define model model = SGDClassifier(max_iter=1000, tol=1e-3, class_weight='balanced') # Fit data model.fit(X_train_st, y_train_st) # Predict y_pred_st = model.predict(X_test_st)

# Get accuracy score accuracy = accuracy_score(y_test_st, y_pred_st) # Build confusion matrix conf_mat = confusion_matrix(y_true=y_test_st, y_pred=y_pred_st)

# Output results print('Accuracy: %.2f%%' % (accuracy * 100.0)) print(itemfreq(y_pred_st)) print(‘Confusion matrix:\n’, conf_mat) plt.matshow(conf_mat) plt.colorbar() plt.ylabel(‘Real Class’) plt.xlabel(‘Predicted Class’) plt.show()

Accuracy: 80.56% [[ 0. 16.] [ 1. 20.]] Confusion matrix: [[13 4] [ 3 16]]

We can see now that we have an accuracy of around 80% and the confusion matrix shows high diagonal values. Taking both of these results together, we can show that we now have have a classifier which is capable of predicting both classes with a good degree of accuracy.