12/10/2018

Dealing with Imbalanced Data

Author: John Brennan Company: Newcastle University Blog

One of the main challenges faced across many domains, when using machine learning, is data imbalance.

Machine learning algorithms are susceptible to returning unsatisfactory predictions when trained on imbalanced datasets. This is because the classifier often learns to simply predict the majority class all of the time. Taking an extreme example of a two class dataset where the minority class only accounts for 5% of the data an accuracy of 95% would be reported. Taken in isolation this result looks good however, in reality, we are producing a useless classifier.

In this post we will consider some strategies for dealing with imbalanced data. We will be using a synthetic dataset generated using  `sklearn_datasets`:

```import numpy as np
import pandas as pd
from sklearn.datasets import make_classification```
```# Generate the dataset
X, y = make_classification(n_classes=2,
class_sep=0.5,
weights=[0.05, 0.95],
n_informative=1,
n_redundant=1,
flip_y=0,
n_features=5,
n_clusters_per_class=1,
n_samples=100,
random_state=2)
train_data = pd.DataFrame(np.column_stack((X,y)))```
```# Count classes and plot
target_count = train_data.iloc[:,-1].value_counts()
print(‘Class 0:’, target_count)
print(‘Class 1:’, target_count)
target_count.plot(kind=’bar’, title=’Count (target)’);```
```Class 0: 5
Class 1: 95```

Model Evaluation

One of the most common problems that beginners in machine learning make when working with imbalanced data is not properly considering their model evaluation metrics. Simply using a metric like `accuracy_score` can produce misleading results as it oversimplifies model evaluation, providing results which look good but are in fact not.

Consider the following example, using a simple SGD Classifier with the default values and no feature manipulation.

```from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from scipy.stats import itemfreq```
```# Remove column
labels = train_data.columns[:-1]
X = train_data[labels]
y = train_data.iloc[:,-1]```
```# Split the data into train:test @ ratio 80:20
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=1)```
```# Define the classifier model
model = SGDClassifier(max_iter=1000, tol=1e-3)```
```# Fit data to the model (train)
model.fit(X_train, y_train)```
```# Predict results of test data
y_pred_raw = model.predict(X_test)```
```# Get accuracy score
accuracy = accuracy_score(y_test, y_pred_raw)
print('Accuracy: %.2f%%' % (accuracy * 100.0))```
```# Show predicted classes
print(itemfreq(y_pred_raw))

```
```Accuracy: 90.00%
[[ 1. 20.]]```

We can see that this model produces an accuracy of 90% but this is simply achieved by predicting class 1 every time. Let’s further prove the point by slightly modifying the classifier. This time we will only train using a single feature, which should have a negative impact on the accuracy. As the classifier will have less information about each data point.

NOTE: The accuracy shown deviates from the expected 95% because there are 2 examples of the minority class in the test set of 20 observations.

```# Fit single feature data to the model (train)
model.fit(X_train.iloc[:,-1].values.reshape(-1, 1), y_train)
# Predict results of test data
y_pred_single = model.predict(
X_test.iloc[:,-1].values.reshape(-1, 1))```
```# Get accuracy score
accuracy = accuracy_score(y_test, y_pred_single)
print('Accuracy: %.2f%%' % (accuracy * 100.0))```
```# Show predicted classes
print(itemfreq(y_pred_single))```
```Accuracy: 90.00%
[[ 1. 20.]]```

As we can see, the high accuracy rate is a very misleading metric.

A better way to evaluate the results is through the use of a confusion matrix. In a confusion matrix the number of correct and incorrect predictions are summarised with count values and broken down by each class. Showing what each target was predicted to be by the classifier, with high diagonal values showing a high number of correct predictions across all classes.

```from sklearn.metrics import confusion_matrix
from matplotlib import pyplot as plt```
```# Build confusion matrix from ground truth labels and model predictions
conf_mat = confusion_matrix(y_true=y_test, y_pred=y_pred_single)
print(‘Confusion matrix:\n’, conf_mat)```
```# Plot matrix
plt.matshow(conf_mat)
plt.colorbar()
plt.ylabel(‘Real Class’)
plt.xlabel(‘Predicted Class’)
plt.show()

```
```Confusion matrix:
[[ 0 2]
[ 0 18]]```

Here we can clearly see that every prediction was for class = 1.

Now we understand how to better evaluate the results from a given model, we can consider how to manipulate our data to give the classifier a better chance of behaving as we expect.

Using Weights

Most machine learning algorithms have an option to associate weights to each class. This allows us to increase, or decrease, the importance of any feature within our dataset. Below we will use the `class_weight` parameter with the `'balanced’` keyword on the previously used model. The `'balanced’` option uses values of y to automatically adjust weights inversely proportional to class frequencies in the input data.

```# Redefine classifier model to use class_weight
model = SGDClassifier(max_iter=1000,
tol=1e-3,
class_weight=’balanced’)
# Train
model.fit(X_train, y_train)
# Predict
y_pred_wtd = model.predict(X_test)```
```# Get accuracy score
accuracy = accuracy_score(y_test, y_pred_wtd)
print('Accuracy: %.2f%%' % (accuracy * 100.0))
print(itemfreq(y_pred))```
```# Build confusion matrix
conf_mat = confusion_matrix(y_true=y_test, y_pred=y_pred_wtd)
print(‘Confusion matrix:\n’, conf_mat)```
```# Plot matrix
plt.matshow(conf_mat)
plt.colorbar()
plt.ylabel(‘Real Class’)
plt.xlabel(‘Predicted Class’)
plt.show()```
```Accuracy: 45.00%
[[ 0. 12.]
[ 1. 8.]]
Confusion matrix:
[[ 2 0]
[11 7]]```

Here we can see that, even though our accuracy has dropped, the predictions are no longer always for a single class. However, although we now have a model which will predict more than one class it’s still not very good at predicting the correct class. This is because there is still a real lack of data for the minority class.

Using Sampling Strategies

Another way to approach the problem of imbalance is to use some form of sampling, in order to balance the classes before giving them to the model. This allows for greater control of the data and domain appropriate strategy selection. We will explore some of the possible options in the following sections.

Oversampling

In oversampling we create additional data for the minority class either by making duplicates from the minority class or by some method to make additional synthetic data that is representative of the minority class.

Undersampling

In undersampling we remove data for the majority class either randomly or by some method to choose the most ‘appropriate’ points to remove.

Random undersampling

In undersampling we simply choose random data points from within the majority class and delete them until both classes are the same size.

```# Class count
target_1_count, target_0_count=train_data.iloc[:,-1].value_counts()```
```# Seperate classes
target_0 = train_data[train_data.iloc[:,-1] == 0]
target_1 = train_data[train_data.iloc[:,-1] == 1]```
```# Resample target1 to match target 0 count
target_1_undersample = target_1.sample(target_0_count)```
```# Merge back to single df
test_undersample = pd.concat([target_1_undersample, target_0],
axis=0)```
```# Show counts and plot
print(‘Random under-sampling:’)
print(test_undersample.iloc[:,-1].value_counts())
test_undersample.iloc[:,-1].value_counts().plot(kind=’bar’, title=’Count (target)’);```
```Random under-sampling:
0.0 5
1.0 5
Name: 5, dtype: int64```

While undersampling is generally the preferred sampling option, as all the data remains ‘real’, it can lead to the situation where there is not enough data to train a model.

Random oversampling

The easiest, and most naive option is random oversampling where we randomly duplicate data points within the minority class until both classes are the same size. Here we need to use `replace=True` in order to sample with replacement, which allows the method to duplicate data.

```# Resample target0 to match target 1 count
target_0_oversample = target_0.sample(target_1_count, replace=True)
# Merge back to single df
test_oversample = pd.concat([target_0_oversample, target_1], axis=0)```
```# Show counts and plot
print(‘Random over-sampling:’)
print(test_oversample.iloc[:,-1].value_counts())
test_oversample.iloc[:,-1].value_counts().plot(kind=’bar’, title=’Count (target)’);

```
```Random over-sampling:
1.0 95
0.0 95
Name: 5, dtype: int64```

Data Aware Sampling

Data aware methods attempt to improve sampling by making decisions based on relationships within the data. This is easier to visualise in two dimensions. Therefore we will use PCA to select 2 principal components from our data and define a helper function for plotting.

`from sklearn.decomposition import PCA`
```# Define PCA model, specifying 2 dimensions
pca = PCA(n_components=2)
# Fit data
X_2d = pca.fit_transform(X)```
```# Plot helper function
def draw_plot(X, y, label):
for l in np.unique(y):
plt.scatter(
X[y==l, 0],
X[y==l, 1],
label=l
)
plt.title(label)
plt.legend()
plt.show()```
```# plot raw PCA
draw_plot(X_2d, y, ‘Raw Data’)```

SMOTE

The Synthetic Minority Over-sampling TEchnique (SMOTE) generates new points for the minority class by fully connecting all points in the minority class with straight lines. Then for each existing data point SMOTE then determines a point on these interconnections to make a new point based on how many of the closest neighbours are considered for synthesis (`k_neighbors`).

NOTE: By default `k_neighbors=5`. Here minority class only has 5 members in total. Therefore, the maximum number of neighbours any single point can have is 4 so we use `k_neighbors=4`.

`from imblearn.over_sampling import SMOTE`
```# Define SMOTE model and specify minority class for oversample
smote = SMOTE(ratio=’minority’, k_neighbors=4)
# Fit data
X_smote, y_smote = smote.fit_sample(X_2d, y)
# Plot
draw_plot(X_smote, y_smote, ‘SMOTE’)```

The ADAptive SYNthetic (ADASYN) sampling approach is essentially an extension of SMOTE. ADASYN attempts to infer which points in the minority class would be the most difficult for a model to learn and attempts to place a higher ratio of synthetic data close to these points.

`from imblearn.over_sampling import ADASYN`
```# Define SMOTE model and specify minority class for oversample
# Fit Data
# Plot

The Tomek Links algorithm removes data from the majority class that have tomek links. A tomek link is defined as data of different classes which are nearest neighbours of each other.

```from imblearn.under_sampling import TomekLinks
from collections import Counter```
```# Define model
# Fit data
X_tome, y_tome, id_tome = tome.fit_sample(X_2d, y)
# Find removed indices
idx_samples_removed = np.setdiff1d(np.arange(X_2d.shape),id_tome)```
```# Show result
print(‘Removed indexes:’, idx_samples_removed)
print(‘Original dataset shape {}’.format(Counter(y)))
print(‘Resampled dataset shape {}’.format(Counter(y_tome)))
```Removed indexes: 
Original dataset shape Counter({1.0: 95, 0.0: 5})
Resampled dataset shape Counter({1.0: 94, 0.0: 5})```

Mixed Strategies

We can also use mixed, or ensemble, approaches such as performing oversampling using SMOTE and cleaning the data using Tomek links.

`from imblearn.combine import SMOTETomek`
```# Define model using previous SMOTE model
smto = SMOTETomek(ratio=’auto’, smote=smote)
# Fit data
X_smto, y_smto = smto.fit_sample(X_2d, y)
# Plot
draw_plot(X_st, y_st, ‘SMOTE + TOMEK’)```

Let’s try and use the data generated through `SMOTETomek` with our classifier from earlier and see if the results have improved.

```# Create train: test split for new data
X_train_st, X_test_st, y_train_st, y_test_st = \
train_test_split(X_smto, y_smto, test_size=0.2, random_state=1)```
```# Define model
model = SGDClassifier(max_iter=1000,
tol=1e-3,
class_weight='balanced')
# Fit data
model.fit(X_train_st, y_train_st)
# Predict
y_pred_st = model.predict(X_test_st)```
```# Get accuracy score
accuracy = accuracy_score(y_test_st, y_pred_st)
# Build confusion matrix
conf_mat = confusion_matrix(y_true=y_test_st, y_pred=y_pred_st)```
```# Output results
print('Accuracy: %.2f%%' % (accuracy * 100.0))
print(itemfreq(y_pred_st))
print(‘Confusion matrix:\n’, conf_mat)
plt.matshow(conf_mat)
plt.colorbar()
plt.ylabel(‘Real Class’)
plt.xlabel(‘Predicted Class’)
plt.show()```
```Accuracy: 80.56%
[[ 0. 16.]
[ 1. 20.]]
Confusion matrix:
[[13 4]
[ 3 16]]```

We can see now that we have an accuracy of around 80% and the confusion matrix shows high diagonal values. Taking both of these results together, we can show that we now have have a classifier which is capable of predicting both classes with a good degree of accuracy.