This is a thorough analysis of diabetes dataset from Kaggle. In this article, I will demonstrate how to do an end-to-end machine learning project using diverse classification models including K-Nearest Neighbors, Logistic Regression, Support Vector Machine, Decision Tree and Random Forest.
For everything you may need, visit:
Source Data
Hope this helps!


I cannot emphasize the importance of data preprocessing. It is the process of transforming raw data into more reasonable, useful and efficient format. It's a must and one of the most important step in a machine learing project. Without it our models will probably crash and won't be able to generate good results.
There are several steps in data preprocessing: data cleaning, data reduction and data transformation. In the following paragraphs, I'll show you how to perform the steps one by one.

Data Cleaning

First, I load the data from the .csv file and print the first ten rows so that I can have a basic understanding of how the data looks like.

# packages needed
import numpy as np
import pandas as pd

# load data
data = pd.read_csv('diabetes.csv')
data.head(10)
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
5 5 116 74 0 0 25.6 0.201 30 0
6 3 78 50 32 88 31.0 0.248 26 1
7 10 115 0 0 0 35.3 0.134 29 0
8 2 197 70 45 543 30.5 0.158 53 1
9 8 125 96 0 0 0.0 0.232 54 1

The meaning of each column is stated as below:

  • Pregnancies: Number of times pregnant
  • Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  • BloodPressure: Diastolic blood pressure (mm Hg)
  • SkinThickness: Triceps skin fold thickness (mm)
  • Insulin: 2-Hour serum insulin (mu U/ml)
  • BMI: Body mass index (weight in kg/(height in m)^2)
  • DiabetesPedigreeFunction: Diabetes pedigree function
  • Age: Age (years)
  • Outcome: Class variable (0 or 1)

Next, we need to examin whether there are null values in the dataset.

data.isnull().sum()
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

It seems there is no null values. Shall we pop the champagne now? If we further examine the data, we will find that the situation is not as good as we thought.

data.describe().round(2).T
count mean std min 25% 50% 75% max
Pregnancies 768.0 3.85 3.37 0.00 1.00 3.00 6.00 17.00
Glucose 768.0 120.89 31.97 0.00 99.00 117.00 140.25 199.00
BloodPressure 768.0 69.11 19.36 0.00 62.00 72.00 80.00 122.00
SkinThickness 768.0 20.54 15.95 0.00 0.00 23.00 32.00 99.00
Insulin 768.0 79.80 115.24 0.00 0.00 30.50 127.25 846.00
BMI 768.0 31.99 7.88 0.00 27.30 32.00 36.60 67.10
DiabetesPedigreeFunction 768.0 0.47 0.33 0.08 0.24 0.37 0.63 2.42
Age 768.0 33.24 11.76 21.00 24.00 29.00 41.00 81.00
Outcome 768.0 0.35 0.48 0.00 0.00 0.00 1.00 1.00
We can see from the picture above that the minimum value of Glucose, BloodPressure, SkinThickness, Insulin, BMI is zero. However, anyone having common sense knows that by no means would these indicators of an alive person be zero. Apparently, the null values are replaced by zeros when inputing data. Here we calculate the number of null values in each column so that we will better decide how to deal with them.
for col in list(data.columns):
    zero = len(data[data[col] == 0])
    print('0s in %s: %d' % (col, zero))
0s in Pregnancies: 111
0s in Glucose: 5
0s in BloodPressure: 35
0s in SkinThickness: 227
0s in Insulin: 374
0s in BMI: 11
0s in DiabetesPedigreeFunction: 0
0s in Age: 0
0s in Outcome: 500

There are nearly 400 null values in Insulin column, so it's not realistic to drop all the null values. Normally under such circumstance, we replace them with indicators such as mean, median, or mode, etc. Here I choose median.

# replace 0s in target columns with null value
data.iloc[:, 1:6] = data.iloc[:, 1:6].replace({0:np.NaN})
# fill null values with medians
data = data.fillna(data.median()) 
data.describe().round(2).T
count mean std min 25% 50% 75% max
Pregnancies 768.0 3.85 3.37 0.00 1.00 3.00 6.00 17.00
Glucose 768.0 121.66 30.44 44.00 99.75 117.00 140.25 199.00
BloodPressure 768.0 72.39 12.10 24.00 64.00 72.00 80.00 122.00
SkinThickness 768.0 29.11 8.79 7.00 25.00 29.00 32.00 99.00
Insulin 768.0 140.67 86.38 14.00 121.50 125.00 127.25 846.00
BMI 768.0 32.46 6.88 18.20 27.50 32.30 36.60 67.10
DiabetesPedigreeFunction 768.0 0.47 0.33 0.08 0.24 0.37 0.63 2.42
Age 768.0 33.24 11.76 21.00 24.00 29.00 41.00 81.00
Outcome 768.0 0.35 0.48 0.00 0.00 0.00 1.00 1.00

Pairplot & Correlation Matrix -- Imbalanced Dataset

Now we can visualize the dataset and see if we can find something else.

import seaborn as sns
import matplotlib.pyplot as plt
sns.pairplot(data, hue="Outcome", corner=True)
Pairplot-Imbalanced Dataset
corr_mat = data.corr().round(2)
plt.figure(figsize = (15,10))
sns.heatmap(corr_mat, annot=True)

Correlation Matrix-Imbalanced Dataset
From the pairplot above, we discover this is an imbalanced dataset. There are 500 observations whose outcome is 0 and only 268 is 1. If we train our model with such data, the results may be misleading. There are several methods to deal with imbalanced data:

  1. Collect more data
  2. Under-Sampling
  3. Over-Sampling
  4. Use confusion matrix or other ways to evaluate the peformance
    As the number of observations is only 768, it's unrealistic to under-sampling. Here I wil illustrate how to do oversampling using SMOTE().

Oversampling with SMOTE()

We can use SMOTE() to over-sampling observations so that the number of outcome in each class will be the same. After doing that, we describe the data again.

from imblearn.over_sampling import SMOTE 
# get X, y
X          = data.iloc[:, 0:8]
y          = data.iloc[:, -1]
# perform SMOTE()
oversample = SMOTE()
X, y       = oversample.fit_resample(X, y)
# join X, y
y          = pd.DataFrame(y)
data       = pd.concat([X, y], axis=1)
print(data.shape)
data.describe().round(2).T
count mean std min 25% 50% 75% max
Pregnancies 1000.0 3.99 3.33 0.00 1.00 3.00 6.00 17.00
Glucose 1000.0 126.40 31.39 44.00 102.82 122.00 148.00 199.00
BloodPressure 1000.0 72.97 11.71 24.00 65.75 72.00 80.00 122.00
SkinThickness 1000.0 29.64 8.22 7.00 27.00 29.00 33.00 99.00
Insulin 1000.0 146.58 90.69 14.00 125.00 125.00 136.25 846.00
BMI 1000.0 33.01 6.68 18.20 28.40 32.66 36.80 67.10
DiabetesPedigreeFunction 1000.0 0.49 0.32 0.08 0.26 0.40 0.65 2.42
Age 1000.0 33.88 11.34 21.00 25.00 31.00 41.00 81.00
Outcome 1000.0 0.50 0.50 0.00 0.00 0.50 1.00 1.00

Pairplot & Correlation Matrix -- Balanced Dataset

Once again visualize the dataset to see the changes in the distribution of data.

sns.pairplot(data, hue="Outcome", corner=True)

Pairplot - Balanced Dataset
For now, we transform the original imbalanced dataset into a balanced dataset with the number of each outcome class being 500. there is not much difference in distribution between diabetics and nondiabetics. Glucose should have the highest correlation with Outcome among all the features. We can testify it using a correlation matrix heatmap.

corr_mat = data.corr().round(2)
plt.figure(figsize = (15,10))
sns.heatmap(corr_mat, annot=True)
Correlation Matrix - Balanced Dataset

The features with higher correlation with Outcome are more important and more inspirable to our model process.

Data Transformation

If the range of two features differ a lot, for example, one is from 0 to 1 and the other is from 0 to 10,000, apparantly models won't treat such two features equally. Normally, the feature with larger range would influence the result more. Nevertheless, this doesn't mean it's more important than the other one. That's why we need scale the features. Frequently used scaling techniques are:

  • MinMaxScaler()
  • StardardScaler()
  • MaxAbsScaler()
  • RobustScaler()

I usually choose MinMaxScaler() or StandardScaler() for most of datasets. You can also use others according to the charasteristic of the dataset at hand.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# get X, y
X  = data.drop(["Outcome"], axis=1)
y  = data["Outcome"]

X2 = X.copy
y2 = y.copy

# split X, y into traning set and validation set
X2_train, X2_trainValid, y2_train, y2_trainValid = train_test_split(X2, y2, test_size=0.5)
# perform feature scaling
scaler   = StandardScaler()

X2_train = scaler.fit_transform(X2_train)
print("The training set:")
print(X2_train)

X2_trainValid = scaler.transform(X2_trainValid)
print("The validation set:")
print(X2_trainValid)
The training set:
[[ 0.91861577  1.16692299  1.18785055 ... -0.37252998 -1.00546758
   1.12702163]
 [-0.31857719  2.14006051 -0.37791467 ... -0.29655845 -0.59256699
  -0.00540078]
 [-1.2464719  -0.54138673  1.3123182  ... -0.07795059  0.92337497
   0.25592747]
 ...
 [ 0.60931753  0.71279215  0.66592881 ... -0.50927873 -0.96541006
   1.38834988]
 [ 0.91861577  0.69093641  0.49301433 ...  0.36541497  0.17583587
   1.4754593 ]
 [-1.2464719  -1.00641748 -0.20394076 ...  1.61792412 -0.44466231
  -1.1378232 ]]
  
The validation set:
[[-0.93717366 -1.13616915 -0.89983642 ... -0.85874777  0.2732917
  -0.87649495]
 [-0.62787543 -1.03885539 -0.7258625  ...  0.78223728  0.56293838
  -0.96360437]
 [ 0.30001929 -1.20104498 -0.55188859 ... -1.28418834 -0.46006904
  -0.35383845]
 ...
 [-0.62787543 -0.94154164 -0.37791467 ... -1.78560044  0.47974199
  -0.70227612]
 [ 0.30001929  0.61547839  0.83990272 ... -0.12942108 -0.1211208
   2.08522521]
 [ 0.60931753 -0.58472455  1.36182446 ...  0.59990561  0.72624981
  -0.26672903]]

Data Reduction

With clean data at hand, we can visulize the data, which may indicate some relations between the features or between features and targets. Knowing these is useful to data reduction, for example, if a feature is highly correlated with the target, then we had better retain it. Or if two features are highly correlated, it would be much easier for us if we drop out one of them. Here I use PCA (Principal componenet analysis) for a demonstration.

from sklearn.decomposition import PCA

pca      = PCA()

X2_train = pca.fit_transform(X2_train)
print("The training set:")
print(X2_train)

X2_trainValid = pca.transform(X2_trainValid)
print("The validation set:")
print(X2_trainValid)
The training set:
[[ 1.11094069  1.71567637  0.06106008 ... -0.41295852  0.00731795
  -0.44959938]
 [-0.35117318  0.31229466  1.6917659  ... -1.61109567 -0.49354468
   0.3728086 ]
 [ 0.16735037 -0.4001708  -0.38262459 ...  0.82802415  0.95458562
  -0.10766268]
 ...
 [ 0.66464415  1.65681528  0.09661504 ... -0.34339558  0.49824185
  -0.21578121]
 [ 1.89124805  0.9469956   0.91665363 ...  0.40528492  0.16728102
   0.47349385]
 [-0.77480095 -1.71331367 -1.18697513 ...  0.26478707 -0.22960668
   1.25339887]]
The validation set:
[[-2.92512666 -0.09229361  0.31554059 ... -0.10102492 -0.22291171
   0.68989167]
 [-0.55971697 -1.49656753 -0.53947916 ...  0.41658342 -0.12598596
   0.48539238]
 [-2.27295931  1.03707233 -0.42829519 ... -0.03041183 -0.23491582
  -0.2007138 ]
 ...
 [-2.79570134  0.52778715  0.64923584 ...  0.15441194 -0.14783737
  -0.13731851]
 [ 1.54801396  1.38771828  1.28965127 ...  0.67922686  0.85859194
   0.66502772]
 [ 0.65313216  0.2231839  -0.87818227 ...  0.76275166 -0.74002896
   0.09427758]]