This is a thorough analysis of diabetes dataset from Kaggle. In this article, I will demonstrate how to do an end-to-end machine learning project using diverse classification models including K-Nearest Neighbors, Logistic Regression, Support Vector Machine, Decision Tree and Random Forest.
For everything you may need, visit:
Source Data
Hope this helps!

I cannot emphasize the importance of data preprocessing. It is the process of transforming raw data into more reasonable, useful and efficient format. It's a must and one of the most important step in a machine learing project. Without it our models will probably crash and won't be able to generate good results.
There are several steps in data preprocessing: data cleaning, data reduction and data transformation. In the following paragraphs, I'll show you how to perform the steps one by one.

Data Cleaning

First, I load the data from the .csv file and print the first ten rows so that I can have a basic understanding of how the data looks like.

# packages needed
import numpy as np
import pandas as pd

# load data
data = pd.read_csv('diabetes.csv')
data.head(10)

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1
5	5	116	74	0	0	25.6	0.201	30	0
6	3	78	50	32	88	31.0	0.248	26	1
7	10	115	0	0	0	35.3	0.134	29	0
8	2	197	70	45	543	30.5	0.158	53	1
9	8	125	96	0	0	0.0	0.232	54	1

The meaning of each column is stated as below:

Pregnancies: Number of times pregnant

Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

BloodPressure: Diastolic blood pressure (mm Hg)

SkinThickness: Triceps skin fold thickness (mm)

Insulin: 2-Hour serum insulin (mu U/ml)

BMI: Body mass index (weight in kg/(height in m)^2)

DiabetesPedigreeFunction: Diabetes pedigree function

Age: Age (years)

Outcome: Class variable (0 or 1)

Next, we need to examin whether there are null values in the dataset.

data.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

It seems there is no null values. Shall we pop the champagne now? If we further examine the data, we will find that the situation is not as good as we thought.

data.describe().round(2).T

	count	mean	std	min	25%	50%	75%	max
Pregnancies	768.0	3.85	3.37	0.00	1.00	3.00	6.00	17.00
Glucose	768.0	120.89	31.97	0.00	99.00	117.00	140.25	199.00
BloodPressure	768.0	69.11	19.36	0.00	62.00	72.00	80.00	122.00
SkinThickness	768.0	20.54	15.95	0.00	0.00	23.00	32.00	99.00
Insulin	768.0	79.80	115.24	0.00	0.00	30.50	127.25	846.00
BMI	768.0	31.99	7.88	0.00	27.30	32.00	36.60	67.10
DiabetesPedigreeFunction	768.0	0.47	0.33	0.08	0.24	0.37	0.63	2.42
Age	768.0	33.24	11.76	21.00	24.00	29.00	41.00	81.00
Outcome	768.0	0.35	0.48	0.00	0.00	0.00	1.00	1.00

We can see from the picture above that the minimum value of Glucose, BloodPressure, SkinThickness, Insulin, BMI is zero. However, anyone having common sense knows that by no means would these indicators of an alive person be zero. Apparently, the null values are replaced by zeros when inputing data. Here we calculate the number of null values in each column so that we will better decide how to deal with them.

for col in list(data.columns):
    zero = len(data[data[col] == 0])
    print('0s in %s: %d' % (col, zero))

0s in Pregnancies: 111
0s in Glucose: 5
0s in BloodPressure: 35
0s in SkinThickness: 227
0s in Insulin: 374
0s in BMI: 11
0s in DiabetesPedigreeFunction: 0
0s in Age: 0
0s in Outcome: 500

There are nearly 400 null values in Insulin column, so it's not realistic to drop all the null values. Normally under such circumstance, we replace them with indicators such as mean, median, or mode, etc. Here I choose median.

# replace 0s in target columns with null value
data.iloc[:, 1:6] = data.iloc[:, 1:6].replace({0:np.NaN})
# fill null values with medians
data = data.fillna(data.median()) 
data.describe().round(2).T

	count	mean	std	min	25%	50%	75%	max
Pregnancies	768.0	3.85	3.37	0.00	1.00	3.00	6.00	17.00
Glucose	768.0	121.66	30.44	44.00	99.75	117.00	140.25	199.00
BloodPressure	768.0	72.39	12.10	24.00	64.00	72.00	80.00	122.00
SkinThickness	768.0	29.11	8.79	7.00	25.00	29.00	32.00	99.00
Insulin	768.0	140.67	86.38	14.00	121.50	125.00	127.25	846.00
BMI	768.0	32.46	6.88	18.20	27.50	32.30	36.60	67.10
DiabetesPedigreeFunction	768.0	0.47	0.33	0.08	0.24	0.37	0.63	2.42
Age	768.0	33.24	11.76	21.00	24.00	29.00	41.00	81.00
Outcome	768.0	0.35	0.48	0.00	0.00	0.00	1.00	1.00

Pairplot & Correlation Matrix -- Imbalanced Dataset

Now we can visualize the dataset and see if we can find something else.

import seaborn as sns
import matplotlib.pyplot as plt
sns.pairplot(data, hue="Outcome", corner=True)

corr_mat = data.corr().round(2)
plt.figure(figsize = (15,10))
sns.heatmap(corr_mat, annot=True)

Correlation Matrix-Imbalanced Dataset
From the pairplot above, we discover this is an imbalanced dataset. There are 500 observations whose outcome is 0 and only 268 is 1. If we train our model with such data, the results may be misleading. There are several methods to deal with imbalanced data:

Collect more data
Under-Sampling
Over-Sampling
Use confusion matrix or other ways to evaluate the peformance
As the number of observations is only 768, it's unrealistic to under-sampling. Here I wil illustrate how to do oversampling using SMOTE().

Oversampling with SMOTE()

We can use SMOTE() to over-sampling observations so that the number of outcome in each class will be the same. After doing that, we describe the data again.

from imblearn.over_sampling import SMOTE 
# get X, y
X          = data.iloc[:, 0:8]
y          = data.iloc[:, -1]
# perform SMOTE()
oversample = SMOTE()
X, y       = oversample.fit_resample(X, y)
# join X, y
y          = pd.DataFrame(y)
data       = pd.concat([X, y], axis=1)
print(data.shape)
data.describe().round(2).T

	count	mean	std	min	25%	50%	75%	max
Pregnancies	1000.0	3.99	3.33	0.00	1.00	3.00	6.00	17.00
Glucose	1000.0	126.40	31.39	44.00	102.82	122.00	148.00	199.00
BloodPressure	1000.0	72.97	11.71	24.00	65.75	72.00	80.00	122.00
SkinThickness	1000.0	29.64	8.22	7.00	27.00	29.00	33.00	99.00
Insulin	1000.0	146.58	90.69	14.00	125.00	125.00	136.25	846.00
BMI	1000.0	33.01	6.68	18.20	28.40	32.66	36.80	67.10
DiabetesPedigreeFunction	1000.0	0.49	0.32	0.08	0.26	0.40	0.65	2.42
Age	1000.0	33.88	11.34	21.00	25.00	31.00	41.00	81.00
Outcome	1000.0	0.50	0.50	0.00	0.00	0.50	1.00	1.00

Pairplot & Correlation Matrix -- Balanced Dataset

Once again visualize the dataset to see the changes in the distribution of data.

sns.pairplot(data, hue="Outcome", corner=True)

Pairplot - Balanced Dataset
For now, we transform the original imbalanced dataset into a balanced dataset with the number of each outcome class being 500. there is not much difference in distribution between diabetics and nondiabetics. Glucose should have the highest correlation with Outcome among all the features. We can testify it using a correlation matrix heatmap.

corr_mat = data.corr().round(2)
plt.figure(figsize = (15,10))
sns.heatmap(corr_mat, annot=True)

The features with higher correlation with Outcome are more important and more inspirable to our model process.

Data Transformation

If the range of two features differ a lot, for example, one is from 0 to 1 and the other is from 0 to 10,000, apparantly models won't treat such two features equally. Normally, the feature with larger range would influence the result more. Nevertheless, this doesn't mean it's more important than the other one. That's why we need scale the features. Frequently used scaling techniques are:

MinMaxScaler()
StardardScaler()
MaxAbsScaler()
RobustScaler()

I usually choose MinMaxScaler() or StandardScaler() for most of datasets. You can also use others according to the charasteristic of the dataset at hand.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# get X, y
X  = data.drop(["Outcome"], axis=1)
y  = data["Outcome"]

X2 = X.copy
y2 = y.copy

# split X, y into traning set and validation set
X2_train, X2_trainValid, y2_train, y2_trainValid = train_test_split(X2, y2, test_size=0.5)
# perform feature scaling
scaler   = StandardScaler()

X2_train = scaler.fit_transform(X2_train)
print("The training set:")
print(X2_train)

X2_trainValid = scaler.transform(X2_trainValid)
print("The validation set:")
print(X2_trainValid)

The training set:
[[ 0.91861577  1.16692299  1.18785055 ... -0.37252998 -1.00546758
   1.12702163]
 [-0.31857719  2.14006051 -0.37791467 ... -0.29655845 -0.59256699
  -0.00540078]
 [-1.2464719  -0.54138673  1.3123182  ... -0.07795059  0.92337497
   0.25592747]
 ...
 [ 0.60931753  0.71279215  0.66592881 ... -0.50927873 -0.96541006
   1.38834988]
 [ 0.91861577  0.69093641  0.49301433 ...  0.36541497  0.17583587
   1.4754593 ]
 [-1.2464719  -1.00641748 -0.20394076 ...  1.61792412 -0.44466231
  -1.1378232 ]]

The validation set:
[[-0.93717366 -1.13616915 -0.89983642 ... -0.85874777  0.2732917
  -0.87649495]
 [-0.62787543 -1.03885539 -0.7258625  ...  0.78223728  0.56293838
  -0.96360437]
 [ 0.30001929 -1.20104498 -0.55188859 ... -1.28418834 -0.46006904
  -0.35383845]
 ...
 [-0.62787543 -0.94154164 -0.37791467 ... -1.78560044  0.47974199
  -0.70227612]
 [ 0.30001929  0.61547839  0.83990272 ... -0.12942108 -0.1211208
   2.08522521]
 [ 0.60931753 -0.58472455  1.36182446 ...  0.59990561  0.72624981
  -0.26672903]]

Data Reduction

With clean data at hand, we can visulize the data, which may indicate some relations between the features or between features and targets. Knowing these is useful to data reduction, for example, if a feature is highly correlated with the target, then we had better retain it. Or if two features are highly correlated, it would be much easier for us if we drop out one of them. Here I use PCA (Principal componenet analysis) for a demonstration.

from sklearn.decomposition import PCA

pca      = PCA()

X2_train = pca.fit_transform(X2_train)
print("The training set:")
print(X2_train)

X2_trainValid = pca.transform(X2_trainValid)
print("The validation set:")
print(X2_trainValid)

The training set:
[[ 1.11094069  1.71567637  0.06106008 ... -0.41295852  0.00731795
  -0.44959938]
 [-0.35117318  0.31229466  1.6917659  ... -1.61109567 -0.49354468
   0.3728086 ]
 [ 0.16735037 -0.4001708  -0.38262459 ...  0.82802415  0.95458562
  -0.10766268]
 ...
 [ 0.66464415  1.65681528  0.09661504 ... -0.34339558  0.49824185
  -0.21578121]
 [ 1.89124805  0.9469956   0.91665363 ...  0.40528492  0.16728102
   0.47349385]
 [-0.77480095 -1.71331367 -1.18697513 ...  0.26478707 -0.22960668
   1.25339887]]

The validation set:
[[-2.92512666 -0.09229361  0.31554059 ... -0.10102492 -0.22291171
   0.68989167]
 [-0.55971697 -1.49656753 -0.53947916 ...  0.41658342 -0.12598596
   0.48539238]
 [-2.27295931  1.03707233 -0.42829519 ... -0.03041183 -0.23491582
  -0.2007138 ]
 ...
 [-2.79570134  0.52778715  0.64923584 ...  0.15441194 -0.14783737
  -0.13731851]
 [ 1.54801396  1.38771828  1.28965127 ...  0.67922686  0.85859194
   0.66502772]
 [ 0.65313216  0.2231839  -0.87818227 ...  0.76275166 -0.74002896
   0.09427758]]