In the previous article, I did data preprocessing. It is often seen as one of the most important step during a machine learning project. One must be patient with the data manimulation no matter how troublesome it is. In this article, we are finally exposed to the model traing procedure. I will demonstrate how to conduct different types of classification models and how to evalute their performance.
For everything you may need, visit:
Source Data
Hope this helps!

In Daibetes Data Analytics (1) -- Data Preprocessing, I demonstrate how to do feature scaling and deature reductio separately. This time I'll show how to combine these steps with model training using Pipeline.

Model Training And Evaluating

There are so many classification models. Some you may have heard or used while the others not. This articles covers almost all the classification models you may need as a beginner. I will show you how to conduct them and how different hyperparameters will affect the models.
Notice the X and y used in train_test_split are the same from here.

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_validate

# get traing set and testing set
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size    = 0.25,
    random_state = 42
)

# Since this is a clinical dataset, we don't care too much about accuracy.
# Recall and precision are more important. So we use f1 score to evaluate models

# cross validation
def Cval(X_train, y_train, modelObj):
    # split training set into traing set & validation set
    shuffle = ShuffleSplit(n_splits=100, test_size=0.25, random_state=10)
    CVInfo  = cross_validate(
        modelObj,
        X_train,
        y_train,
        cv                 = shuffle,
        scoring            = 'f1',
        return_train_score = True,
        n_jobs             = -1
    )
    mean_train = np.mean(CVInfo['train_score'])
    mean_test  = np.mean(CVInfo['test_score'])
    return f"Mean Train: {mean_train}, Mean Test: {mean_test}"

# grid search
def Gsearch(X, y, modelObj):
    # split training set into traing set & validation set
    shuffle     = ShuffleSplit(n_splits=100, test_size=0.25, random_state=10)
    grid_search = GridSearchCV(
        modelObj,
        param_grid,
        cv                 = shuffle,
        scoring            = 'f1',
        return_train_score = True,
        n_jobs             = -1
    )
    grid_search.fit(X,y)
    results = pd.DataFrame(grid_search.cv_results_)
    return results

K-Nearest Neighbors Classifier

from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

# grid search: K
klist      = list(range(1, 20))
fullModel  = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("pca", PCA()),
        ("knn",KNeighborsClassifier())
    ]
)
param_grid = {'knn__n_neighbors':klist}
print(
    Gsearch(X_train, y_train, fullModel)[
        [
            'rank_test_score',
            'mean_train_score',
            'mean_test_score',
            'param_knn__n_neighbors'
        ]
    ]
)
    rank_test_score  mean_train_score  mean_test_score param_knn__n_neighbors
0                17          1.000000         0.767296                      1
1                19          0.905732         0.705512                      2
2                15          0.876778         0.778121                      3
3                18          0.846312         0.749821                      4
4                14          0.845480         0.782354                      5
5                16          0.829343         0.775068                      6
6                 2          0.834093         0.789197                      7
7                12          0.831851         0.785771                      8
8                 4          0.830803         0.787731                      9
9                 9          0.829832         0.786638                     10
10                1          0.825560         0.790609                     11
11               13          0.823216         0.785650                     12
12                3          0.820105         0.788285                     13
13               11          0.818992         0.786504                     14
14                5          0.816281         0.787701                     15
15                7          0.815164         0.786984                     16
16                6          0.812760         0.787472                     17
17                8          0.813046         0.786949                     18
18               10          0.810027         0.786559                     19

Logistic Regression

Without Penalty

from sklearn.linear_model import LogisticRegression

fullModel = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("pca", PCA()),
        ("lr", LogisticRegression(penalty="none")),
    ]
)
print(Cval(X_train, y_train, fullModel))
Mean Train: 0.7489133929356746, Mean Test: 0.7334838216167244

With "l2" Penalty

clist = [0.1, 0.5, 1., 2., 5., 10., 50., 100., 200.]
fullModel = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("pca", PCA()),
        ("lr", LogisticRegression(penalty="l2")),
    ]
)
param_grid = {"lr__C": clist}
print(
    Gsearch(X_train, y_train, fullModel)[
        ["rank_test_score", "mean_train_score", "mean_test_score", "param_lr__C"]
    ]
)
   rank_test_score  mean_train_score  mean_test_score param_lr__C
0                9          0.746639         0.732332         0.1
1                1          0.748524         0.734091         0.5
2                2          0.748749         0.733897         1.0
3                4          0.748788         0.733610         2.0
4                3          0.748850         0.733614         5.0
5                8          0.748915         0.733403        10.0
6                7          0.748927         0.733416        50.0
7                5          0.748927         0.733484       100.0
8                5          0.748913         0.733484       200.0

Naive Bayes Classifier

from sklearn.naive_bayes import GaussianNB

fullModel = Pipeline(
    [("scaler", StandardScaler()), ("pca", PCA()), ("gnb", GaussianNB())]
)
print(Cval(X_train, y_train, fullModel))
Mean Train: 0.7156724204600774, Mean Test: 0.6935958660647001

Support Vector Machine

Linear

from sklearn.svm import SVC

C         = [0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1.0, 1.5, 2.0, 5.0, 10.0]
fullModel = Pipeline(
    [("scaler", StandardScaler()), ("pca", PCA()), ("lsv", SVC(kernel="linear"))]
)
param_grid = {"lsv__C": C}
print(
    Gsearch(X_train, y_train, fullModel)[
        ["rank_test_score", "mean_train_score", "mean_test_score", "param_lsv__C"]
    ]
)
   rank_test_score  mean_train_score  mean_test_score param_lsv__C
0                9          0.736142         0.721251         0.01
1                8          0.748459         0.732040          0.1
2                7          0.751446         0.734082          1.0
3                3          0.751417         0.734420          1.5
4                4          0.751544         0.734324          2.0
5                1          0.751747         0.734826          5.0
6                2          0.751812         0.734491         10.0
7                6          0.751675         0.734110         20.0
8                5          0.751708         0.734141         50.0

Nonlinear

C         = [0.1, 0.25, 0.5, 0.75, 1.0, 1.5, 2.0]
gamma     = [0.0001, 0.001, 0.1, 0.5, 1.0, 1.5, 2.0]
fullModel = Pipeline(
    [("scaler", StandardScaler()), ("pca", PCA()), ("nlsv", SVC(kernel="rbf"))]
)
param_grid = {"nlsv__C": C, "nlsv__gamma": gamma}
print(
    Gsearch(X_train, y_train, fullModel)[
        [
            "rank_test_score",
            "mean_train_score",
            "mean_test_score",
            "param_nlsv__C",
            "param_nlsv__gamma",
        ]
    ]
)
    rank_test_score  mean_train_score  mean_test_score param_nlsv__C  param_nlsv__gamma 
0                44          0.372066         0.351788           0.1            0.0001  
1                43          0.372777         0.352342           0.1             0.001   
2                16          0.794937         0.769077           0.1               0.1
3                19          0.796141         0.759841           0.1               0.5
4                38          0.506495         0.426928           0.1               1.0
5                41          0.389886         0.354617           0.1               1.5
6                49          0.389178         0.338997           0.1               2.0
7                44          0.372066         0.351788          0.25            0.0001 
8                37          0.512152         0.495792          0.25             0.001
9                 9          0.819667         0.787810          0.25               0.1
10               12          0.865541         0.782728          0.25               0.5
11               20          0.898090         0.747326          0.25               1.0
12               36          0.816284         0.555991          0.25               1.5
13               39          0.589468         0.395044          0.25               2.0
14               44          0.372066         0.351788           0.5            0.0001
15               33          0.715235         0.699347           0.5             0.001
16                7          0.832132         0.790794           0.5               0.1
17                3          0.906015         0.792512           0.5               0.5
18               14          0.953103         0.775851           0.5               1.0
19               25          0.978297         0.724190           0.5               1.5
20               35          0.989296         0.616334           0.5               2.0
21               44          0.372066         0.351788          0.75            0.0001
22               31          0.723086         0.707281          0.75              0.01
23                6          0.840006         0.791039          0.75               0.1
24                5          0.920507         0.791277          0.75               0.5
25               13          0.965587         0.777566          0.75               1.0
26               24          0.986410         0.732414          0.75               1.5
27               34          0.995198         0.669428          0.75               2.0
28               44          0.372066         0.351788           1.0            0.0001
29               28          0.728932         0.711763           1.0              0.01
30                4          0.845415         0.791687           1.0               0.1
31                8          0.930660         0.790740           1.0               0.5
32               15          0.975305         0.774491           1.0               1.0
33               21          0.992273         0.741896           1.0               1.5
34               32          0.996811         0.706107           1.0               2.0
35               42          0.372209         0.352824           1.5            0.0001
36               26          0.736768         0.720368           1.5              0.01
37                2          0.852950         0.793025           1.5               0.1
38               10          0.947417         0.787222           1.5               0.5
39               17          0.987013         0.768417           1.5               1.0
40               22          0.996314         0.741427           1.5               1.5
41               29          0.998739         0.710655           1.5               2.0
42               40          0.405951         0.387968           2.0            0.0001
43               27          0.735862         0.718574           2.0              0.01
44                1          0.858846         0.793716           2.0               0.1
45               11          0.958640         0.783840           2.0               0.5
46               18          0.992780         0.765268           2.0               1.0
47               23          0.998327         0.740466           2.0               1.5
48               30          0.999947         0.709506           2.0               2.0

Decision Tree

from sklearn.tree import DecisionTreeClassifier

max_tree_depth = list(np.arange(1, 11, 1))
fullModel      = Pipeline(
    [("scaler", StandardScaler()), ("pca", PCA()), ("dtc", DecisionTreeClassifier())]
)
param_grid = {"dtc__max_depth": max_tree_depth}
print(
    Gsearch(X_train, y_train, fullModel)[
        [
            "rank_test_score",
            "mean_train_score",
            "mean_test_score",
        ]
    ]
)
   rank_test_score  mean_train_score  mean_test_score param_dtc__max_depth
0                3          0.769956         0.752233                    1
1                5          0.766799         0.739621                    2
2               10          0.777782         0.728209                    3
3                1          0.822327         0.759595                    4
4                2          0.851185         0.753298                    5
5                4          0.882829         0.746050                    6
6                6          0.912364         0.739208                    7
7                7          0.937076         0.737403                    8
8                9          0.956295         0.729259                    9
9                8          0.971882         0.730771                   10

Random Forest

max_tree_depth = list(np.arange(1, 21, 1))
fullModel      = Pipeline(
    [("scaler", StandardScaler()), ("pca", PCA()), ("forest", RandomForestClassifier())]
)
param_grid = {"forest__max_depth": max_tree_depth}
print(
    Gsearch(X_train, y_train, fullModel)[
        [
            "rank_test_score",
            "mean_train_score",
            "mean_test_score",
        ]
    ]
)
    rank_test_score  mean_train_score  mean_test_score param_forest__max_depth
0                20          0.780006         0.743256                       1
1                19          0.810001         0.760425                       2
2                18          0.832622         0.770300                       3
3                17          0.857490         0.776023                       4
4                16          0.885679         0.782947                       5
5                14          0.916162         0.789779                       6
6                15          0.946883         0.789271                       7
7                13          0.970423         0.793335                       8
8                 7          0.985119         0.795747                       9
9                 6          0.992331         0.795925                      10
10               12          0.996899         0.793636                      11
11                9          0.999103         0.795075                      12
12                8          0.999682         0.795129                      13
13                1          0.999947         0.798139                      14
14                2          0.999982         0.796966                      15
15                5          0.999982         0.796176                      16
16               10          1.000000         0.794725                      17
17               11          1.000000         0.794472                      18
18                4          1.000000         0.796287                      19
19                3          1.000000         0.796903                      20

Model Testing

For now we have trained a set of models, among which KNN, non-linear SVC, and Random Forest are the best. So we use the three methods to see how they perform on testing set.

K-Nearest Neighbors Classifier

from sklearn.metrics import f1_score

fullModel = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("pca", PCA()),
        ("nlsv", KNeighborsClassifier(n_neighbors=11)),
    ]
)
fullModel.fit(X_train, y_train)
y_pred = fullModel.predict(X_test)
f1_score(y_test, y_pred)
0.793103448275862

Non-linear Support Vector Classifier

fullModel = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("pca", PCA()),
        ("nlsv", SVC(kernel="rbf", C=1, gamma=0.5)),
    ]
)
fullModel.fit(X_train, y_train)
y_pred = fullModel.predict(X_test)
f1_score(y_test, y_pred)
0.8060836501901141

Random Forest

fullModel = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("pca", PCA()),
        ("forest", RandomForestClassifier(max_depth=14)),
    ]
)
fullModel.fit(X_train, y_train)
y_pred = fullModel.predict(X_test)
f1_score(y_test, y_pred)
0.8057553956834532