In the previous article, I did data preprocessing. It is often seen as one of the most important step during a machine learning project. One must be patient with the data manimulation no matter how troublesome it is. In this article, we are finally exposed to the model traing procedure. I will demonstrate how to conduct different types of classification models and how to evalute their performance.
For everything you may need, visit:
Source Data
Hope this helps!
In Daibetes Data Analytics (1) -- Data Preprocessing, I demonstrate how to do feature scaling and deature reductio separately. This time I'll show how to combine these steps with model training using Pipeline.
Model Training And Evaluating
There are so many classification models. Some you may have heard or used while the others not. This articles covers almost all the classification models you may need as a beginner. I will show you how to conduct them and how different hyperparameters will affect the models.
Notice the X and y used in train_test_split are the same from here.
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_validate
# get traing set and testing set
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size = 0.25,
random_state = 42
)
# Since this is a clinical dataset, we don't care too much about accuracy.
# Recall and precision are more important. So we use f1 score to evaluate models
# cross validation
def Cval(X_train, y_train, modelObj):
# split training set into traing set & validation set
shuffle = ShuffleSplit(n_splits=100, test_size=0.25, random_state=10)
CVInfo = cross_validate(
modelObj,
X_train,
y_train,
cv = shuffle,
scoring = 'f1',
return_train_score = True,
n_jobs = -1
)
mean_train = np.mean(CVInfo['train_score'])
mean_test = np.mean(CVInfo['test_score'])
return f"Mean Train: {mean_train}, Mean Test: {mean_test}"
# grid search
def Gsearch(X, y, modelObj):
# split training set into traing set & validation set
shuffle = ShuffleSplit(n_splits=100, test_size=0.25, random_state=10)
grid_search = GridSearchCV(
modelObj,
param_grid,
cv = shuffle,
scoring = 'f1',
return_train_score = True,
n_jobs = -1
)
grid_search.fit(X,y)
results = pd.DataFrame(grid_search.cv_results_)
return results
K-Nearest Neighbors Classifier
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
# grid search: K
klist = list(range(1, 20))
fullModel = Pipeline(
[
("scaler", StandardScaler()),
("pca", PCA()),
("knn",KNeighborsClassifier())
]
)
param_grid = {'knn__n_neighbors':klist}
print(
Gsearch(X_train, y_train, fullModel)[
[
'rank_test_score',
'mean_train_score',
'mean_test_score',
'param_knn__n_neighbors'
]
]
)
rank_test_score mean_train_score mean_test_score param_knn__n_neighbors 0 17 1.000000 0.767296 1 1 19 0.905732 0.705512 2 2 15 0.876778 0.778121 3 3 18 0.846312 0.749821 4 4 14 0.845480 0.782354 5 5 16 0.829343 0.775068 6 6 2 0.834093 0.789197 7 7 12 0.831851 0.785771 8 8 4 0.830803 0.787731 9 9 9 0.829832 0.786638 10 10 1 0.825560 0.790609 11 11 13 0.823216 0.785650 12 12 3 0.820105 0.788285 13 13 11 0.818992 0.786504 14 14 5 0.816281 0.787701 15 15 7 0.815164 0.786984 16 16 6 0.812760 0.787472 17 17 8 0.813046 0.786949 18 18 10 0.810027 0.786559 19
Logistic Regression
Without Penalty
from sklearn.linear_model import LogisticRegression
fullModel = Pipeline(
[
("scaler", StandardScaler()),
("pca", PCA()),
("lr", LogisticRegression(penalty="none")),
]
)
print(Cval(X_train, y_train, fullModel))
Mean Train: 0.7489133929356746, Mean Test: 0.7334838216167244
With "l2" Penalty
clist = [0.1, 0.5, 1., 2., 5., 10., 50., 100., 200.]
fullModel = Pipeline(
[
("scaler", StandardScaler()),
("pca", PCA()),
("lr", LogisticRegression(penalty="l2")),
]
)
param_grid = {"lr__C": clist}
print(
Gsearch(X_train, y_train, fullModel)[
["rank_test_score", "mean_train_score", "mean_test_score", "param_lr__C"]
]
)
rank_test_score mean_train_score mean_test_score param_lr__C 0 9 0.746639 0.732332 0.1 1 1 0.748524 0.734091 0.5 2 2 0.748749 0.733897 1.0 3 4 0.748788 0.733610 2.0 4 3 0.748850 0.733614 5.0 5 8 0.748915 0.733403 10.0 6 7 0.748927 0.733416 50.0 7 5 0.748927 0.733484 100.0 8 5 0.748913 0.733484 200.0
Naive Bayes Classifier
from sklearn.naive_bayes import GaussianNB
fullModel = Pipeline(
[("scaler", StandardScaler()), ("pca", PCA()), ("gnb", GaussianNB())]
)
print(Cval(X_train, y_train, fullModel))
Mean Train: 0.7156724204600774, Mean Test: 0.6935958660647001
Support Vector Machine
Linear
from sklearn.svm import SVC
C = [0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1.0, 1.5, 2.0, 5.0, 10.0]
fullModel = Pipeline(
[("scaler", StandardScaler()), ("pca", PCA()), ("lsv", SVC(kernel="linear"))]
)
param_grid = {"lsv__C": C}
print(
Gsearch(X_train, y_train, fullModel)[
["rank_test_score", "mean_train_score", "mean_test_score", "param_lsv__C"]
]
)
rank_test_score mean_train_score mean_test_score param_lsv__C 0 9 0.736142 0.721251 0.01 1 8 0.748459 0.732040 0.1 2 7 0.751446 0.734082 1.0 3 3 0.751417 0.734420 1.5 4 4 0.751544 0.734324 2.0 5 1 0.751747 0.734826 5.0 6 2 0.751812 0.734491 10.0 7 6 0.751675 0.734110 20.0 8 5 0.751708 0.734141 50.0
Nonlinear
C = [0.1, 0.25, 0.5, 0.75, 1.0, 1.5, 2.0]
gamma = [0.0001, 0.001, 0.1, 0.5, 1.0, 1.5, 2.0]
fullModel = Pipeline(
[("scaler", StandardScaler()), ("pca", PCA()), ("nlsv", SVC(kernel="rbf"))]
)
param_grid = {"nlsv__C": C, "nlsv__gamma": gamma}
print(
Gsearch(X_train, y_train, fullModel)[
[
"rank_test_score",
"mean_train_score",
"mean_test_score",
"param_nlsv__C",
"param_nlsv__gamma",
]
]
)
rank_test_score mean_train_score mean_test_score param_nlsv__C param_nlsv__gamma 0 44 0.372066 0.351788 0.1 0.0001 1 43 0.372777 0.352342 0.1 0.001 2 16 0.794937 0.769077 0.1 0.1 3 19 0.796141 0.759841 0.1 0.5 4 38 0.506495 0.426928 0.1 1.0 5 41 0.389886 0.354617 0.1 1.5 6 49 0.389178 0.338997 0.1 2.0 7 44 0.372066 0.351788 0.25 0.0001 8 37 0.512152 0.495792 0.25 0.001 9 9 0.819667 0.787810 0.25 0.1 10 12 0.865541 0.782728 0.25 0.5 11 20 0.898090 0.747326 0.25 1.0 12 36 0.816284 0.555991 0.25 1.5 13 39 0.589468 0.395044 0.25 2.0 14 44 0.372066 0.351788 0.5 0.0001 15 33 0.715235 0.699347 0.5 0.001 16 7 0.832132 0.790794 0.5 0.1 17 3 0.906015 0.792512 0.5 0.5 18 14 0.953103 0.775851 0.5 1.0 19 25 0.978297 0.724190 0.5 1.5 20 35 0.989296 0.616334 0.5 2.0 21 44 0.372066 0.351788 0.75 0.0001 22 31 0.723086 0.707281 0.75 0.01 23 6 0.840006 0.791039 0.75 0.1 24 5 0.920507 0.791277 0.75 0.5 25 13 0.965587 0.777566 0.75 1.0 26 24 0.986410 0.732414 0.75 1.5 27 34 0.995198 0.669428 0.75 2.0 28 44 0.372066 0.351788 1.0 0.0001 29 28 0.728932 0.711763 1.0 0.01 30 4 0.845415 0.791687 1.0 0.1 31 8 0.930660 0.790740 1.0 0.5 32 15 0.975305 0.774491 1.0 1.0 33 21 0.992273 0.741896 1.0 1.5 34 32 0.996811 0.706107 1.0 2.0 35 42 0.372209 0.352824 1.5 0.0001 36 26 0.736768 0.720368 1.5 0.01 37 2 0.852950 0.793025 1.5 0.1 38 10 0.947417 0.787222 1.5 0.5 39 17 0.987013 0.768417 1.5 1.0 40 22 0.996314 0.741427 1.5 1.5 41 29 0.998739 0.710655 1.5 2.0 42 40 0.405951 0.387968 2.0 0.0001 43 27 0.735862 0.718574 2.0 0.01 44 1 0.858846 0.793716 2.0 0.1 45 11 0.958640 0.783840 2.0 0.5 46 18 0.992780 0.765268 2.0 1.0 47 23 0.998327 0.740466 2.0 1.5 48 30 0.999947 0.709506 2.0 2.0
Decision Tree
from sklearn.tree import DecisionTreeClassifier
max_tree_depth = list(np.arange(1, 11, 1))
fullModel = Pipeline(
[("scaler", StandardScaler()), ("pca", PCA()), ("dtc", DecisionTreeClassifier())]
)
param_grid = {"dtc__max_depth": max_tree_depth}
print(
Gsearch(X_train, y_train, fullModel)[
[
"rank_test_score",
"mean_train_score",
"mean_test_score",
]
]
)
rank_test_score mean_train_score mean_test_score param_dtc__max_depth 0 3 0.769956 0.752233 1 1 5 0.766799 0.739621 2 2 10 0.777782 0.728209 3 3 1 0.822327 0.759595 4 4 2 0.851185 0.753298 5 5 4 0.882829 0.746050 6 6 6 0.912364 0.739208 7 7 7 0.937076 0.737403 8 8 9 0.956295 0.729259 9 9 8 0.971882 0.730771 10
Random Forest
max_tree_depth = list(np.arange(1, 21, 1))
fullModel = Pipeline(
[("scaler", StandardScaler()), ("pca", PCA()), ("forest", RandomForestClassifier())]
)
param_grid = {"forest__max_depth": max_tree_depth}
print(
Gsearch(X_train, y_train, fullModel)[
[
"rank_test_score",
"mean_train_score",
"mean_test_score",
]
]
)
rank_test_score mean_train_score mean_test_score param_forest__max_depth 0 20 0.780006 0.743256 1 1 19 0.810001 0.760425 2 2 18 0.832622 0.770300 3 3 17 0.857490 0.776023 4 4 16 0.885679 0.782947 5 5 14 0.916162 0.789779 6 6 15 0.946883 0.789271 7 7 13 0.970423 0.793335 8 8 7 0.985119 0.795747 9 9 6 0.992331 0.795925 10 10 12 0.996899 0.793636 11 11 9 0.999103 0.795075 12 12 8 0.999682 0.795129 13 13 1 0.999947 0.798139 14 14 2 0.999982 0.796966 15 15 5 0.999982 0.796176 16 16 10 1.000000 0.794725 17 17 11 1.000000 0.794472 18 18 4 1.000000 0.796287 19 19 3 1.000000 0.796903 20
Model Testing
For now we have trained a set of models, among which KNN, non-linear SVC, and Random Forest are the best. So we use the three methods to see how they perform on testing set.
K-Nearest Neighbors Classifier
from sklearn.metrics import f1_score
fullModel = Pipeline(
[
("scaler", StandardScaler()),
("pca", PCA()),
("nlsv", KNeighborsClassifier(n_neighbors=11)),
]
)
fullModel.fit(X_train, y_train)
y_pred = fullModel.predict(X_test)
f1_score(y_test, y_pred)
0.793103448275862
Non-linear Support Vector Classifier
fullModel = Pipeline(
[
("scaler", StandardScaler()),
("pca", PCA()),
("nlsv", SVC(kernel="rbf", C=1, gamma=0.5)),
]
)
fullModel.fit(X_train, y_train)
y_pred = fullModel.predict(X_test)
f1_score(y_test, y_pred)
0.8060836501901141
Random Forest
fullModel = Pipeline(
[
("scaler", StandardScaler()),
("pca", PCA()),
("forest", RandomForestClassifier(max_depth=14)),
]
)
fullModel.fit(X_train, y_train)
y_pred = fullModel.predict(X_test)
f1_score(y_test, y_pred)
0.8057553956834532