Similar question but with different dataset
confusion matrix and classification report of StratifiedKFold
I'm trying to create a confusion matrix using Titanic data set. I followed this kaggle notebook and
there is this part to run StratifiedKFold
with RandomForest part. I little bit modified to code to get confusion matrix. See ADDED line.
OTH, I do not understand how to split train and test data set works StratifiedKfold
. Since in this notebook they already separated. I do not know where to get y_test
to provide data to confusion matrix.
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
N = 2
oob = 0
probs = pd.DataFrame(np.zeros((len(X_test), N * 2)), columns=['Fold_{}_Prob_{}'.format(i, j) for i in range(1, N + 1) for j in range(2)])
importances = pd.DataFrame(np.zeros((X_train.shape[1], N)), columns=['Fold_{}'.format(i) for i in range(1, N + 1)], index=df_all.columns)
fprs, tprs, scores, conf_matrix = [], [], [], []
skf = StratifiedKFold(n_splits=N, random_state=N, shuffle=True)
for fold, (trn_idx, val_idx) in enumerate(skf.split(X_train, y_train), 1):
print('Fold {}
'.format(fold))
# Fitting the model
single_best_model.fit(X_train[trn_idx], y_train[trn_idx])
#ADDED
y_pred = single_best_model.predict(X_test)
conf_matrix.append(confusion_matrix(y_test, y_pred))
# Computing Train AUC score
trn_fpr, trn_tpr, trn_thresholds = roc_curve(y_train[trn_idx], single_best_model.predict_proba(X_train[trn_idx])[:, 1])
trn_auc_score = auc(trn_fpr, trn_tpr)
# Computing Validation AUC score
val_fpr, val_tpr, val_thresholds = roc_curve(y_train[val_idx], single_best_model.predict_proba(X_train[val_idx])[:, 1])
val_auc_score = auc(val_fpr, val_tpr)
scores.append((trn_auc_score, val_auc_score))
fprs.append(val_fpr)
tprs.append(val_tpr)
# X_test probabilities
probs.loc[:, 'Fold_{}_Prob_0'.format(fold)] = single_best_model.predict_proba(X_test)[:, 0]
probs.loc[:, 'Fold_{}_Prob_1'.format(fold)] = single_best_model.predict_proba(X_test)[:, 1]
importances.iloc[:, fold - 1] = single_best_model.feature_importances_
oob += single_best_model.oob_score_ / N
print('Fold {} OOB Score: {}
'.format(fold, single_best_model.oob_score_))
print('Average OOB Score: {}'.format(oob))
The first error I'm getting is that y_test
and I don't know how this guy defined y_test. I cannot find it anywhere in the notebook.
After I got the matrix
I want to get something like this
https://medium.com/analytics-vidhya/evaluating-a-random-forest-model-9d165595ad56