The goal of this notebook is to familiarize the reader with the powerful package XGBoost for constructing gradient boosted trees. We will discuss how to visualize feature importances as well as techniques for optimizing XGBoost.
In this notebook, we will focus on using Gradient Boosted Trees (in particular XGBoost) to classify the supersymmetry (SUSY) dataset, first introduced by Baldi et al. Nature Communication 2015 and Arxiv:1402.4735. The supersymmetry data set consists of 5,000,000 Monte-Carlo samples of supersymmetric and non-supersymmetric collisions with 18 features. The signal process is the production of electrically-charged supersymmetric particles which decay to W bosons and an electrically-neutral supersymmetric particle that is invisible to the detector.
The first 8 features are "raw" (low-level) kinematic features that can be directly measured from collisions. The final 10 features are "hand constructed" (high-level) features that have been chosen using physical knowledge, and are known to be important in distinguishing supersymmetric and non-supersymmetric collision events. More specifically, they are given by the column names below.
We will drawn upon the useful blog posts that compactly introduces many of the basics of hyperparameter tuning. As we will see, there are many practical trade-offs we have to worry about. Unlike Random Forests, for more complicated algorithms such as XGBoost, overfitting can be a major worry. It is also extremely computationally expensive to do hyperparameter searches. However, this notebook should get you started on these interesting methods.
The supersymmetry dataset can be downloaded from the UCI Machine Learning repository on https://archive.ics.uci.edu/ml/machine-learning-databases/00279/SUSY.csv.gz. The dataset is quite large. Download the dataset and unzip it in a directory. We will be using this dataset with gradient boosted trees. We will focus on the XGBoost aglorithm.
If you have not already done so, you will also have to install the XGBoost python package, see e.g. https://github.com/dmlc/xgboost/tree/master/python-package. The easiest way to do this is to clone the Github repository, navigate to this directory, and use the pip command: pip install xgboost. Alternatively, one can use Anacodna: conda install -c conda-forge xgboost.
#Load the dataset using pandas and numpy
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
#Filename [CHANGE THIS TO YOUR FILENAME FOR SUSY]
filename='/Users/chinghao/Downloads/SUSY.csv'
#Read in SUSY File. We will only work with subset of data for demonstration purposes.
features=['SUSY','lepton 1 pT', 'lepton 1 eta', 'lepton 1 phi', 'lepton 2 pT', 'lepton 2 eta', 'lepton 2 phi',
'missing energy magnitude', 'missing energy phi', 'MET_rel', 'axial MET', 'M_R', 'M_TR_2', 'R', 'MT2',
'S_R', 'M_Delta_R', 'dPhi_r_b', 'cos(theta_r1)']
low_features=['lepton 1 pT', 'lepton 1 eta', 'lepton 1 phi', 'lepton 2 pT', 'lepton 2 eta', 'lepton 2 phi',
'missing energy magnitude', 'missing energy phi']
high_features=['MET_rel', 'axial MET', 'M_R', 'M_TR_2', 'R', 'MT2','S_R', 'M_Delta_R', 'dPhi_r_b', 'cos(theta_r1)']
#Number of datapoints to work with
N = 100000
print("Size of dataset : %i"%N)
df = pd.read_csv(filename, header=None,nrows=N,engine='python')
df.columns=features
y = df['SUSY'].values
X = df[[col for col in df.columns if col!="SUSY"]]
#Make datasets using only the 8 low-level features and 10 high-level features
X_low=X[low_features]
X_high=X[high_features]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.1, random_state=0)
X_low_train, X_low_test, y_low_train, y_low_test = train_test_split(X_low, y, test_size=.1, random_state=0)
X_high_train, X_high_test, y_high_train, y_high_test = train_test_split(X_high, y, test_size=.1, random_state=0)
# For next cell
from sklearn.metrics import roc_auc_score
import time
import xgboost as xgb
import warnings
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
print("Training on %i examples with %i features"%X_train.shape)
#Use default parameters and train on full dataset
XGBclassifier = xgb.sklearn.XGBClassifier(nthread=-1, seed=1, n_estimators=1000)
#Train and time classifier
start_time = time.time()
XGBclassifier.fit(X_train, y_train)
run_time = time.time() - start_time
#Make Predictions
print("Predicting on %i examples with %i features\n"%X_test.shape)
y_pred= XGBclassifier.predict(X_test)
#Print Results
print("Model Accuracy with all features: {:.2f}%".format(100*XGBclassifier.score(X_test, y_test)))
print("The AUC score with all features is {:.2f}".format(roc_auc_score(y_test,y_pred)))
print("Run time with all features: {:.2f} sec\n\n".format(run_time))
#Rerun with just low-level kinematic features with default parameters
print("Training on %i examples with %i features"%X_low_train.shape)
XGBclassifier_low = xgb.sklearn.XGBClassifier(nthread=-1, seed=1)
#Train and time classifier
start_time = time.time()
XGBclassifier_low.fit(X_low_train, y_low_train)
run_time = time.time() - start_time
#Make Predictions
print("Predicting on %i examples with %i features\n"%X_low_test.shape)
y_low_pred = XGBclassifier_low.predict(X_low_test)
#Print Results
print("Model Accuracy with just low-level kinematic features: {:.2f}%".format(100*XGBclassifier_low.score(X_low_test, y_low_test)))
print("The low-level AUC score is {:.2f}".format(roc_auc_score(y_test,y_low_pred)))
print("Run time with low-level features: {:.2f} sec\n\n".format(run_time))
#Rerun with just high-level kinematic features with default parameters
print("Training on %i examples with %i features\n"%X_high_train.shape)
XGBclassifier_high = xgb.sklearn.XGBClassifier(nthread=-1, seed=1)
#Train and time classifier
start_time = time.time()
XGBclassifier_high.fit(X_high_train, y_high_train)
run_time = time.time() - start_time
print("Training on %i examples with %i features"%X_high_test.shape)
#Make Predictions
y_high_pred = XGBclassifier_high.predict(X_high_test)
#Print Results
print("Model Accuracy with just high-level features: {:.2f}%".format(100*XGBclassifier_low.score(X_low_test, y_low_test)))
print("The high-level AUC score is {:.2f}".format(roc_auc_score(y_test,y_high_pred)))
print("Run time with high-level features: {:.2f} sec\n\n".format(run_time))
One nice aspect of XGBoost (and ensemble methods in general) is that it is easy to visualize feature importances. In XGBoost, there are some handy plots for viewing these (similar functions also exist for the scikit implementation of random forests). One thing we can calculate is the feature importance score (Fscore), which measures how many times each feature was split on. The higher this number, the more fine-tuned the partitions in this direction, and presumably the more informative it is for our classification task.
#import ml_style as style
import matplotlib as mpl
#mpl.rcParams.update(style.style)
import matplotlib.pyplot as plt
%matplotlib inline
fig=plt.figure()
xgb.plot_importance(XGBclassifier, ax=plt.gca())
fig.subplots_adjust(left=0.4) #
#fig.savefig('SUSYXGBoost1.pdf')
fig=plt.figure()
xgb.plot_importance(XGBclassifier_low, ax=plt.gca())
fig.subplots_adjust(left=0.4)
#fig.savefig('SUSYXGBoost2.pdf')
fig=plt.figure()
xgb.plot_importance(XGBclassifier_high, ax=plt.gca())
fig.subplots_adjust(left=0.4)
#fig.savefig('SUSYXGBoost3.pdf')
This simple example shows that with the default parameters one can already achieve an accuracy of about 80 percent using all the features (kinematic and hand crafted), and a slightly smaller accuracy of about 78.75% using just the kinematic features. Both achieve a very respectable AUC (area under the ROC curve, see https://en.wikipedia.org/wiki/Receiver_operating_characteristic) score of around 0.78 or 0.79. This is significantly better than that achieved using Boosted Decision Trees (though not deep neural networks) in the original paper, even without tuning hyperparameters. Furthermore, we are using only a small subset of all the data (100,000 out of a total of 5,000,000 datapoints) so this performance is a lower bound on what can be accomplished with XGBoost. Note that there are only three points on the curve so the ROC does not contain much information beyond the accuracy.
We can summarize this by plotting the ROC curves for these three models. Recall that ROC curves plot the true positive rate. Here, we will use the modified version used in high-energy physics plotting the true negative rate (Background rejection) against the true positive rate (signal efficiency).
from sklearn.metrics import roc_curve
%matplotlib inline
fpr, tpr, _ = roc_curve(y_test, y_pred)
fpr_low, tpr_low, _ = roc_curve(y_test, y_low_pred)
fpr_high, tpr_high, _ = roc_curve(y_test, y_high_pred)
plt.figure(1)
plt.plot(tpr, 1-fpr, label='Full')
plt.plot(tpr_low, 1-fpr_low, label='Low')
plt.plot(tpr_high, 1-fpr_high, label='High')
plt.legend(loc=1)
plt.xlabel('Signal efficiency')
plt.ylabel('Background Rejection')
plt.savefig("SUSY_roc_XGBoost.pdf")
We will now optimize the parameters of the XGBoost algorithm by performing a grid search. We will use the very useful new function from scikit-learn GridSearchCV()
. This function allows you to specify lists of parameters to search over.
Let us briefly discuss what parameters we can tune to improve performance with descriptions:
max_depth
[default=6]: maximum depth of a tree, increasing this value will make the model more complex / likely to overfit.
eta
or 'learning_rate'[default =0.3]: step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features. eta
actually shrinks the feature weights to make the boosting process more conservative.
gamma
or min-split-loss [default=0]: This is the penalty that regularizes the number of leaves. The larger, the more conservative the algorithm will be.
min_child_weight
[default=1]: In linear regression mode, this simply corresponds to the minimum number of instances needed to be in each node (min $B_j$ in notation of manuscript). The larger, the more conservative the algorithm will be. More generally, it is the minimum sum of instance weight (Hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning.
As you can see this cross-validation procedure is quite computationally expensive. With the parameters below, it takes somewhere between 2 and 5 minutes on a powerful laptop. In the cell below, we perform the search and examine the results in the subsequent results.
from sklearn.model_selection import GridSearchCV
#Create values to search over
cv_params = {'max_depth': [3,4,6], 'min_child_weight': [1,3,5], 'learning_rate':[0.1,0.3]}
ind_params = {'n_estimators': 100, 'seed':1, 'colsample_bytree': 1,
'objective': 'binary:logistic'}
opt_XGBclassifier = GridSearchCV(xgb.XGBClassifier(**ind_params),
cv_params,
scoring = 'accuracy', cv = 5, n_jobs = -1, verbose=3)
opt_XGBclassifier.fit(X_train, y_train)
opt_XGBclassifier.grid_scores_
#Print scores
print('The optimal score on training set is {:0.3f}'.format(opt_XGBclassifier.best_score_))
#Find optimal parameters
print('The optimal parameters for the classifier are:')
print(opt_XGBclassifier.best_params_)
#Fit performance on the test set
XGBclassifier_final=opt_XGBclassifier.best_estimator_
y_pred_final=XGBclassifier_final.predict(X_test)
print("Model Accuray with optimal parameters: {:.2f}%".format(100*XGBclassifier_final.score(X_test, y_test)))
print("The AUC score is {:.2f}".format(roc_auc_score(y_test,y_pred_final)))
We see that we have slightly improved our performance. The default parameters work pretty well. Here, we have used relatively small ensembles (100) to make things faster. In practice, it is worth using much bigger ensembles, in which case overfitting and optimization are likely to have larger effects.
Following this very nice blog post, https://jessesw.com/XG-Boost/, we can do some further optimization of the XGBoost algorithm. Part of this is computational and has to do with how we interface with XGBoost, and part will be due to another regularization technique that we discussed in the gradient descent chapter: early stopping. Early stopping is now considered one of the most important regularization techniques. The basic idea behind it is to just stop the gradient descent once some measure of error stops going down significantly.
You are invited to play with the code and experiment with this yourself.