Boosted Decision Trees Analysis - Background Estimation

Background Estimation

5.5 Boosted Decision Trees Analysis

variables,xi,xj, andxk. The algorithm starts from the root node. An event is classified into the one of the leaf nodes at the bottom end of the tree for signal, labelled as S in the figure, or the leaf nodes at the bottom end of the tree for background, labelled as B in the figure. Each branch provides the best separation criterion between the background events and the signal events for a discriminant variable by training a tree with the Monte Calro simulation sample which has well known composition. At several splitting nodes, the same discriminating variable can be used but the nodes provide the different separation points. A ranking of the BDT input variables shows the number of times that the variables are used to split decision tree nodes. When the ranking is produced, the algorithm also considers each split occurrence weighted by the separation gain-squared, when it has achieved, and the number of events in the node.

On first step of decision tree algorithm, a sample is separated into two samples, one sub-sample is background-like and the other sub-sub-sample is signal-like. Then, both sub-sub-samples are separated into two sub-samples again on second step of decision tree algorithm. The decision tree algorithm continuously makes sub-samples by splitting into background-like and signal-like components, and this separating procedure continues until the number of events in sub-samples reach the minimum number, the minimum leaf size. The multiple splitting leads to reusing events with signal process which are wrongly separated by bad discriminant variables. At the end, the tree has the leaves which are composed of background-like and signal-like regions. The regions provide the event weights corresponding to scores. We use whole events but apply the event weights on BDT implementation.

Boosting indicates that the decision trees are re-trained again and again by using re-weighted train sample which are applied an additional weight to the mis-classified event on the previous training. The additional weight is calculated from the mis-classification rate of each tree. We use the average of scores in each boosted trees. In other words, the BDT algorithm produces a forest of decision trees with different classifier response, which originates from statistics fluctuation, from the same training sample. Then, the algorithm classifies an event on a majority vote of the classifications done by each tree in the forest.

When the BDT are optimized by training, we pay attention to over-training. The over-trained BDT is too optimized to the training sample, and it is able to be sensitive to the meaningless difference between training sample and test sample, which does not originate from the physics reason. For example, too small size of the final leaf node is sensitive to the statistics fluctuation of events in the train sample. In order to check whether the trained trees are over-training or not, they are investigated by using test samples which is independent from train samples. The output from the training sample is compared with the output from the test sample by the Kolmogorov-Smirov test. Then, the parameters of BDT have been optimized.

Table 5.2 shows the options for BDT training.

Table 5.2: Options for the BDT training.

BDT options setting

BoostType GradientBoost

Shrinkage 0.10

GradBaggingFraction 0.5

nCuts 20

NNodesMax 5

BoostType indicates Boosting type of the trees. There are some options like AdaBoost, Bagging,

Grad(Gradient Boost), and so on. During the boosting procedure, a boost weightαis applied to misclassified events. The weight is expressed by

α= 1−err

err , (5.16)

where err is the misclassification rate of the previous tree. We define the result of an individual classifier as h(X) whereX is the tuple of input variables, X = (x₁, ..., x_n_var) withn_var which is the number of input variables. The h(X)can take -1 for background and +1 for signal. The boosted event classificationy(X)is expressed by

y(X) = 1 N_collection ·

NcollectionX

ln (α)·h_i(X), (5.17) where the sum is over all classifiers in the collection. We obtaine they(X)from the BDT train-ing. We also define the functionF(X), which is a weighted sum of parametrised base functions f(x;a_m), by

F(X;P) = XM

m=0

β_mf(x;a_m); P ∈ {β_m;a_m}^M0 . (5.18) f(x;a_m)indicates any TMVA classifier, which could act as a weak learner. A weak learner means that a classifier has slight correlation with the true classification. In case of the decision trees, the weak classifiers consist of small individual decision trees with a tree depth of often as small two or three. Therefore, the decision trees have very little discrimination power by themselves. The boosting procedure is performed by adjusting the parametersP such that the deviation between the model responseF(X)and the true valueyis minimised. The deviation is measured by “loss-f unction” L(F, y). The boosting procedure depends on the loss-f unction. AdaBoost uses an exponential loss-f unction expressed by L(F, y) = e⁻^F(X)y. GradientBoost uses the binomi-nal log-likelihoodloss-f unctionexpressed byL(F, y) = ln (1 +e⁻^2F^(X)y). The minimization of the GradientBoost loss-f unctionrequires the iteration procedure. The algorithm calculates the current gradient of theloss-f unction. Then, it produces a regression tree whose leaf values are adjusted to match the mean value of the gradient in each region defined by the tree structure.

By iterating this procedure, the algorithm obtains the set of decision trees which minimises the loss-f unction. Shrinkage parameter controls the weight of the individual trees to reduce the learning rate of the algorithm. A small shrinkage allows more trees growing. GradBaggingFrac-tion indicates the fracGradBaggingFrac-tion of events to be used for a bagging-like resampling procedure using random subsamples of the training events for growing the trees. The Bagging denotes a resam-pling technique where a classifier is repeatedly trained using resampled training events such that the combined classifier represents an average of the individual classifiers. On the training stage, the splitting node picks the single variable with best discriminating power from input variables and sets the single cut point on the variable for separating the background and the signal events.

The Algorithm decides the cut value by maximizing Gini Index, which is the default option of

“SeparationType” and is expressed byp·(1−p)wherepis the purity of the node. The purity is calculated from the ratio of signal events to all events in that node. Therefore, a node with only background events has thepequal to zero. The cut value is optimised by scanning over the vari-able range with a granularity that is set via nCuts option. The nCuts is the number of grid points

in variable range used in finding optimal cut in node splitting. NNodesMax indicates the limit of tree size. By limiting the tree depth during the tree building process, we can avoid the overtraining because the overtrained trees are typically grown to a large depth. This setting are decided by considering not only the good separation but also the reduction in anticorrelation between signal normalization andtt+bb cross section uncertainty. We trained the BDT for each charged Higgs mass hypothesis by using the selected variables in the signal region. Figure 5.31 shows the dis-tributions of background-like events and signal-like events, which originate from charged Higgs bosons with mass of 500 GeV, in the BDT input variables.

Figure 5.31: Distributions of background-like events and signal-like events, which originate from charged Higgs bosons with mass of 500 GeV, in the BDT input variables.

In addition to the control plots, the variables are validated by comparing the BDT responses in the four control regions between the real data and the MC simulation. Figures 5.32 - 5.36 show the distributions of BDT input variables in the signal region. The red dash line indicates the signal distribution withm_H+of 300 GeV normalized to the number of events of the real data. The blue dashed histogram indicates the signal distribution withmH+of 500 GeV normalized to number of events of the real data. We find the good agreement between the real data and the MC simulation.

Figures 5.37 and 5.38 are the distributions of BDT output trained on the charged Higgs signal withm_H^±of 300 or 500 GeV. The blue dashed lines in the figures indicate the signal distribution normalized to the number of events of data. As the distribution of the output gets closer to−1, the contribution from the background-like events becomes larger. As the distribution of the output gets closer to+1, the contribution from the signal-like events becomes larger. We find that the BDT has good improvement in the discrimination for higher charged Higgs mass point.

Events / 0.05

0 200 400 600 800 1000 1200 1400 1600 1800

Data tt+LJ c

+c t

t tt+bb

Other bkg Total unc.

300 GeV shape H+

500 GeV shape H+

ATLAS

=8 TeV, 20.3 fb-1

+(tb)

→tH gb

≥3b)

≥5j(

Pre-fit

2nd Fox-Wolfram moment 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Data/Bkg

0.6 0.8 1 1.2 1.4

Figure 5.32: Second Fox-Wolfram moment cal-culated from the jets.

Events / 0.25

0 500 1000 1500 2000 2500 3000 3500

Data +LJ t t

c +c t t

b +b t t Other bkg Total unc.

300 GeV shape H+

500 GeV shape H+

ATLAS

=8 TeV, 20.3 fb-1

+(tb)

→tH gb

≥3b)

≥5j(

Pre-fit

Rbb

∆ average

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Data/Bkg

0.6 0.8 1 1.2 1.4

Figure 5.33: Average∆R_bb.

Events / 20 GeV

0 500 1000 1500 2000 2500

Data tt+LJ c

+c t

t tt+bb

Other bkg Total unc.

300 GeV shape H+

500 GeV shape H+

ATLAS

=8 TeV, 20.3 fb-1

+(tb)

→tH gb

≥3b)

≥5j(

Pre-fit

[GeV]

Leading jet pT

0 50 100 150 200 250 300 350 400

Data/Bkg

0.6 0.8 1 1.2 1.4

Figure 5.34: p_Tof the leading jet.

Events / 20 GeV

0 500 1000 1500 2000 2500

Data tt+LJ c

+c t

t tt+bb

Other bkg Total unc.

300 GeV shape H+

500 GeV shape H+

ATLAS

=8 TeV, 20.3 fb-1

+(tb)

→tH gb

≥3b)

≥5j(

Pre-fit

R [GeV]

∆ for bb-pair with min mbb

0 50 100 150 200 250 300 350 400

Data/Bkg

0.6 0.8 1 1.2 1.4

Figure 5.35: m_bb for theb-pair that is closest in

∆R.

Events / 60 GeV

0 500 1000 1500 2000 2500

Data tt+LJ c

+c t

t tt+bb

Other bkg Total unc.

300 GeV shape H+

500 GeV shape H+

ATLAS

=8 TeV, 20.3 fb-1

+(tb)

→tH gb

≥3b)

≥5j(

Pre-fit

[GeV]

had

0 200 400 600 800 1000 1200

Data/Bkg

0.6 0.8 1 1.2 1.4

Figure 5.36: HadronicH_T.

Events / 0.067

0 200 400 600 800 1000 1200 1400 1600

=300 GeV

H+

m shape H+

xBR=1 pb σ H+

Data +LF t t

c +c t t

b +b t t Other bkg Total unc.

=8 TeV, 20.3 fb-1

+(tb)

→tH

gb ≥5j(≥3b)

ATLAS Pre-fit

BDT output -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

Data/Bkg

0.6 0.8 1 1.2 1.4

Figure 5.37: Distribution of the BDT output on 300 GeV.

Events / 0.1

0 500 1000 1500 2000 2500

=500 GeV

H+

m shape H+

xBR=1 pb σ H+

Data tt+LF

c +c t

t tt+bb

Other bkg Total unc.

=8 TeV, 20.3 fb-1

+(tb)

→tH

gb ≥5j(≥3b)

ATLAS Pre-fit

BDT output -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

Data/Bkg

0.6 0.8 1 1.2 1.4

Figure 5.38: Distribution of the BDT output on 500 GeV.

Chapter 6

ドキュメント内 ATLAS実験における重心系衝突エネルギー8 TeVでの陽子-陽子衝突のデータを用いたトップクォークとボトムクォークに崩壊する荷電ヒッグス粒子の探索 (ページ 160-166)