4.7 Multivariate Analysis
4.7.2 BDT Optimization
This analysis uses the BDT with the GradientBoost method implemented in the Toolkit for MultiVariate Data Analysis (TMVA) package [120]. Two BDTs are prepared and optimized for the VBF and the Boosted categories, using different input variables and BDT parameters. Only VBF signal process is used in the training procedure for the VBF category, while all signal production processes are considered for the Boosted category, where the Higgs boson mass is set tomH = 125GeV for both categories.
BDT Training Sample
The BDT training requires samples, which have large statistics and physics processes are known. In order to maximize statistics of the training samples, themT <70GeV requirement and the∆η(jet1,jet2) <
3.0requirement are removed. In addition, different background modelings for theZ → τ τ and the fake τhad background events are used from the modelings described in Section 4.6. TheZ →τ τ background is modeled by simulation samples (see Section 4.3.1) instead of the embedding sample. A different data-driven estimation is applied for the fakeτhad background, referred to as OS-SS method [121, 122].
Furthermore, a cross-evaluation technique [123] is performed in the training procedure. A concept of the technique is shown in Fig. 4.17 (b). Events are spitted intosample Aandsample Bwith a random bias, and then two independent BDTs denoted byBDT AandBDT Bare trained usingsample Aandsample B, respectively. Based on the fact that thesample Ais statistically independent fromsample B,BDT A is tested with sample BandBDT Bis tested with sample A. Finally, the BDT score distribution as the result of the cross-evaluation is constructed by the sum ofBDT AandBDT Bdistributions. By applying the cross-evaluation technique, statistics of training samples becomes twice for signal and background events estimated by simulation samples.
Input Variables and BDT Parameters
Although the number of input variables are not limited in the BDT algorithm, it is difficult to understand and model the correlations of a large number of input variables, so that the number and kind of input variables must be optimized according to the BDT performance. The performance is evaluated using the discrimination significance expressed by:
Significance= < S >+< B >
√
σ2S+σB2
, (4.17)
where< S >and< B >denote means of BDT output distribution for signal and background, andσS andσBdenote root-mean-square (RMS). At first, a large number of input variables are tested to train the BDT, and then a variable which has the lowest importance is discarded. The importance of each input variable is quantified by determining the number of times used in a node splitting of the BDT training.
This optimization procedure is repeated until the discrimination significance is maximized. Table 4.9
lists input variables for the VBF and the Boosted category after the optimization. The definitions of the input variables and their importances are also shown in the table.
[GeV]
MMCτ
mτ
0 50 100 150 200 250
Fraction of Events
0 0.05 0.1 0.15 0.2 0.25 0.3
0.35 H(125)→ττ
τ τ
→ Z
τhad
Fake VBF
τhad
µ
had + τ e
= 8TeV, 20.3fb-1
s
(a)mτ τ
had) (l,τ
∆R
1 2 3 4 5 6
Fraction of Events
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
τ τ
→ H(125)
τ τ
→ Z
τhad
Fake VBF
τhad
µ
had + τ e
= 8TeV, 20.3fb-1
s
(b)∆R(ℓ, τhad)
centrality
missφ ET
−1 −0.5 0 0.5 1
Fraction of Events
0 0.1 0.2 0.3 0.4 0.5 0.6
τ τ
→ H(125)
τ τ
→ Z
τhad
Fake VBF
τhad
µ
had + τ e
= 8TeV, 20.3fb-1
s
(c)ETmiss-ϕcentrality
[GeV]
mT
0 20 40 60 80 100 120
Fraction of Events
0 0.05 0.1 0.15
0.2 0.25 0.3 0.35
0.4 H(125)→ττ
τ τ
→ Z
τhad
Fake VBF
τhad
µ
had + τ e
= 8TeV, 20.3fb-1
s
(d)mT
Fig. 4.18: Input variable disctributions of mτ τ,∆R(ℓ, τhad), ETmiss - ϕ centrality, and mT in the VBF category, for the signal (red), theZ →τ τ background (blue) and the fakeτhadbackground (green). The signal represents only the VBF signal process.
The input variables ofmτ τ,∆R(ℓ, τhad)andETmiss-ϕcentrality represent a feature of the reconstructed H → τℓτhad final state. They are commonly used for the VBF and the Boosted category. Figure 4.18 and 4.19 shows their distributions for the VBF and Boosted category, respectively. The mτ τ and
∆R(ℓ, τhad)variables, which include the information of the invariant mass difference between the Higgs and theZ boson, are useful to distinguish theZ → τ τ background events from the signal events. The ETmiss-ϕcentrality has a relatively complex definition compared with other variables. This variable quan-tifies the relative angular position of the ETmiss with respect to the lepton and theτhad in the transverse
84
[GeV]
MMCτ
mτ
0 50 100 150 200 250
Fraction of Events
0 0.05 0.1 0.15 0.2 0.25 0.3
0.35 H(125)→ττ
τ τ
→ Z
τhad
Fake Boosted
τhad
µ
had + τ e
= 8TeV, 20.3fb-1
s
(a)mτ τ
had) τ (l,
∆R
1 2 3 4 5 6
Fraction of Events
0 0.05 0.1 0.15
0.2 0.25
0.3 H(125)→ττ
τ τ
→ Z
τhad
Fake Boosted
τhad
µ
had + τ e
= 8TeV, 20.3fb-1
s
(b)∆R(ℓ, τhad)
centrality φ
miss
ET
−1 −0.5 0 0.5 1
Fraction of Events
0 0.1 0.2 0.3 0.4
0.5 →ττ
H(125) τ τ
→ Z
τhad
Fake Boosted
τhad
µ
had + τ e
= 8TeV, 20.3fb-1
s
(c)ETmiss-ϕcentrality
[GeV]
mT
0 10 20 30 40 50 60 70
Fraction of Events
0 0.05 0.1 0.15
0.2 0.25 0.3 0.35 0.4
0.45 H(125)→ττ
τ τ
→ Z
τhad
Fake Boosted
τhad
µ
had + τ e
= 8TeV, 20.3fb-1
s
(d)mT
Fig. 4.19: Input variable disctributions ofmτ τ,∆R(ℓ, τhad),ETmiss-ϕcentrality, andmT in the Boosted category, for the signal (red), theZ →τ τ background (blue) and the fakeτhadbackground (green). The signal represents the sum of all signal processes.
plane, expressed by:
ETmiss-ϕcentrality= A+B
√A2+B2,
A= sin(ϕEmiss
T −ϕℓ)
sin(ϕτhad−ϕℓ) , B= sin(ϕEmiss
T −ϕτhad)
sin(ϕℓ−ϕτhad) . (4.18) It takes a value√
2in case that theETmissvector is perfectly center between the lepton and theτhad, while it takes less than 1 in case that theETmissvector is not between them. ThemT is used to distinguish the fakeτhadbackground, especially theW+jets background. While a dominant phase space of theW+jets
background is already rejected by a requirement of mT < 70GeV in the signal region, the remaining events are still present as one of the dominant background. ThemTis still meaningful as input variable using a shape difference between the signal and theW+jets background.
2)
1, j η(j
∆ 3 3.5 4 4.5 5 5.5 6 6.5 7
Fraction of Events
0 0.05 0.1 0.15 0.2
0.25 H(125)→ττ
τ τ
→ Z
τhad Fake VBF
τhad µ had + τ e
= 8TeV, 20.3fb-1 s
(a)∆η(jet1,jet2)
j2
η
×
j1
η
−10 −8 −6 −4 −2 0 2 4
Fraction of Events
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
τ τ
→ H(125)
τ τ
→ Z
τhad Fake VBF
τhad µ had + τ e
= 8TeV, 20.3fb-1 s
(b)ηjet1×ηjet2
(l)
η2 1,
Cη
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fraction of Events
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
0.45 H(125)→ττ
τ τ
→ Z
τhad Fake VBF
τhad µ had + τ e
= 8TeV, 20.3fb-1 s
(c)ℓ-ηcentrality
[GeV]
, j2 j1
m 0 200 400 600 800 100012001400
Fraction of Events
0 0.05 0.1 0.15 0.2 0.25 0.3
0.35 H(125)→ττ
τ τ
→ Z
τhad Fake VBF
τhad µ had + τ e
= 8TeV, 20.3fb-1 s
(d)mj1,j2
[GeV]
Total
pT
0 20 40 60 80 100 120
Fraction of Events
0 0.1 0.2 0.3 0.4 0.5 0.6
τ τ
→ H(125)
τ τ
→ Z
τhad Fake VBF
τhad µ had + τ e
= 8TeV, 20.3fb-1 s
(e)ptotalT
Fig. 4.20: Input variable disctributions used in the VBF category for the signal (red), the Z → τ τ background (blue) and the fake τhad background (green). The signal represents only the VBF signal process.
For the VBF category, variables using two high momentum jets with a large pseudo-rapidity gap are input to the training. Figure 4.20 shows the input variable distributions which are used in only the VBF category. Both the∆η(jet1,jet2)and theηjet1×ηjet2contain an angular separation of two jets inη. The VBF signal provides higher∆η(jet1,jet2)values and a long negative tail shape inηjet1×ηjet2, compared with the backgrounds. The mj1,j2 is a variable including a combined information of two jet momenta and their angular separation, and the VBF signal provides higher mj1,j2 values than the backgrounds.
The ℓ - η centrality variable quantifies theη position of the lepton with respect to the two jets in the pseudo-rapidity plane, expressed by:
ℓ-ηcentrality=exp
[ −1 (ηjet1−ηjet2)2
(
ηℓ−ηjet1+ηjet2 2
)]
. (4.19)
86
It takes a value of 1 in case that the lepton is perfectly center between two jets, while it takes less than a value of1/ein case that the lepton is outside of jets. TheptotalT is a vector sum of transverse momenta of objects in the VBF signal process: a lepton, aτhad, two jets and anETmiss. This variable represents an additional activity other than the VBF objects.
τ)
T( p
T(l)/
p
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Fraction of Events
0 0.05 0.1 0.15 0.2 0.25
τ τ
→ H(125)
τ τ
→ Z
τhad
Fake Boosted
τhad
µ
had + τ e
= 8TeV, 20.3fb-1
s
(a)pℓT/pτThad
[GeV]
pT
∑
100 200 300 400 500 600 700
Fraction of Events
0 0.05 0.1 0.15
0.2 0.25 0.3 0.35
0.4 H(125)→ττ
τ τ
→ Z
τhad
Fake Boosted
τhad
µ
had + τ e
= 8TeV, 20.3fb-1
s
(b)∑pT
Fig. 4.21: Input variable disctributions used in the Boosted category for the signal (red), the Z → τ τ background (blue) and the fake τhad background (green). The signal represents the sum of all signal processes.
For the Boosted category, two additional variables representing a feature of the boosted topology are used. The∑
pTrepresents a total activity in an event, and the signal provide much higher∑
pTvalues than the backgrounds. ThepℓT/pτThad is especially useful to discriminate the fakeτhadbackground, based on an asymmetry betweenpℓTandpτThaddue to a difference of the number of neutrinos fromτdecays. As a result, the signal provides lowerpℓT/pτThad values than the fakeτhadbackground. Figure 4.21 shows the
∑pTand thepℓT/pτThaddistributions.
The BDT with GradientBoost requires to optimize some parameters to maximize its performance. The parameters are MaxDepth, MinNodeSize, Ntrees and Shrinkage, as described in Section 4.7.1. The MaxDepth and theMinNodeSize control how much grow each decision tree, while theNtrees and the Shrinkagedetermine the boosting algorithm. A two-dimensional scan ofMaxDepthandNtreesare per-formed to maximize the significance of the BDT output, and then the remaining parameters are optimized by step-by-step scanning. This optimization procedure is separately performed for the VBF and Boosted categories. The parameter values are summarized in Table 4.10.