BDT Optimization - Multivariate Analysis - October2015YukiSAKURAI EvidencefortheHiggsbosoninthe

4.7 Multivariate Analysis

4.7.2 BDT Optimization

This analysis uses the BDT with the GradientBoost method implemented in the Toolkit for MultiVariate Data Analysis (TMVA) package [120]. Two BDTs are prepared and optimized for the VBF and the Boosted categories, using different input variables and BDT parameters. Only VBF signal process is used in the training procedure for the VBF category, while all signal production processes are considered for the Boosted category, where the Higgs boson mass is set tom_H = 125GeV for both categories.

BDT Training Sample

The BDT training requires samples, which have large statistics and physics processes are known. In order to maximize statistics of the training samples, them_T <70GeV requirement and the∆η(jet1,jet2) <

3.0requirement are removed. In addition, different background modelings for theZ → τ τ and the fake τ_had background events are used from the modelings described in Section 4.6. TheZ →τ τ background is modeled by simulation samples (see Section 4.3.1) instead of the embedding sample. A different data-driven estimation is applied for the fakeτhad background, referred to as OS-SS method [121, 122].

Furthermore, a cross-evaluation technique [123] is performed in the training procedure. A concept of the technique is shown in Fig. 4.17 (b). Events are spitted intosample Aandsample Bwith a random bias, and then two independent BDTs denoted byBDT AandBDT Bare trained usingsample Aandsample B, respectively. Based on the fact that thesample Ais statistically independent fromsample B,BDT A is tested with sample BandBDT Bis tested with sample A. Finally, the BDT score distribution as the result of the cross-evaluation is constructed by the sum ofBDT AandBDT Bdistributions. By applying the cross-evaluation technique, statistics of training samples becomes twice for signal and background events estimated by simulation samples.

Input Variables and BDT Parameters

Although the number of input variables are not limited in the BDT algorithm, it is difficult to understand and model the correlations of a large number of input variables, so that the number and kind of input variables must be optimized according to the BDT performance. The performance is evaluated using the discrimination significance expressed by:

Significance= < S >+< B >

√

σ²_S+σ_B²

, (4.17)

where< S >and< B >denote means of BDT output distribution for signal and background, andσ_S andσ_Bdenote root-mean-square (RMS). At first, a large number of input variables are tested to train the BDT, and then a variable which has the lowest importance is discarded. The importance of each input variable is quantified by determining the number of times used in a node splitting of the BDT training.

This optimization procedure is repeated until the discrimination significance is maximized. Table 4.9

lists input variables for the VBF and the Boosted category after the optimization. The definitions of the input variables and their importances are also shown in the table.

[GeV]

MMCτ

mτ

0 50 100 150 200 250

Fraction of Events

0 0.05 0.1 0.15 0.2 0.25 0.3

0.35 _H(125)_→_τ_τ

τ τ

→ Z

τhad

Fake VBF

τhad

had + τ e

= 8TeV, 20.3fb-1

(a)mτ τ

had) (l,τ

∆R

1 2 3 4 5 6

Fraction of Events

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

τ τ

→ H(125)

τ τ

→ Z

τhad

Fake VBF

τhad

had + τ e

= 8TeV, 20.3fb-1

(b)∆R(ℓ, τhad)

centrality

missφ ET

−1 −0.5 0 0.5 1

Fraction of Events

0 0.1 0.2 0.3 0.4 0.5 0.6

τ τ

→ H(125)

τ τ

→ Z

τhad

Fake VBF

τhad

had + τ e

= 8TeV, 20.3fb-1

(c)E_T^miss-ϕcentrality

[GeV]

0 20 40 60 80 100 120

Fraction of Events

0 0.05 0.1 0.15

0.2 0.25 0.3 0.35

0.4 ^H(125)^→^τ^τ

τ τ

→ Z

τhad

Fake VBF

τhad

had + τ e

= 8TeV, 20.3fb-1

(d)mT

Fig. 4.18: Input variable disctributions of m_{τ τ},∆R(ℓ, τ_had), E_T^miss - ϕ centrality, and m_T in the VBF category, for the signal (red), theZ →τ τ background (blue) and the fakeτhadbackground (green). The signal represents only the VBF signal process.

The input variables ofm_{τ τ},∆R(ℓ, τ_had)andE_T^miss-ϕcentrality represent a feature of the reconstructed H → τ_ℓτhad final state. They are commonly used for the VBF and the Boosted category. Figure 4.18 and 4.19 shows their distributions for the VBF and Boosted category, respectively. The m_{τ τ} and

∆R(ℓ, τhad)variables, which include the information of the invariant mass difference between the Higgs and theZ boson, are useful to distinguish theZ → τ τ background events from the signal events. The E_T^miss-ϕcentrality has a relatively complex definition compared with other variables. This variable quan-tifies the relative angular position of the E_T^miss with respect to the lepton and theτhad in the transverse

[GeV]

MMCτ

mτ

0 50 100 150 200 250

Fraction of Events

0 0.05 0.1 0.15 0.2 0.25 0.3

0.35 ^H(125)^→^τ^τ

τ τ

→ Z

τhad

Fake Boosted

τhad

had + τ e

= 8TeV, 20.3fb-1

(a)mτ τ

had) τ (l,

∆R

1 2 3 4 5 6

Fraction of Events

0 0.05 0.1 0.15

0.2 0.25

0.3 H(125)→ττ

τ τ

→ Z

τhad

Fake Boosted

τhad

had + τ e

= 8TeV, 20.3fb-1

(b)∆R(ℓ, τhad)

centrality φ

miss

−1 −0.5 0 0.5 1

Fraction of Events

0 0.1 0.2 0.3 0.4

0.5 ^→^τ^τ

H(125) τ τ

→ Z

τhad

Fake Boosted

τhad

had + τ e

= 8TeV, 20.3fb-1

(c)E_T^miss-ϕcentrality

[GeV]

0 10 20 30 40 50 60 70

Fraction of Events

0 0.05 0.1 0.15

0.2 0.25 0.3 0.35 0.4

0.45 ^H(125)^→^τ^τ

τ τ

→ Z

τhad

Fake Boosted

τhad

had + τ e

= 8TeV, 20.3fb-1

(d)mT

Fig. 4.19: Input variable disctributions ofm_{τ τ},∆R(ℓ, τhad),E_T^miss-ϕcentrality, andmT in the Boosted category, for the signal (red), theZ →τ τ background (blue) and the fakeτ_hadbackground (green). The signal represents the sum of all signal processes.

plane, expressed by:

E_T^miss-ϕcentrality= A+B

√A²+B²,

A= sin(ϕ_E^miss

T −ϕ_ℓ)

sin(ϕ_τ_had−ϕ_ℓ) , B= sin(ϕ_E^miss

T −ϕτhad)

sin(ϕ_ℓ−ϕ_τ_had) . (4.18) It takes a value√

2in case that theE_T^missvector is perfectly center between the lepton and theτ_had, while it takes less than 1 in case that theE_T^missvector is not between them. ThemT is used to distinguish the fakeτhadbackground, especially theW+jets background. While a dominant phase space of theW+jets

background is already rejected by a requirement of mT < 70GeV in the signal region, the remaining events are still present as one of the dominant background. Them_Tis still meaningful as input variable using a shape difference between the signal and theW+jets background.

1, j η(j

∆ 3 3.5 4 4.5 5 5.5 6 6.5 7

Fraction of Events

0 0.05 0.1 0.15 0.2

0.25 H(125)→ττ

τ τ

→ Z

τhad Fake VBF

τhad µ had + τ e

= 8TeV, 20.3fb-1 s

(a)∆η(jet1,jet2)

−10 −8 −6 −4 −2 0 2 4

Fraction of Events

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

τ τ

→ H(125)

τ τ

→ Z

τhad Fake VBF

τhad µ had + τ e

= 8TeV, 20.3fb-1 s

(b)ηjet1×ηjet2

(l)

η2 1,

Cη

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fraction of Events

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

0.45 ^H(125)^→^τ^τ

τ τ

→ Z

τhad Fake VBF

τhad µ had + τ e

= 8TeV, 20.3fb-1 s

(c)ℓ-ηcentrality

[GeV]

, j2 j1

m 0 200 400 600 800 100012001400

Fraction of Events

0 0.05 0.1 0.15 0.2 0.25 0.3

0.35 _H(125)_→_τ_τ

τ τ

→ Z

τhad Fake VBF

τhad µ had + τ e

= 8TeV, 20.3fb-1 s

(d)mj1,j2

[GeV]

Total

0 20 40 60 80 100 120

Fraction of Events

0 0.1 0.2 0.3 0.4 0.5 0.6

τ τ

→ H(125)

τ τ

→ Z

τhad Fake VBF

τhad µ had + τ e

= 8TeV, 20.3fb-1 s

(e)p^total_T

Fig. 4.20: Input variable disctributions used in the VBF category for the signal (red), the Z → τ τ background (blue) and the fake τ_had background (green). The signal represents only the VBF signal process.

For the VBF category, variables using two high momentum jets with a large pseudo-rapidity gap are input to the training. Figure 4.20 shows the input variable distributions which are used in only the VBF category. Both the∆η(jet1,jet2)and theηjet1×ηjet2contain an angular separation of two jets inη. The VBF signal provides higher∆η(jet1,jet2)values and a long negative tail shape inη_jet1×η_jet2, compared with the backgrounds. The m_j1,j2 is a variable including a combined information of two jet momenta and their angular separation, and the VBF signal provides higher mj1,j2 values than the backgrounds.

The ℓ - η centrality variable quantifies theη position of the lepton with respect to the two jets in the pseudo-rapidity plane, expressed by:

ℓ-ηcentrality=exp

[ −1 (ηjet1−ηjet2)²

(

ηℓ−η_jet1+η_jet2 2

)]

. (4.19)

It takes a value of 1 in case that the lepton is perfectly center between two jets, while it takes less than a value of1/ein case that the lepton is outside of jets. Thep^total_T is a vector sum of transverse momenta of objects in the VBF signal process: a lepton, aτhad, two jets and anE_T^miss. This variable represents an additional activity other than the VBF objects.

τ)

T( p

T(l)/

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Fraction of Events

0 0.05 0.1 0.15 0.2 0.25

τ τ

→ H(125)

τ τ

→ Z

τhad

Fake Boosted

τhad

had + τ e

= 8TeV, 20.3fb-1

(a)p^ℓ_T/p^τ_T^had

[GeV]

∑

100 200 300 400 500 600 700

Fraction of Events

0 0.05 0.1 0.15

0.2 0.25 0.3 0.35

0.4 _H(125)_→_τ_τ

τ τ

→ Z

τhad

Fake Boosted

τhad

had + τ e

= 8TeV, 20.3fb-1

(b)∑pT

Fig. 4.21: Input variable disctributions used in the Boosted category for the signal (red), the Z → τ τ background (blue) and the fake τ_had background (green). The signal represents the sum of all signal processes.

For the Boosted category, two additional variables representing a feature of the boosted topology are used. The∑

p_Trepresents a total activity in an event, and the signal provide much higher∑

p_Tvalues than the backgrounds. Thep^ℓ_T/p^τ_T^had is especially useful to discriminate the fakeτhadbackground, based on an asymmetry betweenp^ℓ_Tandp^τ_T^haddue to a difference of the number of neutrinos fromτdecays. As a result, the signal provides lowerp^ℓ_T/p^τ_T^had values than the fakeτ_hadbackground. Figure 4.21 shows the

∑pTand thep^ℓ_T/p^τ_T^haddistributions.

The BDT with GradientBoost requires to optimize some parameters to maximize its performance. The parameters are MaxDepth, MinNodeSize, Ntrees and Shrinkage, as described in Section 4.7.1. The MaxDepth and theMinNodeSize control how much grow each decision tree, while theNtrees and the Shrinkagedetermine the boosting algorithm. A two-dimensional scan ofMaxDepthandNtreesare per-formed to maximize the significance of the BDT output, and then the remaining parameters are optimized by step-by-step scanning. This optimization procedure is separately performed for the VBF and Boosted categories. The parameter values are summarized in Table 4.10.

ドキュメント内 October2015YukiSAKURAI EvidencefortheHiggsbosoninthe τ τ ﬁnalstateanditsCPmeasurementinproton-protoncollisionswiththeATLASdetector (ページ 85-89)