Документ взят из кэша поисковой машины. Адрес оригинального документа : http://acat02.sinp.msu.ru/presentations/vaiciulis/acat2002.ps
Дата изменения: Wed Jun 26 10:16:32 2002
Дата индексирования: Mon Oct 1 20:21:16 2012
Кодировка:
Support Vector Machines in
Analysis of Top Quark Production
Anthony Vaiciulis
University of Rochester
Advanced Computing and Analysis Techniques
June 24, 2002, Moscow
 introduction
 SVM method
 separable linear problem
(hyperplane classi ers)
 separable nonlinear problem
(feature space and kernels)
 nonseparable nonlinear problem
 top quark study

Introduction
Common HEP problem is classi cation
 signal vs background event
 particle identi cation
no explicit method to determine correct output from
input data
Devise algorithm exploiting complex patterns in in-
put/output examples (labeled training data) to learn
the solution to the problem
! \supervised learning"
map each training input/output pair onto
 two categories ! "binary classi cation"
 more than two categories ! \multi-class classi ca-
tion"
 continuous, real-valued output ! \regression"
ii

Introduction
possible goal: modify algorithm until all training data
are classi ed correctly
potential problems
 noisy training data.
! no correct underlying classi cation function?
! similar training examples may be in di erent cat-
egories
 misclassi cation of unseen data
! \over tting"
iii

Introduction
better goal: optimize \generalization" { ability to cor-
rectly classify unseen data
Keep it simple: extra complexity must cause signi cant
improvement in performance.
potential problems:
 small training set ! over tting
(when using complex mapping function)
 large number of tunable parameters
 mistakes due to local minima
Support Vector Machines (SVMs) address these prob-
lems.
iv

History
history
 early 1960s: Support vector method developed to con-
struct separating hyperplanes for pattern recogni-
tion problems (Vapnik and Lerner 1963; Vapnik and
Chervonenkis 1964)
 early 1990s: generalized for constructing non-linear
separating functions (but linear in feature space).
(Boser et al. 1992; Cortes and Vapnik 1995)
 1995: generalized for estimating real-valued functions
(regression) (Vapnik 1995)
 development continues today
current events
 special SVM issue of Neural Computation journal.
2002.
 NATO Advanced Study Institute on Learning The-
ory and Practice, July, 2002.
 International Workshop on Practical Application of
Support Vector Machines in Pattern Recognition, Au-
gust, 2002.
 International Conference on Neural Information Pro-
cessing, Special Session on Support Vector Machines,
November, 2002.
v

SVM Overview
SVM quick overview
 map training data into a high dimensional feature
space
 determine decision boundary in feature space by con-
structing the optimal separating hyperplane
 avoid computations in feature space by using a kernel
function
 use concepts from statistical learning theory to de-
scribe which factors have to be controlled for good
generalization
vi

Classes of Problems
separable
linear
F
nonlinear
map
separable
nonlinear
F
nonlinear
map
nonseparable
nonlinear
vii

Separable Linear
Formal goal: estimate function f : < N ! f1g using
input/output training data
(x 1 ; y 1 ); :::; (x ` ; y ` ) 2 < N  f1g
such that f will correctly classify unseen examples (x; y),
i.e. f (x) = y.
` = number of training examples.
For good generalization, restrict class of functions from
which f is chosen. Minimization of training error,
(1=`) P
i j f (x i ) y i j,
does not imply good generalization.
More precise: restrict class of functions to one with a
\capacity" suitable for the amount of available training
data.
\capacity" is richness or exibility of the function class
low capacity ! good generalization
regardless of the dimensionality of the space (assuming
function describes the data well)
viii

Capacity
a measure of capacity: \VC dimension" = largest num-
ber of points, d, which can be separated in all 2 d pos-
sible classi cations for a given set of functions
Example: set of oriented lines in two dimensions can
separate three points in all 2 3 possible ways, but not
four points.
0011
0011
0011
0011
0011
0011
0011
0011
0011 0011
ix

Separable Linear
Support Vector classi ers are based on the class of hy-
perplanes
(w  x) + b = 0
with w 2 < N ; b 2 < corresponding to the decision
function
f (x) = sign[(w  x) + b]
w is \weight vector", b is \threshold"
w and b: control the function, must be learned from
the data.
Note: here we consider linear case
Find the (unique) optimal hyperplane
 hyperplane with maximal margin of separation be-
tween the two classes (by de nition)
 has the lowest capacity of all hyperplanes
 minimizes risk of over tting
(di erent than \intuitive" way of decreasing capacity
by reducing degrees of freedom)
x

Separable Linear
geometric interpretation: hyperplane splits input space
into two parts, each one corresponding to a di erent
class.
margin size = 2=kwk
To nd optimal hyperplane minimize kwk 2 subject to
constraints (for i = 1; :::`)
y i ((w  x i ) + b)  1
x
.
w )+b=0
(
x 1
x
.
w
( )+b=-1
x
.
w
( )+b=+1
y = +1
x 2
margin
y = -1
i
i
w
xi

Separable Linear
Capacity of hyperplane decreases as margin size in-
creases
xii

Separable Linear
x i for which equality holds are \support vectors"
 carry all information about the problem
 removal would change the solution
 lie on a hyperplane de ning the margin
To solve constrained quadratic optimization problem,
rst reformulate it in terms of a Lagrangian,
L(w; b; ) = (1=2)kwk 2 P
i i [y i ((x i  w) + b) 1]
Solution has expansion in terms of a subset of input
vectors (with i 6= 0) called Support Vectors
w = P
i i y i x i
Problem in dual form: nd i which maximize
L = P
i i (1=2) P
i;j i j y i y j x i  x j
subject to constraints i  0 and P
i i y i = 0
Decision function is
f (x) = sign[ P
i y i i (x  x i ) + b]
Input vectors enter as dot products only.
xiii

Separable Nonlinear
Possible methods if f (x) is nonlinear function of x
 neural network: use network of simple linear classi-
ers
problems: many parameters, local minima, ...
 SVM: map input data into high dimensional feature
space, F , via nonlinear map  : R N ! F Use opti-
mal hyperplane algorithm in F .
high dimensionality ! computational problem in fea-
ture space
Since input vectors appear only as dot products, we
only need dot products in feature space. Find a kernel
function, K, such that
K(x 1 ; x 2 ) = (x 1 )  (x 2 )
Then we don't need to know  explicitly.
Examples of kernel functions
 K(x; y) = (x  y) d
(polynomial of degree d)
 K(x; y) = exp( k x y k 2 =(2 2 ))
(Gaussian Radial Basis Function)
 K(x; y) = tanh((x  y) + ) (sigmoid)
xiv

Separable Nonlinear
 substitute (x i ) for each training example x
 substitute kernel for dot products of 
decision function
f (x) = sign[ P
i y i i K(x; x i ) + b]
optimization problem: maximize
L = P
i i (1=2) P
i;j i j y i y j K(x i ; x j )
subject to constraints i  0 and P
i i y i = 0
Form of optimization problem guarantees global mini-
mum
xv

Nonseparable Nonlinear
Real world applications ! no linear separation in fea-
ture space. Optimization problem will not converge.
Introduce slack variables ( i  0) to allow points to
violate constraints. No training error in separable case.
Redo Lagrangian formulation, adding a term
C P
i  i where parameter C controls error penalty
(large C ! larger penalty)
! same dual optimization problem and constraints as
for separable case except
i  0 ! 0  i  C
We now have nonlinear mappings for nonseparable data.
x 1
y = +1
margin
y = -1
i
i
x 2
z
z 1
2
xvi

Top Quark Study
A possible application of SVM in HEP:
improve signal/background discrimination in
t 
t dilepton channel
Small cross section: (pp ! t 
t) = 5:0 pb (theory)
t
t --
q
q --
b
b --
W +
W ­
l + , q --
n l , q ,
l ­ , q ,
n --
l , q --
Top channels distinguished by W decay modes
One possible event topology:
t 
t ! b  bll 5% (4/81) \dilepton" l = e; 
Signature: 2 high-p T isolated leptons, 2 b jets,
Goal: increase signal eфciency in e dilepton channel
(low background)
Backgrounds include WW production and Z!  + 
Consider only WW in this study.
xvii

Top Quark Study
Four input variables (based on conventional cuts)
(GeV)
T
Jet 1 E
0 50 100 150 200
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
top
WW
(GeV)
T
max lepton E
0 50 100 150 200
0
0.02
0.04
0.06
0.08
0.1
0.12
top
WW
(GeV)
T
sum jet E
0 100 200 300
0
0.05
0.1
0.15 top
WW
(GeV)
T
missing E
0 100 200
0
0.02
0.04
0.06
0.08
0.1
0.12
top
WW
xviii

Top Quark Study
SVM performance  performance of best cuts
SVM output
0 0.5 1
0
50
100
150
200
top
WW
background eff.
0 0.2 0.4 0.6 0.8 1
signal
eff.
0
0.2
0.4
0.6
0.8
1
background eff.
0 0.2 0.4 0.6 0.8 1
signal
eff.
0
0.2
0.4
0.6
0.8
1
SVM
cuts
xix

Top Quark Study
performance of Gaussian kernel  sigmoid kernel
SVM output (Gaussian kernel)
0 0.5 1
0
50
100
150
200
top
WW
SVM output (sigmoid kernel)
0 0.5 1
0
50
100
150
200
top
WW
background eff.
0 0.2 0.4 0.6 0.8 1
signal
eff.
0
0.2
0.4
0.6
0.8
1
Gaussian
sigmoid
xx

Top Quark Study
Little performance di erence vs. number of training examples
SVM output (2000 events)
0 0.5 1
0
20
40
60
80 top
WW
SVM output (1000 events)
0 0.5 1
0
20
40
60
80
SVM output (200 events)
0 0.5 1
0
20
40
60
80
background eff.
0 0.2 0.4 0.6 0.8 1
signal
eff.
0
0.2
0.4
0.6
0.8
1
2000
1000
200
xxi

Training Time
 for high statistics sample, use toy Monte Carlo
! two variables (one gaussian, the other exponen-
tial)
 training done with gaussian kernel
 1000 MHz Pentium III PC (Linux)
 LIBSVM uses a modi ed Sequential Minimal Opti-
mization (SMO) algorithm for training
number of events toy MC time (sec) ttbar MC time (sec)
200 < 1 < 1
400 < 1 < 1
1000 < 1 1
2000 1.5 2
3000 4 6
4000 8
8000 38
16000 149
For top quark study, training time was negligible.
xxii

References
 An Introduction to Support Vector Machines. Nello
Cristianini and John Shawe-Taylor. Cambridge Uni-
versity Press. 2000.
 Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a
library for support vector machines, 2001. Software
available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
 http://www.kernel-machines.org/
 A Tutorial on Support Vector Machines for Pattern
Recognition. Christopher J.C. Burges.
 Advances in Kernel Methods. edited by Bernhard
Scholkopf, Christopher J.C. Burges, Alexander J. Smola.
MIT Press. 1999.
xxiii

Summary
 a new way to tackle complicated problems in HEP
and other elds
 based on statistical learning theory
! optimize generalization
! intuitive
 all computations performed directly in the input space
(via kernels)
 correspond to a linear method in the feature space
! theoretically easy to analyze
! allows us to generalize some existing linear meth-
ods to non-linear methods
 only one user chosen parameter, the error penalty
(aside from kernel parameters)
 well de ned quadratic optimization problem { global
minimum
 SVM development can build on existing work on
other kernel based methods
! another method for our multivariate analysis tool
box
xxiv