Документ взят из кэша поисковой машины. Адрес оригинального документа : http://hea-www.harvard.edu/AstroStat/Stat310_1213/pp_20130129.pdf
Дата изменения: Tue Jan 29 18:56:39 2013
Дата индексирования: Sun Feb 3 10:12:42 2013
Кодировка:
Поисковые слова: trees

Missing Data

Automatic classification of variables in catalogs with missing data
Pavlos Protopapas
Institute for Applied Computational Science Harvard-Smithsonian Center for Astrophysics

January 29, 2013

Pavlos Protopapas

Automatic classification with missing data

Missing Data

Outline

Problem Bayesian Networks Inferences using Bayesian Networks Impute data Classification Experiments

Pavlos Protopapas

Automatic classification with missing data

Missing Data

Problem

Classifying objects based on their features (color and magnitude) dates back in the 19th century. Automatic classification
Recently became sophisticated Necessary due to the exponential growth of astronomical data.

In time-domain astronomy of light-curves
Use features of the light-curves Apply sophisticated machine learning to classify objects in a multidimensional features space, provided there are enough examples to learn from (training). After almost a decade since the first appearance of automatic classification methods, many of those methods have produced and continue to produce high fidelity catalogs
Kim et al 2011,2012, Blo om 2011, Richards 2011, Debosscher 2007, Wachman 2009, Wang 2010
Pavlos Protopapas Automatic classification with missing data

Missing Data

Problem

Best to use as many available catalogs as possible (x-ray, u-band, time series catalogs) Catalogs are taken with different instruments, bandwidths, locations, times, etc, The intersection of these catalogs is smaller than any single catalog resulting multi-catalog contains missing values. Traditional classification methods can not deal with the resulting catalogs

Pavlos Protopapas

Automatic classification with missing data

Missing Data

Introduction
Fill missing data using Monte Carlo approaches. Each missing value is drawn from a distribution that is determined from all objects in the training set. This approach totally ignores the relationship amongst the variables.
7% .87=6(9% 6( % .86<79%

.879:;6.86<79%

6%

Pavlos Protopapas

.869:;7.86<79% Automatic classification with missing data

Missing Data

Bayesian Networks (BN)

A Bayesian network (BN) is a probabilistic graphical model that represents dependency relationships among a set of variables. One of the main advantages of the BN factorization is that each of the factors involves a smaller number of variables, where it is easier to estimate.

Pavlos Protopapas

Automatic classification with missing data

Missing Data

Bayesian Networks (BN)
[t] Let S = {x1 , . . . , xn } b e a set of data instances, each one describ ed with a set of D variables i i {F1 , . . . , FD }. Each instance xi is represented as a vector xi = {F1 , . . . , FD }. BNs can represent the joint probability distribution P (F1 , . . . , FD ) of dataset S as a product of factors, where each factor is a conditional probability distribution of each node given its parents in the BN:

n

n

n

D

P (S )

=
i =1

P (xi ) =
i =1

P (F1 , . . . , FD ) =
i =1 j =1

i

i

P (Fj |Pa

i

i BN

(Fj ))

where PaBN (Fj ) represents the set of parents of variable Fj in the BN and Pa parents of feature Fj are instantiated in the values of xi

i BN

(Fj ) indicate that

joint probability distribution can be factorized according to the BN as: P (F1 , . . . , F5 ) = P (F1 |F4 )P (F2 )P (F3 |F5 )P (F4 )P (F5 |F2 , F4 )
Pavlos Protopapas Automatic classification with missing data

Missing Data

Inferences

Suppose we have an object with missing values F5 and F2 . Estimate the most probable value for variable F5 given the observed values for F1 , F3 , F4 , we can calculate P (F5 |F1 , F3 , F4 ) as:

Pavlos Protopapas

Automatic classification with missing data

Missing Data

Factorizing

P (F5 |F1 , F3 , F4 ) =

P (F1 , F3 , F4 , F5 ) P (F1 , F3 , F4 )

P (F1 , F2 , F3 , F4 , F5 ) =
F2

P (F 1 , F 2 , F 3 , F 4 , F 5 )
F2 ,F5

P (F1 |F4 )P (F2 )P (F3 |F5 )P (F4 )P (F5 |F2 , F4 ) =
F2

P (F1 |F4 )P (F2 )P (F3 |F5 )P (F4 )P (F5 |F2 , F4 )
F2 ,F5

P (F1 |F4 )P (F3 |F5 )P (F4 ) = P (F4 )P (F1 |F4 )
F2 ,F5 F2

P (F2 )P (F5 |F2 , F4 )

P (F3 |F5 )P (F2 )P (F5 |F2 , F4 )

Pavlos Protopapas

Automatic classification with missing data

Missing Data

Pusing in

"Pushing in" the factors in the sums is known as variable elimination (Pearl et el 94) The simplest exact inference algorithm.

Pavlos Protopapas

Automatic classification with missing data

Missing Data

Juntion Tree

A junction tree is an undirected graph where each node corresponds to a set of variables in the original BN that forms a family (a node with its parents).
The moral graph is such that two nodes are connected if either they are directly connected in the BN or if they have a common child in the BN. Identify all the cliques (sets of nodes where every pair inside the set is connected by an edge) in the moral graph. The junction tree is a new data structure which has one node per each clique in the moral graph, and two nodes are connected if they have at least one common feature between them.
Pavlos Protopapas Automatic classification with missing data

Missing Data

Juntion Tree

Nodes send to other nodes the value of its own probability, summing out the variables which are not included in the destination node.
{F {F {F {F
2 2 5 4

, , , ,

F F F F

4 4 3 1

, F5 } sends F2 F5 P , F5 } sends F2 F4 P } sends F3 P (F3 |F5 ) } sends F1 P (F1 |F4 )

(F (F to to

|F2 5 |F2 {F2 {F2
5

, , , ,

F4 F4 F4 F4

) ) , ,

to {F4 , F1 } to {F5 , F3 } F5 } F5 }

This process is known as `junction tree calibration" (Cowell 2002).
Pavlos Protopapas Automatic classification with missing data

Missing Data

Gaussian Nodes inference

Continuous data Gaussian Nodes inference (Shachter 1989). Variables not involved in the calculation of a probability can be eliminated from the Bayesian network (barren nodes). e.g. leaf nodes Not so simple. Arc reversal
Pavlos Protopapas Automatic classification with missing data

Missing Data

Learning Bayesian Networks with complete data

Learning the network involves learning the structure (edges) and the parameters (probability distributions on each of the factors).

Pavlos Protopapas

Automatic classification with missing data

Missing Data

Structure Learning with complete data
Number of possible networks structure grows exponentially a greedy search strategy. Starting with an initial random order of variables (nodes), we add parents to the nodes incrementally and keep the parent(s) that generates the network with the highest score. The score of a network structure is related to how the structure fits data. Probability of the structure given the data, which corresponds to apply the same factorization imposed by the structure and use multinomial distributions over each factor. We estimate each probability by firstly discretizing the possible values that each feature Fi can take and then creating a multi-dimensional histogram.

Pavlos Protopapas

Automatic classification with missing data

Missing Data

Parameter Learning with complete data
Learning the parameters of a BN means to learn the distribution of each of the factors given by the structure of the BN. To learn the parameters it is necessary first to know the structure.
Parameter estimation Use Gaussians to model these probability distributions. We mo del Fj as a linear Gaussian of its parents: P (Fj ) = N (0 + µ; + ), where the set of parents PaBN (Fj ) are jointly Gaussian N (µ; ). Note that µ and are k dimensional vectors, and the matrix is k в k
T 2 T

(1)

To learn a Gaussian node, we first learn (i) the Gaussian distribution of each of the parent nodes and then (ii) the set of parameters {0 , . . . , k ; } of the linear combination.

Pavlos Protopapas

Automatic classification with missing data

Missing Data

Learning Gaussians for each parent node

We model each parent node as a mixture of Gaussians. To estimate their distribution use Expectation Maximization (EM) algorithm. EM optimizes the likelihood function of a model which depends on latent or unobserved variables.

Pavlos Protopapas

Automatic classification with missing data

Missing Data

Learning the linear combination of parents
Log-likeliho od Having determined the distribution node Fj given its parents. Let {U1 P (Fj |PaBN (Fj )) = N (0 + 1 µ1 F = {0 , . . . , k ; }. To learn
j

of each parent node, we can now estimate the distribution of each Gaussian , . . . , Uk } be the parent nodes with respective means {µ1 , . . . , µk }, then + · · · + k µk ; 2 ). Our task is to learn the set of parameters those parameters we optimize the log-likelihood, expressed as:
n

LF (F |D ) = j j
i =1

-

1 2

log(2 ) -

2

1 2 (0 + 1 µ1 + k µk - xi ) . 2 2

Setting the derivative with respect to 1 , . . . , k to zero we have the following k equations: E [Fj · U1 ] = . . . E [Fj · Uk ] = 0 E [Uk ] + 1 E [U1 · Uk ] + k E [Uk · Uk ] 0 E [U1 ] + 1 E [U1 · U1 ] + k E [Uk · U1 ]

k

k

= cov[Fj , Fj ] -
p =1 q =1

2

p q cov[Up , Uq ]

Pavlos Protopapas

Automatic classification with missing data

Missing Data

Learning Bayesian Networks with missing data

How to learn both the structure and the parameters under incomplete data? To learn the parameters with missing data, we need to previously know the structure and to learn the structure we need to guess the missing values. This is done in a iterative method.

Pavlos Protopapas

Automatic classification with missing data

Missing Data

Learning parameters with missing data
We assume that we already know the structure of the Bayesian network before we start learning the distribution parameters with missing data. Optimize the log likelihood given the data.
Likelihood · Each data p oint xi can b e written as xi = {xio , xim } Rewriting the log likelihoo d we have:
n C

l ( |x , x , z )

o

m

=
i =1 j =1

zij [ - - - 1 2 2 2
2 ,o j o

1 n log 2 + log 2 2
o2

j

(xi - µj )
o

1
2,om j

(xi - µj )(xi - µj )
m m2

o

m

m

1
2 ,m j

(xi - µj ) ]

Pavlos Protopapas

Automatic classification with missing data

Missing Data

Learning parameters with missing data

Likelihood Note that in the E-step for the missing data case we need to estimate three unknown terms, zij , zij xim , and zij xim 2 . Then the E-Step computes: (xio - µo ) j
j m m2

E [xi |zij = 1, xi , k ] E [zij xi |xi , ] E [zij (xi ) |xi , ]
m2 o m o

m

o

= = =

µj +

m

P (zij |xi , )xij

P (zij |xi , )((xij ) )

The M-Step for the missing case is the same as in the complete data case, the main difference is just that the unknown values are replaced by the exp ectations in equations above.

Pavlos Protopapas

Automatic classification with missing data

Missing Data

Learning the network structure with missing data
To learn the structure of a Bayesian network with missing data, we complete the missing values and then iterate to improve these values using the structure learned so far.
Algorithm · Learn for each variable in {F1 , . . . , FD } a univariate Gaussian Mixture GM (i ) i [1 . . . D ]. 1 · Create M complete datasets Ds , s [1 . . . M ] filling the missing values of each variable Fi with values sampled from GM (i ). ·t=1 while Convergence criteria is not achieved do for s = 1 to M do ( ( · From each complete dataset in Ds t ) , learn a BN, Bs t ) . end for · Create one Bayesian network structure B (t ) as the union of all the BNs. · Learn the parameters (t ) using the original incomplete data and the network structure B · Use the network B (t ) , (t ) to sample new values and create new completed datasets ( Ds t +1) . ·t =t+1 end while

(t )

Pavlos Protopapas

Automatic classification with missing data

Missing Data

The automatic classification model

So far, infer the missing values. Next: Automatic classifier using the new training completed set. Random Forest (RF) classifier
Very efficient algorithm based on decision tree models and bagging for classification problems Ensemble methods, appearing in machine learning literature at the end of nineties and has been used recently in the astronomical journals . Quinlan 1993, Breiman 2001, Pichara 2012

Pavlos Protopapas

Automatic classification with missing data

Missing Data

Random Forrest

The process of training or building a RF given training data is as follows: Let P be the number of trees in the forest and F the number of features in each tree, both values are model parameters. Build P sets of n samples taken with replacement from the training set. Note that each of the P bags has the same number of elements with the training set but less different examples, given that the samples are taken with replacement (The training set also has n samples). For each of the P sets, train a decision tree using a random sample of F features from the set of q possible features.

Pavlos Protopapas

Automatic classification with missing data

Missing Data

Random Forrest

RF creates many linear separators, attempting to separate between elements of different classes using some featuresand some data points (the ones given by each of the bags).

Each of the decision trees creates one decision The final decision is the most voted class among the set of P decision trees

As the number of trees goes to infinity the classification error of the RF becomes bounded and the classifier does not overfit the data.

Pavlos Protopapas

Automatic classification with missing data

Missing Data

Experimental Results

To test the advantage of the model, we ask the following questions is it possible to learn an automatic classification model that is able to deal with missing data? does the model outperform any model that uses a subset of training set with complete data? does the proposed model overcome the case where missing data is filled using models that treat each variable independently?

Pavlos Protopapas

Automatic classification with missing data

Missing Data

Experimental Results

Base catalog, the MACHO catalog and extract 14 features from each lightcurve. Combine the MACHO catalog with other catalogs containing magnitudes at different wavelengths.
SAGE (Meixner et al 2006) UBVI (Piatti 2011) 2MASS (Skrutskie 2006)

Pavlos Protopapas

Automatic classification with missing data

Missing Data

Data
Percentage of missing values on 2MASS/SAGE catalogs Vars J H K m36 m45 m58 m80 % of missing values 54% 54% 58% 1% 13% 68% 74%

Percentage of missing values on UBVI catalog Vars U B V I % of missing values 49% 0% 0% 14%
Automatic classification with missing data

Pavlos Protopapas

Missing Data

Data

Number of objects per class on SAGE-2MASS-UBVI training set Class Non-Variables QSO Be star Cepheid RR Lyrae EB LPV number of training objects 1136 45 76 70 69 100 337

Pavlos Protopapas

Automatic classification with missing data

Missing Data

Results

Class None Variables Quasar Be Cepheid RR-Lyrae EB LPV Weighted Average:

Precision 0.857 0.9 0.679 0.805 0.333 0.5 0.919 0.821

Gaussian mixture Recall 0.952 0.8 0.5 0.886 0.014 0.25 0.938 0.851

F-Measure 0.902 0.847 0.576 0.844 0.028 0.333 0.928 0.826

Precision 0.878 0.878 0.724 0.785 0.583 0.525 0.925 0.846

Our model Recall 0.942 0.956 0.553 0.886 0.203 0.31 0.947 0.863

F-Measure 0.909 0.915 0.627 0.832 0.301 0.39 0.935 0.848

Moreover, after training the model, we run it on the whole MACHO catalog, in order to generate a new quasar candidate list. To evaluate the quality of the new quasar candidate list, we calculate the matching level of our list of candidates with previous known lists. Adding SAGE-2MASS-UBVI features 0.858 0.956 0.896 Only MACHO Features 0.857 0.8 0.828

Quasar Precision Quasar Recall Quasar F-Score

Pavlos Protopapas

Automatic classification with missing data

Missing Data

Conclusions

A new way of dealing with missing data Tested on real astronomical datasets
Catalogs with missing data can be useful for automatic classification. Integrate catalogs improve performance Improve accuracy of our results in previous work on quasar detection

Considers probability dependencies between variables, that makes possible to take advantage of the observed values, in order to increase the accuracy of the estimation when the number of observed values increase.

Pavlos Protopapas

Automatic classification with missing data