Документ взят из кэша поисковой машины. Адрес оригинального документа : http://hea-www.harvard.edu/AstroStat/Stat310MM_VI_VII/hsl_20060907.pdf
Дата изменения: Mon Sep 11 03:35:52 2006
Дата индексирования: Tue Oct 2 05:48:19 2012
Кодировка:
A Convex Hull Peeling Depth Approach to Nonparametric Massive Multivariate Data Analysis with Applications
Hy u n s o o k L e e
. hlee@stat.psu.edu

Depar tment of Statistics The Pennsylvania State University

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 1


Outlines
Convex Hull Peeling (CHP) and Multivariate Data Analysis Definitions on CHP Data Depth (Ordering Multivariate Data) Quantiles and Density Estimation Color Magnitude (CM) Diagram and Sloan Digital Sky Survey Nonparametric Descriptive Statistics with CHP Multivariate Median Skewness and Kur tosis of a Multivariate Distribution Outlier Detection with CHP Level ; Shape Distor tion; Balloon Plot Concluding Remarks

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 2


Definitions
Convex Set

A set C Rd is convex if for every two points x, y C the whole segment xy is also contained in C.

Convex Hull

The convex hull of a set of points X in Rd is denoted by C H (X ), is the intersection of all convex sets in Rd containing X. In
algorithms, a convex hull indicates points of a shape invariant minimal subset of C H (X ) (ver tices, extreme points), connecting these points produces a wrap of C H (X ).
2 2 y -2 -1 0 x 1 2 -2 -2 -1 0 1

y

-2

-1

0

1

-1

0 x

1

2

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 3


Convex Hull Peeling

Before
2 2

After

1

0

y

-1

y -2 -1 0 x 1 2

-2

-2 -2

-1

0

1

-1

0 x

1

2

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 4


Convex Hull Peeling Depth (CHPD)
[CHPD:] For a point x Rd and the data set X = {X1 , ..., XN -1 }, let C1 = C H {x, X } and denote a set of its ver tices V1 . We can get Cj = Cj -1 \Vj -1 through CHP until x Vj (j = 2, ...). Then, CHPD(x) = is 0.
(
k i=1

Vi )

N

for k s.t. k = minj {j : x Vj } ; otherwise CHPD

Tukey (1974): Locating data center (median) by the Convex Hull Peeling Process. Barnett (1976): Ordering based on Depth pth quantiles are 1 - pth CHPDs. ^ ^ Hyper-polygons of 1 - pth depth obtainable from any dimensional data. ^ QHULL(Barber et. al., 1996) works for general dimensions (http://qhull.org). Why CHPD...

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 5


Challenges in Nonparametric Multivariate Analysis
How to Order Multivariate Data?

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 6


Challenges in Nonparametric Multivariate Analysis
How to Order Multivariate Data?

Ordering Multivariate Data Data Depth
Mahalanobis Depth : Mahalanobis (1936) Convex Hull Peeling Depth: Barnett (1976) Half Space Depth: Tukey (1975) Simplical Depth : Liu (1990) Oja Depth : Oja (1983) Majority Depth : Singh (1991) Ordering is not uniformly defined

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 6


Statistical Data Depth
(Zuo and Serfling, 2000)

(P1) (Affine invariance) D (Ax + b; FAX +b ) = D (x; FX ) for all X (A nonsingular matrix) holds for any random vector X in Rd , any d в d nonsingular matrix A, and any d-vector b; (P2) (Maximality at center) D ( ; F ) = supx F F having center ;
R
d

D (x; F ) holds for any

(P3) (Monotonicity) for any F F having deepest point , D (x; F ) D ( + (x - ); F ) holds for [0, 1]; and (P4) D (x; F ) 0 as ||x|| , for each F F .

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 7


Convex Hull Peeling Depth
affine invariance maximality at center monotonicity relative to deepest point vanishing at infinity CHPD has these proper ties and points of smallest depth are possible outliers

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 8


Quantile Estimation
Median: A point(s) left after peeling (will show robustness of this estimator later) pth Quantile: Level set whose central region contains 100p% data (will define the level set and the central region later) No Closed Form; Empirical Process

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 9


Empirical Density Estimation
Density Estimation with CHPD on Bivariate Normal Data 100000 Bivariate Normal Sample Quantiles={0.99,0.95,0.90,0.80,...0.20,0.10,0.05,0.01}
4

(McDermott, 2003)

2

density

0

y

-2

0.05

0.10

0.15

0.20

0.00

23 01 -1 -2 -3 0 x 1 2 3 4

-4

-4 -3 -2 -1

-4

-2

0
x

2

4

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 10

y


Lessons and Further Studies
Sample from a convex distribution (no doughnut shape) Works on Massive data - Sequential Method Without previous knowledge, no model or prior is known to star t an analysis. Exploratory data analysis for a large database Nonparametric and non-distance based approach Where CHP can be applied and how? - Multi-color diagram from astronomy, where a plethora of free data archives is available.

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 11


Color Magnitude diagram
Two dimensional Color-Color diagram or Celebrated Her tzsprung-Russell diagram (swi

tch

)

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 12


Color Magnitude diagram
Two dimensional Color-Color diagram or Celebrated Her tzsprung-Russell diagram (swi

tch

)

What if we can see beyond 2 dimensions without bias (projection) Then, 3 or higher dimensional color diagrams might have popularity.

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 12


Color Magnitude diagram
Two dimensional Color-Color diagram or Celebrated Her tzsprung-Russell diagram (swi

tch

)

What if we can see beyond 2 dimensions without bias (projection) Then, 3 or higher dimensional color diagrams might have popularity. CHP may assist analyzing multi-color diagrams. Need a suitable data set with colors.

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 12


Sloan Digital Sky Survey: SDSS
Commissioned 2000, now Data Release 5 is available. 5 bands; 4 variables (u-g, g-r, r-i, i-z) Studies on analyzing astronomical massive data received spotlights recently. http://www.sdss.org July, 2005: Data Release Four 6670 square degrees, 180 million objects Available from http://www.sdss.org/dr4 From SpecPhotoAll with SQL: Attributes of photometric data are color indices, u,b,g,i,z along with coordinates.

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 13


SQL for SDSS

select psf from Sp where s

r M e p

a a c e

, g P c

dec, z, psfMag_u, psfMag_g, psfMag_r, _i, psfMag_z hotoAll class= 2

Note -- 2: galaxies, 3: QSO, 4: HighZ QSO Galaxies: 499043 Quasars: 70204

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 14


Multivariate Descriptive Statistics
CHP Median CHP Skewness CHP Kur tosis with bivariate simulated data and SDSS DR4

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 15


Convex Hull Peeling Median (CHPM)
Multivariate Median: the inner most point among data
Sur vey of Multivariate Median (Small, 1990)

CHPM: recursive peeling leads to the inner most point(s). The average of these largest depth points is the median of a data set.

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 16


Convex Hull Peeling Median (CHPM)
Multivariate Median: the inner most point among data
Sur vey of Multivariate Median (Small, 1990)

CHPM: recursive peeling leads to the inner most point(s). The average of these largest depth points is the median of a data set. Simulations: Sample from standard bivariate normal distribution
n 104 106 me a n (0.001338, -0.02232) (0.000072, 0.000114) me d i a n (-0.005305, -0.01643) (0.001185, -0.000717) Sequential CHPM CHPM (0.000918, -0.010589) (0.002455, -0.000456) (0.004741, -0.004111)

Setting for the sequential method: m=10000 and d=0.05

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 16


Application: Median

Quasars Mean Median CHPM Galaxies Mean Median CHPM Seq. CHPM

u-g 0 .4 6 1 9 0 .2 5 2 0 0 .2 5 3 0 u-g 1 .6 2 2 1 .6 8 0 1 .7 9 0 1 .7 7 2

g-r 0 .2 4 8 4 0 .1 7 5 0 0 .1 6 4 0 g-r 0 .9 2 1 1 0 .8 9 3 0 0 .9 5 7 0 .9 5 0

r-i 0 .1 6 4 9 0 .1 5 2 0 0 .1 9 1 3 r-i 0 .4 2 2 6 0 .4 2 0 0 0 .4 2 4 0 .4 2 2 8

i-z 0 .1 0 0 8 0 .0 7 7 0 0 .0 7 0 0 i-z 0 .3 4 3 9 0 .3 5 4 0 0 .3 6 7 0 .3 6 3

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 17


Robustness of Convex Hull Peeling Median
Breakdown point of a convex hull peeling median goes to zero as n (Donoho, 1982). Outliers are necessarily located at infinity.

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 18


Robustness of Convex Hull Peeling Median
Breakdown point of a convex hull peeling median goes to zero as n (Donoho, 1982). Outliers are necessarily located at infinity. Empirical mean square error (EMSE) and Relative Efficiency (RE): Model:(1 - )N ((0, 0), I) + N (·, 4I) n = 5000, m = 500, Tj =(CHPM, Mean) m 1 E M S E = m i = 1 | |T j - µ | |2
N ((5, 5)t , 4I) CHPM 0 0.005 0.05 0.2 0.002178 0.0028521 0.016842 0.139215 Me a n 0.000417 0.001682 0.125522 2.00109 RE 0.191689 0.589961 7.45262 14.37612 CHPM 0.002178 0.002891 0.017824 0.1435910 N ((10, 10)t , 4I) Me a n 0.000417 0.005444 0.500610 8.0017 RE 0.191689 1.88291 28.08597 55.7264

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 18


Generalized Quantile Process
EinMahl and Mason (1992) Un (t) = inf {(A) : Pn (A) t, A A}, 0 < t < 1. Central Region: R Level Set: B
CH CH

(t) = {x Rd : C H P D (x) t} = R
CH

(t)

(t)

= {x Rd : C H P D (x) = t} Volume Functional: VC H (t) = V olume(R - One dimensional mapping.
CH

(t))

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 19


4

0

y

2

y
-4 -2 0 x 2 4

-2

-4

0

2

4

6

8

-4

-2

0 x

2

4

3

1

2

-3 -2 -1

-3

-2

-1

0 x

1

2

3

-200

0 100

y

0

y

300

-50

0 x

50

100

not equi-probability contours, assume smooth convex distributions
Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 20


Skewness Measure
Let xj,i be the ith ver tex in a level set BC H,j comprised by the j peeling process. A measure of skewness: Rj = maxi ||xj,i - C H P M || - mini ||xj,i - C H P M || mini ||xj,i - C H P M ||
th

Not only a sequence of Rj visualizes but also quantizes the skewness along depths. Denominator for the regularization affine invariant Rj
symmetric: skewed:

flat Rj along convex hull peels
j

fluctuating R

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 21


Simulation: Skewness Measure
N(0, I) N(0, )
3

non normal

2

2

N(0, 1)
-4 -2 0 2

0

0

-2

-2

-3 -2 -1

0

1

2

3

-3

-2

-1

0

1

2

5

10

15

20
2 10

25

30



2.5

2.5

2.0

1.5

2.0

Rj

Rj

1.0

1.5

Rj
1.0 0 20 40 60 80 2 0 3

0.0

0.5

0

20

40

60

80

4

5

6

20

40

60

80

jthconvex hull level set

jthconvex hull level set

jthconvex hull level set

2000

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 22


Application:Skewness Measure (Quasars)

j

R

2 0

4

6

8

10

12

14

20

40

60

80

jthconvex hull level set
Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 23


Application:Skewness Measure (Galaxies)

R
2 0 4 6

j

8

10

12

50

100

150

200

jthconvex hull level set
Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 24


Kurtosis Measure
Quantile (Depth) based Kur tosis: KC Tailweight: t(r, s) = VC H (r ) VC H (s)
H 1 r 1 VC H ( 2 - r ) + VC H ( 1 + 2 ) - 2VC H ( 2 ) 2 2 (r ) = r r 1 VC H ( 2 - 2 ) - VC H ( 1 + 2 ) 2

for 0 < s < r 1. Here, VC H (r ) indicates the volume functional at depth r.

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 25


Simulation: Kurtosis Measure (Tailweight)

1.0

t(r, r

min

)

0.0

0.2

0.4

0.6

0.8

uniform normal t10 cauchy

1.0

0.8

0.6

0.4

0.2

0.0

r : depth
Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 26


Application: Kurtosis Measure (Quasars)

v(r, rmin
0.0 1.0 0.2 0.4

)

0.6

0.8

1.0

0.8

0.6

0.4

0.2

0.0

depth
Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 27


Multivariate Outlier Detection
What are Outliers? Detecting Algorithms Level Shape Distor tion Balloon Plot

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 28


What are Outliers?
Outliers are... Cumbersome Observations Lead to New Scientific Discoveries Improve Models (Robust Statistics) ... No Clear Objectives but Come Along Often CHP: Experience and relative Robustness suppor t the Idea of Outlier Detection. We need a clear definition on outliers; especially, outliers of the 21st century. And outlier detecting methods.

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 29


Outliers are observations....
Huber (1972): unlikely to belong to the main population. Barnett and Leroy (1994): appear inconsistent with the remainder. Hawkins (1980): deviated so much to arouse suspicion. Beckman and Cook (1983): surprising and discrepant to the investigator. Discordant Observations or Contaminants Rohlf (1975): somewhat isolated from the main cloud of points.

Yet, somewhat VAGUE!

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 30


Some Outlier Detection Methods
Univariate: Box-and-Whisker plot, Order statistics, ...

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 31


Some Outlier Detection Methods
Univariate: Box-and-Whisker plot, Order statistics, ... Multivariate: Mostly bivariate applications Generalized Gap Test (Rolhf, 1975) Bivariate Box Plot (Zani et. al, 1999) Sunburst Plot (Liu et. al., 1999) Bag plot (Miller et. al., 2003) and Mahalanobis distance D (x) = (x - µ)-1 (x - µ). ^^ ^

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 31


Some Outlier Detection Methods
Univariate: Box-and-Whisker plot, Order statistics, ... Multivariate: Mostly bivariate applications Generalized Gap Test (Rolhf, 1975) Bivariate Box Plot (Zani et. al, 1999) Sunburst Plot (Liu et. al., 1999) Bag plot (Miller et. al., 2003) and Mahalanobis distance D (x) = (x - µ)-1 (x - µ). ^^ ^ Difficulties of multivariate analysis arise from the complexity of ordering multivariate data.

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 31


Quantile Based Outlier Detection
bivariate standard normal
4 20

bivariate t5 with =-0.5

0

y

y

-2

-4

-4

-2

0
x

2

4

-20

-10

0

10

2

-20

-10

0
x

10

20

0.20

0.15

density

density

0.10

0.05

0.00

-4 -3 -2 -1

0 x

1

2

3

4

0.00

23 01 -1 -2 -3

0.05

0.10

0.15

0.20

-6

-4

-2

0 x

2

4

6

46 02 -2 -4 -6

y

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 32

y


Contour Shape Changes
Bivariate Normal Sample Outliers Added
2 4

y

0

y
-3 -2 -1 0 1 2 3 -2 0

-2

2

-2

0

2

4

6

x

x

2

y

0

y

-2

-3

-2

-1

0

1

2

3

-2

0

2

4

-2

0
x

2

4

6

x

50

40

30

Volj

Volj

30

40

50

GAP
20

10

20

0

0

20

40

60

80

0 0

10

20

40

60

80

jthconvex hull level set

jthconvex hull level set

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 33


Balloon Plot
Bivariate Normal Sample Outliers Added

2

y

0

y -3 -2 -1 0 1 2 3 -2 0

-2

2

4

-2

0

2

4

6

x x A Balloon Plot is obtained by blowing .5th CHPD polyhedron by 1.5 times (lengthwise). Let V.5 be a set of ver tices of .5th CHPD hull. The balloon for outlier detection is

B

1 .5

= {yi : yi = xi + 1.5(xi - C H P M ), xi V.5 }.

In other words, blow the balloon of IQR 1.5 times larger.
Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 34


Outliers in Quasar Population

400

300

200

Vol0.25 j 0 20 40 60 80

Vol

j

100

0

0 0

1

2

3

4

20

40

60

80

jthconvex hull level set

jthconvex hull level set

Volumes of 1st CH, .01 Depth CH, .05 Depth CH: (474.134, 14.442, 4 .3 5 3 )
Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 35


Outliers in Galaxy Population

5000

4000

3000

Vol0.25 j 0 50 100 j (peel) 150 200

Vol

j

1000

2000

0

0 0

2

4

6

8

50

100 j(peel)

150

200

Volumes of 1st CH, .01 Depth CH, .05 Depth CH: (4919.492, 4.310, 1 .0 7 5 )
Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 36


Discussion on CHP
Convex Hull Peeling is.. a robust location estimator. a tool for descriptive statistics. Skewness and Kur tosis measure. a reasonable approach for detecting multivariate outliers. a star ter for clustering. Our methods help to characterize multivariate distributions and identify outlier candidates from multivariate massive data; therefore, the results initiate scientists to study fur ther with less bias. CHP as Exploratory Data Analysis and Data Mining Tools.

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 37


Concluding Remarks
Drawbacks of CHPD Limited to moderate dimension data. CHPD estimates depths inward not outward. Convexity of a data set. No population/theoretical counterpar t.

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 38


Concluding Remarks
Drawbacks of CHPD Limited to moderate dimension data. CHPD estimates depths inward not outward. Convexity of a data set. No population/theoretical counterpar t.

No assumption on data distribution, Non-distance based, Affine invariant, Applicable to streaming data, Detecting Outliers, Providing Multivariate Descriptive Statistics, Explorator y data analysis

Hyunsook Lee, Depar tment of Statistics, Penn State Univ ­ p. 38