Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://hea-www.harvard.edu/AstroStat/Stat310_0910/dr_20100323_mono.pdf
Äàòà èçìåíåíèÿ: Mon Mar 22 20:30:30 2010
Äàòà èíäåêñèðîâàíèÿ: Tue Oct 2 05:48:51 2012
Êîäèðîâêà: IBM-866

Ïîèñêîâûå ñëîâà: ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï
Statistical Inference with Monotone Incomplete Multivariate Normal Data

This talk is based on joint work with my wonderful co-authors: Wan-Ying Chang (US Census Bureau) Megan Romer (Penn State University) Tomoya Yamada (Sapporo Gakuin University)

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 1/4


Background
We have a population of "patients" We draw a random sample of N patients, and measure m variables on each patient:
1 2 3 4 5 6 . . . m

Visual acuity LDL (low-density lipoprotein) cholesterol Systolic blood pressure Glucose intolerance Insulin response to oral glucose Actual weight ? Expected weight . . . White blood cell count

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 2/4


We obtain data: Patient
1 v1, v1, . . . v1,
1 2



m





2 v2, v2, . . . v2,
1 2



m







3 v3, v3, . . . v3,
1 2

m





§§§ §§§



N vN , vN , . . . vN

1 2

,m





Vector notation: V1 , V2 , . . . , VN
V1 : The measurements on patient 1, stacked into a column

etc.

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 3/4


Classical multivariate analysis
Statistical analysis of N m-dimensional data vectors Common assumption: The population has a multivariate normal distribution
V : The vector of measurements on a randomly chosen patient

Multivariate normal populations are characterized by:
²: The population mean vector : The population covariance matrix

For a given data set, ² and are unknown

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 4/4


We wish to perform inference about ² and Construct confidence regions for, and test hypotheses about, ² and Anderson (2003). An Introduction to Multivariate Statistical Analysis Eaton (1984). Multivariate Statistics: A Vector-Space Approach Johnson and Wichern (2002). Applied Multivariate Statistical Analysis Muirhead (1982). Aspects of Multivariate Statistical Theor y

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 5/4


Standard notation: V Np (², ) The probability density function of V : For v Rm ,
f (v ) = (2 )-
m/2

||-

1/2

exp - 1 (v - ²) -1 (v - ²) 2

V1 , V2 , . . . , VN : Measurements on N randomly chosen patients

Estimate ² and using Fisher's maximum likelihood principle Likelihood function: L(², ) =
N j =1

f (vj )

Maximum likelihood estimator : The value of (², ) that maximizes L

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 6/4


²= =

1 N 1 N

N j =1 n j =1

Vj : The sample mean and MLE of ² ï ï (Vj - V )(Vj - V ) : The MLE of

What are the probability distributions of ² and ?
² Np (²,
1 N

)

Law of Large Numbers: ² ², a.s., as N
N has a Wishar t distribution, a generalization of the ² and also are mutually independent
2

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 7/4


Monotone incomplete data
Some patients were not measured completely The resulting data set, with denoting a missing obser vation
v1, v1, v1, . . . v1,
1 2 3





v2, v2, . . . v2,



2 3





m

m

v3,2 . . . v3,m

§§§





. . . vN
,m



Monotone data: Each is followed by 's only We may need to renumber patients to display the data in monotone form

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 8/4


Physical Fitness Data
A well-known data set from a SAS manual on missing data Patients: Men taking a physical fitness course at NCSU Three variables were measured: Oxygen intake rate (ml. per kg. body weight per minute) RunTime (time taken, in minutes, to run 1.5 miles) RunPulse (hear t rate while running)

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 9/4


Oxygen RunTime RunPulse 44.609 45.313 54.297 51.855 49.156 40.836 44.811 45.681 39.203 45.790 50.545 48.673 47.920 47.467 50.388 11.37 10.07 8.65 10.33 8.95 10.95 11.63 11.95 12.88 10.47 9.93 9.40 11.50 10.50 10.08 178 185 156 166 180 168 176 176 168 186 148 186 170 170 168 | | | | | | | | | | | | | | | | 39.407 46.080 45.441 54.625 39.442 60.055 37.388 44.754 46.672 46.774 45.118 49.874 49.091 59.571 50.541 47.273 12.63 11.17 9.63 8.92 13.08 8.63 14.03 11.12 10.00 10.25 11.08 9.22 10.85 * * * 174 156 164 146 174 170 186 176 * * * * * * * *

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 10/4


Monotone data have a staircase pattern; we will consider the two-step pattern Par tition V into an incomplete par t of dimension p and a complete par t of dimension q
X1 Y1 X2 , Y2 Xn ,..., Yn ,
+1

Yn

,

Yn


+2

,..., YN

Assume that the individual vectors are independent and are drawn from Nm (², ) Goal: Maximum likelihood inference for ² and , with analytical results as extensive and as explicit as in the classical setting

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 11/4


Where do monotone incomplete data arise?
Panel sur vey data (Census Bureau, Bureau of Labor Statistics) Astrophysics Early detection of diseases Wildlife sur vey research Cover t communications Mental health research Climate and atmospheric studies . . .

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 12/4


We have n obser vations on obser vations on Y

X Y

and N - n additional

Difficulty: The likelihood function is more complicated
n N

L=
i=1 n

fX,Y (xi , yi ) §

fY (yi )
i= n+1 N

=
i=1 N

fY (yi )fX |Y (xi ) §
n

fY (yi )
i= n+1

=
i=1

fY (yi ) §

fX |Y (xi )
i=1

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 13/4


Par tition ² and similarly:
²= ² ²
1 2

,

=

11 21

12 22

Let
²
1§2 - = ²1 - 12 221 (Y - ²2 ), - 11§2 = 11 - 12 221 21 1§2

Y Nq (²2 , 22 ),

X |Y Np (²

, 11§2 )

² and : Wilks, Anderson, Morrison, Olkin, Jinadasa, Tracy, ...

Anderson and Olkin (1985): An elegant derivation of

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 14/4


Sample means:
1 ï X= n
n

Xj ,
j =1 N

1 ï Y1 = n Yj , ï= 1 Y N

n

Yj
j =1 N

1 ï2 = Y N -n
n

Yj
j =1

j = n+1

Sample covariance matrices:
n

A11 =
j =1

ï ï (Xj - X )(Xj - X ) , ï ï (Yj - Y1 )(Yj - Y1 ) ,

A12 =
j =1

ï ï (Xj - X )(Yj - Y1 )
N

n

A22,n =
j =1

A22,

N

=
j =1

ï ï (Yj - Y )(Yj - Y )

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 15/4


The MLE's of

² and
ï ²2 = Y
1

Notation: = n/N , = 1 - ï
ïï ï ï ²1 = X - A12 A-1n (Y1 - Y2 ), 22, ²1 is called the regression estimator of ²

In sample sur veys, extra obser vations on a subset of variables are used to improve estimation of a parameter
is more complicated: 11 = 12 = 22 = 1 1 -1 (A11 - A12 A22,n A21 ) + A12 A-1n A22,N A-1n A21 22, 22, n N 1 - A12 A221n A22,N , N 1 A22,N N

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 16/4


Seventy year-old unsolved problems
Explicit confidence levels for elliptical confidence regions for ² In testing hypotheses on ² or , are the LRT statistics unbiased? Calculate the higher moments of the components of ² Determine the asymptotic behavior of ² as n or N The Stein phenomenon for ²? The crucial obstacle: The exact distribution of ²

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 17/4


The exact distribution of

²

Chang and D.R. (J. Multivariate Analysis, 2009): For n > p + q ,
² = ² + V1 +
L 1 n

-

1 1/2 N

Q2 1/2 Q1

V2 , 0

where V1 , V2 , Q1 , and Q2 are independent;
V1 Np =
1 N +q

(0, ), V2 Np (0, 11§2 ), Q1 2 n
1 n

-q

, Q2 2 ; q

+

-

1 N

11§ 0

2

0 . 0

Consequences: ² is an unbiased estimator of ². Also, ²1 and ² are independent iff 12 = 0. Romer and D.R. (2009): Explicit formulas for the V 's and Q's

2

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 18/4


Computation of the higher moments of ² now is straightforward Due to the term 1/Q1 , even moments exist only up to order n - q The covariance matrix of ²:
(n - 2) ï 1 Cov(²) = + N n(n - q - 2) 11§ 0
2

0 0

Asymptotics for ²: If n, N with N/n 1 then
L N (² - ²) Np
+q

11§ 0, + ( - 1) 0

2

0 0

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 19/4


The analog of Hotelling's

T 2 -statistic

T 2 = (² - ²) Cov(²)-1 (² - ²)

where
1 (n - 2) ï Cov(²) = + N n(n - q - 2) 11§ 0
2

0 0

An obvious ellipsoidal confidence region for ² is
Rp
+q

: (² - ) Cov(²)-1 (² - ) c

What is the corresponding confidence level?

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 20/4


Theorem: For t 0, P (T 2 t) is bounded above by
P Fq
, N -q

(q

-1

-N

-1

)t

and bounded below by
N 2 Q2 q Q3 Nq P 1+ + nQ1 Q5 Q5
1/2

Q3

1/2

+ 1/2 Q4 ï

1/2 2

t ,

where Q1 2 n

-p -q

and Q1 , . . . , Q5 are mutually independent.

, Q2 2 , Q3 2 , Q4 2 , Q5 2 , p q q 2
2

Romer (2009) has now derived the exact distribution of T Shrinkage estimation for ² when is unknown

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 21/4


A decomposition of
Notation: A11§
2,n



:= A11 - A12 A-1n A21 22, A11 A12 A21 A22,n A11§ + ï 0
2,n

n = +

0 0 A-1n A21 0 22, 0 Iq

A12 A-1n 0 22, 0 Iq

BB BB

where
A11 A12 A21 A22,n W (n - 1, )

p +q

and

B Wq (N - n, 22 )

are independent. Also, N 22 Wq (N - 1, 22 )

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 22/4


n

A22,

N

=
j =1

ï ï ï ï ï ï (Yj - Y1 + Y1 - Y )(Yj - Y1 + Y1 - Y )
N

+

j = n+1

ï ï ï ï ï ï (Yj - Y2 + Y2 - Y )(Yj - Y2 + Y2 - Y )

A22,

N

= A22,n + B
N

B=

j = n+1

ï2 )(Yj - Y2 ) + n(N - n) (Y1 - Y2 )(Y1 - Y2 ) ï ï ïï ï (Yj - Y N

Verify that the terms in the decomposition of are independent

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 23/4


The marginal distribution of 11 is non-trivial If 12 = 0 then A22,n , B , A11§ independent Matrix F -distribution: Fa
(q ) ,b 2,n

ïï ï , A12 A-1n A21 , X , Y1 , and Y2 are 22,
-1/2 2 -1/2 2

=W

W1 W

where W1 Wq (a, 22 ) and W2 Wq (b, 22 )

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 24/4


Theorem: Suppose that 12 = 0. Then 1 1/2 -1/2 -1/2 L 1 11 11 11 = W1 + W2 Ip + F W n N where W1 , W2 , and F are independent, and
W1 Wp (n - q - 1, Ip ), F FN N
-1/2 11 (p ) - n , n- q + p - 1

1/2 2

W2 Wp (q , Ip ), and

-1/2 L 11 11 =

N -1/2 A11§ n 11 + 11
-1/2

2,n

11

-1/2 -1/2

- A12 A221n (A22,n + B )A-1n A21 11 , 22,

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 25/4


Theorem: With no assumptions on 12 ,
-1 L 12 22 = - 12 221 + 11§2 W 1/2 -1/2

K 22

-1/2

where W and K are independent, and
W Wp (n - q + p - 1, Ip ), K Npq (0, Ip Iq )

- In par ticular, 12 -1 is an unbiased estimator of 12 221 22

The general distribution of requires the hypergeometric functions of matrix argument Saddlepoint approximations

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 26/4


The distribution of || is much simpler :
|| = |11§2 | § |22 | |11§2 | and |22 | are independent; each is a product of independent 2 variables

Hao and Krishnamoor thy (2001):
p q

|| = n-p N

L

-q

|| §

2 n
j =1

-q -j

§

2 N
j =1

-j

It now is plausible that tests of hypothesis on are unbiased

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 27/4


Testing

= 0

Data: Two-step, monotone incomplete sample
0 : A given, positive definite matrix

Test H0 : = 0 vs. Ha : = 0 (WLOG, 0 = I

p +q

)

Hao and Krishnamoor thy (2001): The LRT statistic is

1

|A22,N |N ½|A11§

/2

2,n

½ exp - 1 tr A12 A-1n A21 ) . 22, 2

1 |n/2 exp - 2 tr A11§

exp - 1 tr A22, 2

N 2,n

Is the LRT unbiased? If C is a critical region of size , is
P (1 C |Ha ) P (1 C |H0 )?

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 28/4


Pitman (1939): Even with 1-d data, 1 is not unbiased Bar tlett: 1 becomes unbiased if sample sizes are replaced by degrees of freedom With two-step monotone data, perhaps a similarly modified statistic, 2 , is unbiased? Answer : Still unknown. Theorem: If |11 | < 1 then 2 is unbiased With monotone incomplete data, fur ther modification is needed

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 29/4


Theorem: The modified LRT,

3

|A22,N |( ½ |A11§

N -1)/2

2,n

½ |A12 A-1n A21 |q 22,

|(

n-q -1)/2 /2

1 exp - 2 tr A22,

N 2,n

1 exp - 2 tr A12 A-1n A21 , 22,

exp - 1 tr A11§ 2

is unbiased. Also, 1 is not unbiased For diagonal = diag(j j ), the power function of 3 increases monotonically as any |j j - 1| increases, j = 1, . . . , p + q .

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 30/4


With monotone two-step data, test
H0 : (², ) = (²0 , 0 ) vs . Ha : (², ) = (²0 , 0 )

where ²0 and 0 are given. The LRT statistic is
ïï ïï 4 = 1 exp - 1 (nX X + N Y Y ) 2

Remarkably, 4 is unbiased The sphericity test, H0 : I
p+q vs.

Ha : I

p +q

The unbiasedness of the LRT statistic is an open problem

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 31/4


The Stein phenomenon for

²

²: The mean of a complete sample from Nm (², Im )

Quadratic loss function: L(², ²) = ² - ² Risk function: R(²) = E L(², ²) C. Stein: ² is inadmissible for m 3

2

James-Stein estimator for shrinking ² to Rm :
²c = 1- c ²-
2

(² - ) +

Baranchik's positive-par t shrinkage estimator : c + (² - ) + ²c = 1 - 2 ²- +

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 32/4


We collect a monotone incomplete sample from Np

+q

(², )

Does the Stein phenomenon hold for ², the MLE of ²? The phenomenon seems almost universal: It holds for many loss functions, inference problems, and distributions Various results available on shrinkage estimation of with incomplete data, but no such results available for ² The crucial impediment: The distribution of ² was unknown

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 33/4


Theorem (Yamada and D.R.): For p 2, n q + 3, and = Ip+q , both ² and ²c are inadmissible:
R(²) > R(²c ) > R(²+ ) c

for all Rp


+q

and all c (0, 2c ), where

p-2 q c= +. n N

Non-radial loss functions Replace ² -
2

by non-radial functions of ² -

Shrinkage to a random vector , calculated from the data

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 34/4


Kur tosis tests for multivariate normality
m-dimensional complete, random sample: V1 , . . . , VN

Extensive literature on testing for multivariate normality Mardia's statistic for testing for kur tosis:
N

b

2,m

=
j =1

ï (Vj - V ) S

-1

ï (Vj - V )

2

Invariance under nonsingular affine transformations of the data Asymptotic distribution of b
2,m

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 35/4


Monotone incomplete data, i.i.d., unknown population:
X1 Y1 X2 , Y2 Xn ,..., Yn ,
+1

Yn

,

Yn


+2

,..., YN

A generalization of Mardia's statistic:
n

^ =
j =1 N

Xj Yj








-1

Xj Yj

2


2

+
j = n+1

(Yj - ²

-1 2 ) 22

(Yj - ²2 )

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 36/4


An alternative to

^
1jn n+1j N
2

Impute each missing Xj using linear regression:
Xj = Xj , - ²1 + 12 221 (Yj - ²2 ),
N

Construct
=
j =1

Xj Yj





-1

Xj Yj



A remarkable result:



is invariant under nonsingular affine transformations of the data

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 37/4


Yamada, Romer, and D.R. (2010): Under cer tain regularity conditions,
( - c1 )/c2 N (0, 1)
L

as n, N The constants c1 , c2 depend on n, N and the underlying population distribution In the normal case, c1 , c2 depend only on n, N , p, q

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 38/4


References
Chang and D.R. (2009). Finite-sample inference with monotone incomplete multivariate normal data, I. J. Multivariate Analysis. Chang and D.R. (2009). Finite-sample inference with monotone incomplete multivariate normal data, II. J. Multivariate Analysis. D.R. and Yamada (2010). The Stein phenomenon for monotone incomplete multivariate normal data. J. Multivariate Analysis. Yamada (2009). The asymptotic expansion of the distribution of the canonical correlations with monotone incomplete multivariate normal data. Preprint, Sapporo Gakuin University. Romer (2009). The Statistical Analysis of Monotone Incomplete Multivariate Normal Data. Doctoral Disser tation, Penn State University. Romer and D.R. (2010). Maximum likelihood estimation of the mean of a multivariate normal population with monotone incomplete data. Preprint, Penn State University. Yamada, Romer, and D.R. (2010). Kur tosis tests for multivariate normality with monotone incomplete data. Preprint, Penn State University.

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 39/4


Astrostatistics research problems
K. R. Lang, Astrophysical Formulae, Vol. II: Space, Time, Matter and Cosmology, 3rd. ed., Springer, 2006 Numerous monotone incomplete data sets Is it true that astrophysicists often discard incomplete data? Incomplete longitudinal data (light cur ves, luminosity data) Incomplete time series Small-sample distributions of test statistics, e.g., Mardia's statistic, often are unexplored even with complete data How to relax the MCAR assumption to MAR?

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 40/4


A. Isenman, "Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning," Springer, 2008 COMBO-17 Sur vey Apply classical multivariate statistical procedures (principal components, MANOVA, ...) to the COMBO-17 sur vey I hear that some variables in the sur vey "are not of scientific interest," e.g., the absence of high-redshift (i.e. distant) high-absolute-magnitude (i.e. faint) galaxies, the dropoff in flux with redshift, the dropoff in image size with redshift, ...

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 41/4


Carr y out a statistical analysis of the variables which "are not of scientific interest" Discover amazing results that put you in the NY Times and Get an invitation to Stockholm Send me 10% of the prize money (I'm a reasonable guy)

Statistical Inference with Monotone Incomplete Multivariate Nor mal Data í p. 42/4