Документ взят из кэша поисковой машины. Адрес оригинального документа : http://hea-www.harvard.edu/AstroStat/SolStat2012/tom_loredo.pdf
Дата изменения: Sat Feb 18 01:30:29 2012
Дата индексирования: Tue Oct 2 03:43:00 2012
Кодировка:

Поисковые слова: п п п п п п п п
Big Data: Handle With Care!
Tom Loredo Dept. of Astronomy, Cornell University

17 Feb 2012 -- SolarStat @ CfA


"Big data" problems are ubiquitous

ATST SDO

JASON 2008 (for DoD)


Solar/LSST p oints of contact

·

LSST will produce high-dimensional data products: · Multicolor images

· · · ·

functional data for objects (stars, galaxies, minor planets)

Real-time/streaming analysis crucial Co-located processing resources for user community Changing perspectives: · Populations of light curves

·

Feature-based data processing


UC Berkeley group (Richards et al. 2011)

Used to compare various classifiers (random forest most accurate but too slow!) Spurring growth of emerging subdisciplines: Astrostatistics, astroinformatics


Asymptopia is tantalizing
Asymptopia by Peter Sprangers Come with me and we shall go a place that only n has known a kingdom distant and sublime whose ruler is the greatest prime a land where infinite sums can rest and undergrads shall take no test a place where every child you see writes p oems ab out the C.L.T. where cdf 's converge to one and every day is filled with sun. where we can jump time's famous hurdle and watch Achilles b eat the turtle and every stat plucked from a tree is, without proof, U.M.V.U.E. where joy o'erflows the cornucopia in this, the land of Asymptopia.


Asymptopia is tantalizing
Asymptopia by Peter Sprangers Come with me and we shall go a place that only n has known a kingdom distant and sublime whose ruler is the greatest prime a land where infinite sums can rest and undergrads shall take no test a place where every child you see writes p oems ab out the C.L.T. where cdf 's converge to one and every day is filled with sun. where we can jump time's famous hurdle and watch Achilles b eat the turtle and every stat plucked from a tree is, without proof, U.M.V.U.E. where joy o'erflows the cornucopia in this, the land of Asymptopia.

Punishment of Tantalus


Agenda

1 Big is never big enough

2 Measurement error in numb er-size distributions

3 Bayesian computation for streaming data


Outline

1 Big is never big enough

2 Measurement error in numb er-size distributions

3 Bayesian computation for streaming data


Is N ever large?
Cyg X-1 - vs. X-ray
Bassani+ 1989

BeppoSAX spectrum, GX 349+2
Di Salvo et al. 2001


Spectrum of Ultrahigh-Energy Cosmic Rays
Nagano & Watson 2000

HiRes Team 2007
(eV m s sr )
-1

10

Flux*E /10

3

24

2

-2

HiRes-2 Monocular HiRes-1 Monocular AGASA

-1

1

17

17.5

18

18.5

19

19.5

20

20.5

21

log10(E) (eV)


N is never large (enough)
Sample sizes are never large. If N is too small to get a sufficiently-precise estimate, you need to get more data (or make more assumptions). But once N is `large enough,' you can start subdividing the data to learn more (for example, in a public opinion poll, once you have a good estimate for the entire country, you can estimate among men and women, northerners and southerners, different age groups, etc etc). N is never enough because if it were `enough' you'd already be on to the next problem for which you need more data.

-- Andrew Gelman (blog entry, 31 July 2005)


N is never large (enough)
Sample sizes are never large. If N is too small to get a sufficiently-precise estimate, you need to get more data (or make more assumptions). But once N is `large enough,' you can start subdividing the data to learn more (for example, in a public opinion poll, once you have a good estimate for the entire country, you can estimate among men and women, northerners and southerners, different age groups, etc etc). N is never enough because if it were `enough' you'd already be on to the next problem for which you need more data. Similarly, you never have quite enough money. But that's another story. -- Andrew Gelman (blog entry, 31 July 2005)


Outline

1 Big is never big enough

2 Measurement error in numb er-size distributions

3 Bayesian computation for streaming data


We survey everything!
Lunar Craters Solar Flares TNOs

Stars & Galaxies

GRBs


Numb er-size distributions
aka size-frequency distributions, number counts, log N ­log S . . .
Lunar Craters Solar Flares TNOs

Quasars

GRBs


Selection effects and measurement error

· ·

Selection effects (truncation, censoring) -- obvious (usually) Typically treated by "correcting" data Most sophisticated: product-limit estimators "Scatter" effects (measurement error, etc.) -- insidious Typically ignored (average out?)


Measurement error for line & curve fitting
QSO hardness vs. luminosity (Kelly 2007, 2011)

"Regression with easurement error" in statistics refers to the case of errors in both x and y


Accounting For Measurement Error
Introduce latent/hidden/incidental parameters
Suppose f (x |) is a distribution for an observable, x .

From N precisely measured samples, {xi }, we can infer from L() p ({xi }|) =
i

f (xi |)


Bayesian graphical model

· · ·

Nodes/vertices = uncertain quantities Edges specify conditional dependence Absence of an edge denotes conditional independence



x1

x2

xN

p (, {xi }) = p ()
i

f ( xi | )


But what if the x data are noisy, Di = {xi + i }?

We should somehow incorporate

i

(xi ) = p (Di |xi )

L(, {xi }) p ({Di }|, {xi })
i

f ( xi | ) i ( xi )

Marginalize over {xi } to summarize inferences for . Marginalize over to summarize inferences for {xi }. Key point: Maximizing over xi and integrating over xi can give ve r y d i ff e r e n t r e s u l t s !


Graphical representation


x1

x2

xN

D1

D2

DN

p (, {xi }, {Di })

=

p ( )
i

f (xi |)p (Di |xi ) =
i

f (xi |) i (xi )

A two-level multi-level model (MLM) or hierarchical model


Example--Distribution of Source Fluxes
Measure m = -2.5 log(flux) from sources following a "rolling power law" distribution (inspired by trans-Neptunian objects) f (m) 10[
f(m)
(m-23)+ (m-23)2 ]



23

m

Simulate 100 surveys of populations drawn from the same dist'n. Simulate data for photon-counting instrument, fixed count threshold. Measurements have uncertainties 1% (bright) to 30% (dim). Analyze simulated data with maximum ("profile") likelihood and Bayes.


Parameter estimates from Bayes (circles) and maximum likelihood (crosses):

Uncertainties don't average out! Applications: GRBs (TL & Wasserman 1993+); TNOs (Gladman+ 1998+, Petit+ 2006+)


Smaller measurement error only postpones the inevitable: Similar toy survey, with parameters to mimic SDSS QSO surveys (few % errors at dim end):


What's going on? · · · · ·
Implicitly or explicitly, each datum brings with it a new parameter Middle level distribution f (xi |) acts as prior on lower level latent parameters xi (observables) Separate "copies" of the prior data never overwhelm effect of prior Marginalization accounts for volume in {xi } parameter space "Mustering and borrowing of strength" (Tukey):
· Pool information from individuals about upper (pop'n) level · Pooled information feeds back; xi estimate is affected by all other {xi } through what they say about · Shrinkage: esulting xi estimates are biased but better (lower M S E)


Outline

1 Big is never big enough

2 Measurement error in numb er-size distributions

3 Bayesian computation for streaming data


Posterior sampling
Monte Carlo integration wrt posterior distribution: d g ( )p ( | D ) 1 n g ( i ) + O ( n
i p ( |D ) -1/2

)

When p () is a posterior distribution, drawing samples from it is called posterior sampling:

· ·

One set of samples can be used for many different calculations (so long as they don't depend on low-probability events) This is the most promising and general approach for Bayesian computation in high dimensions--though with a twist (MCMC!)

Challenge: How to build a RNG that samples from a posterior?


Markov chain Monte Carlo (MCMC)

BL AH
2
()L() contours

p ( | D ) =

q ( ) Z q ( ) p ( )p (D | )

Metropolis-Hastings algorithms:
Initial

·
Markov chain

Propose candidate from proposal dist'n Accept new/repeat old depending on q ( )/q () (and proposal ratio)



1

·

Requires evaluating q () for e ve r y c an d i d at e


Imp ortance sampling

d g ( )q ( ) =

d g ( )

q ( ) 1 P ( ) P ( ) n

g ( i )
i P ( )

q ( i ) P ( i )

Choose P to make variance small. (Not easy!)

g(x)) (x
Q*(x P(x)) P*(x q(x))

x


Sequential Monte Carlo for streaming data
Balakrishnan and Madigan (2006)

· ·

Get posterior samples {i } from initial chunk D< Revise samples by bringing in D> in small chunks d : · Weight by new data likelihood p(d |i )

· ·

Resample via weights (like bootstrap) degeneracy Jitter via MCMC using modified KDE surrogate for current posterior

Applied to Bayesian logistic regression (7 predictors) and 4в4 Markov transition regression (5 ­ 20 samples), N 106


Posterior sampling using D<
Likelihood using d

Weight using d Resample Shrink to build surrogate

Posterior using D

<

Jitter using KDE surrogate


Main Points · · · · ·
Solar, stellar & extragalactic astronomy share "big data" problems Cross fertilization opportunities? Statistics may not become simpler with "large N " Relative importance of measurement error grows with sample size; it doesn't "average out" MLMs handle such complications by marginalizing over latent parameters Streaming Bayesian computation via sequential Monte Carlo -- a research front

Note: 2012/2013 SAMSI Program on Statistical and Computational Methodology for Massive Datasets (www.samsi.info)