Документ взят из кэша поисковой машины. Адрес оригинального документа : http://hea-www.harvard.edu/AstroStat/SolStat2012/tom_loredo.pdf
Дата изменения: Sat Feb 18 01:30:29 2012
Дата индексирования: Tue Oct 2 03:43:00 2012
Кодировка:
Поисковые слова: п п п п п п п п

Big Data: Handle With Care!
Tom Loredo Dept. of Astronomy, Cornell University

17 Feb 2012 -- SolarStat @ CfA

"Big data" problems are ubiquitous

ATST SDO

JASON 2008 (for DoD)

Solar/LSST p oints of contact

·

LSST will produce high-dimensional data products: · Multicolor images

· · · ·

functional data for objects (stars, galaxies, minor planets)

Real-time/streaming analysis crucial Co-located processing resources for user community Changing perspectives: · Populations of light curves

·

Feature-based data processing

UC Berkeley group (Richards et al. 2011)

Used to compare various classifiers (random forest most accurate but too slow!) Spurring growth of emerging subdisciplines: Astrostatistics, astroinformatics

Asymptopia is tantalizing
Asymptopia by Peter Sprangers Come with me and we shall go a place that only n has known a kingdom distant and sublime whose ruler is the greatest prime a land where infinite sums can rest and undergrads shall take no test a place where every child you see writes p oems ab out the C.L.T. where cdf 's converge to one and every day is filled with sun. where we can jump time's famous hurdle and watch Achilles b eat the turtle and every stat plucked from a tree is, without proof, U.M.V.U.E. where joy o'erflows the cornucopia in this, the land of Asymptopia.

Asymptopia is tantalizing
Asymptopia by Peter Sprangers Come with me and we shall go a place that only n has known a kingdom distant and sublime whose ruler is the greatest prime a land where infinite sums can rest and undergrads shall take no test a place where every child you see writes p oems ab out the C.L.T. where cdf 's converge to one and every day is filled with sun. where we can jump time's famous hurdle and watch Achilles b eat the turtle and every stat plucked from a tree is, without proof, U.M.V.U.E. where joy o'erflows the cornucopia in this, the land of Asymptopia.

Punishment of Tantalus

Agenda

1 Big is never big enough

2 Measurement error in numb er-size distributions

3 Bayesian computation for streaming data

Outline

1 Big is never big enough

2 Measurement error in numb er-size distributions

3 Bayesian computation for streaming data

Is N ever large?
Cyg X-1 - vs. X-ray
Bassani+ 1989

BeppoSAX spectrum, GX 349+2
Di Salvo et al. 2001

Spectrum of Ultrahigh-Energy Cosmic Rays
Nagano & Watson 2000

HiRes Team 2007
(eV m s sr )
-1

10

Flux*E /10

3

24

2

-2

HiRes-2 Monocular HiRes-1 Monocular AGASA

-1

1

17

17.5

18

18.5

19

19.5

20

20.5

21

log10(E) (eV)

N is never large (enough)
Sample sizes are never large. If N is too small to get a sufficiently-precise estimate, you need to get more data (or make more assumptions). But once N is `large enough,' you can start subdividing the data to learn more (for example, in a public opinion poll, once you have a good estimate for the entire country, you can estimate among men and women, northerners and southerners, different age groups, etc etc). N is never enough because if it were `enough' you'd already be on to the next problem for which you need more data.

-- Andrew Gelman (blog entry, 31 July 2005)

N is never large (enough)
Sample sizes are never large. If N is too small to get a sufficiently-precise estimate, you need to get more data (or make more assumptions). But once N is `large enough,' you can start subdividing the data to learn more (for example, in a public opinion poll, once you have a good estimate for the entire country, you can estimate among men and women, northerners and southerners, different age groups, etc etc). N is never enough because if it were `enough' you'd already be on to the next problem for which you need more data. Similarly, you never have quite enough money. But that's another story. -- Andrew Gelman (blog entry, 31 July 2005)

Outline

1 Big is never big enough

2 Measurement error in numb er-size distributions

3 Bayesian computation for streaming data

We survey everything!
Lunar Craters Solar Flares TNOs

Stars & Galaxies

GRBs

Numb er-size distributions
aka size-frequency distributions, number counts, log N log S . . .
Lunar Craters Solar Flares TNOs

Quasars

GRBs

Selection effects and measurement error

· ·

Selection effects (truncation, censoring) -- obvious (usually) Typically treated by "correcting" data Most sophisticated: product-limit estimators "Scatter" effects (measurement error, etc.) -- insidious Typically ignored (average out?)

Measurement error for line & curve fitting
QSO hardness vs. luminosity (Kelly 2007, 2011)

"Regression with easurement error" in statistics refers to the case of errors in both x and y

Accounting For Measurement Error
Introduce latent/hidden/incidental parameters
Suppose f (x |) is a distribution for an observable, x .

From N precisely measured samples, {xi }, we can infer from L() p ({xi }|) =
i

f (xi |)

Bayesian graphical model

· · ·

Nodes/vertices = uncertain quantities Edges specify conditional dependence Absence of an edge denotes conditional independence

x1

x2

xN

p (, {xi }) = p ()
i

f ( xi | )

But what if the x data are noisy, Di = {xi + i }?

We should somehow incorporate

i

(xi ) = p (Di |xi )

L(, {xi }) p ({Di }|, {xi })
i

f ( xi | ) i ( xi )

Marginalize over {xi } to summarize inferences for . Marginalize over to summarize inferences for {xi }. Key point: Maximizing over xi and integrating over xi can give ve r y d i ff e r e n t r e s u l t s !

Graphical representation

x1

x2

xN

D1

D2

DN

p (, {xi }, {Di })

=

p ( )
i

f (xi |)p (Di |xi ) =
i

f (xi |) i (xi )

A two-level multi-level model (MLM) or hierarchical model

Example--Distribution of Source Fluxes
Measure m = -2.5 log(flux) from sources following a "rolling power law" distribution (inspired by trans-Neptunian objects) f (m) 10[
f(m)
(m-23)+ (m-23)2 ]

23

m

Simulate 100 surveys of populations drawn from the same dist'n. Simulate data for photon-counting instrument, fixed count threshold. Measurements have uncertainties 1% (bright) to 30% (dim). Analyze simulated data with maximum ("profile") likelihood and Bayes.

Parameter estimates from Bayes (circles) and maximum likelihood (crosses):

Uncertainties don't average out! Applications: GRBs (TL & Wasserman 1993+); TNOs (Gladman+ 1998+, Petit+ 2006+)

Smaller measurement error only postpones the inevitable: Similar toy survey, with parameters to mimic SDSS QSO surveys (few % errors at dim end):

What's going on? · · · · ·
Implicitly or explicitly, each datum brings with it a new parameter Middle level distribution f (xi |) acts as prior on lower level latent parameters xi (observables) Separate "copies" of the prior data never overwhelm effect of prior Marginalization accounts for volume in {xi } parameter space "Mustering and borrowing of strength" (Tukey):
· Pool information from individuals about upper (pop'n) level · Pooled information feeds back; xi estimate is affected by all other {xi } through what they say about · Shrinkage: esulting xi estimates are biased but better (lower M S E)

Outline

1 Big is never big enough

2 Measurement error in numb er-size distributions

3 Bayesian computation for streaming data

Posterior sampling
Monte Carlo integration wrt posterior distribution: d g ( )p ( | D ) 1 n g ( i ) + O ( n
i p ( |D ) -1/2

)

When p () is a posterior distribution, drawing samples from it is called posterior sampling:

· ·

One set of samples can be used for many different calculations (so long as they don't depend on low-probability events) This is the most promising and general approach for Bayesian computation in high dimensions--though with a twist (MCMC!)

Challenge: How to build a RNG that samples from a posterior?

Markov chain Monte Carlo (MCMC)

BL AH
2
()L() contours

p ( | D ) =

q ( ) Z q ( ) p ( )p (D | )

Metropolis-Hastings algorithms:
Initial

·
Markov chain

Propose candidate from proposal dist'n Accept new/repeat old depending on q ( )/q () (and proposal ratio)

1

·

Requires evaluating q () for e ve r y c an d i d at e

Imp ortance sampling

d g ( )q ( ) =

d g ( )

q ( ) 1 P ( ) P ( ) n

g ( i )
i P ( )

q ( i ) P ( i )

Choose P to make variance small. (Not easy!)

g(x)) (x
Q*(x P(x)) P*(x q(x))

x

Sequential Monte Carlo for streaming data
Balakrishnan and Madigan (2006)

· ·

Get posterior samples {i } from initial chunk D< Revise samples by bringing in D> in small chunks d : · Weight by new data likelihood p(d |i )

· ·

Resample via weights (like bootstrap) degeneracy Jitter via MCMC using modified KDE surrogate for current posterior

Applied to Bayesian logistic regression (7 predictors) and 4в4 Markov transition regression (5 20 samples), N 106

Posterior sampling using D<
Likelihood using d

Weight using d Resample Shrink to build surrogate

Posterior using D

<

Jitter using KDE surrogate

Main Points · · · · ·
Solar, stellar & extragalactic astronomy share "big data" problems Cross fertilization opportunities? Statistics may not become simpler with "large N " Relative importance of measurement error grows with sample size; it doesn't "average out" MLMs handle such complications by marginalizing over latent parameters Streaming Bayesian computation via sequential Monte Carlo -- a research front

Note: 2012/2013 SAMSI Program on Statistical and Computational Methodology for Massive Datasets (www.samsi.info)