Документ взят из кэша поисковой машины. Адрес оригинального документа : http://www.mrao.cam.ac.uk/~bn204/lecture/2008/bn-l5.pdf
Дата изменения: Fri Aug 10 20:17:37 2012
Дата индексирования: Tue Oct 2 10:04:18 2012
Кодировка:
Поисковые слова: carina

Statistics for Astronomy V: Bayesian Statistics (+A bit more on ML)
B. Nikolic b.nikolic@mrao.cam.ac.uk
Astrophysics Group, Cavendish Laboratory, University of Cambridge http://www.mrao.cam.ac.uk/~bn204/lecture/astrostats.html

5 November 2008

Goals for this Lecture

ML Example Monte-Carlo Bias Covariance of parameters

Introduction to Bayesian Inference

Features of Bayesian Inference

Outline

ML Example Monte-Carlo Bias Covariance of parameters

Introduction to Bayesian Inference

Features of Bayesian Inference

Monte Carlo simulation
Investigate the bias of the Maximum-Likelihood analysis of a Gaussian-shaped source in a noise image. Sub-sample Monte-Carlo data sets from the sets that were analysed:

Monte Carlo result

800

600

400

200

Solid line is the case with S/N of 2:1 per pixel Dashed line has 2 larger pixels and 2 smaller noise

N

0 0.8 0.9 1 Amplitude 1.1 1.2 1.3

Both are biased: the solid line has mean bias of about 0.3%

Monte Carlo result
1 0.75 0.5 0.25 0 0.8
0.6 0.55 0.5 0.45 0.4 0.99

C

0.9

1 Amplitude

1.1

1.2

Solid line is the case with S/N of 2:1 per pixel Dashed line has 2 larger pixels and 2 smaller noise

Both are biased: the solid line has mean bias of about 0.3%

C

0.995

1

1.005

1.01

1.015

Amplitude

Error Matrix for ML Estimate
Recall ML error analysis showed that there is high covariance between A and x , y . = {A, x0 , y0 , x , y , }
A x y

(1)
-1 2x

0
x Ay

D

-1

0 2 0 2h = - A 21 y -1 2x 0

0 0
y Ax

-1 2y

0 0 0 0

0 0
x Ay

0 0 0

0 0 0
y Ax

0 0

0 0 0 0 0
2 Ax y

(2)

0

Illustration: Covariance in ML parameters
105 Monte-Carlo simulations:

90 50

80

40

Colour scale indicates number of parameter results falling into each cell Horizontal axis is the value of inferred peak Ver tical axis is the inferred width of Gaussian (in a slightly different parametrisation unfor tunately!)

Width 30 70 20

60 10

0 0.85 0.9 0.95 1 1.05 1.1 1.15 0 1 Amplitude

Outline

ML Example Monte-Carlo Bias Covariance of parameters

Introduction to Bayesian Inference

Features of Bayesian Inference

Bayes Theorem

Bayes theorem
P (B |A)P (A) P (B )

P (A|B ) = Allows inversion of probability: If you know P (B |A), the probability of B given A

(3)

=

Then you can calculate P (A|B ), the probability of A given B

Bayes Theorem simple example

A Roll of two dice adds to 10 B First dice rolled 5 P (B ) 1/6 P (A) 1/12 P (A|B ) 1/6 What is P (B |A), i.e., the probability that first dice rolled 5 if the sum is 10? P (A|B )P (B ) 1 P (B |A) = = (4) P (A) 3

Bayesian inference
Basic premise is that both the experiments and the theories we are investigating can be described by a single coherent system of probabilities: the Bayes theorem connects the two aspects. P (x | ) What we know about experiment P ( ) Prior knowledge about model P (x ) "Evidence" did we observe data we expected? P ( |x ) What we know about model after the experiment P ( |x ) = P (x | )P ( ) P (x ) (5)

Bayesian inference: classic example
How likely is it that somebody committed a crime given DNA was found on the scene? A The suspect is guilty C The suspect's DNA was found at the scene B The suspect is innocent = P (A) + P (B ) = 1

P (A|C ) This is what we want: what is the probability suspect is guilty given that DNA was found at the scene? P (C |B ) Probability of error in DNA testing, "ver y unlikely", e.g. 10-6 P (C |A) Probability that DNA matches with the person leaving it, close to 1

Bayes Inf example: applying the theorem
P (C |A)P (A) P (C )

P (A|C ) =

(6)

We need two quantities that we do not already have: P (A) the a-priori probability that the suspect is guilty. Without other relevant information assign about 1 0 -7

P (C ) the probability that suspect's DNA is found at scene. We know either A or B is true, so P (C ) = P (C |A)P (A) + P (C |B )P (B ) 10
-1 -7

(7) (8)

+ 10

-6

10

-6

Therefore find: P (A|C ) = 10

, i.e., only 10%

Bayes Inf example: what happened?

As you'd expect, in this case the result is dominated by the two unlikely probabilities: Probability of guilty without any evidence P (A) = P (C |B ) Probability of error in testing (9) What we knew before (10) Mistake in our clever experiment

P (A|C )

Outline

ML Example Monte-Carlo Bias Covariance of parameters

Introduction to Bayesian Inference

Features of Bayesian Inference

Features of Bayesian Inference

Incor porates the likelihood calculation Incor porates information about all the ways we can get the observed data ("evidence") what did this experiment add to our knowledge? Incor porates prior information what did we know before the experiment? Results are probability distributions makes it possible to concentrate on the parameters of the model that are impor tant to us ("marginalisation")

About priors

Choice of priors appears tricky in some ver y general analytical calculations in practice they are much more apparent Priors are impor tant when the experiment add little to your pre-experiment knowledge usually the opposite is true. Quantified by the Evidence value. Priors are naturally subjective. If there is significant literature in a field, you should consider what others would consider acceptable priors

Worst case scenario? I
You make some observations x . You give your prior A but your colleague has somewhat different B . You are afraid that P ( |x A) = P ( |x B )

If your experiment is very powerful
Priors can not make much of a difference. Problem solved find new area to work in.

If your experiment is not quite that powerful...
Real life! This single experiment doesn't completely solve the problem, people use their background knowledge to help them.

More experiments Different experiments Talk to colleagues

Some common prior choices

Flat prior : know parameter must be between l , h : P ( ) = 1 h - (11)
l

Log-flat priors: P ( ) =

1 log h - log

l

1

(12)

Conjugate priors, these are families of analytically convenient priors depending of form of likelihood: 1 (x - µ)2 P (x |µ) = exp - 2 2 2 P (µ) = 1 µ2 exp - 2 2 2 (13)

P (x |µ)P (µ) exp -

(x - µ)2 µ2 - 2 2 2 2

Another Gaussian! (14)

About evidence
P (x ) the prior probability that you would observe the data

Design of the experiment
P (x | )

Prior information
P ( )

What are the possible outcomes?

What prior information on parameters we have?

How they relate to parameters? Result is often a computationally ver y demanding integral:

P (x ) =

d P (x | )P ( )

(15)

Worst case scenario? II
How much did the observation change our knowledge? P ( |x B ) P (x | ) = = P ( |B ) P (x |B )

P (x | ) d P (x | )P ( |B )

(16)

Observation made less believable: must come from likelihood P (x | )

Observations have made more believable: generally due to d P (x | )P ( |B ), in fact generally due to what you thought is true [P ( |B ) large] but is not suppor ted by the data [P (x | ) small].

Conclusion
Greatest increase in knowledge is precisely when peaks in the likelihood and the priors do not overlap.

Occam's Razor I

Occam's Razor
In general:

Prefer simpler solutions to a problem By adding parameters to a model/distribution can always get a better fit to observation How do you compare models with different number of parameters?

In statistics:

Occam's Razor II
Example
In our Gaussian ML example, take fitting for position (x0 ,y0 ): solid lines include these as fitted parameters, dashed lines have them fixed. 105 MC simulations. 800
0.52
600

0.51

400

0.5

N

C

200

0.49

0 3600

3800

4000

4200

4400

4600

0.48 4085

4087.5

4090

4092.5

4095

4097.5

- log P(x| )

- log P(x| )

The two extra parameters in the solid line histogram make this model look more likely, even though I know the actual source was always centred at same position.

Occam's Razor in Bayesian inference
Automatic Occam's Razor
Bayesian statistics has an automatic Occam's razor in the evidence calculation. Consider n parameters: P (x ) = d n P (x | )P ( ) (17)

If P ( ) is non informative with some scale then P ( ) As a result: P (x ) n n (19) 1 n (18)

where is the width of region over which likelihood is high

Nuisance parameters
Parameters of a model which we are not interested in, but have to infer anyway because we have poor prior constraints on them are called nuisance parameters.

Example
In the ML Gaussian fitting example, we could only be interested in A but have to fit for x0 , y0 too because we do not a-priori know the position of source. Fitting for position Not fitting for position
0.6

0.6

0.55

0.55

0.5

0.5

C

C

0.45

0.45

0.4 0.99

0.995

1

1.005

1.01

1.015

0.4 0.98

0.99

1 Amplitude

1.01

1.02

Amplitude

In ML the penalty for this is bias!

Marginalisation
Bayesian inference deals with nuisance parameters through marginalisation: Consider observations {xi } that has allowed inference of joint probability of two parameters y and z : P (y , z |{xi })

P (y ) =
-

dz P (y , z |{xi })

(20)

Advice
With the nuisance parameter problem solved, you can afford to put in more free parameters into your model:

Parameters unknown, uninteresting but relevant Sometimes, parameters you think you know exactly also if result is sur prising = model is wrong

What to do with posterior distribution

You've got P ( |{xi }), what do with it?

Plot it (only works well up to 2 parameters ...) Marginalise all combinations of parameters and then plot Find most likely , mean , etc... Estimate confidence intervals, errors Publish electronically