Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://www.mrao.cam.ac.uk/~steve/algor2008/images/BayeSys_manual.pdf
Äàòà èçìåíåíèÿ: Mon Dec 1 13:48:47 2008
Äàòà èíäåêñèðîâàíèÿ: Tue Oct 2 12:22:04 2012
Êîäèðîâêà:

Ïîèñêîâûå ñëîâà: ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï
BayeSys and MassInf

Left Left Right n



Right

n or - n-1

· · · - Death -- · ·

·

·

·



n

·· · Birth · ·

· ·

n+1 - or n

R L X

..o..oo....oo.oo.....oo.ooo....ooooooo..ooooo...ooo.o...o...o....o... -|- -- -|- -- -|- -|- - -|- - -- --

John Skilling

Maximum Entropy Data Consultants Ltd (est.1981)

February 2003


2. Overview of Inference For us, inference is the art of recovering an ob ject from data D. Usually, the ob ject will be something quite complicated, perhaps a spectrum or an image, having at least several and maybe thousands or millions of degrees of freedom. The data, acquired through some experiment R so that D R() will nearly always be incomplete (fewer components than has) or noisy (inexact), or somehow inadequate to fix unambiguously. Methods of inference fall into three increasingly sophisticated classes, "inversion", "regularisation" and "probabilistic". I can illustrate these with the population of the four countries (England, Scotland, Wales and Ireland) that comprise the British Isles. I have received data (mythical, of course) that of the entire population of 10000, 8000 live in the east (England and Scotland), and 7500 live in the south (England and Wales). ¿From these three numbers, I have to estimate the populations = (1 , 2 , 3 , 4 ) of the four countries. Ireland = 4 Wales = 3 2000 In algebraic terms, Scotland =
2

2500 7500 10000

England = 1 8000

1 0101 2500 7500 1 0 1 0 2 D= = R . = 0011 3 2000 1100 4 8000

2.1. INVERSION ^ Inversion in its pure form inverts R to obtain the estimate = R-1 D, which might work except that R is here (and in general) singular so that there is no inverse. A common fixup is to use a pseudo-inverse instead, ^ so that = X D where X approximates the inverse of R. But the data which we would have obtained if ^ this were true are RX D which, insofar as X is not the inverse of R, differ from the true data, showing cannot be correct. So much for inversion. This method should only be used when the results are essentially unambiguous, or don't matter much. 2.2. REGULARISATION In my illustrative problem, simple logic shows that the data leave only one degree of freedom. Let this be the Irish population x. Then 4 = x, 3 = 2000 - x, 2 = 2500 - x, 1 = 5500 + x,

and this satisfies the data for any x. Regularisation consists of finding the "best" result from among all those that agree with the data, as defined by maximising some regularisation function () representing quality. The commonest such method is least squares, which here fixes x by minimising
2 2 2 2 1 + 2 + 3 + 4 = -()

sub ject to agreement with the data. Unfortunately the selected value of x is -250, indicating that the population of Ireland is negative. Really? Least squares is not always best! For the sort of positive distribution being considered, maximum entropy (maximise = - i log i ) is useful. Negative populations are prohibited by the logarithm, and the result Ireland = 500 Wales = 1500 Scotland = 2000 England = 6000

has the 1:3 north:south ratio nicely independent of longitude, and the 1:4 east:west ratio nicely independent of latitude. In fact, entropy is the only function yielding this general symmetry, which may well account for 2


the high quality of many maximum-entropy reconstructions. But questions remain. How should we account for noise in the data? What about nuisance parameters? How uncertain is the "best" result? 2.3. PROBABILITIES In 1946, R.T. Cox [ref ] proved that the only way of doing inference consistently is through probability calculus and seeking the posterior probability distribution Pr( | D) of result , given data D. We start by assuming a prior distribution representing what we think might be, and modulate this with how likely the data measurements would then be. This gives the joint distribution of and D, which can be alternatively expanded as the posterior we want, scaled by a coefficient called the "evidence". Pr(, D | I ) = Joint assumptions & measurements Pr( | I ) â Pr(D | , I ) = Prior â Likelihood inference Pr( | D, I ) â Pr(D | I ) Posterior â Evidence

That's the famous, and stunningly elementary, Bayes' Theorem. In writing it, I have explicitly included the dependence on whatever ancillary information I is to hand (though in practice one often omits the symbol when the context is understood). Observe that the evidence amounts to the likelihood for this information. We would use exactly the same Bayes' Theorem to compare different ancillary assumptions I (under more general background information J ) as we do here to compare different parameter values : it is only the identifications of the symbols that would change. If we don't have any alternatives for I , the evidence is useless to us, though in principle we should still calculate it and publish it in case somebody else can think of an alternative that might explain the data better. Or worse, if the alternative evidence was smaller. Always calculate the evidence (its units are inverse data) as an integral part of Bayesian inference. Few people do this, which is a pity. 2.4. PRIOR PROBABILITIES Before we even start to use the data, we need to assign a prior probability distribution () = Pr( | I ) to everything relevant that we don't know. I sometimes remark that I don't know what I am talking about -- until I have defined a prior. We have complete freedom to set , sub ject only to it being non-negative and summing to unity (so that it can actually b e a probability distribution). In practice, we will assign by using such desiderata as symmetry, simplicity, and reasonable behaviour, according to our experience and judgment. We can also use the posterior distribution from previous observations as our current prior. (That's what "doing inference consistently" means. It doesn't matter whether we use dataset D1 first then D2 , or the other way round, or use them both together. Each yields the same posterior.) Yes, the assignment of prior is sub jective. That's the way it is! But, after acquiring data, we can compare evidence values, so the sub jective assumptions can be ob jectively compared. In my illustrative population example, we need to assign a prior over the four non-negative populations. In fact, we may as well split the British Isles into arbitrary fractions fi instead of restricting to four quarters. One choice, which can be used at any resolution and is quite often suggested (though seldom by me), is a gamma distribution - () i 1+cfi e-i where and c are constants. Priors nearly always involve constants like these (grandiosely called "hyperparameters"). Interpreting them and setting them appropriately is part of the art. Given the success of maximum entropy in regularisation, it is tempting to try () exp(-1 log 1 - 2 log 2 - 3 log 3 - 4 log 4 ) or some close variant. However, integration shows that the implied prior on the total population = 1 + 2 + 3 + 4 is nothing like exp(- log ), whereas the gamma distribution would have remained intact. Worse, the entropy prior cannot be subdivided. There is no distribution p(·) which could be applied independently to northern and southern England and then integrated to give a prior like exp(- log ) for the population of England as a whole. Entropy is a good regulariser but, like most functions, it does not translate into a good prior. 3


2.4.1. Atomic priors One useful way of breaking down the complexity of a problem is to construct the ob ject as some number of a-priori-equivalent "atoms", which are scattered randomly over the domain of interest. These atoms are given whatever extra attributes are needed to model the ob ject. In the population example, an atom might represent a person, in which case those particular data would require 10000 atoms. Or an atom might represent a census unit of 500 people, in which case only 20 would be needed. Or an atom might represent a tribe whose size was drawn from some distribution such as an exponential: this would often be better because of the flexibility involved, and might require even fewer atoms. Whatever an atom represents, and however many there are, letting them fall randomly over the domain ensures that we can compute on any scale. Atomic priors give a structure-based description. The computer only needs to deal with the amount of structure that is actually required, because the number of atoms is usually allowed to vary. Spatial resolution, in the form of accuracy of location of the atoms, is "free", because each atom is automatically held to the arithmetical precision of the hardware. Algebraically-continuous priors like the gamma distribution, by contrast, give a cell-based description. To use them, we have to divide the spatial domain into as many cells as we are ever likely to need, and be prepared to compute them all. Resolution is directly limited by the computer memory and processor time. This comparison between atomic and continuous priors is rather analogous to the comparison between a Monte Carlo representation by samples and a full covering of the entire space. In each case, the former is practical, while the latter is likely not. And there is no loss of generality. If required, we could let the whole be a single atom having attributes for every cell, which would reproduce a cell-based prior. To define an atomic prior, we assign distributions for the number n of atoms, and for their attributes. Typical priors for n are (n) = constant between a minimum and maximum number (where equality makes n fixed), or Poisson ( n) = e (binomial if a maximum is imposed), or geometric (n) cn with c < 1
- n

/n!

(which is wider than the Poisson). As a technical note, only the Poisson assignment is "infinitely divisible", meaning that it can be accumulated from arbitrarily small but fully independent subdivisions of the domain -- remember that Poisson distributions combine into another Poisson distribution with summed mean. If computed at small scale, the other forms need small correlations to make the total number correctly distributed. I don't think that matters at all. Indeed, I usually prefer the less-committal geometric assignment. 2.4.2. Coordinates As for the attribute coordinates, their priors will depend on what the attributes are. Importantly, there are many applications where an atom has very few attributes, such as image reconstruction where an atom has only position (x, y ), intensity, and possibly shape. It is then much easier to find an acceptable and useful new position for an atom than it would be for the ob ject as a whole, simply because of the huge reduction in dimensionality. (Divide and conquer, as the slogan has it.) An important case is where an attribute measures an additive quantity z such as population number, or power of signal, or brightness of radiation. For such intensities, an exponential distribution is often convenient, (z ) = c e-cz , c = constant. Equivalently, the cumulant distribution = 0 ( )d = 1 - e-cz is (by construction) uniformly distributed with () = 1 in [0, 1]. This form can be imposed quite generally. Even if there are several (d) attributes of an atom, it is always possible to squash the original prior into uniformity () = 1 over the unit hypercube 4
z


[0, 1]d . I recommend this discipline: there is no loss of generality and numerical exploration is likely to be easier. 2.5. SAMPLING Only when has very few degrees of freedom can we hope to explore "all" to find the posterior "everywhere". Instead, the modern approach is to characterise the posterior in a Monte Carlo sense, by ~ taking a dozen or more random-sample ob jects from it. It is fortunate but true that these dozen samples will very likely answer any particular question about ~ ~ with adequate precision. The point is that each sample ob ject yields an independent value Q = Q() for the scalar quantity Q being sought. Only occasionally will these be badly biassed overall. For example, there ~ is only about a 1 in 2000 chance that their average Q will be more than one standard deviation from the true mean of Q (for Gaussian statistics), and this chance drops sharply as more samples are taken. Note that any single selected from the posterior should be treated with care. Suppose for instance that the posterior requires to lie on a circle. Then the mean (an obvious candidate for presentation) will lie inside the circle, which is supposed to be prohibited! A median is predicated upon being able to order the values of , which really only makes sense if is restricted to one dimension. The mode, obtained by maximising the posterior (in a method sometimes glorified with the acronym MAPP for maximum a-posteriori probability), moves if is re-parameterised, because squeezing somewhere increases the posterior there to compensate. ~ Hence the mode lacks an invariance we often want. In fact the dozen or more sample ob jects seem to be the only faithful representation that is generally accessible.

5


3. Markov chain Monte Carlo (MCMC) Markov chain Monte Carlo algorithms are the preferred method for practical inference. The only generally faithful way of representing the posterior distribution is through sampling, which implies a Monte Carlo method. And the posterior can only be reached in practice through a Markov chain of intermediate states, each learning from the previous one(s). Hence MCMC. Let our ob ject have available states i = 1, 2, . . . . A MCMC algorithm for exploring is identified by the transitions T that it can make, with their probabilities. i j with probability Tj i = Pr(j | i) Technically, a Markov transition could also involve a memory of earlier states. However, transitions may (and often do) already involve a sophisticated history of intermediate trial or actual states, so the restriction is more apparent than real. Suppose our current knowledge of is probabilistic: Pr(i ) = pi . Repeatedly applying the algorithm will yield p Tp T2 p T3 p · · · . Eventually, p will converge to the principal eigenvector of T with greatest eigenvalue (which, because the components of p always sum to the same unit total, must be 1). Provided the algorithm is aperiodic (so that it doesn't just bounce) and "irreducible" (so that every state is eventually accessible from any other), this principal eigenvector is unique. Elements of randomness soon ensure that an algorithm is aperiodic, and if it happens that chains of successive p reduce into disjoint un-mixed domains, we will include extra transition routes to join all such domains. Hence it is usually straightforward to satisfy these conditions. We will start by designing algorithms which target the prior Pr() = (), for which we need a transition matrix whose principal eigenvector is . Eigenvectors are somewhat remote from the components of a matrix, being expensive to compute. However, any matrix with the property of "detailed balance" Tj i /Tij = j /i for all i, j

will suffice. Thinking physically, we see that if we start correctly with i ob jects in state i and j in state j , then on applying T the same number (Tj i i ) will pass from i to j as pass from j to i (Tij j ), leaving the correct assignment intact. i -- - - ---- - - -- ----
Tij j Tj i
i

j

The corresponding algebraic proof is (T )j = i Tij )j = j . Although not i Tij j = ( i Tj i i = essential, detailed balance is the key to constructing useful transition matrices. We start with algorithms that explore the prior faithfully, leaving the extra complication of likelihood factors until later. 3.1. THE NUMBER OF ATOMS Typical priors Pr(n) for the number of atoms are for 0 n < N 1/N , ( n) = e- n /n!, for n 0 n c, for n 0

(uniform); (Poisson); (geometric).

and we seek an algorithm that faithfully targets any such prior. The natural unit of change is just one atom at a time, so the only non-zero transitions Tj i will be j = i + 1 ("birth" of an atom), j = i - 1 ("death" of an atom), and j = i (no change). Detailed balance fixes the ratios between birth and death but leaves their overall magnitudes free. I choose to let each atom decay with unit mean lifetime, so that the death rate is set as Tn
-1,n

= n dt

in infinitesimal interval dt of artificial time. The rationale for this is that most of the atoms will have been changed after unit time. Regular O(1) timesteps thus give a natural sampling period for our multi-atom 6


ob ject. Sampling more frequently would make successive ob jects too similar, whereas sampling less often might equilibrate wastefully between samples. Detailed balance implies a corresponding birth rate Tn+1,n = n dt , For the three typical priors, these birth rates are n = n + 1, for n < N - 1, else 0 (uniform); , (Poisson); (n + 1) c, (geometric). n = (n + 1) n
+1

/n .

Starting with (say) n atoms, the time to the next event is exponentially distributed Pr(t) = (n + n) e
-(n +n) t

and when that event occurs, it is either birth or death in ratio n : n. Thus, in the following example, samples are taken at uniform times , 2 , 3 , 4 , . . . when n happens to be 3, 3, 2, 2, . . . . Birth n=3 | n=4 | 2 Death n=3 | 3 Death n=2 | 4 Birth n=3

3.2. COORDINATES As recommended above, we require the prior to be uniform () = 1 over the unit hypercube [0, 1]d . In one dimension, this domain is just the unit interval 0 < < 1. Within the computer, of course, the coordinate will come from a finite set determined by the finite precision of the hardware. This observation suggests a "modern digital" style of treatment. My convention, taking the computer word length to be B bits (usually 32) is to let the available values be odd multiples of 2-(B +1) , labelled by unsigned (i.e. non-negative) integers from 0 to 2B - 1: k = 2-B (k + 1 ). 2 This makes all states a-priori-equivalent, and keeps 1 - always paired with . Helpfully, the states are symmetrically away from the boundaries = 0 and = 1, and centre = 1 , which might be special. The 2 rules of integer arithmetic (modulo 2B ) ensure that is wraparound continuous, which is never harmful and sometimes appropriate. In one dimension, the obvious "analogue" transition scheme is to select at some appropriate scale (to be assigned somehow), and then use it to either increment or decrement : ± (mod 1).

In the underlying integer representation, with modulo 2B arithmetic understood, k k ± k. With increment and decrement being equi-probable, the algorithm is clearly in detailed balance over a uniform prior, no matter how was set. An alternative "digital" transition scheme is to decide on some number b up to B , and then randomise the low order b bits of k : k k Uniform[ 0, 2b ) where "" represents binary exclusive-or, here with a random integer less than 2b . Extra randomisation at the same scale is achieved by taking the coordinate relative to some non-zero origin o . k (k - o) Uniform[ 0, 2b ) + o 7


Advantages of this digital algorithm will become clear later. When has d dimensions (more than one), we can fill the unit hypercube with a space-filling Peano curve, thus reducing the topology to a one-dimensional coordinate along the curve. The hypercube contains (2B )d digital points, so each point along the curve is labelled by an integer with B d bits (d words). The diagrams below show Peano curves in two dimensions. A Peano curve can be constructed recursively from its generator. At the top level the generator is a path around the 2d corners of a hypercube, using segments directed parallel to the axes. In computer parlance, this path is a "Gray code" for d-bit integers. At the next (second) level, smaller copies of each generator are placed at each first-level point, oriented to keep the ends of successive generators adjacent. There are now 22d points along the path. At the third level, yet smaller copies of each generator are placed at each second-level point, again oriented to keep the ends of successive generators adjacent. By now, the path has 23d points.

And so on, at finer and finer scales until all B d bits have been used. Each stage adds resolution without changing the existing larger-scale pattern: with 4-bit axes 0, 1, . . . , 15 there are 16d = 162 = 256 points.

A Peano curve preserves locality, meaning that contiguity along the line implies contiguity in space (though not conversely). In fact, the Peano curve has the greatest degree of locality possible for a space-filling curve. It follows from locality that a Peano curve also preserves continuity, meaning that a function that is continuous in space must also be continuous along the line (though not conversely). Differentiability, though, is not preserved. Even so, there are useful methods for one-dimensional exploration which can immediately be applied in d dimensions, simply by invoking extended-precision integer arithmetic. A length of line that occupies a fraction f of the total necessarily fills that same fraction f of the hypercube volume. The winding pattern spirals out very roughly isotropically, so that each coordinate ranges over something like f 1/d as the length is traversed. Winding a one-dimensional line into a d-dimensional volume, though, necessarily places some points that are close in space far apart on the line. With the very precise and binary Peano pattern, there is only one crossing of the central line 1 = 1 (vertical in the 2 diagrams), so that points on opposite sides of that line are mostly far apart on the Peano curve. Similarly, there is only one crossing of the (wraparound continuous) abscissa 1 = 0, and none at all of the base 2 = 0. These barriers to free spatial movement are made harmless by moving them around after each observation interval , re-randomising the origin of the Peano curve within the hypercube and re-permuting its orientation. One-dimensional representation along a line implies an ordering of locations, so that an atom's neighbours can be quickly identified by keeping a linked list. Neighbouring atoms along the line may not be quite 8


the closest in space, but they still turn out to be useful because they identify pairs of atoms that may be similarly related to the data, and whose behaviour may thereby be correlated. ·

· · · · ·

· ·

In order to keep the ordering of atoms unambiguous, I accept a restriction that not more than one atom may occupy a single point. There is a huge number of points, so this is only a technicality. 3.3. ROLE OF LIKELIHOOD We have, so far, discussed the form of the prior, and settled on a flat prior over the unit hypercube for the attributes of each of a variable number of atoms. For exploration of the prior by MCMC methods, we have introduced birth and death rates for changing the number of atoms, and movement along a Peano curve to change their positions. It is now time to introduce the data. To correct a MCMC algorithm and make it converge to the required posterior instead of merely to the prior, its transition probabilities need to be adjusted. Suppose that transitions are performed, not definitively, but probabilistically according to acceptance probabilities A. This will reduce an effective transition rate from Tj i to Aj i Tj i . Detailed balance should conform to the posterior L() (), so we require Aj i Tj i L(j ) (j ) = . Aij Tij L(i ) (i ) Because the basic transition scheme T will already conform to the prior Tj i /Tij = (j )/ (i ) , the acceptance probabilities must simply be in ratio of the likelihoods A A
ji ij

=

L(j ) . L(i )

To avoid wasting resources, we want to accept as often as possible, so we set whichever acceptance would be larger to its greatest permitted value of 1. Hence Aj i = min 1, which is equivalent to the rule: "Accept transition i j if and only if L(j ) Uniform 0, L(i) ". Although delightfully simple, this method (due to Metropolis [ref ] as generalised by Hastings [ref ]) can be inefficient. The worst difficulty lies in how to choose the magnitude of change in the transitions. If is set too small, diffusion of position will be un-necessarily (and quadratically) slow. If is set too large, nearly every proposal will be rejected, and the procedure effectively stops -- without having converged. We need a way of ensuring that has the correct scale. Incidentally, it is now apparent why attention is universally focussed on pair-wise transitions rather than on longer cycles ij l 9 k L(j ) L(i )


that might seem to offer systematic exploration instead of diffusion. The flux of samples around such a cycle is limited by the relative probability of its least likely state. With larger cycles, this will be an ever-smaller fraction of the highest probability where most of the samples will reside, so the flux decreases and the cycle offers no gain. The smallest non-trivial cycle (length 2) is best. 3.4. SLICE SAMPLING Let the initial state, with likelihood L0 , be represented by a B -bit integer k from domain D0 . Set an acceptance level a Uniform(0, L0 ) and randomise all B bits of k to reach j0 = k Uniform[ 0, 2B ) in domain D0 . According to Metropolis-Hastings, we are entitled to accept any new trial state j whose likelihood exceeds a, provided it was generated symmetrically with the reverse transition j k being just as probable as the forwards k j . So, if the likelihood for j0 is greater than a, accept it. If not, halve the domain size to D1 D0 by randomising only the lowest B - 1 bits, j1 = k Uniform[ 0, 2
B -1

),

keeping the top bit intact. The transition is symmetric because both k and j1 lie in the same domain D1 , so j1 k is just as likely as k j1 (with probability 1/2B -1 as it happens). So, if the likelihood for j1 is greater than a, accept it. If not, halve the domain size again to D2 by randomising only the lowest B - 2 bits, j2 = k Uniform[ 0, 2
B -2

).

This transition too is symmetric, because k and j2 both lie in D2 so j2 k is just as likely as k j2 , and each direction had the same chance of being aborted at the earlier stages. (This would no longer be true if j2 had been obtained by addition or subtraction instead of bit-randomisation, because k and j2 would have gone to each other via different ranges and would hence have had different chances of aborting.) So, if the likelihood for j2 is greater than a, accept it. If not, keep going by randomising fewer and fewer low-order bits, until an acceptable likelihood is reached. At worst, this procedure terminates (after B trials) with the original state, necessarily acceptable because L0 a. More likely, an acceptable state will be found somewhere around the scale of the controlling likelihood function, at which log L = O(1). All we ask of the likelihood is that it be a reasonably continuous function of position, so that there is an acceptable scale. k0 D0 D
1

D2 D3 k3 There appear to be barriers to free movement at special places such as = 1 (where passage between 2 adjacent integers 0111111 · · · and 1000000 · · · requires all B - 1 low bits to change). As elsewhere, these can be made harmless by carrying out the procedure with respect to a randomly offset origin. There is nothing special about the halving procedure just described, other than being simple to implement in integer arithmetic. Any set of pre-defined nested domains would suffice: slice sampling was introduced by Neal [ref ] in this more general form. Actually, Neal started his slice-sampling algorithm from some intermediate length scale, and could step outward by widening the domain as well as inward by shrinkage. In a multi-atom environment, that is less important. Each atom has left and right neighbours along the 10


Peano curve, and their location will often give sensible limits on that atom's movement. A location beyond can be considered "out-of-range", and rejected at once. Stepping out is no longer particularly necessary, because the early steps inward from the full number of bits mostly involve merely the trivial overhead of checking whether a trial location is in range. The following fragment of pseudo-code illustrates the basic implementation of slice-sampling. The entry position is represented by the B d-bit integer k . bBâd o Uniform[ 0, 2b ) a Uniform 0, L(k ) do{ j (k - o) Uniform[ 0, 2b ) + o bb-1 }while( j is out-of-range or L(j ) < a ) full number of bits random origin level of acceptance probability loop trial position around k shrink interval around k until j is in range and if so is acceptably probable

The fact that the position almost certainly changes (j = k ) more than justifies the mild extra cost of slice sampling, as opposed to straightforward Metropolis-Hastings rejection.

11


4. Annealing In any large application, it is true almost by definition that the significant "volume" of the posterior occupies only a tiny fraction of the volume of the prior. Technically, the information, or negative entropy H=+ P () log P ()/ () d, P = posterior, = prior,

is likely to be much bigger than unity, in keeping with the dimensionality of the problem. It is then difficult to jump directly from the prior to the posterior, which is why we need a Markov chain. Practically all samples from the prior will lie out on the far tails of the posterior, where the local structure may give poor guidance about how to approach the isolated peak(s). To avoid this, we divide the application into subsidiary steps within which H does not increase too much. We arrange that there is significant overlap between the initial and final distributions of each step, and the computation can then proceed in reasonable safety. The traditional problem of locating a needle in a haystack is analogous: the needle occupies 1 unit in a haystack of volume V = exp(H ). Instead of a direct search, which would take V or so trials, we divide the haystack into halves, quarters etc, thus managing to locate the needle in log(V ) = H steps instead. Although there are many possible paths between the prior and the posterior, one particular path is specially sympathetic to the formalism. I have tried a couple of others, but found no advantage. Let L() be the likelihood Pr(D | ) = L() and let be a numerical coefficient. Instead of working with L directly, we work with a modified likelihood L () which induces a modified posterior proportional to L () (). Setting = 0 switches the likelihood off (L0 = 1), so the modified posterior is simply the prior. Setting = 1 switches the likelihood on (L1 = likelihood), so the modified posterior is simply the true posterior. And, by increasing gently from 0 to 1, we can divide the application into small, safe steps. In fact we can continue further: making even larger makes the modified posterior sharpen around the point(s) of maximum likelihood. =0 - - =1 - - = prior posterior maximum We can thus use the same program for b oth Bayesian sampling of the posterior and maximum-likelihood, or indeed for maximising an arbitrary function. In a thermodynamic analogy that is productive of terminology, is inverse temperature 1/T (so we call it "coolness"), and - log L is energy E , whence the modified likelihood factor becomes the familiar exp(-E /k T ). This is why passage along this particular path is called "annealing". We also talk of a set of ob jects {} being "at equilibrium" if they sample the modified posterior faithfully. Annealing also allows us to calculate the values of evidence E and information H . We generalise E , the evidence E = Pr(D) = Pr(D | ) Pr()d = L d , to E () = which differentiates to give d(log E ) = d L d

L log L d = log L L d



.

As the central expression defines, the angle-brackets here denote averaging over the posterior as modified to coolness . The purpose of annealing is to calculate the posterior (by sampling) at unit coolness, so a fortiori we can sample at intermediate coolnesses to pick up the required averages. And, because E (0) = d = 1, the evidence value we seek is simply their sum
1

E = E (1) = exp
0

d(log E ) d = exp d 12

1

log L d.
0


(I was startled by this result when I stumbled across it, and enthusiastically took it to the late ­ and widely missed ­ Edwin Jaynes. He looked at it and remarked that the formula was "well known to those who know these things". Thank you Ed. Such identities are indeed commonplace in thermodynamics. In numerical work, the method is called "thermodynamic integration".) A similarly short derivation yields H () = log L


- log E (),

with = 1 being the case Bayesians usually want. How should we control ? One way of proceeding is to set up some chain of values of , and let the ob ject(s) diffuse up and down, as well as equilibrating in at the current coolness [ref from Julian Besag]. If the chain is appropriate (so that the beginning and end of each link overlap well enough), and if the links are weighted by their correct evidence values, then any ob ject that successfully wanders from the prior end of the chain to the posterior end, and back, is guaranteed to have given a faithful sample of the posterior. That sounds attractive. But the evidence values are numerical, and rely upon adequate equilibration of , so the guarantee is not what it seems. If an implementation failed to move at all, it would still appear to work. A variant, "replica exchange", starts by putting an ob ject on each link of the chain, and letting them exchange position as they equilibrate. This avoids having to pre-acquire evidence values, but it starts far from equilibrium because at the beginning we only know how to sample at the prior end. Replica exchange, like everything else, relies on successful equilibration in . I prefer not to control by diffusion. The number of coolness values will be H or so, so that an ob ject will take H 2 iterates to diffuse along the chain. This seems wasteful by a large factor H . I think it is better to cool systematically. And I think it wise to take a hint from physics, where the efficient route to equilibrium is usually through slow cooling, trying to remain always close to equilibrium. All too often, systems lock into metastable states if they are quenched too rapidly. A successful program, then, needs an "annealing schedule" that defines how fast the coolness may increase. Crudely, the change in per cooling step should presumably allow the relative information between its beginning and end to be O(1), hence the suggestion [Otten & van Ginneken 1984 quoted in Radford Neal (1993) technical report p.90] that entropy should decrease at a roughly constant rate, whatever rate that may be. But I think that you, the user, should have control of the overall numerical rate, if only because applications differ in their difficulty. 4.1. SELECTIVE ANNEALING Let us use an ensemble of N member ob jects, perhaps 10 - 100 of them, with the k th having likelihood Lk . Suppose this is in equilibrium at coolness . We now wish to cool by an additional . This could be done by weighting the ob jects by wk Lk or, with normalisation to w = 1, wk = (Lk /L) , L= L
1/

.

Weighted averages over the ob jects would then be faithful to the new coolness. However, log-likelihoods are usually large, so that the weights would become very non-uniform after significant cooling, leading to gross numerical inefficiency. Instead, draw N new samples from the weighted ensemble, ensuring a mean multiplicity nk = wk for each original ob ject. The members of the new ensemble will then have equal weight again. To reduce changes in the ensemble, it is desirable to keep the (integer) number of copies nk as close as possible to its mean, being either the integer immediately below or immediately above nk . It is further desirable to treat similar ob jects as similarly as possible. For example, if 10 ob jects have n = 0·3, we should keep exactly 3 of them, omitting the other 7. Likewise, if 10 ob jects have n = 2·4, we should keep exactly 24 of them, taking either 2 or 3 copies of each. These desiderata can be satisfied by ordering the Lk into increasing (or decreasing) order: for example with N = 4 we might have:13


Ob ject k Weight w Cumulant weight

1 2 3 4 0·5 0·7 1·0 1·8 0 0·5 1·2 2·2 4

We then draw an initialising random variable r Uniform(0, 1) to set a sequence {r, r + 1, r + 2, . . .}. Whenever one of the sequence intersects the cumulant weight, we draw that one as a new ob ject. For example, if r = 0·4 the sequence would be {0·4, 1·4, 2·4, 3·4}, which intersects the cumulants at k = {1, 3, 4, 4}. The least likely ob ject 1 is drawn once, ob ject 2 is deleted, ob ject 3 is taken once, and the most likely ob ject 4 is taken twice. The net effect is to copy ob ject 4 onto ob ject 2, duplicating the former while deleting the latter. Depending on the value of r (random between 0 and 1), the various multiplicities would have been:1 2 3 4 Ob ject k r=1 . 0112 . . Actual multiplicities n r = 0 ·5 1012 r = 0 ·2 1111 r=0 0·5 0·7 1·0 1·8 Mean multiplicities n Necessarily, the mean multiplicities agree with the quoted weights. And, provided the mean is correct, there is no harm in thus imposing some correlations. I suggest that should be chosen so that in one cooling step no ob ject is copied by more than some maximum -- to be supplied by you the user as a floating-point number Rate (it need not be an integer, and will often be set less than 1 for safety). w
max

= 1 + Rate





wmax is a monotonically increasing function of , so this equation is easily soluble numerically. However, accidental coalescence of all the likelihoods Lk could make damagingly large. To make this less likely even when the ensemble is small, we can include a dozen or so extra likelihoods taken from recent ensembles. Even if these are a little atypical for the current coolness, it won't matter because the enhanced range will merely slow the cooling down a little, adding to safety. As a second, probably un-necessary, line of defence, can be prevented from more than doubling between cooling steps. Such reductions in stabilise the annealing schedule. They only matter when N is fairly small, but they enable the schedule to operate correctly even if N is only 1 (when copying ob jects is impossible). Finally, the user can impose whatever limit on coolness is appropriate, either 1 for Bayesian calculations, or no limit for maximisation. 4.1.1. Imperfections Selective annealing cannot cool perfectly, because the ensemble is only finite. For a start, the member ob jects must lose some independence through the duplications, so that after 2 or more cooling steps the ensemble is no longer quite at equilibrium. So annealing must still be accompanied by equilibration over : there is no escape from that. If were to be set large, the ensemble still could not anneal further than to the largest likelihood in its current membership, and this limits the amount of annealing that appears to take place. Also, the estimate of L that underlies the weights fluctuates statistically by N -1/2 , and this induces a O(N -1 ) systematic reduction in apparent cooling. Just as in elementary statistics when we estimate variance as N 2 iv 1 ( x) /(N - 1) instead of na¨ ely dividing by N , we have the fixup
(apparent)

=

N -1 . N

Presumably it is slightly better to use this apparent cooling in the actual updating of , though the effect is invisible if the exploration of is reasonably efficient. 14


4.1.2. Properties The most important feature of selective annealing is that ob jects in the ensemble jump from "bad" low-likelihood positions to such "good" high-likelihood places as have currently been found. Hence ob jects can escape from local maxima of likelihood without having to find an exit geometrically, or by tunnelling. In a problem having several likelihood peaks, only one ob ject needs to find the "right" peak for all the others to copy across in due course. And, because of the regular nature of copying in this implementation, no "good" ob ject (L L) is ever destroyed. Qualitatively, the cooling speed per step varies inversely with the log-likelihood range. The general formula is non-linear, but the small-speed limit is = Rate / (log L
max

- log L).

Thus, whenever the likelihoods are widely separated because of a phase change or other difficulty in the application, the cooling rate automatically slows in proportion to compensate. Cooling also slows if one ob ject suddenly acquires a much higher likelihood than the others. Again, the program "recognises" that something interesting has happened, and takes care to slow down. In the small-speed limit, the relative information between the beginning and end of a step depends on the standard deviation of the log-likelihood as H (log L)
r.m.s.



2

Rate2 .

Thus selective annealing is broadly in line with the hope that this relative information should be held moreor-less constant. However, I prefer to fix the cooling by selective annealing because (a) it has a direct computational interpretation in terms of copy operations, and (b) it focusses on the single most probable ob ject, which must be right when that one happens to be unusual. 4.2. COMPARISON WITH STATISTICAL THERMODYNAMICS Statistical thermodynamics is the application of probabilistic methods to large physical systems, and some of its formulas correspond closely with ours. Our basic Bayesian formulas are: Prior : = () where symbol includes number of atoms N and all attributes Likelihood : L = L() associated with coolness Evidence : E = E () = L d Annealed posterior : P = P () = L /E Information=-Entropy : H = H () = - log E + d log E /d = + P log P d : log L = d log E /d = P log L d : d log E = log L d : H = - log E + log L These are strikingly similar to formulas for a canonical (fixed N ) ensemble of V -state systems. List of states : j = 1, 2, . . . , V Energy states Canonical partition function Probability of state Entropy Energy : : : : : : : Ej associated with temperature T Q = Q(T ) = exp(-Ej /T ) P = P (j | T ) = exp(-Ej /T ) / Q S = S (T ) = log Q + T d log Q/dT = - E = T d log Q/dT = d log Q = E dT /T 2 S = log Q + E /T 15
2

P (j ) log P (j )

P (j )Ej


This invites the following identifications: Thermodynamic j Ej E

Bayesian - log L() - log L

Q/V E 1/T S - log V -H Statistical thermodynamics being a well-developed discipline, it may contain other techniques of use to us. In particular, there is a grand canonical ensemble which explicitly allows N to vary according to a chemical potential µ. Actually, the list of states in the canonical ensemble could already include differing numbers N of component parts, so from our point of view the grand canonical ensemble may offer nothing new. Let's see. The thermodynamic formulas for the grand canonical ensemble are: List of states : j = 1, 2, . . . for each number of atoms N Energy states : Ej (N ) with temperature T Grand canonical partition function : = (T , µ) = N ,j exp(-(Ej (N ) - N µ)/T ) Probability of state : P = P (N , j | T , µ) = exp(-(Ej (N ) - N µ)/T ) / Entropy : S = S (T , µ) = log + T log / T = - P (N , j ) log P (N , j ) : N = T log / µ = P (N , j )N : E = T (µ log / µ + T log / T ) = P (N , j )Ej (N ) : S = log + ( E - µ N )/T The two "potentials" T and µ allow a variety of "annealing" paths through the two-dimensional (T , µ) space, and it looks as if this could enable more direct control over the number of atoms. The geometric prior Pr(N ) = (N ) = cN in particular looks attractively similar to the chemical potential factor eN µ/T . However, in a Bayesian calculation powering the prior is quite different from powering the likelihood L. For a start, it interferes with normalisation, which matters for a prior but not for a likelihood. At a more formal level, the prior represents our prior assumptions and is a measure. It is an intrinsically additive quantity, which can only appear linearly, as in () . . . d = . . . d . If we were, regardless, to raise to (say) the zeroth power, we would in effect be assigning uniform weight with respect to the coordinates. But coordinates are arbitrary, raising a difficulty about consistent treatment. And, in the case of a number of atoms, we certainly do not want to assign uniform weight over a possibly unlimited number. In our Bayesian context, different numbers of atoms are not a-priori-equivalent. So we do not want the powering coefficient µ/T to vanish. That sits uneasily with wanting T to anneal downwards from , and suggests annealing with µ/T held constant. Which removes the whole point of considering the grand canonical ensemble in the first place, because we already have µ/T = log c = constant as one of our priors. Out of interest, I have tried "annealing" µ instead of , starting with µ = - at which no atoms are present, and gradually allowing more atoms to appear. The method was hopelessly inefficient, because the full force of the likelihood was seen by each fresh atom, and this was wholly unsympathetic to easy movement. Finally, it would be permissible to factorise the likelihood into two or more factors L = L1 L2 · · ·, each with its own coolness 1 , 2 , · · ·, and to take an arbitrary path from coolnesses all 0 to all 1. But I see no advantage. Number Energy

16


5. The BayeSys program The following parts of this manual document what is actually implemented, which for various reasons can differ in detail from the foregoing tutorial material. BayeSys (an acronym for Bayesian System) is a program for sampling the posterior of a system having an atomic prior, along with calculating the evidence value. It incorporates several engines for creating, destroying, and moving the atoms efficiently. As it attempts to modify the atoms, it passes each suggested modification to your user-procedures, which inform it about the associated likelihood change. In the light of this, BayeSys accepts, rejects, or changes its suggestion. After each complete iterate, when each atom has likely been changed, BayeSys passes all its atoms back to your procedure UserMonitor, for display of the current ensemble of sample ob jects {}, and collection of any statistics you want. At this point, you may want to re-calibrate any "nuisance" parameters that had to be assigned. More properly, you should re-sample them from their posterior probability as calculated in accordance with the current ob ject they are related to. This alternate sampling of relevant parameters (here and ) is known as "Gibbs sampling" [ref. Geman and Geman]. It ensures that the joint posterior of atoms and nuisance parameters will be explored faithfully, which is how nuisance parameters are correctly estimated and eliminated. Start Initialise parameters Response to new or deleted atoms and/or positions User procedures - - - -- -- - - - Atoms of new ob ject , given BayeSys/MassInf Finish It is also your responsibility to demand that the computation finish -- by signalling from UserMonitor. 5.1. THE BAYESYS PRIOR As programmed in BayeSys, the prior Pr(n) for the number of atoms can be styled "uniform", "Poisson" or "geometric", according to the sign (0, +, -) of the input parameter Alpha = . In each case, the distribution is qualified by a minimum number M and a maximum number N supplied through parameters MinAtoms and MaxAtoms, M n N. M is given directly by MinAtoms. It must be positive (1, 2, . . .) and not 0 (or negative). This means that BayeSys excludes the null ob ject with no atoms -- a restriction imposed for the convenience of both author and user. If you the user really want to include the null ob ject, you can treat it as an alternative prior hypothesis, weighted as always through Bayes' theorem with Pr(D | n > 0) = Evidence from BayeSys (given as its logarithm); Pr(D | n = 0) = Likelihood(Data | null ob ject). N is given by MaxAtoms, usually directly, except that MaxAtoms = 0 codes for N = (absence of a maximum). Omitting the maximum is prohibited for a uniform prior, which would be improper. N must, obviously, be at least as large as M . Equality is allowed, and forces BayeSys to use exactly that number of atoms. 17 -- - - - - - - -- Display ob ject Collect statistics (New parameters ) UserMonitor


The "uniform" prior (Alpha = 0) is Pr(n | = 0) = (N - M + 1) with mean and variance n = 1 (N + M ), 2 var(n) =
1 12 -1

for M n N

(N - M )(N - M + 2).

The "Poisson" prior is specified by Alpha = > 0. With finite maximum, the implementation is binomial, offset by the minimum number. Pr(n | > 0) = (N - M )! q (n - M )! (N - n)!
n-M

(1 - q )N

-n

,

q=

+N -M

Supplying the binomial rate through Alpha = rather than directly as q makes it impossible to supply an unusable value q > 1. The binomial mean and variance are n = (1 - q )M + q N , var(n) = (N - M ) q (1 - q ).

If the maximum is omitted, the binomial reduces to Poisson, offset by the minimum number. Pr(n | > 0) = e
- (n-M )



/(n - M )! ,

n = M + ,

var(n) =

The "geometric" prior is specified by Alpha = < 0, exhausting all the setting possibilities. It is Pr(n | < 0) = 1-c 1 - cN -M cn-M , c= | | . || + 1

+1

Again, supplying the ratio through Alpha in this way makes it impossible to supply a ratio c > 1 which could be unusable. The limit - is the uniform case alternatively coded as Alpha = 0. If the maximum is omitted, the geometric prior is no longer truncated and is Pr(n | < 0) = (1 - c) cn-M , n = M + ||, | | , || + 1 var(n) = || || + 1 . c=

Thus, as for the Poisson prior, directly yields the suggested number of atoms. Your selection of prior should be influenced by your expectation of the degree of complexity n that you expect in the ob ject, and by the uncertainty var(n) of that estimate. For many applications, I prefer to keep options open by omitting the maximum number of atoms and setting the minimum to 1. Then Alpha can be set to the number of atoms I expect to see, usually made negative because the geometric distribution is less committal than the Poisson ( ± instead of ± ). However, the precise settings ought not to influence the results very much. If they do, seek a reason because something may be amiss. The last parameter needed to define the prior is Ndim, the number of attributes per atom, otherwise known as the dimensionality d. This has to be a positive integer. Each attribute or coordinate is restricted to the range [ 0, 1], and the prior is uniform over the unit hypercube [ 0, 1]d . The Peano-curve transformations of the hypercube are handled within BayeSys, and need not concern the applications programmer. BayeSys will provide you with suggested positions for its atoms, and in fact each coordinate will always be an odd multiple of 2-33 (assuming a 32-bit word length). You do not have to provide any positions yourself, but you can instruct your user-procedures to reject particular positions. If you do this, BayeSys will omit those parts of the hypercube from your prior. 18


5.2. THE BAYESYS ENGINES BayeSys has several MCMC exploration algorithms or "engines", currently LifeStory1, LifeStory2, GuidedWalk, Leapfrog1, Leapfrog2, Chameleon1 and Chameleon2. The LifeStory engines control the birth and death of atoms, and allow their movement along the space-filling curve. The GuidedWalk and Leapfrog engines allow atoms to move geometrically within the coordinate hypercube, thus letting the ensemble learn about and use the shape of the likelihood function. The Chameleon engines let atoms jump between the ob jects in your ensemble, thus allowing the ob jects to communicate. You can control the engines by setting the 7-bit-integer input parameter Method as follows. high low GuidedWalk Leapfrog2 Leapfrog1 Chameleon2 Chameleon1 LifeStory2 Peano Method = off off off off off LifeStory1 raster 64 32 16 8 4 2 1

1 ON 0 OFF

The lowest bit of Method switches between topologies. If the bit is ON the Peano curve is used, otherwise a simple raster is used (with attribute coordinates 0, 1, 2, . . . taking progressively lower precedence). In one dimension, the two topologies are equivalent, so the switch has no effect. Generally, I recommend using the Peano topology unless you have a specific contrary need. The next-to-lowest bit of Method switches between LifeStory1 and its more sophisticated form LifeStory2. These two are the only engines in BayeSys that are guaranteed to mix all the states irreducibly, so one of them must be present, and you are given no way of switching them both off. Generally, I recommend the extra power of LifeStory2, although its iterations are more expensive. The highest five bits of Mathod switch the other individual engines on or off. For example, Method=27 (= 16+8+2+1) uses Leapfrog1 and Chameleon2, with LifeStory2 but not LifeStory1, in the context of Peano topology. Future development may yield new engines switched by higher bits. None of the Method settings should change the ultimate statistical properties of the results, because BayeSys always aims to explore the posterior. It is only the efficiency of that exploration which changes. I suggest the strategy of developing an application with all bits ON (i.e. Method = 127 or equivalently Method = -1). Then, if you need to save computer resources, try switching bits OFF as long as your results are undamaged. Remember that you can control the speed of annealing with parameter Rate. All the engines operate in the environment of an ensemble of N ob jects (input as parameter ENSEMBLE), each being a sample from the posterior distribution, as currently annealed. The LifeStory engines operate on just one ob ject at a time, but the others can use two or even three. To control this, we consider the ensemble to be a single supersystem = {1 , 2 , . . . , N } with its own prior () = (1 ) (2 ) · · · (N ) and its own likelihood L() = L(1 )L(2 ) · · · L(N ) giving its own posterior P () = P (1 )P (2 ) · · · P (N ) which factorises cleanly into the required posteriors of the constituent ob jects. Equilibrating just one ob ject at a time is like exploring by simple Gibbs sampling, but the supersystem allows more general exploration. Dimensionality need not be a curse; here it is an opportunity. There is a reason for having several engines. Any individual engine can be thwarted by a particular form of likelihood. Even if provably convergent, it may be impractically slow. But a different engine might overcome the defect, only to be thwarted elsewhere. The combination of both engines will only be defeated by likelihoods that defeat each. BayeSys runs with up to six engines, and roughly balances its computation time between them. At worst, the cost of running six engines instead of the best alone is only a factor of six. The payoff comes when one of the engines rescues the system from a location that would trap the others. We give no general guarantee of convergence. A sufficiently perverse likelihood will defeat any engine, even if composite, and you may never be aware of it. Indeed the scope for difficulty is so wide that "most" 19


conceivable likelihoods will defeat any engine we are ever likely to build. On the other hand, our actual applications involve data that we feel will be interpretable, otherwise the data would not have been collected in the first place. We specialise in soluble problems! 5.2.1. LifeStory1 The LifeStory1 engine operates on just one ob ject in the ensemble, and combines the processes of birth, death, and movement ­ hence the name ­ in a rather natural way. As explained above, the transition scheme for exploring the prior number n of atoms uses a birth rate n and death rate n per unit artificial time. When an event occurs, being birth or death in ratio n : n, the atomic number is to be incremented or decremented as appropriate. n+1
n

n



n

n- 1 More slowly, but with equal validity, we can change the atomic number with probability 50%, leaving the indeterminate composite to be resolved afterwards. n+1 n+1 and - or n n

n n and or - n- 1 n-1

n

n



n

With likelihood factors Lj for j atoms, we proceed to sample from a birth composite, which has likelihood Lbirth = 1 (Ln + Ln+1 ), according to the relative individual likelihoods: 2 Pr(n + 1 | birth composite) = Ln Pr( n | birth composite) = Ln Likewise for a death composite of likelihood L
death +1

/(Ln + Ln /(Ln + Ln
-1

+1 +1

), ).

= 1 (Ln + Ln 2
n n-1

):
-1 -1

Pr( n | death composite) = L Pr(n - 1 | death composite) = L

/(Ln + Ln /(Ln + Ln

), ).

When atoms have attributes, birth of an atom is accompanied by selection of a random coordinate , and death is accompanied by selection of a random identifiable-by-location atom, as flagged by downward arrows in the diagram below.


·


· Birth

· ·

· ·

n or - n- 1

· ·

· · Death ·

n

·

·

·



n

·

n+1 - or n

The likelihoods are promoted to functions, of position of the atom being born or killed in the presence of the other atoms, and the reason for introducing intermediate composite states now appears. The new position, whether that selected for birth or that of the atom under sentence of death, need not be immediately accepted or rejected by Metropolis-Hastings on the basis of that position only. Instead, movement is almost guaranteed by using slice sampling to adjust the composite state's position, using whichever of Lbirth and 20


Ldeath is appropriate. Slice sampling centres on the original selected position, and for efficiency it is carried out between the left and right neighbour atoms of that initial choice.
Left Left Right n Right

n or - n-1

· · · - Death -- · ·

·

·

·



n

·· · Birth · ·

· ·

n+1 - or n

The birth or death of a complete atom might well cause significant change in the likelihood, and it is wise to couple the opportunity with a wider view of the possibilities by allowing movement. If a birth event succeeds in incrementing the number of atoms, all well and good. A change has been made, and slice sampling has likely improved the suggested location. If it fails, then there has been no change, which is regrettable. If a death event succeeds in decrementing the number of atoms, a change has been made. If not, then at least the selected atom will almost certainly have been moved. So, overall, LifeStory1 should make useful changes at least half the time. The illustration below illustrates the range allowed to a selected location X , constrained to move along the Peano curve (with origin randomised to (5, 7) on a wraparound 26 â 26 grid) between its left and right neighbours L and R. With respect to the coordinates, this range is very roughly circular out to the distance of the neighbours.

R L X

5.2.2. LifeStory2 The LifeStory2 engine, like LifeStory1, operates on just one ob ject, and combines the processes of birth, death, and movement. The difference is that one of the neighbouring atoms is also allowed to move. Because 21


they are indivisible changes, birth and death are often discriminated against by the associated changes in likelihood. The existing n atoms may have equilibrated as best they can, so that inserting or deleting a complete atom might always be unlikely, even though re-equilibration with n ± 1 atoms might make the change acceptable or even preferable. In this situation, LifeStory1 is thwarted. LifeStory2 allows a neighbouring atom to move aside to make room for the insertion, or move closer to compensate the deletion, so that the engine can jump around the barrier and equilibrate the number of atoms more effectively. In this way, a locally dominant constraint (such as mean position or total intensity) can remain satisfied even while an atom is created or destroyed.
Left + Left + Right Right

n n ·· · · ·· · · · n or - - - - Death --- n- 1 ·· · The diagram shows the leftward neighbour "+" of the selected birth or death position "" being included in the process: inclusion of the rightward neighbour would be equally likely. In the upper state of either composite there are now two atoms that can be moved around, and in the lower state there is one. Slice sampling can equally well be carried out on two coordinates as on one. It centres on the original selected positions, and for efficiency it is done between the left and right neighbour atoms of those initial choices, as shown by the horizontal range arrows. In the 2-dimensional illustration, both X and its left neighbour (now called Y ) can move between the further-left neighbour F and the original right neighbour R, allowing synergy between X and Y across a somewhat greater range.

·· · · - Birth -- ·· ·

·

n+1 - or n

R Y F X

If a birth event succeeds in incrementing the number of atoms, a good change has been made, with likely improvement to the suggested random location, and compensatory movement of a neighbour. Even if 22


it fails, the neighbour will almost certainly have moved, so some change will have occurred. If a death event succeeds in decrementing the number of atoms, a neighbour will have moved to compensate, allowing the change to be more sympathetic to the data (hence more favoured). If not, then at least the selected atom and a neighbour will almost certainly have both moved. So, overall, LifeStory2 should almost always make useful changes, and thereby should be at least twice as powerful as LifeStory1. One can envisage generalising LifeStory2 by allowing yet more atoms to move. However, bringing in more atoms will expand the domain they cover, and may well bring in additional constraints. Slice sampling would have to work harder to find an acceptable pattern, and the movements would be correspondingly smaller. My judgment is that LifeStory2 is often likely to be about best. 5.2.3. GuidedWalk The LifeStory engines explore a very roughly spherical domain about one or two selected positions. Especially in many dimensions, where the likelihood function is increasingly capable of constraining an atom anisotropically, this can be inefficient. An atom is allowed to move only a short distance along stronglyconstrained directions, and this restriction carries over to weakly-constrained directions if exploration is isotropic, making their exploration slow. As it happens, the number of attributes (dimensions) ascribed to an atom is often small, so the mere fact of using an atomic prior much reduces the difficulty: we seek to control only one or two atoms at a time, not the entire ob ject. Even so, it may be useful to avoid the inefficiency by using the local shape of the likelihood function. In "classical analogue" style, high-order methods such as conjugate gradient (for maximisation) and hybrid Monte Carlo (for probabilistic exploration) come to mind. These methods use a sequence of intermediate evaluations (of gradient as well as value of likelihood) to discover and incorporate the local shape. Alternatively, we could simply use the ensemble. Let X be a randomly-selected atom that we wish to move, in accordance with the local likelihood function. We may presume that the existence of X at its particular location already represents some feature of the likelihood, in which case other ob jects should also have atoms in similar location representing the same feature. Let L and R be left and right neighbour atoms of the position of X , drawn from different ob jects. If, indeed, each ob ject contains just one atom for the feature in question, then that will lie fairly closely left of X along the Peano curve half the time, and fairly closely right of X the other half. So L from its ob ject should be the appropriate corresponding atom about half the time, and so should R from its ob ject. Even if the precise correspondence and symmetry are relaxed, there should remain an appreciable O(1) probability P that L and R have a similar environment to X . The offset vector (R - L) will then be obedient to the extent and shape of the local likelihood function. Hence it can be suggested as an appropriate increment for the position of X , yielding a new trial location X
trial

= X + s(R - L),

s = O(1).

Instead of a quasi-isotropic random walk, we have a guided walk in the direction ±(R - L). As usual, the scale s of the change cannot be too large. Larger values of |s| increase the risk that the neighbours of X trial will no longer be L and R, breaking detailed balance. With randomly scattered atoms and |s| = 1, the neighbours will only be correct with probability O(2-d ), basically because there are 2d quadrants in which other atoms might intervene. That would limit |s| to 1 or so, which reduces the 2 potentially interfering atoms by the compensating volumetric factor 2d . Moreover, if there are Gaussian constraints active on an atom, then |s| is limited to about -1/2 , otherwise the trial location will usually be rejected. However, if the local atoms do suffice to represent a local likelihood feature, then we might be allowed to go further, half way out to the next nearby feature. It is hard to make this sort of argument precise, and fortunately we can use slice sampling to keep s nearly optimal, whatever the restrictions. Suppose there is one (or more) direction that is completely unconstrained, so that the likelihood function is extremely anisotropic. At any stage in the computation, the ensemble will have its corresponding atoms X, L, R, . . . distributed somehow along this axis, say with variance x 2 = 2 . After a step of the guided walk with s = -1/2 , the variance of the newly accepted X trial will be (1 + 2s2 ) 2 = (1 + 2/ ) 2 , greater than before by a factor 1 + 2/ . Hence the spread of atoms along the unconstrained direction(s) increases exp onentially, with x gaining a factor like eP after ensemble-wide applications of the guided walk. Exponential growth occurs because the algorithm is invariant under affine transformation, so it automatically adjusts to whatever scales are operative (until supposedly-corresponding atoms become so distant that they 23


can no longer be identified as such). Of course, we do not expect to recognise exponential behaviour in realistic applications. What we may see, though, is easy exploration of badly conditioned likelihood functions. To implement GuidedWalk, we extend the vector (R - L) across the unit hypercube, starting at the arbitrary origin of the current space-filling Peano curve and continuing until the most-rapidly-changing coordinate ( say) increments or decrements by 1. On the digital B -bit grid, this defines a "staircase" of 2B points parameterised by . This is displaced along the less-rapidly-changing axes until it passes through X, as shown by the circles in the illustration below (note the horizontal wraparound of 5 points caused by the Peano origin being randomised to (5, 7)).




L

R

····· ·········X ···

··

Filled circles identify those points of the staircase which lie on that part of the Peano curve for which L and R remain the left and right neighbours in their own ob jects. These are the trial points that are in detailed balance with X . To select one, we use slice sampling along the staircase, which almost guarantees acceptable movement along what is appreciably often a useful direction. 5.2.4. Leapfrog1 and Leapfrog2 There are simpler ways of using the neighbours L and R of a selected atom X . Leapfrog1 uses just one leftward or rightward neighbour, L or R, preferably from an ob ject different from X though it need not be. It takes X(1) = 2L - X or X(1) = 2R - X as a trial location, inverting X through L or R without any tunable coefficient. The trial location must be in detailed balance with X . Hence if L or R, in its member, is the left (right) neighbour of the location of X , then it must also be the right (left) neighbour of X (1) without any other intervening atoms. Whether this is likely depends on how atoms are distributed locally. Provided this neighbourhood condition holds, the Metropolis-Hastings rule can be used to accept or reject the suggestion on the basis of its likelihood relative to X . 24


Leapfrog2 uses two neighbours, L to the left and R to the right, preferably drawn from ob jects other than X , but they could be both from the same as X . It takes X
(2)

=L+R-X

as a trial location, inverting X through the midpoint of L and R (which may be usefully closer to the centre of the local feature than either L or R individually). The trial location will be in detailed balance if L remains its leftward and R remains its rightward neighbour. Provided this neighbourhood condition holds, the Metropolis-Hastings rule can again be used to accept or reject the suggestion on the basis of its likelihood relative to X . In the illustration below, Leapfrog1 can invert X to either of the points marked 1 , and Leapfrog2 can invert X to 2 .

1 2 1 L X R

Like GuidedWalk, Leapfrog1 and Leapfrog2 are capable of increasing the spread of atoms along unconstrained directions exp onentially. And, lacking the slice-sampling loop, they are simpler and faster to compute. However, if the atoms are sub ject to several constraints, the suggested changes will usually be rejected. Supposing that the local atoms are distributed in accordance with a locally Gaussian likelihood having constraints, each atom's chisquared misfit should be 2 . But, according to the generating formulas, the misfits at the trial locations are likely to be larger: 5 for Leapfrog1 and 3 for Leapfrog2. The net effect is that the acceptance rate drops exponentially with the number of constraints, roughly like e-0.4 for Leapfrog1 and e-0.25 for Leapfrog2. This means that the Leapfrog engines can only operate well when the individual atoms have just a few constraints, but that will certainly be true if their number of attributes d is small. No matter how long they are run for, the Leapfrog engines can only move atoms around the discrete lattice of points being integer combinations of the original positions. Whether or not this pattern of points is technically irreducible depends on Diophantine subtleties of whether the offsets in this lattice are co-prime in each dimension with the number of grid points, here 2B . So the test of irreducibility may be less clear 25


than one might suppose. In practice, we play safe and consider the Leapfrog engines to lack the irreducible property, resolving always to use them alongside irreducible engines like LifeStory which do provide full exploration. 5.2.5. Chameleon1 The Chameleon1 engine operates on two ob jects. It tries to make a randomly chosen atom jump from one ob ject to another. n atoms m atoms · · · · · · · Source ob ject Destination ob ject

The idea is that an atom might be better placed in an ob ject other than its source, especially if a useful patch of the hypercube is only just being discovered. Occasionally this could help the destination ob ject out of a trap, so the engine could be worth including even if most of its transitions were rejected. Chameleon1 cannot be used on its own because its transitions are not irreducible: it can only use existing atom positions and cannot generate all the other possibilities. Hence it must be used with an irreducible engine such as LifeStory. Care is needed with detailed balance. Let the original ensemble state "i" have n atoms in the source ob ject and m in the destination. This occupancy has prior probability (n) (m). The destination state "j " will have n - 1 and m + 1 atoms respectively, with prior probability (n - 1) (m + 1). We require balance between the forward and backward transitions between these states. n atoms i m atoms Hence the transition ratios must obey Tj i (n - 1) (m + 1) m n = = Tij (n) (m) m + 1 n- - -- -- -- -- -- - -- -- -- -- -- - -- -- -- -- - - -
T
ij

n-1 atoms
T
ji

(n) (m)

j m+1 atoms

(n-1) (m+1)

1

(remembering the birth and death rates for faithfully sampling the prior). This is implemented with a forwards rate Tj i = nm dt in infinitesimal interval dt of artificial time, implying the balancing backwards rate Tij = (m + 1)n-1 dt. Starting with two ob jects with n and m atoms, jumps of an atom from n to m occur at rate nm , whereas jumps the other way from m to n occur at rate mn . n- 1 m+1 · · · · · · - -- ---
n
m

n m

· ·

· ·

· ·

n m

-- - ---

mn

· ·

·

·· ·

n+1 m -1

The time to the next event is exponentially distributed as Pr(t) = (nm + mn ) exp - (nm + mn ) t . When that event occurs, it is either n-to-m or m-to-n in ratio nm : mn . Using a regular interval = O(n + m)-1 between observations offers each atom an O(1) chance of jumping. This is the natural "period" for which to run Chameleon1 between a given pair of ob jects, randomly selected from the ensemble. A suggested jump is accepted or rejected on the basis of the ensemble likelihood L=
ob jects

L(ob ject) 26


in which the only changeable factors are the likelihoods of the source and destination. According to the Metropolis-Hastings rule, a jump is accepted if L
new

Uniform(0, L

old

)

and rejected otherwise. It would be possible to use slice sampling to merge an atom's jump between ob jects with movement along the Peano line, as was done in LifeStory for birth and death. However, even that version would still be constrained to a fixed total number of atoms in the ensemble, so the engine by itself would only be cleanly irreducible and faithful to the prior for individual ob jects if the ensemble was very large. I do not expect the extra complication would yield any worthwhile advantage in power. 5.2.6. Chameleon2 In the Chameleon2 engine, pairs of atoms exchange their ensemble membership. Equivalently, they exchange positions.
Select

· ·

·

· ·

· ·

Exchange

To encourage synergy between the jumps, the chosen atoms should be close together, so that the exchange atom should be a neighbour (in its own ob ject) of the position of the atom first selected. Detailed balance is then trivially assured. The transitions do not affect the number of atoms in either ob ject, so they stay faithful to the prior without needing to adjust the relative rates. As was done for Chameleon1, it seems appropriate for an iterate to operate the engine on a given pair of ob jects for long enough that each atom is offered an O(1) chance of jumping. Suggested jumps are again accepted or rejected on the basis of the ensemble likelihood, according to whether or not Lnew Uniform(0, Lold ) .

27


6. Massive Inference (MassInf ) MassInf (an acronym for point-Mass atoms in Inference, originally with particular reference to mass spectrometry) is an extension to BayeSys, for use when attributes include intensity parameters which relate to linear data having Gaussian errors. In this special but quite common case, the intensity parameters can be processed semi-analytically, which means less load on the program, and enhanced power. Each atom in the prior model has intensity coordinates z which are treated separately from the other attributes . For example, the atoms for a black-and-white image would have a single intensity coordinate representing brightness, whereas atoms for a colour image might have three, one for each primary colour red, green, blue. The number of such intensities is called Valency as a mnemonic for the number of ways the atom can bond to data. For imagery the intensities would be positive, though in other applications z might take either sign. 6.1. MASSINF PRIORS Massive inference allows a choice of four priors, selected by the MassInf parameter. In each case, the intensity is factorised as z = q where q is a dimensional unit of intensity (common to all the atomic intensities in any single ob ject), and is the individual dimensionless coefficient. The prior on q is assigned as
- (q ) = q0 2 q e -q /q
0

where q0 is an initial global guess about the expected magnitudes of the atomic intensities, and the "xe-x " form is chosen to be tractable whilst keeping q away from both 0 and . The hyperparameter q0 should have little effect, because q will usually be influenced much more strongly by the numerous data. But it must be given, and MassInf obtains it by peeking at the magnitude of the data. To be absolutely consistent with the strictest Bayesian paradigm, any dimensional dataset should be accompanied by a similarly dimensioned constant giving the expected size of phenomenon being observed. Nobody ever provides this, so nobody ought to ob ject to the Bayesian analyst cheating a little, and peeking at the data to guess it. The four MassInf priors for the relative intensities are: MassInf = 0 ("monkeys", degenerate case with fixed); ( - 1), exp(- ), MassInf = 1 ("positive", > 0, the commonest case); ( ) = 1 exp(-| |), MassInf = 2 ("positive-negative", has either sign); 2 1 exp(- 2 2 )/ 2 , MassInf = 3 ("Gaussian", has either sign). The "monkey" prior (MassInf = 0) models a team of monkeys throwing balls of equal magnitude q at the unit box defined by the remaining coordinates . It was the original motivation for maximum entropy image reconstruction, as developed with the Cambridge group. On dividing the box into cells, the probability that N balls will be distributed as (n1 , n2 , . . .) is given by the degeneracy Pr(n | N ) N! = , n1 ! n2 ! . . .

which suggests using the entropy S = log as a prior for intensity. Taking the model literally, though, imposes an unfortunate digitisation of intensity (to integer multiples of q ), and as noted in the Overview the smoothed entropy form does not quite work as a Bayesian prior. Maximum entropy is a regularisation method, albeit with strong Bayesian connections, whilst the monkey prior is fully Bayesian. The "positive" prior (MassInf = 1) avoids the monkey model's digitisation by letting the balls be of variable magnitude, as described by their exponential prior. Although the BayeSys/MassInf system need not and does not do this, the unit box of remaining coordinates can be divided into cells. Choosing a Poisson prior of mean for the total number of atoms, a cell of size will contain r atoms, distributed as Poisson with mean µ = . Pr(r) = e-µ µr /r!, r = 0, 1, 2, . . . 28


With the intensity of each of the r atoms distributed as Pr(i ) = exp(-i ), the distribution for the total intensity in the cell is Pr( | r) = ( ), e-
r -1

i = 1, 2, . . . , r

r = 0; /(r - 1)!, r > 0.

and the net effect is that the total intensity in the cell is distributed as


Pr( ) =
r =0

Pr( | r) Pr(r) = e



( ) + e-



µ/ I1 (2 µ )

where I1 is the first-order Bessel function. It is intellectually satisfying to find that the number of atoms can be summed away analytically, albeit at the cost of doing a cell-wise computation afterwards. It is also amusing to note that, almost regardless of the data, the most probable (MAPP) inferred intensity would thereby become precisely zero because of the sole surviving delta function -- a delicious counter-example to MAPP estimation. Pursuing the algebraic development beyond these remarks takes us too far afield, into the realms of measure theory and L´ evy-Khinchin representations. The "positive-negative" prior (MassInf = 2) is the natural generalisation of the positive-only prior for applications where z may take either sign. Its cusp at zero is no disadvantage; it helps to reduce the intensity of faint atoms of doubtful significance. The "Gaussian" prior (MassInf = 3) is an alternative prior when z may be of either sign. Its main disadvantage over "positive-negative" is that the expected intensity q is close to the r.m.s. value, so tends to be dragged up by any unusually large intensities. This leaves the fainter bulk of the intensities less well controlled, whereas the largest intensities are over-controlled and pulled back by the severe Gaussian tail. Ob jects of large dynamic range are thus recovered less well. The effect shows up quantitatively as a poorer (lower) value of the evidence for such ob jects, reflecting the improbability of having most intensities well below the r.m.s. magnitude, and some well above. On the other hand, the Gaussian prior exp(-x2 - y 2 ) on two intensities x and y serves as a circularly symmetric prior for a complex z = x + iy , whereas the positive-negative exp(-|x| - |y |) does not. 6.2. MASSINF LIKELIHOOD For MassInf to be used, the likelihood should factorise Pr(D | atoms) = L
Bayes

() LMassInf (z | )

so that the intensity-dependent part LMassInf is Gaussian, as from linear data. An atom of unit intensity at position has a footprint (or "point-spread-function" or "Green's function") f () over the data, several such if it has several intensities. The footprints of the constituent atoms combine linearly to produce the mock data F= zi f (i )
atoms

as a list of numbers which compare with the actual data D. To keep the notation clean, we incorporate the standard deviation uncertainties into the definitions X·Y =
2 Xk Yk /k ,

X

2

= X·X

of inner product and 2-norm. We then write the chisquared misfit as 2 (F) = F - D and this gives the likelihood L
MassInf 2

(F) = Z 29

-1 -2 /2

e


2 where Z = 2 k is the dimensional normalisation. MassInf uses these formulas to calculate the likelihood efficiently. When an atom is created (with new intensity z ) or destroyed (by removing its intensity) or moved (by destroying it at the old location then creating it at the new), the program updates the mock data by

F = f () z and the likelihood by
1 (log LMassInf ) = -F · ( 2 F + F - D)

which should be faster than re-calculating the factor from scratch. The accompanying factor LBayes is often just 1 and ignorable, but it need not be. For example, the data might be intensities whose locations are themselves observed uncertainly, in which case the identification of an atom's footprint would become probabilistic, involving a LBayes "evidence" factor for its plausibility. 6.3. MASSINF UNIT OF INTENSITY MassInf uses Gibbs sampling to explore the joint distribution of q and the other variables, alternating between re-calibrating q and performing the rest of its calculation to re-sample the atoms. At the end of each iterate, before returning the updated ensemble to you as a new set of ob jects, MassInf re-samples q . Temporarily write the mock data as F(z ) = q F ( ) to make explicit its dimensional scaling with q . With all other variables (including ) fixed, the q -dependence of the likelihood is
1 Pr(D | q , . . .) exp( q F · D - 2 q 2 F · F )

and this combines with the prior for q to give a joint distribution
1 Pr(D, q | . . .) q exp(-q /q0 + q F · D - 2 q 2 F · F )

which in turn is proportional to the posterior Pr(q | D, . . .). To re-calibrate q , MassInf re-samples from this "x â Gaussian(x)" distribution. That deals with q . For the rest of the calculation, q is merely a constant that can be included in the appropriate prior for z : (z - q ), -1 q exp(-z /q ), (z ) = 1 q -1 exp(-|z |/q ), 2 exp(- 1 z 2 /q 2 )/ 2 q 2 , 2 6.4. MASSINF INTENSITIES The BayeSys engines modify only one, or sometimes two, atoms at a time. For exposition, we first consider modifying only one atom (say the k th ), in the single-intensity case Valency = 1. Let F- =
i=k

MassInf MassInf MassInf MassInf

= = = =

0; 1; 2; 3.

zi f (i )
th

be the mock data from all the other atoms, with the selected k the dependence of the likelihood on the selected intensity zk is

removed. With all other variables fixed,
1 2 2 zk f (k ) · f (k )

Pr(D | zk , k , . . .) exp - zk (F- - D) · f (k ) -

and this combines with the "positive" prior for zk to give a joint distribution Pr(D, zk | k , . . .) exp - zk /q - zk (F- - D) · f (k ) - 30
1 2 2 zk f (k ) · f (k )


which in turn is proportional to the posterior Pr(zk | D, k , . . .). Hence we can reset zk by sampling directly from this "truncated Gaussian(x)" distribution: we don't have to guess trial values which may be rejected. The other MassInf priors give similarly tractable distributions. Moreover, the joint distribution is integrable to an error function.


Pr(D | k , . . .) =
0

Pr(D, zk | k , . . .) dzk =
0

Gaussian(z ) dz erf(coefficients) evidence) of k alone. Hence we can without having to find the intensity is improved. Implicitly, all intensity after a new acceptable location has

This is an explicit expression for the likelihood (or, if you prefer, the explore the location on its own, with the standard BayeSys engines, z at the same time. Dimensionality is reduced and exploratory power values are explored in parallel, and we delay picking an intensity until been found. · · pick · - · · · kill sum · - · · · · · 0 Old z 0 All z All z

New z

Also, Gaussian forms over the positive quadrant remain tractable in 2 dimensions, so that the LifeStory2 engine (which moves two atoms in the same ob ject) still works. The intensities of both atoms integrate out together, and are picked together after the new locations have been found. If an atom has several intensities (Valency > 1), the analysis continues to hold, with the z -integrals being promoted to more dimensions. However, the analytic evaluation of these relies on separability. So, for practical implementation, an atom's footprints over the data must not overlap. Thus you could not use MassInf if you had red, green, blue footprints from the same atom blurred into the same measurements. If you want to use the LifeStory2 engine with Valency > 1, there is another restriction, that a single footprint from one atom must not overlap more than one of the footprints from another. To run MassInf, all you need to supply are the data D and their uncertainties , along with a procedure to give the footprint(s) f () at a trial position . The program takes care of the rest. MassInf can also auto-scale your uncertainties for you, if their absolute scale is unknown.

31


7. Display of results A posterior ob ject derived from an atomic prior is intrinsically spiky. As a function of it is a sum of delta functions, one at each location of an atom. Likewise, any accumulation of ob jects remains a set of delta functions: summing over many ob jects merely produces more. Yet most likelihood functions are spatially smooth because data have finite resolution. So the actual posterior ought to be correspondingly smooth. It is true that, for any pre-specified resolution, long-term averaging will eventually produce a smooth result, but only after large numbers of atoms have chanced to fall into each (small) cell . This can take far too long to compute directly. For display purposes, we want to show a limited number of atoms as a smooth function. Note that this is not part of Bayesian analysis: it is openly a matter of professional communication and aesthetics. BayeSys gives you some guidance. Each atom it produces is accompanied by an estimate of its width, in the form of that fraction f of the hypercube's volume that the atom could plausibly range through. You may use this estimate as you please, depending upon your current application. In one dimension (d = 1), the hypercube is just the unit interval [0, 1], and f is a range within it. For display, an atom will presumably be shown as some sort of bell-shaped curve of correct area, correct centre, and width related to f . In two dimensions (d = 2), it would be natural to give each coordinate (1 , 2 ) a width f 1/2 . Generally, in d dimensions, a width f 1/d might be appropriate so that the volume remains f . On the other hand, you may wish to divide f differently among the various dimensions, if these represent intrinsically different quantities. You decide. To find f , BayeSys keeps a historical log of its atoms, biassed towards the more recent and presumably better-equilibrated ones. These many atoms are all put onto a standard Peano curve, along which their mean density number of atoms in interval (x) , normalised to (x)dx = 1, length of interval x is tolerably well defined over intervals wide enough to cover at least several atoms. When a new ob ject is produced, having n atoms, each may be expected to range over a fraction 1/n of the atoms in the log. Perhaps half of this may be assigned to randomness, yielding a width f = x = 1/(2n) expressed as a fraction of the length of the Peano line, or equivalently as a fraction of the volume of the hypercube. This can conveniently be computed as a fixed atom-count across the accumulated log. In "important" parts of the posterior, where accumulated atoms congregate closely, the width is small. It is less than the size of the feature, so does not appreciably degrade the delineation of that part. In unimportant parts of the posterior, where a few atoms may wander about weakly constrained in some broad background, the width becomes large. Such stray atoms are displayed as wide and shallow, so that they no longer obtrude upon the eye as isolated sharp spikes. As a technical refinement, BayeSys uses whichever of the leftward-looking and rightward-looking estimates of is larger. This narrows the width so as to prevent atoms near the edge of an important domain from spilling out into the background, which could degrade the display's visible resolution if it were allowed. The diagram illustrates a log of atoms "o" placed on the Peano line "...". Below this are five atoms of a current ob ject " | ", and sufficient atoms are in the log that the fixed atom-count is 2 (in practice the count would be more because the log would be fuller, and the log would be sparse among the huge number of points on the Peano line). Arrows in the diagram show how the widths are constructed, by counting 2 atoms across the log to whichever side is closer. ..o..oo....oo.oo.....oo.ooo....ooooooo..ooooo...ooo.o...o...o....o... -|- -- -|- -- -|- -|- - -|- - -- -- Atoms in or at the edge of a dense region are narrow, whereas atoms in sparse regions are wide.

32