Astron. Astrophys. 325, 961-971 (1997)

## 2. Bayesian statistics

In the Bayesian approach to statistics probability is interpreted as a measure of credibility rather than the frequency of occurrence as in the classical approach (see e.g. O'Hagan 1994). This permits to assign a probability for the validity of a physical theory, in our case of a certain law of star formation.

### 2.1. Probability of a theory

Let us consider a set of N hypotheses , only one of which can be true. Then the probability of the k -th hypothesis, given the data D, is computed from Bayes' Theorem:

The prior probabilities represent the investigator's degree of belief or his knowledge from previous measurements. The likelihood is a measure of how well the predictions of match the data. Thus, Eq. (1) describes how a measurement or observation improves our knowledge: It states that the posterior probability for a certain hypothesis to be true is in proportion to the product of its probability assigned before seeing the data D and the likelihood .

In case a hypothesis contains free parameters , the above likelihood is replaced by the mean likelihood with respect to some prior information about the parameters:

where is the (joint) prior probability density. Since this prior is normalised over the parameter space,

the mean likelihood diminishes, if the parameter space were blown up to a volume much larger than that one where contributes significantly. Likewise, the inclusion of an additional free parameter leads to a smaller mean likelihood. In either case one might thus achieve a better fit of the data, but if one allows oneself too much freedom, this 'dilution' of the mean likelihood may more than compensate for any increase of the likelihood itself due to the improved fit. Since in Bayesian reasoning a hypothesis is to be preferred when its posterior probability exceeds that of any other competing hypothesis, it is this feature of 'dilution' which allows a mathematically consistent formulation of Occam's razor.

When further data are available, it can easily be incorporated by applying Bayes' theorem sequentially, using the old posterior probability distributions just for the new prior.

If one has only weak prior information about a parameter there are rules how to construct the prior distribution (Jeffreys 1983). Such a rule ensures for instance that for estimation of parameters the confidence region drawn from the posterior density distribution reproduces the 'classical' confidence region, i.e. given an experiment could be performed very often, the true value of a parameter is within the interval indeed as often as the chosen confidence level states.

In practice, especially for good data, the actual prior may be quite unimportant: the truth has then a good chance against our prejudice (as long as an inappropriate prior does not prevent it at all). Only if the data were so poor or scant that they do not add to our knowledge, the posterior will reflect just our prejudice.

Often one has 'nuisance parameters', i.e. formal parameters in a hypothesis, whose true values are of no interest or relevance, like offsets and factors of proportionality. Then the likelihood is also integrated over the space of this nuisance parameter.

### 2.2. Simulations with artificial data

As a demonstration of how in the outlined method Occam's razor is at work, and to show some general features, we apply the method to a simpler problem: we wish to decide whether a measured profile of the optical surface brightness of a galaxy is better represented by an exponential law

with a scale length a and a factor of proportionality , or rather by a more complicated law

that has the exponent b as an additional parameter. As parameter space we allow

The factors of proportionality are considered to be nuisance parameters. The 'observational' data are random realizations of

which is sampled at points at . The noise applied is a Gaussian with zero mean and variance . Then the likelihood is:

For each of the two laws (4) and (5) we compute the mean likelihood by integrating Eq. (7) over the parameter space. Following Jeffreys, the parameter priors are assumed to be uniform over b, and uniform in the logarithms of amplitude and scale length a. With , the integration over amplitude can be done analytically. This would seem to imply a divergent normalisation integral, but as the same space is used for both laws, the normalisation integral cancels out, because we consider only the ratio of the mean likelihoods. This has been done 50 times with different realizations for the noise . For the prior probabilities , we assume that we have no prior knowledge, and thus both are considered equal.

In Fig. 1 we compare the values of for the two fitting laws (4) and (5), as obtained from the individual data configurations for several values of noise level indicated by different symbols. For very low noise (the crowded cloud of open circles), the probability for the more complex law is much larger than that for the simple law. Hence, the true law underlying the data is unambiguously found. As the noise level increases, the cloud of points shows a larger scatter (triangles), and also the mean probability for the complex law drops. At  dex the small dots are almost evenly clustered around the diagonal, i.e. on the average both laws are equally probable. Thus it depends on the particular configuration of the observational data which fitting law happens to be come out as the better one. At  dex the cloud of crosses has become elongated along the diagonal, but most points are found in the regime with a higher probability for the simpler law. This is the action of Occam's razor. At still higher levels of the noise, the cloud of black squares is more strongly concentrated towards the diagonal, and in the limit of infinite noise, both laws are equally probable. Then the likelihood is constant in the whole parameter space.

 Fig. 1. Simulations with noisy artificial data; shown are the logarithms of the mean likelihoods for a simple law (exponential decay; abscissa) and a more complicated ( ; ordinate) one, for several noise levels: (circles), 0.05 (triangles), 0.2 (small dots), 0.3 (crosses), and 1 dex (black squares). The full line refers to the probabilities being equal, and the dashed lines to the probability levels 1 and 99 per cent

Application of Bayes' Theorem requires that both laws are mutually exclusive (), one can assign to both laws probabilities which add up to unity. Points lying above the 99 per cent line mean that the probability of the law (5) is higher than 0.99. That there are no points lying below the 1 per cent line is due to the small parameter space . Enlarging this range would make the law (5) more unreasonable, and all points would move down.

This example shows the workings of Occam's razor quite well. It also demonstrates that the scatter in due to different realizations can be quite appreciable. In practice, one often has but one set of data.

### 2.3. Application to our problem

Now we apply this method to assess models of the SFR. A nested hierarchy of SFR laws is considered, with the most general one being

with gas surface density and distance r from the centre. We prefer to explore such an explicit dependence on r instead of the local angular velocity or the epicyclic frequency, as this avoids additional errors from the uncertainties of rotation curves. But since many galaxies show a fairly constant rotational velocity, their SFR should well be represented with .

Apart from the most general law (8), several simpler SFR prescriptions and as well as combinations are considered, in order to cover already proposed SF hypotheses. Since we shall assume that we have no preference to any one of these laws, we assign equal prior probabilities to all models, irrespective of the number of free parameters.

In what follows the random noise due to observational errors as well as any intrinsic scatter be normally distributed in with zero mean and variance . Hence, as suggested by published error bars, it is assumed that the relative errors in the SFR as well as in the gas surface densities are normally distributed and do not depend on the brightness level (i.e. constant signal-to-noise ratio). The physical origin of the scatter in the data - observational uncertainties, azimuthal and radial inhomogeneities in the disks, genuine intrinsic spread of the SFR itself - shall not be addressed here, and we shall not make any distinction between the various contributions.

Assuming that the data points are independent from each other, the likelihood is given by:

where is the observed SFR indicator, i.e. the H surface brightness, and the summation is over all n points. We note that because the errors are assumed to be distributed log-normally, there is no difference between fitting with and with . In both ways, we retain the full information.

Unfortunately, information about the error bars of the surface brightnesses in the various wavelengths is not only rather scarce, and is difficult to compare due to the different angular resolutions used, but one must also keep in mind that possibly there is quite a large genuine scatter which is averaged out by the azimuthal integration. In principle a thorough discussion of all the scatter involved should enable one to construct a more informative prior. Since the information necessary is not available, only weak prior information about the true error level is assumed. Thus, we consider an additional nuisance parameter, and apply a prior uniform in . The use of a prior uniform in does not affect the posterior inference markedly.

The factor of proportionality c, a measure of the efficiency of the star formation, would be a rather interesting parameter to determine, too. This would enable us to compute gas depletion timescales. However, to get this value requires the knowledge not only of the absolute surface brightnesses for all objects, but also corrected for internal extinction in the galactic disk. Since the data come from different sources and usually are not absolutely calibrated, we must consider c here as nuisance parameter, too.

Hence, the likelihood is integrated over c, assuming an uniform prior over , and an interval . The integration of the likelihood can be performed analytically:

where . The normalization integral for this prior diverges for , but since we consider the ratios of mean likelihoods, and since the same nuisance parameter with the same range is present in all the laws considered, the normalisation factors cancel.

Integrating (10) afterwards with respect to results in a likelihood :

Finally, integration over the parameter spaces gives the mean likelihood which is a measure for the posterior probability of a SFR law:

and are the normalised prior distributions for the model parameters x and y.

Since depends on the volume of the parameter space, the choice of the range for x and y might pose a principal problem: If we really knew nothing about x and y, why not allow an infinitely large space? This could dilute the likelihoods beyond any limit. However, very large exponents (say ) would lead to extremely steep functions and are therefore quite unreasonable. We decided to choose this parameter space:

which would contain all reasonable situations, but would not be too restrictive. Within these boundaries the prior density distribution is assumed uniform. If the likelihood 'mountain' occupies only a small area well within this region, any further increase of the parameter space decreases by the factor with which the volume grows.

This analysis is performed for each galaxy separately, yielding for each object -values for each of the different SFR prescriptions. The values are suitably normalised to facilitate comparison of the SFR laws, and these Bayes factors are collected in Table 2.

To assess how well the various laws reproduce the data of the complete sample of galaxies, there are two possibilities: The probability for any law, but allowing the best combination of its parameters to vary from object to object, is obtained by multiplying simply the Bayes factors from Table 2. On the other hand, the density distribution for the joint mean likelihood can be computed from all objects, by taking the product over all N galaxies: . Integration over the parameter spaces gives the posterior of each law. One obtains the optimal values for the two parameters, and one can extract confidence regions in parameter space.

© European Southern Observatory (ESO) 1997

Online publication: April 28, 1998