• 検索結果がありません。

On the “Expectation” of Bayesian Methods: How the Prior Is Chosen in “Objective” Manner in Practices of Statistics

N/A
N/A
Protected

Academic year: 2021

シェア "On the “Expectation” of Bayesian Methods: How the Prior Is Chosen in “Objective” Manner in Practices of Statistics"

Copied!
6
0
0

読み込み中.... (全文を見る)

全文

(1)

1. Introduction

Statistical data analysis is an important process for inferences in empirical science. There, theories are said to be divided into different "schools" called by frequentism and Bayesianism. Their validity, importance, problems, limitations, etc. have been discussed by statisticians, scientists, and also philosophers for over 100 years (e.g., Fisher 1925, Neyman & Pearson 1928a,b, Lindley 1957, Hacking 1965, Mayo 1994, Royall 1997, Sober 2008).

However, in recent years, the theory and practice of statistical data analysis have changed significantly due to, in part, improvements in data measurement systems, numerical algorithms and computer performance. As a result, there is a large gap between issues that have been discussed in existing philosophies and issues that arise in practical scientific applications (e.g. Gelman et al. 2012). This gap was a major motivation for planning a symposium at the 16th Congress of the Logic, Methodology, and the Philosophy of Science and Technology (CLMPST) held in Prague in August 2018, which is the background of this special issue. In this paper, I point out this gap and look forward to the progress of philosophical discussions and further exchanges with other fields.

This article is organized as follows. First, I introduce the maximum likelihood method, which has historically been the main source of data analysis in empirical studies of biology, and briefly summarizes how this method can be justified from a frequentist perspective. Second, I introduce the idea of "updating of beliefs by data" in traditional Bayesianism and the typical criticism of "subjectivity of prior probabilities and prior distributions". Third, I argue that recent Bayesian statistics include aspects that cannot be explained by traditional Bayesian "ism" and point out the need for further philosophical discussions. ――――――――――――――――

1. Center for Human Nature, Artificial Intelligence, and Neuroscience, Hokkaido University, Sapporo, Japan. a). ohkubo.yusaku1989@gmail.com

2. Frequentism and the maximum likelihood

method

Data analysis in the scientific field covers a wide variety of examination including parameter estimation, interval estimation, and testing. However, many of the methodologies that have historically occupied the mainstream in the science are based on a method of parameter estimation called the maximum likelihood method (MLE). Thus here, I explain in what sense maximum likelihood is “good” statistical inference from the viewpoint of frequentism1.

Suppose now that there is a probability distribution whose shape is determined by a certain vector of parameter . Suppose further that we do not know what value will take. However, it is assumed that a finite amount of realized values (that is, data at hand) generated from are available. At this time, the problem of estimating using the data at hand (that is, recovering the original distribution ) is called the problem of parameter estimation in statistics. The maximum likelihood method is one of the methods for parameter estimation.

The maximum likelihood estimate is defined as follows.

Here, is a function of theta called a likelihood function. For example, consider the following example.

Now, in the dataset of sample size , there is an objective variable of the -dimensional vector and the explanatory variable of the matrix. Suppose you want to know the relationship between them. For example, if the value of follows a normal distribution and its mean is considered to be in a linear relationship with the known explanatory variable , the following linear regression model is often used.

1

Note that there are several other approaches to a justification for maximum likelihood.

On the “Expectation” of Bayesian Methods: How

the Prior Is Chosen in “Objective” Manner in

Practices of Statistics

YUSAKU OHKUBO

1, a)

Abstract: The philosophical foundation of statistical methods has long been discussed among statisticians, scientists, and philosophers. While traditional accounts have assumed epistemological differences between Frequentist and subjective Bayesian methods, some recent practitioners are adopting the Bayesian method from different aspects. This paper aims to give a brief review of the epistemological justification behind Frequentist and subjective Bayesian methods and of how the prior distribution is chosen in the current practice of statistics. Then, I propose further examination .n the philosophical aspects of these activities.

(2)

Here, is a normal distribution with mean and standard deviation . If you know and in this model, you can guess how changes with changes in . The maximum likelihood method is one of the methods to estimate and using the data at hand. The (joint) likelihood function of this model can be written as follows, and , which maximizes this function is the maximum likelihood estimate.

Why is the maximum likelihood method a good inference? Although mathematical details are omitted, it has been shown that the estimated value obtained by the maximum likelihood method has the following properties under some regularity conditions (see Watanabe 2010 for exact formulations of regularity, for example).

Where is a metric called Fisher information matrix and is defined as follows.

That is, even if the maximum likelihood method is used, true value cannot be reliably hit. This is because the estimated value fluctuates stochastically as can be seen from the right side of the equation (-). However, what is important from the viewpoint of frequency is that the "deviation" from the true value follows a normal distribution, and the mean on the right side is a zero vector. In other words, "a finite number of samples are extracted from the same distribution and , The expected difference between and is zero. It is a way to justify the use of the maximum likelihood method from a frequentist perspective, which suggests that it has good estimation performance in the long run.

So far, we have discussed the idea of frequentism and its justification, using the maximum likelihood method as an example. But is this a "good" justification for scientists? What do scientists want to know, isn't an estimator able to make an unbiased estimate on average, but an analysis of this particular data? Bayesianism, introduced below, has been expected to provide an answer to such issues.

3. Bayesianims

Bayesianism, in philosophy, is a movement of epistemology, which evaluate a hypothesis based on a rule of probability theory

called Bayes' theorem, described below..

Here, means the probability of A conditional on B. It should be noted that Bayes' theorem itself is derived from the definition of Kolmogorov’s axiom of probability and conditional probability. It is independent of Bayesian"ism". But, if A and B are interpreted as a hypothesis H and data D, respectively, Bayes' theorem indicates that “probability that H is true before D is obtained” and “H is true after D is obtained”. Two probability is called “prior probability” and “posterior probability” respectively.

Traditionally, Bayesian has supported the use of this Bayes theorem to evaluate the plausibility of a hypothesis or theory or to evaluate how these plausibility changes as new data are obtained. Although its application is very broad, it has also had a significant impact on statistics, where the relationship between data and hypotheses are considered based on probability theory. For example, the Bayesian method can be applied to linear regression models described above where parameters are estimated using Bayes theorem.

where is a joint prior distribution for unknown parameters and .

It has been argued that the appeal of Bayesian statistical methods to empirical researchers is that Bayesian analysis is a straightforward way to evaluate hypotheses. In the various introductory textbooks, you can see explanation such as

"Unlike the frequentist methods, the question of the Bayesian methods asks is:" assuming that data has been observed, then what is the probability that the hypothesis of interest is true?" We are interested in the truth of the hypothesis, not the probability of obtaining observed data under the hypothesis (McCarthy 2007, with modification). "

"Rather than saying, like frequentism, if the hypothesis H: μ1 = μ2 is true, a rare (i.e. the probability is 1%) event happened, Bayesian methods offer a way to say "the hypothesis is true with 99% confidence.”

So, why are Bayesian methods are so easy to understand? This is because the degree of belief that the hypothesis is true can be expressed by the same index, probability before and after data collection, P(H) and P(H|D), respectively. Here, Bayes theorem can be seen as the updating rule of probability.

It was the philosophy of science in the first half of the twentieth century that gave early attempts to give a justification to this idea of expressing the degree of belief that a hypothesis is

(3)

true by probability. Here, due to the limitation of space, only two major arguments are outlined2.

de Finetti (1972) considered decision making where uncertainties are involved. He took these situations as a kind of betting game and examined the conditions under which that bet would behave "rationally". He proved that, in order not to choose the “surely lossy choice called "Dutch Book" (i.e. irrational option), the agent must express her belief according to the axiom of probability.

According to Savage's expected utility theory (Savage 1954), when an agent has to "rationally" determine an option which maximizes her utility, under a set of rationality axioms (for example, for preference order, if a > b and b > c, then a > c), then there exists a probability function by which her epistemic state can be represented (Sprenger and Hartmann 2019).

These arguments have tried to justify this idea of expressing the degree of belief that a hypothesis is true by probability. However, until the late 1980s, Bayesian methodologies were rarely used in common. One technical reason was the difficulty of calculating the posterior distribution and its summary statistic (e.g. posterior mean or variance). Since it is difficult to obtain the exact solution of multiple integral of the denominator in Bayes' theorem (called the "normalization term" or the "marginal likelihood of the model"), the form of the posterior distribution cannot analytically be obtained except for a few simple cases.

Another reason is that Bayesianism has been criticized as "subjective." The following two subjectivities should be clarified. One point is the problem with the subjectivity of the interpretation of probability. Bayesianism has been founded on the subjective account of probability (Hájek 2019) and it has irritated some scientists and statisticians those advocate whole the science should be based on "objectivity". However, as we have seen, the term "belief" in Bayesianism does not mean "anything goes" as used in everyday language. Bayesianism is aimed to model an idealized form of rational belief though, of course, what criterion is called "rationality" is controversial.

Another point of the subjectivity is how to give prior probabilities. As is clear from Bayes' theorem, the posterior depends on how to set the prior. But how to set the prior is open to analysts. Although there had been lots of efforts to pursue a "non-informative" prior, which does not give much influence on the posterior, it has been long debated on the concept of "non-informative" both in philosophy (i.e. on the Bertrand's paradox and the principle of indifference) and statistics (i.e. uniform distribution, Jeffery's prior distribution, unit information prior distribution etc.).

4. Recent “Bayesian” methods

In the previous section, I briefly introduced what makes Bayesianism attractive, from an empirical biologist's point of view, with its philosophical justification and its difficulty in application. While Bayesianism offers an intuitive way to analyze

2 See, for example, Sprenger and Hartmann (2019) for other arguments

a hypothesis but it had been difficult to calculate the posterior distribution. More importantly, there was a problem of subjectivity in how to give the prior distribution.

However, the solution to the former problem made great progress around 1990. The Markov chain Monte Carlo methods (MCMC) like the Gibbs sampling algorithm and many other techniques were developed to obtain approximation. Even if the exact form of the probability distribution is not obtained, the average and variance of can be obtained if a large number of random samples from are obtained, which is often sufficient for daily practices of data analysis. Since the advent of MCMC, the "Bayesian" method has been applied in a wide range of fields including machine learning techniques of recent years.

So what happened to the philosophical problem of subjectivity in prior distributions? Several methods have been proposed for selecting priors in a "non-subjective" manner. In this section, I introduce two cases, "half-Cauchy prior" for random "information criterion" for evaluation of predictive accuracy of a Bayesian model. Then, I discuss how the choice of prior distribution is justified.

4.1 half-Cauchy prior for random-effect models

In the linear regression model introduced in Section 2, it is assumed that the slope of the regression line is uniformly determined by . However, this assumption may be too strong when data are sampled from various backgrounds of subjects, regions, or species. In such a case, the quality of the estimation is often improved by using the following random effect model.

where is a new explanatory variable that contains information such as subject and observation point. The new model assumes that 1) the slope varies depending on and 2) its variation follows a normal distribution with mean zero. This makes it possible to obtain a better estimate of the slope while considering differences between subjects or regions. Such a model is called a "random effect model" in the sense that the slope of the regression line fluctuates stochastically, and is a kind of a broader class of model called a hierarchical model. This model has been widely applied to the analysis of time-series data, longitudinal analysis of patients, and multi-site observation data.

However, problems arise with estimating . The parameters of the random effect model can be estimated both by maximum likelihood estimation or Bayesian estimation. However, the maximum likelihood method often fails when the true is not very large. On the other hand, when implementing Bayesian estimation, it is necessary to give a prior distribution to . Gamma distributions are often used because of its analytical convenience, but it does not tell us how to set shape and rate parameters of the gamma distribution. Of course, other classes of distribution are also possible. Since is known that the posterior distribution of is more sensitive to a given prior distribution

(4)

than ordinary regression coefficients, it has been long discussed what prior distribution should be given.

Gelman (2006) proposed a way to dissolve the dilemma reporting that Bayesian estimation with a half-Cauchy distribution yielded “good” estimation performance. If the researcher community uses the half-Cauchy distribution in common, then the prior distribution was not chosen arbitrarily. It avoids criticism that the "subjectivity" would affect the analysis. Why should we use the half-Cauchy distribution among the many distributions? Polson & Scott (2012) justified the semi-Cauchy prior by defining a risk function as follows:

Suppose now that is estimated from the data by a certain parameter estimation method . Then, the risk function is defined as follows.

Polson & Scott (2012) compared the maximum likelihood estimation with Bayesian estimation using the half-Cauchy prior and found that the latter yield less risk. Note that, at the right side of the risk function, it takes the expectation over the true data generating process . In other words, in some cases, the maximum likelihood method might yield a closer estimate of the true value than the Bayesian method based on the half-Cauchy distribution. However, taking the expected value, frequency of better estimate by half-Cauchy Bayes estimate is more often than the maximum likelihood method. This can be viewed as a frequentist justification of a "Bayesian" method.

4.2 Information criterion and the evaluation of a Bayesian predictive distribution

In this section, I introduce another example, an information criterion used to evaluate the predictive accuracy of a Bayesian model. In the area of machine learning, the focus is often on using data at hand to predict future observations, rather than analyzing data at hand or estimating model parameters (in this sense, a set of data, objective variables and explanatory variables in the above examples, is sometimes referred to as "training data"). Bayesian methods are widely applied to such a field for prediction too.

Now, for example, suppose that there is training data and try to predict unobserved data using a trained model. by the following Bayesian posterior predictive distribution below.

In other words, the posterior predictive distribution is based on the posterior distribution of the parameter created with the at hand, "How the unobserved data is distributed when the actual parameter is θ. ) weighted average distribution.

If there is a set of the training data, the likelihood function, and the prior distribution, then the posterior prediction distribution is determined. But, it does not mean this gives a “good” prediction.

Therefore, from a practical point of view, there are cases where it is desired to evaluate how "good" the posterior predictive distribution is or to compare the accuracy of sets of posterior predictive distributions. However, from its nature, the ability to predict unobserved data cannot be directly known without unobserved data. It is necessary to make inferences in some statistical way.

Suppose that there exists a true data generating distribution , which is common to and .

Then, we can reinterpret a random sample from . The Widely Applicable Information Criterion (WAIC) is an estimator of the "goodness" of the Bayesian prediction distribution based just only on “training loss ( )” that is, how well the model fits the training data , where goodness is defined by the “generalization loss ( )” between the true data generating distribution and the posterior predictive distribution (Watanabe 2010).

where is the true data generating distribution of .

Watanabe (2010) revealed the following properties.

Though we do not consider the theoretical validity of the statement here in this article, it indicates, by using WAIC, the fitness (generalization loss) for unknown data can be estimated on the order of . But, again, note that frequentist justification plays an important role here. In the above equation, as in the case of the risk analysis of the half-Cauchy distribution, the expectation over data generating distribution appears. In other words, "sometimes succeed, sometimes fail" since it is a random variable. But WAIC can evaluate the posterior prediction distribution in the sense that and maybe hit or miss, but the expected value of WAIC is the same as the expected value of generalization loss ''. is there.

Here, even if the same data and the same likelihood function are given, a different posterior predicted distribution is obtained if

(5)

the prior distribution is changed. Thus, if WAIC can quantitatively evaluate multiple posterior prediction distributions, then we have a statistical method to determine, among the candidate, which prior gives the most predictive accuracy given a combination of certain data and likelihood function. It can be said that we have a method of "objectively" selecting the prior distribution instead of subjective one based on background knowledge. Before concluding this section, I quote the content of a standard textbook on machine learning: Pattern Recognition and Machine Learning (Bishop 2006; emphasis added).

"from a Bayesian perspective the uncertainty in our model is expressed through a posterior distribution over [parameters].” “Bayesian methods based on poor choices of prior can give poor results with high confidence. Frequentist evaluation methods offer some protection from such problems, and techniques […] remain useful in areas such as model comparison.”

Here, it is mentioned that the concept of “evaluating" the prior distribution as described above and that this method is based on the frequentist virtue of justification.

4.3 Summary: frequentist justifications for Bayesian methods

So far, we have seen two cases where the prior distribution is not chosen by beliefs or background information. Since, as we saw in section 3, the biggest concern for Bayesianism has been the subjective choice of priors, then, can scientists enjoy the benefits of Bayesianism without any slander against subjectivity? Unfortunately, this is not always the case.

When researchers use a common half-Cauchy prior distribution to a random-effects model, subjectivity seems to be eliminated because the choice is not the matter of individual arbitrariness but inter-subjective. But, note that the justification for the half-Cauchy is based on the expected risk, a frequentist measure. On the other hand, in the evaluation of the Bayesian predictive distribution, the choice of the prior is open to the scientist. Although, in this sense, the choice remains arbitrariness, which prior distribution gives good prediction accuracy can be "objectively" compared and evaluated, so that subjectivity can be eliminated. But, again, note that the prediction error is estimated by a frequentist way.

Again, remember why the Bayesian posterior is easy to interpret. This is because, on the discussion of de Finetti and Savage and other eminent philosophers, it was justified to express both our beliefs before/after data as probability distributions (as prior and posterior distributions, respectively). However, if the priors were given by belief, de Finetti's dutch-book argument nor Savage's rational preference axiom would apply. Then, how can we interpret posterior as the "post-data belief". Of course, it is possible to calculate P (H|D). It is simply an application Bayes' theorem. However, it is important to note that de Finetti and Savage's discussions play a major role in interpreting P (H|D) as

"the degree of the hypothesis H is true conditional on data collected."

5. Closing remarks

At present, Bayesian methods, including Bayesian statistics and predictions by machine learning are used in a very wide range of fields. There seems to be no doubt that these methodologies are truly useful in the real world of science and society. However, it may not be well discussed how/why these methods are useful and how they are justified. For example, if the Bayesian posterior distribution does not give "the degree of a hypothesis H is true", how should we interpret the posterior distribution? Or, instead of de Finetti's dutch-book nor Savage's rational theory, can we construct a new argument that justifies interpreting a posterior as "the degree to which hypothesis H is true." All these questions are difficult to deepen without philosophy. I expect further discussion on these issues.

Acknowledgments

The main idea of this paper was presented at the 16th Congress of the Logic, Methodology, and the Philosophy of Science and Technology, (Prague, Czech Republic, 2019) and the 5th seminar of Center for Human Nature, Artificial Intelligence, and Neuroscience (CHAIN), Hokkaido University. I thank all feedbacks from the participants.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

This research is partly funded by the Grant for the Groundbreaking Young Researchers of Suntory Foundation and by grants-in-aid from the Ministry of Education, Culture, Sports Science and Technology of Japan to M. M. (nos. 16H03050)

References

Bishop, C. M. (2006). Pattern recognition and machine learning, Springer, New York.

de Finetti, B. (1972). Probability, Induction and Statistics, Wiley, New York.

Fisher, R. A. (1925). Statistical methods for research workers. Genesis Publishing, Guildford.

Gelman, A. (2006). ‘Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper),’ Bayesian analysis, 1(3), 515-534.

Hacking, I. (1965). Logic of Statistical Inference, Cambridge : Cambridge University Press.

Hájek, A (2019). ‘Interpretations of Probability,’ In: The Stanford Encyclopedia of Philosophy (Ed. Edward N. Zalta), The Metaphysics Research Lab, Stanford. (visited 1. Mar. 2020) Lindley, D. V., & Phillips, L. D. (1976). ‘Inference for a Bernoulli

(6)

process (a Bayesian view),’ The American Statistician, 30(3),

112-119.

McCarthy, M. A. (2007). Bayesian methods for ecology, Cambridge University Press, Cambridge.

Neyman, J., & Pearson, E. S. (1928). ‘On the use and interpretation of certain test criteria for purposes of statistical inference: Part I,’ Biometrika, 175-240.

Neyman, J., & Pearson, E. S. (1928). ‘On the use and interpretation of certain test criteria for purposes of statistical inference: Part II,’ Biometrika, 263-294.

Polson, N. G., & Scott, J. G. (2012). ‘On the half-Cauchy prior for a global scale parameter,’ Bayesian Analysis, 7(4), 887-902. Royall, R. (1997). Statistical evidence : a likelihood paradigm,

Chapman & Hall/CRC, London.

Savage, L. J. (1954). The Foundations of Statistics, Wiley, New York (second edition 1972, Dover, New York).

Sober, E. (2008). Evidence and Evolution : The logic behind science, Cambridge University Press.

Sprenger, J., & Hartmann, S. (2019). Bayesian philosophy of science, Oxford University Press, Oxford.

Watanabe, S. (2010). ‘Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory,’ Journal of Machine Learning Research, 11(Dec), 3571-3594.

参照

関連したドキュメント

Robust families of exponential attractors (that is, both upper- and lower-semicontinuous with explicit control over semidistances in terms of the perturbation parameter) of the

In section 2 we present the model in its original form and establish an equivalent formulation using boundary integrals. This is then used to devise a semi-implicit algorithm

We have formulated and discussed our main results for scalar equations where the solutions remain of a single sign. This restriction has enabled us to achieve sharp results on

Kilbas; Conditions of the existence of a classical solution of a Cauchy type problem for the diffusion equation with the Riemann-Liouville partial derivative, Differential Equations,

We shall see below how such Lyapunov functions are related to certain convex cones and how to exploit this relationship to derive results on common diagonal Lyapunov function (CDLF)

Then it follows immediately from a suitable version of “Hensel’s Lemma” [cf., e.g., the argument of [4], Lemma 2.1] that S may be obtained, as the notation suggests, as the m A

In conclusion, we reduced the standard L-curve method for parameter selection to a minimization problem of an error estimating surrogate functional from which two new parameter

7.1. Deconvolution in sequence spaces. Subsequently, we present some numerical results on the reconstruction of a function from convolution data. The example is taken from [38],