• 検索結果がありません。

2 Extended Maximum Likelihood Principle

N/A
N/A
Protected

Academic year: 2021

シェア "2 Extended Maximum Likelihood Principle"

Copied!
15
0
0

読み込み中.... (全文を見る)

全文

(1)

Multimodel Inference via the Extended Maximum Likelihood Principle When Candidate Models Are Fully Specified

1 Introduction

In studying statistical theory for economic analysis, it will be im- portant to take the following problems into consideration. First, economic statistics, the fundamental information for rational deci- sion making, can contain significant and skewedmeasurement error, as implied by a recent study of Stiglitz, Sen, and Fitoussi eds. (2009) about the difficulty of measuring economic phenomena. This means that economic statistics tend to be biased, as exemplified by the Japanese consumer price index studied by Ariga and Matsui (2003) and Shiratsuka (2005), and by the 2005 international comparison program price data studied by Chen and Ravallion (2008). Second, economic models may contain significantapproximation errorrela- tive to the true phenomenon being modeled. This may be inevitable in the field of social sciences such as economics and politics. It fol- lows that development of statistical theory for economic analysis must take these problems into account. The latter, approximation error problem of economic models, has been a prime motivation for

Faculty of Economics, Fukuoka University; E-mail:kagihara@fukuoka- u.ac.jp

Masato Kagihara

−373−

( 1 )

(2)

the writing of this paper; the former,measurement error problem of economic statistics, has motivated the writing of Kagihara (2009) and Kagihara (forthcoming).

For this purpose, this paper extracts the essence of Akaike’sex- tended maximum likelihood principle by considering a simple case in which candidate models are fully specified. The principle advo- cated by Akaike (1973) is free from thetrue model assumption, and therefore is suitable for a social science such as economics. It is unrealistic in economics to expect that there will be little approxi- mation error in theoretical models of the true economic phenomena.

Akaike (1973) used the Kullback-Leibler information to measure discrepancies between theoretical models and the unknown true phenomenon being modeled, and derived the Akaike Information Criterion (AIC). The AIC is a bias corrected estimator of essen- tial parts of the Kullback-Leibler information. He has shown that in statistical estimation problems for estimating unknown param- eters in a fixed model, the classical maximum likelihood principle has come as a natural result of his approach. Therefore, Akaike (1973) has called the approach an extension of the maximum like- lihood principle, to cover statistical estimation problems under a variety of different models. This approach can be considered to combine statistical estimation problems with model selection, re- ferred to as multimodel inference. Detailed expositions of the ap- proach advocated by Akaike (1973) has been found in Burnham and Anderson (2002), Konishi and Kitagawa (2004), and Konishi and Kitagawa (2008). TheKullback-Leibler informationof a probability density funcction (pdf)g from another pdff,D(f, g), is defined as follows,

D(f, g) :=

f(x) logf(x)

g(x)dx (1)

=Ef

logf(x) g(x)

=Ef[logf(x)]−Ef[logg(x)],

−374−

( 2 )

(3)

whereEf stands for the expectation over the pdff. This quantity has the following properties: D(f, g) 0 and D(f, g) = 0 ⇐⇒

f(x) = g(x). Hence, D(f, g) can be interpreted as a discrepancy measure of a pdfgfrom another pdff. Furthermore, iff is the pdf describing the unknown true phenomenon, then the pdfgattaining smallestD(f, g) is thebest approximate modelfor the unknown true pdff in the sense of Kullback-Leibler information.

The rest of this paper is structured as follows. After describing the setting of the problem and ideas of the extended maximum like- lihood principle in Section 2, we develop procedures for multimodel inference via this principle in Section 3. Concluding remarks are given in Section 4.

2 Extended Maximum Likelihood Principle

Assume that we have independently and identically distributed (i.i.d.) observations {Xi R}Ni=1, and that we want to know the data generating process (DGP) producing these observations. The strategy considered here is to approximate the true but ultimately unknown DGP by a known model which is expected to be closest to the true DGP. Suppose that we have a set of models explain- ing the DGP of {Xi}, the candidate models, denoted by {gk}Kk=1 forK ∈N. These candidate models {gk} should be chosen based on our scientific knowledge about the phenomenon producing the observations{Xi}: economics, politics, biology, physics, and so on.

Throughout this paper, we suppose that the pdf f in (1) is the unknown true DGP generating the observations{Xi}, and that the candidate models {gk} are given as fully specified forms without unknown parameters.

The closest model to the true DGP f in the sense of Kullback- Leibler information (1) is obtained by selecting a model which min- imizes (1) within the candidates{gk} as the best. This approach is

Multimodel Inference via the Extended Maximum Likelihood

Principle When Candidate Models Are Fully Specified (Kagihara) −375−

( 3 )

(4)

equivalent to the minimization of

CE(f, gk) :=−Ef[loggk(X)] (2) within {gk}Kk=1, because the first term of the last expression in (1) is the same for all the candidate models{gk} and therefore can be ignored. Note that the first term Ef[logf(x)] is the negative of Shannon information or entropy of the pdf f and is the expected log-likelihood of the true model f. The quantity (2) is called cross entropy 1 and is the negative of the expected log-likelihood of the model gk under the true model f. Hence, the best approximate model g within the candidate models in the population (true best approximate model within the candidates)is defined as

g := arg min

gk∈{gk}Kk=1{D(f, gk)} (3)

= arg min

gk∈{gk}Kk=1{CE(f, gk)}.

When comparing two modelsgkandgl, thedifference in the Kullback- Leibler information (KLI-difference), Δ(gk, gl) defined as follows, is a key quantity because the modelgk is better than, worse than, or indifferent to gl when this quantity is positive, negative, or zero respectively.

Δ(gk, gl) :=D(f, gl)−D(f, gk) (4)

=Ef[loggk(X)]−Ef[loggl(X)]

=Ef

loggk(X) gl(X)

Equation (4) shows that the KLI-difference is the expected log- likelihood ratio between the modelsgkandgl under the true DGPf.

The problem arising here is that we cannot observe the quanti- ties (1), (2), and (4) directly. Therefore, we have to estimate them based on the observed sample {Xi}. We discuss the matter in the following Section 3

1For terms related toentropy, see Soofi (1994).

−376−

( 4 )

(5)

3 Multimodel Inference via the Extended Max- imum Likelihood Principle

This section develops procedures for multimodel inference via the extended maximum likelihood principle when the candidate mod- els are fully specified, i.e., the candidate models are given as fully known forms without unknown parameters. Subsection 3.1 dis- cusses how to estimate the cross entropy (2) and the KLI-difference (4), and investigates their statistical properties. Subsection 3.2 presents a procedure for estimating the best approximate model and provides a statistical foundation for it.

3.1 Asymptotic Properties of Sample Cross Entropy and Sample Difference of Kullback-Leibler Infor- mation

Equation (2) shows that the cross entropy is the negative expected log-likelihood of a candidate modelgk when the expectation is eval- uated under the true DGPf. A natural estimator of the cross en- tropy is its sample analog, a sample cross entropy which replaces the true f in equation (2) with an empirical distribution ˆf:

CE(f, gˆ k) :=CE( ˆf , gk) =1 N

N i=1

loggk(Xi). (5) The sample cross entropy (5) is proportional to a log-likelhood of the modelgkif the data{Xi}are i.i.d. Its asymptotic properties are obtained by the law of large numbers and the central limit theorem.

Proposition 1 (Properties of Sample Cross Entropy).

The sample cross entropy (5) is a consistent and asymptotically normal estimator of the population cross entropy (2).

1. CE(f, gˆ k)−→p CE(f, gk).

Multimodel Inference via the Extended Maximum Likelihood

Principle When Candidate Models Are Fully Specified (Kagihara) −377−

( 5 )

(6)

2. Assume that Ef[{loggk(X)}2]<∞, then

√N( ˆCE(f, gk)−CE(f, gk))−→d N(0, Vf[loggk]) where Vf[loggk] =Ef[{loggk(X)−Ef[loggk(X)]}2].

Proof. 1. Since {Xi} are i.i.d., so are{loggk(Xi)}for all k. Be- cause CE(f, gˆ k) = (

loggk(Xi))/N and Ef[loggk(Xi)] = CE(f, gk) for all k and i, we get the result owing to Khin- chine’s weak law of large numbers.

2. Since {Xi} are an i.i.d. sequence, so are {loggk(Xi)} for all k. Because√

N( ˆCE(f, gk)−CE(f, gk)) =[

(loggk(Xi) Ef[loggk(X)])]/

N, the result is gained by the Lindeberg- Levy central limit theorem.

From this result, we construct a confidence interval of the cross entropy for each model gk. For this purpose, we must estimate Vf[loggk(X)] consistently because we cannot observe this quantity in practice. Such an estimator is obtained by, say, the sample vari- ance2,

Vˆk:= 1 N

N i=1

(loggk(Xi)−Eˆk)2, (6) where ˆEk := N

i=1loggk(Xi)

/N =−CE(f, gˆ k). It is concluded from the above that anasymptotic (1−α) confidence interval for the model gk’s cross entropy (2) is

CE(f, gˆ k)−zα/2

Vˆk

N, CE(f, gˆ k) +zα/2

Vˆk N

⎦ (7)

2The analytically equivalent expression of sample variance (6) is ˆVk = (loggk(Xi))2/N ( ˆEk)2. Hoewever, in practice, realized values computed from the sample can differ from each other because of their rounding errors.

Appendix A shows that the expression (6) produces a more accurate value.

−378−

( 6 )

(7)

for α [0,1], where zα/2 means the 100(1−α/2)th percentile of the standard normal distribution.

To estimate, or select, the best approximate model within the candidate models, comparing their sample cross entropy (5) or con- fidence intervals (7) is insufficient. Co-variation among the models should be taken into account by examining the KLI-difference (4) between them. Because the KLI-difference is an unknown quantity, it must be estimated. For this purpose, consider a sample KLI- difference, an average empirical log-likelihood ratio, which replaces the true f in equation (4) with an empirical distribution ˆf:

ΛN(gk, gl) :=Efˆ

loggk(X) gl(X)

= 1 N

N i=1

loggk(Xi)

gl(Xi) (8)

= ˆEk−Eˆl = CE(f, gˆ l)−CE(f, gˆ k).

Its asymptotic properties are established as follows.

Proposition 2 (Properties of Sample KLI-Difference).

The sample KLI-difference (8) is a consistent and asymptotically normal estimator of the population KLI-difference (4).

1. ΛN(gk, gl)−→p Δ(gk, gl). 2. Assume that Ef

{log (gk(X)/gl(X))}2

<∞, then

√NN(gk, gl)Δ(gk, gl))−→d N

0, Vf

loggk(X) gl(X)

, whereVf[log(gk(X)/gl(X))] =Vf[loggk]+Vf[loggl]2Cov[loggk,

loggl], andCov[loggk,loggl] =Ef[(loggk−Ef[loggk])(loggl Ef[loggl])].

Proof. 1. Since {Xi} are i.i.d., so are {log(gk(Xi)/gl(Xi))} for all k and l. Because ΛN(gk, gl) = [ log (gk(Xi)/gl(Xi))]/N and Ef[log (gk(Xi)/gl(Xi))] = Δ(gk, gl) for all k, l, andi, we get the result owing to Khinchine’s weak law of large numbers.

Multimodel Inference via the Extended Maximum Likelihood

Principle When Candidate Models Are Fully Specified (Kagihara) −379−

( 7 )

(8)

2. Since{Xi}are i.i.d., so are{log (gk(Xi)/gl(Xi))}for allkand l. Because√

NN(gk, gl)Δ(gk, gl)) =

{log(gk(Xi)/gl(Xi)) Ef[log(gk(X)/gl(X))]}/√

N, the mentioned result is gained by the Lindeberg-Levy central limit theorem.

Because we cannot observe Vf[log (gk(X)/gl(X))] in practice, we must estimate this quantity consistently by,

Vˆkl:= 1 N

N i=1

loggk(Xi) gl(Xi) −Eˆkl

2

(9)

= ˆVk+ ˆVl2 ˆCovkl, where ˆEkl := [

log (gk(Xi)/gl(Xi))]/N = ˆEk−Eˆl = ˆCE(f, gl) CE(f, gˆ k), andCovˆ kl=

(loggk(Xi)−Eˆk)(loggl(Xi)−Eˆl)

/N. From the above, we can construct anasymptotic(1−α) confidence interval for the KLI-difference (4) as follows,

⎣ΛN(gk, gl)−zα/2

Vˆkl

N , ΛN(gk, gl) +zα/2

Vˆkl N

. (10)

If all the points within this interval are positive, then we conclude that at significance level α, the model gk is a significantly better approximation of the true DGP f than the model gl. If all the points within this interval are negative, then we reach the opposite conclusion. If this interval includes 0, then we conclude that the modelsgk and gl are equally supported by the sample{Xi}.

Note that the sample KLI-difference (8), which is a key quan- tity in the multimodel inference procedure described on the above, is proportional to the log-likelihood ratio statistic in classical proce- dures of the statistical hypothesis testing. However, the meanings behind them are different. In the multimodel inference via the extended maximum likelihood principle, the true DGP f remains

−380−

( 8 )

(9)

unknown, but in the classical testing procedure, it is necessary to assume that some known model is a true DGP. In the multimodel inference via the extended maimum likelihood principle, the null hy- pothesis is “expected values of the likelihood (i.e. cross entropy) of both the models are exactly the same” and the alternative hypothe- sis is not. In other words, this multimodel inference procedure tests the null hypothesis: “the models gk’s and gl’s degrees of approxi- mation to the true DGPf are exactly the same in the population”

against the alternative hypothesis: “one model is a more accurate approximation to the true DGP than the other”. Hence, although the classical test statistic seems similar to the sample KLI-difference considered in this paper, the philosophies behind them are different.

This is a motivating insight for this paper3.

3.2 Estimated Best Approximate Model and its Prop- erties

The estimated (selected) best approximate model by the procedure considered in Subsection 3.1 is defined as

ˆ

g := arg min

gk∈{gk}Kk=1{CE(f, gˆ k)}, (11) where CE(f, gˆ k) is defined in (5). This is the extended maximum likelihood principle for estimating the best approximate model g defined in (3). If there is only one candidate model with unknown parameters, the principle results in the classical maximum likeli- hood principle for estimating the parameters in the fixed model. We can easily calculate the sample cross entropy (5) for each candidate model because the candidate models are given in fully known forms by assumptions stated in Section 2. Hence ˆg is also gained easily.

However, whether the estimated best model ˆg is truly best or not

3Mori, Nishikimi, and Smith (2005) also studied the distributional properties of Kullback-Leibler information statistic, which had different purpose than ours.

Multimodel Inference via the Extended Maximum Likelihood

Principle When Candidate Models Are Fully Specified (Kagihara) −381−

( 9 )

(10)

in the population, i.e. ˆg = g or ˆg = g, remains an important question. To answer it, statistical properties of the estimated best model ˆg must be investigated. For this purpose, note that the first part of Proposition 1 implies that we can estimate the rankings of the candidate models consistently via the extended maximum like- lihood principle if all the candidates are given as fully known forms.

Before proceeding, let us make the following assumption about the true best approximate model in the population.

Assumption 1 (Best Approximate Model in Population).

There exists only one modelg, the true best approximate model in the population, which satisfies (3), i.e.,

Δ(g, gk)>0, for all gk =g.

To understand the implications of this assumption, consider the following simple problem of comparing nested models:

Model (i): y=a+, ∼N(0, σ2) Model (ii): y=a+bx+, ∼N(0, σ2),

where parameter values a and b are specified, and the explana- tory variable x is non-stochastic. From the above, it follows that y∼N(a, σ2) in Model (i) and thaty∼N(a+bx, σ2) in Model (ii).

Assuming the true DGP model is Model (i), as in the classical hy- pothesis testing procedure, then we can easily calculateCE(f, g1) =

−Ef[logg1]1 andCE(f, g2) =−Ef[logg2]1+(bx/σ)2. There- fore, it is concluded thatCE(f, g1)< CE(f, g2), which satisfies As- sumption 1. On the other hand, if the parameter valuesaandbare not specified, which is the usual case in statistical analyses, then CE(f, g10) =CE(f, g20) can be concluded by settingb= 0 in spite of the assumption that Model (i) is the true DGP. This can disagree with Assumption 1, a subject related to a study by Shibata (1976).

However in this paper, for the purpose of extracting the essence of Akaike’s extended maximum likelihood principle, the candidate

−382−

( 10 )

(11)

models are fully specified without unknown parameters. Specifying the candidate models without unknown parameters makes the use of Assumption 1 reasonable. This leads us to Proposition 34. Proposition 3. Under Assumption 1,

CE( ˆf ,gˆ)−→p CE(f, g),

where gˆ = arg mingk{CE( ˆf , gk)} and g = arg mingk{CE(f, gk)}.

Proof. Assume that the sequence of the candidate models {gk} is ordered as CE(f, g1) CE(f, g2) · · · CE(g, gK) with- out loss of generality, and therefore, in this case, g = g1. Fur- themore under Assumption 1, we get CE(f, g1) < CE(f, g2) CE(f, g3)· · ·CE(f, gK). From the first part of Proposition 1, CE( ˆf , g1) −→p CE(f, g1) < CE(f, gk) ←−p CE( ˆf , gk) for every k = 1. Because ˆg = arg mingk{CE( ˆf , gk)} and mingk{CE( ˆf , gk)} −→p CE(f, g1) < CE(f, gk) for ∀k, it is concluded that CE( ˆf ,ˆg) = mingk{CE( ˆf , gk)}−→p CE(f, g1) =CE(f, g).

This implies that we can asymptotically estimate the true best ap- proximate model via the extended maximum likelihood principle under Assumption 1 if the candidate models are given as fully speci- fied forms. The above results can justify descriptive comparisons of the sample cross entropy because this leads us to an asymptotically correct estimation of the best approximate model. However in a fi- nite sample, a problem caused by sampling errors remains. Because this problem is usually not considered in descriptive comparisons, widely used descriptive comparisons can result in insignificant con- clusions in the light of statistical inference which considers the ef- fect of sampling errors on conclusions, a subject studied by Vuong

4Asymptotic properties of the estimated best model ˆg are related to the distributional propeties of CE( ˆf,ˆg):

N(CE( ˆf,gˆ) CE(f, g)) =

(log ˆg(Xi)Ef[logg(Xi)])/

N, which is to be examined.

Multimodel Inference via the Extended Maximum Likelihood

Principle When Candidate Models Are Fully Specified (Kagihara) −383−

( 11 )

(12)

(1989) and Shimodaira (1998) for example. To resolve the difficulty, this paper simply applies Proposition 2 to ˆg for all gk = ˆg and constructs the asymptotic (1−α) confidence intervals. If none of the intervals for all other models gk = ˆg includes 0, then we can conclude that within the candidate models at significance level α, ˆ

g is the true best approximate model g. On the other hand, if some intervals include 0, then it should be concluded that besides the estimated best approximate model ˆg, other models within the candidate models are equally supported by the observations{Xi}. By this, we can construct a multimodel inference procedure by tak- ing the effect of sampling errors into consideration.

4 Concluding Remarks

The objective of this paper was to extract the essence of the ex- tended maximum likelihood principle advocated by Akaike (1973).

One of the distinctive features of the principle is that the extended maximum likelihood principle is free from the true model assump- tion, and is, therefore, suitable for economic analysis in cases where it is difficult to approximate true phenomena. This paper studied statistical aspects of the extended maximum likelihood principle and via this princple, formulated a procedure for multimodel in- ference when the candidate models are fully specified without un- known parameters.

Acknowledgements

The author would like to pay tribute to the late Professor Toshihiro Tanaka, who was the then Dean of the Faculty of Economics when I was interviewed at Fukuoka University in Autumn 2001 for joining the faculty.

−384−

( 12 )

(13)

References

Akaike, H. (1973), “Information theory and an extension of the maximum likelihood principle”, InProceeding of the Second Inter- national Symposium on Information Theory, 267–281.

Ariga, K. and K. Matsui (2003), “Mismeasurement of the CPI”, In Structural Impediments to Growth in Japan, eds. M. Blomstr¨om, J. Corbett, F. Hayashi, and A. Kashyap, National Bureau of Eco- nomic Research, 89–128.

Burnham, K. P. and D. R. Anderson (2002), Model Selection and Multimodel Inference: A Practical Information-Theoretic Ap- proach, 2nd ed., Springer.

Chen, S. and M. Ravallion (2008), “The developing world is poorer than we thought, but no less successful in the fight against poverty”, Policy Research Working Paper, WPS4703, The World Bank, De- velopment Research Group.

Kagihara, M. (2009), “Semilogarithmic error based estimation and its related distributions”, InJSM 2009 Proceedings, American Sta- tistical Association, 3214–3227.

Kagihara, M. (forthcoming), “On potential applications of the method of least rectangles to biased data”,The Economic Review, 183, No.2, Kyoto University Economic Society (In Japanese).

Konishi, S. and G. Kitagawa (2004),Information Criteria, Asakura- Shoten (In Japanese).

Konishi, S. and G. Kitagawa (2008), Information Criteria and Statistical Modeling, Springer.

Mori, T., K. Nishikimi, and T. E. Smith (2005), “A divergence statistic for industrial localization”,Review of Economics and Statis- tics, 87, 635–651.

Multimodel Inference via the Extended Maximum Likelihood

Principle When Candidate Models Are Fully Specified (Kagihara) −385−

( 13 )

(14)

Shibata, R. (1976), “Selection of the order of an autoregressive model by Akaike’s information criterion”, Biometrika, 63, 117–

126.

Shimodaira, H. (1998), “An application of multiple comparison techniques to model selection”, Annals of Institute of Statistical Mathematics,50, 1–13.

Shiratsuka, S. (2005), “Measurement error of Japanese CPI: the state of so-called upper bias”,Bank of Japan Review Series, 2005- J-14, Bank of Japan (In Japanese).

Soofi, E. S. (1994), “Caputuring the intangible concept of informa- tion”, Journal of the American Statstical Association, 89, 1243–

1254.

Stiglitz, J., A. Sen, and J. Fitoussi eds. (2009), Report by the Commission on the Measurement of Economic Performance and Social Progress, Commission on the Measurement of Economic Performance and Social Progress.

Vuong, Q, H. (1989), “Likelihood ratio tests for model selection and non-nested hypotheses”, Econometrica, 57, 307–333.

Appendix

A The Effect of Rounding Errors on Sample Variance

In practical computation, we cannot always obtain the exact value of a sample mean ¯X but only its approximate value ˆX= ¯X+o(X) where o(X) stands for rounding errors. Hence, we get

(Xi X)ˆ 2/N =

Xi2/N −Xˆ2 in spite of

(Xi−X)¯ 2/N =

Xi2/N

−386−

( 14 )

(15)

X¯2. Note the followings:

1 N

N i=1

(Xi−X)ˆ 2 = 1 N

N i=1

(Xi−X)¯ 2+o(X)2, (12) 1

N N

i=1

Xi2−Xˆ2 = 1 N

N i=1

(Xi−X)¯ 22 ¯X·o(X) +o(X)2. (13) It is concluded from the above expressions that the first expres- sion (12) is more accurate than the second one (13). This is why the approximation error in the first expression is the square of the rounding error o(X)2 while the error in the second expression in- cludes the rounding erroro(X) multiplied by the sample mean ¯X.

Therefore, when a true value of the sample mean ¯X is sufficiently large compared with its roouding erroro(X), the second expression may cause severe biases in computing the sample variance. The difference between them are:

1 N

N i=1

(Xi−X)ˆ 2

1 N

N i=1

Xi2−Xˆ2

= 2 ˆX·o(X).

Multimodel Inference via the Extended Maximum Likelihood

Principle When Candidate Models Are Fully Specified (Kagihara) −387−

( 15 )

参照

関連したドキュメント

Key words: multitime maximum principle, curvilinear integral cost, variational PDEs, adjoint PDEs, m-needle variations.. 1 Multitime

The maximum likelihood estimates are much better than the moment estimates in terms of the bias when the relative difference between the two parameters is large and the sample size

In Section 3 the extended Rapcs´ ak system with curvature condition is considered in the n-dimensional generic case, when the eigenvalues of the Jacobi curvature tensor Φ are

We show that a discrete fixed point theorem of Eilenberg is equivalent to the restriction of the contraction principle to the class of non-Archimedean bounded metric spaces.. We

In section 3 all mathematical notations are stated and global in time existence results are established in the two following cases: the confined case with sharp-diffuse

A variety of powerful methods, such as the inverse scattering method [1, 13], bilinear transforma- tion [7], tanh-sech method [10, 11], extended tanh method [5, 10], homogeneous

In this paper, based on a new general ans¨atz and B¨acklund transformation of the fractional Riccati equation with known solutions, we propose a new method called extended

This paper is devoted to the study of maximum principles holding for some nonlocal diffusion operators defined in (half-) bounded domains and its applications to obtain