2 Extended Maximum Likelihood Principle

(1)

Multimodel Inference via the Extended Maximum Likelihood Principle When Candidate Models Are Fully Speciﬁed

1 Introduction

In studying statistical theory for economic analysis, it will be important to take the following problems into consideration. First, economic statistics, the fundamental information for rational deci- sion making, can contain significant and skewedmeasurement error, as implied by a recent study of Stiglitz, Sen, and Fitoussi eds. (2009) about the difficulty of measuring economic phenomena. This means that economic statistics tend to be biased, as exemplified by the Japanese consumer price index studied by Ariga and Matsui (2003) and Shiratsuka (2005), and by the 2005 international comparison program price data studied by Chen and Ravallion (2008). Second, economic models may contain significantapproximation errorrela- tive to the true phenomenon being modeled. This may be inevitable in the field of social sciences such as economics and politics. It follows that development of statistical theory for economic analysis must take these problems into account. The latter, approximation error problem of economic models, has been a prime motivation for

∗Faculty of Economics, Fukuoka University; E-mail:kagihara@fukuoka- u.ac.jp

Masato Kagihara

^＊

−３７３−

（１）

(2)

the writing of this paper; the former,measurement error problem of economic statistics, has motivated the writing of Kagihara (2009) and Kagihara (forthcoming).

For this purpose, this paper extracts the essence of Akaike’sex- tended maximum likelihood principle by considering a simple case in which candidate models are fully speciﬁed. The principle advocated by Akaike (1973) is free from thetrue model assumption, and therefore is suitable for a social science such as economics. It is unrealistic in economics to expect that there will be little approximation error in theoretical models of the true economic phenomena.

Akaike (1973) used the Kullback-Leibler information to measure discrepancies between theoretical models and the unknown true phenomenon being modeled, and derived the Akaike Information Criterion (AIC). The AIC is a bias corrected estimator of essen- tial parts of the Kullback-Leibler information. He has shown that in statistical estimation problems for estimating unknown parameters in a fixed model, the classical maximum likelihood principle has come as a natural result of his approach. Therefore, Akaike (1973) has called the approach an extension of the maximum like- lihood principle, to cover statistical estimation problems under a variety of different models. This approach can be considered to combine statistical estimation problems with model selection, re- ferred to as multimodel inference. Detailed expositions of the approach advocated by Akaike (1973) has been found in Burnham and Anderson (2002), Konishi and Kitagawa (2004), and Konishi and Kitagawa (2008). TheKullback-Leibler informationof a probability density funcction (pdf)g from another pdff,D(f, g), is defined as follows,

D(f, g) :=

f(x) logf(x)

g(x)dx (1)

=E_f

logf(x) g(x)

=E_f[logf(x)]−E_f[logg(x)],

−３７４−

（２）

(3)

whereE_f stands for the expectation over the pdff. This quantity has the following properties: D(f, g) ≥ 0 and D(f, g) = 0 ⇐⇒

f(x) = g(x). Hence, D(f, g) can be interpreted as a discrepancy measure of a pdfgfrom another pdff. Furthermore, iff is the pdf describing the unknown true phenomenon, then the pdfgattaining smallestD(f, g) is thebest approximate modelfor the unknown true pdff in the sense of Kullback-Leibler information.

The rest of this paper is structured as follows. After describing the setting of the problem and ideas of the extended maximum likelihood principle in Section 2, we develop procedures for multimodel inference via this principle in Section 3. Concluding remarks are given in Section 4.

2 Extended Maximum Likelihood Principle

Assume that we have independently and identically distributed (i.i.d.) observations {X_i ∈ R}^N_i=1, and that we want to know the data generating process (DGP) producing these observations. The strategy considered here is to approximate the true but ultimately unknown DGP by a known model which is expected to be closest to the true DGP. Suppose that we have a set of models explain- ing the DGP of {X_i}, the candidate models, denoted by {g_k}^K_k=1 forK ∈N. These candidate models {g_k} should be chosen based on our scientiﬁc knowledge about the phenomenon producing the observations{X_i}: economics, politics, biology, physics, and so on.

Throughout this paper, we suppose that the pdf f in (1) is the unknown true DGP generating the observations{X_i}, and that the candidate models {g_k} are given as fully speciﬁed forms without unknown parameters.

The closest model to the true DGP f in the sense of Kullback- Leibler information (1) is obtained by selecting a model which min- imizes (1) within the candidates{g_k} as the best. This approach is

Multimodel Inference via the Extended Maximum Likelihood

Principle When Candidate Models Are Fully Specified （Kagihara） −３７５−

（３）

(4)

equivalent to the minimization of

CE(f, g_k) :=−E_f[logg_k(X)] (2) within {g_k}^K_k=1, because the first term of the last expression in (1) is the same for all the candidate models{g_k} and therefore can be ignored. Note that the first term E_f[logf(x)] is the negative of Shannon information or entropy of the pdf f and is the expected log-likelihood of the true model f. The quantity (2) is called cross entropy ¹ and is the negative of the expected log-likelihood of the model g_k under the true model f. Hence, the best approximate model g^∗ within the candidate models in the population (true best approximate model within the candidates)is defined as

g^∗ := arg min

gk∈{gk}^K_k=1{D(f, g_k)} (3)

= arg min

gk∈{gk}^K_k=1{CE(f, g_k)}.

When comparing two modelsg_kandg_l, thedifference in the Kullback- Leibler information (KLI-difference), Δ(g_k, g_l) defined as follows, is a key quantity because the modelg_k is better than, worse than, or indifferent to g_l when this quantity is positive, negative, or zero respectively.

Δ(g_k, g_l) :=D(f, g_l)−D(f, g_k) (4)

=E_f[logg_k(X)]−E_f[logg_l(X)]

=E_f

logg_k(X) g_l(X)

Equation (4) shows that the KLI-diﬀerence is the expected log- likelihood ratio between the modelsg_kandg_l under the true DGPf.

The problem arising here is that we cannot observe the quanti- ties (1), (2), and (4) directly. Therefore, we have to estimate them based on the observed sample {X_i}. We discuss the matter in the following Section 3

1For terms related toentropy, see Sooﬁ (1994).

−３７６−

（４）

(5)

3 Multimodel Inference via the Extended Max- imum Likelihood Principle

This section develops procedures for multimodel inference via the extended maximum likelihood principle when the candidate models are fully speciﬁed, i.e., the candidate models are given as fully known forms without unknown parameters. Subsection 3.1 dis- cusses how to estimate the cross entropy (2) and the KLI-diﬀerence (4), and investigates their statistical properties. Subsection 3.2 presents a procedure for estimating the best approximate model and provides a statistical foundation for it.

3.1 Asymptotic Properties of Sample Cross Entropy and Sample Diﬀerence of Kullback-Leibler Infor- mation

Equation (2) shows that the cross entropy is the negative expected log-likelihood of a candidate modelg_k when the expectation is eval- uated under the true DGPf. A natural estimator of the cross entropy is its sample analog, a sample cross entropy which replaces the true f in equation (2) with an empirical distribution ˆf:

CE(f, gˆ _k) :=CE( ˆf , g_k) =−1 N

N i=1

logg_k(X_i). (5) The sample cross entropy (5) is proportional to a log-likelhood of the modelg_kif the data{X_i}are i.i.d. Its asymptotic properties are obtained by the law of large numbers and the central limit theorem.

Proposition 1 (Properties of Sample Cross Entropy).

The sample cross entropy (5) is a consistent and asymptotically normal estimator of the population cross entropy (2).

1. CE(f, gˆ _k)−→^p CE(f, g_k).

Principle When Candidate Models Are Fully Specified （Kagihara） −３７７−

（５）

(6)

2. Assume that E_f[{logg_k(X)}²]<∞, then

√N( ˆCE(f, g_k)−CE(f, g_k))−→^d N(0, V_f[logg_k]) where V_f[logg_k] =E_f[{logg_k(X)−E_f[logg_k(X)]}²].

Proof. 1. Since {X_i} are i.i.d., so are{logg_k(X_i)}for all k. Be- cause CE(f, gˆ _k) = −(

logg_k(X_i))/N and E_f[logg_k(X_i)] = CE(f, g_k) for all k and i, we get the result owing to Khin- chine’s weak law of large numbers.

2. Since {X_i} are an i.i.d. sequence, so are {logg_k(X_i)} for all k. Because√

N( ˆCE(f, g_k)−CE(f, g_k)) =−[

(logg_k(X_i)− E_f[logg_k(X)])]/√

N, the result is gained by the Lindeberg- Levy central limit theorem.

From this result, we construct a conﬁdence interval of the cross entropy for each model g_k. For this purpose, we must estimate V_f[logg_k(X)] consistently because we cannot observe this quantity in practice. Such an estimator is obtained by, say, the sample variance²,

Vˆ_k:= 1 N

N i=1

(logg_k(X_i)−Eˆ_k)², (6) where ˆE_k := _N

i=1logg_k(X_i)

/N =−CE(f, gˆ _k). It is concluded from the above that anasymptotic (1−α) conﬁdence interval for the model g_k’s cross entropy (2) is

⎡

⎣CE(f, gˆ _k)−z_α/2

Vˆ_k

N, CE(f, gˆ _k) +z_α/2

Vˆ_k N

⎤

⎦ (7)

2The analytically equivalent expression of sample variance (6) is ˆV_k = (logg_k(X_i))²/N −( ˆE_k)². Hoewever, in practice, realized values computed from the sample can diﬀer from each other because of their rounding errors.

Appendix A shows that the expression (6) produces a more accurate value.

−３７８−

（６）

(7)

for α ∈ [0,1], where z_α/2 means the 100(1−α/2)th percentile of the standard normal distribution.

To estimate, or select, the best approximate model within the candidate models, comparing their sample cross entropy (5) or confidence intervals (7) is insufficient. Co-variation among the models should be taken into account by examining the KLI-difference (4) between them. Because the KLI-difference is an unknown quantity, it must be estimated. For this purpose, consider a sample KLI- difference, an average empirical log-likelihood ratio, which replaces the true f in equation (4) with an empirical distribution ˆf:

Λ_N(g_k, g_l) :=E_f_ˆ

logg_k(X) g_l(X)

= 1 N

N i=1

logg_k(X_i)

g_l(X_i) (8)

= ˆE_k−Eˆ_l = CE(f, gˆ _l)−CE(f, gˆ _k).

Its asymptotic properties are established as follows.

Proposition 2 (Properties of Sample KLI-Diﬀerence).

The sample KLI-diﬀerence (8) is a consistent and asymptotically normal estimator of the population KLI-diﬀerence (4).

1. Λ_N(g_k, g_l)−→^p Δ(g_k, g_l). 2. Assume that E_f

{log (g_k(X)/g_l(X))}²

<∞, then

√N(Λ_N(g_k, g_l)−Δ(g_k, g_l))−→^d N

0, V_f

logg_k(X) g_l(X)

, whereV_f[log(g_k(X)/g_l(X))] =V_f[logg_k]+V_f[logg_l]−2Cov[logg_k,

logg_l], andCov[logg_k,logg_l] =E_f[(logg_k−E_f[logg_k])(logg_l− E_f[logg_l])].

Proof. 1. Since {X_i} are i.i.d., so are {log(g_k(X_i)/g_l(X_i))} for all k and l. Because Λ_N(g_k, g_l) = [ log (g_k(X_i)/g_l(X_i))]/N and E_f[log (g_k(X_i)/g_l(X_i))] = Δ(g_k, g_l) for all k, l, andi, we get the result owing to Khinchine’s weak law of large numbers.

Principle When Candidate Models Are Fully Specified （Kagihara） −３７９−

（７）

(8)

2. Since{X_i}are i.i.d., so are{log (g_k(X_i)/g_l(X_i))}for allkand l. Because√

N(Λ_N(g_k, g_l)−Δ(g_k, g_l)) =

{log(g_k(X_i)/g_l(X_i))− E_f[log(g_k(X)/g_l(X))]}/√

N, the mentioned result is gained by the Lindeberg-Levy central limit theorem.

Because we cannot observe V_f[log (g_k(X)/g_l(X))] in practice, we must estimate this quantity consistently by,

Vˆ_kl:= 1 N

N i=1

logg_k(X_i) g_l(X_i) −Eˆ_kl

₂

(9)

= ˆV_k+ ˆV_l−2 ˆCov_kl, where ˆE_kl := [

log (g_k(X_i)/g_l(X_i))]/N = ˆE_k−Eˆ_l = ˆCE(f, g_l)− CE(f, gˆ _k), andCovˆ _kl=

(logg_k(X_i)−Eˆ_k)(logg_l(X_i)−Eˆ_l)

/N. From the above, we can construct anasymptotic(1−α) conﬁdence interval for the KLI-diﬀerence (4) as follows,

⎡

⎣Λ_N(g_k, g_l)−z_α/2

Vˆ_kl

N , Λ_N(g_k, g_l) +z_α/2

Vˆ_kl N

⎤

⎦. (10)

If all the points within this interval are positive, then we conclude that at signiﬁcance level α, the model g_k is a signiﬁcantly better approximation of the true DGP f than the model g_l. If all the points within this interval are negative, then we reach the opposite conclusion. If this interval includes 0, then we conclude that the modelsg_k and g_l are equally supported by the sample{X_i}.

Note that the sample KLI-diﬀerence (8), which is a key quantity in the multimodel inference procedure described on the above, is proportional to the log-likelihood ratio statistic in classical procedures of the statistical hypothesis testing. However, the meanings behind them are diﬀerent. In the multimodel inference via the extended maximum likelihood principle, the true DGP f remains

−３８０−

（８）

(9)

unknown, but in the classical testing procedure, it is necessary to assume that some known model is a true DGP. In the multimodel inference via the extended maimum likelihood principle, the null hypothesis is “expected values of the likelihood (i.e. cross entropy) of both the models are exactly the same” and the alternative hypothesis is not. In other words, this multimodel inference procedure tests the null hypothesis: “the models g_k’s and g_l’s degrees of approximation to the true DGPf are exactly the same in the population”

against the alternative hypothesis: “one model is a more accurate approximation to the true DGP than the other”. Hence, although the classical test statistic seems similar to the sample KLI-diﬀerence considered in this paper, the philosophies behind them are diﬀerent.

This is a motivating insight for this paper³.

3.2 Estimated Best Approximate Model and its Prop- erties

The estimated (selected) best approximate model by the procedure considered in Subsection 3.1 is deﬁned as

ˆ

g^∗ := arg min

gk∈{gk}^K_k=1{CE(f, gˆ _k)}, (11) where CE(f, gˆ _k) is defined in (5). This is the extended maximum likelihood principle for estimating the best approximate model g^∗ defined in (3). If there is only one candidate model with unknown parameters, the principle results in the classical maximum likelihood principle for estimating the parameters in the fixed model. We can easily calculate the sample cross entropy (5) for each candidate model because the candidate models are given in fully known forms by assumptions stated in Section 2. Hence ˆg^∗ is also gained easily.

However, whether the estimated best model ˆg^∗ is truly best or not

3Mori, Nishikimi, and Smith (2005) also studied the distributional properties of Kullback-Leibler information statistic, which had diﬀerent purpose than ours.

Principle When Candidate Models Are Fully Specified （Kagihara） −３８１−

（９）

(10)

in the population, i.e. ˆg^∗ = g^∗ or ˆg^∗ = g^∗, remains an important question. To answer it, statistical properties of the estimated best model ˆg^∗ must be investigated. For this purpose, note that the ﬁrst part of Proposition 1 implies that we can estimate the rankings of the candidate models consistently via the extended maximum likelihood principle if all the candidates are given as fully known forms.

Before proceeding, let us make the following assumption about the true best approximate model in the population.

Assumption 1 (Best Approximate Model in Population).

There exists only one modelg^∗, the true best approximate model in the population, which satisﬁes (3), i.e.,

Δ(g^∗, g_k)>0, for all g_k =g^∗.

To understand the implications of this assumption, consider the following simple problem of comparing nested models:

Model (i): y=a+, ∼N(0, σ²) Model (ii): y=a+bx+, ∼N(0, σ²),

where parameter values a and b are speciﬁed, and the explana- tory variable x is non-stochastic. From the above, it follows that y∼N(a, σ²) in Model (i) and thaty∼N(a+bx, σ²) in Model (ii).

Assuming the true DGP model is Model (i), as in the classical hypothesis testing procedure, then we can easily calculateCE(f, g₁) =

−E_f[logg₁]∝1 andCE(f, g₂) =−E_f[logg₂]∝1+(bx/σ)². There- fore, it is concluded thatCE(f, g₁)< CE(f, g₂), which satisﬁes As- sumption 1. On the other hand, if the parameter valuesaandbare not speciﬁed, which is the usual case in statistical analyses, then CE(f, g₁₀) =CE(f, g₂₀) can be concluded by settingb= 0 in spite of the assumption that Model (i) is the true DGP. This can disagree with Assumption 1, a subject related to a study by Shibata (1976).

However in this paper, for the purpose of extracting the essence of Akaike’s extended maximum likelihood principle, the candidate

−３８２−

（１０）

(11)

models are fully speciﬁed without unknown parameters. Specifying the candidate models without unknown parameters makes the use of Assumption 1 reasonable. This leads us to Proposition 3⁴. Proposition 3. Under Assumption 1,

CE( ˆf ,gˆ^∗)−→^p CE(f, g^∗),

where gˆ^∗ = arg min_g_k{CE( ˆf , g_k)} and g^∗ = arg min_g_k{CE(f, g_k)}.

Proof. Assume that the sequence of the candidate models {g_k} is ordered as CE(f, g₁) CE(f, g₂) · · · CE(g, g_K) without loss of generality, and therefore, in this case, g^∗ = g₁. Fur- themore under Assumption 1, we get CE(f, g₁) < CE(f, g₂) CE(f, g₃)· · ·CE(f, g_K). From the ﬁrst part of Proposition 1, CE( ˆf , g₁) −→^p CE(f, g₁) < CE(f, g_k) ←−^p CE( ˆf , g_k) for every k = 1. Because ˆg^∗ = arg min_g_k{CE( ˆf , g_k)} and min_g_k{CE( ˆf , g_k)} −→^p CE(f, g₁) < CE(f, g_k) for ∀k, it is concluded that CE( ˆf ,ˆg^∗) = min_g_k{CE( ˆf , g_k)}−→^p CE(f, g₁) =CE(f, g^∗).

This implies that we can asymptotically estimate the true best approximate model via the extended maximum likelihood principle under Assumption 1 if the candidate models are given as fully specified forms. The above results can justify descriptive comparisons of the sample cross entropy because this leads us to an asymptotically correct estimation of the best approximate model. However in a fi- nite sample, a problem caused by sampling errors remains. Because this problem is usually not considered in descriptive comparisons, widely used descriptive comparisons can result in insignificant conclusions in the light of statistical inference which considers the ef- fect of sampling errors on conclusions, a subject studied by Vuong

4Asymptotic properties of the estimated best model ˆg^∗ are related to the distributional propeties of CE( ˆf,ˆg^∗): √

N(CE( ˆf,gˆ^∗) −CE(f, g^∗)) =

− (log ˆg^∗(X_i)−E_f[logg^∗(X_i)])/√

N, which is to be examined.

Principle When Candidate Models Are Fully Specified （Kagihara） −３８３−

（１１）

(12)

(1989) and Shimodaira (1998) for example. To resolve the difficulty, this paper simply applies Proposition 2 to ˆg^∗ for all g_k = ˆg^∗ and constructs the asymptotic (1−α) confidence intervals. If none of the intervals for all other models g_k = ˆg^∗ includes 0, then we can conclude that within the candidate models at significance level α, ˆ

g^∗ is the true best approximate model g^∗. On the other hand, if some intervals include 0, then it should be concluded that besides the estimated best approximate model ˆg^∗, other models within the candidate models are equally supported by the observations{X_i}. By this, we can construct a multimodel inference procedure by tak- ing the eﬀect of sampling errors into consideration.

4 Concluding Remarks

The objective of this paper was to extract the essence of the extended maximum likelihood principle advocated by Akaike (1973).

One of the distinctive features of the principle is that the extended maximum likelihood principle is free from the true model assumption, and is, therefore, suitable for economic analysis in cases where it is diﬃcult to approximate true phenomena. This paper studied statistical aspects of the extended maximum likelihood principle and via this princple, formulated a procedure for multimodel inference when the candidate models are fully speciﬁed without unknown parameters.

Acknowledgements

The author would like to pay tribute to the late Professor Toshihiro Tanaka, who was the then Dean of the Faculty of Economics when I was interviewed at Fukuoka University in Autumn 2001 for joining the faculty.

−３８４−

（１２）

(13)

References

Akaike, H. (1973), “Information theory and an extension of the maximum likelihood principle”, InProceeding of the Second Inter- national Symposium on Information Theory, 267–281.

Ariga, K. and K. Matsui (2003), “Mismeasurement of the CPI”, In Structural Impediments to Growth in Japan, eds. M. Blomstr¨om, J. Corbett, F. Hayashi, and A. Kashyap, National Bureau of Eco- nomic Research, 89–128.

Burnham, K. P. and D. R. Anderson (2002), Model Selection and Multimodel Inference: A Practical Information-Theoretic Ap- proach, 2nd ed., Springer.

Chen, S. and M. Ravallion (2008), “The developing world is poorer than we thought, but no less successful in the ﬁght against poverty”, Policy Research Working Paper, WPS4703, The World Bank, De- velopment Research Group.

Kagihara, M. (2009), “Semilogarithmic error based estimation and its related distributions”, InJSM 2009 Proceedings, American Sta- tistical Association, 3214–3227.

Kagihara, M. (forthcoming), “On potential applications of the method of least rectangles to biased data”,The Economic Review, 183, No.2, Kyoto University Economic Society (In Japanese).

Konishi, S. and G. Kitagawa (2004),Information Criteria, Asakura- Shoten (In Japanese).

Konishi, S. and G. Kitagawa (2008), Information Criteria and Statistical Modeling, Springer.

Mori, T., K. Nishikimi, and T. E. Smith (2005), “A divergence statistic for industrial localization”,Review of Economics and Statis- tics, 87, 635–651.

Principle When Candidate Models Are Fully Specified （Kagihara） −３８５−

（１３）

(14)

Shibata, R. (1976), “Selection of the order of an autoregressive model by Akaike’s information criterion”, Biometrika, 63, 117–

126.

Shimodaira, H. (1998), “An application of multiple comparison techniques to model selection”, Annals of Institute of Statistical Mathematics,50, 1–13.

Shiratsuka, S. (2005), “Measurement error of Japanese CPI: the state of so-called upper bias”,Bank of Japan Review Series, 2005- J-14, Bank of Japan (In Japanese).

Sooﬁ, E. S. (1994), “Caputuring the intangible concept of information”, Journal of the American Statstical Association, 89, 1243–

1254.

Stiglitz, J., A. Sen, and J. Fitoussi eds. (2009), Report by the Commission on the Measurement of Economic Performance and Social Progress, Commission on the Measurement of Economic Performance and Social Progress.

Vuong, Q, H. (1989), “Likelihood ratio tests for model selection and non-nested hypotheses”, Econometrica, 57, 307–333.

Appendix

A The Eﬀect of Rounding Errors on Sample Variance

In practical computation, we cannot always obtain the exact value of a sample mean ¯X but only its approximate value ˆX= ¯X+o(X) where o(X) stands for rounding errors. Hence, we get

(X_i − X)ˆ ²/N =

X_i²/N −Xˆ² in spite of

(X_i−X)¯ ²/N =

X_i²/N −

−３８６−

（１４）

(15)

X¯². Note the followings:

1 N

N i=1

(X_i−X)ˆ ² = 1 N

N i=1

(X_i−X)¯ ²+o(X)², (12) 1

N N

i=1

X_i²−Xˆ² = 1 N

N i=1

(X_i−X)¯ ²−2 ¯X·o(X) +o(X)². (13) It is concluded from the above expressions that the ﬁrst expression (12) is more accurate than the second one (13). This is why the approximation error in the ﬁrst expression is the square of the rounding error o(X)² while the error in the second expression includes the rounding erroro(X) multiplied by the sample mean ¯X.

Therefore, when a true value of the sample mean ¯X is suﬃciently large compared with its roouding erroro(X), the second expression may cause severe biases in computing the sample variance. The diﬀerence between them are:

1 N

N i=1

(X_i−X)ˆ ²

−

1 N

N i=1

X_i²−Xˆ²

= 2 ˆX·o(X).

Principle When Candidate Models Are Fully Specified （Kagihara） −３８７−

（１５）