従属観測データにおける局所漸近二次構造モデルのモデル比較

(1)

九州大学学術情報リポジトリ

Kyushu University Institutional Repository

従属観測データにおける局所漸近二次構造モデルのモデル比較

江口, 翔一

https://doi.org/10.15017/1931722

出版情報：Kyushu University, 2017, 博士（数理学）, 課程博士バージョン：

権利関係：

(2)

Model comparison for LAQ models with dependent observations

Doctoral dissertation February 14, 2018

Shoichi Eguchi

Graduate School of Mathematics, Kyushu University.

744 Motooka, Nishi-ku, Fukuoka 819-0395, Japan.

(3)

Abstract

This doctoral dissertation is based on three original papers [21], [22], and [23]. Through this thesis, we consider the Bayesian model comparison and parametric estimation of ergodic diﬀusion processes. Some numerical examples and real data examples are given in order to check whether the proposed method can show a good performance.

For a model-comparison purpose, we study the asymptotic behavior of the marginal quasi-log likelihood associated with a family of locally asymptotically quadratic (LAQ) statistical experiments. Our result entails a far-reaching extension of the applicable scope of the classical approximate Bayesian model comparison due to Schwarz, with frequentist-view theoretical foundation. In particular, the proposed statistics can treat possibly misspecified generalized linear models with dependent observations. Further- more, the proposed statistics can deal with both ergodic and non-ergodic stochastic process models, where the corresponding M-estimator may of multi-scaling type and the asymptotic quasi-information matrix may be random. We deduce the consistency of the multistage optimal-model selection where we select an optimal sub-model structure step by step, so that computational cost can be much reduced.

Further, we study parametric estimation of ergodic diﬀusions observed at high frequency. Diﬀerent from the previous studies, we suppose that model-time scale (sampling stepsize) is unknown, thereby making the conventional Gaussian quasi-likelihood not directly applicable. In this situation, we construct estimators of both model pa- rameters and model-time scale in a fully explicit way and prove that they are jointly asymptotically normally distributed. The L^q-boundedness of the obtained estimator is also derived. Moreover, we propose the BIC type statistics for model selection and show its model-selection consistency.

(4)

Acknowledgements

First, I would like to express my deepest gratitude to Professor Hiroki Masuda for his valuable guidance, suggestions and encouragement. I am grateful to Professor Ryuei Nishii, Professor Yoshihiko Maesono, Associate Professor Yoshiyuki Ninomiya and As- sociate Professor Kei Hirose for their valuable comments. I also thank to my friends for their kind support during my studies. At last but not least, my gratitude goes to my family whose heartfelt assistance to my daily life.

Shoichi Eguchi February, 2018

(5)

Chapter 1 Introduction

The objective of this thesis is Bayesian model comparison for a general class of statistical models, which includes various kinds of stochastic process models that cannot be handled by preceding results. There are two classical principles of model selection:

the Kullback–Leibler divergence (KL divergence) principle and the Bayesian one, acted over Akaike information criterion (AIC, [1, 2]) and Schwarz or Bayesian information criterion (BIC, [57]), respectively. A common knowledge is that there are no universal politic between AIC and BIC type statistics, and they are indeed used for diﬀerent pur- poses. On the one hand, the AIC is a predictive model selection criterion minimizing the KL divergence between prediction and true models, not intended to pick up the true model consistently even if it does exist in the candidate-model set. On the other hand, the BIC is used to look for better model description, putting importance not only on underfitting but also on overfitting. The BIC usually takes the form

BIC_n=−2ℓ_n(ˆθ_n^MLE) +plogn,

where ℓ_n, ˆθ_n^MLE, and p denote the log-likelihood function, the maximum-likelihood estimator (MLE), and the dimension of the parameter space of the statistical model to be assessed, respectively. The model selection consistency via BIC type statistics has been studied by many authors in several diﬀerent model setups, for example, [8, 11], and [55], to mention just a few old ones. An extension of the BIC-derivation logic to subsume smoothly regularized likelihood estimation can be found in [41].

There also do exist many studies of the BIC methodology in the time series context.

The underlying principles, such as maximization of posterior model selection probability, remain the same in this case. It should be mentioned that [12] demonstrated that derivation of the classical BIC could be generalized into general √

n-consistent frame- work with constant asymptotic information. Their argument supposes the almost-sure behaviors of the likelihood characteristics, especially of the observed information matrix. Our stance is similar to theirs, but more general so as to subsume a much broader spectrum of models that cannot be handled by [12]. We note that much less has been known about theoretically guaranteed information criteria concerning sampled data

(8)

from stochastic process models; to mention some of them, we refer to [59, 60, 61, 62]

and [66].

Our primary interest is to extend the range of application of Schwarz’s BIC to a large degree in a unified way, so as to be able to target a wide class of dependent data models especially including the locally asymptotically mixed-normal family of statistical experiments. The Bayesian principle of model selection amounts choosing the model that is most likely in terms of the posterior model selection probability, which is typically measured by approximating the (expected) marginal quasi-log likelihood.

Unfortunately, a mathematically rigorous derivation of BIC type statistics is sometimes missing in the literature, especially when the underlying model is non-ergodic. In this thesis, we will focus on locally asymptotically quadratic (LAQ) statistical models. We will introduce the quasi-BIC (QBIC) through the stochastic expansion of the marginal quasi-likelihood. Here, we use the terminology “quasi” to mean that the model may be misspecified in the sense that none of candidate models may not include the true one;

see [48] for information criteria for a class of generalized linear models for independent data. Our proof of the expansion essentially utilizes the polynomial type large deviation inequality of [71]; quite importantly, the asymptotic information matrix then may be random (i.e., suitably scaled observed information (random bilinear form) has a random limit in probability), enabling us to deal with non-ergodic models in a unified way. We note the two things, though we do not go into any detail in this thesis: the popular cointegration models (see [7] and the references therein) would be in the scope of the QBIC as well; the QBIC may be closely related to the correct BIC in the context of non- stationary time series models [40], where the observed information matrix is involved in the bias-correction term. Further, it is worth mentioning that QBIC may be used even for semiparametric models, where possibly infinite-dimensional nuisance element, whenever a suitable quasi-likelihood is available.

There are many other works on the model selection, which includes the risk information criterion [27], the generalized information criterion [42], the “parametricness”

index [47], and many extensions of AIC and BIC including [15,48]. We refer to [10,16], and [43] for comprehensive accounts of information criteria, and also to [20] for an il- lustration from practical point of view.

This thesis is organized as follows. In Chapter 2, we describe some related back- grounds and present the asymptotic expansions of the marginal quasi-log likelihood (equivalently, the Bayes factor or the Kullback-Leibler divergence). We also discuss the model selection consistency with respect to the optimal model, which is naturally defined to be a minimal model among those minimizing the quasi-entropy quantities.

When in particular the quasi-maximum likelihood estimator is of multi-scaling type, we prove the consistency of the multistage optimal model selection procedure, where we partially select an optimal model structure step by step, resulting in a reduced computational cost. Chapter 3 presents the asymptotic properties of the quasi-maximum likelihood estimator and the BIC type statistics in possibly misspecified generalized

(9)

linear models for dependent data. In Chapter 4, we illustrate the proposed model selection method by the Gaussian quasi-likelihoods, with focuses on estimation of an ergodic diﬀusion process and volatility-parameter estimation for a class of continuous semimartingales, both based on high-frequency sampling; to the best of our knowledge, this is the first place that mathematically validates Schwarz’s methodology of model comparison for high-frequency data from a stochastic process. In Chapter5, we propose the modified logarithmic Gaussian quasi-likelihood and parameter estimation method and present asymptotic properties of the estimators of the model parameter and the model-time scale. We also give the suﬃcient conditions of the polynomial type large deviation inequality under the modified logarithmic Gaussian quasi-likelihood. Further- more, we derive the BIC type statistics in case where the model-time scale is unknown and discuss the model selection consistency with respect to the true model. Chapter 6 shows the specification of the model selection function IC in R package yuima.

We introduce some basic notations used throughout this thesis. The convergence in probability and convergence in distribution are denoted by−→^P and−→^L , respectively. We denote by|A| the Frobenius norm of a tensor A. If in particular, A is a square matrix, then |A| is used also for the determinant; there will be no confusion for this multiple uses. The notation′means the transpose, and the symbol ∂_a^k stands fork-times partial diﬀerentiation with respect to variablea. We denote byCa universal positive constant, which may change at each appearance, and write A_n ≲B_n if A_n ≤CB_n a.s. for every n large enough.

(10)

Chapter 2 Quasi-Bayesian information criterion

2.1 Setup

We begin with describing our basic model setup for this thesis. Denote by X an observation random variable defined on an underlying probability space (Ω,F,P), and byGn(dx) =gn(x)µn(dx) the true distributionL(Xn), whereµnis aσ-finite dominating measure on a Borel state space ofX_n, that is, G_n(dx) =P◦X⁻_n¹(dx).

Suppose that we are given a set of M candidate model M1, . . . , M_M; Mm ={(

p_m, π_m,n(θ_m),Hm,n(·|θ_m))

|θ_m ∈Θ_m}, m = 1, . . . , M, where the ingredients in each Mm are given as follows.

• pm >0 denotes the relative likeliness of the model-Mmoccurrence amongM1, . . . , MM; we have ∑M

m=1p_m = 1.

• π_m,n : Θ_m → (0,∞) is the prior distribution L(θ_m) of mth-model parameter θ_m, here defined to be a probability density function possibly depending on the sample size n, with respect to the Lebesgue density on a bounded convex domain Θ_m ⊂R^p^m.

• The measurable function x7→Hm,n(x|θ_m) for eachθ_m ∈Θ_m defines a logarithmic regular conditional probability density of L(X_n|θ_m) with respect to µ_n(dx).

EachMm may be misspecified in the sense that the true data generating model g_n(x) does not belong to the family {exp{Hm,n(·|θ_m)}|θ_m ∈ Θ_m}; we will, however, assume suitable regularity conditions for the associated statistical random fields.

Concerning the modelMm, the random functionθ_m 7→exp{Hm,n(X_n|θ_m)}, assumed to be a.s. well-defined, is referred to as the quasi-likelihood of L(X_n|θ_m). The quasi- maximum likelihood estimator (QMLE) ˆθ_m,n associated with Hm,n is defined to be any

(11)

maximizer of Hm,n:

θˆ_m,n ∈argmax

θ∈Θ¯m

Hm,n(X_n|θ).

We will assume the a.s. continuity ofHm,n over the compact set ¯Θ_m, so that ˆθ_m,n always exists.

Our objective includes estimators of multi-scaling type, meaning that the components of ˆθ_m,n converges at different rates, which can often occur when considering high- frequency asymptotics. A typical example is the Gaussian quasi-likelihood estimation of ergodic diffusion process: see [39], also Section 4.2. Let K_m ∈N be a given number, which represents the number of the components having different convergence rates in Mm, and assume that the mth-model parameter vector is divided into Km parts:

θ_m = (θ_m,1, . . . , θ_m,K_m)∈

Km

∏

k=1

Θ_m,k = Θ_m,

with each Θ_m,k being a bounded convex domain inR^p^m,k,k ∈ {1, . . . , K_m}, wherep_m =

∑_K_m

k=1p_m,k. Then, the QMLE in themth model takes the form ˆθ_m,n = (ˆθ_m,1,n, . . . ,θˆ_m,K_m_,n).

The optimal value of θ_m associated with Hm,n, to be precisely defined later on, is denoted by θ_m,0 = (θ_m,1,0, . . . , θ_m,K_m_,0),θ_m,k,0 ∈Θ_m,k. The rate matrix in the model Mm

is then given in the form

R_m,n =R_m,n(θ_m,0) = diag(

r_m,1,n(θ_m,0)I_p_m,1, . . . , r_m,K_m_,n(θ_m,0)I_p_m,Km)

, (2.1.1) where Ip denotes the p-dimensional identity matrix and rm,k,n (θm,0) are deterministic positive sequences satisfying that

r_m,k,n(θ_m,0)→0, r_m,i,n(θ_m,0)/r_m,j,n(θ_m,0)→0 (i < j), n → ∞. (2.1.2) The diagonality ofRm,n(θ0) is just for simplicity.

Since we are allowing not only data dependency but also the possibility of model misspecification, we may deal with a wide range of quasi-likelihoodsHm,n, even including semiparametric situations such as the Gaussian quasi-likelihood; see Section4.3for related models.

2.2 Bayesian model selection principle

The quasi-marginal distribution ofXn in themth modelMm is given by density x7→f_m,n(x) :=

∫

Θm

exp{Hm,n(x|θ_m)}π_m,n(θ_m)dθ_m.

(12)

Typical reasoning in Bayesian principle of model selection inM1, . . . ,MM is to choose the model that is most likely to occur in terms of the posterior probability, namely to choose the model maximizing

log (

fm,n(x)pm

∑M

i=1fi,n(x)pi

)

= logf_m,n(x) + logp_m−log ( _M

∑

i=1

f_i,n(x)p_i )

overm= 1, . . . , M. This is equivalent to finding argmax

m≤M {logf_m,n(x) + logp_m}.

Then, one proceeds with suitable almost-sure (Ω ∋ ω-wise) asymptotic expansion of the logarithm of the quasi-marginal likelihood logf_m,n(x) forn → ∞around a suitable estimator, a measurable function ofx=x_n for each n: when√

n(ˆθ_m,n−θ_m,0) = O_p(1), the resulting form may be quite often given by

logf_m,n(x) + logp_m ≈Hn(x|θˆ_m,n)−p_m

2 logn+O(1) a.s. (2.2.1) This is the usual scenario of derivation of the classical-BIC type statistics; see [12] and [48] as well as [57].

We recall that the expansion (2.2.1) is also used to approximate the Bayes factor.

The logarithmic Bayes factor of Mi against Mj is defined by the (random) ratio of posterior and prior odds: lettingP(Mi|X_n) denote the posterior probability of the ith model, we have

log BF_n(i, j) := logP(Mi|X_n)/P(Mj|X_n)

p_i/p_j = log f_i,n(X_n)

f_j,n(X_n). (2.2.2) The Bayes factor measures change in model selection odds betweenMi and Mj when observing X_n. We relatively prefer Mi to Mj if log BF_n(i, j) > 0, and vice versa.

A selected model via the Bayes factor minimizes the total error rates compounding false-positive and false-negative probabilities, while, diﬀerent from the AIC, it has no theoretical implication for predictive performance of the selected model. For a more detailed account of the philosophy of the Bayes factor, we refer to [45].

As was explained in [48], we have yet another interpretation based on the Kullback–

Leibler (KL) divergence between the true distribution g_n and the mth quasi-marginal distributionf_m,n:

KL(f_m,n;g_n) :=−

∫ (

logf_m,n(x) g_n(x)

)

g_n(x)µ_n(dx)

=∫ {

logg_n(x)}

g_n(x)µ_n(dx)−∫ {

logf_m,n(x)}

g_n(x)µ_n(dx); (2.2.3)

(13)

recall that in the classical AIC methodology we instead look at KL{f_m,n(·; ˆθ_m,n);g_n} where ˆθ_m,n = ˆθ_m,n( ˜X_n) denotes the MLE in the mth correctly specified model, con- structed from an i.i.d. copy ˜X_n of X_n. Based on (2.2.3), we choose a relatively optimal one amongM1, . . . ,MM, the model index of which equals

argmin

m≤M

KL(fm,n;gn) = argmin

m≤M

∫ {logfm,n(x)}

gn(x)µn(dx).

Comparison off_i,n and f_j,n is equivalent to looking at the sign of KL(fj,n;gn)−KL(fi,n;gn) =

∫ log

(f_i,n(x) f_j,n(x)

)

gn(x)µn(dx)

=E {

log

(f_i,n(X_n) f_j,n(X_n)

)}

. (2.2.4)

As was noted in [48], it is important to notice that this reasoning remains valid even when any of candidate models does not coincide with the true model. We also refer to [31] for another Bayesian variable selection device based on the KL projection.

2.3 Quasi-Bayesian information criterion (QBIC)

We here focus on a single model Mm and consider the asymptotic expansion related to Hm,n. From now on, we will omit the model index “m” from the notation, simply denoting the prior density and the quasi-log likelihood byπn(θ) andHn(θ) =Hn(Xn|θ), respectively. The parameter θ∈Θ⊂R^p is graded into K parts, say

θ = (θ₁, . . . , θ_K), θ_k ∈R^p^k. Here we wrote p = ∑K

k=1p_k. Let θ₀ ∈ Θ be a constant, which will serve as the optimal parameter defined in Section 2.4. We are thinking of situations where the contrast function Hn provides an M-estimator ˆθ_n such that the R_n(θ₀)(ˆθ_n−θ₀) tends in distribution to a non-trivial asymptotic distribution. The rate matrix R_n(θ₀) is of the form (2.1.1) satisfying (2.1.2):

R_n =R_n(θ₀) = diag(

r_1,n(θ₀)I_p₁, . . . , r_K,n(θ₀)I_p_K) ,

where positive decreasing sequences r_k,n(θ₀) such that r_k,n⁻¹(θ₀)/r⁻_l,n¹(θ₀) → 0 for k > l;

we will assume thatR_n(ˆθ_n)−R_n(θ₀)→−^P 0 (see Theorem2.3.8(ii)), so that log|R_n(ˆθ_n)|=

∑K

k=1pklogrk,n(ˆθn). The statistical random field associated with Hn is given by Zn(u) =Zn(u;θ₀) := exp{

Hn

(θ₀+R_nu)

−Hn(θ₀)}

, (2.3.1)

which is defined on the admissible domain Un =Un(θ₀) = {

u∈R^p;θ₀+R_nu∈Θ} .

(14)

The objective here is to deduce the asymptotic behavior of the marginal quasi-log likelihood function

log (∫

Θ

exp{

Hn(θ)}

π_n(θ)dθ )

, and then derive an extension of the classical BIC.

2.3.1 Stochastic expansion

We begin with the stochastic expansion of the marginal quasi-log likelihood function.

Dnote by θ^j the jth element of θ and by R_n,ii the (i, i)th element of R_n (i.e., R_n,ii = rj,n(θ0) for some j ∈ {1, . . . , K}). We write θ_k = (θ1, . . . , θk) and θk = (θk, . . . , θK), with θ_k,0 and θ_k,0 in a similar manner. Let

Yk,n(θ_k;θ_k₋₁) =r_k,n² (θ₀)(

Hn(θ_k₋₁, θ_k, θ_k+1)−Hn(θ_k₋₁, θ_k,0, θ_k+1))

and Yk,0(θ_k) be random function for any k ∈ {1, . . . , K}. By convention, we neglect symbols with index K+ 1 like θ_K+1 and ones with index 0 likeθ₀.

Assumption 2.3.1. Hn(θ) is of class C³(Θ) and satisfies the following conditions:

(i) ∆_n = ∆_n(θ₀) := R_n∂_θHn(θ₀) = O_p(1);

(ii) Γn = Γn(θ0) := −Rn∂_θ²Hn(θ0)Rn = Γ0 +op(1), and Γ0 = diag(Γ1,0, . . . ,ΓK,0) denotes the a.s. positive definite, where Γ_k,0 ∈ R^p^k ⊗R^p^k is the a.s. positive definite for all k = 1, . . . , K;

(iii) sup

θ

R_n∂_θ³Hn(θ)R_n=O_p(1).

Assumption 2.3.1 implicitly sets down the optimal value θ₀; of course, as in the usual M-estimation theory (e.g. [67], Chapter 5) it is possible to put more specific conditions in terms of the uniform-in-θ limits of suitable scaled quasi-log likelihoods function, but we omit them. The quadratic form Γ₀ is the asymptotic quasi-Fisher information matrix, which may be random. A truly random example is the volatility- parameter estimation of a continuous semimartingale (see Section 4.3). In particular, Assumption 2.3.1 leads to the LAQ approximation of logZn:

sup

u∈A

logZn(u)− (

∆_n[u]−1

2Γ₀[u, u])

=o_p(1) (2.3.2)

for each compact set A⊂R^p.

Assumption 2.3.2. For every k = 1, . . . , K−1, the following conditions are satisfied:

(15)

(i) sup

θk+1

r_k,n(θ₀)∂_θ_kHn(θ_k,0, θ_k+1)=O_p(1);

(ii) sup

θk+1

−r_k,n² (θ₀)∂_θ²

kHn(θ_k,0, θ_k+1)−Γ_k,0=O_p(1).

Assumption 2.3.3. There exists an a.s. positive definite random matrixΣ0 ∈R^p⊗R^p such that

(∆_n,Γ_n)→−^L (Σ⁻₀^1/2η,Γ₀),

where η ∼N_p(o, I_p) is a random variable defined on an extension of the original prob- ability space.

Assumption 2.3.4. There exists an a.s. positive random variable χ₀ such that for each κ >0,

sup

θ1;|θ1−θ1,0|≥κ

Y1,0(θ₁)∨ sup

θ2;|θ2−θ2,0|≥κ

Y2,0(θ₂)∨ · · · ∨ sup

θK;|θK−θK,0|≥κ

YK,0(θ_K)≤ −χ₀κ² a.s.

Assumption 2.3.5. There exists a constant q∈(0,1) for which (r_1,n(θ₀))_−q

sup

θ

Y1,n(θ)−Y1,0(θ₁)∨(

r_2,n(θ₀))_−q sup

θ2

Y2,n(θ₂;θ_1,0)−Y2,0(θ₂)

∨ · · · ∨(

r_K,n(θ₀))₋q

sup

θK

YK,n(θ_K;θ_K₋_1,0)−YK,0(θ_K)−→^P 0.

Let Π(dθ) denote the prior distribution over Θ.

Assumption 2.3.6. The distribution Π admits a bounded Lebesgue density p(θ)which is continuous and positive at θ₀.

Assumption 2.3.4ensures thatθ_k,0 is an unique maximizer of Yk,0(θ_k) for allk, that is,

{θ_k,0}= argmax

θk

Yk,0(θ_k), k= 1, . . . , K.

Moreover, Assumption2.3.5implies that for everyk = 1, . . . , K,Yk,n(θk,θˆk+1,n, . . . ,θˆK,n; θ_k₋_1,0)−→^P Yk,0(θ_k) uniformly in θ_k.

Theorem 2.3.7. Under Assumptions 2.3.4 and 2.3.5, we have θˆn−→P θ0.

(16)

Under Assumptions2.3.4and2.3.5, we can apply the argmax theorem (see for example [67]) to the proof of the consistency of ˆθ_k,nsince ˆθ_k,n ∈argmax_θ_kYk,n(θ_k,θˆ_k+1,n, . . . ,θˆ_K,n; θˆ_1,n, . . . ,θˆ_k₋_1,n). Hence, Theorem 2.3.7 is established. The next theorem shows the asymptotic expansion of the marginal quasi-log likelihood function.

Theorem 2.3.8. Suppose that Assumptions 2.3.1 to 2.3.6 are satisfied.

(i) We have the asymptotic expansion log

(∫

Θ

exp{

Hn(θ)}

πn(θ)dθ )

=Hn(θ0) +

∑K k=1

pklogrk,n(θ0) + p

2log 2π+ logπn(θ0)

− 1

2log|Γ₀|+1 2Γ⁻₀¹[

∆^⊗_n²]

+o_p(1).

(ii) If further logr_k,n(ˆθ_n) = logr_k,n(θ₀) +o_p(1) for all k = 1, . . . , K, then log

(∫

Θ

exp{

Hn(θ)}

π_n(θ)dθ )

=Hn(ˆθ_n) +

∑K k=1

p_klogr_k,n(ˆθ_n) + p

2log 2π+ logπ_n(ˆθ_n)

− 1

2log−R_n(ˆθ_n)∂_θ²Hn(ˆθ_n)R_n(ˆθ_n)+o_p(1).

Remark 2.3.9. Assume that Assumptions2.3.1 and2.3.6hold and that logr_k,n(ˆθ_n) = logrk,n(θ0) + op(1) for all k = 1, . . . , K. Theorem 2.3.8 is satisfied if the following condition hold: for anyϵ >0 there exist M > 0 andN ∈N such that

sup

n≥NP (∫

Un∩{|u|≥M}Zn(u)du > ϵ )

< ϵ. (2.3.3)

See [23] for details.

In view of Theorem2.3.8 (ii), we obtain log

(∫

Θ

exp{

Hn(θ)}

π_n(θ)dθ )

=Hn(ˆθ_n)−1

2log−∂_θ²Hn(ˆθ_n)+O_p(1)

=Hn(ˆθn)−1 2

∑K k=1

pklogr⁻_k,n²(ˆθn) +Op(1).

Ignoring the O_p(1) parts, we define the quasi-Bayesian information criterion (QBIC) and Bayesian information criterion (BIC)by

QBIC_n=−2Hn(ˆθ_n) + log−∂_θ²Hn(ˆθ_n), (2.3.4)

(17)

BIC_n=−2Hn(ˆθ_n) +

∑K k=1

p_klogr_k,n⁻²(ˆθ_n), (2.3.5) respectively. Note that in the classical case of single√

n-scaling (2.3.5) reduces to the familiar form

BICn=−2Hn(ˆθn) +plogn.

The statistics QBIC_n thus provides us with a far-reaching extension of derivation ma- chinery of the classical BIC. Although the QBIC (2.3.4) may have higher computational load than the BIC (2.3.5), it enables us to incorporate a model-complexity bias correction taking the volume of observed information into account. In particular, to re- flect data information for dependent-data models, (2.3.4) would be more suitable than (2.3.5) whose bias correction is only based on the rate of convergence.

Let QBIC⁽¹⁾_n , . . . ,QBIC^(M_n ⁾ be the QBIC values in each candidate model. We com- pute QBIC⁽¹⁾_n , . . . ,QBIC^(M)_n and select the best modelMm0 in the sense of approximate Bayesian model description:

{m₀}= argmin

1≤m≤M

QBIC^(m)_n ,

the uniqueness being implicitly assumed. The best model can be selected using the BIC in a similar manner.

Remark 2.3.10. Making use of the observed information matrix (2.3.4) for regulariza- tion has been already mentioned in the literature; for example, [8]; [38], and [58] contain such statistics for some variants of the AIC statistics. Further, it is worth mentioning that using the observed-information is a right way for some non-stationary models (see [40]).

Remark 2.3.11(Variants of QBIC). In practice, we may conveniently consider several variants of the QBIC (2.3.4). When Γ₀ takes the form Γ₀ = diag(Γ_1,0, . . . ,Γ_K,0) with each Γ_k,0 ∈R^p^k ⊗R^p^k being a.s. positive definite, we may slightly simplify the form of the QBIC as follows. We can see that under Assumption 2.3.1,

−r_k,n(θ₀)r_l,n(θ₀)∂_θ_k∂_θ_lHn(ˆθ_n) = o_p(1), k̸=l.

Taking logarithmic determinant of a positive definite matrix is continuous, the asymptotic expansion in Theorem 2.3.8 (ii) becomes

log (∫

Θ

exp{

Hn(θ)}

π_n(θ)dθ )

=Hn(ˆθ_n)− 1 2

∑K k=1

log−∂_θ²

kHn(ˆθ_n)+O_p(1), resulting in the QBIC of the form

−2Hn(ˆθ_n) +

∑K k=1

log−∂_θ²

kHn(ˆθ_n). (2.3.6)

(18)

In particular, this is the case if R_n⁻¹(ˆθ_n− θ₀) is asymptotically mixed normally distributed, with a block diagonal asymptotic (random) covariance matrix Σ₀ = diag(Σ_1,0, . . . ,Σ_K,0) where each Σ_k,0 ∈R^p^k ⊗R^p^k is a.s. positive definite. We will deal with such an example in Section 4.2.

We may also consider finite-sample manipulations of QBIC without breaking its asymptotic behavior. For example, the problem caused by | −∂_θ²Hn(ˆθ_n)| ≤ 0 can be avoided by using

−2Hn(ˆθ_n) +I{−∂_θ²Hn(ˆθ_n)>0}

log−∂_θ²Hn(ˆθ_n) +I{−∂_θ²Hn(ˆθ_n)≤0}∑^K

k=1

p_klog(

r_k,n⁻²(ˆθ_n))

instead of (2.3.4); obviously, the diﬀerence between this quantity and QBIC_nis ofo_p(1).

Further, we may useany Γˆ_n such that ˆΓ_n−→^P Γ₀:

−2Hn(ˆθ_n)−2 logR_n(ˆθ_n)+ log|Γˆ_n|,

which would be convenient if ˆΓnis more likely to be stable than−Rn(ˆθn)∂_θ²Hn(ˆθn)An(ˆθn);

for example, if we beforehand know the specific form of Γ₀ = Γ₀(θ), then it would be numerically more stable to use Γ₀(ˆθ_n) instead of −R_n(ˆθ_n)∂_θ²Hn(ˆθ_n)R_n(ˆθ_n).

2.3.2 Convergence of the expected values

From the frequentist point of view where X_n is regarded as a random element, it is desirable to verify the convergence of expected marginal quasi-log likelihood, which follows from the asymptotic uniform integrability of the sequence

{−2 log (∫

Θ

exp{

Hn(θ)}

π_n(θ)dθ )

−QBIC^♯_n }

n

, where QBIC^♯_n =−2Hn(ˆθ_n) + log| −∂_θ²Hn(ˆθ_n)| −2 logπ_n(ˆθ_n)−plog 2π.

Assumption 2.3.12. The random function Hn is of class C³(Θ) a.s. and for every r >0

sup

n E (

|∆_n|^r+ sup

θ

Γ_n(θ)^r+

∑p i=1

sup

θ

R_n(θ₀)∂_θ³Hn(θ)R_n(θ₀)^r)

<∞.

Assumption 2.3.13. There exists an a.s. positive definite random matrix Γ₀ such that Γ_n(θ₀)−→^P Γ₀, and for some q > 3p we have

lim sup

n E(

sup

θ

λ⁻_min^q (

Γ_n(θ)))

<∞, where λ_min(·) denotes the smallest eigenvalue of a given matrix.

(19)

The moment bounds in Assumption2.3.13was studied in [13] and [14] for some time series models, with a view toward prediction. The integrability in Assumption 2.3.13 is related to the key indexχ₀ of [65] in case of volatility estimation of a continuous Itˆo process.

Under Assumptions 2.3.12 and 2.3.13, we have λ⁻_min^q (Γ_n(θ₀)) −→^P λ⁻_min^q (Γ₀) by the continuous mapping theorem, and alsoλ⁻_min¹ (Γ₀)∈L^q(P) as well as Γ₀ ∈∩

r>0L^r(P).

Finally, we impose the boundedness of moments of the normalized estimator ˆ

u_n :=R_n⁻¹(ˆθ_n−θ₀).

Assumption 2.3.14. sup_nE(|uˆ_n|^r)<∞ for somer >3.

We can now state the L¹(P)-converge result.

Theorem 2.3.15. If Assumptions2.3.2to2.3.6, 2.3.12, and2.3.14hold, then we have

nlim→∞E{ −2 log

(∫

Θ

exp{

Hn(θ)}

π_n(θ)dθ )

−QBIC^♯_n }

= 0.

In particular, QBIC^♯_n is an asymptotically unbiased estimator of the logarithmic quasi- marginal likelihood.

2.4 Model selection consistency

As long as concerned with good prediction performance, model selection consistency itself does not matter in an essential way. Given a set of models, it does when attempting to find the one “closest” (in the sense of KL divergence) to the true data-generating model structure itself as much as possible. For example, estimation of daily integrated volatility in econometrics would be the case, for econometricians usually builds up daily- volatility prediction model through a time series model such as, among others, ARFIMA models; an underlying continuous-time dynamics and a daily-volatility time series are separately modeled. This section is devoted to studying the validity of model selection consistency in our general setting. In particular, we propose an adaptive (stepwise) model selection strategy when we have more than one scaling rate. We start with a single-norming case, and then, before moving on to the multi-scaling case, we look at the case of ergodic diﬀusions since it well illustrates the proposed method.

2.4.1 Single-scaling case

We first consider cases where

r_n =r_m,k,n(θ₀)→0

(20)

for each m ∈ {1, . . . , M} and k ∈ {1, . . . , K_m}. Suppose that there exists a random functionHm,0 such that

r_n²Hm,n(θ_m)→−^P Hm,0(θ_m) (2.4.1) uniformly in θ_m ∈ Θ¯_m as n → ∞ (m = 1, . . . , M). Moreover, we assume that the optimal parameter θ_m,0 ∈Θ_m in the model Mm is the unique maximizer of Hm,0:

{θ_m,0}= argmax

θm∈Θm

Hm,0(θ_m) a.s.

Ifm0 satisfies

{m0}= argmin

m∈M dim(Θm),

whereM= argmax₁_≤_m_≤_MHm,0(θ_m,0), we say that Mm0 is the optimal model. That is, the optimal model is, if exists, an element of the optimal model set M which has the smallest dimension.

Remark 2.4.1. If we consider the correctly specified model, the optimal parameter and true parameter are equal.

Let Θ_i ⊂ R^pⁱ and Θ_j ⊂ R^p^j be the parameter space associated with Mi and Mj, respectively. We say that Θ_i is nested in Θ_j when p_i < p_j and there exist a matrix F ∈R^p^j^×^pⁱ withF^′F =I_p_i_×_p_i and a constantc∈R^p^j such thatHi,n(θ_i) = Hj,n(F θ_i+c) for all θ_i ∈ Θ_i. That is, when Θ_i is nested in Θ_j, any model given by a parameter in Θ_i can also be generated by a parameter in Θ_j, so that Mj includes Mi. Denote by QBIC^(m)_n the QBIC in Mm.

Theorem 2.4.2. Assume that (2.4.1) is satisfied and that Mm0 is the optimal model.

Let m∈ {1, . . . , M} \ {m₀}, and let Assumptions 2.3.1 to 2.3.6 hold, and suppose that either

(i) Θm0 is nested in Θm, or

(ii) Hm,0(θ_m)̸=Hm0,0(θ_m₀_,0) a.s. for any θ_m ∈Θ_m. Then we have

nlim→∞P(

QBIC^(m_n ⁰⁾−QBIC^(m)_n <0)

= 1, (2.4.2)

nlim→∞P(

BIC^(m_n ⁰⁾−BIC^(m)_n <0)

= 1. (2.4.3)

This theorem indicates that the probability that QBIC and BIC choose the optimal model tends to 1 asn → ∞.

(21)

2.4.2 Multi-scaling case: Adaptive model comparison

For simplicity of exposition, we consider the two-scaling case, that is, K = 2. We propose a multi-step model selection procedure, which seems natural and more eﬀective especially when an adaptive estimation procedure is possible in such a way that we can estimate a first componentθ_m₁ without knowledge of a second one θ_m₂. That is to say, it should be possible to select an optimal “partial” model structure associated withθ_m₁, with regarding θm2 as a nuisance element.

We suppose that the full model is “decomposed” into two parts, each consisting of M₁andM₂candidates, resulting inM₁×M₂models in total. Write (Mm1,m2)_m₁_≤_M₁_;m₂_≤_M₂ for the set of all the candidate models. We are given the “full” quasi-log likelihood function Hm1,m2,n(θ_m₁, θ_m₂). Roughly speaking, we proceed as follows.

• First, introducing an auxiliary quasi-log likelihood which is only associated with the first-component parameter θm1 and does not involve θm2, we obtain an estimate ˆθ_m₁_,n of θ_m₁. Then we compare the corresponding (Q)BICs to select a first- stage optimal index, say m^∗_1,n ∈ {1, . . . , M₁}; note that this strategy reduces the model-candidate set from{Hm1,m2,n(θ_m₁, θ_m₂)}m1,m2to{Hm^∗_1,n,m2,n(ˆθ_m∗

1,n,n, θ_m₂)}m2.

• Second, based on the “partly optimized” full quasi-log likelihoods Hm^∗_1,n,1,n, . . . , Hm^∗_1,n,M2,n, we find a second-stage optimal index m^∗_2,n ∈ {1, . . . , M₂} through (Q)BIC again.

• Finally, we pick the model Mm^∗_1,n,m^∗_2,n as our optimal model.

This adaptive procedure apparently reduces the computational cost (the number of comparison) to much extent compared with the joint-(Q)BIC case, that is, from “O(M₁× M₂)” to “O(M₁ +M₂)”; needless to say, the amount of reduction becomes larger for K ≥3.

Remark 2.4.3. It is not essential in the above argument that the final step is based on the original quasi-log likelihood Hm1,m2,n. What is essential for the model selection consistency is that at each stage we have a suitable auxiliary quasi-likelihood function based on which we can estimate a suitably separated optimal model. We here do not go into this direction.