多重混合型漸近推測における正則化推定

(1)

九州大学学術情報リポジトリ

Kyushu University Institutional Repository

多重混合型漸近推測における正則化推定

清水, 優祐

https://doi.org/10.15017/1806828

出版情報：Kyushu University, 2016, 博士（数理学）, 課程博士バージョン：

権利関係：Fulltext available.

(2)

Regularized estimation under multiple and mixed-rates

asymptotics

Doctoral dissertation February 14, 2017

Yusuke Shimizu

Graduate School of Mathematics, Kyushu University.

744 Motooka, Nishi-ku, Fukuoka 819-0395, Japan.

(3)

Abstract

This doctoral dissertation is based on two original papers [22] and [28]. InM- estimation under standard asymptotics, the weak convergence combined with a large deviation estimate of the associated statistical random ﬁeld provides us with a general tool for deriving not only the asymptotic distribution of the associated M-estimator, but also the convergence of its moments, where the latter plays an important role in theoretical statistics. Here the standard asymptotics refers to the situation where the statistical random ﬁeld can be well investigated by a single matrix norming, which, however, may be impossible in several situations including sparsely regularizedM-estimation.

Through this thesis, we consider the uniform tail-probability estimate of a class of scaledM-estimators under multiple and mixed-rates asymptotics in the sense of [26], where the associated statistical random ﬁelds may be non- diﬀerentiable and may fail to be partially locally asymptotically quadratic so that the conventional approach through the polynomial type large deviation inequality (PLDI) developed by [36] does not work directly. To my best knowledge, there is no results deriving the PLDI under multiple and mixed- rates asymptotics.

In particular, our results are applied to regularized estimation of the ergodic diﬀusion process observed at high frequency. The model is described by the Wiener-driven stochastic diﬀerential equation

dX_t=a(X_t, α)dw_t+b(X_t, β)dt,

and for the qualitatively diﬀerent parameters α and β we can estimate the former more quickly than the latter, hence the estimator we will deal with is of multiple-scaling type. In the literature, [8] studied an adaptive-lasso type regularized estimation of the same model, with taking the quadratic approximation of the quasi-likelihood function into account: they deduced the oracle property of their estimator. In this study, we will derive the asymptotic behaviors of a regularized estimator of diﬀusion process under the general regularization term, without resorting to convexity at all.

Further, our results enable us to deduce convergence of moments of a wide range of regularized M-estimators, which especially serves as a critical tool when, for example, analyzing the mean squared prediction error and bias correction for using AIC-type information criteria for tuning-parameter selection in sparse estimation (we refer to, for example, [7], [11], [16], [24], [27], [28], [29], [30], as well as [36]).

(4)

Acknowledgements

First, I would like to express my deepest gratitude to Professor Hiroki Masuda for his valuable guidance, suggestions and encouragement. I am grateful to Professor Ryuei Nishii, Professor Yoshihiko Maesono, Associate Professor Yoshiyuki Ninomiya and Associate Professor Kei Hirose for their valuable comments. I also thank to my friends for their kind support during my studies. At last but not least, my gratitude goes to my family whose heartfelt assistance to my daily life.

Yusuke Shimizu January, 2017

(5)

1 Introduction

Suppose that we observe data X_n, the distribution of which is indexed by a ﬁnite-dimensional parameter θ ∈ Θ ⊂ R^p. In order to estimate θ based on X_n, we usually introduce an appropriate (quasi-)likelihood or contrast function Hn : Ω× Θ → R, and estimate an optimal parameter value θ₀ by any point ˆθ_n ∈ argminHn. For assessing asymptotic performance of ˆθ_n quantitatively, we look at the statistical random ﬁelds

Mn(u;θ₀) = Hn(θ₀+A_n(θ₀)u)−Hn(θ₀), (1.1) where An(θ0) denotes the rate matrix such that |An(θ0)| →0 asn → ∞and the components may decrease at different rates; estimation with multiple- rates of convergence has appeared in the literature of, for example, econo- metrics [4]. Throughout this paper, we use the notation|A|² = tr(AA^⊤) for a matrixA with⊤denoting the transpose. As is well-known, the weak convergence of Mn to some M0 over compact sets, the identifiability condition on M0, and the tightness of the scaled estimator ûn:=An(θ0)⁻¹(ˆθn−θ0) make the “argmin” functional continuous for Mn: û_n ∈argminMn −→L argminM0. See e.g., [33, Section 5]. Further, when concerned with moments of û_n- dependent statistics, such as the mean squared error, more than the weak convergence is required. Then the polynomial type large deviation inequality (PLDI) of [36], which estimates the tail ofL(û_n) in such a way that

sup

r>0

sup

n>0

r^LP(|uˆ_n| ≥r)<∞ (1.2) for a given L > 0, plays an important role: we set ˆu₀ ∈ argminM0 for a random variable ˆu₀. The moment convergence

E[|uˆ_n|^q]→E[|uˆ₀|^q], q >0 (1.3) holds if there exists a q^′ > q such that sup_n>0E[|uˆn|^q^′] <∞. Let us assume that the PLDI (1.2) holds for some L > q^′. Then we obtain

sup

n>0

E[|uˆ_n|^q^′] = sup

n>0

∫ _∞

0

P(|uˆ_n|^q^′ > s)ds <∞.

It has been known that the PLDI can be proved under modest conditions when Mn admit a locally asymptotically quadratic (LAQ) structure, which is satisﬁed for many situations including asymptotically mixed-normal type models under multi-scaling. Here, in the multi-scaling case where the random

(7)

vector ˆθ_n converges at diﬀerent rates, the LAQ structure at “ﬁrst” step takes the form¹

Mn(u, τ;θ₀) = ∆_n(τ;θ₀)[u] + 1

2Γ₀(τ;θ₀)[u, u] +r_n(u, τ;θ₀), (1.4) where we are required to verify, among others, the following conditions which are to hold uniformly in “the second- and the subsequent-step” parameter τ, which is regarded as a nuisance parameter in the ﬁrst step: suﬃcient integrability of the random linear form ∆_n(τ;θ₀); the non-degeneracy of the possibly random bilinear form Γ₀(τ;θ₀); and a kind of “non-explosiveness” of the scaled remainder term (1+|u|²)⁻¹rn(u, τ;θ0), where theu-pointwise limit of r_n(u, τ;θ₀), whenever exists, typically equals zero. For notational convenience, here and in the sequel we writeA[b₁, . . . , b_m] =∑

i1,...,imA_i₁_...i_mb_1i₁. . . bmim for multilinear forms A ={Ai1...im}i1,...,im and bj ={bji_k}i_k; sometimes b_j themselves may be tensors, hence the resulting form is also a multilinear form, e.g. A[B, C] = {∑

i,jA_ijB_ikC_jl}k,l for A = {A_ij}, B = {B_ik}, and C = {Cjl}. See [36, Section 5] for the detailed account of the above- mentioned multistep procedures. In many standard statistical models, the form (1.1) is enough to ﬁnd the asymptotic distribution of all the components of ˆθn; this case may be called the standard asymptotics, to be brieﬂy discussed in Section 3.2.

In principle, anyM-estimation procedure, typically producing an asymptotically mixed-normally distributed estimator, may have its “regularized”

counterpart; we refer to [5] for some general backgrounds of statistical regularization. We are concerned here with extending the random-field structure to deal with possibly dependent data and a broader class of regularized M- estimation under the “mixed-rates” asymptotics. In particular, we will show how the PLDI of [36] can carry over to the mixed-rates M-estimation where the target statistical random fields may have components converging at different rates; we refer to [22] and [28] for details in case of linear regression with general regularization term. We will adopt the very general theoretical framework developed by [26, Sections 2 and 3]. It will be shown that the PLDI criterion of [36] can apply to some mixed-rates cases while it may require some modification when the key LAQ structure of the original statistical random field fails to hold; it may even happen thatrn(u, τ;θ0) diverges in probability. Indeed, most of the existing sparse estimation procedures may fall into this type of asymptotics. Consequently, with a true parameter being fixed, our moment-convergence result provides yet another theoretical insight about the regularized estimation, the well-established methodology especially

1The sign in front of the quadratic term (1/2)Γ0(τ;θ0)[u, u] is diﬀerent from the original LAQ of [36] since we consider minimization of (1.1).

(8)

in variable and/or model selection.² The logic of the sparse and more gen- erally shrinkage estimation would be best and most clearly described by the context of multiple linear regression, with many deep theoretical interpre- tation such as geometrical (projection) characterization, variable selection, stabilized prediction performance, etc. See e.g. [14, Chapter 3].

There exist a lot of previous works on moment convergence of estimators. It serves as a fundamental tool when analyzing asymptotic behavior of the expectations of statistics depending on the estimator such as asymptotic bias and mean squared prediction error; to mention just a few, we refer to [7], [11], [16], [24], [27], [28], [29], [30], as well as [36]. Also, the convergence of moments of regularized sparse maximum-likelihood estimator of generalized linear model was deduced in [32] to verify the AIC type variable- selection. Further, [1] recently discussed optimal selection of random and k-fold cross-validation estimators, the theoretical backbone of which involves some moment bounds of the estimators used; the related paper [2] studied the uniform integrability of the ordinary least-squares estimator in the linear regression setting.

This thesis is organized as follows. Section 2 describes our model setup, for which a series of basic asymptotic statements are given in Section 3, where, in particular, the polynomial type large deviation estimate of the underlying statistical random fields will play a crucial role for the uniform tail-probability estimate concerning the scaled M-estimator; although the asymptotics is classical, in the literature there seems to exist no unified tools that can handle general M-estimation of multiple-rates and possibly mixed- rates type, and importantly, of possibly non-differentiable and non-convex type; in Section 3.3.1, we will briefly discuss a naive yet formal example of component-wise tuning-parameter choice. The shrinkage effect is still useful for dependent-data models: it can effectively diminish non-significant factor involved in the model, resulting in a model-complexity assessment and/or selection. In Section 4 we will apply the foregoing results to regularized estimation of an ergodic diffusion process observed at high frequency. The model is described by the Wiener-driven stochastic differential equation

dX_t=a(X_t, α)dw_t+b(X_t, β)dt,

and for the qualitatively diﬀerent parameters α and β we can estimate the former more quickly than the latter, hence the estimator we will deal

2It should be noted that the sparse estimation has received mixed reception from a kind of estimation singularity similar to that of the classical Hodge’s super eﬃcient estimator.

The unpleasant feature of the sparse-type estimator essentially stems from non-uniformity in weak convergence with respect to the true value of parameters, see [24] for details.

(9)

with is of multiple-scaling type. In the literature, [8] previously studied an adaptive-lasso type regularized estimation of the same model, with taking the quadratic approximation of the quasi-likelihood function into account:

they deduced the oracle property of their estimator. In this study, we will derive the asymptotic behaviors of a regularized estimator of diﬀusion process under the general regularization term, without resorting to convexity at all. In Section 5, we generalize the regularization terms, and give some ex- amples through considering the regularized least-squares estimator for linear regression model. The moment convergence of the estimator can be derived from our results.

2 Setup

Let us begin with description of the basic model setup for Section3. Through- out we are given an underlying probability space (Ω,F, P). For the purpose of accelerating estimation performance, we consider M-estimation of an additive regularization type. We will focus on the case of two-scaling, where the target statistical parameter θ ∈ Θ is divided into two parts, say θ = (α, β);

an extension to cases of more-than-two scaling is a trivial matter while making notation messy. We set α ∈ R^p and β ∈ R^q, and Θ = Θ_α×Θ_β to be a bounded convex domain in R^p+q.

We are given a function M_n : Ω×Θ → R, and regularization (possibly random) functions R^a_n(α) and R^b_n(β). We then consider contrast functions Hn : Ω×Θ→R of the form

Hn(θ) = Hn(α, β) =M_n(α, β) +R^a_n(α) +R^b_n(β). (2.1) The associated regularized M-estimator is deﬁned to be any element (for brevity, implicitly assumed to exist)

θˆ_n∈argmin

θ∈Θ

Hn(θ).

We quantitatively distinguish zero parameters from non-zero ones. We denote by θ₀ = (α₀, β₀) the value we want to estimate (typically the true value of θ) and assume that it takes the form α₀ = (α^◦₀, α^∗₀) = ((α^◦_0,k′)_k′,(α^∗_0,k′′)_k′′) and β₀ = (β₀^◦, β₀^∗) = ((β_0,l^◦ ′)_l′,(β_0,l^∗ ′′)_l′′) with

α^◦_0,k′ = 0, β_0,l^◦ ′ = 0, α^∗_0,k′′ ̸= 0, β^∗_0,l′′ ̸= 0.

We set α^◦₀ ∈ R^p^◦, β₀^◦ ∈ R^q^◦, α^∗₀ ∈ R^p^∗ and β₀^∗ ∈ R^q^∗ with p^◦, q^◦, p^∗, q^∗ ∈ N; then, p = p^◦ +p^∗ and q = q^◦ +q^∗. Correspondingly, we write θ = (θ^◦, θ^∗)

(10)

with θ^◦ = (α^◦, β^◦) and θ^∗ = (α^∗, β^∗) in the obvious manner. We also write θˆ_n = ( ˆα_n,βˆ_n) = ( ˆα^◦_n,αˆ^∗_n,βˆ_n^◦,βˆ_n^∗) with ˆθ^◦_n = ( ˆα_n^◦,βˆ_n^◦) and ˆθ_n^∗ = ( ˆα^∗_n,βˆ_n^∗). For clarity we focus on the following regularization terms:

R^a_n(α) =

∑p k=1

λ^a_n,kR^a(α_k), R^b_n(β) =

∑q l=1

λ^b_n,lR^b(β_l). (2.2) This form subsumes many of the existing types, e.g., [12] and [37] for linear regression model, although not essential for our basic asymptotic results given in Sections 3.1 and 3.2. In fact, we will generalize the regularization terms in Section 5. For convenience of reference in the regularity conditions given later, we write:

Râ_n(α) = Râ_n^◦(α^◦) +Râ_n^∗(α^∗) =

p^◦

∑

k^′=1

λ^a_n,k^◦′R^a(α^◦_k′) +

p^∗

∑

k^′′=1

λ^a_n,k^∗′′R^a(α^∗_k′′), (2.3)

R^b_n(β) =R^b_n^◦(β^◦) +R^b_n^∗(β^∗) =

q^◦

∑

l^′=1

λ^b_n,l^◦′R^b(β_l^◦′) +

q^∗

∑

l^′′=1

λ^b_n,l^∗′′R^b(β_l^∗′′), (2.4) where:

• λ^a_n,k^◦ ′, λ^a_n,k^∗ ′′,λ^b_n,l^◦′ and λ^b_n,l^∗′′ are non-negative random variables;

• R^a(·) and R^b(·) are non-random non-negative functions onRsuch that R^a(0) =R^b(0) = 0;

• For all a₀, b₀ ̸= 0 andk > 0, there exists a constantC =C(a₀, b₀, k)>

0 such that sup

(a^′,b^′):|a^′|∨|b^′|≤k

|R^a(a^′)−R^a(a₀)|+|R^b(b^′)−R^b(b₀)|

|a^′−a₀|+|b^′−b₀| ≤C. (2.5) The last condition (local Lipschitz continuity) is a technical one. Further conditions on the ingredient of M_n, Râ_n(α) and R^b_n(β) will be imposed later on; in Section3.3.1, we will briefly discuss about how to set the regularization terms in naive yet specific ways.

We will deal with a situation where the non-zero part of the ﬁrst component α can be estimated faster than that of the second component β; more speciﬁcally, we will suppose that the sequence

(

s⁻_n¹( ˆα^∗_n−α^∗₀), t⁻_n¹( ˆβ_n^∗−β₀^∗) )

(11)

has a non-trivial asymptotic distribution for some possibly diﬀerent positive sequence (s_n) and (t_n), both tending to zero and satisfying that s_n =o(t_n).

Although not explicitly mentioned, we presuppose that the “principal” part M_n(θ) reasonably makes sense even without regularization terms R^a_n(α) + R^b_n(β); most typically, the un-regularized case, where Hn(θ) = M_n(θ), cor- responds to a negative of a (quasi) log-likelihood. We should note that the additive regularization can be interpreted as incorporating a prior information about the parameter of interest; see Section 3.3.1.

3 Basic asymptotics

Under the setting described in Section 2, Section 3.1 focuses on the sparse asymptotics, where the underlying statistical random ﬁeld is of mixed-rates type. On the other hand, it is possible to treat the standard asymptotics as well, where the localization via matrix (1.1) can completely determine the asymptotic distribution of ˆθ_n; we consider this standard case in Section 3.2.

3.1 Sparse case

In this section, we consider regularity conditions under which the following properties hold without assuming the convexity of Hn.

(1) The (weak) consistency of ˆθn = (ˆθ^◦_n,θˆ_n^∗).

(2) The asymptotic distributions:

(a) The sparse consistency of ˆθ_n^◦, i.e. P(ˆθ^◦_n= 0)→1;

(b) The asymptotic distribution of ˆθ_n^∗ at possibly multiple rates of convergence (via a matrix norming).

(3) The uniform tail-probability estimate of ˆθ_n= (ˆθ^◦_n,θˆ_n^∗).

3.1.1 Consistency

We impose the uniform law of large numbers plus identiﬁability condition, and additionally some stochastic-order conditions on the regularization terms.

Recall that the parameter space Θ is bounded.

Assumption 3.1

(1) (s_n)and(t_n)are positive nonrandom sequences such thatmax(s_n, t_n)→ 0 and that s_n =o(t_n).

(12)

(2) There exist continuous random functions M^a₀ : Ω×Θ_α →R and M^b₀ : Ω×Θ→R such that:

(a) sup

α

s²_n{M_n(α, β₀)−M_n(α₀, β₀)} −M^a₀(α) + sup

θ

t²_n{M_n(α, β)−M_n(α, β₀)} −M^b₀(θ)−→^p 0;

(b) argmin

α

M^a₀(α) ={α₀} a.s. and argmin

β

M^b₀(α₀, β) = {β₀} a.s.

(3) sup

α

s²_nR^a_n(α)+ sup

β

t²_nR^b_n(β)−→^p 0.

We will take advantage of the general results given in [26] concerning mixed-rates asymptotics.

Lemma 3.2 Assume that a random function

Mn(u, v) = ¯a_nf_n(u) + ¯b_ng_n(u, v), (u, v)∈R^p×R^q,

and random variables (ˆun,vˆn) and (ˆu0,vˆ0) satisfy the following conditions:

(L1) ¯a_n and ¯b_n are positive numbers such that ¯b_n =o(¯a_n);

(L2) (f_n(·), g_n(·,·))→^L (f₀(·), g₀(·,·))in C(K_f×K_g)for every compact K_f× K_g ⊂R^p×R^q;

(L3) Mn(ˆu_n,ˆv_n)≤inf_(u,v)Mn(u, v) +o_p(¯b_n);

(L4) (ˆu_n,vˆ_n) = O_p(1);

(L5) u7→f₀(u) has a.s. unique minimum at u= ˆu₀; (L6) v 7→g₀(ˆu₀, v) has a.s. unique minimum at v = ˆv₀. Then we have the following.

(1) (ˆu_n,vˆ_n)→^L (ˆu₀,ˆv₀) under the conditions (L1) to (L6).

(2) uˆ_n →^L uˆ₀ under the conditions (L1) to (L5).

Proof The first claim is just a simplified version of [26, Theorem 1]. The second one follows on applying the usual argmax theorem to the rescaled function u 7→ ¯a⁻_n¹Mn(u,vˆ_n) = f_n(u) + (¯b_n/¯a_n)g_n(u,vˆ_n), which admits an approximate minimizer û_n and weakly converges to f₀ in C(K_f) for every

compact K_f ⊂R^p. □

(13)

Remark 3.3 Although we do not use the second claim of Theorem 3.2 in this study, it may be useful when considering stepwise estimation where there is an original contrast (or quasi-likelihood) function but some step-by-step strategy is taken for estimating parameter components that can be estimated more quickly than the others. See [31] for an ergodic diﬀusion model. □ To apply the ﬁrst claim of Lemma3.2under Assumption3.1with (u, v) = (α, β), we set Mn(α, β) = Hn(α, β)−Hn(α₀, β₀), ¯a_n=s⁻_n² and ¯b_n=t⁻_n²:

f_n(α) = s²_n(

M_n(α, β₀)−M_n(α₀, β₀) +R^a_n(α)−R^a_n(α₀)) , g_n(α, β) = t²_n

(

M_n(α, β)−M_n(α, β₀) +R^b_n(β)−R^b_n(β₀) )

.

According to the a.s. continuity of the random functions α 7→ M^a₀(α) and θ 7→M^b₀(θ), it is straightforward to verify all the conditions in Theorem 3.2 under Assumption 3.1, hence the following claim:

Theorem 3.4 We have θˆ_n−→^p θ₀ under Assumption 3.1.

3.1.2 Rates of convergence Next we prove (ˆu_n,ˆv_n) = O_p(1) where

ˆ

u_n:=s⁻_n¹( ˆα_n−α₀), vˆ_n:=t⁻_n¹( ˆβ_n−β₀) (3.1) under additional conditions. To this end we introduce the following general result, which is a simpliﬁed version of [26, Lemma 1], to deduce “correct”

convergence rate of ˆθ_n^∗ and a “preliminary” convergence rate of ˆθ_n^◦. Moreover, the latter preliminary rate may be used also to verify the conditions for the sparse consistency of ˆθ^◦_n (Section 3.1.3). Let [a]₊ := max(a,0) for a ∈R. Lemma 3.5 Let ξ denote either α or β, and assume that the real-valued random function Hn(ξ) satisfies the following conditions:

(1) Hn( ˆξn)≤Hn(ξ0) a.s.

(2) There exist random functions U_n(ξ) and V_n(ξ) such that Hn(ξ)−Hn(ξ₀) =U_n(ξ)−V_n(ξ)

where, for some random variable U₀ >0 a.s., constants0≤γ < ρ, and positive nonrandom sequence (k_n) such that k_n→0, we have:

(a) P (

U_n( ˆξ_n)≥ |ξˆ_n−ξ₀|^ρU₀

)→1;

(14)

(b) [V_n( ˆξ_n)]₊ =O_p(k_n|ξˆ_n−ξ₀|^γ).

Then kn⁻^1/(ρ⁻^γ)( ˆξ_n−ξ₀) = O_p(1).

To deduce Lemma3.5, observe that on the set{|ξˆ_n−ξ₀|^ρ≤U₀⁻¹U_n( ˆξ_n)}we have|ξˆ_n−ξ₀|^ρ≤U₀⁻¹{V_n( ˆξ_n)+Hn( ˆξ_n)−Hn(ξ₀)} ≤U₀⁻¹V_n( ˆξ_n)≤U₀⁻¹[V_n( ˆξ_n)]₊

In what follows, for any square matrixAwe writeA^⊗² =AA^⊤ and denote byλ_min(A) the smallest eigenvalue of A. Coming back to our model, we next impose:

Assumption 3.6

(1) Mn ∈ C³(Θ) a.s., and it holds that:

(a) sup

β |s_n∂_αM_n(α₀, β)|+|t_n∂_βM_n(θ₀)|=O_p(1);

(b) sup

α |s_nt_n∂_α∂_βM_n(α, β₀)|=O_p(1);

(c) sup

θ

s²_n∂_ζ∂_α²M_n(θ)+ sup

θ

t²_n∂_ζ∂_β²M_n(θ)=O_p(1) for ζ =α, β;

(d) There exist symmetric random functions Γ^α₀ : Ω×Θ_α →R^p⊗R^p and Γ^β₀ : Ω×Θ→R^q⊗R^q such that

s²_n∂_α²M_n(θ₀)−Γ^α₀(α₀)+t²_n∂_β²M_n(θ₀)−Γ^β₀(θ₀)−→^p 0, with λmin

(Γ^α₀(α0))

∧λmin

(Γ^β₀(θ0))

>0 a.s.

(2) s_nλ^a_n,k^∗′′ =O_p(1) and t_nλ^b_n,l^∗′′ =O_p(1) for each k^′′ and l^′′.

Remark 3.7 Note that Assumption 3.6.2 is only concerned with the non- zero parameter parts, which of course are unknown a priori in practice; such a situation will appear a few times later. Hence, as is well recognized as a common situation in adaptive type sparse estimation, some appropriate data- driven choices of the weights are desirable. There would be many possibilities for this issue. See Section 3.3.1 for more discussions. □ Let Assumptions 3.1 and 3.6 hold; then, ˆθ_n −→^p θ₀ by Theorem 3.4. We apply Lemma 3.5 separately for proving ˆu_n = O_p(1) and ˆv_n = O_p(1). To show ˆu_n=O_p(1), set

Hn(α) = s²_nHn(α,βˆ_n).

(15)

For notational convenience, for a random functionF_n(θ) we will writeF_n(θ) = O^∗_p(1) and F_n(θ) = o^∗_p(1) if sup_θ|F_n(θ)| = O_p(1) and sup_θ|F_n(θ)| = o_p(1), respectively. The ﬁrst condition in Lemma 3.5 is trivial. To deduce the second one, making use of a third-order Taylor expansion we deriveHn( ˆα_n)− Hn(α₀) =U_n( ˆα_n)−V_n( ˆα_n) for

U_n( ˆα_n) := 1

2s²_n∂_α²M_n( ˜α_n,βˆ_n)[

( ˆα_n−α₀)^⊗²] +s²_n

p^◦

∑

k^′=1

λ^a_n,k^◦′R^a( ˆα^◦_n,k′)

= 1 2

(

s²_n∂_α²M_n(α₀,βˆ_n) +s²_n∂_α³M_n( ˇα_n,βˆ_n)[ ˜α_n−α₀]) [

( ˆα_n−α₀)^⊗²] +s²_n

p^◦

∑

k^′=1

λ^a_n,k^◦′R^a( ˆα^◦_n,k′)

= 1 2

{

Γ^α₀(α₀) +O^∗_p(|αˆ_n−α₀| ∨ |βˆ_n−β₀|)} [

( ˆα_n−α₀)^⊗2] +s²_n

p^◦

∑

k^′=1

λ^a_n,k^◦′R^a( ˆα^◦_n,k′)

= 1 2

{Γ^α₀(α₀) +o^∗_p(1)} [

( ˆα_n−α₀)^⊗²] +s²_n

p^◦

∑

k^′=1

λ^a_n,k^◦ ′R^a( ˆα_n,k^◦ ′), (3.2)

V_n( ˆα_n) :=−s²_n∂_αM_n(α₀,βˆ_n)[ ˆα_n−α₀]−s²_n

p^∗

∑

k^′′=1

λ^a_n,k^∗′′

(R^a( ˆα^∗_n,k′′)−R^a(α^∗_0,k′′)) , where the points ˜α_n and ˇα_n are located on the segments connecting α₀ and

ˆ

α_n, and α₀ and ˜α_n, respectively. Note that the non-negativity of Râ enables us to ignore the second term of the right-hand side of (3.2) when estimating U_n( ˆα_n) from below. Also, under the local Lipschitz continuity (2.5) of Râ, we see that the conditions of Lemma 3.5 are satisfied with γ = 1, ρ = 2, (e.g.) U0 =λmin(Γ^α₀(α0))/4, and kn =sn: we have Un( ˆαn) ≥(1/2){op(1) + λ_min(Γ^α₀(α₀))}|αˆ_n−α₀|² and |V_n( ˆα_n)| ≤ O_p(s_n|αˆ_n−α₀|). Hence û_n =O_p(1) is proved. To deduce ˆv_n =O_p(1), we can follow the same way as above along with

Hn(β) = t²_n{Hn( ˆα_n, β)−Hn( ˆα_n, β₀)} in place of Hn(α) = s²_nHn(α,βˆ_n).

Theorem 3.8 We have (ˆu_n,ˆv_n) =O_p(1) under Assumptions 3.1 and 3.6.

(16)

3.1.3 Sparse consistency

The sparse consistency of ˆθ_n^◦ refers to the property P(ˆθ^◦_n = 0) → 1; the asymptotic distribution of ˆθ^◦_n then degenerates at the origin, with arbitrarily fast rate of convergence, i.e. R_nθˆ^◦_n = o_p(1) for any R_n → ∞. The next general result is a variant of [26, Theorem 2], which is a tailor-made tool to establish the property.

Lemma 3.9 Let ξ denote either α or β (so ξ = (ξ^◦, ξ^∗) and ξ₀ = (0, ξ₀^∗)), and assume that the real-valued random function Hn(ξ) = Hn(ξ^◦, ξ^∗) satisfies the following conditions.

(1) Hn( ˆξ_n^◦,ξˆ_n^∗)≤Hn(0,ξˆ_n^∗) a.s.

(2) There exist random functions U_n(ξ) and V_n(ξ) such that Hn(ξ)−Hn(0, ξ^∗) = U_n(ξ)−V_n(ξ),

where it holds that for some random variable U₀ >0a.s. and constants ρ >0,

P [ {

[Vn( ˆξn)]+ = 0 }∩{

Un( ˆξn)≥ |ξˆ_n^◦|^ρU0

} ]→1.

Then P( ˆξ_n^◦ = 0)→1.

Lemma3.9follows on observing the following: on the eventA_n :={[V_n( ˆξ_n)]₊

= 0} ∩ {U_n( ˆξ_n^◦,ξˆ_n^∗) ≥ |ξˆ_n^◦|^ρU₀}, we have |ξˆ_n^◦|^ρ ≤ U₀⁻¹{V_n( ˆξ_n) + Hn( ˆξ_n) − Hn(0,ξˆ_n^∗)} ≤U₀⁻¹[V_n( ˆξ_n)]₊ = 0. HenceP(|ξˆ_n^◦|= 0) can be bounded below by P(A_n)→1.

Remark 3.10 To conclude P( ˆξ_n^◦ = 0)→1 we may replace the second condition of Lemma 3.9 by

P [ {

[Vn( ˆξn)]+ = 0 }∩{

Un( ˆξn)≥ |ξˆ_n,m^◦ ′|^ρU0

} ] →1

for each m^′ running through{1, . . . , p^◦} or{1, . . . , q^◦}according as ξ =α or β; for each m^′, the same proof as above leads toP( ˆξ_n,m^◦ ′ = 0)→1. □ Let us go back to our main context. We keep imposing Assumptions 3.1 and 3.6. Denoting by Γ^α₀^◦(α) (resp. Γ^β₀^◦(θ)) the upper left p^◦ ×p^◦ part of

多重混合型漸近推測における正則化推定