Combinatorial bandit prediction with relaxation-based approximation algorithm
全文
(2) Vol.2018-AL-167 No.4 2018/3/8. IPSJ SIG Technical Report. rize the previous results and ours in Table 1. Table 1 Kakade et al. [5] Garber [4] ours. Summary of results.. α-regret ((α + ϵ) for ours ) 2 O(αT 3 ) 2 O(αT 3 ) 2 O((α + ϵ)T 3 ). time per trial 2 O(poly(n)T 3 ) O(poly(n) log T ) O(poly(n, 1/ϵ)). Other related work There are several previous results for the conversion approach in the full information setting. The first result is the Follow the Perturbed Leader (FPL) proposed by Kalai and Vempala [6]. Kakade et al. [5] proposed a strategy which achieves α-regret = √ O(α T ) with O(poly(n)T ) time at each trial. Garber [4] game a better algorithm whose running time is O(poly(n) log T ) with the same regret bound. In the bandit setting, Bubeck et al. [1] proposed the OSMD algorithm for convex decision sets. This algorithm achieves optimal regret bounds for many classes of decision sets.. 2.. Preliminaries. In this section, we define remark notations, definitions and problem setting. 2.1 Notations We denote scalars with lower case letters and vectors with bold case letters, matrix with uppercase letters. The ith element of vector x is denoted by x(i). Vectors are column in the standard, so inner product between vector x and y is denoted by x⊤ y with transpose symbol ⊤. Let R be a whole set of real number and R+ be a non-negative set of R. We denote a hypercube whose vertices are in {−a, a}n as B(a),i.e.,B(a) = [−a, a]n . The expectation operator of a distribution D is denoted by ED [·]. Let [n] be a set of all positive integers less than n + 1 (i.e.[n] = {1, 2, . . . , n}). 2.2 Definitions Definition 2.1 (Bregman Divergence). Let F : Γ → R be a strictly convex function. We denote a Bregman divergence DF with respect to F and define below. DF (x, y) = F(x) − F(y) − ∇F(y)⊤ (x − y) Definition 2.2 (Legendre-Fenchel dual). We define the LegendreFenchel dual of a function F as ( ) F ⋆ (y) = min x⊤ y − F(x) . x Note that (∇F ⋆ )−1 = ∇F. Definition 2.3 (β-Barycentric Spanner (β-BS)). A set {q1 , q2 , . . . , qn } ⊂ C is a β-Barycentric Spanner (β-BS) for C if ∑ ∀c ∈ C, ∃λ1 , . . . , λn ∈ [−β, β], c = λi qi i∈[n]. 3.. Algorithm. In this section, we propose a bandit algorithm described in Algorithm 1. As a preprocessing, the algorithm calculates β-BS for. c 2018 Information Processing Society of Japan ⃝. a real decision set C and make a mapping matrix Q(i)( j) = q j (i) for every i, j ∈ [n]. We consider a mapped set Cˆ = Q−1 C because it becomes easy to analyze regret thanks to some propositions described in appendix ˆ a mapped space of P by Q, i.e. P ˆ = Q−1 P. P ˆ We define P also consists of poly(n) linear constraint. Due to proposition in appendix, it is clearly that Cˆ is contained by a n-dimensional hyˆ and B(β) is denoted by B. ˆ percube B(β). Intersection between P ˆ ˆ Note that B has poly(n) linear constraint since P and B(β) also have. When we update online combinatorial prediction problems over pseudo decision space, then algorithm solve the problem over real decision set. Framework of our algorithm is as follows: In each iteration t = 1, . . . , T , at first, the algorithm decides to do either exploration or exploitation with probability γ, 1 − γ respectively . The algorithm choose it uniformly at random and outputs eit in exploration term. Else, algorithm solves α-metarouding problem with input xˆ t and play cˆ distributed by some metarouding algorithm. After outputting decision vector, the algorithm receives incur its loss lt . Next, algorithm estimates a pseudo loss vector. If it explored in first step, estimate loss vector is calculated as s˜ t = lt γn eit . Else s˜ t = 0. This method makes s˜ t unbiased estimator of sˆ t described in lemma3.1 in detail. Then the algorithm updates xˆ t OSMD-like with regarding estimate vector s˜ t as a true ˆ Last, algorithm finds the most clospseudo loss vector sˆ t over P. est point in Bˆ with respect to Bregman divergence. We remark pseudo code of our algorithm below.. Algorithm 1 Proposed algorithm ( 1 ) Parameters: learning ratio η > 0, exploration parameter 0 < γ < 1 ˆ ∩ ( 2 ) Calculate β-BS(C) = {q1 , . . . , qn } and a mapping matrix Q, Bˆ = P B(β). ˆ be any initial point. ( 3 ) Set x1 ∈ P ( 4 ) For t = 1, .. . , T 0 (with probability γ) ( a ) bt = 1 (with probability 1 − γ) ( b ) if bt = 0(exploration) then ( i ) The algorithm chooses it ∈ [n] uniformly at random and play cˆ t = eit . ( ii ) Incur loss lt and calculate s˜ t = γn lt eit . ( c ) else(exploitation) ( i ) Solves α-metarouding problem with input xˆ t . ( ii ) Incur loss lt and calculate s˜ t = 0. ( d ) update xˆ t+ 21 = ∇F ⋆ (∇F( xˆ t ) − η˜st ). ˆ That is xˆ t+1 = arg min ˆ ˆ DF ( xˆ , xˆ t+1/2 ). ( e ) Project xˆ t+1/2 onto B. x∈B. ∑. We use a Euclidean norm squared regularizer F(x) 2 i∈[n] x(i) in OSMD-like update, so. =. DF (x, y) = ∥x − y∥22 In our algorithm, it holds the following lemmas. Lemma 3.1. s˜ t is a unbiased estimater of sˆ t , i.e., E [˜st ] = sˆ t Proof. Since s˜ t is independent of randomness of exploitation, 2.
(3) Vol.2018-AL-167 No.4 2018/3/8. IPSJ SIG Technical Report. E [(ˆct − (α + ϵ) xˆt )⊤ st ] 1∑ ⊤ = γ e st + E MBB [ˆc⊤ st ] − γE MBB [ˆc⊤ st ] − (α + ϵ) xˆ ⊤ st n k∈[n] k 1∑ ⊤ ≤ γ e st + (α + ϵ) xˆ ⊤ st − γE MBB [ˆc⊤ st ] − (α + ϵ) xˆ ⊤ st n k∈[n] k 1∑ ≤ γ D (∵ lt ≤ D) n k∈[n]. E [˜st ] = Ebt ,it [˜st ] = γEit [˜st |bt = 0] + (1 − γ)E [˜st |bt = 1] 1∑ n = γ lt e k + 0 n k∈[n] γ ∑ = ek lt k∈[n]. =. ∑. ek e⊤k sˆ t. = Dγ. k∈[n]. = sˆ t □. Then the first term of (1) is bounded by T Dγ. lemma3.1, the second term of (2) is. Lemma 3.2. Euclidean norm squared of s˜ t. T T T T ∑ ∑ ∑ ⊤ ∑ ⊤ E xˆ t sˆ t − xˆ ⊤ sˆ t = E xˆ t E [˜st ] − xˆ ⊤ E [˜st ]. ] n2 D 2 [ E ∥˜st ∥22 ≤ γ. t=1. t=1. t=1. t=1. T T ∑ ∑ ⊤ = E xˆ t s˜ t − xˆ ⊤ s˜ t . Proof.. t=1. E[∥˜st ∥22 ]. =. E[˜s⊤t s˜ t ]. DF ( xˆ t+1 , xˆ t+1/2 ) + DF ( xˆ , xˆ t+1 ) ≤ DF ( xˆ , xˆ t+1/2 ) Then, η˜s⊤t ( xˆ t − xˆ ) = ( xˆ t − xˆ )⊤ (∇F( xˆ t+1/2 ) − ∇F( xˆ t )) (∵ ∇F( xˆ t+1/2 ) = ∇F( xˆ t ) − η˜st ). □. = DF ( xˆ , xˆ t ) + DF ( xˆ t , xˆ t+1/2 ) − DF ( xˆ , xˆ t+1/2 ) ≤ DF ( xˆ , xˆ t ) + DF ( xˆ t , xˆ t+1/2 ). Main Results. −DF ( xˆ t+1 , xˆ t+1/2 ) − DF ( xˆ , xˆ t+1 ). Theorem 4.1. Using MBB algorithm [3] in line (4)-(c)-(i) in algorithm 3 and setting η = O(βD−1 T −2/3 ), γ = O(nT −1/3 ), our algorithm runs at each trial poly(n, 1/ϵ) time and achieves the expected regret below.. ≤ DF ( xˆ , xˆ t ) + DF ( xˆ t , xˆ t+1/2 ) − DF ( xˆ , xˆ t+1 ) (∵ DF ( xˆ t+1 , xˆ t+1/2 ) ≥ 0) Summing up this inequality for t = 1, . . . , T , we get. 2 3. (α + ϵ) − regret = O((α + ϵ)βnDT ). T ∑. ˆ Proof. For any xˆ ∈ P, E. [∑. T t=1. t=1. From Generalized Pythagorean Theorem (see e.g. [2]),. ( )⊤ ( ) 1∑ n n = γ lt e k lt ek + (1 − γ)0 n k∈[n] γ γ n = lt2 n γ n2 D2 (∵ lt ≤ D) ≤ γ. 4.. Thanks to. η˜s⊤t ( xˆ t − xˆ ) ≤ DF ( xˆ , xˆ 1 ) − DF ( xˆ , xˆ T +1 ) +. DF ( xˆ t , xˆ t+1/2 ). t=1. t=1. ] ∑ c⊤t st − (α + ϵ) Tt=1 x⊤ st. T ∑. ≤ DF ( xˆ , xˆ 1 ) +. T ∑. DF ( xˆ t , xˆ t+1/2 ) (∵ DF ( xˆ , xˆ T +1 ) ≥ 0). t=1. Here,. T T ∑ ∑ ⊤ = E cˆ t sˆ t − (α + ϵ) xˆ ⊤ sˆ t t=1. DF ( xˆ t , xˆ t+1/2 ) = ∥ xˆ t − xˆ t+1/2 ∥22. t=1. = ∥η˜st ∥22. T T T T ∑ ∑ ∑ ∑ ⊤ ⊤ ⊤ ⊤ xˆ sˆ t xˆ t sˆ t − (α + ϵ) xˆ t sˆ t + (α + ϵ) = E cˆ t sˆ t − (α + ϵ) t=1. t=1. t=1. t=1. t=1. T T ∑ ∑ ⊤ ⊤ = E cˆ t sˆ t − (α + ϵ) xˆ t sˆ t . t=1. t=1. Since E[ˆct ] = γ n. k∈[n] ek. (∵ lemma3.2). DF ( xˆ , xˆ 1 ) ≤ 4nβ2 ˆ we have (α + ϵ)−regret upper bounded Since Cˆ ⊂ P,. t=1. T T ∑ ∑ ⊤ ⊤ ˆ ˆ ˆ ˆ x st ≤ T Dγ + (α + ϵ)E xt st − 1 ∑. n2 D2 γ. (1)And,. T T ∑ ∑ ⊤ ⊤ xˆ t sˆ t − (α + ϵ) xˆ sˆ t +E (α + ϵ) t=1. ≤ η2. t=1. + (1 − γ)E MBB [ˆc], so we obtain. c 2018 Information Processing Society of Japan ⃝. (2). T Dγ + (α + ϵ). 4nβ2 T ηn2 D2 + (α + ϵ) η γ. By tuning parameters properly, we complete proof.. □ 3.
(4) Vol.2018-AL-167 No.4 2018/3/8. IPSJ SIG Technical Report. 5.. Concluding Remarks. In this paper, we propose a bandit algorithm using an approximation algorithm based on LP relaxation. Our algorithm is superior to previous work in the two points. One is that our algorithm’s computation time at each iteration doesn’t depend on T . The other is that the player does not need to know the approximation ratio α. One of the future works is to improve the regret √ bound to (α + ϵ)-regret = O((α + ϵ) T ).. Proposition 3. A loss between decision vector c and loss vector s equals to a loss between corresponding pseudo decision and loss vectors. Proof. cˆ ⊤ sˆ = (Q−1 c)⊤ Q⊤ s = cQ−⊤ Q⊤ s = c⊤ s □. References [1]. [2] [3]. [4]. [5] [6]. S´ebastien Bubeck, Nicol`o Cesa-Bianchi, and Sham M. Kakade. Towards minimax policies for online linear optimization with bandit feedback. In COLT 2012 - The 25th Annual Conference on Learning Theory, June 25-27, 2012, Edinburgh, Scotland, pages 41.1–41.14, 2012. Nicol`o Cesa-Bianchi and G´abor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. Takahiro Fujita, Kohei Hatano, and Eiji Takimoto. Combinatorial online prediction via metarounding. In Algorithmic Learning Theory - 24th International Conference, ALT 2013, Singapore, October 6-9, 2013. Proceedings, pages 68–82, 2013. Dan Garber. Efficient online linear optimization with approximation algorithms. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 627–635, 2017. Sham M. Kakade, Adam Tauman Kalai, and Katrina Ligett. Playing games with approximation algorithms. SIAM J. Comput., 39(3):1088– 1106, 2009. Adam Tauman Kalai and Santosh Vempala. Efficient algorithms for online decision problems. J. Comput. Syst. Sci., 71(3):291–307, 2005.. Appendix We define a mapping matrix Q ∈ Cn which have ith β−BS qi in its ith column i.e. Q(i)( j) = q j (i). We consider a mapped concept Cˆ of C, which is Cˆ = {Qc|c ∈ C}. Then the following propositions holds. Proposition 1. Cˆ includes standard basis e1 , . . . , en . Proof. By definition of Q, ∀i ∈ [n].qi = Qei . Q is inverticle for almost natural discrete concepts, so ∀i ∈ [n]. ei = Q−1 qi ∈ Cˆ □ Proposition 2. Suppose that a discrete set C has β-BS ˆ {q1 , . . . , qn }, then {e1 , . . . .en } are β-BS of C. Proof. By definition of β-BS for C, ∑ ∀c ∈ C. c = λ i qi i∈[n]. Multiplying the matrix Q−1 to both sides from left side of the above equation, we obtain ∑ ˆ cˆ = ∀ˆc ∈ C. λi ei i∈[n]. □ Considering pseudo adversary which outputs pseudo loss vector sˆ = Q⊤ s, the following proposition holds.. c 2018 Information Processing Society of Japan ⃝. 4.
(5)
関連したドキュメント
13 proposed a hybrid multiobjective evolutionary algorithm HMOEA that incorporates various heuristics for local exploitation in the evolutionary search and the concept of
VRP is an NP-hard problem [7]; heuristics and evolu- tionary algorithms are used to solve VRP. In this paper, mutation ant colony algorithm is used to solve MRVRP with
5 used an improved version of particle swarm optimization algorithm in order to solve the economic emissions load dispatch problem for a test system of 6 power generators, for
First algorithm of our classifier is classif yCeq, which classifies the simple sin- gularities in characteristic p > 0 with respect to the contact equivalence.. This
We start with some introductory materials to the over-relaxed η- proximal point algorithm based on the notion of maximal η-monotonicity, and recall some investigations on
We present a Sobolev gradient type preconditioning for iterative methods used in solving second order semilinear elliptic systems; the n-tuple of independent Laplacians acts as
Using right instead of left singular vectors for these examples leads to the same number of blocks in the first example, although of different size and, hence, with a different
We presented simple and data-guided lexisearch algorithms that use path representation method for representing a tour for the benchmark asymmetric traveling salesman problem to