Combinatorial bandit prediction with relaxation-based approximation algorithm

全文

(1)Vol.2018-AL-167 No.4 2018/3/8. IPSJ SIG Technical Report. Combinatorial bandit prediction with relaxation-based approximation algorithm Ryohei Nagaura1,a). Kohei Hatano1,2,b). Eiji Takimoto1,c). Abstract: We consider combinatorial online prediction problems with bandit feedbacks. In particular, we take an approach of converting an offline approximation algorithm to an online one. Under the natural assumption that the offline combinatorial problem has a relaxation-based approximation scheme, we show an efficient and robust conversion method.. 1.. Introduction. Combinatorial online decision making problems often arise in various situations such as scheduling, ranking, routing, and so on, where decisions are combinatorial objects, e.g., permutations, spanning trees, and paths. More precisely, the combinatorial online decision making problem we consider is formulated as a repeated game between the algorithm and the adversary. The algorithm is given a fixed decision set C ⊂ Rn+ of decisions. The protocol of the game is defined for each trial t = 1, . . . , T : ( 1 ) The algorithm chooses a decision vector ct ∈ C from the decision set C. ( 2 ) The adversary chooses a loss vector st ∈ Rn+ . ( 3 ) The adversary returns a feedback to the algorithm: • (Full information setting) The feedback is the loss vector st . • (Bandit setting) The feedback is the values of the loss lt = c⊤t st ( 4 ) The algorithm incurs the loss lt . In particular, we focus on the bandit setting of the game, where the algorithm can observe only the loss at each trial and it does not know the loss vector. Further assumptions are that the algorithm is possibly randomized and that the adversary possibly know how the algorithm works, but does not know the results of random variables used by the algorithm. The goal of the algorithm is to minimize the α-regret for sufficiently small α ≥ 1,   T T ∑ ∑ ⊤   c ⊤ st , α-regret = E  ct st  − α min c∈C t=1. t=1. where the expectation is taken with respect to the randomness of the algorithm. There are two approaches to the problem. The first approach 1 2 a) b) c). Kyushu University, Japan RIKEN AIP, Japan [email protected] [email protected] [email protected]. c 2018 Information Processing Society of Japan ⃝. is to design algorithms for specific classes of decision sets such as permutations, k-sets, paths, spanning trees and so on. This approach, however, needs careful elaboration for each specific classes. The other approach is to “convert” an offline approximation algorithm to a corresponding online algorithm. This approach is advantageous to the first one in that it is generic; we can apply many existing offline approximation algorithms without designing specific online algorithms. In this paper, we consider the second approach to the online combinatorial bandit problem. Previous work assumes that for the decision set C, there exists an α-approximation algorithm. More precisely, their assumption is that there exists an algorithm, with input s ∈ Rn+ , outputs c ∈ C such that c⊤ s ≤ α min c∗ ∈C c∗⊤ s . For simplicity, the running time of the approximation algorithm is constant. Kakade et al. [5] proposed a strategy which achieves α-regret = O(αT 2/3 ) with O(poly(n)T ) time at each trial. Garber [4] proposed a more efficient algorithm which runs in time O(poly(n) log T ) with the same regret bound. In this paper, we assume a stronger but natural condition for the offline approximation algorithm. We assume the approximation algorithm is based on a relaxation scheme. That is, it first solves a continuous linear optimization problem over a relaxed set relax(C), where relax(C) ⊃ C is a convex polytope defined by polynomially many linear constraints. Then it obtains a decision c from the relaxed solution satisfying c⊤ s ≤ α min x∗ ∈relax(C) x∗⊤ s. Under this assumption, we obtain a bandit online algorithm which achieves (α + ϵ)-regret O((α + ϵ)T 2/3 ) with O(poly(n, 1/ϵ)) computation time per trial. Our result is obtained by extending the work of Fujita et al. [3] for the full information setting to the bandit setting. The key technique of Fujita et al. is to construct a “metarounding” algorithm from the relaxation-based approximation algorithm. An αmetarounding algorithm for C, when given x ∈ relax(C) as input, outputs c ∈ C such that for any s ∈ Rn+ , c⊤ s ≤ αx⊤ s. In particular, they show a construction method of (α + ϵ)-metarounding algorithm whose running time is O(poly(n, 1/ϵ)). We will use the metarounding algorithm while omitting the details. We summa-. 1.

(2) Vol.2018-AL-167 No.4 2018/3/8. IPSJ SIG Technical Report. rize the previous results and ours in Table 1. Table 1 Kakade et al. [5] Garber [4] ours. Summary of results.. α-regret ((α + ϵ) for ours ) 2 O(αT 3 ) 2 O(αT 3 ) 2 O((α + ϵ)T 3 ). time per trial 2 O(poly(n)T 3 ) O(poly(n) log T ) O(poly(n, 1/ϵ)). Other related work There are several previous results for the conversion approach in the full information setting. The first result is the Follow the Perturbed Leader (FPL) proposed by Kalai and Vempala [6]. Kakade et al. [5] proposed a strategy which achieves α-regret = √ O(α T ) with O(poly(n)T ) time at each trial. Garber [4] game a better algorithm whose running time is O(poly(n) log T ) with the same regret bound. In the bandit setting, Bubeck et al. [1] proposed the OSMD algorithm for convex decision sets. This algorithm achieves optimal regret bounds for many classes of decision sets.. 2.. Preliminaries. In this section, we define remark notations, definitions and problem setting. 2.1 Notations We denote scalars with lower case letters and vectors with bold case letters, matrix with uppercase letters. The ith element of vector x is denoted by x(i). Vectors are column in the standard, so inner product between vector x and y is denoted by x⊤ y with transpose symbol ⊤. Let R be a whole set of real number and R+ be a non-negative set of R. We denote a hypercube whose vertices are in {−a, a}n as B(a),i.e.,B(a) = [−a, a]n . The expectation operator of a distribution D is denoted by ED [·]. Let [n] be a set of all positive integers less than n + 1 (i.e.[n] = {1, 2, . . . , n}). 2.2 Definitions Definition 2.1 (Bregman Divergence). Let F : Γ → R be a strictly convex function. We denote a Bregman divergence DF with respect to F and define below. DF (x, y) = F(x) − F(y) − ∇F(y)⊤ (x − y) Definition 2.2 (Legendre-Fenchel dual). We define the LegendreFenchel dual of a function F as ( ) F ⋆ (y) = min x⊤ y − F(x) . x Note that (∇F ⋆ )−1 = ∇F. Definition 2.3 (β-Barycentric Spanner (β-BS)). A set {q1 , q2 , . . . , qn } ⊂ C is a β-Barycentric Spanner (β-BS) for C if ∑ ∀c ∈ C, ∃λ1 , . . . , λn ∈ [−β, β], c = λi qi i∈[n]. 3.. Algorithm. In this section, we propose a bandit algorithm described in Algorithm 1. As a preprocessing, the algorithm calculates β-BS for. c 2018 Information Processing Society of Japan ⃝. a real decision set C and make a mapping matrix Q(i)( j) = q j (i) for every i, j ∈ [n]. We consider a mapped set Cˆ = Q−1 C because it becomes easy to analyze regret thanks to some propositions described in appendix ˆ a mapped space of P by Q, i.e. P ˆ = Q−1 P. P ˆ We define P also consists of poly(n) linear constraint. Due to proposition in appendix, it is clearly that Cˆ is contained by a n-dimensional hyˆ and B(β) is denoted by B. ˆ percube B(β). Intersection between P ˆ ˆ Note that B has poly(n) linear constraint since P and B(β) also have. When we update online combinatorial prediction problems over pseudo decision space, then algorithm solve the problem over real decision set. Framework of our algorithm is as follows: In each iteration t = 1, . . . , T , at first, the algorithm decides to do either exploration or exploitation with probability γ, 1 − γ respectively . The algorithm choose it uniformly at random and outputs eit in exploration term. Else, algorithm solves α-metarouding problem with input xˆ t and play cˆ distributed by some metarouding algorithm. After outputting decision vector, the algorithm receives incur its loss lt . Next, algorithm estimates a pseudo loss vector. If it explored in first step, estimate loss vector is calculated as s˜ t = lt γn eit . Else s˜ t = 0. This method makes s˜ t unbiased estimator of sˆ t described in lemma3.1 in detail. Then the algorithm updates xˆ t OSMD-like with regarding estimate vector s˜ t as a true ˆ Last, algorithm finds the most clospseudo loss vector sˆ t over P. est point in Bˆ with respect to Bregman divergence. We remark pseudo code of our algorithm below.. Algorithm 1 Proposed algorithm ( 1 ) Parameters: learning ratio η > 0, exploration parameter 0 < γ < 1 ˆ ∩ ( 2 ) Calculate β-BS(C) = {q1 , . . . , qn } and a mapping matrix Q, Bˆ = P B(β). ˆ be any initial point. ( 3 ) Set x1 ∈ P ( 4 ) For t = 1, .. . , T   0 (with probability γ) ( a ) bt =   1 (with probability 1 − γ) ( b ) if bt = 0(exploration) then ( i ) The algorithm chooses it ∈ [n] uniformly at random and play cˆ t = eit . ( ii ) Incur loss lt and calculate s˜ t = γn lt eit . ( c ) else(exploitation) ( i ) Solves α-metarouding problem with input xˆ t . ( ii ) Incur loss lt and calculate s˜ t = 0. ( d ) update xˆ t+ 21 = ∇F ⋆ (∇F( xˆ t ) − η˜st ). ˆ That is xˆ t+1 = arg min ˆ ˆ DF ( xˆ , xˆ t+1/2 ). ( e ) Project xˆ t+1/2 onto B. x∈B. ∑. We use a Euclidean norm squared regularizer F(x) 2 i∈[n] x(i) in OSMD-like update, so. =. DF (x, y) = ∥x − y∥22 In our algorithm, it holds the following lemmas. Lemma 3.1. s˜ t is a unbiased estimater of sˆ t , i.e., E [˜st ] = sˆ t Proof. Since s˜ t is independent of randomness of exploitation, 2.

(3) Vol.2018-AL-167 No.4 2018/3/8. IPSJ SIG Technical Report. E [(ˆct − (α + ϵ) xˆt )⊤ st ] 1∑ ⊤ = γ e st + E MBB [ˆc⊤ st ] − γE MBB [ˆc⊤ st ] − (α + ϵ) xˆ ⊤ st n k∈[n] k 1∑ ⊤ ≤ γ e st + (α + ϵ) xˆ ⊤ st − γE MBB [ˆc⊤ st ] − (α + ϵ) xˆ ⊤ st n k∈[n] k 1∑ ≤ γ D (∵ lt ≤ D) n k∈[n]. E [˜st ] = Ebt ,it [˜st ] = γEit [˜st |bt = 0] + (1 − γ)E [˜st |bt = 1] 1∑ n = γ lt e k + 0 n k∈[n] γ ∑ = ek lt k∈[n]. =. ∑. ek e⊤k sˆ t. = Dγ. k∈[n]. = sˆ t □. Then the first term of (1) is bounded by T Dγ. lemma3.1, the second term of (2) is. Lemma 3.2. Euclidean norm squared of s˜ t.  T   T  T T ∑ ∑ ∑ ⊤  ∑ ⊤  E  xˆ t sˆ t − xˆ ⊤ sˆ t  = E  xˆ t E [˜st ] − xˆ ⊤ E [˜st ]. ] n2 D 2 [ E ∥˜st ∥22 ≤ γ. t=1. t=1. t=1. t=1.  T  T ∑ ∑ ⊤  = E  xˆ t s˜ t − xˆ ⊤ s˜ t . Proof.. t=1. E[∥˜st ∥22 ]. =. E[˜s⊤t s˜ t ]. DF ( xˆ t+1 , xˆ t+1/2 ) + DF ( xˆ , xˆ t+1 ) ≤ DF ( xˆ , xˆ t+1/2 ) Then, η˜s⊤t ( xˆ t − xˆ ) = ( xˆ t − xˆ )⊤ (∇F( xˆ t+1/2 ) − ∇F( xˆ t )) (∵ ∇F( xˆ t+1/2 ) = ∇F( xˆ t ) − η˜st ). □. = DF ( xˆ , xˆ t ) + DF ( xˆ t , xˆ t+1/2 ) − DF ( xˆ , xˆ t+1/2 ) ≤ DF ( xˆ , xˆ t ) + DF ( xˆ t , xˆ t+1/2 ). Main Results. −DF ( xˆ t+1 , xˆ t+1/2 ) − DF ( xˆ , xˆ t+1 ). Theorem 4.1. Using MBB algorithm [3] in line (4)-(c)-(i) in algorithm 3 and setting η = O(βD−1 T −2/3 ), γ = O(nT −1/3 ), our algorithm runs at each trial poly(n, 1/ϵ) time and achieves the expected regret below.. ≤ DF ( xˆ , xˆ t ) + DF ( xˆ t , xˆ t+1/2 ) − DF ( xˆ , xˆ t+1 ) (∵ DF ( xˆ t+1 , xˆ t+1/2 ) ≥ 0) Summing up this inequality for t = 1, . . . , T , we get. 2 3. (α + ϵ) − regret = O((α + ϵ)βnDT ). T ∑. ˆ Proof. For any xˆ ∈ P, E. [∑. T t=1. t=1. From Generalized Pythagorean Theorem (see e.g. [2]),. ( )⊤ ( ) 1∑ n n = γ lt e k lt ek + (1 − γ)0 n k∈[n] γ γ n = lt2 n γ n2 D2 (∵ lt ≤ D) ≤ γ. 4.. Thanks to. η˜s⊤t ( xˆ t − xˆ ) ≤ DF ( xˆ , xˆ 1 ) − DF ( xˆ , xˆ T +1 ) +. DF ( xˆ t , xˆ t+1/2 ). t=1. t=1. ] ∑ c⊤t st − (α + ϵ) Tt=1 x⊤ st. T ∑. ≤ DF ( xˆ , xˆ 1 ) +. T ∑. DF ( xˆ t , xˆ t+1/2 ) (∵ DF ( xˆ , xˆ T +1 ) ≥ 0). t=1. Here,.  T  T ∑ ∑ ⊤   = E  cˆ t sˆ t  − (α + ϵ) xˆ ⊤ sˆ t t=1. DF ( xˆ t , xˆ t+1/2 ) = ∥ xˆ t − xˆ t+1/2 ∥22. t=1. = ∥η˜st ∥22.   T T T T ∑ ∑ ∑  ∑ ⊤ ⊤  ⊤ ⊤  xˆ sˆ t  xˆ t sˆ t − (α + ϵ) xˆ t sˆ t + (α + ϵ) = E  cˆ t sˆ t − (α + ϵ) t=1. t=1. t=1. t=1. t=1.  T  T ∑  ∑ ⊤ ⊤  = E  cˆ t sˆ t − (α + ϵ) xˆ t sˆ t . t=1. t=1. Since E[ˆct ] = γ n. k∈[n] ek. (∵ lemma3.2). DF ( xˆ , xˆ 1 ) ≤ 4nβ2 ˆ we have (α + ϵ)−regret upper bounded Since Cˆ ⊂ P,. t=1.   T T ∑  ∑ ⊤ ⊤   ˆ ˆ ˆ ˆ x st  ≤ T Dγ + (α + ϵ)E  xt st − 1 ∑. n2 D2 γ. (1)And,.   T T ∑ ∑   ⊤ ⊤   xˆ t sˆ t − (α + ϵ) xˆ sˆ t  +E (α + ϵ) t=1. ≤ η2. t=1. + (1 − γ)E MBB [ˆc], so we obtain. c 2018 Information Processing Society of Japan ⃝. (2). T Dγ + (α + ϵ). 4nβ2 T ηn2 D2 + (α + ϵ) η γ. By tuning parameters properly, we complete proof.. □ 3.

(4) Vol.2018-AL-167 No.4 2018/3/8. IPSJ SIG Technical Report. 5.. Concluding Remarks. In this paper, we propose a bandit algorithm using an approximation algorithm based on LP relaxation. Our algorithm is superior to previous work in the two points. One is that our algorithm’s computation time at each iteration doesn’t depend on T . The other is that the player does not need to know the approximation ratio α. One of the future works is to improve the regret √ bound to (α + ϵ)-regret = O((α + ϵ) T ).. Proposition 3. A loss between decision vector c and loss vector s equals to a loss between corresponding pseudo decision and loss vectors. Proof. cˆ ⊤ sˆ = (Q−1 c)⊤ Q⊤ s = cQ−⊤ Q⊤ s = c⊤ s □. References [1]. [2] [3]. [4]. [5] [6]. Sébastien Bubeck, Nicolò Cesa-Bianchi, and Sham M. Kakade. Towards minimax policies for online linear optimization with bandit feedback. In COLT 2012 - The 25th Annual Conference on Learning Theory, June 25-27, 2012, Edinburgh, Scotland, pages 41.1–41.14, 2012. Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. Takahiro Fujita, Kohei Hatano, and Eiji Takimoto. Combinatorial online prediction via metarounding. In Algorithmic Learning Theory - 24th International Conference, ALT 2013, Singapore, October 6-9, 2013. Proceedings, pages 68–82, 2013. Dan Garber. Efficient online linear optimization with approximation algorithms. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 627–635, 2017. Sham M. Kakade, Adam Tauman Kalai, and Katrina Ligett. Playing games with approximation algorithms. SIAM J. Comput., 39(3):1088– 1106, 2009. Adam Tauman Kalai and Santosh Vempala. Efficient algorithms for online decision problems. J. Comput. Syst. Sci., 71(3):291–307, 2005.. Appendix We define a mapping matrix Q ∈ Cn which have ith β−BS qi in its ith column i.e. Q(i)( j) = q j (i). We consider a mapped concept Cˆ of C, which is Cˆ = {Qc|c ∈ C}. Then the following propositions holds. Proposition 1. Cˆ includes standard basis e1 , . . . , en . Proof. By definition of Q, ∀i ∈ [n].qi = Qei . Q is inverticle for almost natural discrete concepts, so ∀i ∈ [n]. ei = Q−1 qi ∈ Cˆ □ Proposition 2. Suppose that a discrete set C has β-BS ˆ {q1 , . . . , qn }, then {e1 , . . . .en } are β-BS of C. Proof. By definition of β-BS for C, ∑ ∀c ∈ C. c = λ i qi i∈[n]. Multiplying the matrix Q−1 to both sides from left side of the above equation, we obtain ∑ ˆ cˆ = ∀ˆc ∈ C. λi ei i∈[n]. □ Considering pseudo adversary which outputs pseudo loss vector sˆ = Q⊤ s, the following proposition holds.. c 2018 Information Processing Society of Japan ⃝. 4.

(5)