Hierarchical Policy Gradient Reinforcement Learning: Two-layer Model

(1)

Hierarchical Policy Gradient Reinforcement Learning:

Two-layer Model

Harukazu Igarashi* and Seiji Ishihara**

Hierarchical reinforcement learning is a framework that accelerates learning and represents abstract knowledge about the states of environments and the actions of an agent. We propose a two-layer hierarchical model with two characteristics. First, general mapping from lower-level states to upper-level states allows the hierarchical model to represent such abstract knowledge as temporal abstraction, coarse graining, and state abstraction. Using a policy gradient method, we can derive learning rules for parameters included in policy functions that determine the primitive and abstract actions of an agent. The results show that each policy parameter is updated at the end of each episode and that the update quantity is proportional to the product of the reward and the sum of the characteristic eligibilities. Second, the policy gradient learning used in our approach does not require Markovian properties on the environmental dynamics, the expected rewards, or the agent policies. This second characteristic is different from most studies on hierarchical reinforcement learning, where Markov Decision Processes are assumed for using value-based reinforcement learning as Q-learning. The second characteristic of our model will provide more flexibility for representing abstract knowledge at multiple levels.

Keywords: machine learning, artificial intelligence, reinforcement learning, policy gradient

1. Introduction

Much work has been done on hierarchical reinforcement learning [1]-[5]. The following are the two key points in hierarchical reinforcement learning. First, why is a hierarchical structure necessary for reinforcement learning? Second, what should a hierarchical structure have for reinforcement learning? Hierarchical reinforcement learning is a framework that accelerates learning and represents abstract knowledge about the states of environments and the actions of an agent. In this paper, we briefly consider these two points and propose a two-layer hierarchical model for reinforcement learning.

Our proposed model has two characteristics. First, a region of states in the lower-level layer is mapped to a state in the upper-level layer by a general mapping function. Sutton et al. proposed options formalism as a framework for temporal abstraction in reinforcement learning [3], where option is a kind of macro action. However, they did not propose any solutions to other issues such as state abstraction or hierarchical execution [3]. Dietterich proposed a framework called MAXQ Value Function Decomposition, where the task to be solved is decomposed into subtasks. The MAXQ framework uses a hierarchical architecture of tasks. On the

other hand, we consider a simpler two-layer model with a hierarchical architecture of state spaces. A set of states in the lower layer is mapped to a state in the upper layer by a general mapping. The generality of mapping allows the hierarchical model to represent such abstract knowledge as temporal abstraction, coarse graining, task decomposition, and state abstraction.

Second, we use a policy gradient method [6]-[8] and derive learning rules for parameters included in policy functions that determine an agent’s primitive and abstract actions. The policy gradient learning used in our approach does not require Markovian properties on the environmental dynamics, the expected rewards, or the agent policies. This second characteristic is different from most studies on hierarchical reinforcement learning [2]-[5], where Markov Decision Processes (MDPs) are assumed for such value-based reinforcement learning as Q-learning. The second characteristic of our model provides greater flexibility for representing abstract knowledge at multiple levels.

2. Hierarchical Reinforcement Learning

Why is a hierarchical structure of states or actions necessary for reinforcement learning? The following three reasons are

(2)

considered. First, the coarse graining of the state space of environments and defining macro actions increase the grain size for information processing and reduce the computation time for learning. The dimension of the state space becomes smaller as the layer becomes higher. No matter which reinforcement learning approach we take, decreasing the space dimension saves learning time. Options, an example of macro actions, are proposed for temporal abstraction to link a state with another state obtained by executing a series of actions produced by the option’s policy. The MAXQ framework is also classified to an approach of using a macro action, because a task is decomposed into subtasks that are accomplished by executing a series of actions.

The second reason for using a hierarchical structure is to analyze and understand the properties of environmental dynamics and state transitions caused by an agent policy from a strategic standpoint. The analysis results give efficient information on the validity of the current agent’s policy from a strategic standpoint. Moreover, reinforcement learning in higher layers contributes to finding the optimal path to the goal with less computation time than that spent on learning in lower layers. The path to the goal obtained by learning at high layers helps us understand the strategical meaning of the path obtained by learning on the primitive state-action space. That gives not only a series of primitive state transitions but also a structural explanation of the policy achieved by an agent by learning from an abstract and global standpoint.

The third reason is that modularizing a policy function makes it possible to represent a complex mapping from a state to an action [9] [10] and to reuse the knowledge and experiences obtained by reinforcement learning. Our hierarchical reinforcement learning is proposed from the above three reasons. We consider a hierarchical structure of the state of an agent or its environments. That structure is simple but basic and essential for hierarchical reinforcement learning.

3. Multi-layer Model for Hierarchical

Reinforcement Learning

3.1 Model Structure: Two-layer model

We propose a two-layer model for hierarchical reinforcement learning that can be easily extended to a multi-

Fig.1. Two-layer models for hierarchical reinforcement learning

Fig.2. Structure of two-layer model

layer model with any number of layers. The lower-level layer is a primitive state space where the original learning problem is defined. The upper-level layer is a coarse-grained state space or an abstract state space. A region of states in the lower layer is mapped to a state in the upper layer by a general mapping function. We do not require Markovian properties on the environmental dynamics, the expected rewards, or the agent policies of both layers. Fig.1 shows a schematic diagram of the two-layer model and mapping between two layers. Fig.1a is an example of coarse graining, and Fig.1b is an example of mapping an original state space to an abstract space representing the characteristics of states in the lower

High-resolution space

Low-resolution space Characteristics space

State space

(a)

(b)

layer.

Figure 2 shows the structure of our two-layer model that consists of upper- and lower-level layers. s



S

, action aA, and policy



π are defined in the lower layer. The corresponding

quantities defined in the upper layer are represented by dashed symbols such as s’



S

, a’



A

, and π’. State s is mapped to state s’ as s’=sp(s). We call mapping function sp(s) a state filter. In Fig. 2, states s1, s2, and s3 are mapped to state s’ ,

while s4 is mapped to another state s’+1. Symbols t and 

denote discrete time steps defined in the lower and upper layers, respectively. The relation between t and  is described in Section 3.3.

3.2 State-Transition Structure of the Two-layer

Model

The state-transition algorithm of the two-layer model is defined as follows:

1. Map state st to state s’ by state filter sp(s) as s’ =sp(st).

2. The system stops if st or s’ satisfies termination conditions.

3. Select action a’ by policy ’(a’|s’, h’) in the upper layer.

4. Set sub goal g



g(s’, a’, h’)



S

.

5. Select an action by policy (a t |s t , h t , g) in the lower

layer to get to sub goal g.

6. The lower-layer system transits from st to st+1 by

state-transition probability function P(st+1|ht+1).

7. If s’ =sp(st+1), then go to step 5 with t=t+1.

8. Return to step 1 with =+1.

Symbols ht and h’ represent the histories of an agent’s

states and actions. For example, history ht in the lower layer is

defined by



 









0, 0 , ,1 1 ,..., 1, 1



t t t

h  s a s a s a  . ···(1)

Sub goal g



g(s’, a’, h’)



S

in step 4 means an intention

of an agent on the state or the action at the strategy level. That is called abstract action or intention [2].

As the algorithm shows, a state in the lower layer is mapped to a state in the upper layer. A strategic action based on the abstract state is selected in the upper layer to give a sub goal to an agent. The agent selects a primitive action in the lower layer to get to the sub goal. It is assumed that state filter sp(s)

Fig.3. State-Transition Structure of Two-layer Model and sub goal function g(s’, a’, h’) are given by a system

designer. Fig. 3 shows the state transition algorithm from steps 1 to 8.

3.3 Representation of Time and Episode

Time goes forward by one step in the upper layer when an agent moves to another abstract state. Time t in the lower layer is mapped to timein the upper layer by time filter tm(t) as 

=tm(t). We write inverse map tm-1() of the time filter as t() for simplicity.

In this paper, we only deal with episodic learning that increases the expectation of a reward function that evaluates the states and actions taken by an agent in an episode. Episode

’ in the upper layer is defined by



 





   



 



s a0, 0 , ,s a1 1 ,..., sL 1,aL 1 ,sL



      __  __  _ , ··· (2)

where L(’) denotes the length of episode ’. Episode  in the lower layer is defined by a time-series of partial episodes as

 



0, ,...,1 L  1



    _{ } ··· (3) where is a partial episode in the lower layer defined by

   







   





       



   



st ,at , st 1,at 1 ,..., st L  1,at L  1 ,st L 



          

  _ _ _ _ _ _ _ .(4)

In (4), it is assumed that the last state of partial episode

satisfies the termination conditions of . Note that

t()+L()=t(+1). Fig. 4 shows the relations of the times and

episodes defined in the two layers.

st st+1 s’ s’1 Upper layer Lower layer ’：a’ ：at _s  

 

sp t s   s ①_② ③



, ,



g s a h    ④ t=t() t=t(1) ⑧ ⑤ ⑥ ⑦

(3)

considered. First, the coarse graining of the state space of environments and defining macro actions increase the grain size for information processing and reduce the computation time for learning. The dimension of the state space becomes smaller as the layer becomes higher. No matter which reinforcement learning approach we take, decreasing the space dimension saves learning time. Options, an example of macro actions, are proposed for temporal abstraction to link a state with another state obtained by executing a series of actions produced by the option’s policy. The MAXQ framework is also classified to an approach of using a macro action, because a task is decomposed into subtasks that are accomplished by executing a series of actions.

3. Multi-layer Model for Hierarchical

Reinforcement Learning

3.1 Model Structure: Two-layer model

State space

(a)

(b)

layer.



S





S

, a’



A

3.2 State-Transition Structure of the Two-layer

Model



g(s’, a’, h’)



S

.

defined by



 









0, 0 , ,1 1 ,..., 1, 1



t t t

h  s a s a s a  . ···(1)

Sub goal g



g(s’, a’, h’)



S

3.3 Representation of Time and Episode



 





   



 



s a0, 0 , ,s a1 1 ,..., sL 1,aL 1 ,sL



      __  __  _ , ··· (2)

 



0, ,...,1 L  1



   







   





       



   





          

  _ _ _ _ _ _ _ .(4)

 

sp t s   s ①_② ③



, ,



g s a h    ④ t=t() t=t(1) ⑧ ⑤ ⑥ ⑦ considered. First, the coarse graining of the state space of

environments and defining macro actions increase the grain size for information processing and reduce the computation time for learning. The dimension of the state space becomes smaller as the layer becomes higher. No matter which reinforcement learning approach we take, decreasing the space dimension saves learning time. Options, an example of macro actions, are proposed for temporal abstraction to link a state with another state obtained by executing a series of actions produced by the option’s policy. The MAXQ framework is also classified to an approach of using a macro action, because a task is decomposed into subtasks that are accomplished by executing a series of actions.

3. Multi-layer Model for Hierarchical

Reinforcement Learning

3.1 Model Structure: Two-layer model

State space

(a)

(b)

layer.



S





S

, a’



A

3.2 State-Transition Structure of the Two-layer

Model



g(s’, a’, h’)



S

.

defined by



 









0, 0 , ,1 1 ,..., 1, 1



t t t

h  s a s a s a  . ···(1)

Sub goal g



g(s’, a’, h’)



S

3.3 Representation of Time and Episode



 





   



 



s a0, 0 , ,s a1 1 ,..., sL 1,aL 1 ,sL



      __  __  _ , ··· (2)

 



0, ,...,1 L  1



   







   





       



   





          

  _ _ _ _ _ _ _ .(4)

 

sp t s   s ①_② ③



, ,



g s a h    ④ t=t() t=t(1) ⑧ ⑤ ⑥ ⑦

(4)

Fig.4. Episodes and times defined in two layers

3.4 Joint Distribution Function of Episodes

Generated in the Two Layers

Assume that the system generates episode ’ in the upper

layer and episode  in the lower layer, as shown in (2) and (3). Joint distribution function P,’₍__,__{’) of}__and__{’ can be}

written as





 





 







 



1 , 0 0 0 0 , L , , _t P  P s P s s  a s h P g s                 



       







s 1 sp st L



      , ···(5)

where P(s0) is a distribution function of initial state s0 in the

lower layer and (A) is a function that takes 1 if proposition A is true or else it takes 0. P₍___|g__,s_t()_{) denotes the probability}

with which partial episode  is generated in the lower layer.

It depends on initial state st() and sub goal g of partial episode

 as  







 



     1 1 1 , t t L t t, ,t t t t t P g s   a s h g P s h            



. ···(6)

Eqs. (5) and (6) are used in the next section to derive the learning rules of parameters in policies ’(a’|s’,h’) and

(at|st,ht,g).

4. Hierarchical Reinforcement Learning

4.1 On-line and Off-line Learning

Joint distribution function P,’₍__,__{’) includes policy}

’(a’|s’,h’) for selecting an abstract action in the upper

layer and policy(at|st,ht,g) for selecting a primitive action in

the lower layer. These two policies include parameters ’ and

, respectively, that can be adjusted by a policy gradient method [7][8].

Learning in the two-layer model (Fig. 3) is divided into two types. The first is on-line learning, which uses the real environments of an agent. Parameters ’ and are learned through the real experiences of an agent by running the whole system. The second type of learning is off-line learning, which completely separates the two layers from each other. If system users want to learn policy parameters ’ in the upper layer,

they only run the upper layer. However, the environmental dynamics in the upper layer must be given or assumed by someone. That is a tabletop exercise or a virtual simulation for learning strategic policy ’(a’|s’,h’) using only the upper

layer. On the other hand, if system users want to learn policy parameters  in the lower layer, they only use the lower layer. They can make an agent learn policy(at|st,ht,g) by setting

the initial states and the sub goals that the agent must reach. In this paper, we derive learning rules of on-line learning.

4.2 Learning Rules in a Policy Gradient Method

We take a policy gradient approach for on-line learning in the two-layer model. Williams proposed a policy gradient approach [6] whose learning rules are shown [7] as





_

_

 1

_{ }

0 , _, L t Y _Y  _{e t} 

 





 



       _  _  



, ··· (7)

where function Y(,) can include policy parameters , states, actions, and rewards. The value of Y(,) is determined by observing these variables in episode . This rule is given because the expectation E[Y(,)] of function Y(,) can be differentiated with respect to policy parameters  as













 1

 

0 , , , L t Y E Y E Y  e t

 



     _ __ _      _ 



_. ·· (8)

Eq. (8) does not require Markovian properties on the environmental dynamics, the expected rewards, or the agent policies [7]. These characteristics are different from most studies on hierarchical reinforcement learning, where Markov Decision Processes are assumed for using value-based reinforcement learning as Q-learning. These characteristics give more flexibility for representing abstract knowledge at

0 1 2  1 L

 

  0 t t 1 t

 

2 t

 

 t



1



t L



 





 

L Upper layer Lower layer   1 L 

 



1



t L 

 

t  L   time   1 L  _{ }   1  0 



 time t multiple levels.

4.3

Gradients of Joint Distribution Functions of Episodes



and



’

We can derive the following theorem from (5): [Theorem 1]

Gradients of Joint Distribution Functions P,’₍__,__{’) with}

respect to policy parameters ’ and  are shown as





  1

 





, , 0

,

L

,

P

  

e

P

   

 



 



     







_

 







_



_

, ···(9) and









       





1 1 , , 0 , L t L ; , t t P     e t g P     

 



          _ _ _   



__



__ . (10)

e’’() and e(t;g) are characteristic eligibilities [6] extended

to the case of our two-layer model and are defined by

 

ln



,



e



_



a s h            , ··· (11) and



;



ln



t t, ,t



e t g 

_



a s h g    . ··· (12) [End] Proofs of (9) and (10) in Theorem 1 are given in Appendix A.

4.4 Learning Rules of Policy Parameters

Let X(,’) be a function to be maximized such as the

expectation of rewards given to an agent in episodes and ’.

It is also assumed that X(,’) does not depend on policy

parameters  and ’. A special case where X(,’)=r() means that the objective of reinforcement learning is maximizing the expectation of rewards per episode in the lower layer. To evaluate only the trajectories in the upper layer from strategic viewpoints, set X(,’)=r(’).

We can easily derive the policy gradients of the expectation of X(,’) from (9) and (10) in Theorem 1 as





,



 



, , , , E X P  X                 _ ___  _ _ _     



 ·· (13)

 



 



1 , 0 , , L e P X       



 

            _ _   

 





 1

 

, 0 , L E  X  e               _ _ 



, ··· (14) and





,



 



, , , , E X P  X    

 



    _ ___  _ _ _   



 · (15)





       



 



1 1 , 0 ; , , L t L t t e t g P X                                   

 



(16)









       1 1 , 0 , L t L ; t t E  X    e t g   

 

                   



. ··· (17)

The learning rules of on-line learning are obtained by (14) and (17) as





  1

 

0

,

L

X



e

 





 

  







   



, ··· (18) and









        1 1 0

,

L t L

;

t t

X

  

e t g

   





 

    







   















. ··· (19)

These rules mean that the amount of updating policy parameters at the end of episodes  and ’ is proportional to

the product of a function to be maximized and the sum of characteristic eligibilities in episodes  and ’. Our two-layer

model can be extended to a general multi-layer model (See Appendix B). On-line learning rules as (18) and (19) can be derived in the same way as described in this section.

5. Conclusion

We proposed a two-layer hierarchical model for reinforcement learning that has two characteristics. First, a region of states in the lower layer is mapped to a state in the upper layer by a general mapping function. The generality of mapping allows the hierarchical model to represent such abstract knowledge as temporal abstraction, coarse graining, task decomposition, and state abstraction. Second, we used a policy gradient method and derived on-line learning rules for the parameters included in the policy functions that determine the abstract and primitive actions of an agent. The policy gradient learning used in our approach does not require

(5)

Fig.4. Episodes and times defined in two layers

3.4 Joint Distribution Function of Episodes

Generated in the Two Layers

written as





 





 







 



1 , 0 0 0 0 , L , , _t P  P s P s s  a s h P g s                 



       









      , ···(5)

 as  







 





. ···(6)

(at|st,ht,g).

4. Hierarchical Reinforcement Learning

4.1 On-line and Off-line Learning

4.2 Learning Rules in a Policy Gradient Method





_

_

  1

_{ }

0 , _, L t Y _Y  _{e t} 

 





 



       _  _  



, ··· (7)













 1

 

0 , , , L t Y E Y E Y  e t

 



     _ __ _      _ 



_. ·· (8)

0 1 2  1 L

 

  0 t t 1 t

 

2 t

 

 t



1



t L



 





 

L  Upper layer Lower layer   1 L 

 



1



t L 

 

t  L   time   1 L  _{ }   1  0 



4.3 

and



’





 1

 





, , 0

,

L

,

P

  

e

P

   

 



 



     







_

 







_



_

, ···(9) and









       





1 1 , , 0 , L t L ; , t t P     e t g P     

 



          _ _ _   



__



__ . (10)

 

ln



,



e



_



a s h            , ··· (11) and



;



ln



t t, ,t



e t g 

_



4.4 Learning Rules of Policy Parameters





,



 



, , , , E X P  X                 _ ___  _ _ _     



 ·· (13)

 



 



1 , 0 , , L e P X       



 

            _ _   

 





 1

 

, 0 , L E  X  e               _ _ 



, ··· (14) and





,



 



, , , , E X P  X    

 



    _ ___  _ _ _   



 · (15)





       



 



1 1 , 0 ; , , L t L t t e t g P X                                   

 



(16)









       1 1 , 0 , L t L ; t t E  X    e t g   

 

                   



. ··· (17)





  1

 

0

,

L

X



e

 





 

  







   



, ··· (18) and









        1 1 0

,

L t L

;

t t

X

  

e t g

   





 

    







   















. ··· (19)

5. Conclusion

We proposed a two-layer hierarchical model for reinforcement learning that has two characteristics. First, a region of states in the lower layer is mapped to a state in the upper layer by a general mapping function. The generality of mapping allows the hierarchical model to represent such abstract knowledge as temporal abstraction, coarse graining, task decomposition, and state abstraction. Second, we used a policy gradient method and derived on-line learning rules for the parameters included in the policy functions that determine the abstract and primitive actions of an agent. The policy gradient learning used in our approach does not require Fig.4. Episodes and times defined in two layers

3.4 Joint Distribution Function of Episodes

Generated in the Two Layers

written as





 





 







 



1 , 0 0 0 0 , L , , _t P  P s P s s  a s h P g s                 



       









      , ···(5)

 as  







 





. ···(6)

(at|st,ht,g).

4. Hierarchical Reinforcement Learning

4.1 On-line and Off-line Learning

4.2 Learning Rules in a Policy Gradient Method





_

_

  1

_{ }

0 , _, L t Y _Y  _{e t} 

 





 



       _  _  



, ··· (7)













 1

 

0 , , , L t Y E Y E Y  e t

 



     _ __ _      _ 



_. ·· (8)

0 1 2  1 L

 

  0 t t 1 t

 

2 t

 

 t



1



t L



 





 

L  Upper layer Lower layer   1 L 

 



1



t L 

 

t  L   time   1 L  _{ }   1  0 



4.3 

and



’





 1

 





, , 0

,

L

,

P

  

e

P

   

 



 



     







_

 







_



_

, ···(9) and









       





1 1 , , 0 , L t L ; , t t P     e t g P     

 



          _ _ _   



__



__ . (10)

 

ln



,



e



_



a s h            , ··· (11) and



;



ln



t t, ,t



e t g 

_



4.4 Learning Rules of Policy Parameters





,



 



, , , , E X P  X                 _ ___  _ _ _     



 ·· (13)

 



 



1 , 0 , , L e P X       



 

            _ _   

 





 1

 

, 0 , L E  X  e               _ _ 



, ··· (14) and





,



 



, , , , E X P  X    

 



    _ ___  _ _ _   



 · (15)





       



 



1 1 , 0 ; , , L t L t t e t g P X                                   

 



(16)









       1 1 , 0 , L t L ; t t E  X    e t g   

 

                   



. ··· (17)





  1

 

0

,

L

X



e

 





 

  







   



, ··· (18) and









        1 1 0

,

L t L

;

t t

X

  

e t g

   





 

    







   















. ··· (19)

5. Conclusion

We proposed a two-layer hierarchical model for reinforcement learning that has two characteristics. First, a region of states in the lower layer is mapped to a state in the upper layer by a general mapping function. The generality of mapping allows the hierarchical model to represent such abstract knowledge as temporal abstraction, coarse graining, task decomposition, and state abstraction. Second, we used a policy gradient method and derived on-line learning rules for the parameters included in the policy functions that determine the abstract and primitive actions of an agent. The policy gradient learning used in our approach does not require