Notes

Notes - notes.io

A Deep Hierarchical Approach To Lifelong Learning In Minecraft
The ability to reuse or transfer knowledge from one task to another in lifelong learning problems, such as Minecraft, is one of the major challenges faced in AI. Reusing knowledge across tasks is crucial to solving tasks efficiently with lower sample complexity. We provide a Reinforcement Learning agent with the ability to transfer knowledge by learning reusable skills, a type of temporally extended action (also know as Options (Sutton et. al. 1999)). The agent learns reusable skills using Deep Q Networks (Mnih et. al. 2015) to solve tasks in Minecraft, a popular video game which is an unsolved and high-dimensional lifelong learning problem. These reusable skills, which we refer to as Deep Skill Networks (DSNs), are then incorporated into our novel Hierarchical Deep Reinforcement Learning Network (H-DRLN) architecture. The H-DRLN is a hierarchical version of Deep Q-Networks and learns to efficiently solve tasks by reusing knowledge from previously learned DSNs. The H-DRLN exhibits superior performance and lower learning sample complexity (by taking advantage of temporal extension) compared to the regular Deep Q Network (Mnih et. al. 2015) in sub-domains of Minecraft. We also show the potential to transfer knowledge between related Minecraft tasks without any additional learning.

1 Introduction

Lifelong learning is defined as the ability to accumulate knowledge across multiple tasks and then reuse or transfer this knowledge in order to solve sub-sequent tasks Eaton and Ruvolo (2013). This is one of the fundamental learning problems in AI Thrun and Mitchell (1995); Eaton and Ruvolo (2013). There are different ways to performing knowledge transfer across tasks. For example, Ammar et al. (2014) and Ammar et al. (2015) transfer knowledge via a latent basis whereas Brunskill and Li (2014) perform batch optimization across all encountered tasks.

Lifelong learning in real-world domains suffers from the curse of dimensionality. That is, as the state and action spaces increase, it becomes more and more difficult to model and solve new tasks as they are encountered. A challenging, high-dimensional domain that incorporates many of the elements found in life-long learning is Minecraft 111https://minecraft.net/. Minecraft is a popular video game whose goal is to build structures, travel on adventures, hunt for food and avoid zombies. An example screenshot from the game is seen in Figure 1. Minecraft is an open research problem as it is impossible to solve the entire game using a single AI technique. Instead, the solution to Minecraft may lie in solving sub-problems, using a divide-and-conquer approach, and then providing a synergy between the various solutions. Once an agent learns to solve a sub-problem, it has acquired a skill that can then be reused when a similar sub-problem is subsequently encountered.

Many of the tasks that are encountered by an agent in a lifelong learning setting can be naturally decomposed into skill hierarchies Stone et al. (2000, 2005); Bai et al. (2015). In Minecraft for example, consider building a wooden house as seen in Figure 1. This task can be decomposed into sub-tasks (a.k.a skills) such as chopping trees, sanding the wood, cutting the wood into boards and finally nailing the boards together. Here, the knowledge gained from chopping trees can also be partially reused when cutting the wood into boards. In addition, if the agent receives a new task to build a small city, then the agent can reuse the skills it acquired during the ‘building a house’ task.

In a lifelong learning setting such as Minecraft, learning skills and when to reuse the skills are key to increasing exploration, efficiently solving tasks and advancing the capabilities of the Minecraft agent. As mentioned previously, Minecraft and other lifelong learning problems suffer from the curse of dimensionality. Therefore as the dimensionality of the problem increases, it becomes increasingly non-trivial to learn reasonable skills as well as when to reuse these skills.

Reinforcement Learning (RL) provides a generalized approach to skill learning through the options framework Sutton et al. (1999). Options are Temporally Extended Actions (TEAs) and are also referred to as skills da Silva et al. (2012) and macro-actions Hauskrecht et al. (1998). Options have been shown both theoretically Precup and Sutton (1997); Sutton et al. (1999) and experimentally Mann and Mannor (2013) to speed up the convergence rate of RL algorithms. From here on in, we will refer to options as skills.

Recent work in RL has provided insights into learning reusable skills Mankowitz et al. (2016a, b), but this has been limited to low dimensional problems. In high-dimensional lifelong learning settings (E.g. just another wordpress site ), learning from visual experiences provides a potential solution to learning reusable skills. With the emergence of Deep Reinforcement Learning, specifically Deep Q-Networks (DQNs)Mnih et al. (2015), RL agents are now equipped with a powerful non-linear function approximator that can learn rich and complex policies. Using these networks the agent learns policies (or skills) from raw image pixels, requiring less domain specific knowledge to solve complicated tasks (E.g Atari video games Mnih et al. (2015)). While different variations of the DQN algorithm exist Van Hasselt et al. (2015); Schaul et al. (2015); Wang et al. (2015); Bellemare et al. (2015), we will refer to the Vanilla DQN Mnih et al. (2015) unless otherwise stated.

In our paper, we focus on learning reusable RL skills using Deep Q Networks Mnih et al. (2015), by solving sub-problems in Minecraft. These reusable skills, which we refer to as Deep Skill Networks (DSNs) are then incorporated into our novel Hierarchical Deep Reinforcement Learning (RL) Network (H-DRLN) architecture. The H-DRLN, which is a hierarchical version of the DQN, learns to solve more complicated tasks by reusing knowledge from the pre-learned DSNs. By taking advantage of temporal extension, the H-DRLN learns to solve tasks with lower sample complexity and superior performance compared to vanilla DQNs.

Contributions: (1) A novel Hierarchical Deep Reinforcement Learning Network (H-DRLN) architecture. (2) We show the potential to learn reusable Deep Skill Networks (DSNs) and perform knowledge transfer of the learned DSNs to a new task to obtain an optimal solution. (3) Empirical results for learning a H-DRLN in a sub-domain of Minecraft which outperforms the vanilla DQN. (4) We verify empirically the improved convergence guarantees for utilizing reusable DSNs (a.k.a options) within the H-DRLN, compared to the vanilla DQN. (5) The potential to transfer knowledge between related tasks without any additional learning.

2 Background

Reinforcement Learning: The goal of an RL agent is to maximize its expected return by learning a policy π:S→ΔA:𝜋→𝑆subscriptΔ𝐴pi:SrightarrowDelta_Aitalic_π : italic_S → roman_Δ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT which is a mapping from states s∈S𝑠𝑆sin Sitalic_s ∈ italic_S to a probability distribution over the action space A𝐴Aitalic_A. At time t𝑡titalic_t the agent observes a state st∈Ssubscript𝑠𝑡𝑆s_tin Sitalic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_S, selects an action at∈Asubscript𝑎𝑡𝐴a_tin Aitalic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_A, and receives a bounded reward rt∈[0,Rmax]subscript𝑟𝑡0subscript𝑅r_tin[0,R_max]italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ 0 , italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] where Rmaxsubscript𝑅R_maxitalic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT is the maximum attainable reward and γ∈[0,1]𝛾01gammain[0,1]italic_γ ∈ [ 0 , 1 ] is the discount factor. Following the agents action choice, it transitions to the next state st+1∈Ssubscript𝑠𝑡1𝑆s_t+1in Sitalic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∈ italic_S . We consider infinite horizon problems where the cumulative return at time t𝑡titalic_t is given by Rt=∑t′=t∞γt′-trtsubscript𝑅𝑡superscriptsubscriptsuperscript𝑡′𝑡superscript𝛾superscript𝑡′𝑡subscript𝑟𝑡R_t=sum_t^prime=t^inftygamma^t^prime-tr_titalic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The action-value function Qπ(s,a)=𝔼[Rt|st=s,at=a,π]superscript𝑄𝜋𝑠𝑎𝔼delimited-[]formulae-sequenceconditionalsubscript𝑅𝑡subscript𝑠𝑡𝑠subscript𝑎𝑡𝑎𝜋Q^pi(s,a)=mathbbE[R_t|s_t=s,a_t=a,pi]italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) = blackboard_E [ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a , italic_π ] represents the expected return after observing state s𝑠sitalic_s and taking an action a𝑎aitalic_a under a policy π𝜋piitalic_π. The optimal action-value function obeys a fundamental recursion known as the Bellman equation,

Q*(st,at)=𝔼[rt+γmaxa′Q*(st+1,a′)]superscript𝑄subscript𝑠𝑡subscript𝑎𝑡𝔼delimited-[]subscript𝑟𝑡𝛾superscript𝑎′maxsuperscript𝑄subscript𝑠𝑡1superscript𝑎′Q^*(s_t,a_t)=mathbbEleft[r_t+gammaunderseta^primemathrm% maxQ^*(s_t+1,a^prime)right]italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E [ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ start_UNDERACCENT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_max end_ARG italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ]

Deep Q Networks: The DQN algorithm Mnih et al. (2015) approximates the optimal Q function with a Convolutional Neural Network (CNN) Krizhevsky et al. (2012), by optimizing the network weights such that the expected Temporal Difference (TD) error of the optimal bellman equation (Equation 1) is minimized:

𝔼st,at,rt,st+1∥Qθ(st,at)-yt∥22,subscript𝔼subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡subscript𝑠𝑡1superscriptsubscriptnormsubscript𝑄𝜃subscript𝑠𝑡subscript𝑎𝑡subscript𝑦𝑡22mathbbE_s_t,a_t,r_t,s_t+1left|Q_thetaleft(s_t,a_tright% )-y_tright|_2^2enspace,blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (1)

where

yt=rtif st+1 is terminalrt+γmaxa’Qθtarget(st+1,a′)otherwise.subscript𝑦𝑡casessubscript𝑟𝑡if subscript𝑠𝑡1 is terminalsubscript𝑟𝑡𝛾a’maxsubscript𝑄subscript𝜃𝑡𝑎𝑟𝑔𝑒𝑡subscript𝑠𝑡1superscript𝑎′otherwisey_t=begincasesr_t&mboxif s_t+1mbox is terminal\ r_t+gammaundersetmboxmbox$a$'mboxmaxQ_theta_targetleft(% s_t+1,a^^primeright)&mboxotherwiseendcasesenspace.italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = start_ROW start_CELL italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL if italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is terminal end_CELL end_ROW start_ROW start_CELL italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ undera’ start_ARG max end_ARG italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) end_CELL start_CELL otherwise end_CELL end_ROW .

Notice that this is an off-line learning algorithm, meaning that the tuples st,at,rt,st+1,γsubscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡subscript𝑠𝑡1𝛾lefts_t,a_t,r_t,s_t+1,gammaright italic_s start_POSTSUBSCRIPT italic_t , end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_γ are collected from the agents experience and are stored in the Experience Replay (ER) Lin (1993). The ER is a buffer that stores the agentâs experiences at each time-step t𝑡titalic_t, for the purpose of ultimately training the DQN parameters to minimize the loss function. When we apply minibatch training updates, we sample tuples of experience at random from the pool of stored samples in the ER. The DQN maintains two separate Q-networks. The current Q-network with parameters θ𝜃thetaitalic_θ, and the target Q-network with parameters θtargetsubscript𝜃𝑡𝑎𝑟𝑔𝑒𝑡theta_targetitalic_θ start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT. The parameters θtargetsubscript𝜃𝑡𝑎𝑟𝑔𝑒𝑡theta_targetitalic_θ start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT are set to θ𝜃thetaitalic_θ every fixed number of iterations. In order to capture the game dynamics, the DQN represents the state by a sequence of image frames.

Skills, Options, Macro-actions Sutton et al. (1999): A skill σ𝜎sigmaitalic_σ is a temporally extended control structure defined by a triple σ= fragmentsσI,π,βsigma=italic_σ = where I is the set of states where the skill can be initiated, π𝜋piitalic_π is the intra-skill policy, which determines how the skill behaves in encountered states, and β𝛽betaitalic_β is the set of termination probabilities determining when a skill will stop execution. β𝛽betaitalic_β is typically either a function of state s𝑠sitalic_s or time t𝑡titalic_t.

Semi-Markov Decision Process (SMDP): Planning with skills can be performed using SMDP theory. For each state in the skill initiation set I𝐼Iitalic_I, the SMDP model predicts the state in which the skill will terminate and the total reward received along the way. More formally, a SMDP can be defined by a five-tuple fragmentsS,Σ,P,R,γ< italic_S , roman_Σ , italic_P , italic_R , italic_γ >where S𝑆Sitalic_S is a set of states, ΣΣSigmaroman_Σ is a set of skills, and P𝑃Pitalic_P is the transition probability kernel. We assume rewards received at each timestep are bounded by [0,Rmax]0subscript𝑅[0,R_max][ 0 , italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ]. R:S×σ→[0,Rmax1-γ]:𝑅→𝑆𝜎0subscript𝑅1𝛾R:Stimessigmarightarrow[0,fracR_max1-gamma]italic_R : italic_S × italic_σ → [ 0 , divide start_ARG italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG ] represents the expected discounted sum of rewards received during the execution of a skill σ𝜎sigmaitalic_σ initialized from a state s𝑠sitalic_s.

Skill Policy: A skill policy μ:S→ΔΣ:𝜇→𝑆subscriptΔΣmu:SrightarrowDelta_Sigmaitalic_μ : italic_S → roman_Δ start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT is a mapping from states to a probability distribution over skills ΣΣSigmaroman_Σ. The action-value function QÂµ:S×Σ→R:subscript𝑄italic-Âµ→𝑆Σ𝑅Q_Âtextmu:StimesSigmarightarrow Ritalic_Q start_POSTSUBSCRIPT italic_Â end_POSTSUBSCRIPT roman_µ : italic_S × roman_Σ → italic_R represents the long-term value of taking a skill σ∈Σ𝜎ΣsigmainSigmaitalic_σ ∈ roman_Σ from a state s∈S𝑠𝑆sin Sitalic_s ∈ italic_S and thereafter always selecting skills according to policy μ𝜇muitalic_μ and is defined by QÂµ(s,σ)=𝔼[∑t=0∞γtRt

Homepage: https://prestalive.com/

Notes is a web-based application for online taking notes. You can take your notes and share with others people. If you like taking long notes, notes.io is designed for you. To date, over 8,000,000,000+ notes created and continuing...

With notes.io;

* You can take a note from anywhere and any device with internet connection.
* You can share the notes in social platforms (YouTube, Facebook, Twitter, instagram etc.).
* You can quickly share your contents without website, blog and e-mail.
* You don't need to create any Account to share a note. As you wish you can use quick, easy and best shortened notes with sms, websites, e-mail, or messaging services (WhatsApp, iMessage, Telegram, Signal).
* Notes.io has fabulous infrastructure design for a short link and allows you to share the note as an easy and understandable link.

Fast: Notes.io is built for speed and performance. You can take a notes quickly and browse your archive.

Easy: Notes.io doesn’t require installation. Just write and share note!

Short: Notes.io’s url just 8 character. You’ll get shorten link of your note when you want to share. (Ex: notes.io/q )

Free: Notes.io works for 14 years and has been free since the day it was started.

You immediately create your first note and start sharing with the ones you wish. If you want to contact us, you can use the following communication channels;

Email: [email protected]

Twitter: http://twitter.com/notesio

Instagram: http://instagram.com/notes.io

Facebook: http://facebook.com/notesio

Regards;
Notes.io Team

Notes

Notes - notes.io

Shortened Note Link

Long File

Notes