Deep reinforcement learning hands-on apply modern RL methods, with deep Q-networks, value iteration, policy gradients, TRPO, AlphaGo Zero and more

This practical guide will teach you how deep learning (DL) can be used to solve complex real-world problems. About This Book Explore deep reinforcement learning (RL), from the first principles to the latest algorithms Evaluate high-profile RL methods, including value iteration, deep Q-networks, poli...

Descripción completa

Detalles Bibliográficos
Otros Autores: Lapan, Maxim, author (author)
Formato: Libro electrónico
Idioma:Inglés
Publicado: Birmingham, England : Packt Publishing 2018.
Edición:1st edition
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630632506719
Tabla de Contenidos:
  • Cover
  • Copyright
  • Packt upsell
  • Contributors
  • Table of Contents
  • Preface
  • Chapter 1 - What is Reinforcement Learning?
  • Learning - supervised, unsupervised, and reinforcement
  • RL formalisms and relations
  • Reward
  • The agent
  • The environment
  • Actions
  • Observations
  • Markov decision processes
  • Markov process
  • Markov reward process
  • Markov decision process
  • Summary
  • Chapter 2 - OpenAI Gym
  • The anatomy of the agent
  • Hardware and software requirements
  • OpenAI Gym API
  • Action space
  • Observation space
  • The environment
  • Creation of the environment
  • The CartPole session
  • The random CartPole agent
  • The extra Gym functionality - wrappers and monitors
  • Wrappers
  • Monitor
  • Summary
  • Chapter 3 - Deep Learning with PyTorch
  • Tensors
  • Creation of tensors
  • Scalar tensors
  • Tensor operations
  • GPU tensors
  • Gradients
  • Tensors and gradients
  • NN building blocks
  • Custom layers
  • Final glue - loss functions and optimizers
  • Loss functions
  • Optimizers
  • Monitoring with TensorBoard
  • TensorBoard 101
  • Plotting stuff
  • Example - GAN on Atari images
  • Summary
  • Chapter 4 - The Cross-Entropy Method
  • Taxonomy of RL methods
  • Practical cross-entropy
  • Cross-entropy on CartPole
  • Cross-entropy on FrozenLake
  • Theoretical background of the cross-entropy method
  • Summary
  • Chapter 5 - Tabular Learning and the Bellman Equation
  • Value, state, and optimality
  • The Bellman equation of optimality
  • Value of action
  • The value iteration method
  • Value iteration in practice
  • Q-learning for FrozenLake
  • Summary
  • Chapter 6 - Deep Q-Networks
  • Real-life value iteration
  • Tabular Q-learning
  • Deep Q-learning
  • Interaction with the environment
  • SGD optimisation
  • Correlation between steps
  • The Markov property
  • The final form of DQN training
  • DQN on Pong
  • Wrappers.
  • DQN model
  • Training
  • Running and performance
  • Your model in action
  • Summary
  • Chapter 7 - DQN Extensions
  • The PyTorch Agent Net library
  • Agent
  • Agent's experience
  • Experience buffer
  • Gym env wrappers
  • Basic DQN
  • N-step DQN
  • Implementation
  • Double DQN
  • Implementation
  • Results
  • Noisy networks
  • Implementation
  • Results
  • Prioritized replay buffer
  • Implementation
  • Results
  • Dueling DQN
  • Implementation
  • Results
  • Categorical DQN
  • Implementation
  • Results
  • Combining everything
  • Implementation
  • Results
  • Summary
  • References
  • Chapter 8 - Stocks Trading Using RL
  • Trading
  • Data
  • Problem statements and key decisions
  • The trading environment
  • Models
  • Training code
  • Results
  • The feed-forward model
  • The convolution model
  • Things to try
  • Summary
  • Chapter 9 - Policy Gradients - An Alternative
  • Values and policy
  • Why policy?
  • Policy representation
  • Policy gradients
  • The REINFORCE method
  • The CartPole example
  • Results
  • Policy-based versus value-based methods
  • REINFORCE issues
  • Full episodes are required
  • High gradients variance
  • Exploration
  • Correlation between samples
  • PG on CartPole
  • Results
  • PG on Pong
  • Results
  • Summary
  • Chapter 10 - The Actor-Critic Method
  • Variance reduction
  • CartPole variance
  • Actor-critic
  • A2C on Pong
  • A2C on Pong results
  • Tuning hyperparameters
  • Learning rate
  • Entropy beta
  • Count of environments
  • Batch size
  • Summary
  • Chapter 11 - Asynchronous Advantage Afctor-Critic
  • Correlation and sample efficiency
  • Adding an extra A to A2C
  • Multiprocessing in Python
  • A3C - data parallelism
  • Results
  • A3C - gradients parallelism
  • Results
  • Summary
  • Chapter 12 - Chatbots Training with RL
  • Chatbots overview
  • Deep NLP basics
  • Recurrent Neural Networks
  • Embeddings
  • Encoder-Decoder.
  • Training of seq2seq
  • Log-likelihood training
  • Bilingual evaluation understudy (BLEU) score
  • RL in seq2seq
  • Self-critical sequence training
  • The chatbot example
  • The example structure
  • Modules: cornell.py and data.py
  • BLEU score and utils.py
  • Model
  • Training: cross-entropy
  • Running the training
  • Checking the data
  • Playing with the trained model
  • Training: SCST
  • Running the SCST training
  • Results
  • Telegram bot
  • Summary
  • Chapter 13 - Web Navigation
  • Web navigation
  • Browser automation and RL
  • Mini World of Bits benchmark
  • OpenAI Universe
  • Installation
  • Actions and observations
  • Environment creation
  • MiniWoB stability
  • Simple clicking approach
  • Grid actions
  • Example overview
  • Model
  • Training code
  • Starting containers
  • Training process
  • Checking the learned policy
  • Issues with simple clicking
  • Human demonstrations
  • Recording the demonstrations
  • Recording format
  • Training using demonstrations
  • Results
  • TicTacToe problem
  • Adding text description
  • Results
  • Things to try
  • Summary
  • Chapter 14 - Continuous Action Space
  • Why a continuous space?
  • Action space
  • Environments
  • The Actor-Critic (A2C) method
  • Implementation
  • Results
  • Using models and recording videos
  • Deterministic policy gradients
  • Exploration
  • Implementation
  • Results
  • Recording videos
  • Distributional policy gradients
  • Architecture
  • Implementation
  • Results
  • Things to try
  • Summary
  • Chapter 15 - Trust Regions - TRPO, PPO and ACKTR
  • Introduction
  • Roboschool
  • A2C baseline
  • Results
  • Videos recording
  • Proximal Policy Optimisation
  • Implementation
  • Results
  • Trust Region Policy Optimisation
  • Implementation
  • Results
  • A2C using ACKTR
  • Implementation
  • Results
  • Summary
  • Chapter 16 - Black-Box Optimization in RL
  • Black-box methods.
  • Evolution strategies
  • ES on CartPole
  • Results
  • ES on HalfCheetah
  • Results
  • Genetic algorithms
  • GA on CartPole
  • Results
  • GA tweaks
  • Deep GA
  • Novelty search
  • GA on Cheetah
  • Results
  • Summary
  • References
  • Chapter 17 - Beyond Model- Free - Imagination
  • Model-based versus model-free
  • Model imperfections
  • Imagination-augmented agent
  • The environment model
  • The rollout policy
  • The rollout encoder
  • Paper results
  • I2A on Atari Breakout
  • The baseline A2C agent
  • EM training
  • The imagination agent
  • The I2A model
  • The Rollout encoder
  • Training of I2A
  • Experiment results
  • The baseline agent
  • Training EM weights
  • Training with the I2A model
  • Summary
  • References
  • Chapter 18 - AlphaGo Zero
  • Board games
  • The AlphaGo Zero method
  • Overview
  • Monte-Carlo Tree Search
  • Self-play
  • Training and evaluation
  • Connect4 bot
  • Game model
  • Implementing MCTS
  • Model
  • Training
  • Testing and comparison
  • Connect4 results
  • Summary
  • References
  • Book summary
  • Other Books You May Enjoy
  • Index.