Deep reinforcement learning hands-on apply modern RL methods, with deep Q-networks, value iteration, policy gradients, TRPO, AlphaGo Zero and more
This practical guide will teach you how deep learning (DL) can be used to solve complex real-world problems. About This Book Explore deep reinforcement learning (RL), from the first principles to the latest algorithms Evaluate high-profile RL methods, including value iteration, deep Q-networks, poli...
Otros Autores: | |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Birmingham, England :
Packt Publishing
2018.
|
Edición: | 1st edition |
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630632506719 |
Tabla de Contenidos:
- Cover
- Copyright
- Packt upsell
- Contributors
- Table of Contents
- Preface
- Chapter 1 - What is Reinforcement Learning?
- Learning - supervised, unsupervised, and reinforcement
- RL formalisms and relations
- Reward
- The agent
- The environment
- Actions
- Observations
- Markov decision processes
- Markov process
- Markov reward process
- Markov decision process
- Summary
- Chapter 2 - OpenAI Gym
- The anatomy of the agent
- Hardware and software requirements
- OpenAI Gym API
- Action space
- Observation space
- The environment
- Creation of the environment
- The CartPole session
- The random CartPole agent
- The extra Gym functionality - wrappers and monitors
- Wrappers
- Monitor
- Summary
- Chapter 3 - Deep Learning with PyTorch
- Tensors
- Creation of tensors
- Scalar tensors
- Tensor operations
- GPU tensors
- Gradients
- Tensors and gradients
- NN building blocks
- Custom layers
- Final glue - loss functions and optimizers
- Loss functions
- Optimizers
- Monitoring with TensorBoard
- TensorBoard 101
- Plotting stuff
- Example - GAN on Atari images
- Summary
- Chapter 4 - The Cross-Entropy Method
- Taxonomy of RL methods
- Practical cross-entropy
- Cross-entropy on CartPole
- Cross-entropy on FrozenLake
- Theoretical background of the cross-entropy method
- Summary
- Chapter 5 - Tabular Learning and the Bellman Equation
- Value, state, and optimality
- The Bellman equation of optimality
- Value of action
- The value iteration method
- Value iteration in practice
- Q-learning for FrozenLake
- Summary
- Chapter 6 - Deep Q-Networks
- Real-life value iteration
- Tabular Q-learning
- Deep Q-learning
- Interaction with the environment
- SGD optimisation
- Correlation between steps
- The Markov property
- The final form of DQN training
- DQN on Pong
- Wrappers.
- DQN model
- Training
- Running and performance
- Your model in action
- Summary
- Chapter 7 - DQN Extensions
- The PyTorch Agent Net library
- Agent
- Agent's experience
- Experience buffer
- Gym env wrappers
- Basic DQN
- N-step DQN
- Implementation
- Double DQN
- Implementation
- Results
- Noisy networks
- Implementation
- Results
- Prioritized replay buffer
- Implementation
- Results
- Dueling DQN
- Implementation
- Results
- Categorical DQN
- Implementation
- Results
- Combining everything
- Implementation
- Results
- Summary
- References
- Chapter 8 - Stocks Trading Using RL
- Trading
- Data
- Problem statements and key decisions
- The trading environment
- Models
- Training code
- Results
- The feed-forward model
- The convolution model
- Things to try
- Summary
- Chapter 9 - Policy Gradients - An Alternative
- Values and policy
- Why policy?
- Policy representation
- Policy gradients
- The REINFORCE method
- The CartPole example
- Results
- Policy-based versus value-based methods
- REINFORCE issues
- Full episodes are required
- High gradients variance
- Exploration
- Correlation between samples
- PG on CartPole
- Results
- PG on Pong
- Results
- Summary
- Chapter 10 - The Actor-Critic Method
- Variance reduction
- CartPole variance
- Actor-critic
- A2C on Pong
- A2C on Pong results
- Tuning hyperparameters
- Learning rate
- Entropy beta
- Count of environments
- Batch size
- Summary
- Chapter 11 - Asynchronous Advantage Afctor-Critic
- Correlation and sample efficiency
- Adding an extra A to A2C
- Multiprocessing in Python
- A3C - data parallelism
- Results
- A3C - gradients parallelism
- Results
- Summary
- Chapter 12 - Chatbots Training with RL
- Chatbots overview
- Deep NLP basics
- Recurrent Neural Networks
- Embeddings
- Encoder-Decoder.
- Training of seq2seq
- Log-likelihood training
- Bilingual evaluation understudy (BLEU) score
- RL in seq2seq
- Self-critical sequence training
- The chatbot example
- The example structure
- Modules: cornell.py and data.py
- BLEU score and utils.py
- Model
- Training: cross-entropy
- Running the training
- Checking the data
- Playing with the trained model
- Training: SCST
- Running the SCST training
- Results
- Telegram bot
- Summary
- Chapter 13 - Web Navigation
- Web navigation
- Browser automation and RL
- Mini World of Bits benchmark
- OpenAI Universe
- Installation
- Actions and observations
- Environment creation
- MiniWoB stability
- Simple clicking approach
- Grid actions
- Example overview
- Model
- Training code
- Starting containers
- Training process
- Checking the learned policy
- Issues with simple clicking
- Human demonstrations
- Recording the demonstrations
- Recording format
- Training using demonstrations
- Results
- TicTacToe problem
- Adding text description
- Results
- Things to try
- Summary
- Chapter 14 - Continuous Action Space
- Why a continuous space?
- Action space
- Environments
- The Actor-Critic (A2C) method
- Implementation
- Results
- Using models and recording videos
- Deterministic policy gradients
- Exploration
- Implementation
- Results
- Recording videos
- Distributional policy gradients
- Architecture
- Implementation
- Results
- Things to try
- Summary
- Chapter 15 - Trust Regions - TRPO, PPO and ACKTR
- Introduction
- Roboschool
- A2C baseline
- Results
- Videos recording
- Proximal Policy Optimisation
- Implementation
- Results
- Trust Region Policy Optimisation
- Implementation
- Results
- A2C using ACKTR
- Implementation
- Results
- Summary
- Chapter 16 - Black-Box Optimization in RL
- Black-box methods.
- Evolution strategies
- ES on CartPole
- Results
- ES on HalfCheetah
- Results
- Genetic algorithms
- GA on CartPole
- Results
- GA tweaks
- Deep GA
- Novelty search
- GA on Cheetah
- Results
- Summary
- References
- Chapter 17 - Beyond Model- Free - Imagination
- Model-based versus model-free
- Model imperfections
- Imagination-augmented agent
- The environment model
- The rollout policy
- The rollout encoder
- Paper results
- I2A on Atari Breakout
- The baseline A2C agent
- EM training
- The imagination agent
- The I2A model
- The Rollout encoder
- Training of I2A
- Experiment results
- The baseline agent
- Training EM weights
- Training with the I2A model
- Summary
- References
- Chapter 18 - AlphaGo Zero
- Board games
- The AlphaGo Zero method
- Overview
- Monte-Carlo Tree Search
- Self-play
- Training and evaluation
- Connect4 bot
- Game model
- Implementing MCTS
- Model
- Training
- Testing and comparison
- Connect4 results
- Summary
- References
- Book summary
- Other Books You May Enjoy
- Index.