Let's Run Jinyeah

Integrating Learning and Planning 본문

Deep Learning/Reinforcement Learning

Integrating Learning and Planning

jinyeah 2021. 8. 31. 18:48

알파고에 쓰인 시뮬레이션 기반 Model-Based RL 방법에 대해 알아보겠다.

 

Outline


1. Introduction

  • What is Model?
  • Model-Free RL vs Model-Based RL
  • Advantages of Model-Based RL

2. Model-Based RL

  • Model Learning
  • Planning with a Model

3. Integrated Architectures

  • Dyna

4. Simulation-Based Search

  • Simple Monte-Carlo Search
  • Monte-Carlo Tree Search(MCTS)
  • Temporal-Difference Search

Introduction

What is Model?

  • MDP including Transition probability, Reward function

Model-Free RL vs Model-Based RL

Model-Free RL Model-Based
- No model
- Learn value function(and/or policy) from experience
- Learn a model from experience
- Plan value function(and/or policy) from model

Advantages of Model-Based RL

  • can efficiently learn model by supervised learning methods
  • can reason about model uncertainty

Model-Based RL

Model Learning

  • estimate Model from real experience
  • supervised learning problem

Planning with a model (Sample-Based Planning)

  • use the model only to generate samples
  • apply model-free RL to samples

Integrated Architectures

Dyna

  • Learn a model from real experience
  • Learn and plan value function (and/or policy) from real and simulated experience

Simulation-Based Search

  • 모델로부터 샘플링이 아닌 시뮬레이션을 통해 planning
  • Forward search paradigm using sample-based planning
    1. Simulate episodes of experience from now with the model
    2. Apply model-free RL to simulated episodes 

Simple Monte-Carlo Search

  • Monte-Carlo Control기법 사용
  • 현재 state에서 취할 수 있는 모든 action에 대해서 forward simulation
  • Return의 평균으로 Q-value 계산하고 Q-value가 최대인 action 선택

Monte-Carlo Tree Search(MCTS)

  • 현재 state에서 tree search를 활용해 simulation
  • apply MC contorl to sub-MDP from now

Temporal-Difference Search

  • Using TD instead of MC (bootstrapping)
  • applies Sarsa to sub-MDP from now

참고

RL Course by David Silver - Lecture 8: Integrating Learning and Planning

'Deep Learning > Reinforcement Learning' 카테고리의 다른 글

Policy Gradient  (0) 2021.08.31
Value Function Approximation  (0) 2021.08.30
Model-Free Control  (0) 2021.08.21
Comments