Value Function Approximation

Notice

Recent Posts

Recent Comments

Link

« 2025/10 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

Let's Run Jinyeah

Value Function Approximation 본문

Deep Learning/Reinforcement Learning

Value Function Approximation

jinyeah 2021. 8. 30. 23:06

이전 포스트(Model-Free Control)에서 가치함수에 근거하여 액션을 선택하는 Sarsa와 Q러닝 기법에 대해 알아보았습니다. 하지만 상태 공간과 액션 공간이 매우 커서 밸류를 일일이 lookup table에 담지 못하는 상황에서 모든 상태, 액션에 대한 밸류를 어떻게 계산해야할까요? 이번 포스트에서는 뉴럴넷과 강화학습을 접목하여 이에 대한 해결책을 찾아보겠습니다.

Outline

1. Incremental Methods

Stochastic Gradient Descent
Control with Value Function Approximation
Incremental Contorl Algorithm

2. Batch Methods

Stochastic Gradient Descent with Experience Replay
Deep Q-Networks(DQN)

Incremental Methods

Stochastic Gradient Descent

[Goal]

Find parameter vector w minimising mean-squared error between approximation value and true value

[Loss function]

어떤 상태(s)에 대한 손실함수인지 모르고, 존재하는 모든 상태(s)를 방문해 볼 수 없음
기댓값 연산자를 통해 정책 함수 파이를 이용해 방문했던 상태(s)에 대해 손실함수 계산

[Gradient of Loss function based on theta]

chain rule 이용(true value는 상수)
sampling을 통해 여러 데이터를 모으면, 기댓값 제거 가능

Control(Model-Free Policy Iteration) with Value Function Approximation

[Policy evaluation] Approximate policy evaluation with Learnt Value Function

[Policy improvement] ϵ-greedy Exploration

Incremental Control Algorithms

substitute a target(MC, TD(0), TD(λ) for true value

Batch Methods

Stochastic Gradient Descent with Experience Replay

Sample state, value from Experience Replay
can get rid of Expectation from gradient of loss function

Deep Q-Networks

Off-policy Temporal-Difference(TD) learning인 Q-learning 이용하여 true value를 계산
특징
1. experience replay를 통해 얻은 transition들을 sample하여 활용
2. 학습의 안정성을 위해 별도의 target network(frozen Q-network)를 두어 true value 계산
절차

Take action according to E-greedy policy
Store transition (s,a,r,s') in replay memory D
Sample random mini-batch of transitions (s,a,r,s') from D
Compute Q-learning targets w.r.t target network
Optimise MSE between Q-network and Q-learning targets using stochastic gradient descent

참고

[강의]RL Course by David Silver - Lecture 6: Value Function Approximation

[책]바닥부터 배우는 강화학습

'Deep Learning > Reinforcement Learning' 카테고리의 다른 글

Integrating Learning and Planning (0)	2021.08.31
Policy Gradient (0)	2021.08.31
Model-Free Control (0)	2021.08.21

'Deep Learning/Reinforcement Learning' Related Articles

Comments

Let's Run Jinyeah

Value Function Approximation 본문

Value Function Approximation

Outline

Incremental Methods

Stochastic Gradient Descent

[Goal]

[Loss function]

[Gradient of Loss function based on theta]

Control(Model-Free Policy Iteration) with Value Function Approximation

Incremental Control Algorithms

Batch Methods

Stochastic Gradient Descent with Experience Replay

Deep Q-Networks

'Deep Learning > Reinforcement Learning' 카테고리의 다른 글

티스토리툴바