Paper Review/Knowledge Distillation
Distilling the Knowledge in Neural Network
jinyeah
2022. 5. 10. 13:06
[논문] Distilling the Knowledge in Neural Network
- Knowledge Distillation의 초창기 논문 (2014 NIPS workshop)
- one model compression technique for bringing computations to edge devices. Where the goal is to have a small and compact model to mimic the performance of the cumbersome model
- Supervised learning (Label 존재)
- Response based knowledge (Teacher의 soft target) + Offline distillation (pre-trained Teacher model)
Knowledge
기존
- one-hot encoding을 통한 softmax 확률값에서 가장 높은 값은 1, 나머지 값은 0이 되는 hard target 사용
- 기존 softmax function은 가장 큰 확률값은 1에 가깝고 나머지는 0에 가까운 값으로 매핑되는 문제점 있음
제안 방법
- soft target: Temperature hyperparameter (T)를 softmax function에 추가
- Temperature = 1 일 때, 기존 softmax function과 동일
- Temperature가 클수록 더 soft한 확률분포 얻음
Distillation
Teacher 모델이 학습한 Knowledge를 Student 모델에 전달
- disillation loss: soft label과 soft prediction의 차이를 Kullback-Leiber Divergence를 통해 구함
- student loss: hard predictions와 hard label을 Cross-entropy를 통해 구함
What is Kullback-Leiber Divergence?
2022.05.10 - [Deep Learning/Basic] - Objective function
reference
https://towardsdatascience.com/distilling-knowledge-in-neural-network-d8991faa2cdc
https://intellabs.github.io/distiller/knowledge_distillation.html
https://dsbook.tistory.com/324