Distilling the Knowledge in Neural Network

Paper Review/Knowledge Distillation

Distilling the Knowledge in Neural Network

jinyeah 2022. 5. 10. 13:06

[논문] Distilling the Knowledge in Neural Network

Knowledge Distillation의 초창기 논문 (2014 NIPS workshop)
one model compression technique for bringing computations to edge devices. Where the goal is to have a small and compact model to mimic the performance of the cumbersome model
Supervised learning (Label 존재)
Response based knowledge (Teacher의 soft target) + Offline distillation (pre-trained Teacher model)

Knowledge

기존

one-hot encoding을 통한 softmax 확률값에서 가장 높은 값은 1, 나머지 값은 0이 되는 hard target 사용
기존 softmax function은 가장 큰 확률값은 1에 가깝고 나머지는 0에 가까운 값으로 매핑되는 문제점 있음

제안 방법

soft target: Temperature hyperparameter (T)를 softmax function에 추가
- Temperature = 1 일 때, 기존 softmax function과 동일
- Temperature가 클수록 더 soft한 확률분포 얻음

Distillation

Teacher 모델이 학습한 Knowledge를 Student 모델에 전달

disillation loss: soft label과 soft prediction의 차이를 Kullback-Leiber Divergence를 통해 구함
student loss: hard predictions와 hard label을 Cross-entropy를 통해 구함

What is Kullback-Leiber Divergence?

2022.05.10 - [Deep Learning/Basic] - Objective function

reference

Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531 (2015).

https://towardsdatascience.com/distilling-knowledge-in-neural-network-d8991faa2cdc

https://intellabs.github.io/distiller/knowledge_distillation.html

https://dsbook.tistory.com/324