【论文笔记】Distilling the Knowledge in a Neural Network

2019年11月02日

Author: Guofei

文章归类: 0-读论文，文章编号: 1

版权声明：本文作者是郭飞。转载随意，但需要标明原文链接，并通知本人
原文链接：https://www.guofei.site/2019/11/02/distilling_knowledge.html

Edit

Distilling the knowledge in a neural network (2015), G. Hinton et al. pdf
镜像地址 pdf

abstract&introduction

ensemble 方法确实不错，但太消耗算力。这里提出一种 Distilling the Knowledge 的方法，使训练快速、并行。

我们可以训练一个笨重的（cumbersome）模型，然后用 Distilling the Knowledge 的方法，最终在部署阶段部署一个轻量的模型。

在很多个class的多分类模型中，有些class的概率很低，但objective function仍然要计算他们。

模型

先用 cumbersome model 拟合，得到softmax层，
pi=exp(vi/T)∑exp(vi/T)pi=exp⁡(vi/T)∑exp⁡(vi/T) 这一步与我们一般的理解一样。

下一步做distillation，就是把上一步的pipi作为y，因为这一步和上一步都用交叉熵作为cost function，我们得到 ∂C∂zi=1T(qi−pi)=1T(exp(zi/T)∑jexp(zj/T)−exp(vi/T)∑exp(vj/T))∂C∂zi=1T(qi−pi)=1T(exp⁡(zi/T)∑jexp⁡(zj/T)−exp⁡(vi/T)∑exp⁡(vj/T))

近似

当T足够大时，

∂C∂zi≈1T(1+zi/TN+∑jzj/T−1+vi/TN+∑jvj/T)∂C∂zi≈1T(1+zi/TN+∑jzj/T−1+vi/TN+∑jvj/T)

在假设logits 是 zero-meaned
∂C∂zi≈1NT2(zi−vi)∂C∂zi≈1NT2(zi−vi)

实验

使用了 MNIST，speech recognition 做了实验，效果良好。

我的理解

模型描述起来其实简答，但要理解为什么work，就要费一定的功夫了。

关于 T（temperature）

写了个代码去模拟

import numpy as np
import matplotlib.pyplot as plt

z=np.random.rand(5,1)*20
T=np.arange(1,10).reshape(1,9)

tmp=np.exp(z/T)
p=tmp/tmp.sum(axis=0)

plt.plot(p.T)
plt.show()

关于为什么work

下面是我的粗浅理解。

可以看成是一种 data augmentation，举例来说，你某次训练的标签是 [BMW, 卡车, 猫]，对应的label是[1,0,0]，用大模型生成的是[0.9,0.099,0.001]，得到了类与类之间的相似性，或者说更多的信息（使用小网络时，减少了信息丢失）。

然后 temperature 这个 trick可以让这个信息更明显。

您的支持将鼓励我继续创作！

【论文笔记】Distilling the Knowledge in a Neural Network