IJCV2021: Knowledge Distillation: A Survey

Abstract:This paper provides a comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher–student architecture, distillation algorithms, performance comparison and applications. Furthermore, challenges in knowledge distillation are briefly reviewed and comments on future research are discussed and forwarded.

Section1:Introduction

模型轻量化的技术
1. 设计 efficient deep models
2. 剪枝:parameter pruning and sharing
3. 量化:Low-rank factorization
4. Transferred compact convolutional filters, 不太了解
5. 知识蒸馏
知识蒸馏的main idea:student model mimics the teacher model to obtain a competitive or even a superior performance
知识蒸馏的key problem:how to represent, and transfer knowledge from a large teacher model to a small student model.
知识蒸馏系统的三个组成部分:knowledge,distillation algorithm,teacher-student architecture

在这里插入图片描述

Section2:知识

知识蒸馏中的知识有多种类别,一种最基本的知识就是使用教师模型的预测logits;此外,教师模型中间层的特征,也可以视为表征知识来引导学生网络学习。教师网络中不同神经元,不同特征层富含的关系信息,以及教师模型的参数也包含的一定的知识。综述将知识分为三种类别:response-based knowledge,feature-based knowledge,relation-based knowledge

在这里插入图片描述

response-based knowledge:指的是教师模型的输出类别的logits

Response-based knowledge usually refers to the neural response of the last output layer of the teacher model. The main idea is to directly mimic the final prediction of the teacher model, which is simple yet effective. The idea of the response-based knowledge is straightforward and easy to understand, especially in the context of “dark knowledge”
在这里插入图片描述

最典型的response-based的方法为Hinton在2015年提出的结合Temperature的soft target方法
然而, the response-based knowledge usually relies on the output of the last layer, e.g., soft targets, and thus fails to address the intermediate-level supervision from the teacher model, which turns out to be very important for representation learning using very deep neural networks
feature-based knowledge

在这里插入图片描述

典型方法是FitNet的Hints方法:the output of a teacher’s hidden layer that supervises the student’s learning.,用教师模型的Feature map当做知识。
The main idea is to directly match the feature activations of the teacher and the student. Inspired by this, a variety of other methods have been proposed to match the features indirectly,indirect的方法有从特征图中产生attention map来表征知识,匹配特征空间的概率分布,use the activation boundary of the hidden neurons for knowledge transfer,cross-layer knowledge distillation, which adaptively assigns proper teacher layers for each student layer via attention allocation.
用于监督feature-based knowledge distillation的损失函数有:L1损失,L2损失,CE交叉熵损失,MMD损失(maximum mean discrepancy)
feature-based knowledge distillation的难点:Though feature-based knowledge transfer provides favorable information for the learning of the student model, how to effectively choose the hint layers from the teacher model and the guided layers from the student model remains to be further investigated (Romero et al. 2015). Due to the significant differences between sizes of hint and guided layers, how to properly match feature representations of teacher and student also needs to be explored.
Relation-based knowledge:further explores the relationships between different layers or data samples of teacher-student models
常见的方法有FSP矩阵(flow of solution process matrix)表示法,SVD singular value decomposition表示法,graph表示法,mutual information flow表示法,instance relationship graph表示法,
relation-based distillation loss based on relations of feature maps

在这里插入图片描述

relation-based distillation loss based on relations of instance

在这里插入图片描述
在这里插入图片描述

不同种类的relation-based knowledge

在这里插入图片描述

relations of feature maps 和 relations of instance的区别是什么????

Section 3:蒸馏的范式:offline distillation,online distillation,self-distillation

在这里插入图片描述

offline distillation:先训练好教师模型,然后冻结教师模型,再监督训练学生模型
online distillation:both the teacher model and the student model are updated simultaneously, and the whole knowledge distillation framework is end-to-end trainable
self-distillation:只有一个模型,knowledge from the deeper sections of the network is distilled into its shallow sections
Besides, offline, online and self distillation can also be intuitively understood from the perspective of human beings teacher–student learning. Offline distillation means the knowledgeable teacher teaches a student knowledge; online distillation means both teacher and student study together with each other; self-distillationmeans student learn knowledge by oneself. Moreover, just like the human beings learning, these three kinds of distillation can be combined to complement each other due to their own advantages.

Section 4:Teacher-Student Architecture

在这里插入图片描述

The complexity of deep neural networks mainly comes from two dimensions: depth and width. It is usually required to transfer knowledge from deeper and wider neural networks to shallower and thinner neural networks
The student network is usually chosen to be
1)a simplified version of a teacher network with fewer layers and fewer channels in each layer
2)a quantized version of a teacher network in which the structure of the network is preserved
3) a small network with efficient basic operations
4) a small network with optimized global network structure
5) the same network as teacher

Section 5: 蒸馏算法

1. Adversarial Distillation:Inspired by GAN, many adversarial knowledge distillation methods have been proposed to enable the teacher and student networks to have a better understanding of the true data distribution
如图所示,adversarial distillation方法可分为三种类型

在这里插入图片描述

不太懂adversarial distillation

2. Multi-Teacher Distillation:Different teacher architectures can provide their own useful knowledge for a student network

在这里插入图片描述

3. Cross-Modal Distillation

在这里插入图片描述

相关工作

[1] CVPR 2016: Cross modal distillation for supervision transfer.
[2] ECCV 2018: Modality distillation with multiple stream networks for action recognition
[3] CVPR 2018: Through-wall human pose estimation using radio signals.
[4] ICASSP 2018: Cross-modality distillation:A case for conditional generative adversarial networks
[5] CVPR 2020: Knowledge as Priors: Cross-Modal Knowledge Generalization for Datasets without Superior Knowledge
[6] ACM MM 2018: Emotion recognition in speech using cross-modal transfer in the wild
[7] ICIP 2019: Cross-modal knowledge distillation for action recognition
[8] ICLR 2020: Contrastive representation distillation
[9] ICCV 2019: Compact trilinear interaction for visual question answering
[10] ECCV 2018: Learning deep representations with probabilistic knowledge transfer.
[11] CVPR 2016: Learning with side information through modality hallucination
[12] CVPR 2019: Um-adapt: Unsupervised multi-task adaptation using adversarial cross-task distillation
[13] CVPR 2019: rdoco: Pixel-level domain transfer with cross-domain consistency
[14] BMVC 2017: Adaptingmodels to signal degradation using distillation
[15] ECCV 2018: Graph distillation for action detection with privileged modalities
[16] PR 2019: Spatiotemporal distilled denseconnectivity network for video action recognition
[17] ICASSP 2019: Multi-teacher knowledge distillation for compressed video action recognition on deep neural networks
[18] AAAI 2020: Knowledge integration networks for action recognition

在这里插入图片描述

4. Graph-based Distillation:explore the intra-data relationships using graphs. The main ideas of these graph-based distillation methods are 1) to use the graph as the carrier of teacher knowledge; or 2) to use the graph to control the message passing of the teacher knowledge. A generic framework for graph-based distillation is shown in Fig. 13在这里插入图片描述
5. Attention-based Distillation:The core of attention transfer is to define the attention maps for feature embedding in the layers of a neural network. That is to say, knowledge about feature embedding is transferred using attention map functions
6. Data-Free Distillation:no training data. Instead, the data is newly or synthetically generated.

在这里插入图片描述

7. Quantized Distillation:using the quantization process in the teacher–student framework:the full-precision teacher network is first quantized on the feature maps, and then the knowledge is transferred from the quantized teacher to a quantized student network

在这里插入图片描述

不太懂
8. Lifelong Distillation
lifelong learning, including continual learning, continuous learning and meta-learning
9. NAS-Based Distillation

Section 6: Performance Comparison

在CIFAR10和CIFAR100上提供了许多基于教师-学生模型的对比报道。具体见原文。

Section 7: Application

Section 8: Conclusion and Discussion

Challenges: 1) the quality of knowledge, 2) the types of distillation, 3) the design of the teacher-student architectures, 4) the theory behind knowledge distillation

Future Directions: 见原文

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值