ScaleKD: Strong Vision Transformers Could Be Excellent Teachers

¹ Intel Labs China ² iMotion Automotive Technology
Thirty-eighth Conference on Neural Information Processing Systems (NeurIPS 2024)
^*Core authors contributed to method formulation, experimental design and analysis, ^†Corresponding author.

Abstract

In this paper, we question if well pre-trained vision transformer (ViT) models could be used as teachers that exhibit scalable properties to advance cross architecture knowledge distillation research, in the context of adopting mainstream large-scale visual recognition datasets for evaluation. To make this possible, our analysis underlines the importance of seeking effective strategies to align (1) feature computing paradigm differences, (2) model scale differences, and (3) knowledge density differences. By combining three closely coupled components namely cross attention projector, dual-view feature mimicking and teacher parameter perception tailored to address the alignment problems stated above, we present a simple and effective knowledge distillation method, called ScaleKD. Our method can train student backbones that span across a variety of convolutional neural network (CNN), multi-layer perceptron (MLP), and ViT architectures on image classification datasets, achieving state-of-the-art knowledge distillation performance. For instance, taking a well pre-trained Swin-L as the teacher model, our method gets 75.15%|82.03%|84.16%|78.63%|81.96%|83.93%|83.80%|85.53% top-1 accuracies for MobileNet-V1|ResNet-50|ConvNeXt-T|Mixer-S/16|MixerB/16|ViT-S/16|Swin-T|ViT-B/16 models trained on ImageNet-1K dataset from scratch, showing 3.05%|3.39%|2.02%|4.61%|5.52%|4.03%|2.62%|3.73% absolute gains to the individually trained counterparts. Intriguingly, when scaling up the size of teacher models or their pre-training datasets, our method showcases the desired scalable properties, bringing increasingly larger gains to student models. We also empirically show that the student backbones trained by our method transfer well on downstream MS-COCO and ADE20K datasets. More importantly, our method could be used as a more efficient alternative to the time-intensive pre-training paradigm for any target student model on large-scale datasets if a strong pre-trained ViT is available, reducing the amount of viewed training samples up to 195×.

Experiments

Table 1: Main results of ScaleKD on 11 teacher-student network pairs. † denotes the model pretrained on IN-22K and ‡ denotes the model pre-trained by EVA, which has the learned knowledge of LAION-2B

Table 2: Experiments on exploring scalable properties from the teacher’s pre-training data. We use the best reported models with different pre-training methods as our baselines to examine whether our student model has learned the teacher’s pre-training knowledge. We use Swin-L as the teacher for the first two experiments and BEiT-L/14 as the teacher for the rest two experiments. ⇒ denotes transfer learning and * denotes the model is trained and tested with 384×384 sample resolution.

Table 3: Transfer learning results (%) on MS-COCO

Table 4: Transfer learning results (%) on ADE20K.

@article{fan2024scalekd, title={ScaleKD: Strong Vision Transformers Could Be Excellent Teachers}, author={Fan, Jiawei and Li, Chao and Liu, Xiaolong and Yao, Anabang}, journal={Thirty-eighth Conference on Neural Information Processing Systems}, year={2024} }