Multi-aspect Knowledge Distillation with Large Language Model
Taegyeong Lee, Jinsik Bang, Soyeong Kwon, and Taehwan Kim

TL;DR
This paper introduces a multi-aspect knowledge distillation approach using Multimodal Large Language Models to transfer complex, diverse visual knowledge to improve image classification performance.
Contribution
The paper proposes a novel multi-aspect knowledge distillation method leveraging MLLMs, enabling models to learn diverse and complex visual aspects beyond traditional class labels.
Findings
Improved performance on image classification benchmarks.
Effective transfer of multi-aspect visual knowledge.
Enhanced understanding of complex visual features.
Abstract
Recent advancements in deep learning have significantly improved performance on computer vision tasks. Previous image classification methods primarily modify model architectures or add features, and they optimize models using cross-entropy loss on class logits. Since they focus on classifying images with considering class labels, these methods may struggle to learn various \emph{aspects} of classes (e.g., natural positions and shape changes). Rethinking the previous approach from a novel view, we propose a multi-aspect knowledge distillation method using Multimodal Large Language Models (MLLMs). Our approach involves: 1) querying Large Language Model with multi-aspect questions relevant to the knowledge we want to transfer to the model, 2) extracting corresponding logits from MLLM, and 3) expanding the model's output dimensions to distill these multi-aspect logits. We then apply…
Peer Reviews
Decision·Submitted to ICLR 2025
1.This paper is written in a clear and straightforward manner, making it easy to quickly grasp the method's approach. 2.The paper conducted a lot of experiments, and the figures and tables are well-organized. 3.The authors claimed they are the first to offer a novel perspective on distilling multi-aspect knowledge regarding abstract and complex concepts. I have seen the author's efforts in the design of knowledge transfer.
1. The proposed method shows some improvement on some classic CNN-based models but lacks experiments on ViT-based models. 2. In the knowledge distillation task, the comparison is only done with KD, lacking comparisons with other knowledge distillation methods [1,2]. 3. The improvement in object detection tasks is very limited in Tab7, and there is no comparison done on currently well-performing object detection methods. Object detection is inherently a more fine-grained visual task than classif
The paper is well-written, with a clear and logical flow from the introduction through to the conclusion. The authors present simple ideas in a straightforward manner, making the paper accessible to readers from diverse backgrounds. The experimental setup is meticulously organized, with each step of the process described in a way that facilitates reproducibility. The authors outline the methodologies, datasets, and evaluation metrics in clear subsections, allowing readers to follow the experimen
1、 Limited Experimental Setting:The experimental setting is narrow, which restricts the generalizability of the findings. The scale of datasets is small and may not be sufficient to demonstrate the robustness of the proposed method across different scenarios. Expanding the experimental scope to include more varied or challenging datasets such as the full ImageNet would significantly strengthen the paper. 2、 Lack of novelty: The proposed method directly adopts the MLLM’s output logit to perform
1. The core idea is simple but looks effective. 2. The paper writing is fluent and easy to follow. 3. The paper conducts experiments on six different fine-grained datasets and two different coarse-grained datasets. The results show that the proposed method achieves stable performance improvement, especially on the fine-grained datasets. 4. The ablation studies and related visualization are comprehensive and insightful.
1. The evaluation datasets in the paper are relatively small, and the model parameters appear insufficient in 2024 . Using ResNet18/34 as the primary model limits the assessment of the framework’s scalability. It would be valuable to test the framework on a larger dataset, such as ImageNet, and with a more complex model like ResNet101, to assess its effectiveness in a more challenging setting. 2. The paper lacks comparisons with other knowledge distillation (KD) baselines, which would provide a
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies
MethodsFocus · Knowledge Distillation
