TL;DR
This paper introduces a multimodal document classification method using side-tuning, which effectively combines different data sources like text and images, surpassing current accuracy benchmarks.
Contribution
It applies the side-tuning framework to multimodal data, enabling better model adaptation and avoiding issues like model rigidity and catastrophic forgetting.
Findings
Achieves higher accuracy than existing methods
Successfully combines text and image data for classification
Demonstrates effectiveness of side-tuning in multimodal settings
Abstract
In this paper, we propose to exploit the side-tuning framework for multimodal document classification. Side-tuning is a methodology for network adaptation recently introduced to solve some of the problems related to previous approaches. Thanks to this technique it is actually possible to overcome model rigidity and catastrophic forgetting of transfer learning by fine-tuning. The proposed solution uses off-the-shelf deep learning architectures leveraging the side-tuning framework to combine a base model with a tandem of two side networks. We show that side-tuning can be successfully employed also when different data sources are considered, e.g. text and images in document classification. The experimental results show that this approach pushes further the limit for document classification accuracy with respect to the state of the art.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsResidual Connection · Depthwise Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · Pointwise Convolution · Batch Normalization · Depthwise Separable Convolution · Max Pooling · Global Average Pooling · Bottleneck Residual Block · Residual Block
