Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta
Quoc-Khang Tran, Minh-Thien Nguyen, Nguyen-Khang Pham

TL;DR
This paper introduces a novel framework combining CoAtNet architecture with model soups to improve classification of scarce, visually similar Intangible Cultural Heritage images, achieving state-of-the-art accuracy while reducing variance.
Contribution
It presents a hybrid model soup approach that ensembling diverse checkpoints to enhance generalization in low-resource cultural heritage image classification.
Findings
Model soups reduce variance and stabilize predictions.
Achieves 72.36% top-1 accuracy on ICH-17 dataset.
Outperforms baseline models like ResNet-50 and ViT.
Abstract
The classification of Intangible Cultural Heritage (ICH) images in the Mekong Delta poses unique challenges due to limited annotated data, high visual similarity among classes, and domain heterogeneity. In such low-resource settings, conventional deep learning models often suffer from high variance or overfit to spurious correlations, leading to poor generalization. To address these limitations, we propose a robust framework that integrates the hybrid CoAtNet architecture with model soups, a lightweight weight-space ensembling technique that averages checkpoints from a single training trajectory without increasing inference cost. CoAtNet captures both local and global patterns through stage-wise fusion of convolution and self-attention. We apply two ensembling strategies - greedy and uniform soup - to selectively combine diverse checkpoints into a final model. Beyond performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArchaeological Research and Protection · Aesthetic Perception and Analysis · Generative Adversarial Networks and Image Synthesis
