# Hybrid lightweight vision transformers with attention mechanism for feature extraction and classification of product designs

**Authors:** Abdul Wahid, Hikmat Ullah Khan, Anam Naz, Fawaz Khaled Alarfaj

PMC · DOI: 10.1371/journal.pone.0343510 · PLOS One · 2026-03-03

## TL;DR

This paper proposes a new hybrid AI model for analyzing product packaging designs, achieving high accuracy in classification.

## Contribution

A novel hybrid model combining CNNs and vision transformers for efficient and accurate packaging design classification.

## Key findings

- The proposed LeViT model achieves 95% classification accuracy on packaging design images.
- LeViT outperforms CNN-based models like ResNet-50, RegNet, and ConvNeXt in packaging classification tasks.
- The model effectively captures both local and global visual features for improved performance.

## Abstract

In modern consumer markets, product packaging strongly influences customer attention and buying decisions. Attractive and informative designs help brands stand out in competitive environments. Recently, Artificial Intelligence (AI) has been widely used to support packaging evaluation, especially for design analysis, personalized user experiences, and product recommendation systems. However, traditional deep learning models, such as CNN-based ResNet-50 architectures, often fail to capture long-range relationships and global visual context. These limitations reduce their effectiveness in complex visual tasks like packaging classification. To address this issue, this study investigates the use of vision transformer-based models for packaging design analysis. We propose LeViT, an efficient hybrid architecture that combines convolutional neural networks with vision transformers. This design enables the model to learn both local visual details and global contextual features. The proposed approach improves feature representation while maintaining computational efficiency. Experiments were conducted on an image dataset of packaging designs. The performance of LeViT was compared with state-of-the-art models, including CNN-ResNet-50, RegNet, and ConvNeXt. The results show that the proposed model achieves the highest classification accuracy of 95%, outperforming all comparison methods. These findings demonstrate the effectiveness of transformer-based architectures for packaging classification. The proposed approach offers practical benefits for retail analytics, brand assessment, and marketing decision-making.

## Full-text entities

- **Diseases:** brain tumors (MESH:D001932), plant diseases (MESH:D010939), breast cancer (MESH:D001943), fruit and vegetable disease (MESH:D018458), crack (MESH:D003387), seizure (MESH:D012640), lung cancer (MESH:D008175), Parkinson Disease (MESH:D010300)
- **Chemicals:** LeViT (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12956104/full.md

## Figures

17 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12956104/full.md

## References

62 references — full list in the complete paper: https://tomesphere.com/paper/PMC12956104/full.md

---
Source: https://tomesphere.com/paper/PMC12956104