CTRL-F: Pairing Convolution with Transformer for Image Classification via Multi-Level Feature Cross-Attention and Representation Learning Fusion

Hosam S. EL-Assiouti; Hadeer El-Saadawy; Maryam N. Al-Berry; Mohamed F. Tolba

arXiv:2407.06673·cs.CV·August 26, 2025

CTRL-F: Pairing Convolution with Transformer for Image Classification via Multi-Level Feature Cross-Attention and Representation Learning Fusion

Hosam S. EL-Assiouti, Hadeer El-Saadawy, Maryam N. Al-Berry, Mohamed F. Tolba

PDF

Open Access 1 Repo

TL;DR

CTRL-F is a hybrid convolution-transformer model that leverages multi-level feature cross-attention and novel fusion techniques to enhance image classification performance, especially in limited data scenarios.

Contribution

The paper introduces a lightweight hybrid network combining convolution and transformer modules with novel fusion techniques and multi-level feature cross-attention for improved image classification.

Findings

01

Achieves state-of-the-art accuracy on benchmark datasets.

02

Performs well in both large-data and low-data regimes.

03

Demonstrates robustness and superior generalization.

Abstract

Transformers have captured growing attention in computer vision, thanks to its large capacity and global processing capabilities. However, transformers are data hungry, and their ability to generalize is constrained compared to Convolutional Neural Networks (ConvNets), especially when trained with limited data due to the absence of the built-in spatial inductive biases present in ConvNets. In this paper, we strive to optimally combine the strengths of both convolution and transformers for image classification tasks. Towards this end, we present a novel lightweight hybrid network that pairs Convolution with Transformers via Representation Learning Fusion and Multi-Level Feature Cross-Attention named CTRL-F. Our network comprises a convolution branch and a novel transformer module named multi-level feature cross-attention (MFCA). The MFCA module operates on multi-level feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hosamsherif/ctrl-f
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition

MethodsSoftmax · Attention Is All You Need · Convolution