EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval

Yuhan Chen; Pengwen Dai; Chuan Wang; Dayan Wu; Xiaochun Cao

arXiv:2603.25267·cs.CV·April 2, 2026

EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval

Yuhan Chen, Pengwen Dai, Chuan Wang, Dayan Wu, Xiaochun Cao

PDF

1 Repo

TL;DR

EagleNet introduces a novel energy-aware, fine-grained relationship learning network that enhances text-video retrieval by capturing frame contextual information and improving cross-modal alignment.

Contribution

The paper proposes a new FRL mechanism and EAM to generate context-aware text embeddings and model interaction energy, advancing text-video retrieval accuracy.

Findings

01

EagleNet outperforms existing methods on MSRVTT, DiDeMo, MSVD, and VATEX datasets.

02

The energy-aware matching improves the modeling of real text-video pair distributions.

03

Replacing softmax contrastive loss with sigmoid loss stabilizes training and enhances performance.

Abstract

Text-video retrieval tasks have seen significant improvements due to the recent development of large-scale vision-language pre-trained models. Traditional methods primarily focus on video representations or cross-modal alignment, while recent works shift toward enriching text expressiveness to better match the rich semantics in videos. However, these methods use only interactions between text and frames/video, and ignore rich interactions among the internal frames within a video, so the final expanded text cannot capture frame contextual information, leading to disparities between text and video. In response, we introduce Energy-Aware Fine-Grained Relationship Learning Network (EagleNet) to generate accurate and context-aware enriched text embeddings. Specifically, the proposed Fine-Grained Relationship Learning mechanism (FRL) first constructs a text-frame graph by the generated text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

draym28/EagleNet
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.