# Text-Guided Visual Representation Optimization for Sensor-Acquired Video Temporal Grounding

**Authors:** Yun Tian, Xiaobo Guo, Jinsong Wang, Xinyue Liang

PMC · DOI: 10.3390/s25154704 · Sensors (Basel, Switzerland) · 2025-07-30

## TL;DR

This paper introduces a framework that uses text to improve the alignment between video content and language queries by optimizing visual representations in both space and time.

## Contribution

A novel text-guided framework with spatial and temporal modules to refine visual representations for video temporal grounding.

## Key findings

- The proposed framework outperforms state-of-the-art methods on benchmark datasets.
- The SVRO and TVRO modules effectively enhance cross-modal alignment by focusing on relevant spatiotemporal content.
- Self-supervised contrastive loss improves inter-clip discrimination and semantic variance.

## Abstract

Video temporal grounding (VTG) aims to localize a semantically relevant temporal segment within an untrimmed video based on a natural language query. The task continues to face challenges arising from cross-modal semantic misalignment, which is largely attributed to redundant visual content in sensor-acquired video streams, linguistic ambiguity, and discrepancies in modality-specific representations. Most existing approaches rely on intra-modal feature modeling, processing video and text independently throughout the representation learning stage. However, this isolation undermines semantic alignment by neglecting the potential of cross-modal interactions. In practice, a natural language query typically corresponds to spatiotemporal content in video signals collected through camera-based sensing systems, encompassing a particular sequence of frames and its associated salient subregions. We propose a text-guided visual representation optimization framework tailored to enhance semantic interpretation over video signals captured by visual sensors. This framework leverages textual information to focus on spatiotemporal video content, thereby narrowing the cross-modal gap. Built upon the unified cross-modal embedding space provided by CLIP, our model leverages video data from sensing devices to structure representations and introduces two dedicated modules to semantically refine visual representations across spatial and temporal dimensions. First, we design a Spatial Visual Representation Optimization (SVRO) module to learn spatial information within intra-frames. It selects salient patches related to the text, capturing more fine-grained visual details. Second, we introduce a Temporal Visual Representation Optimization (TVRO) module to learn temporal relations from inter-frames. Temporal triplet loss is employed in TVRO to enhance attention on text-relevant frames and capture clip semantics. Additionally, a self-supervised contrastive loss is introduced at the clip–text level to improve inter-clip discrimination by maximizing semantic variance during training. Experiments on Charades-STA, ActivityNet Captions, and TACoS, widely used benchmark datasets, demonstrate that our method outperforms state-of-the-art methods across multiple metrics.

## Full-text entities

- **Diseases:** injury to (MESH:D014947), VTG (MESH:D007815)
- **Chemicals:** IoU (-)
- **Species:** Canis lupus familiaris (dog, subspecies) [taxon 9615], Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12349264/full.md

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12349264/full.md

## References

61 references — full list in the complete paper: https://tomesphere.com/paper/PMC12349264/full.md

---
Source: https://tomesphere.com/paper/PMC12349264