Text-Guided Visual Representation Optimization for Sensor-Acquired Video Temporal Grounding

Yun Tian; Xiaobo Guo; Jinsong Wang; Xinyue Liang

PMC · DOI:10.3390/s25154704·July 30, 2025

Text-Guided Visual Representation Optimization for Sensor-Acquired Video Temporal Grounding

Yun Tian, Xiaobo Guo, Jinsong Wang, Xinyue Liang

PDF

Open Access

TL;DR

This paper introduces a framework that uses text to improve the alignment between video content and language queries by optimizing visual representations in both space and time.

Contribution

A novel text-guided framework with spatial and temporal modules to refine visual representations for video temporal grounding.

Findings

01

The proposed framework outperforms state-of-the-art methods on benchmark datasets.

02

The SVRO and TVRO modules effectively enhance cross-modal alignment by focusing on relevant spatiotemporal content.

03

Self-supervised contrastive loss improves inter-clip discrimination and semantic variance.

Abstract

Video temporal grounding (VTG) aims to localize a semantically relevant temporal segment within an untrimmed video based on a natural language query. The task continues to face challenges arising from cross-modal semantic misalignment, which is largely attributed to redundant visual content in sensor-acquired video streams, linguistic ambiguity, and discrepancies in modality-specific representations. Most existing approaches rely on intra-modal feature modeling, processing video and text independently throughout the representation learning stage. However, this isolation undermines semantic alignment by neglecting the potential of cross-modal interactions. In practice, a natural language query typically corresponds to spatiotemporal content in video signals collected through camera-based sensing systems, encompassing a particular sequence of frames and its associated salient subregions.…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species2

Canis lupus familiaris(dog · subspecies)Homo sapiens(human · species)

Chemicals1

IoU

Diseases2

injury to VTG

Figures12

Click any figure to enlarge with its caption.

Keywords

video temporal groundingcross-modal learningcross-attentioncontrastive learningrepresentation optimization

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

Full text

1. Introduction

The rapid proliferation of video data, largely collected through camera-based sensing systems, has elevated video analysis to a central focus in computer vision research. While conventional tasks such as video classification [1,2] and action localization [3,4] have advanced semantic understanding, they depend heavily on predefined label sets and are ill-suited for open-ended querying. Recent advances have introduced natural language as a flexible query modality, enabling users to articulate semantic intent with greater freedom. This conceptual shift has given rise to the research task of video temporal grounding (VTG), which aims to semantically localize segments from untrimmed video signals acquired via visual sensors. As illustrated in Figure 1a, the goal is to identify the temporal segment within an untrimmed video that best aligns with a user-provided textual query [5]. As a canonical task in multimodal learning, VTG has found widespread application in video retrieval, event parsing, and intelligent surveillance. It also serves as a structural foundation for higher-level vision–language tasks such as video dialogue [6], video relationship detection [7,8,9], and video question answering [10,11,12]. By grounding semantic queries in real-world sensor-acquired video data, VTG constitutes a critical bridge between computer vision and natural language processing, promoting intelligent scene perception.

Most existing VTG methods [13,14,15,16,17] construct feature representations independently within either the video or text modality. As illustrated in Figure 1b, these intra-modal approaches employ separate encoders for each modality, extracting unimodal features that are passed to downstream modules for segment generation and boundary prediction. However, because visual features are extracted in isolation from sensor data without semantic conditioning, aligning them effectively with textual queries becomes inherently challenging. The absence of cross-modal interaction prevents an effective representation of semantically discriminative sensor data, limiting the model’s responsiveness to subtle variations in natural scenes. Consequently, models become overly dependent on downstream modules for semantic compensation, weakening early-stage perceptual sensitivity and ultimately undermining the interpretability and precision of cross-modal alignment.

Unlike conventional VTG methods that construct representations independently within each modality, our approach adopts a cross-modal paradigm to more effectively bridge the semantic gap between video and text. Given that textual descriptions often provide more precise semantic cues than raw visual signals captured by sensors [18], we leverage textual guidance to semantically reshape cross-modal visual representations via a text-guided optimization module, as illustrated in Figure 1c. By introducing cross-modal interaction during the visual feature construction stage, our model strengthens the semantic alignment of representations with respect to the query’s intent. This mechanism enables sensor-acquired visual features to undergo early semantic filtering and exhibit stronger task awareness prior to fusion.

Natural language descriptions inherently encode rich temporal structures and fine-grained spatial interactions, making them especially effective for guiding semantic interpretation of sensory visual data. Conventional cross-modal approaches often fall short in capturing such nuanced semantics. Specifically, a natural language query typically refers to a continuous sequence of frames while also conveying intent about inter-frame transitions and the localization of salient regions. As shown in Figure 1d, different textual descriptions of the same sensor video stream can induce distinct patterns of attention. For example, the blue query highlights a specific action unit, “a dog biting the frisbee,” concentrating attention on the moment of contact. In contrast, the brown query emphasizes “dog and woman scrambling for the frisbee,” leading to broader attention over dynamic interactions. Without explicit textual guidance, these distinctions are often lost in low-level encodings of sensor inputs. Therefore, to improve the perceptual accuracy of intelligent sensing systems, visual representation learning must incorporate mechanisms that allow sensor-derived features to adapt dynamically to query-level semantics.

Based on the preceding analysis, we propose a text-guided visual representation optimization framework that incorporates two specialized modules to identify text-relevant video frames and internal patches for constructing discriminative visual representations tailored to video temporal grounding. The Spatial Visual Representation Optimization (SVRO) module identifies spatially salient regions in sensor-acquired video frames that exhibit strong semantic correspondence with the textual query. Unlike prior approaches [19,20,21] that rely solely on visual-domain saliency detection, SVRO conditions patch selection on textual input, computing attention scores between visual patches and query tokens using a cross-attention mechanism. The top-K semantically relevant patches are retained to form the final representation, suppressing irrelevant background and noise. The Temporal Visual Representation Optimization (TVRO) module learns temporal alignment between candidate frames and the query using a contrastive learning strategy. A temporal triplet loss enforces semantic separation between text-relevant and -irrelevant video frames, encouraging the model to emphasize continuous key frames while discarding semantic outliers. These modules work in tandem to model the spatiotemporal semantics embedded in the video signals acquired through sensing systems, producing task-aware features for candidate moment generation and boundary prediction.

We conduct systematic evaluations of the proposed method on three widely used benchmark datasets: Charades-STA [5], ActivityNet-Captions [22], and TACoS [23]. The results demonstrate that our method consistently surpasses existing baselines across multiple standard evaluation metrics. Notably, on the ActivityNet-Captions dataset, Rank@1 at IoU = 0.5 reaches 49.18% and an mIoU of 45.76%. Additional ablation studies validate the effectiveness of the proposed modules and confirm the essential role of text-guided optimization in enhancing the semantic quality of sensor-derived video representations.

The main contributions of this work are summarized as follows:

We propose a text-guided visual representation optimization framework for video temporal grounding, which introduces cross-modal interaction during the visual encoding stage. This framework reshapes visual representations derived from sensor-acquired video signals under semantic conditioning by natural language queries.
We design spatial and temporal optimization modules that identify semantically relevant patches and frames from sensed video streams. These modules enhance the discriminability of visual features at both intra-frame and inter-frame levels, contributing to more precise semantic grounding.
Extensive experiments on three public benchmarks demonstrate that our method consistently outperforms existing approaches. Furthermore, the proposed modules exhibit strong generalization and can be seamlessly integrated into various VTG architectures, offering performance improvements without modifying the core structure. These results highlight the method’s potential in intelligent visual sensing and semantic-level video interpretation.

2. Related Work

2.1. Video Temporal Grounding

Temporal localization in untrimmed videos primarily encompasses two subfields: temporal action localization and video temporal grounding. The former aims to identify the temporal boundaries of predefined action categories, making it suitable for action detection within a closed category space, yet it faces considerable challenges when handling open-ended natural language instructions [3,24]. To address this constraint, Gao et al. [5] and Hendricks et al. [25] first proposed the video temporal grounding task in 2017, with the goal of precisely locating temporal boundaries in video that align with a natural language query. Most existing VTG methods [13,14,15,16,17] follow an intra-modal modeling paradigm, where visual and textual features are independently encoded during the representation stage and subsequently passed to downstream modules for candidate moment modeling and boundary prediction. While this strategy allows for coarse-grained semantic alignment via temporal modeling or matching functions, the absence of cross-modal interaction makes it difficult for visual representations to respond dynamically to language queries and ultimately restricts the model’s capacity for fine-grained detail modeling and precise grounding. These untrimmed video signals are typically captured through camera-based sensing systems, making VTG a critical task for interpreting and structuring sensor-acquired visual data using high-level textual supervision.

In light of these limitations, recent studies have increasingly turned to cross-modal modeling paradigms that incorporate language-guided supervision and symmetric interaction mechanisms to enhance the semantic fidelity of visual representations. One line of research focuses on language-guided visual representation learning, such as LGI [26], PS-VTG [27], and VDI [28], which segment natural language queries into semantic phrases or sub-queries to guide the visual encoding of candidate clips, thereby achieving fine-grained alignment between textual cues and spatial regions. Another line of work explores bidirectional regulation mechanisms, enabling visual features to modulate the language encoding process in reverse. Representative methods such as CBLN [29] and SeqPAN [30] employ symmetric attention and gated control strategies to refine the semantic mapping process. While these methods enhance the semantic consistency of visual representations, two critical limitations persist. First, their modeling granularity is typically restricted to the frame level, lacking intra-frame spatial saliency modeling. This omission hinders the identification of text-relevant regions within individual frames and results in blurred semantic distinctions. Second, they fail to effectively filter redundant frames, allowing irrelevant content to persist in the visual representation. This inclusion leads to semantic misalignment and reduces the discriminative capacity of the model. In contrast, our work extends the modeling mechanism along two key dimensions. In the spatial dimension, we introduce a salient patch selection strategy that leverages a text-guided mechanism to identify semantically relevant intra-frame regions and achieve fine-grained spatial alignment. In the temporal dimension, we build a redundancy suppression path to suppress the influence of semantically irrelevant frames and enhance the temporal discriminability of visual representations, thereby improving overall focus on the textual semantics.

2.2. Cross-Modal Learning for VTG

In real-world scenarios, actions and events are often conveyed through multiple modalities, with vision–language correspondence playing a central role in semantic modeling. As natural language increasingly emerges as the dominant medium of human–computer interaction, the collaborative modeling of visual and textual modalities has become a cornerstone in cross-modal learning research [31,32,33]. In sensor-acquired video data, semantic structures, such as object interactions and temporal scene transitions, are often entangled within raw pixel streams. The ability to structure such information via text-guided modeling significantly enhances semantic-level sensing and scene understanding. In recent years, the widespread adoption of Transformer architectures has significantly advanced language understanding and representation learning, and their core attention mechanisms have been progressively integrated into multimodal tasks. These mechanisms have shown strong efficacy in fine-grained problems such as video temporal grounding. The majority of existing studies adopt the cross-attention mechanism [34] as the foundational framework for vision–language fusion, leveraging the modeling of inter-modal dependencies to enhance semantic coherence and alignment precision. Representative methods such as VSLNet [35], GDP [36], CBLN [29], SeqPAN [30], and MRTNet [16] utilize cross-attention strategies during modality fusion, effectively enabling semantic alignment between visual and textual features.

While many existing methods incorporate the cross-attention mechanism after completing visual feature encoding, this “post hoc interaction” modeling paradigm improves matching to some extent but lacks textual semantic guidance in the early stages of visual encoding, which results in insufficient responsiveness of visual features to key semantic details. This makes it difficult to structurally model spatial saliency and suppress temporal redundancy, thereby limiting overall cross-modal expressiveness. To address the aforementioned modeling limitations, we propose advancing the cross-attention mechanism into the visual encoding stage and coupling it with text to form a unified text-guided aggregator. With respect to spatiotemporal video content, the text-guided aggregator fuses semantically salient intra-frame patches with textual context to construct informed visual representations, suppressing irrelevant frame sequences and preventing redundant information from contaminating the visual feature construction. This early-stage guidance is particularly valuable when interpreting complex dynamic scenes recorded by visual sensors in real-world environments. Compared with conventional approaches that focus on post-fusion semantic alignment, our method introduces early-stage structural alignment and embeds semantic constraints during initial encoding, concentrating structural changes in the low-level representation generation process, fundamentally enhancing the structural responsiveness of visual features to textual semantics, and providing a unified modeling solution for spatial saliency encoding and temporal noise suppression.

2.3. Contrastive Learning for VTG

Contrastive learning, as a representative self-supervised learning paradigm, aims to reduce the distance between semantically aligned samples while increasing the distance between misaligned ones. This approach has demonstrated strong effectiveness in representation learning and has been widely adopted in cross-modal retrieval [37,38] and spatiotemporal modeling tasks [39,40]. Among the most influential contrastive learning frameworks, CLIP [41] constructs a joint visual–textual embedding space by training on large-scale image–caption pairs, exhibiting strong capabilities in cross-modal alignment and generalization. CLIP4Clip [42] extends CLIP to video–language tasks by generating frame-level visual representations conditioned on textual semantics. CLIP-ViP [43] further introduces a staged training strategy to address semantic shifts between video captions and natural language queries. These studies collectively suggest that incorporating contrastive learning into video–text joint modeling can significantly enhance cross-modal consistency and improve downstream discrimination performance. This paradigm is especially effective when learning from video signals captured by camera-based sensing systems, where supervised labels are often sparse or absent, enabling semantic representations to emerge directly from raw sensory input.

In video temporal grounding, contrastive learning is widely employed to enhance the discriminative power of cross-modal alignment and to sharpen semantic focus. For example, MMN constructs a 2D candidate moment feature map and introduces a contrastive loss to strengthen text-to-candidate matching, thereby improving the model’s response to semantically salient segments [13]. D3G proposes a semantic-aligned grouped contrastive learning mechanism (SA-GCL), which partitions intra-video and inter-video sample groups and utilizes a Gaussian adjustment module for dynamic modeling of the visual–semantic distribution, building hierarchical contrastive paths across videos to enhance segment-level discrimination [17]. These methods validate the feasibility and effectiveness of contrastive learning in VTG from different perspectives. Building upon the contrastive capabilities of CLIP, our method further embeds contrastive learning into the Temporal Visual Representation Optimization (TVRO) module. This module constructs a contrastive structure based on the semantic relation between candidate frames and text, treating semantically relevant frames as positive samples and irrelevant ones as negative samples. A temporal triplet loss is introduced to enlarge the semantic response gap between the two frame groups, guiding the model to focus on key frame sequences while suppressing redundant frames. In addition, we design a clip–text self-supervised contrastive loss, aimed at capturing semantic distinctions among candidate segments throughout training. By leveraging contrastive supervision over sensor-acquired video signals, our model constructs semantically aligned representations with minimal reliance on annotated data, enabling efficient learning from raw visual inputs.

3. Methodology

This section systematically introduces the proposed text-guided visual representation optimization method. As illustrated in Figure 2, the architecture consists of three main stages: First, the CLIP encoder is used to embed both video and text inputs into a shared semantic space, generating a unified visual–textual representation. Second, the visual representation is optimized in both spatial and temporal dimensions under the guidance of textual semantics. Finally, based on the optimized visual representations, candidate moments are generated, and temporal boundaries are localized.

3.1. Problem Definition

Given an untrimmed video V and a natural language query S, the objective is to retrieve a temporal segment M from V that is semantically aligned with S. Formally, the query sentence is denoted as $[eqn]$ , where $[eqn]$ denotes the $[eqn]$ word, and $[eqn]$ is the total number of words in the query. The video V is represented as a sequence of frames, denoted as $[eqn]$ , where $[eqn]$ is the $[eqn]$ video frame, and $[eqn]$ denotes the total number of frames. The target temporal segment is denoted as $[eqn]$ with $[eqn]$ , and it is required to be semantically consistent with the given query S.

3.2. Preparation

Given an input video stream, we first segment it into N consecutive video clips, denoted as $[eqn]$ , where each clip $[eqn]$ consists of T consecutive frames, formally defined as $[eqn]$ . These clips serve as the fundamental units for candidate moment construction. To construct a unified multimodal representation space, CLIP [41] is adopted as the backbone to encode both video and text modalities. For the textual modality, the given natural language query S is encoded using the CLIP text encoder, producing a token sequence $[eqn]$ , where $[eqn]$ and $[eqn]$ denote the start and end tokens, respectively. We use $[eqn]$ as the global text representation of the sentence, denoted as t. For each segmented video clip $[eqn]$ , we uniformly sample m frames and divide each frame into non-overlapping patches. These patches are then fed into the CLIP visual encoder to extract local visual features. The $[eqn]$ tokens $[eqn]$ generated from each frame are aggregated into the set $[eqn]$ . Here, $[eqn]$ denotes the global visual semantic vector of the $[eqn]$ frame. This set forms the initial visual representation of the video clip for subsequent cross-modal alignment.

However, the initial visual representations encoded by CLIP often contain redundant regions that are semantically unrelated to the query, which may hinder accurate alignment if directly used in cross-modal fusion. Inspired by [18], we adopt a text-guided mechanism to enhance the semantic relevance of visual representations. We design a text-guided aggregator to perform cross-modal-guided aggregation over frame-level visual representations, which combines cross-attention with a feed-forward network. Specifically, given a paired text and video clip, with the text representation t serving as the query and the visual features treated as the keys and values, cross-attention is applied as shown in Equation (1) to capture the most text-relevant visual regions:

[eqn]

where $[eqn]$ , $[eqn]$ , and $[eqn]$ are the projection matrices in cross-attention. The attention-weighted features are then passed through a feed-forward network to further enrich their representational capacity, producing the final visual representation $[eqn]$ :

[eqn]

where $[eqn]$ represents the text-guided aggregator. $[eqn]$ denotes a feed-forward neural network. In contrast to conventional visual encoders that extract static representations, the text-guided aggregator facilitates text-guided selection and the reweighting of visual patches, thereby directing the visual representation toward regions that exhibit high semantic relevance to the query. By retaining essential visual content and suppressing irrelevant regions, the model achieves more precise semantic alignment in the early encoding stage.

3.3. Spatial Visual Representation Optimization

In previous work [44,45,46], the global features of each frame are typically used to construct visual representations, which hinders the model’s capacity to capture fine-grained semantic cues within localized spatial regions. In practice, each frame output by the visual encoder contains multiple patch tokens, many of which are either underused or entirely disregarded. These patch tokens, however, encapsulate rich local visual information and offer a more detailed depiction of the scene. Given the large number of patches per frame, most of which are semantically redundant, selectively identifying informative patches becomes essential for enhancing spatial representations. A straightforward approach is to use the global $[eqn]$ token to select the most informative patch tokens within each frame. As shown in Figure 3a, patches with high similarity to the $[eqn]$ token are considered important, which results in retaining them as visual representations within the frame. Considering that natural language queries typically focus on specific patches of a frame, selecting patches that are semantically relevant to the text is crucial for improving cross-modal alignment. To this end, we propose a text-guided patch selection mechanism, as illustrated in Figure 3b. It effectively enhances the semantic responsiveness and spatial discrimination of visual features while mitigating the impact of redundant content.

Specifically, for the $[eqn]$ frame within a given video clip, the corresponding patch feature set is denoted as $[eqn]$ , where $[eqn]$ denotes the jth patch of the i-th frame, and k is the number of patches per frame. To evaluate the semantic relevance between each patch and the textual query, we adopt a cross-modal attention mechanism to compute attention scores and then normalize and rank them. The top K patches most relevant to the query are selected, and they are defined as

[eqn]

where $[eqn]$ denote the $[eqn]$ patch tokens from the ith frame that are the most semantically aligned with the global textual feature t. This mechanism effectively filters out the background patches that are semantically irrelevant to the query while retaining only the most informative and query-relevant patches. However, although solely relying on local patch features improves semantic focus, it may compromise the global structural integrity within each frame, resulting in fragmented visual representations and weakened contextual awareness. To this end, we concatenate the global $[eqn]$ token $[eqn]$ of each frame with its $[eqn]$ selected patch features to form a fused frame visual representation $[eqn]$ , denoted as

[eqn]

wherethe $[eqn]$ token $[eqn]$ serves as a global semantic summary, which complements the local patches with missing contextual structural information, achieving a unified modeling of both global and local semantics. Based on this, to further enhance the responsiveness of visual features to textual semantics, we input the concatenated frame visual feature sequence $[eqn]$ into the text-guided aggregator and perform cross-modal interaction with the text representation t to generate a spatial-enhanced visual representation $[eqn]$ :

[eqn]

The optimized visual representation is not only more sensitive to query semantics at the local level but also maintains contextual consistency at the global level. In Section 4.5.2, we further conduct experiments to demonstrate the effectiveness of incorporating text-related patches.

3.4. Temporal Visual Representation Optimization

Although the text-guided patch selection in the spatial dimension effectively reduces redundant patches, the visual representations still include numerous frames irrelevant to the query in the temporal dimension, such as static backgrounds, action transition frames, or irrelevant event segments. These redundant frames, when incorporated into temporal modeling, compromise the semantic purity of the resulting visual representations, thereby impairing the cross-modality alignment accuracy. To address this, the construction of video clip features should emphasize frames that are semantically consistent with the query. Specifically, query-relevant frames are regarded as positive samples, while the remaining ones are treated as negatives. The final visual representation is then aggregated from the positive frames only. As shown in Figure 4, the proposed TVRO module enhances attention toward positive frames while suppressing the influence of negatives, guiding the visual representation to concentrate on temporally salient segments and improving temporal semantic alignment.

The precise identification of key frames typically requires extensive manual annotation, which is impractical in practical applications. To address this issue, we propose a frame selection strategy based on Intersection over Union (IoU) consistency, enabling the approximate identification of key frames without additional labeling. First, we map the clip index to a temporal interval, as shown in Equation (6):

[eqn]

where L is the total duration of the video, and N is the number of equally divided video clips. $[eqn]$ is a short duration of each clip determined by the video length and sampling rate. We construct positive and negative clip sets based on their IoU overlap with the ground-truth interval $[eqn]$ . The positive clip set $[eqn]$ and negative clip set $[eqn]$ are defined as follows:

[eqn]

where $[eqn]$ is an empirically determined threshold to ensure a distinct semantic gap between positive and negative samples. Subsequently, we uniformly sample f frames from the positive and negative clip sets, constructing a positive frame sequence $[eqn]$ and a negative frame sequence $[eqn]$ . We apply the text-guided aggregator to both the positive and negative frame sequences conditioned on the global text feature t, resulting in the following cross-modal visual representations:

[eqn]

where $[eqn]$ and $[eqn]$ represent the cross-modal features of positive and negative frame sequences. We then concatenate the positive and negative frame sequences to form a hybrid sequence $[eqn]$ . The text-guided aggregator is further applied to this hybrid sequence to generate the fusion features as follows:

[eqn]

where $[eqn]$ should be predominantly generated by positive frames, minimizing the influence of negative frames. To achieve this, we introduce temporal triplet loss as follows:

[eqn]

where $[eqn]$ denotes the $[eqn]$ -norm distance between $[eqn]$ and $[eqn]$ . This loss treats $[eqn]$ as the anchor, minimizing its distance to the positive frames while maximizing the distance to the negative one. This guides the model to focus on key frames that align with the query semantics. The inter-frame visualization presented in Section 4.5.1 demonstrates the effectiveness of TVRO.

3.5. Grounding Head

We follow [47] and construct a prediction head that predicts the best candidate moment based on the optimized visual representations. First, we build up a feature map of candidate moments using the video clip representations $[eqn]$ . For each candidate moment starting from clip $[eqn]$ to $[eqn]$ , we max-pool the corresponding clip features across a specific time span and obtain its feature:

[eqn]

We restructure all candidate moments to a 2D temporal feature map, denoted as $[eqn]$ , where the first two dimensions N represent the start and end clip indexes, respectively, while the third one $[eqn]$ indicates the feature dimension. Then, both $[eqn]$ and text feature t are projected into a unified subspace by fully connected layers, and they are fused through Hadamard product and $[eqn]$ normalization to obtain a cross-modal 2D feature map F:

[eqn]

where $[eqn]$ and $[eqn]$ are learnable projection matrices, $[eqn]$ is the transpose of an all-ones vector, ⊙ is the Hadamard product, and $[eqn]$ denotes Frobenius normalization. On the fused 2D feature map F, we stack L layers of 2D convolution with H kernels per layer to gradually perceive more context of adjacent candidate moments while learning the difference between candidate moments. The feature updates for each convolutional layer follow the form

[eqn]

Finally, the output 2D temporal map goes through a fully connected layer and a sigmoid function to generate a 2D score map. According to the candidate indicator, all the valid scores on the map are then collected, denoted as $[eqn]$ , where C is the total number of moment candidates. Each value $[eqn]$ on the map represents the matching score between a moment candidate with the queried sentence. The moment with the highest score is selected as the final grounding result.

3.6. Loss Function

We construct a multi-branch joint loss function designed to improve localization accuracy, cross-modal alignment, and key frame semantic selectivity. The overall loss comprises three components: binary cross-entropy loss, cross-modal matching loss, and temporal triplet loss (see Section 3.4).

3.6.1. Binary Cross-Entropy

Following the approach in [47], we adopt the scaled IoU score of each candidate moment as the supervision signal. More precisely, for each candidate moment, we first compute its temporal overlap with the ground-truth interval, denoted as $[eqn]$ . The overlap score is then linearly scaled into a soft supervision label $[eqn]$ according to predefined lower and upper thresholds $[eqn]$ and $[eqn]$ as follows:

[eqn]

In this way, the label reflects the proximity between the candidate moment and the ground-truth interval. During training, a standard binary cross-entropy loss is used for optimization, defined as

[eqn]

where $[eqn]$ is the output score of a moment, and C is the total number of valid candidates.

3.6.2. Cross-Modal Matching

To further enhance semantic discrimination across candidate clips, we introduce a self-supervised contrastive learning strategy centered on clip–text alignment. This approach guides the model to highlight semantically relevant clips while suppressing irrelevant ones, thereby improving the discriminative quality and robustness of temporal localization. Concretely, given the visual representation $[eqn]$ of each clip and the global query representation t, we compute their cosine similarity to quantify the degree of the cross-modal matching score:

[eqn]

To obtain a probability distribution over all candidate clips, the similarity scores are normalized using a softmax function:

[eqn]

where $[eqn]$ is a temperature parameter that controls the sharpness of the distribution. Lower values emphasize fine-grained differences among segments, whereas higher values produce a more uniform distribution. Following previous work, we set $[eqn]$ to 0.07 in all experiments. Based on this distribution, we define the cross-modal matching loss as

[eqn]

This loss encourages the model to attend to clips that are most relevant to the query and suppress those that are irrelevant. The overall loss function consolidates three complementary optimization goals: moment localization, clip–text alignment, and key frame semantic sensitivity. It is defined as

[eqn]

The hyperparameters $[eqn]$ and $[eqn]$ regulate the relative contributions of the cross-modal matching loss and the temporal triplet loss within the unified training objective.

4. Experiments

4.1. Datasets

To evaluate the performance of our proposed method, we conduct experiments on three challenging video temporal grounding datasets. Notably, all datasets consist of real-world videos collected via camera-based sensing systems, reflecting the typical input form of vision-based sensors in practical scenarios. These video signals serve as representative sensing data sources and are subsequently used for semantic grounding under textual guidance.

Charades-STA [5] is an extended version of the Charades [48] action recognition dataset, tailored explicitly for the task of video temporal grounding. It comprises 5338 videos and 12,408 query–moment pairs in the training set and 1334 videos with 3720 query–moment pairs in the test set. The original Charades dataset was recorded using RGB cameras in indoor environments with participants performing scripted activities. A salient characteristic of this dataset lies in its high semantic density, with each video associated with multiple overlapping or conflicting queries. This results in ambiguous event boundaries and presents a formidable challenge for precise cross-event alignment. On this dataset, we focus on evaluating the discriminative capabilities of the spatial (SVRO) and temporal (TVRO) optimization modules in handling semantically overlapping scenarios.

ActivityNet-Captions [22] is built upon the large-scale video recognition dataset ActivityNet v1.3 [49], originally designed for video captioning tasks and now widely adopted in temporal grounding research. The videos were collected from the internet and were captured using consumer-grade cameras or mobile devices under unconstrained conditions, making the dataset representative of open-world sensor-acquired video content. There are 37,417, 17,505, and 17,031 moment–sentence pairs for training, validation, and testing, respectively. Following the setting of 2D-TAN [47], we report the evaluation result on val2 set. Characterized by broad coverage, extended event durations, and ambiguous temporal boundaries, the dataset provides a rigorous benchmark for assessing a model’s ability to identify semantically salient segments amid complex event distributions. It offers a particularly suitable testbed for evaluating the robustness of our text-guided patch selection strategy under diverse contextual conditions.

TACoS [23] is constructed from videos of human activities in kitchen environments, comprising 127 long videos and 18,818 video–language pairs, annotated with high-quality labels from [50]. The videos were originally recorded in a controlled kitchen setting using fixed-angle RGB cameras, capturing continuous cooking activities from a first-person view. A standard split [5] consists of 10,146, 4589, and 4083 moment–sentence pairs for training, validation, and testing, respectively. Despite its domain specificity, the dataset is characterized by a high semantic density, continuous yet low-variance activity sequences, and substantial visual redundancy with minimal inter-frame semantic variation. This benchmark is primarily utilized to examine the effectiveness of our TVRO module in scenarios with high visual redundancy and low temporal dynamics. Special emphasis is placed on evaluating its capacity for key frame awareness and redundancy suppression through the proposed temporal triplet supervision strategy.

4.2. Experimental Settings

4.2.1. Evaluation Metrics

Following existing video grounding works, we evaluate the performance on two main metrics:

mIoU: “mIoU" is the average predicted Intersection over Union over all testing samples. The mIoU metric is particularly challenging for short video moments.

Recall: We adopt “ $[eqn]$ ” as the evaluation metric, following [5]. “ $[eqn]$ ” represents the percentage of language queries having at least one result whose IoU between the $[eqn]$ predictions and the ground-truth is larger than m. In our experiments, we report the results of $[eqn]$ and $[eqn]$ .

The mIoU metric emphasizes the overall quality of temporal alignment, reflecting the boundary precision gains achieved through the spatial and temporal optimization of visual representations. The use of IoU thresholds at 0.3, 0.5, and 0.7 captures varying levels of semantic tolerance, where lower thresholds evaluate general relevance, and higher ones assess fine-grained localization. This design effectively measures how well our method captures both the coarse and precise temporal extents implied by the query.

4.2.2. Implementation Details

To enable a fair comparison with existing baselines in terms of both architecture and training pipeline, we follow the 2D convolutional configuration adopted in [47] for candidate moment modeling and temporal boundary prediction. Specifically, the number of sampled clips N is set to 16 for Charades-STA, 64 for ActivityNet Captions, and 128 for TACoS. The number of frames in a clip T is set to 4 for Charades-STA and 16 for ActivityNet Captions and TACoS. For the convolutional structure, Charades-STA and TACoS employ an eight-layer network, with a kernel size of 5, whereas ActivityNet Captions adopts a four-layer network, with a kernel size of 9. The dimensionalities of all hidden states (i.e., $[eqn]$ and $[eqn]$ ) are set to 512. The scaling thresholds $[eqn]$ and $[eqn]$ are set to 0.5 and 1.0 for Charades-STA and ActivityNet Captions and to 0.3 and 0.7 for TACoS.

For feature encoding, we adopt the pretrained CLIP (ViT-B/32) to jointly encode video and text. All video frames are resized to 224 × 224 prior to encoding. These frames, originally captured by camera-based sensing systems, are treated as visual sensor signals and form the core input to our perception model. The $[eqn]$ token generated by the text encoder serves as the global textual representation, while the visual input for each frame consists of the $[eqn]$ token along with its corresponding patch tokens extracted from the visual encoder.

We use Adam [51] with a learning rate of $[eqn]$ and a batch size of 32 for optimization. We adopt the cosine learning rate schedule with a linear warm-up [52] for 5 epochs. In the SVRO module, the $[eqn]$ parameter is set to 10. In the TVRO module, both positive and negative frame sequences are sampled with a fixed length of 8. Regarding loss formulation, the cross-modal matching loss and temporal triplet loss are weighted by $[eqn]$ and $[eqn]$ , respectively.

For the cross-attention module, we adopt a standard multi-head attention mechanism with 8 heads and an attention dimensionality of 512. The query, key, and value projections are implemented as linear layers with identical dimensionality. The subsequent feed-forward network (FFN) consists of a single-layer MLP with a hidden size of 2048, followed by residual connections and layer normalization.

All experiments were conducted using a workstation equipped with an NVIDIA RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA). The implementation was based on Python 3.8 and PyTorch 1.13.0 (Meta Platforms, Inc., Menlo Park, CA, USA). Video frames were extracted using FFmpeg (Version 4.3.1).

4.3. Comparison to State-of-the-Art Methods

4.3.1. Comparative Methods

We compare our proposed method with state-of-the-art video temporal grounding methods on three public datasets. These methods can be grouped into two categories according to the intra-modal and cross-modal viewpoints: (1) The intra-modal methods are 2D-TAN [47], MMN [13], LCNet [14], WSTAN [15], MRTNet [16], D3G [17], CPL [53], CPN [54], DRN [55], GDP [36], and CMIN [56]. (2) The cross-modal methods are LGI [26], PS-VTG [27], VDI [28], CBLN [29], SeqPAN [30], MESM [57], MMDist [58], and VSLNet [35].

4.3.2. Quantitative Comparison

The results are summarized in Table 1, Table 2 and Table 3. Overall, our method outperforms representative state-of-the-art approaches under the most challenging evaluation conditions and consistently ranks among the top three across multiple metrics. On the Charades-STA dataset, it achieves 28.91% on Rank@1, IoU = 0.7 and 42.55% on mIoU, surpassing recent methods such as MMMDist and MRTNet. On ActivityNet Captions, our method obtains the best performance on the Rank@1 metric, particularly achieving 30.36% at IoU = 0.7, underscoring its effectiveness in precise boundary localization. On the TACoS dataset, which is characterized by dense semantic content and significant frame redundancy, our method achieves 31.81% on Rank@1 at IoU = 0.5 and 28.27% on mIoU, ranking second only to CPN and outperforming strong baselines such as SeqPAN and MMN.

To better isolate the source of performance improvements, we conduct a direct comparison with 2D-TAN [47]. Although our method retains the same 2D convolutional architecture for candidate moment modeling and boundary prediction, it consistently outperforms 2D-TAN across all three benchmark datasets. This performance gain cannot be attributed to architectural changes but rather to the enhanced visual representations derived from text-guided optimization during feature construction. By incorporating textual guidance at the encoding stage, our method produces candidate moment representations with improved semantic consistency prior to entering the 2D convolutional network. As a result, the generated 2D temporal feature maps exhibit clearer semantic alignment, leading to overall gains in temporal localization accuracy.

Moreover, we conduct comparative evaluations against several representative intra-modal methods, including MMN, LCNet, WSTAN, MRTNet, and D3G. Our method consistently outperforms these methods across multiple evaluation metrics. This advantage primarily stems from the fact that intra-modal methods do not incorporate semantic priors during visual feature construction. Consequently, their representations tend to lack adaptive sensitivity to textual semantics and rely heavily on late-stage fusion for semantic compensation. Such limitations hinder their ability to effectively handle queries with complex or subtle meanings. In contrast, our method introduces cross-modal semantic guidance at the feature construction stage, enhancing alignment precision and improving temporal localization performance.

Finally, we compare our method with several cross-modal modeling approaches, including CBLN, LGI, PS-VTG, SeqPAN, and VDI. As shown in Table 2, our method achieves superior performance on the ActivityNet Captions dataset. This improvement can be attributed to the inherent complexity of the dataset, which features ambiguous event boundaries, diverse semantic contexts, and significant background clutter. These characteristics pose considerable challenges to a model’s ability to distinguish semantic cues and suppress redundant information. Competing methods rely on frame-level alignment strategies but fail to attend to the most semantically salient subregions within frames. Moreover, they lack dedicated mechanisms for suppressing redundant frames, increasing the likelihood of noise from irrelevant content. In contrast, our method integrates text-guided strategies during feature construction to select semantically aligned patches and emphasize key frames while suppressing irrelevant ones. This leads to more consistent visual representations and improved accuracy in both temporal localization and semantic alignment under complex scenarios.

4.3.3. Plug and Play

We incorporate the proposed visual representation optimization modules into two representative VTG frameworks, MMN [13] and LCNet [14], serving as front-end feature construction components. As reported in Table 4, both models exhibit consistent performance improvements across multiple evaluation metrics on the Charades-STA dataset. These findings demonstrate the modularity and generalizability of our method, confirming its ability to operate as a plug-and-play module that enhances various VTG architectures without requiring architectural redesign. Specifically, MMN constructs a dual-stream semantic matching graph and establishes inter-clip semantic propagation pathways to facilitate the ranking of candidate moments. Its performance is highly dependent on the semantic fidelity of moment-level representations and the discriminative capacity across clips. However, the static visual features utilized by the original MMN often incorporate semantically irrelevant content during candidate moment generation, thereby limiting the representational expressiveness of the constructed graph. By integrating our approach, the SVRO and TVRO modules suppress visual redundancy during the feature construction phase, resulting in candidate moment representations that are more semantically focused. This, in turn, enhances the discriminability of nodes within the semantic graph and improves the effectiveness of the matching process.

As for LCNet, it relies on a weakly supervised cross-frame attention mechanism and a local alignment strategy for clip–text pairs. Through cross-frame semantic aggregation, it learns to approximate the latent temporal boundaries of relevant segments. In this framework, the frame-level semantic granularity of visual features plays a critical role in shaping the attention distribution. However, due to semantic ambiguity and weak activation of salient frames in the original features, the attention mechanism often fails to focus precisely on language-relevant regions. Upon integrating our approach, the TVRO module enhances the alignment between frame-level features and the target query through contrastive supervision. This integration improves the sensitivity of the attention mechanism to text-relevant frames and stabilizes the temporal boundary prediction process.

4.4. Ablation Study

4.4.1. Effect of Individual Components

We conduct an ablation study to assess the contribution of each component under the standard experimental setting. Table 5 summarizes the performance variations resulting from the removal of individual modules. The results show that excluding either component leads to a measurable decline in performance, highlighting their respective and complementary roles in spatial and temporal modeling. Specifically, removing the TVRO module causes the mIoU to drop from 44.76% to 41.18%, and it results in a decline of over 4 percentage points in Rank@1 at IoU = 0.7. This underscores the critical role of TVRO in attending to text-relevant frames and suppressing redundant ones. Without it, the model struggles with precise boundary localization and dense semantic alignment. When only SVRO is removed, Rank@1, IoU = 0.3 drops by 3.86%, reflecting a noticeable reduction in the model’s ability to concentrate on salient visual patches within each frame. Its removal directly impairs the spatial grounding of visual representations with respect to the query. When both modules are removed, the model reverts to using unoptimized CLIP features for visual representation construction, resulting in a further mIoU reduction to 39.21%, along with declines in several other metrics. These results indicate that, while CLIP exhibits certain semantic capabilities, its raw visual features still suffer from intra-frame redundancy and limited inter-frame discriminability. By contrast, the combined optimization from SVRO and TVRO produces spatially refined and temporally aligned representations, ultimately yielding superior overall performance.

4.4.2. Effect of Text-Guided Mechanisms in Different Components

To further evaluate the impact of text-guided mechanisms on visual representation construction, we progressively disable three key pathways that incorporate textual supervision. The corresponding results are reported in Table 6. First, removing text guidance in the text-guided patch selection (TGPS) leads to a 3.9% decrease in Rank@1 at IoU = 0.5. This suggests that, without semantic-aligned filtering at the spatial level, redundant visual regions unrelated to the textual query persist, thereby weakening the discriminability of local semantic representations. Additionally, excluding the text-guided aggregator from SVRO results in a further 1.6% performance drop, indicating that patch-level selection alone is insufficient for constructing effective intra-frame semantic focus. The text-guided aggregator plays a critical role in reorganizing spatial semantics and enhancing semantic responsiveness. In the temporal dimension, disabling textual supervision in the TVRO module causes a 4.3% decrease in Rank@1 at IoU = 0.3, reflecting the model’s reduced ability to capture inter-frame semantic density and localize text-relevant key frames. The complete removal of all three text-guided mechanisms causes a substantial drop in overall performance, clearly demonstrating the essential role of text-driven guidance in optimizing visual representations.

4.4.3. Effect of the Number of Salient Patches

As shown in Table 7, we investigate the effect of varying the number of salient patches retained in the SVRO module. The model achieves optimal performance across all evaluation metrics when the Top-K value is set to 10. Under this setting, the selected patches sufficiently capture semantically salient regions while effectively filtering out redundant visual content. When K is further reduced to 5, some key semantic regions are lost, undermining the spatial representational capacity and resulting in a marked decline in model performance. Conversely, increasing K expands semantic coverage but introduces excessive redundancy, which disrupts semantic focus and leads to a consistent decline in precision under stricter IoU thresholds. In the extreme case of retaining all 49 patches without selection, the mIoU drops by 7.6% compared to the optimal setting, indicating that unfiltered input severely harms the model’s discriminative capacity.

4.4.4. Parameter Sensitivity

We conduct two sets of parameter sensitivity experiments to assess the regulatory impact of the cross-modal matching loss weight ( $[eqn]$ ) and the temporal triplet loss weight ( $[eqn]$ ) in Equation (19). Parameter $[eqn]$ is varied from 0.0 to 1.0 in increments of 0.1 to examine its influence on the semantic alignment between language queries and clips. Parameter $[eqn]$ is tested over the range [0.0, 0.1] with finer granularity to precisely evaluate its effect on temporal modeling through inter-frame contrastive loss. As illustrated by the red curve in Figure 5, setting $[eqn]$ eliminates semantic discrimination among clips, leading to a reduction in the mIoU to 42.6%. As $[eqn]$ increases to 0.3, the model progressively improves its ability to differentiate between relevant and irrelevant clips, thereby enhancing semantic alignment and raising the mIoU to 44.9%. Further increasing $[eqn]$ leads the model to overemphasize semantic similarity at the expense of structural boundary modeling, resulting in the erroneous retention of redundant segments and an increased boundary prediction error, causing a drop in the mIoU. Thus, $[eqn]$ represents the optimal balance, offering sufficient semantic discrimination without compromising structural fidelity.

The blue curve shows that small values of $[eqn]$ fail to enforce sufficient contrast between positive and negative frame sequences, limiting the model’s capacity to identify salient temporal regions, with the mIoU stagnating around 43.1%. Raising $[eqn]$ to 0.04 effectively increases the inter-frame representational gap, enabling the model to focus on key semantic frames while suppressing redundant information, thereby improving the mIoU to 44.9%. When $[eqn]$ exceeds 0.07, excessive suppression disrupts temporal continuity, causing overly narrow segment localization and boundary misalignment, which ultimately degrades performance.

4.4.5. Effect of Different Visual Features

We conduct comparative experiments on the ActivityNet Captions dataset using three different visual feature configurations within a unified 2D-TAN framework. The results are shown in Figure 6. The first setting employs C3D as the video encoder. Due to the absence of a semantic alignment mechanism, it produces coarse motion features that fail to capture fine-grained correspondences with the query, leading to poor segment localization. The second configuration uses raw features extracted from CLIP. Benefiting from its cross-modal pretraining, it improves overall performance, indicating that semantic embeddings contribute positively to matching quality. However, since CLIP does not filter features based on the specific query, it still suffers from dispersed semantics and redundant visual patches. The third configuration optimizes the CLIP representation with our proposed SVRO and TVRO modules. Without modifying the 2D-TAN architecture, the overall performance is further enhanced. Specifically, Rank@1 at IoU = 0.5 improves by 5.1 percentage points compared to using raw CLIP features. These results underscore the importance of text-guided optimization in enhancing visual representations for the VTG task.

4.5. Qualitative Analysis

4.5.1. Inter-Frame Visualization

The TVRO module introduces a temporal triplet loss designed to sharpen the model’s focus on frames relevant to the textual query. Figure 7 depicts the frame-wise attention distribution across a single video, evaluated under two different query conditions. The two curves represent queries that are semantically similar yet diverge in their focal intent. The results show that, although both descriptions refer to the same scene, the attention responses vary significantly along the timeline. For the blue query, attention peaks between frames 4 and 6, corresponding closely to the drinking action. In contrast, the pink query elicits heightened attention between frames 9 and 10, highlighting the subject’s facial expression. These results demonstrate the TVRO module’s temporal sensitivity and its ability to adapt the attention distribution based on query semantics, thereby facilitating the precise identification of key semantic frames.

4.5.2. Intra-Frame Visualization

We conducted a visualization comparing text-guided patch selection with self-guided patch selection derived from image self-attention, as shown in Figure 8. The figure presents three representative vision–language mismatch scenarios, including target omission, weak semantic saliency, and subject ambiguity, which are commonly encountered in real applications. In case (a), the image contains both an adult and a child, and visual self-attention predominantly focuses on facial regions. However, the text description refers only to the “child” and a “pumpkin.” Text-guided patch selection redirects attention from visually dominant regions to semantically aligned areas, effectively filtering out irrelevant content. In case (b), the text highlights the micro-action “pouring coffee,” but the corresponding region is not visually salient. While visual self-attention fails to yield a distinct activation, cross-modal attention amplifies focus on the hand and cup regions, thereby enhancing action localization. In case (c), the scene includes two individuals, but the query refers only to “a girl in a dress.” Visual self-attention distributes attention nearly equally, resulting in reduced discriminability. In contrast, our method suppresses attention to non-target areas via textual cues, ensuring precise alignment with the semantically relevant object.

4.5.3. Grounding Results

Figure 9 presents visualizations of the temporal localization results produced by our method on three benchmark datasets. It can be observed that, under various scenes and action semantics, the temporal boundaries predicted by our method align closely with the ground-truth, with boundary positions approximating the actual start and end times of the target described in the text. By contrast, training without the temporal triplet loss causes the model to produce wider temporal spans, with boundaries spilling over into irrelevant frames. This difference indicates that, in the absence of inter-frame semantic contrastive supervision, the model struggles to distinguish between relevant and irrelevant frames in the feature space, resulting in broader, less accurate predictions.

To further analyze the limitations of our method under challenging conditions, we present two representative failure cases in Figure 10. In the first case, the query describes “a person in a dog costume walking outside a snowy house.” While the sentence conveys a clear action, the textual reference to “a person” conflicts with the animal-like visual appearance, causing a semantic shift across modalities. This mismatch hinders the model’s ability to align the textual and visual representations effectively, leading to incorrect segment selection. In the second case, the query refers to “ingredients displayed next to actions mixing ingredients.” Although this describes a plausible visual scene, the lack of clear temporal cues or visual anchors in the sentence makes it difficult for the model to pinpoint when the actual mixing occurs. As a result, the predicted span fails to capture the intended semantic moment. These cases reveal that, despite the robustness of our method in most scenarios, it still struggles when confronted with semantic ambiguity, cross-modal inconsistency, or underspecified queries. Future work may explore structured semantic decomposition or salient phrase extraction to enhance the model’s grounding capability under complex or underspecified input conditions.

4.6. Case Study

4.6.1. Case Study in Challenging Scenarios

In complex multi-activity scenarios, the query often contains multiple semantic units, each referring to a distinct subject and action. This imposes additional demands on the model’s ability to identify and temporally align multiple target behaviors within a densely populated action timeline. As illustrated in the first case of Figure 11, the query describes “a boy in a black shirt making funny faces” and “another boy playing with the insides of the pumpkin,” where the two actions are semantically parallel and temporally overlapping. In such cases, prior methods like 2D-TAN and VSLNet tend to exhibit partial omissions or temporal drift, failing to capture the full scope of the composite semantic expression. By contrast, our method accurately localizes the full extent of the described actions. This improvement is primarily attributed to the semantic alignment capability inherited from CLIP, which offers a unified multimodal embedding space to bridge visual and textual representations. Furthermore, the SVRO module enhances spatial discriminability by filtering out text-irrelevant regions, while the TVRO module suppresses temporally irrelevant frames, allowing the model to retain only the segments that collectively fulfill the query intent. This progressive optimization architecture significantly improves boundary precision in multi-action environments.

In rapidly changing scenes characterized by frequent shot transitions and disrupted action continuity, conventional methods are often prone to spurious responses and imprecise boundaries due to transient distractions. As shown in the second case of Figure 11, our method maintains stable and accurate temporal localization even under such challenging conditions. In comparison, 2D-TAN relies on the direct fusion of visual and textual features without effective temporal disentanglement, limiting its ability to focus on the intended query semantics. VSLNet incorporates fine-grained query guidance, but its uniform frame processing fails to discriminate between informative and redundant content. Our method addresses these challenges through the TVRO module, which dynamically adjusts frame-wise attention weights in accordance with the semantic intensity of the query, thereby reinforcing consistent responses to meaningful segments. Simultaneously, the SVRO module preserves highly correlated regions within individual frames, further sharpening the representational distinctiveness. These mechanisms collectively enable our model to achieve tighter alignment with ground-truth boundaries in rapidly evolving scenes, demonstrating enhanced robustness and generalization capacity.

4.6.2. Case Study Under Environmental Perturbations

Environmental degradation in sensor-acquired videos, such as low illumination and motion blur, presents significant challenges to vision–language alignment. In low-light scenarios, underexposed frames often exhibit severe detail loss and diminished contrast, which obscure object boundaries and reduce the availability of semantic cues. As shown in the first case of Figure 12, the scene involves a performer under dim indoor lighting, with the background rendered almost completely dark. This results in a substantial loss of spatial information, making it difficult for visual encoders to capture discriminative features. Both 2D-TAN and VSLNet exhibit marked temporal deviations, either truncating the segment prematurely or overshooting the ground-truth boundaries. While our method does not fully recover the precise extent of the target interval under such adverse sensing conditions, it nonetheless produces significantly closer predictions. This relative robustness stems from the prior alignment knowledge encoded by CLIP during large-scale pretraining, which enhances the model’s sensitivity to semantically relevant frames even when raw sensor input is compromised.

In the second case, motion blur caused by fast object movement leads to widespread contour smearing and a loss of inter-frame consistency. This form of perceptual degradation is common in consumer-grade or mobile sensor setups. As illustrated in Figure 12, the skateboarding sequence suffers from visual ambiguity and reduced spatial clarity. In this setting, 2D-TAN misidentifies the starting point, and VSLNet fails to respond to the final portion of the action. Our method demonstrates more stable temporal predictions by suppressing irrelevant frames and amplifying semantic salience. However, boundary predictions still exhibit slight misalignment, indicating room for improvement in handling ambiguous or low-quality inputs. These cases collectively suggest that, while the proposed framework improves robustness against sensor-induced perturbations compared to prior work, its performance remains sensitive to the severity of visual degradation. Future enhancements may explore adaptive feature compensation strategies or external quality-aware priors to further mitigate sensing limitations.

4.7. Computational Complexity Analysis

To gain a more comprehensive understanding of the proposed method, we further analyze its computational complexity by estimating the theoretical floating-point operations (FLOPs) of the two most resource-intensive components: the frame-level cross-modal attention module and the triplet mining mechanism. This analysis not only enables a quantitative assessment of inference load but also provides practical guidance for deployment and resource allocation.

We first examine the computational cost of the frame-level cross-attention mechanism. The module takes as input a sampled visual sequence $[eqn]$ and a global textual feature $[eqn]$ , where T is the number of sampled frames, and D is the feature dimension. The multi-head attention mechanism consists of h parallel subspaces, with each head of dimension $[eqn]$ . According to the standard formulation of attention complexity in Transformer models introduced by Vaswani et al. [59], further elaborated for vision tasks by Dosovitskiy et al. [60], the total FLOPs of the cross-attention operation—including the linear projections for Query, Key, and Value; the attention weight computation with softmax; and the subsequent aggregation—can be approximated by

[eqn]

Under the experimental setup of $[eqn]$ , $[eqn]$ , and thus $[eqn]$ , with sampled sequences comprising 8 frames from each of the positive, negative, and mixed types, we have $[eqn]$ . Substituting into the equation yields approximately $[eqn]$ FLOPs. The following feed-forward network (FFN), consisting of two linear transformations with a hidden dimension of 2048, incurs an additional cost of

[eqn]

Thus, the total cost for a single invocation of the frame-level cross-attention, including FFN, amounts to approximately 0.063 GFLOPs. It is worth noting that this module operates only over a fixed set of sampled frames rather than the full video, ensuring that the overall computational burden remains manageable. We now turn to the triplet mining module, which aims to enhance the discriminability of temporal feature alignment. Each training instance generates $[eqn]$ triplets based on the sampled positive and negative frame sequences, each comprising a hybrid frame feature $[eqn]$ , a positive sample $[eqn]$ , and a negative sample $[eqn]$ . Following the design of FaceNet by Schroff et al. [61], each triplet involves two Euclidean distance computations and a subsequent normalization operation, with the overall complexity approximately given by

[eqn]

For $[eqn]$ and $[eqn]$ , the total cost is below $[eqn]$ FLOPs (less than 0.001 GFLOPs) and, thus, can be regarded as a negligible overhead.

In summary, the combined computational cost of the text-aggregator and temporal triplet loss modules remains under 0.064 GFLOPs, significantly lower than that of visual encoders, proposal generators, or common multi-scale attention structures. More importantly, the proposed framework adopts a sparsely sampled frame interaction strategy with local attention design, which scales linearly with the video length. This design effectively avoids the quadratic or cubic complexity explosion typically observed in dense temporal modeling. As a result, even when applied to long-duration videos or batch-level inference, the system maintains controllable inference efficiency and memory usage by adjusting the sampling interval and interaction granularity.

5. Conclusions

This study investigates the challenge of inadequate cross-modal alignment and deficient semantic responsiveness in visual representations for video temporal grounding and proposes a text-guided visual representation optimization framework to address these limitations. The framework builds upon CLIP to construct a unified visual–text embedding space and incorporates two modules, SVRO and TVRO, to identify spatially and temporally salient content that aligns with the semantic intent of the query. These visual signals originate from camera-based sensing systems, and our method enhances their interpretability through multimodal semantic conditioning.

In particular, the SVRO module performs fine-grained patch-level filtering to extract text-relevant spatial regions, while the TVRO module dynamically suppresses temporally irrelevant frames, ensuring precise alignment with the query’s semantic focus. This dual-stage optimization significantly improves the expressiveness and discriminability of visual features across challenging scenarios, including multi-activity compositions, rapid scene transitions, and sensor-induced degradation. The incorporation of cross-modal contrastive loss further boosts the discriminative capability of the segment-level representations, enhancing query-aware localization accuracy. Extensive experiments on three public benchmarks demonstrate consistent improvements over representative methods, validating the effectiveness of the proposed framework. Furthermore, its integration into diverse VTG architectures confirms its adaptability and transferability. This work contributes to intelligent visual sensing by optimizing the semantic understanding of sensor-acquired video content through a novel combination of pre-trained multimodal encoders and task-specific semantic filtering strategies.

Future work may explore structured semantic decomposition or salient phrase extraction to enhance the model’s grounding capability under complex or underspecified input conditions. Moreover, continued progress may focus on adaptive feature compensation strategies or external quality-aware priors to improve robustness under visual degradation, thereby broadening applicability in real-world scenarios involving ambiguous, incomplete, or low-quality sensor inputs.

Bibliography61

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Wang L. Li W. Li W. Van Gool L. Appearance-and-Relation Networks for Video Classification Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Salt Lake City, UT, USA 18–23 June 20181430143910.1109/CVPR.2018.00155 · doi ↗
2Feichtenhofer C. Fan H. Malik J. He K. Slow Fast Networks for Video Recognition Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV)Seoul, Republic of Korea 27 October–2 November 20196201621010.1109/ICCV.2019.00630 · doi ↗
3Zhao Y. Xiong Y. Wang L. Wu Z. Tang X. Lin D. Temporal Action Detection with Structured Segment Networks Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV)Venice, Italy 22–29 October 20172933294210.1109/ICCV.2017.317 · doi ↗
4Lin Z. Zhao Z. Zhang Z. Zhang Z. Cai D. Moment Retrieval via Cross-Modal Interaction Networks with Query Reconstruction IEEE Trans. Image Process.2020293750376210.1109/TIP.2020.296598731976894 · doi ↗ · pubmed ↗
5Gao J. Sun C. Yang Z. Nevatia R. TALL: Temporal Activity Localization via Language Query Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV)Venice, Italy 22–29 October 20175277528510.1109/ICCV.2017.563 · doi ↗
6Chu Y.W. Lin K.Y. Hsu C.C. Ku L.W. End-to-End Recurrent Cross-Modality Attention for Video Dialogue IEEE/ACM Trans. Audio, Speech, Lang. Process.2021292456246410.1109/TASLP.2021.3065852 · doi ↗
7Ji W. Li Y. Wei M. Shang X. Xiao J. Ren T. Chua T.S. Vid VRD 2021: The Third Grand Challenge on Video Relation Detection Proceedings of the 29th ACM International Conference on Multimedia, MM’21Chengdu, China 20–24 October 20214779478310.1145/3474085.3479232 · doi ↗
8Shang X. Li Y. Xiao J. Ji W. Chua T.S. Video Visual Relation Detection via Iterative Inference Proceedings of the 29th ACM International Conference on Multimedia, MM’21Chengdu, China 20–24 October 20213654366310.1145/3474085.3475263 · doi ↗