Towards Parameter-Efficient Integration of Pre-Trained Language Models In Temporal Video Grounding
Erica K. Shimomoto, Edison Marrese-Taylor, Hiroya Takamura, Ichiro, Kobayashi, Hideki Nakayama, Yusuke Miyao

TL;DR
This paper investigates the integration of pre-trained language models into Temporal Video Grounding, demonstrating that NLP adapters enable efficient fine-tuning and improve model performance without altering visual inputs.
Contribution
It provides a systematic study of PLM effects in TVG and evaluates parameter-efficient adapters, highlighting their effectiveness and practical benefits.
Findings
PLMs significantly improve TVG performance without changing visual inputs.
NLP adapters can match state-of-the-art results with less computational cost.
Different adapters perform better in various scenarios.
Abstract
This paper explores the task of Temporal Video Grounding (TVG) where, given an untrimmed video and a natural language sentence query, the goal is to recognize and determine temporal boundaries of action instances in the video described by the query. Recent works tackled this task by improving query inputs with large pre-trained language models (PLM) at the cost of more expensive training. However, the effects of this integration are unclear, as these works also propose improvements in the visual inputs. Therefore, this paper studies the effects of PLMs in TVG and assesses the applicability of parameter-efficient training with NLP adapters. We couple popular PLMs with a selection of existing approaches and test different adapters to reduce the impact of the additional parameters. Our results on three challenging datasets show that, without changing the visual inputs, TVG models greatly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsTest
