G$^{2}$TR: Generalized Grounded Temporal Reasoning for Robot Instruction Following by Combining Large Pre-trained Models
Riya Arora, Niveditha Narendranath, Aman Tambi, Sandeep S., Zachariah, Souvik Chakraborty, Rohan Paul

TL;DR
This paper introduces G$^{2}$TR, a method that combines large pre-trained models to enable robots to perform temporal reasoning for understanding and executing complex instructions involving past interactions.
Contribution
It proposes a novel approach that factors temporal reasoning into interval estimation, spatial inference, and semantic tracking, leveraging pre-trained models for improved grounding in robot instruction tasks.
Findings
Achieved 70.10% average accuracy on a complex robot video-language dataset.
Effectively grounds past interactions to current scenes for improved instruction following.
Demonstrates the potential of combining large pre-trained models for temporal reasoning in robotics.
Abstract
Consider the scenario where a human cleans a table and a robot observing the scene is instructed with the task "Remove the cloth using which I wiped the table". Instruction following with temporal reasoning requires the robot to identify the relevant past object interaction, ground the object of interest in the present scene, and execute the task according to the human's instruction. Directly grounding utterances referencing past interactions to grounded objects is challenging due to the multi-hop nature of references to past interactions and large space of object groundings in a video stream observing the robot's workspace. Our key insight is to factor the temporal reasoning task as (i) estimating the video interval associated with event reference, (ii) performing spatial reasoning over the interaction frames to infer the intended object (iii) semantically track the object's location…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Intelligent Tutoring Systems and Adaptive Learning
