Open-Ended Multi-Modal Relational Reasoning for Video Question Answering
Haozheng Luo, Ruiyang Qin, Chenwei Xu, Guo Ye, and Zening Luo

TL;DR
This paper presents a robotic agent capable of analyzing video environments and answering questions through integrated video recognition and natural language processing, improving interaction efficiency and trust.
Contribution
It introduces a novel robotic agent that combines multi-modal reasoning for video question answering, enhancing performance and understanding in human-robot interactions.
Findings
Positive correlation between trust and interaction efficiency
2-3% performance improvement over benchmark methods
Effective integration of video recognition and NLP models
Abstract
In this paper, we introduce a robotic agent specifically designed to analyze external environments and address participants' questions. The primary focus of this agent is to assist individuals using language-based interactions within video-based scenes. Our proposed method integrates video recognition technology and natural language processing models within the robotic agent. We investigate the crucial factors affecting human-robot interactions by examining pertinent issues arising between participants and robot agents. Methodologically, our experimental findings reveal a positive relationship between trust and interaction efficiency. Furthermore, our model demonstrates a 2\% to 3\% performance enhancement in comparison to other benchmark methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Topic Modeling
