PolySmart @ TRECVid 2024 Medical Video Question Answering
Jiaxin Wu, Yiyang Jiang, Xiao-Yong Wei, Qing Li

TL;DR
This paper presents a system for medical video question answering that combines text-to-text retrieval, visual answer localization, and instruction captioning using GPT-4, achieving specific evaluation metrics on the TRECVid 2024 challenge.
Contribution
It introduces a novel approach integrating GPT-4 and LLaVA-Next-Video for medical video QA and localization, with a single submission for TRECVid 2024.
Findings
Achieved an F-score of 11.92 in QFISC task.
Obtained a mean IoU of 9.6527 for answer localization.
Demonstrated the effectiveness of GPT-4 in medical video QA.
Abstract
Video Corpus Visual Answer Localization (VCVAL) includes question-related video retrieval and visual answer localization in the videos. Specifically, we use text-to-text retrieval to find relevant videos for a medical question based on the similarity of video transcript and answers generated by GPT4. For the visual answer localization, the start and end timestamps of the answer are predicted by the alignments on both visual content and subtitles with queries. For the Query-Focused Instructional Step Captioning (QFISC) task, the step captions are generated by GPT4. Specifically, we provide the video captions generated by the LLaVA-Next-Video model and the video subtitles with timestamps as context, and ask GPT4 to generate step captions for the given medical query. We only submit one run for evaluation and it obtains a F-score of 11.92 and mean IoU of 9.6527.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · COVID-19 diagnosis using AI · Machine Learning in Healthcare
