Can Local Vision-Language Models improve Activity Recognition over Vision Transformers? -- Case Study on Newborn Resuscitation

Enrico Guerriero; Kjersti Engan; {\O}yvind Meinich-Bache

arXiv:2602.12002·cs.CV·February 13, 2026

Can Local Vision-Language Models improve Activity Recognition over Vision Transformers? -- Case Study on Newborn Resuscitation

Enrico Guerriero, Kjersti Engan, {\O}yvind Meinich-Bache

PDF

Open Access

TL;DR

This study evaluates local vision-language models combined with large language models for activity recognition in newborn resuscitation videos, demonstrating that fine-tuned VLMs with LoRA outperform traditional Vision Transformers in accuracy.

Contribution

It introduces the use of local vision-language models with LoRA fine-tuning for activity recognition, showing significant improvements over baseline Vision Transformer models.

Findings

01

Fine-tuned VLMs with LoRA achieve F1 score of 0.91.

02

Local VLMs initially struggle with hallucinations.

03

VLMs outperform Vision Transformers in this task.

Abstract

Accurate documentation of newborn resuscitation is essential for quality improvement and adherence to clinical guidelines, yet remains underutilized in practice. Previous work using 3D-CNNs and Vision Transformers (ViT) has shown promising results in detecting key activities from newborn resuscitation videos, but also highlighted the challenges in recognizing such fine-grained activities. This work investigates the potential of generative AI (GenAI) methods to improve activity recognition from such videos. Specifically, we explore the use of local vision-language models (VLMs), combined with large language models (LLMs), and compare them to a supervised TimeSFormer baseline. Using a simulated dataset comprising 13.26 hours of newborn resuscitation videos, we evaluate several zero-shot VLM-based strategies and fine-tuned VLMs with classification heads, including Low-Rank Adaptation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInfant Development and Preterm Care · Healthcare Technology and Patient Monitoring · Neonatal Respiratory Health Research