AiSciVision: A Framework for Specializing Large Multimodal Models in   Scientific Image Classification

Brendan Hogan; Anmol Kabra; Felipe Siqueira Pacheco; Laura; Greenstreet; Joshua Fan; Aaron Ferber; Marta Ummus; Alecsander Brito; Olivia; Graham; Lillian Aoki; Drew Harvell; Alex Flecker; Carla Gomes

arXiv:2410.21480·cs.LG·October 30, 2024

AiSciVision: A Framework for Specializing Large Multimodal Models in Scientific Image Classification

Brendan Hogan, Anmol Kabra, Felipe Siqueira Pacheco, Laura, Greenstreet, Joshua Fan, Aaron Ferber, Marta Ummus, Alecsander Brito, Olivia, Graham, Lillian Aoki, Drew Harvell, Alex Flecker, Carla Gomes

PDF

Open Access 1 Repo

TL;DR

AiSciVision is a framework that specializes large multimodal models for scientific image classification by mimicking expert reasoning, improving interpretability and performance in niche scientific domains.

Contribution

The paper introduces AiSciVision, combining visual retrieval-augmented generation and domain-specific tools to enhance interpretability and accuracy of large multimodal models in scientific image classification.

Findings

01

Outperforms fully supervised models in low and full-labeled data settings

02

Provides interpretable reasoning transcripts for each prediction

03

Successfully deployed in real-world aquaculture research applications

Abstract

Trust and interpretability are crucial for the use of Artificial Intelligence (AI) in scientific research, but current models often operate as black boxes offering limited transparency and justifications for their outputs. We introduce AiSciVision, a framework that specializes Large Multimodal Models (LMMs) into interactive research partners and classification models for image classification tasks in niche scientific domains. Our framework uses two key components: (1) Visual Retrieval-Augmented Generation (VisRAG) and (2) domain-specific tools utilized in an agentic workflow. To classify a target image, AiSciVision first retrieves the most similar positive and negative labeled images as context for the LMM. Then the LMM agent actively selects and applies tools to manipulate and inspect the target image over multiple rounds, refining its analysis before making a final prediction. These…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gomes-lab/AiSciVision
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques