TL;DR
SciTune is a framework that fine-tunes large language models with human-curated scientific multimodal instructions, significantly improving their performance on science-related visual and language tasks.
Contribution
This work introduces SciTune, a novel tuning framework that enhances LLMs' ability to follow scientific multimodal instructions, outperforming state-of-the-art models and surpassing human performance in some benchmarks.
Findings
LLaMA-SciTune outperforms state-of-the-art models in figure generation and captioning.
LLaMA-SciTune surpasses human performance on the ScienceQA benchmark.
Human-generated scientific instructions are highly valuable despite their lower volume.
Abstract
Instruction finetuning is a popular paradigm to align large language models (LLM) with human intent. Despite its popularity, this idea is less explored in improving LLMs to align existing foundation models with scientific disciplines, concepts and goals. In this work, we present \textit{SciTune} as a tuning framework to improve the ability of LLMs to follow multimodal instructions generated from scientific publications. To test our methodology, we train a large multimodal model LLaMA-SciTune that connects a vision encoder and LLM for science-focused visual and language understanding. LLaMA-SciTune significantly outperforms the state-of-the-art models in the generated figure types and captions in SciCap and VisText benchmarks. In comparison to the models that are finetuned with synthetic data only, LLaMA-SciTune surpasses human performance on average and in many sub-categories on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
