Towards a Large Language-Vision Question Answering Model for MSTAR Automatic Target Recognition

David F. Ramirez; Tim L. Overman; Kristen Jaskie; Marv Kleine; Andreas Spanias

arXiv:2605.10772·cs.CV·May 12, 2026

Towards a Large Language-Vision Question Answering Model for MSTAR Automatic Target Recognition

David F. Ramirez, Tim L. Overman, Kristen Jaskie, Marv Kleine, Andreas Spanias

PDF

TL;DR

This paper explores the application of large language-vision models to synthetic aperture radar imagery for automatic target recognition, developing a benchmark and fine-tuning methods to identify military vehicles with high accuracy.

Contribution

It introduces a SAR-specific benchmark dataset with captions and questions, and demonstrates fine-tuning of LLVMs achieving 98% accuracy in target identification.

Findings

01

Developed a SAR benchmark with captions and VQA pairs from MSTAR dataset.

02

Fine-tuned LLVMs achieve 98% accuracy in target recognition.

03

Addressed challenges in applying language-vision models to SAR imagery.

Abstract

Large language-vision models (LLVM), such as OpenAI's ChatGPT and GPT-4, have gained prominence as powerful tools for analyzing text and imagery. The merging of these data domains represents a significant paradigm shift with far-reaching implications for automatic target recognition (ATR). Recent transformer-based LLVM research has shown substantial improvements for geospatial perception tasks. Our study examines the application of LLVM to remote sensing image captioning and visual question-answering (VQA), with a specific focus on synthetic aperture radar (SAR) imagery. We examine newly published LLVM methods, including CLIP and LLaVA neural network transformer architectures. We have developed a work-in-progress SAR training and evaluation benchmark derived from the MSTAR Public Dataset. This has been extended to include descriptive text captions and question-answer pairs for VQA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.