Towards a Large Language-Vision Question Answering Model for MSTAR Automatic Target Recognition
David F. Ramirez, Tim L. Overman, Kristen Jaskie, Marv Kleine, Andreas Spanias

TL;DR
This paper explores the application of large language-vision models to synthetic aperture radar imagery for automatic target recognition, developing a benchmark and fine-tuning methods to identify military vehicles with high accuracy.
Contribution
It introduces a SAR-specific benchmark dataset with captions and questions, and demonstrates fine-tuning of LLVMs achieving 98% accuracy in target identification.
Findings
Developed a SAR benchmark with captions and VQA pairs from MSTAR dataset.
Fine-tuned LLVMs achieve 98% accuracy in target recognition.
Addressed challenges in applying language-vision models to SAR imagery.
Abstract
Large language-vision models (LLVM), such as OpenAI's ChatGPT and GPT-4, have gained prominence as powerful tools for analyzing text and imagery. The merging of these data domains represents a significant paradigm shift with far-reaching implications for automatic target recognition (ATR). Recent transformer-based LLVM research has shown substantial improvements for geospatial perception tasks. Our study examines the application of LLVM to remote sensing image captioning and visual question-answering (VQA), with a specific focus on synthetic aperture radar (SAR) imagery. We examine newly published LLVM methods, including CLIP and LLaVA neural network transformer architectures. We have developed a work-in-progress SAR training and evaluation benchmark derived from the MSTAR Public Dataset. This has been extended to include descriptive text captions and question-answer pairs for VQA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
