MMGeoLM: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models

Kai Sun; Yushi Bai; Zhen Yang; Jiajie Zhang; Ji Qi; Lei Hou; Juanzi Li

arXiv:2505.20152·cs.CV·October 2, 2025

MMGeoLM: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models

Kai Sun, Yushi Bai, Zhen Yang, Jiajie Zhang, Ji Qi, Lei Hou, Juanzi Li

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

MMGeoLM introduces a hard negative contrastive learning framework for large multimodal models, significantly improving fine-grained geometric reasoning by training a vision encoder with generated and rule-based negatives, outperforming existing models.

Contribution

The paper proposes a novel hard negative contrastive learning approach for vision encoders, enhancing geometric understanding in large multimodal models, and demonstrates its effectiveness on multiple benchmarks.

Findings

01

MMGeoLM outperforms other open-source models on geometric reasoning benchmarks.

02

Hard negative training improves the model's ability to distinguish fine-grained geometric differences.

03

Even a 7B model rivals larger closed-source models like GPT-4o.

Abstract

Large Multimodal Models (LMMs) typically build on ViTs (e.g., CLIP), yet their training with simple random in-batch negatives limits the ability to capture fine-grained visual differences, particularly in geometric scenarios. To address this challenge, we propose a novel hard negative contrastive learning framework for the vision encoder, which combines image-based contrastive learning using generation-based hard negatives created by perturbing diagram generation code, and text-based contrastive learning using rule-based negatives derived from modified geometric descriptions and retrieval-based negatives selected based on caption similarity. We train a vision encoder (CLIP) using our hard negative training method, namely MMCLIP (Multimodal Math CLIP), and subsequently train an LMM for geometric problem-solving. Experiments show that our trained model, MMGeoLM, significantly outperforms…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper presents a novel hard negative contrastive learning framework that effectively addresses the challenge of capturing fine-grained visual differences in geometric scenarios. By combining image-based and text-based contrastive learning, the approach enhances the model's ability to understand complex geometric relationships. 2. The integration of generation-based hard negatives and rule-based and retrieval-based text negatives is a unique contribution.

Weaknesses

1. While the model performs well on the selected benchmarks, the paper does not discuss its performance on a broader range of geometric reasoning tasks. It would be beneficial to understand how MMGeoLM generalizes to other problem domains. 2. The paper lacks comparisons with other state-of-the-art models in the field. Including such comparisons would provide a clearer context for evaluating MMGeoLM's performance and highlight its relative advantages.

Reviewer 02Rating 4Confidence 4

Strengths

- The paper introduces a practical method for improving geometric reasoning in MLMMs. The core idea of using perturbed diagram generation code to create high-quality image-based hard negatives is technically sound. - The proposed approach (MMGeoLM) demonstrates significant performance gains, reportedly outperforming similarly sized open-source models on three relevant geometric reasoning benchmarks.

Weaknesses

- The core novelty is a clever data generation/curation technique. The method of perturbing code to create image negatives and using rules for text negatives is a form of data augmentation. While effective, the contribution is fundamentally a sophisticated form of data augmentation or data engineering, which limit its conceptual novelty and contribution. - The empirical evaluation could be significantly strengthened. The chosen baselines are not fully representative of the current state-of-the-a

Reviewer 03Rating 2Confidence 5

Strengths

The paper targets an important challenge—enhancing geometry-specific visual encoders for reasoning tasks. The proposed hard negative generation approach is conceptually sound and improves the visual encoder’s fine-grained discrimination.

Weaknesses

The main contribution lies in applying hard negative contrastive learning to geometric visual encoders, which, while useful, represents an incremental improvement (more like technic report) rather than a fundamentally new research direction. The approach primarily involves building a domain-specific dataset, generating hard negatives, and training with standard architectures and loss functions, without methodological innovation in model design or learning objectives. Moreover, geometry-specific

Code & Models

Repositories

thu-keg/mmgeolm
pytorchOfficial

Datasets

THU-KEG/MM-Math-Align
dataset· 27 dl
27 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction · Domain Adaptation and Few-Shot Learning · Neural Networks and Applications

MethodsContrastive Language-Image Pre-training · Contrastive Learning