MMGeoLM: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models
Kai Sun, Yushi Bai, Zhen Yang, Jiajie Zhang, Ji Qi, Lei Hou, Juanzi Li

TL;DR
MMGeoLM introduces a hard negative contrastive learning framework for large multimodal models, significantly improving fine-grained geometric reasoning by training a vision encoder with generated and rule-based negatives, outperforming existing models.
Contribution
The paper proposes a novel hard negative contrastive learning approach for vision encoders, enhancing geometric understanding in large multimodal models, and demonstrates its effectiveness on multiple benchmarks.
Findings
MMGeoLM outperforms other open-source models on geometric reasoning benchmarks.
Hard negative training improves the model's ability to distinguish fine-grained geometric differences.
Even a 7B model rivals larger closed-source models like GPT-4o.
Abstract
Large Multimodal Models (LMMs) typically build on ViTs (e.g., CLIP), yet their training with simple random in-batch negatives limits the ability to capture fine-grained visual differences, particularly in geometric scenarios. To address this challenge, we propose a novel hard negative contrastive learning framework for the vision encoder, which combines image-based contrastive learning using generation-based hard negatives created by perturbing diagram generation code, and text-based contrastive learning using rule-based negatives derived from modified geometric descriptions and retrieval-based negatives selected based on caption similarity. We train a vision encoder (CLIP) using our hard negative training method, namely MMCLIP (Multimodal Math CLIP), and subsequently train an LMM for geometric problem-solving. Experiments show that our trained model, MMGeoLM, significantly outperforms…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper presents a novel hard negative contrastive learning framework that effectively addresses the challenge of capturing fine-grained visual differences in geometric scenarios. By combining image-based and text-based contrastive learning, the approach enhances the model's ability to understand complex geometric relationships. 2. The integration of generation-based hard negatives and rule-based and retrieval-based text negatives is a unique contribution.
1. While the model performs well on the selected benchmarks, the paper does not discuss its performance on a broader range of geometric reasoning tasks. It would be beneficial to understand how MMGeoLM generalizes to other problem domains. 2. The paper lacks comparisons with other state-of-the-art models in the field. Including such comparisons would provide a clearer context for evaluating MMGeoLM's performance and highlight its relative advantages.
- The paper introduces a practical method for improving geometric reasoning in MLMMs. The core idea of using perturbed diagram generation code to create high-quality image-based hard negatives is technically sound. - The proposed approach (MMGeoLM) demonstrates significant performance gains, reportedly outperforming similarly sized open-source models on three relevant geometric reasoning benchmarks.
- The core novelty is a clever data generation/curation technique. The method of perturbing code to create image negatives and using rules for text negatives is a form of data augmentation. While effective, the contribution is fundamentally a sophisticated form of data augmentation or data engineering, which limit its conceptual novelty and contribution. - The empirical evaluation could be significantly strengthened. The chosen baselines are not fully representative of the current state-of-the-a
The paper targets an important challenge—enhancing geometry-specific visual encoders for reasoning tasks. The proposed hard negative generation approach is conceptually sound and improves the visual encoder’s fine-grained discrimination.
The main contribution lies in applying hard negative contrastive learning to geometric visual encoders, which, while useful, represents an incremental improvement (more like technic report) rather than a fundamentally new research direction. The approach primarily involves building a domain-specific dataset, generating hard negatives, and training with standard architectures and loss functions, without methodological innovation in model design or learning objectives. Moreover, geometry-specific
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · Domain Adaptation and Few-Shot Learning · Neural Networks and Applications
MethodsContrastive Language-Image Pre-training · Contrastive Learning
