Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition

Haocheng Tang; Xingyu Dang; Junmei Wang

arXiv:2604.03476·cs.CV·April 22, 2026

Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition

Haocheng Tang, Xingyu Dang, Junmei Wang

PDF

TL;DR

This paper adapts DeepSeek-OCR-2 for molecular structure recognition by formulating it as image-conditioned SMILES generation, introducing a two-stage fine-tuning strategy and training on diverse datasets.

Contribution

It proposes a novel two-stage fine-tuning approach with parameter-efficient methods and large-scale training data for improved molecular OCR performance.

Findings

01

MolSeek-OCR achieves competitive exact matching accuracy.

02

Reinforcement-style post-training does not improve sequence fidelity.

03

Data curation alone is insufficient for perfect SMILES matching.

Abstract

Optical Chemical Structure Recognition (OCSR) is critical for converting 2D molecular diagrams from printed literature into machine-readable formats. While Vision-Language Models have shown promise in end-to-end OCR tasks, their direct application to OCSR remains challenging, and direct full-parameter supervised fine-tuning often fails. In this work, we adapt DeepSeek-OCR-2 for molecular optical recognition by formulating the task as image-conditioned SMILES generation. To overcome training instabilities, we propose a two-stage progressive supervised fine-tuning strategy: starting with parameter-efficient LoRA and transitioning to selective full-parameter fine-tuning with split learning rates. We train our model on a large-scale corpus combining synthetic renderings from PubChem and realistic patent images from USPTO-MOL to improve coverage and robustness. Our fine-tuned model,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.