Modular Multimodal Machine Learning for Extraction of Theorems and Proofs in Long Scientific Documents (Extended Version)
Shrey Mishra, Antoine Gauquier, Pierre Senellart

TL;DR
This paper introduces a modular multimodal machine learning approach for extracting theorems and proofs from long scientific PDFs, leveraging text, font, and image features without OCR or LaTeX sources.
Contribution
It presents a novel cross-modal attention and sliding window transformer architecture that effectively captures sequential information across paragraphs in multi-page documents.
Findings
Improved extraction accuracy with multimodal data over unimodal methods
Effective handling of multi-page PDFs and page breaks
No need for OCR or LaTeX during inference
Abstract
We address the extraction of mathematical statements and their proofs from scholarly PDF articles as a multimodal classification problem, utilizing text, font features, and bitmap image renderings of PDFs as distinct modalities. We propose a modular sequential multimodal machine learning approach specifically designed for extracting theorem-like environments and proofs. This is based on a cross-modal attention mechanism to generate multimodal paragraph embeddings, which are then fed into our novel multimodal sliding window transformer architecture to capture sequential information across paragraphs. Our document AI methodology stands out as it eliminates the need for OCR preprocessing, LaTeX sources during inference, or custom pre-training on specialized losses to understand cross-modality relationships. Unlike many conventional approaches that operate at a single-page level, ours can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Advanced Computational Techniques and Applications · Computational Physics and Python Applications
MethodsDepthwise Convolution · Pointwise Convolution · Depthwise Separable Convolution · Batch Normalization · 1x1 Convolution · Inverted Residual Block · Sigmoid Activation · Conditional Random Field · EfficientNetV2 · Tanh Activation
