MultiSChuBERT: Effective Multimodal Fusion for Scholarly Document Quality Prediction
Gideon Maillette de Buy Wenniger, Thomas van Dongen, Lambert Schomaker

TL;DR
This paper introduces MultiSChuBERT, a multimodal model combining text and visual features to improve scholarly document quality prediction, demonstrating significant performance gains over text-only models across multiple datasets and embeddings.
Contribution
The paper presents a novel multimodal fusion approach for SDQP, highlighting the impact of embedding choice and training strategies like gradual unfreezing for better performance.
Findings
Multimodal fusion improves SDQP accuracy.
Gradual unfreezing reduces overfitting of visual models.
Advanced embeddings like SPECTER2.0 enhance prediction results.
Abstract
Automatic assessment of the quality of scholarly documents is a difficult task with high potential impact. Multimodality, in particular the addition of visual information next to text, has been shown to improve the performance on scholarly document quality prediction (SDQP) tasks. We propose the multimodal predictive model MultiSChuBERT. It combines a textual model based on chunking full paper text and aggregating computed BERT chunk-encodings (SChuBERT), with a visual model based on Inception V3.Our work contributes to the current state-of-the-art in SDQP in three ways. First, we show that the method of combining visual and textual embeddings can substantially influence the results. Second, we demonstrate that gradual-unfreezing of the weights of the visual sub-model, reduces its tendency to ovefit the data, improving results. Third, we show the retained benefit of multimodality when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Softmax · Adam · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Dense Connections
