LoTLIP: Improving Language-Image Pre-training for Long Text Understanding
Wei Wu, Kecheng Zheng, Shuailei Ma, Fan Lu, Yuxin Guo, Yifei Zhang,, Wei Chen, Qingpei Guo, Yujun Shen, Zheng-Jun Zha

TL;DR
This paper introduces LoTLIP, a method that enhances language-image pre-training to better understand long texts by incorporating long captions and corner tokens, validated on a large-scale dataset for improved retrieval performance.
Contribution
The paper proposes a novel approach combining long caption relabeling and corner tokens to improve long text understanding in language-image pre-training models.
Findings
Enhanced long text understanding in LIP models.
Trade-off identified between caption length and efficiency.
Superior performance demonstrated on large-scale dataset.
Abstract
Understanding long text is of great demands in practice but beyond the reach of most language-image pre-training (LIP) models. In this work, we empirically confirm that the key reason causing such an issue is that the training images are usually paired with short captions, leaving certain tokens easily overshadowed by salient tokens. Towards this problem, our initial attempt is to relabel the data with long captions, however, directly learning with which may lead to performance degradation in understanding short text (e.g., in the image classification task). Then, after incorporating corner tokens to aggregate diverse textual information, we manage to help the model catch up to its original level of short text understanding yet greatly enhance its capability of long text understanding. We further look into whether the model can continuously benefit from longer captions and notice a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Handwritten Text Recognition Techniques
