3rd Place Solution for Google Universal Image Embedding
Nobuaki Aoki, Yasumasa Namba

TL;DR
This paper describes a competitive image embedding solution using ViT-H/14 from OpenCLIP with a two-stage training process, achieving high precision in a Kaggle competition.
Contribution
The paper introduces a novel two-stage training approach with ViT-H/14 backbone for image embedding, achieving third place in a Kaggle competition.
Findings
Achieved 0.692 mean Precision @5 on private leaderboard
Utilized ViT-H/14 from OpenCLIP as backbone
Implemented a two-stage training process
Abstract
This paper presents the 3rd place solution to the Google Universal Image Embedding Competition on Kaggle. We use ViT-H/14 from OpenCLIP for the backbone of ArcFace, and trained in 2 stage. 1st stage is done with freezed backbone, and 2nd stage is whole model training. We achieve 0.692 mean Precision @5 on private leaderboard. Code available at https://github.com/YasumasaNamba/google-universal-image-embedding
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in cancer detection · Advanced Image and Video Retrieval Techniques · Brain Tumor Detection and Classification
MethodsAdditive Angular Margin Loss
