Exploring Visual Embedding Spaces Induced by Vision Transformers for Online Auto Parts Marketplaces
Cameron Armijo, Pablo Rivas

TL;DR
This paper investigates the use of Vision Transformers to generate and analyze visual embeddings of auto parts from online marketplaces, aiming to detect illicit activities through pattern recognition in image data.
Contribution
It demonstrates the application of ViT-based embeddings combined with dimensionality reduction and clustering to analyze visual patterns in online auto parts listings, highlighting both strengths and limitations.
Findings
ViT effectively isolates visual patterns in auto parts images.
Clustering reveals meaningful groupings but faces challenges with overlaps.
Single-modal approach has limitations in complex marketplace data.
Abstract
This study examines the capabilities of the Vision Transformer (ViT) model in generating visual embeddings for images of auto parts sourced from online marketplaces, such as Craigslist and OfferUp. By focusing exclusively on single-modality data, the analysis evaluates ViT's potential for detecting patterns indicative of illicit activities. The workflow involves extracting high-dimensional embeddings from images, applying dimensionality reduction techniques like Uniform Manifold Approximation and Projection (UMAP) to visualize the embedding space, and using K-Means clustering to categorize similar items. Representative posts nearest to each cluster centroid provide insights into the composition and characteristics of the clusters. While the results highlight the strengths of ViT in isolating visual patterns, challenges such as overlapping clusters and outliers underscore the limitations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Video Quality Assessment · Visual Attention and Saliency Detection
