A Lightweight Vision-Language Fusion Framework for Predicting App Ratings from User Interfaces and Metadata
Azrin Sultana, Firoz Ahmed

TL;DR
This paper introduces a lightweight multimodal framework combining visual UI features and semantic data to accurately predict app ratings, improving over models that use only one data type.
Contribution
It presents a novel fusion of visual and textual features using MobileNetV3, DistilBERT, and a gated fusion module for app rating prediction, optimized for edge deployment.
Findings
Achieved low MAE of 0.1060 in rating prediction
Demonstrated high correlation coefficient of 0.9251
Validated effectiveness through extensive ablation studies
Abstract
App ratings are among the most significant indicators of the quality, usability, and overall user satisfaction of mobile applications. However, existing app rating prediction models are largely limited to textual data or user interface (UI) features, overlooking the importance of jointly leveraging UI and semantic information. To address these limitations, this study proposes a lightweight vision--language framework that integrates both mobile UI and semantic information for app rating prediction. The framework combines MobileNetV3 to extract visual features from UI layouts and DistilBERT to extract textual features. These multimodal features are fused through a gated fusion module with Swish activations, followed by a multilayer perceptron (MLP) regression head. The proposed model is evaluated using mean absolute error (MAE), root mean square error (RMSE), mean squared error (MSE),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPersona Design and Applications · Advanced Malware Detection Techniques · Web Data Mining and Analysis
