Performance Trade-offs of Optimizing Small Language Models for E-Commerce
Josip Tomo Licardo, Nikola Tankovic

TL;DR
This paper demonstrates that small, optimized open-weight language models can achieve near state-of-the-art accuracy in e-commerce tasks while significantly reducing computational costs and latency, making them practical for domain-specific applications.
Contribution
It introduces a methodology for fine-tuning and optimizing a small multilingual Llama 3.2 model for e-commerce intent recognition, matching larger models' performance with lower resource requirements.
Findings
Small models can match large models' accuracy in e-commerce tasks.
Quantization techniques significantly reduce memory usage.
Hardware-dependent trade-offs affect inference speed and efficiency.
Abstract
Large Language Models (LLMs) offer state-of-the-art performance in natural language understanding and generation tasks. However, the deployment of leading commercial models for specialized tasks, such as e-commerce, is often hindered by high computational costs, latency, and operational expenses. This paper investigates the viability of smaller, open-weight models as a resource-efficient alternative. We present a methodology for optimizing a one-billion-parameter Llama 3.2 model for multilingual e-commerce intent recognition. The model was fine-tuned using Quantized Low-Rank Adaptation (QLoRA) on a synthetically generated dataset designed to mimic real-world user queries. Subsequently, we applied post-training quantization techniques, creating GPU-optimized (GPTQ) and CPU-optimized (GGUF) versions. Our results demonstrate that the specialized 1B model achieves 99% accuracy, matching the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
