Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale
David Noever, Forrest McKee

TL;DR
This paper introduces a scalable benchmark for evaluating large language models as autonomous freelance programmers, measuring their task success and earnings on synthetic economic data tasks.
Contribution
It presents a novel, automated benchmarking framework for assessing LLMs on freelance programming tasks with monetary valuation, enabling scalable performance analysis.
Findings
Claude 3.5 Haiku earns approximately $1.52 million
GPT-4o-mini earns approximately $1.49 million
Models rarely fail completely on tasks
Abstract
This study explores Large Language Models (LLMs) as autonomous agents for real-world tasks, including freelance software development. This work presents a new benchmark that evaluates LLMs on freelance programming and data analysis tasks derived from economic data. We construct the benchmark using synthetic tasks created from a Kaggle Freelancer dataset of job postings, with all job prices standardized to USD (median fixed-project price around 306). Each task is accompanied by structured input-output test cases and an estimated price tag, enabling automated correctness checking and a monetary performance valuation. This approach is inspired by OpenAI's recent SWE-Lancer benchmark (1,400 real Upwork tasks worth $1M total). Still, our framework simplifies evaluation using programmatically testable tasks and predicted price values, making it highly scalable and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Economy and Work Transformation · Retirement, Disability, and Employment · AI and HR Technologies
