When Career Data Runs Out: Structured Feature Engineering and Signal Limits for Founder Success Prediction
Yagiz Ihlamur

TL;DR
This paper develops structured feature engineering from raw JSON data to predict startup success, demonstrating the limits of current signals and the need for richer datasets.
Contribution
It introduces a structured feature engineering approach and benchmarks the signal limits in founder success prediction using JSON data and LLM features.
Findings
Engineered 28 features from raw JSON fields improving prediction accuracy.
LLM-derived prose features capture some importance but do not add predictive signal.
The dataset's information content limits the prediction ceiling, indicating the need for richer data.
Abstract
Predicting startup success from founder career data is hard. The signal is weak, the labels are rare (9%), and most founders who succeed look almost identical to those who fail. We engineer 28 structured features directly from raw JSON fields -- jobs, education, exits -- and combine them with a deterministic rule layer and XGBoost boosted stumps. Our model achieves Val F0.5 = 0.3030, Precision = 0.3333, Recall = 0.2222 -- a +17.7pp improvement over the zero-shot LLM baseline. We then run a controlled experiment: extract 9 features from the prose field using Claude Haiku, at 67% and 100% dataset coverage. LLM features capture 26.4% of model importance but add zero CV signal (delta = -0.05pp). The reason is structural: anonymised_prose is generated from the same JSON fields we parse directly -- it is a lossy re-encoding, not a richer source. The ceiling (CV ~= 0.25, Val ~= 0.30) reflects…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
