# Predicting with Proxies: Transfer Learning in High Dimension

**Authors:** Hamsa Bastani

arXiv: 1812.11097 · 2020-05-07

## TL;DR

This paper introduces a novel high-dimensional transfer learning method that effectively combines proxy and true data, reducing the need for true data and improving predictive accuracy in applications like e-commerce and healthcare.

## Contribution

The paper proposes a two-step estimator leveraging high-dimensional statistics to correct proxy bias, with theoretical error bounds and practical validation on real datasets.

## Key findings

- Achieves similar accuracy with exponentially less true data
- Effectively captures proxy bias with sparse functions
- Demonstrates improved predictions on e-commerce and healthcare datasets

## Abstract

Predictive analytics is increasingly used to guide decision-making in many applications. However, in practice, we often have limited data on the true predictive task of interest, and must instead rely on more abundant data on a closely-related proxy predictive task. For example, e-commerce platforms use abundant customer click data (proxy) to make product recommendations rather than the relatively sparse customer purchase data (true outcome of interest); alternatively, hospitals often rely on medical risk scores trained on a different patient population (proxy) rather than their own patient population (true cohort of interest) to assign interventions. Yet, not accounting for the bias in the proxy can lead to sub-optimal decisions. Using real datasets, we find that this bias can often be captured by a sparse function of the features. Thus, we propose a novel two-step estimator that uses techniques from high-dimensional statistics to efficiently combine a large amount of proxy data and a small amount of true data. We prove upper bounds on the error of our proposed estimator and lower bounds on several heuristics used by data scientists; in particular, our proposed estimator can achieve the same accuracy with exponentially less true data (in the number of features). Our proof relies on a new LASSO tail inequality for approximately sparse vectors. Finally, we demonstrate the effectiveness of our approach on e-commerce and healthcare datasets; in both cases, we achieve significantly better predictive accuracy as well as managerial insights into the nature of the bias in the proxy data.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1812.11097/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/1812.11097/full.md

## References

42 references — full list in the complete paper: https://tomesphere.com/paper/1812.11097/full.md

---
Source: https://tomesphere.com/paper/1812.11097