# Scalable Collaborative Targeted Learning for High-Dimensional Data

**Authors:** Cheng Ju, Susan Gruber, Samuel D. Lendle, Antoine Chambaz, Jessica M., Franklin, Richard Wyss, Sebastian Schneeweiss, Mark J. van der Laan

arXiv: 1703.02237 · 2017-03-08

## TL;DR

This paper introduces scalable algorithms for collaborative targeted learning in high-dimensional data, significantly improving computational efficiency while maintaining estimation accuracy.

## Contribution

The paper proposes a novel pre-ordered C-TMLE algorithm with linear time complexity and a data-driven method to select pre-ordering strategies, enhancing scalability.

## Key findings

- Algorithms perform well in simulation studies.
- Scalable methods are effective on large electronic health databases.
- Original greedy C-TMLE is too slow for high-dimensional data.

## Abstract

Robust inference of a low-dimensional parameter in a large semi-parametric model relies on external estimators of infinite-dimensional features of the distribution of the data. Typically, only one of the latter is optimized for the sake of constructing a well behaved estimator of the low-dimensional parameter of interest. Optimizing more than one of them for the sake of achieving a better bias-variance trade-off in the estimation of the parameter of interest is the core idea driving the general template of the collaborative targeted minimum loss-based estimation (C-TMLE) procedure. The original implementation/instantiation of the C-TMLE template can be presented as a greedy forward stepwise C-TMLE algorithm. It does not scale well when the number $p$ of covariates increases drastically. This motivates the introduction of a novel instantiation of the C-TMLE template where the covariates are pre-ordered. Its time complexity is $\mathcal{O}(p)$ as opposed to the original $\mathcal{O}(p^2)$, a remarkable gain. We propose two pre-ordering strategies and suggest a rule of thumb to develop other meaningful strategies. Because it is usually unclear a priori which pre-ordering strategy to choose, we also introduce another implementation/instantiation called SL-C-TMLE algorithm that enables the data-driven choice of the better pre-ordering strategy given the problem at hand. Its time complexity is $\mathcal{O}(p)$ as well. The computational burden and relative performance of these algorithms were compared in simulation studies involving fully synthetic data or partially synthetic data based on a real world large electronic health database; and in analyses of three real, large electronic health databases. In all analyses involving electronic health databases, the greedy C-TMLE algorithm is unacceptably slow. Simulation studies indicate our scalable C-TMLE and SL-C-TMLE algorithms work well.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1703.02237/full.md

## Figures

16 figures with captions in the complete paper: https://tomesphere.com/paper/1703.02237/full.md

## References

36 references — full list in the complete paper: https://tomesphere.com/paper/1703.02237/full.md

---
Source: https://tomesphere.com/paper/1703.02237