Stepwise regression for unsupervised learning

Jonathan Landy

arXiv:1706.03265·cs.LG·June 13, 2017

Stepwise regression for unsupervised learning

Jonathan Landy

PDF

Open Access

TL;DR

This paper extends the fast stepwise linear regression algorithm to unsupervised learning, enabling efficient identification of representative feature subsets for dimensionality reduction in large datasets.

Contribution

It introduces an unsupervised version of stepwise regression with optimized runtime, pseudocode for various feature selection strategies, and applications to financial data analysis.

Findings

01

The algorithm achieves $O(n^2 m)$ runtime, matching a single regression.

02

It effectively identifies representative stocks in financial markets.

03

Forward and reverse algorithms produce inverted feature orderings in weakly-correlated data.

Abstract

I consider unsupervised extensions of the fast stepwise linear regression algorithm \cite{efroymson1960multiple}. These extensions allow one to efficiently identify highly-representative feature variable subsets within a given set of jointly distributed variables. This in turn allows for the efficient dimensional reduction of large data sets via the removal of redundant features. Fast search is effected here through the avoidance of repeat computations across trial fits, allowing for a full representative-importance ranking of a set of feature variables to be carried out in $O (n^{2} m)$ time, where $n$ is the number of variables and $m$ is the number of data samples available. This runtime complexity matches that needed to carry out a single regression and is $O (n^{2})$ faster than that of naive implementations. I present pseudocode suitable for efficient forward, reverse, and…

Equations68

F (s) \equiv X^{T} - P^{(s)} X^{T}^{2},

F (s) \equiv X^{T} - P^{(s)} X^{T}^{2},

P^{(s)} = \tilde{X}^{T} (\tilde{X} \tilde{X}^{T})^{- 1} \tilde{X} .

P^{(s)} = \tilde{X}^{T} (\tilde{X} \tilde{X}^{T})^{- 1} \tilde{X} .

d = argmin_{j} F (s + {j}) .

d = argmin_{j} F (s + {j}) .

d = argmin_{j} F (s - {j}) .

d = argmin_{j} F (s - {j}) .

M_{ij} \equiv X_{i} \cdot X_{j}^{T} .

M_{ij} \equiv X_{i} \cdot X_{j}^{T} .

F_{j} = 1 - M_{j} \cdot R \cdot M_{j}^{T} .

F_{j} = 1 - M_{j} \cdot R \cdot M_{j}^{T} .

R_{ij} = ⎩ ⎨ ⎧ 0, (\tilde{X} \tilde{X}^{T})_{ij}^{- 1}, if i or j \neq \in s else .

R_{ij} = ⎩ ⎨ ⎧ 0, (\tilde{X} \tilde{X}^{T})_{ij}^{- 1}, if i or j \neq \in s else .

U_{ij} \equiv M_{i} \cdot R \cdot M_{j}^{T},

U_{ij} \equiv M_{i} \cdot R \cdot M_{j}^{T},

D_{ij} \equiv R_{i} \cdot M_{j} .

D_{ij} \equiv R_{i} \cdot M_{j} .

R \leftarrow R - \frac{R _{d} \otimes R _{d}}{R _{dd}},

R \leftarrow R - \frac{R _{d} \otimes R _{d}}{R _{dd}},

R \leftarrow R + \frac{( D _{d}^{T} - e _{d} ) \otimes ( D _{d}^{T} - e _{d} )}{1 - U _{dd}},

R \leftarrow R + \frac{( D _{d}^{T} - e _{d} ) \otimes ( D _{d}^{T} - e _{d} )}{1 - U _{dd}},

U

U

D

U

U

D

F_{j}

F_{j}

F \leftarrow F + \frac{∥ D _{d} ∥ ^{2}}{R _{dd}} .

F \leftarrow F + \frac{∥ D _{d} ∥ ^{2}}{R _{dd}} .

F_{j}

F_{j}

F \leftarrow F - \frac{∥ U _{d} - M _{d} ∥ ^{2}}{1 - U _{dd}} .

F \leftarrow F - \frac{∥ U _{d} - M _{d} ∥ ^{2}}{1 - U _{dd}} .

X_{ij} \equiv \frac{p _{i, j} - p _{i, j - 1}}{p _{i, j - 1}} .

X_{ij} \equiv \frac{p _{i, j} - p _{i, j - 1}}{p _{i, j - 1}} .

(M + u v^{T})^{- 1} =

(M + u v^{T})^{- 1} =

M = M_{11} 3 M_{n 1} \dots \dots M_{1 n} M_{nn},

M = M_{11} 3 M_{n 1} \dots \dots M_{1 n} M_{nn},

u = - 00201 M_{1 n} M_{2 n} M_{(n - 1) n} 0,, v = M_{1 n} M_{2 n} 2 M_{(n - 1) n} 0 0001,

u = - 00201 M_{1 n} M_{2 n} M_{(n - 1) n} 0,, v = M_{1 n} M_{2 n} 2 M_{(n - 1) n} 0 0001,

M + u v^{T} = M_{11} M_{21} 4 M_{(n - 1) 1} 0 \dots \dots \dots \dots M_{1 (n - 1)} M_{2 (n - 1)} M_{(n - 1) (n - 1)} 0 0001 .

M + u v^{T} = M_{11} M_{21} 4 M_{(n - 1) 1} 0 \dots \dots \dots \dots M_{1 (n - 1)} M_{2 (n - 1)} M_{(n - 1) (n - 1)} 0 0001 .

M^{- 1} u = - M_{1 n}^{- 1} - M_{2 n}^{- 1} 2 - M_{(n - 1) n}^{- 1} - M_{nn}^{- 1} M_{1 n}^{- 1} M_{2 n}^{- 1} M_{(n - 1) n}^{- 1} M_{nn}^{- 1} - 1,

M^{- 1} u = - M_{1 n}^{- 1} - M_{2 n}^{- 1} 2 - M_{(n - 1) n}^{- 1} - M_{nn}^{- 1} M_{1 n}^{- 1} M_{2 n}^{- 1} M_{(n - 1) n}^{- 1} M_{nn}^{- 1} - 1,

v^{T} M^{- 1} = [- M_{1 n}^{- 1} M_{1 n}^{- 1} \dots \dots - M_{(n - 1) n}^{- 1} M_{(n - 1) n}^{- 1} - M_{nn}^{- 1} + 1 M_{nn}^{- 1}] .

v^{T} M^{- 1} = [- M_{1 n}^{- 1} M_{1 n}^{- 1} \dots \dots - M_{(n - 1) n}^{- 1} M_{(n - 1) n}^{- 1} - M_{nn}^{- 1} + 1 M_{nn}^{- 1}] .

(I + v^{T} M^{- 1} u)^{- 1} = [11 1 - \frac{1}{( M ^{- 1} ) _{nn}} 1] .

(I + v^{T} M^{- 1} u)^{- 1} = [11 1 - \frac{1}{( M ^{- 1} ) _{nn}} 1] .

(M + u v^{T})^{- 1} = M^{- 1} - \frac{M _{n}^{- 1} \otimes M _{n}^{- 1}}{M _{nn}^{- 1}} + e_{nn},

(M + u v^{T})^{- 1} = M^{- 1} - \frac{M _{n}^{- 1} \otimes M _{n}^{- 1}}{M _{nn}^{- 1}} + e_{nn},

M^{- 1} \leftarrow

M^{- 1} \leftarrow

μ^{- 1} = M^{- 1} + ϵ,

μ^{- 1} = M^{- 1} + ϵ,

μ

μ

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Blind Source Separation Techniques · Neural Networks and Applications

Full text

Stepwise regression for unsupervised learning

Jonathan Landy The author is a member of the data-science group at Square Inc., which is located in San Francisco, CA. (email: [email protected])

Abstract

I consider unsupervised extensions of the fast stepwise linear regression algorithm [5]. These extensions allow one to efficiently identify highly-representative feature variable subsets within a given set of jointly distributed variables. This in turn allows for the efficient dimensional reduction of large data sets via the removal of redundant features. Fast search is effected here through the avoidance of repeat computations across trial fits, allowing for a full representative-importance ranking of a set of feature variables to be carried out in $O(n^{2}m)$ time, where $n$ is the number of variables and $m$ is the number of data samples available. This runtime complexity matches that needed to carry out a single regression and is $O(n^{2})$ faster than that of naive implementations. I present pseudocode suitable for efficient forward, reverse, and forward-reverse unsupervised feature selection. To illustrate the algorithm’s application, I apply it to the problem of identifying representative stocks within a given financial market index – a challenge relevant to the design of Exchange Traded Funds (ETFs). I also characterize the growth of numerical error with iteration step in these algorithms, and finally demonstrate and rationalize the observation that the forward and reverse algorithms return exactly inverted feature orderings in the weakly-correlated feature set regime.

I Introduction

Raw data sets sometimes contain a significant number of redundant or nearly-redundant measurement features – feature variables that can be accurately predicted given the values taken on by each of the others. The pruning of redundant features from a large data set can result in a number of desirable consequences. Most obviously, their removal allows for a smaller, compressed representation of the original data set. Redundant feature removal can also result in improved interpretability of data sets as well as reduced training time, size, and over-fitting of models built on top of them [3, 8]. These benefits motivate the search for general, automated feature selection algorithms that can be applied at scale.

The fundamental challenge associated with selecting optimal feature subsets is that the number of candidates of a given size can be very large – given a set of $n$ feature vectors, there are $n\choose k$ vector subsets of size $k$ . This combinatorial scaling implies that we must often trade-off guaranteed optimality for improved run time in our search strategies. In the supervised regime – where one attempts to obtain a best fit to a target vector $y$ using at most $k$ of the available predictors - two linear regression-based approaches have proven to provide an excellent balance of speed and accuracy: First, a greedy, step-wise regression method has been introduced that can be made to run very quickly via certain matrix inverse update identities [5]. In the forward variant of this algorithm, features are added to the predictor set one at a time. At each step, the feature added is that which results in the best improvement to the regression on the target variable. In this way, an approximate importance ranking of the full set of features is obtained – with the first feature added considered most important, etc. Dropping all but the top $k$ features provides an approximate solution to the optimal selection problem. The second method that has been pursued is an approximate mapping onto a convex optimization problem [3]. This approach does not provide a ranking of the features, but instead simply outputs the feature subset that minimizes the sum of the regression squared error and a sparsity-enforcing regularization term. Though both algorithms can be carried out in polynomial time, each is known to provide globally optimal solutions under certain conditions – a boon given that optimal subset selection is an NP-hard task, in general [12].

In this paper, I consider the challenge of feature selection in the unsupervised regime – that where one wishes to identify a representative subset of features in a given data set, not having any particular target variable in mind. A natural strategy for addressing this challenge is to attempt a translation of the successful linear methods noted above to the unsupervised task of feature set partitioning. Here, the goal is to segment a given feature set into a retained set and a rejected set, with the optimal target rejection set of a given size being that having the least squared error when projected onto its complement – i.e., that most accurately approximated by a linear fit constructed from the features in the corresponding retained set. Extensions of the convex mapping approach that address this problem have been detailed elsewhere [11]. However, prior consideration of the unsupervised stepwise approach appears to be limited to a forward selection variant [6]. Here, I show that the original supervised formalism [5] can be directly extended to the unsupervised regime, and I use this approach to write down and characterize simple implementations of the unsupervised forward, reverse, and hybrid forward-reverse algorithms. Each provides an efficient routine, specialized for optimization in a different limit: The forward algorithm provides near optimal solutions when the subset selected is small, while the reverse algorithm performs best in the opposite limit. Finally, the hybrid forward-reverse algorithm can provide significantly better results than the other two when a subset of intermediate size is selected. I illustrate the application of the algorithms here and also present pseudo-code implementations for each.

The remainder of the paper proceeds as follows: In Sec. II, I (i) formally define the selection problem considered here and outline the greedy strategies for addressing it, (ii) discuss a naive implementation of the greedy strategies, and (iii) provide an overview of their efficient implementations. In Sec. III, I present pseudocode for the forward, reverse, and forward-reverse algorithms. In Sec. IV, I illustrate application of the algorithms through consideration of a stock market fluctuation analysis. Sec. V contains a brief discussion of the results. I also include three appendices. Appendix A contains derivations relating to the update procedure. Appendix B contains an analysis of the growth in numerical round off error in the greedy algorithms. Finally, Appendix C relates to the relationship between the forward and reverse algorithms – here, I demonstrate that they return exactly opposite orderings in certain limits.

II Problem definition and algorithm overview

Problem statement and greedy solution strategies

The unsupervised feature selection problem that I consider here centers around the characterization of a given $n\times m$ data matrix $X$ , with element $X_{jk}$ holding the value of the $j$ -th feature in the $k$ -th data instance. For simplicity, I assume throughout this paper that the rows of $X$ are linearly independent, which requires $m>n$ , and I also assume that the features have been shifted and normalized to each have mean [math] and variance $1$ . The features can be formally considered to be jointly distributed random variables. To score the representative quality of a given feature subset $s$ , I take the squared residual error associated with projecting $X$ onto the rows of $s$ – i.e., the squared error of the best linear fit to the whole data set using only the features within $s$ . This is,

[TABLE]

where $\mathcal{P}^{(s)}$ is the projection operator [1],

[TABLE]

Here, $\tilde{X}$ is defined analogously to $X$ , but with the rows outside of $s$ removed. Notice that the inverse above exists, given the assumption of linear independence, and that the common normalization of the features results in each being weighted equally in the projection error.

Although linear fits are relatively easy to carry out, global minimization of (1) is challenging due to the combinatorial number of candidate subsets. A greedy search procedure can be applied to obtain low-cost subsets in an efficient manner. This approach centers around iterative addition or removal of features from $s$ , one at a time, at each step selecting that single feature that results in the best outcome. In the forward process, one begins with no features in $s$ . Features are then iteratively added in, one at a time, always selecting that feature that reduces $\mathcal{F}$ the most. That is, at a given step one identifies and adds to $s$ that feature $d$ satisfying

[TABLE]

In the reverse process, one begins with all features in $s$ . Features are then iteratively removed from $s$ , one at a time, at each step selecting and removing that feature $d$ satisfying

[TABLE]

The forward-reverse greedy algorithm is a hybrid of the two above: One sometimes adds features to $s$ and sometimes removes them. Various protocols can be considered for determining when a step forward is taken, rather than a step backward – I discuss and outline one choice below.

The change in $\mathcal{F}$ that occurs in a given step of these algorithms can be decomposed into two parts: The change in the projection error of $d$ and the change in the projection error associated with the remaining features outside of $s$ – when $d$ is either added or removed from $s$ , the fit to these other features may improve or worsen. Both contributions need to be considered in order to select the optimal feature at each step.

The forward and reverse greedy algorithms are very similar to gradient descent, in that one always takes a step in the direction of least cost. The subsets obtained in this way provide approximate minimizations of (1). These can be obtained relatively quickly because the greedy search protocol is restricted. For example, in the forward approach, once a feature $d$ is added to $s$ , the algorithm will only explore predictor subsets that include it from that point forward. Although restricted, the greedy strategies contain mechanisms for identifying competitive subsets: If clusters of the features exist, with each feature in a given cluster very similar, the algorithms will generally work towards identifying an $s$ that contains a single representative from each. For example, in the reverse process, pruning features from a cluster of very similar features will come at very little cost, since each can be well-predicted from each of the others. However, a significant cost would be incurred upon removal of the last feature of a cluster from $s$ , as doing so would simultaneously raise the loss on the full similar subset. Because of this, one can expect the greedy pruning process to effectively work towards selecting single representative features from each well-defined cluster that appears in the predictor set – an approach that intuitively suggests we can expect well-optimized feature orderings. Note that the forward-reverse hybrid approach allows for a more exhaustive search, and for this reason often identifies lower cost subsets of a given size. However, these improvements come at the cost of increased run time.

Naive greedy implementations

To initiate the reverse greedy process described above, the first step one must take is to determine which of the $n$ features should be removed from $s$ first. This is that feature that is most easily predicted from each of the others. To identify this feature, a natural first approach would be to develop individual linear fits to each of the $n$ separate features, in each case using the remaining $(n-1)$ features as predictor variables. For example, one would first fit the feature $x_{1}$ using the variables $\{x_{2},x_{3},\ldots,x_{n}\}$ . Next, one would fit $x_{2}$ , using the predictors $\{x_{1},x_{3},\ldots,x_{n}\}$ , etc. The feature whose regression returned the smallest projection error would then be that most accurately predicted by each of the other features. With this feature identified, it can then be pruned from $s$ . In the next iteration of the process, one additional feature must be removed from $s$ . This will be that satisfying (4), leaving the projection error on the full pruned set – now of size two – minimized. The process would then continue from there.

The problem with this naive implementation strategy is that it will run very slowly at large $n$ : Evaluation of $\mathcal{F}$ for a single subset requires evaluation of the projection operator (2), which requires $O(n^{2}m)$ computations. Each iteration of the algorithm requires $O(n)$ candidate subsets to be considered, which will therefore require $O(n)\times O(n^{2}m)=O(n^{3}m)$ time. The full runtime complexity required to obtain a ranking of the full set of $n$ features then scales like $O(n^{4}m)$ . This scaling precludes applications of the naive approach at large $n$ .

Efficient greedy implementations

The fast, stepwise linear regression algorithm [5] is typically applied to the problem of minimizing the squared regression error of a fixed target variable. However, I show here that the algorithm can be simply extended to also allow for the efficient implementation of the greedy minimization of (1). The crux of this algorithm is an application of the Woodbury matrix inverse update formula [15]. This identity allows one to efficiently update the matrix inverse that appears in (1) as features are either added or subtracted from $s$ – thereby avoiding a costly, fresh inverse reevaluation with each iteration. The result is a significant speedup.

To simplify the presentation of the procedure, it is convenient to introduce a few matrices related to the feature correlation matrix, $M$ , which has $i-j$ component

[TABLE]

Recalling the assumed normalization of the features, and writing $\mathcal{F}_{j}$ for the squared projection error of the $j$ -th feature, combining (1), (2), and (5) gives

[TABLE]

Here, $M_{j}$ is the $j$ -th row of the original correlation matrix and $R$ is the matrix having elements

[TABLE]

Notice that projecting $R$ onto $s$ gives the inverse of $\tilde{M}$ , the projection of $M$ onto $s$ . Working with $R$ rather than $\tilde{M}^{-1}$ allows one to update this matrix in place – i.e., it avoids the need to copy the inverse over to one having a different dimension with each iteration. The two final matrices that must be introduced are $U$ and $D$ . These have elements

[TABLE]

and

[TABLE]

I outline a derivation for the Woodbury update expressions for (7) in appendix A. These read as follows: When feature $d$ is removed from $s$ , $R$ is updated to

[TABLE]

where $\otimes$ represents the outer product and $R_{d}$ is the $d$ -th row of $R$ . Similarly, when feature $d$ is added to $s$ , $R$ is updated to

[TABLE]

where $e_{d}$ is the vector that is $1$ at index $d$ and [math] elsewhere. The update formulas for $U$ and $D$ are obtained by combining their definitions with the update formulas for $R$ : If feature $d$ is removed from $s$ , combining (8), (9), and (10) gives

[TABLE]

Similarly, when feature $d$ is added to $s$ , combining (8), (9), and (11) gives

[TABLE]

The matrices $M$ , $R$ , $U$ , and $D$ above allow for a speedup in the greedy selection process because they contain all of the information needed to determine the change in projection error associated with single variable adjustments to $s$ . For example, if feature $d$ is removed from $s$ , combining (6) and (10) gives

[TABLE]

Summing over all $j$ gives the full projection error update that results from the removal of feature $d$ ,

[TABLE]

If $R$ and $D$ are evaluated and updated as the algorithm is carried out, the right side of this equation can be evaluated to check how $\mathcal{F}$ would change with the removal of $d$ – all without actually evaluating the new projection operator. Further, this check simply involves the evaluation of a dot product and can be carried out in $O(n)$ time – a significant speedup over the $O(n^{2}m)$ time taken in the naive approach to score a given candidate feature. Scoring all features then requires only $O(n^{2})$ time per iteration. Once the optimal feature is identified, it can be removed from $s$ , and the updates (10), (12) and (13) can be applied. Each of the three updates also requires only $O(n^{2})$ computations. Once these have been carried out, the process can be iterated from there.

Similarly, if in the forward process the feature $d$ is added to $s$ , combining (6) and (11) gives

[TABLE]

Summing over all $j$ gives the full cost function increase,

[TABLE]

Using this expression, the optimal feature for addition can again be evaluated in $O(n^{2})$ time. Once identified, the updates (11), (14), and (15) can be carried out, and the process continued. Again, a full iteration only requires $O(n^{2})$ computations.

In summary then, the processes above each allow for either a forward or a reverse step to be carried out in $O(n^{2})$ time. A full ranking of the $n$ features can then be carried out in $O(n^{3})$ time. Because $m>n$ , this is necessarily smaller than initial the time required to evaluate $M$ , $O(n^{2}m)$ time. This initial computation sets the scaling for a full run of the fast forward and reverse greedy algorithms. The result is a process that is a factor of $O(n^{2})$ faster than the naive implementations discussed above. For example, if $n\sim O(10^{3})$ , this means that the inverse update procedure provides an $O(10^{6})$ speedup, which is quite significant.

The matrices $R$ , $U$ , and $D$ – and their update formulas presented above – are identical to those evaluated in the supervised stepwise linear regression algorithm [5]. The central difference between the supervised algorithm and those considered here is the cost function that determines the optimal feature for selection at each step. In the supervised case, the regression error on the target variable defines the cost, whereas in the unsupervised case, the cost is given by the squared regression error on all features not in $s$ – (17) in the reverse case and (19) in the forward case. The results above show that this extension can be carried out without increasing the runtime complexity of the algorithms.

The next section contains summary pseudocode for each algorithm.

III Algorithm pseudocode

III-A Forward process

The forward selection algorithm repeatedly applies (19) to evaluate the cost of selecting any given feature for inclusion in $s$ at a given step. Once the optimal feature for an iteration is identified, it is added to $s$ , the current squared projection error (1) is evaluated and stored, and the matrices $R$ and $U$ are updated – again, using the forward update expressions (11) and (14). The summary pseudocode for this process is given in Algorithm 1. As written, this accepts a data matrix $X$ and returns a list of the features, ordered as selected by the algorithm. The algorithm also returns a list of the squared projection error realized at each iteration of the algorithm.

III-B Reverse process

This reverse procedure is very similar to the forward selection process, but begins with all features included. The algorithm then iteratively removes features from $s$ one at a time. The optimal feature for removal is identified using (17). Once this is found, the selected feature set, cost function, and the matrices $R$ and $D$ are updated – the latter using (10) and (13). The summary pseudo-code for the reverse process is given in Algorithm 2. As written, the algorithm returns a list of the features in the order in which they were pruned from $s$ , as well as a list of the cost function that results at each step of the algorithm.

III-C Forward-Reverse selection

Many variants of the forward-reverse selection process can be considered. In general, each presents a strategy that determines when a forward step is taken and when a reverse step is taken in the greedy process. Allowing for removal of features from the retained set $s$ is useful, as a feature that can be very useful when very few features are included may represent a poor feature for inclusion once a number of others have been added to $s$ .

One variant of the hybrid algorithm in presented in Algorithm 3. As written, this approach always proceeds by taking $steps$ forward steps followed by $steps-1$ backward steps. In this way, the algorithm proceeds to move one effective step forward with each iteration of the algorithm, but does so in a seesaw fashion – allowing for features to be included in the forward process but then removed if it turns out they are suboptimal once others are included. In practice, I have found setting $steps>2$ tends to produce smoother results than those obtained at $steps=2$ , in the sense that the best cost (1) identified tends to vary more smoothly with size $|s|$ . Increasing the $steps$ parameter requires increased computational time.

IV Selecting representative stocks

Stock Index Exchange Traded Funds (ETFs) are funds that own a portfolio of stocks that together are selected so as to track a complete index, or sector of the market. These funds divide themselves into shares, allowing for individuals to buy and sell partial ownership of the funds through brokers, much like any other public company stock. The value of an ETF share is locked to the underlying value of its target index through portfolio transparency, which enables financial institutions to carry out arbitrage conversions between an ETF’s shares and the individual stocks comprising its portfolio. Such conversions will be carried out whenever sufficiently large price discrepancies arise between the two, resulting in price discrepancy suppression. Unfortunately, these arbitrage trades become more costly to carry out as the number of stocks in an ETF grows, resulting in a breakdown of the price-locking mechanism. To mitigate this issue, ETFs targeting large indices are now often developed through sampling: Rather than purchase each stock in the index, the ETF will instead develop a portfolio comprised of only a representative subset of the stocks in the index [7].

The unsupervised stepwise selection algorithms provide a natural method for developing sampled portfolios: To characterize the performance of an index of $n$ stocks over $m$ days, one can consider a data matrix $X$ containing the trace of each stock’s daily percentage change in price111The daily percentage change in price is a summary statistic motivated by the logarithmic random walk model of stock price dynamics. In this model, stocks are modeled to increase in price at an average multiplicative rate, but are subject to random fluctuations that are also multiplicative [2, 10].,

[TABLE]

Here, the row index $i$ identifies the stock and the column index $j$ the date. Plugging (20) into the unsupervised selection algorithms, one obtains subset selections $s$ of the rows that approximately minimize the cost function (1). The small $|s|$ subsets obtained in this way provide excellent candidates for inclusion in an index ETF, in that they will each contain stock selections that can be used to develop linear fits to each of the other stocks in $X$ with relatively high accuracy. The greedy selection processes will naturally work towards selecting highly-representative subsets characterized my minimal redundancy, resulting in well-diversified portfolios.

To illustrate the above strategy, I have applied the unsupervised selection algorithms to characterize the performance of $n=200$ of the highest market-cap health-care related stocks that traded on NASDAQ throughout the $m=252$ trading days of 2016222The 200 health-sector stock tickers included in the analysis are: JNJ, PFE, NVS, MRK, UNH, AMGN, MMM, SNY, MDT, GSK, ABBV, CELG, NVO, WBA, LLY, BMY, GILD, CVS, AGN, AZN, ABT, BIIB, SYK, ANTM, AET, ESRX, CI, BDX, REGN, BSX, TEVA, HCA, HUM, ISRG, VRTX, MCK, BAX, ZTS, LUX, ALXN, FMS, INCY, ZBH, CAH, EW, MYL, Q, BCR, ABC, BMRN, LH, SNN, XRAY, DGX, SHPG, IDXX, HSIC, DVA, ANTX, CNC, HOLX, UHS, RMD, COO, PRGO, ALGN, SGEN, JAZZ, TFX, ALKS, VAR, TSRO, RDY, WCG, QGEN, EVHC, DXCM, EXEL, STE, MD, HLF, IONS, ABMD, UTHR, MASI, TARO, HRC, MNK, NBIX, KITE, ICLR, ALNY, ALR, CRL, PDCO, OPK, AKRX, RAD, PRAH, ACAD, HLS, NUVA, TECH, ACHC, MDCO, CTLT, BLUE, HCSG, PRXL, WMGI, CBPO, CMD, CHE, GMED, PBH, ICUI, NUS, VRX, GRFS, PEN, NKTR, MOH, SAGE, ICPT, EXAS, GWPH, MSA, RARE, JUNO, NVRO, LIVN, HZNP, LPNT, CLVS, BKD, PODD, XON, INCR, IRWD, NEOG, AXON, ENDP, ZLTQ, LGND, PTLA, HAE, OMI, TBPH, PRTA, AGIO, CBM, NXTM, SEM, PCRX, SRPT, AMED, HYH, HALO, FGEN, MGLN, RDUS, ONCE, INGN, SUPN, GKOS, LXRX, THC, BPMC, AAAP, MMSI, TDOC, ARRY, AERI, PBYI, INVA, EGRX, DERM, BABY, CNMD, MDXG, GBT, LOXO, MYGN, SPNC, RGEN, EBS, FMI, INSM, TVTY, NHC, PAHC, XNCR, XLRN, GHDX, DPLO, FOLD, ALDR, CORT, NRCIB, NVCR. Historical closing values were obtained from http://real-chart.finance.yahoo.com.. The results of this analysis are summarized in Fig. 1. In the main plot of Fig. 1a, I show the squared projection error that results when only the top $x$ stocks are retained, as output by the UFS model. This shows that if one selects only the top 10, 25, 50, or 100 stocks ranked by this model, 41%, 53%, 68%, and 83% of the variance within the full 200 stock index is captured, respectively. Elbows appear in this plot around the $10$ and $150$ stock selection marks. This means that the improvement in variance captured is somewhat incremental after the first $10$ stocks, and is minimal in the last $50$ . The inset to Fig. 1a shows a plot of the improvement in $\mathcal{F}$ that results if we switch from the UFS algorithm to either the UFRS (dashed) or URS algorithms (solid) at different subset sizes. As expected, the URS algorithm performs better at large $|s|$ , and the hybrid UFRS algorithm performs better throughout.

The results in Fig. 1b summarize the results that are obtained if one applies the algorithms to a five-day moving average of (20). This smoothing serves to reduce short time-scale noise that is often fairly idiosyncratic. In this case, retaining the top 10, 25, 50, or 100 stocks from the UFS procedure results in capturing 47%, 66%, 84%, and 96% of the full index variance, respectively. Averaging over longer periods results in further noise reduction, allowing for even better performance at small $|s|$ . For example, averaging over a 30 day window, the top 10, 25, 50, or 100 stocks selected by the forward procedure capture 80%, 92%, 97%, 99% of the index variance, respectively. An elbow in the plot (not shown) now appears around the 25 stock mark, motivating a portfolio selection of this size.

The example of sampled, representative portfolio development illustrates well the benefits one receives through application of the unsupervised selection procedures: Primarily, these procedures allow for an automated dimensional reduction to be effected, reducing the number of features needed to accurately characterize the behavior of a system. Automated feature-engineering-based dimensional reduction methods are also available for this purpose, but these are not always appropriate. For example, in the case of ETF development, we explicitly require the selection of a representative subset of the stocks – the identification of a set of informative, hybridized stocks that include each of those present would provide little utility. The analysis considered here has also demonstrated the potential for improved fits to be obtained through use of the URS and UFRS algorithms in different limits.

V Discussion

In this paper, I have presented unsupervised extensions of the fast stepwise linear regression algorithm. These extensions allow one to quickly identify feature subsets that are highly representative of the whole. In colloquial terms, one can think of these algorithms as feature selection analogs of the popular principal component analysis (PCA) algorithm. Whereas PCA identifies the unrestricted linear combinations of features that capture the most variance within a data set, the algorithms presented here identify subsets of the original features that do so. This approach retains the original interpretability of the resulting features within the dimensionally-reduced data set, which can be important for certain applications. Like PCA, the greedy stepwise algorithms center around consideration of the feature set correlation matrix. The number of computations required to evaluate this correlation matrix sets the shared runtime for each, which is quadratic in the number of features and linear in the number of data samples available. This efficiency makes the fast stepwise algorithms compelling candidates for feature selection tasks at scale.

I discuss two interesting properties relating to these algorithms in the appendices. In Appendix B, I have shown that the error in the matrices determining which feature should next be pruned or added to a predictor set grows slowly with iteration – this error scales like the square root of the iteration index. This slow growth of error allows one to work with relatively low numerical precision in many applications, which can result in both reduced run-times and reduced memory requirements. Second, in Appendix C, I have shown that the feature orderings returned by the forward and reverse algorithms are exactly opposite whenever the features present are all the same, up to some fluctuations that are only weakly correlated. The result is somewhat surprising, but represents the correct greedy behavior for the optimization of (1) in the highly-redundant limit. When the features present are not weakly correlated, or are not all highly-redundant, the relationship between the two protocols is certainly more complex.

An alternative formalism for carrying out unsupervised forward stepwise regression was introduced in [6]. This prior work also introduced a distributed implementation of the forward algorithm that allows for applications to very large data sets. It would be useful to develop a distributed version of the formalism considered here, or alternatively to extend the formalism of [6] to support the reverse and hybrid forward-reverse protocols. One benefit favoring the formalism considered here is that it allows for the possibility of quick development of high quality implementations via simple extensions of already existing supervised implementations.

Appendix A Correlation matrix inverse updates

Efficient stepwise regression centers around fast updating of the correlation matrix inverse as the predictor set considered is iteratively altered: Typically, an $n\times n$ matrix requires $O(n^{3})$ operations to invert, but using the update approach, the correlation matrix inverse can be corrected at each step of these algorithms in only $O(n^{2})$ time [5]. For completeness, I present a derivation of the update formulas needed to do this here. This is based upon application of the Woodbury inverse update formula [15]: Given an initial $n\times n$ matrix $M$ and a second $n\times n$ matrix $uv^{T}$ – with $u$ and $v$ both $n\times k$ matrices – the formula states

[TABLE]

The correctness of this expression can be confirmed through direct multiplication.

To apply (A) here, I take $M$ to be our correlation matrix,

[TABLE]

with $M_{ij}$ given by (5). Writing

[TABLE]

one obtains

[TABLE]

The upper-left sub-matrix here is the correlation matrix of the first $(n-1)$ features. Its inverse can be obtained from the right side of (A). To evaluate this expression, first note that the definition of $M^{-1}$ gives

[TABLE]

and

[TABLE]

Similarly, we have

[TABLE]

Using these last three lines, and plugging into (A) gives

[TABLE]

where $M^{-1}_{n}$ is the $n$ -th row of $M^{-1}$ , $\otimes$ represents the outer product, and $e_{nn}$ is the $n\times n$ matrix has a $1$ in the $n$ -th diagonal and is zero elsewhere. The components outside of $n$ give the inverse of the correlation matrix of the first $n-1$ features. Note that (28) can be evaluated in $O(n^{2})$ time, given knowledge of $M^{-1}$ , the correlation matrix of the first $n$ components.

One can obtain the inverse update formula that applies when a feature is added to the retained set by reversing the above steps. This is done by changing the sign of $u$ in (23). Carrying out similar algebra to that above gives the update for adding feature $n$ back to $s$ ,

[TABLE]

Here, $e_{n}$ is the vector that is zero everywhere except index $n$ where it takes value $1$ , $M^{-1}$ is now the inverse correlation matrix of the first $n-1$ features, and $M$ is the full correlation matrix. These results are equivalent to (10) and (11).

Appendix B Error propagation

A potential concern associated with applying iterative algorithms is the possibility of compounding growth of numerical error with each iteration. However, I show here that error grows slowly with iteration count in the unsupervised selection algorithms. To illustrate, consider the situation where the reverse process is initialized with a numerical approximation to the true inverse of $M$ given by

[TABLE]

where $\epsilon$ is a small error term. The error at later steps in the process will contain contributions from this initial error, as well as contributions from error introduced through later numerical roundoff mistakes. Consider first the idealized error that is a consequence of the initial error in (30) only. The correlation matrix associated with (30) is given by its inverse. To leading order in $\epsilon$ , this is

[TABLE]

a result following from the Taylor expansion of the inverse [1]. At a later step in the reverse process – again, ignoring round-off error for the moment – the inverse that obtained will be equal to the exact inverse of a feature-pruned version of (31). That is, a matrix $\mu^{\prime}$ obtained from (31) by simply setting some its rows and columns to zero. The exact inverse of this matrix will be given by

[TABLE]

This result shows that the leading error in the output of the idealized algorithm – i.e. that not subject to continued introduction of roundoff error – will remain small after any number of iterations. In other words, there is no build-up, or concentration of the initial error in the final components of the pruned correlation matrix inverse.

To relate (32) to the error of a realistic implementation, one needs to consider the numerical roundoff introduced with each iteration. To leading order, these errors are additive. Further, one can expect these errors to be relatively uncorrelated across iterations, implying that the net error growth after $k$ steps should scale like that for a random walk [13],

[TABLE]

Here, I now use $\epsilon$ for the scale of the round off error. Numerical experiments confirm this expected scaling – see Fig. 2. Similar square-root scaling also characterizes the error growth in the iterative $D$ and $U$ updates returned by the algorithm.

A consequence of the small error growth rate (33) is that accurate feature orderings can be obtained from the algorithm without requiring high precision variable types. This can be very helpful when working with systems having a great many features, as working with smaller precision floats can significantly reduce run time and memory requirements. In general, one can use (33) to select the precision level needed for a given procedure. For example, if $N\sim 10^{4}$ and the maximum desired error is $E=O(10^{-5})$ , then floating point arithmetic valid at $\epsilon\sim O(E/\sqrt{N})\sim O(10^{-7})$ will suffice. A trade-off between precision and speed can also be made: Smaller precision variables can be used if one is willing to occasionally reset the error in the inverse through direct – i.e. $O(n^{3})$ complexity – re-evaluation.

Appendix C Relation between forward and reverse solutions

In the supervised context – where one attempts to find the optimal feature subset of size $k$ for a linear fit to a target variable $y$ – both the forward and reverse greedy algorithms have been analyzed in detail. The reverse algorithm has been shown to always return the optimal feature subset, provided the target variable $y$ is not too far from the space spanned by the optimal $k$ regressors [9, 4]. The same conclusion does not hold for the forward algorithm. However, it has been shown that if an exact, sufficiently sparse representation of the target variable exists within the space spanned by the available features, the forward algorithm will find it [14, 3]. Both of these results are quite significant, given that the challenge of identifying the global optimum is in general an NP-hard task [12]. These results suggest that both the forward and reverse greedy approaches might also be expected to perform well in the unsupervised learning context, provided the features in question are relatively redundant – so that each can be well fit by some subset of the others.

An interesting limiting case is provided by the situation where all feature fluctuations are only weakly correlated. In this regime, one can show that the forward and reverse algorithms return exactly opposite feature orderings: Consider a correlation matrix of the form

[TABLE]

where $\epsilon$ is a small number and $N$ is a symmetric, fully off-diagonal matrix with elements of size $O(1)$ . If some subset $s$ of the features is selected, the regression error on the remaining features can be read off from (6). Expanding, this is

[TABLE]

In the first line here, $N$ is again the full matrix and $N^{-1}$ is the updated inverse, which is non-zero only over the columns within $s$ .

The result (35) shows that to leading order the projection error is the sum of the squared elements of the block off-diagonal elements of $M$ – those elements where one index is in $s$ and one is not. By symmetry – at the same order of approximation – this must also be the squared projection error observed when $s$ is projected onto its complement. If the forward process is run from this starting point – from the perspective of $s$ – the next feature selected will be that that returns the largest sum over the resulting block off-diagonal elements squared. But by the symmetry noted above, if the reverse process is run from this same starting point – from the perspective of the complement of $s$ – the same feature would also be selected for the next removal from the complement, as this element will also minimize the squared error of projecting $s$ onto its complement. In this way, the two algorithms will result in completely opposite orderings: The first feature selected by the forward process will also be the first pruned by the reverse process, etc. It’s important to stress that these opposite orderings are not a consequence of errors in the algorithms in question, but instead represent the correct greedy responses to weakly correlated features. A similar result will also occur if all features take the form $X_{i}=\overline{X}+\delta X_{i}$ , where $\overline{X}$ is a vector common to all features, but the $\delta X_{i}$ vectors are only weakly correlated. This is a model that may be appropriate for many feature clusters.

Acknowledgement

The author thanks J. Bergknoff, E. Jeske, P. Spanoudes, D. Thorpe, C. Yeh, and R. Zhou for helpful discussions relating to this work.

Bibliography15

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] T. M. Apostol. Calculus: Multivariable calculus and linear algebra, with applications to differential equations and probability, 1969.
2[2] L. Bachelier. Théorie de la spéculation . Gauthier-Villars, 1900.
3[3] A. M. Bruckstein, D. L. Donoho, and M. Elad. From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM review , 51(1):34–81, 2009.
4[4] C. Couvreur and Y. Bresler. On the optimality of the backward greedy algorithm for the subset selection problem. SIAM Journal on Matrix Analysis and Applications , 21(3):797–808, 2000.
5[5] M. Efroymson. Multiple regression analysis. Mathematical methods for digital computers , 1:191–203, 1960.
6[6] A. K. Farahat, A. Ghodsi, and M. S. Kamel. Efficient greedy feature selection for unsupervised learning. Knowledge and information systems , 35(2):285–310, 2013.
7[7] S. L. Fuller. The evolution of actively managed exchange-traded funds. The Review of Securities and Commodities Regulation , 41(8):89–96, 2008.
8[8] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of machine learning research , 3(Mar):1157–1182, 2003.