The $L_2$ regularized log conditional likelihood for kernel logistic regression is given as $\mathcal{L}(\mathbf{w}) = \sum_i \log(\sigma(y_i\mathbf{K}_i^T\mathbf{w})) - \frac{\lambda}{2}\mathbf{w}^T\mathbf{K}\mathbf{w}$ where $\mathbf{K}$is a Gramm matrix composed of the products $\mathbf{K}(i,j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle$, $\lambda$represents a weight on the regularization penalty, and $\sigma(a) = (1 + \exp(-a))^{-1}$is the sigmoid logistic function.
The IVM [1] iteratively increase the number of basis functions $K_i$ used in the kernel expansion until $\mathcal{L}$converges. Letting $k$index the iterations, this requires $mn$solutions to $n - m^{(k)}$smaller KLR problems of size on $O((m^{(k)} + 1)^2)$, where $m$is the number of basis functions used in the current expansion. One way to view the ridge penalty is as a MAP estimate using a Gaussian prior. That is $\mathcal{L}$represents an unscaled posterior probability, where $\lambda/2$is the precision parameter for a Gaussian prior on $\Phi\mathbf{w}$, centered at 0, where $\Phi$is the possibility infinite dimensional design matrix $(\phi(\mathbf{x}_1), \phi(\mathbf{x}_2), ..., \phi(\mathbf{x}_n))^T$.
Using this as a guide, it is possible to test $\mathbf{w}_i = 0$, by estimating the precision parameter under this prior under the empirical Bayes framework. Thus choosing $i$to be included in the next model will lower our uncertainty $\mathbf{w}_i$, providing a heuristic for the basis pursuit. The following reviews the method on which this heuristic is based.
One algorithm, the Relevance Vector Machine [2], instead of optimizing the MAP estimate, treats the precision of this Gaussian as a parameter to be learned, by iteratively optimizing $\mathbf{w}$and $\alpha$.
$\mathcal{L}(\alpha) = \log \int \sigma(\mathbf{Y}\mathbf{K}\mathbf{w}) N(\mathbf{w}|\alpha) d\mathbf{w}$
where $Y = \text{diag}(y)$, and $\mathcal{L}$ is the marginal log likelihood after summing out $\mathbf{w}$i.e., $p(\mathbf{y}|\alpha)$. $\mathcal{L}(\alpha)$ cannot be evaluated analytically, and so one way to accomplish this is using a Laplace approximation [2]. Another way is to approximate $\sigma(\mathbf{Y}\mathbf{K}\mathbf{w})$with a Gaussian distribution by linearizing around a mode in the distribution of $\mathbf{w}$, and let $\mathcal{L}$represent the convolution of two Gaussians. This is known as the sequential RVM [3].
The key to the Gaussian approximation used in the sequential RVM is in IRLS, in which successive weighted least squares problems are solved of the form
$(\mathbf{K}^T\mathbf{B}\mathbf{K}+ \mathbf{A})^{-1}\mathbf{K}^T\mathbf{B}\mathbf{z} $
$\mathbf{A}= \text{diag}((\alpha_i)_n)$,$\mathbf{B} = \text{diag}((\sigma(y_i\mathbf{K}_i^T\mathbf{w})(1 - \sigma(y_i\mathbf{K}_i^T\mathbf{w})))_n)$, and $z_i = \mathbf{K}_i^T\mathbf{w}+ \mathbf{B}^{-1}(y_i - \sigma(y_i\mathbf{K}_i\mathbf{w}))$
One way of thinking about the above is that it represents a linear model (of $\mathbf{z}$, in which the variance depends on $(y_i, \mathbf{x}_i)$. That is, $z_i = f(\mathbf{x}_i) + e_i(y_i, \mathbf{x}_i)$, where $e_i$ is uncorrelated with $e_j$, by assumption, and centered at 0. As such this represents a generalized linear model, with $p(\mathbf{z}|\mathbf{w}) \approx N(\mathbf{z}; \mathbf{K}\mathbf{w}, \mathbf{B}^{-1})$. Given this, the resulting approximate marginal likelihood is a convolution of two Gaussians:
$\int N(\mathbf{z};\mathbf{K}\mathbf{w},\mathbf{B}^{-1})N(\mathbf{w};0,\mathbf{A}^{-1})d\mathbf{w}= N(\mathbf{z}|0, \mathbf{C})$,
which is Gaussian again, with $\mathbf{C} = \mathbf{K}\mathbf{A}^{-1}\mathbf{K}^T + \mathbf{B}$.The approximate log marginal likelihood is then $-\frac{1}{2}[ n\log(2\pi) + \log(|\mathbf{C}|) + \mathbf{z}^T\mathbf{C}^{-1}\mathbf{z}]$
Noting that $\mathbf{C} = \mathbf{B}+ \mathbf{K}\mathbf{A}^{-1}\mathbf{K}^T$, this can be rewritten this as
| $ \mathbf{C}$ | $ = \mathbf{B}+ \alpha_i\mathbf{K}_i\mathbf{K}_i^T + \sum_{j \neq i} \alpha_j\mathbf{K}_j\mathbf{K}_j^T$ | (1) | 
| $= \mathbf{C}_{-i} + \alpha_i\mathbf{K}_i\mathbf{K}_i^T$ | (2) | 
Using Sherman-Morrison-Woodbury matrix inversion lemmas, it can be shown that
| $\vert\mathbf{C}\vert$ | $= \vert\mathbf{C}_{-i}\vert(1 + \alpha_i^{-1}\mathbf{K}_i^T\mathbf{C}_{-i}\mathbf{K}_i)$ | (3) | 
| $\mathbf{C}^{-1}$ | $= \mathbf{C}_{-i}^{-1} - \frac{\mathbf{C}_{-i}^{-1}\mathbf{K}_i\mathbf{K}_i^T\mathbf{C}_{-i}^{-1}}{\alpha_i + \mathbf{K}_i^T\mathbf{C}^{-1}_{-i}\mathbf{K}_i}$ | (4) | 
Thus the approximate log marginal likelihood can be rewritten in two parts
| $-\frac{1}{2}[ n\log(2\pi) + \log(\vert\mathbf{C}^{-1}_{-i}\vert) + \mathbf{z}^T\mathbf{C}_{-i}^{-1}\mathbf{z}]$ | (5) | |
| $-\frac{1}{2}[\log(\alpha_i + \mathbf{K}_i^T\mathbf{C}_{-i}^{-1}\mathbf{K}_i) - \log(\alpha_i) - \mathbf{z}^T\frac{\mathbf{C}_{-i}^{-1}\mathbf{K}_i\mathbf{K}_i^T\mathbf{C}_{-i}^{-1}}{\alpha_i + \mathbf{K}_i^T\mathbf{C}^{-1}_{-i}\mathbf{K}_i}\mathbf{z}]$ | (6) | 
Thus the log marginal likelihood can be decomposed into that dependent only on $i$, and that which is global to the model. Letting $q_i = \mathbf{K}_i^T\mathbf{C}_{-i}^{-1}\mathbf{z}$, and $s_i = \mathbf{K}_i^T\mathbf{C}^{-1}_{-i}\mathbf{K}_i$, the log marginal likelihood as a function of $\alpha_i$is $\lambda(\alpha_i) = \frac{1}{2}[\log(\alpha_i) - \log(\alpha_i + s_i) + \frac{q_i^2}{\alpha_i + s_i}]$
Taking derivatives and setting to 0 gives the analytical update $\alpha_i = \frac{s_i^2}{q_i^2 - s_i}$
The RVM training procedure requires computing $q_i$, and $s_i$for all $K_i$, which is computationally expensive. Used as a heuristic estimate of the variance in an assumed zero-mean Gaussian distribution, however, we only need to check these values for $i$ which are not already in the current model.
There is a relationship between the above quantities $q$and $s$and the geometric interpretations of sparsity and fit, which are laid out elsewhere [3]. For now, a brief sketch was necessary.
One remaining issue is the computation of $\mathbf{C}_{-i}^{-1}$. As pointed out, this is equivalent to $(\mathbf{B}+ \mathbf{K}\mathbf{A}\mathbf{K}^T)^{-1}$,
Now using the matrix identities again, we have that $\mathbf{C}_{-i}^{-1} = \mathbf{B}^{-1} - \mathbf{B}^{-1}\mathbf{K}(\mathbf{A}^{-1} + \mathbf{K}^T\mathbf{B}^{-1}\mathbf{K})^{-1}\mathbf{K}^T\mathbf{A}^{-1}$, Where now we can use the truncated $n\times m$matrix $K$filled with only those inner products that are in the current model. This turns an inversion of an $n \times n$ matrix into that of a $m \times m$ matrix, with inversions of diagonal matrices being possible in linear time.
Now, considering the above, the algorithm is to combine IVM with an estimate of the empirical variance, rather than solving KLR for the augmented matrices at each iteration. The algorithm chooses the basis corresponding to the smallest possible positive $\alpha$.
This is still way too heuristic, but I thought it was a decent rough sketch, I'd like to formalize these ideas abit more. One thing is that it is alot faster, having tested this on a dataset or two, however the problem is that it isn't actually performing very well. Oh well, either a nice idea that doesn't work, or I'm missing a few things here that I need to be aware of.
[1] J. Zhu and T. Hastie. Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics, 14(1):185{205, 2005.
[2] ME Tipping. The Relevance Vector Machine. Advances in Neural Information Processing Systems, 12, 2000.
[3] A. Faul and M. Tipping. Analysis of sparse bayesian learning, 2002.
 
No comments:
Post a Comment