I decided to convert my KLR code to use LBFGS from BFGS in order to scale. Which of course means (since I try not to implement what I don't understand) that I had a little extra reading to do. Here's the nutshell.
I'm assuming you're familiar with Newton's method of calculating a zero of a function, which is given by iterating $x = x - \frac{f(x)}{f^\prime(x)}$. Since essentially when we optimize we wish to find zeros of the first derivative of a function, to use the Newton-Rhapson method, we need the second derivative. In the case where $x$ is vector, this involves computing the matrix of second derivatives, known as the Hessian (I've always found this name a little scary).
Besides the issue with the size and complexity of this matrix, with the typically large $d$ or $n$ we might have, there is also the fact that if you look at Newton's method above, that $f^\prime(x)$ appears in the denominator, which means that we actually need to invert the Hessian ($O(n^3)$). For smaller datasets, its usually advisable to use Newton's method since with this method you obtain a quadratic rate of convergence. In everything I've ever seen or done (not too much), usually you jump to a very good solution within 2 iterations.
Of course, someone (not Nick) recently handed me a pixel labelled dataset of 100 images of size 320x240, and wants me to run KLR on the pixel labels using some pretty wacky features (something like 128-vectors of pixel data). Still thats 7,680,000 $(y, x)$ pairs, so pretty much Newton's method is out of the question. Of course, there's this problem with the kernel matrix, but I'll tell you about that trick later.
Now, ok, so anyways, for this issue, I just want to compute the vanilla LR with LBFGS so that I can test against the Newton's method, which I have. Then I'll work on my kernel matrix trick, which I'll describe in another post.
LBFGS is based on BFGS of course, in which you don't need to invert the Hessian, rather you store the inverted Hessian and update it using only first order information, (approximate inverted Hessian). As in the Newton-Raphson method the trick is to compute $\theta = \theta - \alpha H^{-1}\nabla f$,where I'm using $\theta$ to indicate that it is a parameter, and $\alpha$ is a step-size, which you choose generally by some line search method. In my app, I choose $\alpha$ as the one-dimensional minimizer of $f(\theta + \alpha H^{-1}\nabla f)$ (by taking derivatives and setting to 0). This is a simple line search.
The general idea behind Newton's method is that suppose you had a function $f$. If you remember calc I, then you can approximate $f$ using a Taylor expansion to an arbitrary precision. In Newton's method we expand up to the second derivative, which leads to a quadratic polynomial, and sometimes this is called the quadratic approximation.
$f(x) \approx f(a) + f^\prime(a)(x - a) + \frac{1}{2}f^{\prime\prime}(a)(x - a)^2$.
For vector calculus, we have the same idea
$f(x) \approx f(a) + \nabla f(a)^T (x - a) + \frac{1}{2}(x - a)^T H (x - a)$.
So that at the current iterate $\theta_t$ we can minimize the quadratic modelrather than the function itself,
$m(p) = f(\theta_t) + \nabla f(\theta_t)^Tp + \frac{1}{2}p H_t p$, where I have used the notation $p$.
Minimizing this function with respect to $p$, will minimize the function $f$ approximately. Thus taking derivatives, we have $p = -(H_t)^{-1}\nabla f(\theta_t)$, which is just what I said above. Thus, since the optimal difference is $p$, as $p = \theta_t - a$, the Netwon update is $\theta_{t+1} = \theta_t + \alpha_t p$.
BFGS (not LBFGS) operates under the idea that we would like not to calculate $H_t$. Instead we just want to update $H_t$ to be close to the real $H_t$. To avoid confusion, let $B_t$ be the approximation of $H_t$ at iteration $t$. Then we choose some conditions on $B_t$ based on the results of the previous iteration.
Imagine that you are standing on a hill. One reasonable thing is to assume that the gradient at the old position remains constant. That is, if you were to take a step and estimate the gradient at the old position based solely on knowledge where you are currently standing you would expect the gradient to be the same. That is
$\nabla m_{t+1}(-\alpha_tp_t) = \nabla f_t(\theta_t)$.
Taking gradients we see that
$\nabla m_{t+1}(-\alpha_tp_t) = \nabla f(\theta_{t+1}) - B_{t+1} \alpha_t p_t$,
which implies that
$\nabla f(\theta_{t+1}) - B_{t+1} \alpha_t p_t = \nabla f_t(\theta_t)$,
and rearranging, $\nabla f(\theta_{t+1}) - \nabla f_t(\theta_t) = B_{t+1}( \alpha_t p_t )$. Since $\theta_{t+1} - \theta_{t} = \alpha_t p_t$, we have that
$y_t = B_{t+1}s_t$,
where $y_t = \nabla f(\theta_{t+1}) - \nabla f_t(\theta_t)$, and$s_t = \theta_{t+1} - \theta_{t}$.
Now the remaining constraints are simpler, of course we want $B$ to be PSD, and we want $B_{t+1}$ to be not too different from $B_t$. Giving the program$\min_B ||B - B_t||$ subject to $B = B^T$, $Bs_t = y_t$. Using the Weighted Frobenius norm with weight matrix given by the average Hessian, the unique update to $B$ is given as
$B_{t+1} = U_t^TB_tU_t + \frac{y_ty_t^T}{y_t^Ts_t}$
where $U_t = (I - \frac{s_ty_t^T}{y_t^Ts_t})$. Noting that $B_ts_t = y_t$ implies $s_t = B^{-1}y_t$, it is intuitive that the BFGS formula for the inverse approximate Hessian $B_{t+1}^{-1}$ update, which is more useful in practice, can be given by
$B_{t+1}^{-1} = V_t^TB_t^{-1}V_t + \frac{s_ts_t^T}{y_t^Ts_t}$
where $V_t = (I - \frac{y_ts_t^T}{y_t^Ts_t})$.
So this relatively simple desire and simple condition results in a useful algorithm where we do not have to compute the Hessian and still obtain super-linear rates of convergence. L-BFGS is based on the observation that we actually don't really ever want the Hessian, we just need it to compute $p_t$. So if we have a lot of variables we need to optimize over ($\theta$ is high dimensional, as it is in KLR (size $nn$ in the input)), we would prefer not to ever store it.
Looking at the math we have so far, we had originally $p_t = -B_t^{-1} \nabla f(\theta_t)$. At this point, I'll just call $B_t^{-1}$, $H_t$, although this is NOT the Hessian, but rather the inverse approximate Hessian. So expanding $H_t$ and rearranging, we have $p_t = (V_{t-1}^TH_{t-1}V_{t-1} + \frac{s_{t-1}s_{t-1}^T}{y_{t-1}^Ts_{t-1}}) \nabla f(\theta_t)$. If we continue to expand, we have
$p_t = (V_{t-1}^TV_{t-2}^TH_{t-2}V_{t-2}V_{t-1} + V_{t-1}\frac{s_{t-2}s_{t-2}^T}{y_{t-2}^Ts_{t-2}}V_{t-1} \\ + \frac{s_{t-1}s_{t-1}^T}{y_{t-1}^Ts_{t-1}}) \nabla f(\theta_t)$
Expanding until some initial $H_k$ is reached (we will initially set $H_k = \beta I$, for some constant scalar $\beta$, with $t > k$, but subsequently update as $\gamma I$, where $\gamma \propto s_{t-1}^Ty_{t-1}$), we have a recursive definition of $p_t$ at time $t$ in terms of $H_k$ and $s$ and $y$ for each time $t > k$. Thus we only need to store $t - k$ vector pairs $s$ and $y$, and if we choose a diagonal $H_k$, then we can store this also efficiently.
In matlabish notation the LBFGS algorithm is incredibly easy to implement.
$q = \nabla f_t$ for $i = t:-1:k$ $\alpha_i = \frac{s_i^Tq}{y_i^Ts_i}$ $q = q - \alpha_i y_i$ end $p_t = H_k q$ for $i = k:t-1$ $\beta = \frac{y_i^Tp_t}{y_i^Ts_i}$ $p_t = p_t + s_i(\alpha_i - \beta)$ end
with $p_t = H_t \nabla f_t$, so that the algorithm consists of $4nm + n$ multiplications ($O(n)$) where $m = t-k$ , as long as $H_k$ is diagonal. As such, we have efficiently produced a very good update direction, $p_t$ in order roughly $O(n^2)$ using memory scaling in $nm$, rather than $n^2$. Of course, the price is reduced rate of convergence, but I'm willing to sacrifice a little to get my app to actually, you know, work.
Based on reading (there is a copy d/lable at scribd ssssh)
1. Nocedal, J., and S. J. Wright. Numerical Optimization. Springer, 2000.
No comments:
Post a Comment