Newness Fades: 2010-01-03

Thursday, January 7, 2010

More Research

I mentioned last time that I would finish up with DCCA. It is straightforward given the intuition last time, that basically we want something similar to LDA, which maximizes
$\max_w \frac{w^T S_b w}{w^T S_w w}$, where $S_b$ is the covariance matrix of the between-class examples and $S_w$ is that of the within-class examples. Note that this finds a linear transformation, and so is called a linear discriminant.
On the other hand DCCA is concerned with correlations, so we attempt to find the pair of transformations which maximize the within class correlation while minimizing the between class correlation. Again, let $X$ and $Y$ be the two sources of information, which labels indexed by $c \in \{1, 2, ..., K\}$. And so we want
$\max_{a,b} \frac{a^TC_{xy}b}{(a^T S_{xx} a b^T S_{yy} b)^{1/2}}$,
where $C_{xy} = C_w - C_b$, and
$C_w = \sum_c \sum_{\{i|x_i \in c\}} \sum_{\{j|\{y_j \in c\}} x_iy_j^T$
and
$C_b = \sum_{c_1} \sum_{c_2 \neq c_1} \sum_{\{i|x_i \in c_1\}} \sum_{\{j|y_j \in c_2\}} x_iy_j^T$.
It is easy to see that here we are maximizing over the difference in the canonical correlation of the within class and between class canonical correlation.
I'm not going to get into the optimization, mostly because it's lots of specially crafted matrices, which are pretty much impossible to format in blogger, but you can go to the reading I cribbed this from to get the details (below)
Based on readings:

Sun, T. K., et al. "Kernelized Discriminative Canonical Correlation Analysis." 2007. Vol. 3.

Research

I decided to convert my KLR code to use LBFGS from BFGS in order to scale. Which of course means (since I try not to implement what I don't understand) that I had a little extra reading to do. Here's the nutshell.

I'm assuming you're familiar with Newton's method of calculating a zero of a function, which is given by iterating $x = x - \frac{f(x)}{f^\prime(x)}$. Since essentially when we optimize we wish to find zeros of the first derivative of a function, to use the Newton-Rhapson method, we need the second derivative. In the case where $x$ is vector, this involves computing the matrix of second derivatives, known as the Hessian (I've always found this name a little scary).

Besides the issue with the size and complexity of this matrix, with the typically large $d$ or $n$ we might have, there is also the fact that if you look at Newton's method above, that $f^\prime(x)$ appears in the denominator, which means that we actually need to invert the Hessian ($O(n^3)$). For smaller datasets, its usually advisable to use Newton's method since with this method you obtain a quadratic rate of convergence. In everything I've ever seen or done (not too much), usually you jump to a very good solution within 2 iterations.

Of course, someone (not Nick) recently handed me a pixel labelled dataset of 100 images of size 320x240, and wants me to run KLR on the pixel labels using some pretty wacky features (something like 128-vectors of pixel data). Still thats 7,680,000 $(y, x)$ pairs, so pretty much Newton's method is out of the question. Of course, there's this problem with the kernel matrix, but I'll tell you about that trick later.

Now, ok, so anyways, for this issue, I just want to compute the vanilla LR with LBFGS so that I can test against the Newton's method, which I have. Then I'll work on my kernel matrix trick, which I'll describe in another post.

LBFGS is based on BFGS of course, in which you don't need to invert the Hessian, rather you store the inverted Hessian and update it using only first order information, (approximate inverted Hessian). As in the Newton-Raphson method the trick is to compute $\theta = \theta - \alpha H^{-1}\nabla f$,where I'm using $\theta$ to indicate that it is a parameter, and $\alpha$ is a step-size, which you choose generally by some line search method. In my app, I choose $\alpha$ as the one-dimensional minimizer of $f(\theta + \alpha H^{-1}\nabla f)$ (by taking derivatives and setting to 0). This is a simple line search.

The general idea behind Newton's method is that suppose you had a function $f$. If you remember calc I, then you can approximate $f$ using a Taylor expansion to an arbitrary precision. In Newton's method we expand up to the second derivative, which leads to a quadratic polynomial, and sometimes this is called the quadratic approximation.

$f(x) \approx f(a) + f^\prime(a)(x - a) + \frac{1}{2}f^{\prime\prime}(a)(x - a)^2$.

For vector calculus, we have the same idea

$f(x) \approx f(a) + \nabla f(a)^T (x - a) + \frac{1}{2}(x - a)^T H (x - a)$.

So that at the current iterate $\theta_t$ we can minimize the quadratic modelrather than the function itself,

$m(p) = f(\theta_t) + \nabla f(\theta_t)^Tp + \frac{1}{2}p H_t p$, where I have used the notation $p$.

Minimizing this function with respect to $p$, will minimize the function $f$ approximately. Thus taking derivatives, we have $p = -(H_t)^{-1}\nabla f(\theta_t)$, which is just what I said above. Thus, since the optimal difference is $p$, as $p = \theta_t - a$, the Netwon update is $\theta_{t+1} = \theta_t + \alpha_t p$.

BFGS (not LBFGS) operates under the idea that we would like not to calculate $H_t$. Instead we just want to update $H_t$ to be close to the real $H_t$. To avoid confusion, let $B_t$ be the approximation of $H_t$ at iteration $t$. Then we choose some conditions on $B_t$ based on the results of the previous iteration.

Imagine that you are standing on a hill. One reasonable thing is to assume that the gradient at the old position remains constant. That is, if you were to take a step and estimate the gradient at the old position based solely on knowledge where you are currently standing you would expect the gradient to be the same. That is

$\nabla m_{t+1}(-\alpha_tp_t) = \nabla f_t(\theta_t)$.

Taking gradients we see that

$\nabla m_{t+1}(-\alpha_tp_t) = \nabla f(\theta_{t+1}) - B_{t+1} \alpha_t p_t$,

which implies that

$\nabla f(\theta_{t+1}) - B_{t+1} \alpha_t p_t = \nabla f_t(\theta_t)$,

and rearranging, $\nabla f(\theta_{t+1}) - \nabla f_t(\theta_t) = B_{t+1}( \alpha_t p_t )$. Since $\theta_{t+1} - \theta_{t} = \alpha_t p_t$, we have that

$y_t = B_{t+1}s_t$,

where $y_t = \nabla f(\theta_{t+1}) - \nabla f_t(\theta_t)$, and$s_t = \theta_{t+1} - \theta_{t}$.

Now the remaining constraints are simpler, of course we want $B$ to be PSD, and we want $B_{t+1}$ to be not too different from $B_t$. Giving the program$\min_B ||B - B_t||$ subject to $B = B^T$, $Bs_t = y_t$. Using the Weighted Frobenius norm with weight matrix given by the average Hessian, the unique update to $B$ is given as

$B_{t+1} = U_t^TB_tU_t + \frac{y_ty_t^T}{y_t^Ts_t}$

where $U_t = (I - \frac{s_ty_t^T}{y_t^Ts_t})$. Noting that $B_ts_t = y_t$ implies $s_t = B^{-1}y_t$, it is intuitive that the BFGS formula for the inverse approximate Hessian $B_{t+1}^{-1}$ update, which is more useful in practice, can be given by

$B_{t+1}^{-1} = V_t^TB_t^{-1}V_t + \frac{s_ts_t^T}{y_t^Ts_t}$

where $V_t = (I - \frac{y_ts_t^T}{y_t^Ts_t})$.

So this relatively simple desire and simple condition results in a useful algorithm where we do not have to compute the Hessian and still obtain super-linear rates of convergence. L-BFGS is based on the observation that we actually don't really ever want the Hessian, we just need it to compute $p_t$. So if we have a lot of variables we need to optimize over ($\theta$ is high dimensional, as it is in KLR (size $nn$ in the input)), we would prefer not to ever store it.

Looking at the math we have so far, we had originally $p_t = -B_t^{-1} \nabla f(\theta_t)$. At this point, I'll just call $B_t^{-1}$, $H_t$, although this is NOT the Hessian, but rather the inverse approximate Hessian. So expanding $H_t$ and rearranging, we have $p_t = (V_{t-1}^TH_{t-1}V_{t-1} + \frac{s_{t-1}s_{t-1}^T}{y_{t-1}^Ts_{t-1}}) \nabla f(\theta_t)$. If we continue to expand, we have

$p_t = (V_{t-1}^TV_{t-2}^TH_{t-2}V_{t-2}V_{t-1} + V_{t-1}\frac{s_{t-2}s_{t-2}^T}{y_{t-2}^Ts_{t-2}}V_{t-1} \\ + \frac{s_{t-1}s_{t-1}^T}{y_{t-1}^Ts_{t-1}}) \nabla f(\theta_t)$

Expanding until some initial $H_k$ is reached (we will initially set $H_k = \beta I$, for some constant scalar $\beta$, with $t > k$, but subsequently update as $\gamma I$, where $\gamma \propto s_{t-1}^Ty_{t-1}$), we have a recursive definition of $p_t$ at time $t$ in terms of $H_k$ and $s$ and $y$ for each time $t > k$. Thus we only need to store $t - k$ vector pairs $s$ and $y$, and if we choose a diagonal $H_k$, then we can store this also efficiently.

In matlabish notation the LBFGS algorithm is incredibly easy to implement.

$q = \nabla f_t$
for $i = t:-1:k$
 $\alpha_i = \frac{s_i^Tq}{y_i^Ts_i}$
 $q = q - \alpha_i y_i$
end
 $p_t = H_k q$
for $i = k:t-1$
 $\beta = \frac{y_i^Tp_t}{y_i^Ts_i}$
 $p_t = p_t + s_i(\alpha_i - \beta)$
end

with $p_t = H_t \nabla f_t$, so that the algorithm consists of $4nm + n$ multiplications ($O(n)$) where $m = t-k$ , as long as $H_k$ is diagonal. As such, we have efficiently produced a very good update direction, $p_t$ in order roughly $O(n^2)$ using memory scaling in $nm$, rather than $n^2$. Of course, the price is reduced rate of convergence, but I'm willing to sacrifice a little to get my app to actually, you know, work.

Based on reading (there is a copy d/lable at scribd ssssh)

1. Nocedal, J., and S. J. Wright. Numerical Optimization. Springer, 2000.

Montreal

So I recently moved to Montreal (in the dead of winter) and I've been so busy with school things and moving I haven't really had a chance to check it out very much so far. I am excited though to be back in a biggish city.

This is my new home (I never thought I'd live in a smaller apartment than the one I left in Tribeca). It is tiny but bright and I can't complain.

I had arepas for dinner last night. I think the last time I had arepas was with Yujung at the Caracas Arepas bar in the EV. I don't even know if arepas exist in Rochester... but in any case, it was at this place called Boulangerie Bocadillo

Which is this tiny place not unlike Caracas Arepas Bar, and it's two blocks from my place, so it's going to be my go to place (which I never had in Rochester). Now, it kinda seems like Rochester was this terrible girlfriend who I hated, but actually she had her moments and I miss her.

Oh and one crazy thing about my new school, the Ecole Polytechnique, is that when you come off the subway, you have to go up this crazy hill, but there are escalators there. Now the weird thing about these escalators is that its a long ramp like a incline airport conveyer. It's pretty much awesome and stupid at the same time, which I have never felt before. You can be the judge:

Well, anyways, its pretty crazy that nobody ever gets hurt on these things, so I hope I'm not the first.

Wednesday, January 6, 2010

Via "What's New"

Tuesday, January 5, 2010

Research

I've decided to do some research journalling. Online!

Learned about something today called Canonical Correlation Analysis. There's a ton of stuff out there if you're interested, but here's my take on it. First, you can think of CCA as a version of factor analysis (think PCA), used mostly where you have two sets of variables. One example would be multiple views of the same object, for example in vision, suppose you had a bunch of images of your face. You could run a wavelet transform and use only the lower frequency stuff, assuming the high frequency stuff is noise, but you'd be throwing away data, which never feels right. What you'd like is some principled way of deciding what is redundant or irrelevant. CCA is an analysis of the correlation between the two sets of data, and as such can be used for this purpose (feature extraction).

One way of thinking about it is as a problem of dimensionality reduction. In the case of multiple views, we have some extra information that we can use. First, we know that the data in set $X$ and data in set $Y$ are probably highly correlated. In fact, we can imagine some "latent" variables (like factors) that are responsible (cause?) for the observed correlation. In terms of face recognition, think of features of my real face (a gift to you, my reader) as the factors.

The simple intuition behind CCA is that since the data are multiple views of the same object, there are a set of latent or canonical variates which are highly correlated with the resulting features. Ultimately, we would like to use these variates, not the highly variable observed data. Thus by finding a linear combination of the data for each set which maximizes the between-set variance while controlling for the in-set variance, we attempt to uncover this relationship, assuming...

NB: Of course there are a few strong assumptions here, as CCA assumes that the correlation between the canonical variates and $X$ and $Y$ is linear. Moreover, there are implicit assumptions arising from using the sample variance as an estimate of the variance (i.e. Gaussianity). The last can be fixed by using alternative estimates of the variance. On the other hand, we do have to make the assumption that the thing that correlates with the cross correlation is what we're actually interested in observing (my face). End assumptions (I'm not too picky about assumptions especially when the ideas are just plain cool).

So the simple objective in CCA is just to max out the correlation:

$\max_{a, b} R = \frac{a^TS_{xy}b}{(a^TS_{xx}a)^{1/2}(b^TS_{yy}b)^{1/2}}$

where $S_{xy}$ is the cross-covariance matrix, $S_{xx}$ and $S_{yy}$ are the covariance matrices of $X$ and $Y$. Now noting that scaling of $a$ or $b$ does not effect the solution, CCA is scale-invariant. The math at this point is not overly complicated, but it is somewhat involved, so I'll just say check out the Wikipedia page on it. The basic idea is that the vectors a and b define a pair of canonical variates, the projections $a^TX$, $b^TY$. There are $\min(m, d)$ of these pairs, sorted in order of magnitude, and each a is orthogonal. The idea is that the first pair explain most of the correlation between $X$ and $Y$, with each subsequent pair explaining less of the correlation.

New problem: suppose you were given subspaces or found them using PCA. Suppose you had images of your face indoors and outdoors. You want to reduce the dimensionality, but using PCA will almost certainly treat the illumination as a major source of variation (which we do not want). Lets call the two subspaces $U_1$, $U_2$, which maximize the variance after projection. Now attempting to find a subspace which minimizes the angle between the two subspaces can be seen as finding the vectors $v_1$ and $v_2$ which maximize the correlation $corr(U_1v_1, U_2v_2)$, which is identical to the canonical correlation analysis procedure I described above except here we are working with subspaces rather than variables.

One way to think about this is as to stabilize subspace learning, by comparing subspaces rather than an individual example to a subspace. For example, suppose you had the aforementioned images of your face and your task was to program a video camera attached to your car to open the door when you appeared in the front window. You could learn the subspace of your face first and then compute the angle between a single frame from the car camera, OR you could use a face detector and use the subspace found by capturing 10 seconds of video. Clearly there will be a lot less variance in the second method. By recovering the canonical vectors, we can also compute the canonical correlation, and a low correlation could be determined as a low level of similarity.

It is natural to ask if CCA is an extension of PCA to multiple modalities, then what is the discriminative version (PCA is to CCA as LDA is to ?). Suppose for some reason, some common thing appeared in every image in the dataset such as the moon. However, we do not want the presence of the moon to be discriminative. Using CCA would not rectify the situation.

As such we would need some sort of LDA-like between-class/intra-class conditions to be met when determining the canonical vectors. LDA does this through maximization of the ratio of the between class variance of the linear transformation to the intra class variances of the linear transformation. A natural extension then is to maximize the ratio of the between-class canonical correlation after transformation to the intra-class canonical correlation after transformation (called discriminative canonical correlation).

More on that tomorrow.

Based on readings for today --

Second Post

Wow I have a blog, weird. Reminds me of that XKCD comic http://xkcd.com/621/. Am I the least interesting man in the world? Don't answer that right now.

Stuff I'm into:

1) My new Nice Collective sweater

2) Kings of Convenience is pretty perfect for Montreal subways (and making out)

3) Happy new years texts from friends who I haven't seen in 10 years (thanks Marcel)

4) Hello research. It's been awhile. I love you.