Thursday, January 7, 2010
More Research
$\max_w \frac{w^T S_b w}{w^T S_w w}$, where $S_b$ is the covariance matrix of the between-class examples and $S_w$ is that of the within-class examples. Note that this finds a linear transformation, and so is called a linear discriminant.
On the other hand DCCA is concerned with correlations, so we attempt to find the pair of transformations which maximize the within class correlation while minimizing the between class correlation. Again, let $X$ and $Y$ be the two sources of information, which labels indexed by $c \in \{1, 2, ..., K\}$. And so we want
$\max_{a,b} \frac{a^TC_{xy}b}{(a^T S_{xx} a b^T S_{yy} b)^{1/2}}$,
where $C_{xy} = C_w - C_b$, and
$C_w = \sum_c \sum_{\{i|x_i \in c\}} \sum_{\{j|\{y_j \in c\}} x_iy_j^T$
and
$C_b = \sum_{c_1} \sum_{c_2 \neq c_1} \sum_{\{i|x_i \in c_1\}} \sum_{\{j|y_j \in c_2\}} x_iy_j^T$.
It is easy to see that here we are maximizing over the difference in the canonical correlation of the within class and between class canonical correlation.
I'm not going to get into the optimization, mostly because it's lots of specially crafted matrices, which are pretty much impossible to format in blogger, but you can go to the reading I cribbed this from to get the details (below)
Based on readings:
Research
I decided to convert my KLR code to use LBFGS from BFGS in order to scale. Which of course means (since I try not to implement what I don't understand) that I had a little extra reading to do. Here's the nutshell.
$q = \nabla f_t$ for $i = t:-1:k$ $\alpha_i = \frac{s_i^Tq}{y_i^Ts_i}$ $q = q - \alpha_i y_i$ end $p_t = H_k q$ for $i = k:t-1$ $\beta = \frac{y_i^Tp_t}{y_i^Ts_i}$ $p_t = p_t + s_i(\alpha_i - \beta)$ end
Montreal
This is my new home (I never thought I'd live in a smaller apartment than the one I left in Tribeca). It is tiny but bright and I can't complain.
Well, anyways, its pretty crazy that nobody ever gets hurt on these things, so I hope I'm not the first.
Wednesday, January 6, 2010
Tuesday, January 5, 2010
Research
Learned about something today called Canonical Correlation Analysis. There's a ton of stuff out there if you're interested, but here's my take on it. First, you can think of CCA as a version of factor analysis (think PCA), used mostly where you have two sets of variables. One example would be multiple views of the same object, for example in vision, suppose you had a bunch of images of your face. You could run a wavelet transform and use only the lower frequency stuff, assuming the high frequency stuff is noise, but you'd be throwing away data, which never feels right. What you'd like is some principled way of deciding what is redundant or irrelevant. CCA is an analysis of the correlation between the two sets of data, and as such can be used for this purpose (feature extraction).
One way of thinking about it is as a problem of dimensionality reduction. In the case of multiple views, we have some extra information that we can use. First, we know that the data in set $X$ and data in set $Y$ are probably highly correlated. In fact, we can imagine some "latent" variables (like factors) that are responsible (cause?) for the observed correlation. In terms of face recognition, think of features of my real face (a gift to you, my reader) as the factors.
The simple intuition behind CCA is that since the data are multiple views of the same object, there are a set of latent or canonical variates which are highly correlated with the resulting features. Ultimately, we would like to use these variates, not the highly variable observed data. Thus by finding a linear combination of the data for each set which maximizes the between-set variance while controlling for the in-set variance, we attempt to uncover this relationship, assuming...
NB: Of course there are a few strong assumptions here, as CCA assumes that the correlation between the canonical variates and $X$ and $Y$ is linear. Moreover, there are implicit assumptions arising from using the sample variance as an estimate of the variance (i.e. Gaussianity). The last can be fixed by using alternative estimates of the variance. On the other hand, we do have to make the assumption that the thing that correlates with the cross correlation is what we're actually interested in observing (my face). End assumptions (I'm not too picky about assumptions especially when the ideas are just plain cool).
So the simple objective in CCA is just to max out the correlation:
$\max_{a, b} R = \frac{a^TS_{xy}b}{(a^TS_{xx}a)^{1/2}(b^TS_{yy}b)^{1/2}}$
where $S_{xy}$ is the cross-covariance matrix, $S_{xx}$ and $S_{yy}$ are the covariance matrices of $X$ and $Y$. Now noting that scaling of $a$ or $b$ does not effect the solution, CCA is scale-invariant. The math at this point is not overly complicated, but it is somewhat involved, so I'll just say check out the Wikipedia page on it. The basic idea is that the vectors a and b define a pair of canonical variates, the projections $a^TX$, $b^TY$. There are $\min(m, d)$ of these pairs, sorted in order of magnitude, and each a is orthogonal. The idea is that the first pair explain most of the correlation between $X$ and $Y$, with each subsequent pair explaining less of the correlation.
New problem: suppose you were given subspaces or found them using PCA. Suppose you had images of your face indoors and outdoors. You want to reduce the dimensionality, but using PCA will almost certainly treat the illumination as a major source of variation (which we do not want). Lets call the two subspaces $U_1$, $U_2$, which maximize the variance after projection. Now attempting to find a subspace which minimizes the angle between the two subspaces can be seen as finding the vectors $v_1$ and $v_2$ which maximize the correlation $corr(U_1v_1, U_2v_2)$, which is identical to the canonical correlation analysis procedure I described above except here we are working with subspaces rather than variables.
One way to think about this is as to stabilize subspace learning, by comparing subspaces rather than an individual example to a subspace. For example, suppose you had the aforementioned images of your face and your task was to program a video camera attached to your car to open the door when you appeared in the front window. You could learn the subspace of your face first and then compute the angle between a single frame from the car camera, OR you could use a face detector and use the subspace found by capturing 10 seconds of video. Clearly there will be a lot less variance in the second method. By recovering the canonical vectors, we can also compute the canonical correlation, and a low correlation could be determined as a low level of similarity.
It is natural to ask if CCA is an extension of PCA to multiple modalities, then what is the discriminative version (PCA is to CCA as LDA is to ?). Suppose for some reason, some common thing appeared in every image in the dataset such as the moon. However, we do not want the presence of the moon to be discriminative. Using CCA would not rectify the situation.
As such we would need some sort of LDA-like between-class/intra-class conditions to be met when determining the canonical vectors. LDA does this through maximization of the ratio of the between class variance of the linear transformation to the intra class variances of the linear transformation. A natural extension then is to maximize the ratio of the between-class canonical correlation after transformation to the intra-class canonical correlation after transformation (called discriminative canonical correlation).
More on that tomorrow.
Based on readings for today --
- Kim, T., J. Kittler, and R. Cipolla. "Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations." IEEE transactions on pattern analysis and machine intelligence 29.6 (2007): 1005.
- Chaudhuri, K., et al. "Multi-View Clustering Via Canonical Correlation Analysis." ACM New York, NY, USA, 2009.
- Graph Embedding and Extensions: A General Framework for Dimensionality Reduction
- Shuicheng Yan; Dong Xu; Benyu Zhang; Hong-Jiang Zhang; Qiang Yang; Lin, S.
- Pattern Analysis and Machine Intelligence, IEEE Transactions on
- Volume 29, Issue 1, Jan. 2007 Page(s):40 - 51
- Face recognition using temporal image sequence
- Yamaguchi, O.; Fukui, K..; Maeda, K.-i.
- Automatic Face and Gesture Recognition, 1998. Proceedings. Third IEEE International Conference on
- Volume , Issue , 1998 Page(s):318 - 323
- Sun, Q. S., et al. "A New Method of Feature Fusion and Its Application in Image Recognition." Pattern Recognition 38.12 (2005): 2437-48.
- Correlation
- Cosine
- CCA