Saturday, January 16, 2010
Montreal
Good, but not great sandwich, basically a pastrami on rye with yellow mustard. Not too thrilled with this mustard -- Kraftish. Meat was mostly fresh if a bit dry. I'm used to the juicy chewiness of the New York variety. Pretty authentic NYC deli feel though (although those are a dying breed.) Staff was friendly, I ate at the counter "Anywhea you like, buddy." (Weird Canadian+Jewish? accents.) The pickle was fantastic. I might go back for the pickles. Also no Doc Brown's black cherry, boo.
Friday, January 15, 2010
Research
I know there are faster linear algebras out there than ublas, and I may end up converting from this C++ code to use them (atlas, for example) but I took a look at the syntaxes and find that ublas is much easier to work with from a matlab->C++ conversion perspective. Of course, I may do this again just to gain some experience.
When I'm done with this, I really want to try building a matlab->C interpreter, just so I don't have to waste so much time doing this messy first step by hand. I've never built an interpreter before, but I can't think of something that so many people including myself will find as useful. I found a matlab yacc grammar out there, so some part of it is done. I suspect this is the easiest part.
Thursday, January 14, 2010
Later, Sam
Reading:
1. Roweis, S. T., and L. K. Saul. "Nonlinear Dimensionality Reduction by Locally Linear Embedding." Science 290.5500 (2000): 2323.
Wednesday, January 13, 2010
Later, Jay
Research Reading (Convex Clustering?)
Research odds and ends.
Suppose you wanted to cluster a dataset, which is a fairly common task. You have many many choices for how to do this depending on the assumptions you want to make and the amount of computational resources you want to spend. The obvious choice is K-means, and the next obvious choice is soft K-means (a.k.a. Gaussian Mixture Modeling). Then you can get into some esoteric kernel K-means, kernel GMM, other types of mixtures, K-mediods, etc, not to mention discriminative or semi-supervised methods. However, very few of the choices will lead to the same results every time, which is problematic from the standpoint of reproducing results.
Now, myself, I'm not really concerned about this issue, as I've read and experienced that local optimums are usually not terrible solutions, and the global optimum tends not to be too far from the local one. One issue is simply spending time choosing the initial conditions, as most of these clustering algorithms are iterative gradient based techniques. (Yes, EM is a gradient based technique in my book.)
So to avoid that it would be nice to have a convex optimization objective so that we always get the same results, and the optimization will also be fast as we can use lots of convex optimization algorithms to converge to the solution at hopefully super-linear rates.
The following describes a convex clustering procedure, presented by Lashkari and Golland, which, although is limited, may have some uses.
First, I begin with a description of Kernel Density Estimation (a short version). As usual by convention, we are given vectors $(x_1, x_2, ..., x_n)$, with $x_i \in R^d$, which we desire to cluster into one of $K$ clusters.
Suppose that we used kernel density estimation to estimate the probability of $x_j$ given all the other vectors we were given, that
$\hat{P}(x_j) = \sum_i \frac{1}{nh} K(\frac{x_j - x_i}{h})$
Aside: Here $K(.)$ refers to a kernel function rather than the $K$ number of clusters. Confusing, I know, but I'm trying to stick with convention.
Anyways, the KDE estimate is a totally reasonable thing to do. In fact, if we let $K(.)$ be the gaussian kernel $K(\frac{u - v}{h}) = \beta \exp(-\frac{||u - v||_2^2}{h})$, then $h/2$ is the isometric variance, with $\beta$ being a normalization factor ($\propto \sqrt{2\pi}^{-1}$). The thing to notice about this formuation, is that it is a mixture of Gaussians distribution with $K = n$, and mixing proportions $\frac{1}{n}$. If we take the log of the sum of these over all the individual data points, then we have a log likelihood of the data. This can also be interpreted from a coding theory standpoint as an estimate of the number of "nats" needed to encode the given dataset according to the KDE estimate of the distribution.
Of course, in this case there are no unknowns (as $h$ and $\beta$ are scalar constants), and there is no assignment of cluster labels, so there's really no point in taking the log likelihood, right?
Now, one intuition I've seen given a lot regarding KDE is to imagine a Gaussian bump on each dataset. (Extending this to any exponential family distribution bump is easy, because exponential family distributions are completely determined by a location parameter (the median) and scale (measure of variance)). In contrast, the GMM model is that each point is assigned to a Gaussian bump (responsible for any particular cluster $k$) that may not coincide with any particular datapoint.
As $h$ is fixed in the above, and an exponential family distribution is completely parameterized by scale and location, we can merge the two ideas (KDE and GMM) by observing the KDE log likelihood objective
$\sum_j \log \sum_i \frac{1}{nh} K(\frac{x_j - x_i}{h})$.
and treating $\frac{1}{nh}$ as a mixing coefficient, which depends on $x_i$, i.e., $q(i)$. Thus the objective to maximize is
$\sum_j \log \sum_i q_i K(\frac{x_j - x_i}{h})$.
Think of $x_i$ as a parameter rather than a data point, and you see that $x_i$ is the location of the distribution specified by $K$. To make this clearer, I'll use the notation $f(.;\theta)$ instead of $K$.
$\sum_j \log \sum_i q_i f(x_j; x_i, h)$.
Now this is a convex optimization as we just have to minimize the log of a convex function (as $\sum q_i = 1$).
To find $q_i$, taking the gradient of the likelihood, with a Lagrangian multiplier to enforce the equality constraint, we find that
$q_i = \sum_j \frac{f(x_j, x_i, h)}{\sum_i q_i f(x_j; x_i, h)}$
which is the identical update for the mixing coefficients in a GMM. For a more stable solution, it is better to use the recurrence relation, (by substituting the right hand side into the right hand side again and iterating), and solving for $q_i$.
Any $q_i$ close to 0 implies that the $i$th cluster is not effective in ``reconstructing'' the dataset, and convexity implies that we can safely eliminate this cluster from consideration when it gets small enough. Thus we continue until the log likelihood has converged, giving a convex clustering procedure, in which $K$, the number of clusters, is also learned.
Of course, there are several issues. First, and probably most problematic is the assumption of identical isotropic variance for all clusters. This can be fixed to some degree, possibly by learning $h_i$ as well, however, this will not be a convex procedure, intuitively, it will depend heavily on the initial state of $h = (h_i)_{i=1,2,...,n}$.
The next issue is that in areas of low density, the GMM deals with this by increasing variance for some nearby cluster center. Here, the variance is fixed, so that there will be a strong incentive to place a cluster on a low density region, with the same confidence as that in a high density region, despite the fact that the support for this cluster is not the same. In a high dimensional setting, this will be an even worse issue.
Based on reading:
1. D. Lashkari and P. Golland. Convex Clustering with Exemplar-Based Models. Advances in Neural Information Processing Systems, 20:825-832, 2008. http://people.csail.mit.edu/polina/papers/LashkariGolland_NIPS07.pdf
Lust
Tuesday, January 12, 2010
Montreal
I had a chance over the weekend to see the Montreal Museum of Fine Art. I love museums, they are basically a safe haven for thoughtful solitude (I never go in a group, and it can be annoying as a couple as well.) Looking at new things, especially beautiful or amazing things, is a great way to spend an afternoon. The MMFA is big, larger than I first thought, housed in two buildings which contain a pretty diverse collection,i.e., european 18th/19th century paintings, contemporary art, decorative arts, and a very very nice collection of inuit art.
I think I may have found a new tattoo.