Newness Fades: 2010-01-10

Saturday, January 16, 2010

Montreal

Schwartz's Deli, Montreal:

Good, but not great sandwich, basically a pastrami on rye with yellow mustard. Not too thrilled with this mustard -- Kraftish. Meat was mostly fresh if a bit dry. I'm used to the juicy chewiness of the New York variety. Pretty authentic NYC deli feel though (although those are a dying breed.) Staff was friendly, I ate at the counter "Anywhea you like, buddy." (Weird Canadian+Jewish? accents.) The pickle was fantastic. I might go back for the pickles. Also no Doc Brown's black cherry, boo.

Track of The Day

♫ Owen Pallett – "The Great Elsewhere"

Montreal

Portrait via Sean Tan's "Story in pictures".

Friday, January 15, 2010

Track of The Day

♫ Sharon Jones and the Dap-Kings, "Make It Good To Me."

Research

So I finally finished converting my matlab code to use LBFGS, and then finally finished converting my matlab code to C++, using boost::ublas. The results, are of course, encouraging, I'm seeing much faster times, on much larger datasets, so far a 10,000 example set takes about four minutes to compute a L^0 regularized solution. I haven't optimized this set yet, only used g++ optimization flags. (Thanks Chen for your class on optimizers.) I'll spend today optimizing with cachegrind and poring over my code, before getting to the real optimization issue, which is performing the optimization on different data structures.

I know there are faster linear algebras out there than ublas, and I may end up converting from this C++ code to use them (atlas, for example) but I took a look at the syntaxes and find that ublas is much easier to work with from a matlab->C++ conversion perspective. Of course, I may do this again just to gain some experience.

When I'm done with this, I really want to try building a matlab->C interpreter, just so I don't have to waste so much time doing this messy first step by hand. I've never built an interpreter before, but I can't think of something that so many people including myself will find as useful. I found a matlab yacc grammar out there, so some part of it is done. I suspect this is the easiest part.

Thursday, January 14, 2010

Later, Sam

Second RIP post in less than 24 hours, sad, hopefully the last one of these for awhile. RIP Sam Roweis, one of my heroes in computer science. Just watch the video below, hot-linked from http://videolectures.net/sam_roweis/ to see what I mean. I never got the chance to take a course from him (being rejected from UofT, damnit), but I remember watching this video and thinking "I want to be you." I hope he's still working on machine learning wherever he is. Most of what I know about Sam Roweis is through his work, honestly I'm only really familiar with the local linear embedding (LLE) algorithm, which was in Science, (which is the serious hotness). I may get around to summarizing that paper at some point, since I've been really interested with manifolds and computational geometry as it relates to learning algorithms as of late.

Reading:
1. Roweis, S. T., and L. K. Saul. "Nonlinear Dimensionality Reduction by Locally Linear Embedding." Science 290.5500 (2000): 2323.

Wednesday, January 13, 2010

Later, Jay

Via Mishka, a video that seemed appropriate for my birthday, celebrated in a new city, country and school (all by myself). I was all "meh" on Jay Reatard, until today, hope you're killing it wherever you are.

Research Reading (Convex Clustering?)

Research odds and ends.

Suppose you wanted to cluster a dataset, which is a fairly common task. You have many many choices for how to do this depending on the assumptions you want to make and the amount of computational resources you want to spend. The obvious choice is K-means, and the next obvious choice is soft K-means (a.k.a. Gaussian Mixture Modeling). Then you can get into some esoteric kernel K-means, kernel GMM, other types of mixtures, K-mediods, etc, not to mention discriminative or semi-supervised methods. However, very few of the choices will lead to the same results every time, which is problematic from the standpoint of reproducing results.

Now, myself, I'm not really concerned about this issue, as I've read and experienced that local optimums are usually not terrible solutions, and the global optimum tends not to be too far from the local one. One issue is simply spending time choosing the initial conditions, as most of these clustering algorithms are iterative gradient based techniques. (Yes, EM is a gradient based technique in my book.)

So to avoid that it would be nice to have a convex optimization objective so that we always get the same results, and the optimization will also be fast as we can use lots of convex optimization algorithms to converge to the solution at hopefully super-linear rates.

The following describes a convex clustering procedure, presented by Lashkari and Golland, which, although is limited, may have some uses.

First, I begin with a description of Kernel Density Estimation (a short version). As usual by convention, we are given vectors $(x_1, x_2, ..., x_n)$, with $x_i \in R^d$, which we desire to cluster into one of $K$ clusters.

Suppose that we used kernel density estimation to estimate the probability of $x_j$ given all the other vectors we were given, that

$\hat{P}(x_j) = \sum_i \frac{1}{nh} K(\frac{x_j - x_i}{h})$

Aside: Here $K(.)$ refers to a kernel function rather than the $K$ number of clusters. Confusing, I know, but I'm trying to stick with convention.

Anyways, the KDE estimate is a totally reasonable thing to do. In fact, if we let $K(.)$ be the gaussian kernel $K(\frac{u - v}{h}) = \beta \exp(-\frac{||u - v||_2^2}{h})$, then $h/2$ is the isometric variance, with $\beta$ being a normalization factor ($\propto \sqrt{2\pi}^{-1}$). The thing to notice about this formuation, is that it is a mixture of Gaussians distribution with $K = n$, and mixing proportions $\frac{1}{n}$. If we take the log of the sum of these over all the individual data points, then we have a log likelihood of the data. This can also be interpreted from a coding theory standpoint as an estimate of the number of "nats" needed to encode the given dataset according to the KDE estimate of the distribution.

Of course, in this case there are no unknowns (as $h$ and $\beta$ are scalar constants), and there is no assignment of cluster labels, so there's really no point in taking the log likelihood, right?

Now, one intuition I've seen given a lot regarding KDE is to imagine a Gaussian bump on each dataset. (Extending this to any exponential family distribution bump is easy, because exponential family distributions are completely determined by a location parameter (the median) and scale (measure of variance)). In contrast, the GMM model is that each point is assigned to a Gaussian bump (responsible for any particular cluster $k$) that may not coincide with any particular datapoint.

As $h$ is fixed in the above, and an exponential family distribution is completely parameterized by scale and location, we can merge the two ideas (KDE and GMM) by observing the KDE log likelihood objective

$\sum_j \log \sum_i \frac{1}{nh} K(\frac{x_j - x_i}{h})$.

and treating $\frac{1}{nh}$ as a mixing coefficient, which depends on $x_i$, i.e., $q(i)$. Thus the objective to maximize is

$\sum_j \log \sum_i q_i K(\frac{x_j - x_i}{h})$.

Think of $x_i$ as a parameter rather than a data point, and you see that $x_i$ is the location of the distribution specified by $K$. To make this clearer, I'll use the notation $f(.;\theta)$ instead of $K$.

$\sum_j \log \sum_i q_i f(x_j; x_i, h)$.

Now this is a convex optimization as we just have to minimize the log of a convex function (as $\sum q_i = 1$).

To find $q_i$, taking the gradient of the likelihood, with a Lagrangian multiplier to enforce the equality constraint, we find that

$q_i = \sum_j \frac{f(x_j, x_i, h)}{\sum_i q_i f(x_j; x_i, h)}$

which is the identical update for the mixing coefficients in a GMM. For a more stable solution, it is better to use the recurrence relation, (by substituting the right hand side into the right hand side again and iterating), and solving for $q_i$.

Any $q_i$ close to 0 implies that the $i$th cluster is not effective in ``reconstructing'' the dataset, and convexity implies that we can safely eliminate this cluster from consideration when it gets small enough. Thus we continue until the log likelihood has converged, giving a convex clustering procedure, in which $K$, the number of clusters, is also learned.

Of course, there are several issues. First, and probably most problematic is the assumption of identical isotropic variance for all clusters. This can be fixed to some degree, possibly by learning $h_i$ as well, however, this will not be a convex procedure, intuitively, it will depend heavily on the initial state of $h = (h_i)_{i=1,2,...,n}$.

The next issue is that in areas of low density, the GMM deals with this by increasing variance for some nearby cluster center. Here, the variance is fixed, so that there will be a strong incentive to place a cluster on a low density region, with the same confidence as that in a high density region, despite the fact that the support for this cluster is not the same. In a high dimensional setting, this will be an even worse issue.

Based on reading:

1. D. Lashkari and P. Golland. Convex Clustering with Exemplar-Based Models. Advances in Neural Information Processing Systems, 20:825-832, 2008. http://people.csail.mit.edu/polina/papers/LashkariGolland_NIPS07.pdf

Lust

It is definitely possible to feel lust for an inanimate object. Here are two of the things that I have an unhealthy obsession about. Totally unrequited. Reality will totally fail to meet expectations. I will be bored after a month and a half and look for a way out. I'll never feel more alive than anticipating the moment. Yes. Yes. All that. But still, an incredible overwhelming desire. For....

Sodastream.

Oh how I want you, free sodawater anytime I want it, I will carbonate with you until you run out of gas. You'll never be so happy as with me. I don't care what anyone says, my teeth will forgive me.

The other object d'lust is

The Brooks Leather Barbican bag.

Hell yes, my brown beauty, you are the perfect match for me, composed enough for Per Se, tomboy enough for a midnight ride on the streets of Montreal. We'll study together during the day, and then we'll party all night. Please be mine.

Yours, truly.

Tuesday, January 12, 2010

Montreal

I had a chance over the weekend to see the Montreal Museum of Fine Art. I love museums, they are basically a safe haven for thoughtful solitude (I never go in a group, and it can be annoying as a couple as well.) Looking at new things, especially beautiful or amazing things, is a great way to spend an afternoon. The MMFA is big, larger than I first thought, housed in two buildings which contain a pretty diverse collection,i.e., european 18th/19th century paintings, contemporary art, decorative arts, and a very very nice collection of inuit art.

I think I may have found a new tattoo.

In any case, I have a particular fondness for the Inuit, partly because of my second grade teacher Mr. Euster, who lived in Alaska and had his Inuit friends stop by and tell Inuit stories about whale hunting, carve rocks and do other things second graders generally like. I wish I could remember their names.

In any case. I'd recommend the museum, it's not as big as some as I've been to, but it has all the right touches.

Well, at least you'll get to see this:

Monday, January 11, 2010

Homesickness

I have been homesick for the last 2 years, this blog post, via Cara Heller via Facebook via http://niemann.blogs.nytimes.com/2009/02/02/i-lego-ny/ captures it nicely, in lego form