Research odds and ends.
Suppose you wanted to cluster a dataset, which is a fairly common task. You have many many choices for how to do this depending on the assumptions you want to make and the amount of computational resources you want to spend. The obvious choice is K-means, and the next obvious choice is soft K-means (a.k.a. Gaussian Mixture Modeling). Then you can get into some esoteric kernel K-means, kernel GMM, other types of mixtures, K-mediods, etc, not to mention discriminative or semi-supervised methods. However, very few of the choices will lead to the same results every time, which is problematic from the standpoint of reproducing results.
Now, myself, I'm not really concerned about this issue, as I've read and experienced that local optimums are usually not terrible solutions, and the global optimum tends not to be too far from the local one. One issue is simply spending time choosing the initial conditions, as most of these clustering algorithms are iterative gradient based techniques. (Yes, EM is a gradient based technique in my book.)
So to avoid that it would be nice to have a convex optimization objective so that we always get the same results, and the optimization will also be fast as we can use lots of convex optimization algorithms to converge to the solution at hopefully super-linear rates.
The following describes a convex clustering procedure, presented by Lashkari and Golland, which, although is limited, may have some uses.
First, I begin with a description of Kernel Density Estimation (a short version). As usual by convention, we are given vectors $(x_1, x_2, ..., x_n)$, with $x_i \in R^d$, which we desire to cluster into one of $K$ clusters.
Suppose that we used kernel density estimation to estimate the probability of $x_j$ given all the other vectors we were given, that
$\hat{P}(x_j) = \sum_i \frac{1}{nh} K(\frac{x_j - x_i}{h})$
Aside: Here $K(.)$ refers to a kernel function rather than the $K$ number of clusters. Confusing, I know, but I'm trying to stick with convention.
Anyways, the KDE estimate is a totally reasonable thing to do. In fact, if we let $K(.)$ be the gaussian kernel $K(\frac{u - v}{h}) = \beta \exp(-\frac{||u - v||_2^2}{h})$, then $h/2$ is the isometric variance, with $\beta$ being a normalization factor ($\propto \sqrt{2\pi}^{-1}$). The thing to notice about this formuation, is that it is a mixture of Gaussians distribution with $K = n$, and mixing proportions $\frac{1}{n}$. If we take the log of the sum of these over all the individual data points, then we have a log likelihood of the data. This can also be interpreted from a coding theory standpoint as an estimate of the number of "nats" needed to encode the given dataset according to the KDE estimate of the distribution.
Of course, in this case there are no unknowns (as $h$ and $\beta$ are scalar constants), and there is no assignment of cluster labels, so there's really no point in taking the log likelihood, right?
Now, one intuition I've seen given a lot regarding KDE is to imagine a Gaussian bump on each dataset. (Extending this to any exponential family distribution bump is easy, because exponential family distributions are completely determined by a location parameter (the median) and scale (measure of variance)). In contrast, the GMM model is that each point is assigned to a Gaussian bump (responsible for any particular cluster $k$) that may not coincide with any particular datapoint.
As $h$ is fixed in the above, and an exponential family distribution is completely parameterized by scale and location, we can merge the two ideas (KDE and GMM) by observing the KDE log likelihood objective
$\sum_j \log \sum_i \frac{1}{nh} K(\frac{x_j - x_i}{h})$.
and treating $\frac{1}{nh}$ as a mixing coefficient, which depends on $x_i$, i.e., $q(i)$. Thus the objective to maximize is
$\sum_j \log \sum_i q_i K(\frac{x_j - x_i}{h})$.
Think of $x_i$ as a parameter rather than a data point, and you see that $x_i$ is the location of the distribution specified by $K$. To make this clearer, I'll use the notation $f(.;\theta)$ instead of $K$.
$\sum_j \log \sum_i q_i f(x_j; x_i, h)$.
Now this is a convex optimization as we just have to minimize the log of a convex function (as $\sum q_i = 1$).
To find $q_i$, taking the gradient of the likelihood, with a Lagrangian multiplier to enforce the equality constraint, we find that
$q_i = \sum_j \frac{f(x_j, x_i, h)}{\sum_i q_i f(x_j; x_i, h)}$
which is the identical update for the mixing coefficients in a GMM. For a more stable solution, it is better to use the recurrence relation, (by substituting the right hand side into the right hand side again and iterating), and solving for $q_i$.
Any $q_i$ close to 0 implies that the $i$th cluster is not effective in ``reconstructing'' the dataset, and convexity implies that we can safely eliminate this cluster from consideration when it gets small enough. Thus we continue until the log likelihood has converged, giving a convex clustering procedure, in which $K$, the number of clusters, is also learned.
Of course, there are several issues. First, and probably most problematic is the assumption of identical isotropic variance for all clusters. This can be fixed to some degree, possibly by learning $h_i$ as well, however, this will not be a convex procedure, intuitively, it will depend heavily on the initial state of $h = (h_i)_{i=1,2,...,n}$.
The next issue is that in areas of low density, the GMM deals with this by increasing variance for some nearby cluster center. Here, the variance is fixed, so that there will be a strong incentive to place a cluster on a low density region, with the same confidence as that in a high density region, despite the fact that the support for this cluster is not the same. In a high dimensional setting, this will be an even worse issue.
Based on reading:
1. D. Lashkari and P. Golland. Convex Clustering with Exemplar-Based Models. Advances in Neural Information Processing Systems, 20:825-832, 2008. http://people.csail.mit.edu/polina/papers/LashkariGolland_NIPS07.pdf
No comments:
Post a Comment