Newness Fades: 2010-04-04

I have been trying to work with hadoop for my parallelization needs, but I find it to be somewhat difficult to get certain things to work in the system. At University of Rochester, CS, the cluster that I work with doesn't seem to have OpenMPI or any kind of MPI set up on the cluster machines, but at Montreal, the machines do, except that all of the nodes (except for usually 5) are always used by some crystallography program that seems to eats bandwidth and I/O on the NFS filesystem. So I've been doing stuff and copying back and forth, which was just killing me. (Not because it was taking so long and being difficult to do, but the principle damnit!)

So I was avoiding installing openmpi on the Rochester system because I figured it would be difficult at best, and impossible at worst. But eventually I gave up on getting hadoop to play nice and find that it is remarkably easy to set up OpenMPI on a given cluster. If you have a cluster with a common NFS mount, all you have to do is download and unpack and set your path to wherever you installed openmpi. Let me just blog this in case you're wondering how to go about it.

First, I downloaded the tarball from here and unpacked it into ~/openmpi. Then in that directory,

$./configure --prefix=/u/rim/opt/openmpi
$ make all install

Thats pretty much it, other than setting

PATH=$PATH:/u/rim/opt/openmpi/bin
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/u/rim/opt/openmpi/lib

in my .bashrc file. YMMV depending on your favorite shell.

Now I had to write my MPI program, which actually I already did, but here it is for completeness, or at least the main function. This generates a kernel matrix, which is relatively easy to do parallel, I just give row indices of the test matrix to each node and ask them politely to kernelize that part of the final matrix.

When they're done they send them back to the master node, which copies these columns into the big matrix. The only tricky part with ublas is to make sure that the MPI receive and send copy the respective right parts of the matrix. Here, since we're copying rows of the final kernel matrix, we want everything to be row major.

int main(int argc, char *argv[]){
 
  matrix<double> Xtrain;
  matrix<double> Ytrain;
  matrix<double> Xtest;
  matrix<double> Ytest;

  MPI_Status status;
  int size = 1;
  int rank = 0;

  MPI_Init(&argc, &argv);
  size = MPI::COMM_WORLD.Get_size();
  rank = MPI::COMM_WORLD.Get_rank();

  char name[MPI_MAX_PROCESSOR_NAME];
  int len;
  memset(name,0,MPI_MAX_PROCESSOR_NAME);
  MPI::Get_processor_name(name,len);
  memset(name+len,0,MPI_MAX_PROCESSOR_NAME-len);

  if(!load_data(argv[1], Xtrain, Ytrain)){
    cout << "klr_test (error): Bad train filename" << endl;
    return 0;
  }

  if(!load_data(argv[2], Xtest, Ytest)){
    cout << "klr_test (error): Bad test filename" << endl;
    return 0;
  }

  kernelType kernel = RBF;
  if(string(argv[3]) == "CHI2"){
    kernel = CHI2;
  }
  if(string(argv[3]) == "POLY"){
    kernel = POLY;
  }
  if(string(argv[3]) == "LINEAR"){
    kernel = LINEAR;
  }
  if(string(argv[3]) == "EUCLIDEAN"){
    kernel = EUCLIDEAN;
  }

  double param1 = atof(argv[4]);
  int N = Xtest.size1();
  int D = Xtrain.size1();

  //split the matrix into row matrices
  int offset = floor((double)N / (double)size);
  int start = rank*offset;
  int end = (rank+1)*offset;
  end = end > N ? N : end;
  int M = end - start;
  if(rank + 1 == size){
    end = N;
    M = end - start;
  }


  ublas::matrix<double, ublas::row_major> K(N, D);

  if(rank == 0){
    K.assign(ublas::zero_matrix<DBL_TYPE>(N, D));
  }else{
    K.resize(M, D, false);
    K.assign(ublas::zero_matrix<DBL_TYPE>(M, D));
  }


  for(int k = start; k < end; k++){
    matrix_row< matrix<double> > u1(Xtest, k);
    int j = k - start;
    for(int i = 0; i < D; i++){
      matrix_row< matrix<double> > u2(Xtrain, i);
      K(j, i) = kernelize(u1, u2, kernel, param1);
    }
    if(VERBOSE && (k % 100 == 0))
      std::cout << k << ": " << rank << " " << name << std::endl;
  }

 if(size > 1){
    if(rank == 0){
      for(int i = 1; i < size; i++){
        start = i*offset;
        end = (i+1)*offset;
        end = end > N ? N : end;
        M = end - start;
        if(i+1 == size){
          end = N;
          M = end - start;
        }
        MPI_Recv((void *)&K.data()[start*D], M*D, MPI_DOUBLE, i, TAG_MATRIX_PARTITION, MPI_COMM_WORLD, &status);
      }
    }else{
      MPI_Send((void *)&K.data()[0], M*D, MPI_DOUBLE, 0, TAG_MATRIX_PARTITION, MPI_COMM_WORLD);
    }
  }


  if(rank == 0){
    ofstream output(argv[5], ios::out);
    for(unsigned int i = 0; i < K.size1(); i++){
      for(unsigned int j = 0; j < K.size2(); j++){
        output << K(i, j);
        if(j < (K.size2()-1)){
          output << ",";
        }
      }
      output << std::endl;
    }
  }

  MPI_Finalize();
  return 0;
}

Now to make and run, just edit your Makefile to do something like

MPICC=mpic++
...
$(MPICC) kernelize_mat2.cpp -o kernelize_mat2 $(LFLAGS) $(CFLAGS) $(INC) $(OLIBS)

where everything to the right of -o kernelize_mat2 are just settings and flags I otherwise needed. The only change I made was to add the mpi wrapper for the g++ compiler, mpic++. The other bits are stuff that I already had/needed.

After building, you can run. On the cluster, they use Torque for scheduling, so you can either do it via shell script or via the interactive start. I generally prefer to use the shell script since it's less wasteful, but here I use the interactive start up because I wanted to make sure it was all kosher.

$ qsub -I -lnodes=5,walltime=3600,mem=1000M

I like to use funny names for my pbs jobs, but I skip that here. That command line requests 5 nodes with an hour of cpu time and a gig of memory each, which is about what I wanted.

$ mpirun -np 5 --hostfile $PBS_NODEFILE /path/to/kernelize_mat2 train.csv test.csv RBF .7 kernel.csv

-np tells how many processors you want to use and the --hostfile $PBS_NODEFILE says which hosts you want to run your process on. You can, of course use less, but thats a little silly, no? You can also run it locally if you exclude hosts in the hostfile. That is you can also specify

-np 4 --host node33

In any case at each step I was expecting the thing to say, root access required or something, but I guess that's being silly. Anyways, the whole thing worked like a charm and took roughly 20 minutes to set up and get running (almost all of that time was spent compiling).

That's pretty much it, so now you too can write cool apps for the cluster, and if you're using the one I use, then by all means, you can just set your paths to mine (although names were changed to protect the innocent). By the way, does this code really need comments? I don't even know where I would put one.

Newness Fades

Friday, April 9, 2010

Research (Nuts and Bolts)

Wednesday, April 7, 2010

Montreal

Monday, April 5, 2010

Stuff I like...