So, ok, I've put every trick I know into the mix and all my code seems like it's as optimized as possible, but I
still have issues. The main problem is one matrix multiplication operation that I only have to once (not per iteration), but it is very expensive. Before I get rid of it completely, I'd really like to know how well it would work if I parallelized it.
However, doing this wasn't as easy as it seems. I have access to a 64-node cluster of quad-Xeons with 5 gigs each (I think, who knows). My matrix multiplication is $\mathbf{X}^T\mathbf{X}$, where $\mathbf{X}$ is a 810,000 by 275 double precision matrix assumed to be of full-rank.
I could conceivably do this on a single machine (quad CPU), but it seems pretty ripe to just parcel out rows of the matrix, compute the outer products on each node and then add each of these matrices up. The main issue is memory and IO, rather than the computation, so this seems better to do distributed than on multi-cpu.
Of course nothing is easy with limited resources. The cluster I'm working on doesn't seem to have MPI or MapReduce frameworks available. It uses Torque to manage jobs, which means I had to read a lot of man pages, write a bunch of shell scripts, and use the NFS mount for communication. So here's hoping if you're reading this, you can skip all that.
(I hate Perl, but in retrospect, I probably should've done this in Perl. I don't even hate Perl, I just hate Perl programmers. They tend to write code, well, cryptically, and seem proud of it. I hate the game, not the player. Jeff is nice.)
Anyways, I did shell. Since I didn't find good versions of this online, I'm putting this out there for all to make fun of.
First, I had two C-programs, which do the obvious computations
$ xtx_map file output
$ xtx_reduce dir output
xtx_map takes a matrix X, stored in file, and outputs X'X to output, and xtx_reduce takes a directory and computes a
sum of all matrices stored as files there.
Now I had to get the data ready and set up for the matrices I needed. So I needed 3 shell scripts for
this, a basic wrapper for all the calls, and a map shell script to call xtx_map and a reduce script to
call xtx_reduce. Here's the wrapper.
#!/bin/sh
split=`which split`;
$split -l 10000 $SPLIT /u/dave/input/X_
scoreboard=/u/dave/scoreboard
for f in `ls /u/dave/input/X_*`; do
used=`cat $scoreboard`
export f=`basename $f`
while [ $used -gt 30 ]; do
used=`cat $scoreboard`
echo "waiting for node to open up..."
sleep 1
done
qsub -v FILENAME=$f /u/dave/scripts/map.sh
used=`expr $used + 1`
echo $used > $scoreboard
done
jobs=""
for job in `qstat -u dave | awk {'print $1'} | sed 's/\.node.*//' | grep '[0-9]'`; do
nan="$(echo $job | sed 's/[0-9]//g')"
if [ -z $nan ] ; then
if [ -z $jobs ]; then
export jobs=$job
else
export jobs=$jobs:$job
fi
fi
done
qsub -W depend=afterok:$jobs /u/dave/scripts/reduce.sh
So lines 2 & 3 call the unix tool split on the file I want
to break up using 10000 lines per each submatrix and
saving to the directory /u/dave/input/X_, in files name X_aa, X_ab, etc.
(If you don't use split, it's awesome, and also you probably
never used a 9600 baud modem. GTS.) "which" is also a nifty utility to
use when people put common tools in weird places. Yeah, you heard me MacOSX.
Now one issue is that with .8M lines there are roughly 80 files output,
and I'm really only allowed to use 32 of these nodes. So I had to create
a way to schedule the jobs to go into the cluster. So I just used the
hacky tmp file way to do it, and this actually did not work so well. There was
some non-deterministic accounting...
Probably a recursive call would work better here, but shell recursion does not seem
safe. In any case, no harm no foul, line 4 sets the tmp file, called scoreboard, and then
we get all the split files in an outer loop.
Inside the for loop, I read the scoreboard and wait until I have less than 31. I kept
it at a round number because I'm really lazy. Ok, so then it submits
the job to the PBS manager, which schedules the job using qsub. Now qsub doesn't
handle arguments, but it does handle environment variables, so this hacky
-v FILENAME=$f
is the only way I found to do this.
Now the reduce step has to wait for all the jobs to finish, so the idea here is to get a list
of job ids, and pass them to qsub reduce.sh to get it to wait until all the jobs are out. I could
also have polled or something, which would be safer, but this seemed OK to me, especially since
I could have had the reduce step running concurrently, but this step is significanlty faster
than the map steps, as we only need to add matrices of 275 by 275.
Finally after see-sawing between google and shell scripting, I got the right switches and stuff for the
job to wait, which is
qsub -W depend=afterok:$jobs /u/dave/scripts/reduce.sh
The reduce and map shell scripts are pretty mundane:
Map here:
#!/bin/sh
scoreboard=/u/dave/scoreboard
infile=/u/dave/input/$FILENAME
outfile=/u/dave/output/$FILENAME
/u/dave/bin/xtx_map $infile $outfile
used=`cat $scoreboard`
used=`expr $used - 1`
echo $used > $scoreboard
Reduce there:
#!/bin/sh
#PBS -m bea
indir=/u/dave/output
/u/dave/bin/xtx_reduce $indir /u/dave/XTX
I use the names map and reduce because it makes parallelization easy to think about, and I really
hope I get some infrastructure that I can use Mapreduce with soon, because shell scripting sucks. Perhaps
I'm very bad at it, but I just find it annoying and hacky.
Well, the good thing is that this code should work as a basis for any general problems. Hope this finds
someone who it can help.