Hi, Dave, if you're reading this, then you totally forgot how to do this. Lemme help you out.
Switching from boost::ublas to the blas/cblas interface made things much much faster. So here are what I did to convert boost::ublas implemented code to cblas in less than a few hours. There may be easier/better ways, but this worked for me and provided a nice gentle way for me to learn how to use the blas functions. Keep in mind that I don't have root access to these machines, so this works even if you only have access as a user.
First, I installed the cblas interface. I'm assuming you have liblapack installed somewhere or its equivalent, as well as gfortran.
Then unpack, and move into the CBLAS directory. Copy Makefile.<ARCH> into Makefile.in. Here I used Makefile.LINUX. The only sections I changed here was to
BLLIB = $(ACML_LINK) -L/opt/pgi/linux86-64/6.2/libso -lacml -lacml_mv -lpgftnrtl CBDIR = $(HOME)/CBLAS CBLIBDIR = $(CBDIR)/lib/$(PLAT) CBLIB = $(CBLIBDIR)/cblas_$(PLAT).a ... FC = gfortran
To add AMD's math libraries acml, acml_mv and pgftnrtl, a fortran runtime library, which are stored in that directory. Also, I didn't have g77 installed, so I used its updated version, gfortran.
After this, just run
$ make alllibAnd you should get a nice static link library out called cblas_LINUX.a. This stores all the various blas level 1/2/3 interface objects in one container. You can make your executables smaller if you use individual objects.
Now, I just had to add
extern "C"{ #include "cblas.h" }To my header files, and I can now use the cblas functions.
To make my life easier, I ended up writing a bunch of wrappers for the cblas functions, here is an example.
void cblas_prod (ublas::matrix<double, ublas::row_major, ublas::unbounded_array<double> >& C, const ublas::matrix<double, ublas::row_major, ublas::unbounded_array<double> >& A, const ublas::matrix<double, ublas::row_major, ublas::unbounded_array<double> >& B){ cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, C.size1(), C.size2(), A.size2(), 1.0, &A.data()[0], A.size2(), &B.data()[0], B.size2(), 1.0, &C.data()[0], C.size2()); }As you can see, the way the inner data array is arranged makes all the difference. In general, ublas's matrix and vector array can be accessed by its .data() member function. Since this is row major, each matrix row contains column size entries, so use the column size (A.size2()) as an argument to the int incX arguments in the cblas API. Alternatively if A were column major, you'd use A.size1() as the "incA" argument.
By adding one of these for every combination (rowMajor, rowMajor, colMajor), (rowMajor, colMajor, rowMajor), you can avoid the headache of caring when you just want to multiply two matrices. On the other hand, you should be aware that dgemm, for example can do a bit more than straight multiplication. Actually, also zero out the C matrix above before using, as written, the above actually assigns C = A*B + C, unless this is the behavior you want.
Now linking is pretty easy, and getting the right includes is pretty easy. Just set your makefile to do something like
$ g++ /path/to/cblas_LINUX.a myfile.cpp -o myfile.exe \ -L. -L/opt/pgi/linux86-64/6.2/libso -lacml -lacml_mv -lpgftnrtl \ -O3 -fexpensive-optimizations -funroll-loops -DNDEBUG \ -I/where/you/unpacked/CBLAS/src -I.And now you can replace any slow ublas things with fast cblas things, without too much change.
Sincerely, me.