Installing a Scientific Python Stack by Hand

You disabled JavaScript. Please enable it for syntax-highlighting, or don't complain about unlegible code snippets =) This page doesn't contain any tracking/analytics/ad code.

Prelude

This is a living document as things usually change across software versions.

A lot of workplaces suck in the sense that they don't want to give their developers full control over their computer. Such workplaces also usually choose a stable linux distribution, like Ubuntu 12.04. The rationale are as diverse as the C++ style guides under the sky, but fact is you're stuck with missing or highly outdated tools.

This document will go through all the steps necessary to build a scientific computing environment in Python3, with an OpenBLAS (previously GotoBLAS2) or MKL (Intel's BLAS) backed NumPy and SciPy, matplotlib, OpenCV, and IPython with its notebook. Notice that I said build an environment, and by that I mean a virtualenv.

Another reason to follow this guide is if your distro doesn't back NumPy/SciPy with a good BLAS implementation but you need high performance. (If you need very high performance but stay dynamic, I recommend going with Julia, but that's for another article.)

First Baby Steps

In this guide, we'll install everything into the ~/inst prefix; it will mirror the structure of /usr but contain everything we installed locally. In addition, sourcecode of all packages will reside in ~/inst/src and we will compile them in ~/inst/src/build-* as much as possible.

In the following commands, I will often have to write /home/USERNAME instead of ~. In most cases, this is necessary and you should do so too, but, of course, replace USERNAME with your username.

Optionally, you can add the ~/inst/bin and friends to your PATH and friends environment variables. This will have the effect that everything you install to ~/inst will always be preferred to the system-installed versions, implicitly. This might or might not be what you want. (It's not what I want.) If it is what you want, add the following to your shell's startup file; for bash, that's ~/.bashrc and:

PATH="/home/USERNAME/inst/bin:$PATH"
LD_LIBRARY_PATH="/home/USERNAME/inst/lib:$LD_LIBRARY_PATH"

Then, re-source it: . ~/.bashrc.

If you're using another shell, such as the excellent fish or the hipster's zsh, I expect you to know how to adapt the above to your shell.

While for the remainder of this article I'll assume you didn't do the above, you don't need to do anything differently if you did.

If you're a Mac user, replace all wget foos by the corresponding curl foo > filenames or download the files manually.

Static or Shared Python Libraries?

Compiling Python with static libraries makes the whole process easier, but if you plan to have anything embed that python you're compiling (I'm thinking of both OpenCV and Julia's PyCall here), you have to go for the shared library option. I will describe both ways in this article.

Python3

We don't really care about other dependencies of Python, such as tkinter, readline and others since we'll go with IPython as a shell and IPython's excellent notebook for interactive plotting.

Should you want to get everything into python, do install libreadline, liblzma, libgdbm, and tcl/tk into your prefix. (Note: tcl/tk doesn't support out-of-source builds.)

Getting, configuring, compiling and installing Python3 works as follows (but read on first if you go for the shared library option!):

$ cd ~/inst/src
$ wget http://www.python.org/ftp/python/3.3.3/Python-3.3.3.tgz
$ tar -xzv < Python-3.3.3.tgz
$ mkdir build-Python-3.3.3
$ cd build-Python-3.3.3
$ ../Python-3.3.3/configure --srcdir=../Python-3.3.3 --prefix=/home/USERNAME/inst
$ make
$ make install
$ cd -

Keep the build directory, as it also contains a make uninstall target that might come in handy at some point. If you're short on disk space, you can run make clean in the build directory though.

With Shared Python

The only thing which changes in the above step is that you need to add the --enable-shared option to the configure step above. This makes the python executable link against libpython3.so and thus we'll have to adapt some paths later on.

Notes on Tcl/Tk

Installing those two guys from source wasn't possible without too much hassle, and since I don't need them, I didn't try harder. If you are unfortunate enough to really need them, here are a few pointers:

OpenBLAS

TODO: It seems this wants gfortran so that it also compiles lapack functions.

OpenBLAS is an open-source continuation of Kazushige Goto's BLAS implementation which completely crushed all available implementations. You might want to use that unless you have a license to use Intel's MKL, in which case you can skip this section.

Unfortunately, I couldn't find any easy way to build OpenBLAS out-of-source, so here we go:

$ cd ~/inst/src
$ git clone git://github.com/xianyi/OpenBLAS
$ cd OpenBLAS
$ make
$ make PREFIX=/home/USERNAME/inst install
$ cd -

If you have some kind of fancy special CPU which is not correctly autodetected, you might want to look into the TARGET=xxx flag described in the TargetList.txt file. Usually, you'll be fine without it.

If you want to compile ALL the algorithms such that the one for the correct CPU will be chosen at runtime (this only makes sense if you will use this on different computers, say because your home directory is in a grid/cluster engine.), add the DYNAMIC_ARCH=1 option to both the make and the make install commands above.

NOTE to self: I previously included the option NO_SHARED=1, but in recent versions, NumPy doesn't like the static libs anymore: it doesn't link to -lm, -pthread and -lgfortran, so we get undefined references. Interestingly, NumPy doesn't seem to need an updated LD_LIBRARY_PATH given a correct site.cfg, but SciPy does.

Creating a Virtual Environment

Now that we have Python3 compiled, we want to create a virtual environment based on that and install all of the following things into that environment.

Python >= 3.3

Starting with version 3.3, Python comes with built-in support for virtualenvs.

TODO: Experimental notes.

$ ~/inst/bin/pyvenv sci-env3

This already seems to take the lib from ~/inst/lib, so it might not need the LD_LIBRARY_PATH fixes?

Error calling pyvenv

Depending on your distribution (e.g. Ubuntu 14.04) and version of python, you might encounter the following error:

Error: Command '['/home/lucas/sci-env3.4/bin/python3.4', '-Im', 'ensurepip', '--upgrade', '--default-pip']' returned non-zero exit status 1

In this case, you should create the venv without pip and install pip manually into it:

$ pyvenv-3.4 --without-pip sci-env3
$ . sci-env3.4/bin/activate
$ curl https://bootstrap.pypa.io/get-pip.py | python

Python < 3.3

For older versions of Python, most prominently version 2.7, you'll need to rely on a 3rd-party virtualenv creation tool, such as virtualenv:

$ cd ~/inst/src
$ wget https://pypi.python.org/packages/source/v/virtualenv/virtualenv-1.11.2.tar.gz
$ tar -xzv < virtualenv-1.11.2.tar.gz

And now we can create the virtual environment. I'll call it sci-env3 and place it into my home folder. The virtualenv script needs to be called by the python executable you wish to have in the environment, so we'll use the one we just compiled a few minutes ago (See below first if you used --enable-shared):

$ cd virtualenv-1.11.2
$ ~/inst/bin/python3 virtualenv.py /home/USERNAME/sci-env3
$ cd -

All of the following needs to be done with the environment activated! So let's activate it:

$ . ~/sci-env3/bin/activate
(sci-env3)$

Or, for friends of fishes:

$ . ~/sci-env3/bin/activate.fish

With --enable-shared

If you have chosen to compile Python with the --enable-shared flag, you'll need to make sure the libpython3 can be found by the OS. So, instead of the above ~/inst/bin/python3 command, run the following:

$ LD_LIBRARY_PATH=~/inst/lib:$LD_LIBRARY_PATH ~/inst/bin/python3 virtualenv.py /home/USERNAME/sci-env3

In order not to have to type this huge prefix all the time, I recommend you add the following to the virtualenv's activate script in the deactivate function around line 12:

    if [ -n "$_OLD_VIRTUAL_LD_PATH" ] ; then
        LD_LIBRARY_PATH="$_OLD_VIRTUAL_LD_PATH"
        export LD_LIBRARY_PATH
        unset _OLD_VIRTUAL_LD_PATH
    fi

and, around line 53,

_OLD_VIRTUAL_LD_PATH="$LD_LIBRARY_PATH"
LD_LIBRARY_PATH="$VIRTUAL_ENV/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH

Now, activate the virtualenv using this updated activation script by calling . ~/sci-env3/bin/activate and then copy the shared libraries into the virtualenv, just like is being done with the binary:

(sci-env3)$ cp ~/inst/lib/libpython3{.3m.so.1.0,.so} $VIRTUAL_ENV/lib
(sci-env3)$ ln -s $VIRTUAL_ENV/lib/libpython3.3m.so.1.0 $VIRTUAL_ENV/lib/libpython3.3m.so

Having done this, whenever you activate the virtualenv, the LD_LIBRARY_PATH will be correctly set.

For the fish shell

I'm using the lovely fish shell. In that case, it's:

    if test -n "$_OLD_VIRTUAL_LD_PATH"
        set -gx LD_LIBRARY_PATH $_OLD_VIRTUAL_LD_PATH
        set -e _OLD_VIRTUAL_LD_PATH
    end

and

set -gx _OLD_VIRTUAL_LD_PATH $LD_LIBRARY_PATH
set -gx LD_LIBRARY_PATH "$VIRTUAL_ENV/lib" $LD_LIBRARY_PATH

(Notice how in fish, the PATH variables are arrays. Nice.) Don't forget to run the last part above, i.e. activating the virtualenv, the copy of the libs and the creation of the symlink!

Numpy

Now we're ready to install NumPy. Backing NumPy with OpenBLAS and/or Intel MKL has become considerably easier with NumPy 1.8 and thus many of the tutorials you'll find on the internet are outdated and overcomplicated. First, get NumPy and its dependency cython:

(sci-env3)$ pip install cython
(sci-env3)$ cd ~/inst/src
(sci-env3)$ git clone https://github.com/numpy/numpy
(sci-env3)$ cd numpy

Unless you have a compelling reason, it's a good idea to checkout the latest tag, as that one is likely not to be in-between some changes.

(sci-env3)$ git tag
(sci-env3)$ git checkout v1.9.0

Now, we'll have to customize the site.cfg file in order for setup.py to find our installation of either OpenBLAS or MKL.

(sci-env3)$ cp site.cfg.example site.cfg

OpenBLAS

For OpenBLAS, it's as simple as uncommenting and adapting the prepared entries:

[openblas]
libraries = openblas
library_dirs = /home/USERNAME/inst/lib
include_dirs = /home/USERNAME/inst/include

But do notice the warning above, of which the TL;DR is that before Python3.4, multithreaded OpenBLAS and python's multiprocessing don't work well together!

MKL

In case you do have an MKL license, you might want to go for MKL instead of OpenBLAS. There are many different distributions of MKL, the following is what worked for mine, which came bundled with Intel's Composer XE 2013:

library_dirs = /opt/intel_csxe_2013/composer_xe_2013.5.192/mkl/lib/intel64/
lapack_libs = mkl_lapack95_lp64
mkl_libs = mkl_rt

TODO: I don't really want to bother compiling with icc as most online docs do; if you still want to, it doesn't really look difficult.

TODO: Looks like current NumPys (1.7 through 1.9) don't work with my MKL, since all of them segfault the unittest when it reaches dot.

And then

After having correctly configured the site.cfg, we can compile and install NumPy. The output of the first step should hopefully show that the correct BLAS implementation will be used:

(sci-env3)$ python setup.py config
(sci-env3)$ python setup.py build
(sci-env3)$ python setup.py install
(sci-env3)$ cd -

Testing

This installed NumPy. Let's just make sure everything worked out fine by running NumPy's testsuite. We'll need nose for that, so we'll install it using pip and then run NumPy's test suite:

(sci-env3)$ pip install nose
(sci-env3)$ python -c 'import numpy; numpy.test()'

A dot represents a successful unittest, a 'S' is a skipped one, a 'K' is a known failure, all of which are no problem. Anything else is problematic.

Finally, when Numpy doesn't find any BLAS implementation, it compiles its own very slow fallback implementation. To check whether NumPy actually found the BLAS implementation you set up, run

(sci-env3)$ python -c 'import numpy; numpy.__config__.show()'

SciPy

While NumPy wraps BLAS and some more in a convenient interface, SciPy is a higher-level wrapper around LAPACK, UMFPACK and FFTW/DJBFFT. For the sake of this article, I only really cared about LAPACK since both UMFPACK and FFTW are not strictly necessary; they only improve performance of sparse matrix operations and FFTs, respectively. If I'll ever need them, I'll revisit this article.

(sci-env3)$ cd ~/inst/src
(sci-env3)$ git clone https://github.com/scipy/scipy
(sci-env3)$ cd scipy

Again, you might want to pick the most recent tag and check that out. Then,

(sci-env3)$ python setup.py config

It should've picked up the BLAS implementation NumPy is using. If it didn't, either you already failed doing that correctly for NumPy, or maybe you left the virtualenv.

(NOTE/TODO: Recent versions fail because they don't link -lm, -lpthread and -lgfortran which are needed by OpenBLAS which can't link to them itself since it's a static library.)

(sci-env3)$ python setup.py build
(sci-env3)$ python setup.py install
(sci-env3)$ cd -

OpenBLAS Fixme

For now, because static OpenBLAS doesn't work, you'll also need to fix the LD_LIBRARY_PATH in the virtualenv's activate and deactivate functions as described in the With --enable-shared section and then do

(sci-env3)$ ln -s ~/inst/lib/libopenblas.so.0 $VIRTUAL_ENV/lib/libopenblas.so.0

Testing

Again, we can test the validity of the installation:

(sci-env3)$ python -c 'import scipy; scipy.test()'

Here too, you can check whether it correctly detected your BLAS and LAPACK implementations by running

(sci-env3)$ python -c 'import scipy; scipy.__config__.show()'

Note: For me (v0.14.0.dev), there were 89 errors and 1 failure, though they all happened in the sparse matrix functions, which I don't use yet.

Note2: Dayum, it segfaults whith MKL and I don't even know which test does that. No corefile generated (even though ulimit correct O.o) and ltrace doesnt help.

Matplotlib

TODO: Better install using pip install -e . so that dependencies are downloaded?

Matplotlib is the most mature and popular plotting library for Python. It's inspired by Matlab's plotting functionality (hence the name) but it already outgrew it, while keeping its simplicity. Other plotting packages I follow closely are ContinuumIO's bokeh which aims to bring the grammar of graphics (i.e. R's ggplot2) to Python, and Mike Bostock's d3js which, while it is not Python, is simply genius.

Since we installed NumPy from the github master, we'll need to do that for matplotlib too (Same story when checking out tags.):

(sci-env3)$ cd ~/inst/src
(sci-env3)$ git clone https://github.com/matplotlib/matplotlib.git
(sci-env3)$ cd matplotlib
(sci-env3)$ python setup.py config
(sci-env3)$ python setup.py build
(sci-env3)$ python setup.py install
(sci-env3)$ cd -

You might try to just pip install matplotlib, but it might very likely not work because of the very recent NumPy/SciPy we just compiled.

Testing

Same story as for the other packages above:

(sci-env3)$ python -c 'import matplotlib; matplotlib.test()'

OpenCV

OpenCV is a computer-vision library whose origins lie in Intel's highly optimized IPL. The core of OpenCV is very stable, highly optimized and generally of very high quality. Unfortunately, it grew enormously lately and contains a lot of higher-level but less-well maintained parts.

As it's a C/C++ library at heart and the Python is a wrapper tacked onto it, we have to install OpenCV into the virtualenv we just created. You'll have to repeat this step for every single virtualenv you create! (I might explore creating a virtualenv which inherits from another one in another essay, which might make this easier.) In addition, OpenCV, being a super-modern C++ project (I'm kidding), you'll need the CMake C++ build system. If your workplace doesn't even have this, I pity you.

Another thing to watch out for is that OpenCV is BIG: the git repo weights in at 381 MiB and the compilation will take ages if you compile with CUDA support. The reason we use the latest master is that none of the 2.x versions of OpenCV support Python3, only the current master (and future version 3.x) does. Anyways, let's get started:

(sci-env3)$ git clone https://github.com/Itseez/opencv.git
(sci-env3)$ mkdir build-opencv
(sci-env3)$ cd build-opencv
(sci-env3)$ ccmake -D WITH_CUDA=OFF -D PYTHON_INCLUDE_DIR=/usr/include/python3.4m ../opencv

You might see errors about java and matlab. If that is the case, keep in mind you had those errors and go on.

This got you into OpenCV's configure dialog. Press the c key once. Now's the point where you can choose which features to compile and which not to. Press the t key to toggle the "advanced" settings. The important parts are the following:

NOTE: Older versions of OpenCV called the python-related variables PYTHON instead of PYTHON3, so make sure to check which ones you set.

Another possibility is to try just running the following line:

cmake -D WITH_CUDA=OFF -D PYTHON_INCLUDE_DIR=/usr/include/python3.4m -DCMAKE_BUILD_TYPE=Relase -DCMAKE_INSTALL_PREFIX=/home/lucas/sci-env3.4 ../opencv/

After having made those changes, press the c key again, then press it again. (yes, twice. After the first time, new settings have been added which you need to "confirm" by pressing c again.) Then, press g to generate the makefiles which you can use to build and install OpenCV:

(sci-env3)$ make
(sci-env3)$ make install
(sci-env3)$ cd -

Again, I recommend keeping the build folder, as you'll be able to run make uninstall in it, which might come in handy in the future.

Testing

While there is no python test-suite, you can do some poor-man's testing of the python bindings by running the following code snippet:

(sci-env3)$ python -c 'import cv2'

and, of course, you can run the C++ test-suite(s) if you're feeling patient:

(sci-env3)$ build-opencv/bin/opencv_test_core
(sci-env3)$ for f in build-opencv/bin/opencv_test_*; do exec $f; done

IPython

We'll now install IPython and all the dependencies it needs for its notebook, which is an awesome "IDE" in the browser. This includes, amongst others, pyzmq which will produce an error and say something "blabla unless you interrupt me in the next 10 seconds...". This is OK, don't interrupt it.

(sci-env3)$ pip install ipython[all]

You're now able to start IPython by simply running ipython3 (notice the three in the name), or start the IPython notebook server by running ipython3 notebook, which will also open your favorite browser and point it to IPython's interactive notebook. Have fun with it!

Scikit-learn

Scikit-learn is a collection of Machine Learning libraries, or wrapper of such libraries, for Python. The interface is quite well thought-out and the documentation is fist-class. Additionally, installing that one is easy:

(sci-env3)$ pip install scikit-learn

You can test the installation by running:

$ nosetests --exe sklearn

Additional Virtualenvs

Once you've reached here, you've got yourself a nice scientific python base environmentand. It doesn't end here, though. You might want to work on various projects needing additional, more domain-specific libraries, like Theano, statsmodels, NLTK, MMTK or even BioPython. If it's about trying out something specific, I'd recommend creating a new virtualenv for that, and not installing it into your "main" scientific env. Assuming you didn't remove any of the files created in the previous steps, creating additional scientific base environments is much easier now:

  1. Create the env as described above or, if you are lucky enough to have a globally installed virtualenv, using

    $ virtualenv -p ~/inst/bin/python3.3 env
    . env/bin/activate
    
  2. Install NumPy into it

    (env)$ cd ~/inst/src/numpy
    (env)$ python setup.py install
    (env)$ cd -
    (env)$ pip install nose
    (env)$ python -c 'import numpy; numpy.test()'
    
  3. Install SciPy into it

    (env)$ pip install cython
    (env)$ cd ~/inst/src/scipy
    (env)$ python setup.py install
    (env)$ cd -
    (env)$ python -c 'import scipy; scipy.test()'
    
  4. For the rest, either do as in the above two steps, or just pip install.

Updating a Virtualenv

Do you really want to do (risk) that? Just create a new one but git pull the repos before.

More

Theano

TODO

cuDNN

TODO

PyDot

For viewing Theano functions and expressions as a graph (which is useful for debugging), the pydot bindings to the dot graphing language are required. Unfortunately, the default pydot available in the cheeseshop currently fails to install in Python3 with a very uninformative message:

(sci-env3)$ pip install pydot
Collecting pydot
  Using cached pydot-1.0.2.tar.gz
    Traceback (most recent call last):
      File "<string>", line 20, in <module
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

      File "<string>", line 20, in <module

    ----------------------------------------
    Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-e15pmion/pydot

Luckily for us, github user @nlhepler is hosting a Py2/3 compatible version of it, which we can simply install straight from the repo by running:

(sci-env3)$ pip install git+https://github.com/nlhepler/pydot.git

Note that this only works when graphviz is installed, which I haven't had to go through so far.

Julia?

Just as a quickref until I write a full article on that.

Julia is my big hope for the future of scientific programming. Since it's still a child (though not a newborn anymore), it's missing a lot of libraries. Whenever I'm missing something, and it's too much to implement it myself in the time I have, I'm happy to be able to call Python libraries through the excellent PyCall.jl package.

(sci-env3)$ cd ~/inst/src
(sci-env3)$ git clone https://github.com/JuliaLang/julia.git
(sci-env3)$ cd julia
(sci-env3)$ make

Since Julia is such an insanely-fast moving ecosystem; I recommend not actually installing it, but running it from the src directory and repeating the git pull and the make step on a regular (daily?) basis.

If you still want to install it, run:

(sci-env3)$ DESTDIR=/home/USERNAME/inst make install

In order to use the IPython version we installed in the virtualenv, we'll need to have the virtualenv activated when running the following:

(sci-env3)$ ./julia
julia> Pkg.update()
julia> Pkg.add("IJulia")
julia> Pkg.add("PyCall")
julia> Pkg.add("PyPlot")

Leave julia and start the IJulia notebook server (which is actually an IPython notebook server masquerading as IJulia):

(sci-env3)$ ipython3 notebook --profile julia --no-browser

Now point your favorite browser to whatever URL IPython told you, most likely http://127.0.0.1:8998/ and use Julia inside your browser!

Caveat emptor: PyPlot doesn't work with matplotlib 1.4 yet, so there's that.

More about Julia and PyCall in some future article.