This article explains the basics of profiling Python code. The hardest part is installing all the great tools that make it trivial to find the bottleneck in your code.

Getting the required tools

This can be a pain in the neck on a Mac. The rule of thumb I adhere to is Use conda, pip and homebrew where possible.

Python and core packages (numpy, scipy, etc)

Installing Python modules can sometimes be challenging as these link to compiled C/Fortran code behind the scenes. I recommend using conda, which can install pre-compiled modules for your operating system. Download from conda can be installed to your home directory, which comes in handy on the cluster.

After that, you can easily set up a virtual environment and install many of the core packages:

conda create -n py3k anaconda python=3
conda install numpy

The first step above should prepend the newly set up Python interpreter to your path. If it doesn’t, prepend the path to the new interpreter by editing ~/.bashrc and ~/.bash_profile. Pip should also be installed— if it is not, type the following into a terminal

conda install pip

Once conda and pip are up and running, you should always attempt to install new Python packages through them instead of from source. This will save a lot of headache in the long run as you won’t have to compile C/Fortran code.


Some Python packages serve as a wrapper around other software, such as MySQLdb around mysql and bsddb around BerkeleyDB. The underlying software is best installed via homebrew. For instance, here is how to set up BerkeleyDB:

# install C/Fortran compilers, these will come in handy all the time
brew install gcc
# install berkeley to  /usr/local/Cellar/berkeley-db/<VERSION>
brew install berkeley-db
# install python wrapper, passing in the path to the berkeley-db installation
BERKELEYDB_DIR=/usr/local/Cellar/berkeley-db/5.3.15/ pip install bsddb3

Profiling result visualisation (Runsnakerun/ Snakeviz)

These tool are useful for finding which function calls take the most time. You cannot look inside of functions using these. The next section will explain how to profile a single function to find which lines take the most time.

I know of two great packages the can visualise the output of the Python profiler. snakeviz has some fancy features and written in pure Python. It is easier to set up and use, but somewhat slow. If you have profiled a large chunk of an application, snakeviz will give up. In practice, I found it’s only useful for small programs. runsnakerun is written in C, so it can display just about anything, but is somewhat trickier to install.

pip install snakeviz
pip install runsnakerun

For runsnakerun, you may also have to install SquareMap via brew/conda/pip.

To visualise the results of a profiling run with snakeviz type


where is the file output by the profiler (see below). If you get “Sorry, we were not able to load this profile! You can try profiling a smaller portion of your code.”, then you will have to use runsnakerun:

ipython qtconsole

In the window that opens, type

from runsnakerun import runsnake

then use the GUI to open

The reason we are using ipython instead of launching runsnakerun from the terminal is that OSX restricts some applications’ access to the screen. The version of runsnakerun installed through pip is not an Apple “framework” and therefore cannot access the screen. There are multiple guides explaining how to get around this on the Internet, but none of them has worked for me.

IPython extensions

There are many great tutorials on how to set up and use the required IPython extensions, such as .

These extensions are best used to profile single functions (see below). lprun and mprun are of particular interest. Also note prun prints as a table the same information that snakeviz/runsnakerun visualises. Personally I find the visualisation much more useful.

Profiling the entire application

To profile an entire application, run

python -m cProfile -o

This creates a file called, which can be visualised using snakeviz or runsnakerun. The visualisation might like this:


The top left corner of that image can be read as

  • function _count_vocab took a total of 1.6s
  • 0.6s of which were spent in __contains__ (not visible in the image, you need to hover over the box in runsnakerun)
  • 0.2s of which were spent in __my_feature_extractor__, etc

This technique can be used to find out which functions take a long time.

Profiling functions

Having set up the required IPython extensions and identified bottleneck functions, we can profile and incrementally optimise these. Let’s illustrate this with a real example I was recently working on.

I have a function called calculate_log_odds, which does something rather simple but seems to take a lot of time. To profile it line by line, I ran the following command in IPython

# get some data to run the function with
import pickle
b = pickle.load(open('../statistics/stats-exp0-0-cv0-ev.MultinomialNB.pkl'))
X = b.tr_matrix
y = b.y_tr
# do the actual profiling
%lprun -f calculate_log_odds calculate_log_odds(X, y)

The output looks like this:

Total time: 2.36018 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
     3                                           def calculate_log_odds(X, y):
     4                                               # todo this is quite slow
     5         1           14     14.0      0.0      log_odds = np.empty(X.shape[1])
     6         1           45     45.0      0.0      class0_indices = (y == sorted(set(y))[0])
     7      2466         1632      0.7      0.1      for idx in range(X.shape[1]):
     8      2465       864021    350.5     36.6          all_counts = X[:, idx].A.ravel()  # document counts of this feature
     9      2465      1378602    559.3     58.4          total_counts = sum(all_counts > 0)  # how many docs the feature occurs in
    10      2465        79524     32.3      3.4          count_in_class0 = sum(all_counts[class0_indices])  # how many of them are class 0
    11      2465         3871      1.6      0.2          p = float(count_in_class0) / total_counts;
    12      2465        29874     12.1      1.3          log_odds_this_feature = np.log(p) - np.log(1 - p)
    13      2465         2595      1.1      0.1          log_odds[idx] = log_odds_this_feature
    14         1            0      0.0      0.0      return log_odds

Most of the time is spent on line 9, where I am using the Python sum function to add up a numpy array. This can be made much faster by using the appropriate numpy function:

Function: calculate_log_odds2 at line 16
Total time: 1.54092 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
    16                                           def calculate_log_odds2(X, y):
    17         1           16     16.0      0.0      log_odds = np.empty(X.shape[1])
    18         1           55     55.0      0.0      class0_indices = (y == sorted(set(y))[0])
    19         1          526    526.0      0.0      X = X.tocsc()
    20      2466         1746      0.7      0.1      for idx in range(X.shape[1]):
    21      2465      1417910    575.2     92.0          all_counts = X[:, idx].A.ravel()  # document counts of this feature
    22      2465        49792     20.2      3.2          total_counts = np.sum(all_counts > 0)  # how many docs the feature occurs in
    23      2465        32936     13.4      2.1          count_in_class0 = np.sum(all_counts[class0_indices])  # how many of them are class 0
    24      2465         4377      1.8      0.3          p = float(count_in_class0) / total_counts;
    25      2465        30788     12.5      2.0          log_odds_this_feature = np.log(p) - np.log(1 - p)
    26      2465         2769      1.1      0.2          log_odds[idx] = log_odds_this_feature
    27         1            1      1.0      0.0      return log_odds

Note the running time has gone down from 2.3s to 1.5s. Now the bottleneck of the function seems to be at line 21, where I slice a sparse matrix and unnecessarily convert to it a dense one. After a few more iterations of 1) find bottleneck line, and 2) correct line, I arrived at the following:

Total time: 0.058181 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
    56                                           def calculate_log_odds5(X, y):
    57         1           13     13.0      0.0      log_odds = np.empty(X.shape[1])
    58         1           47     47.0      0.1      class0_indices = y == sorted(set(y))[0]
    59         1          619    619.0      1.1      X = X.A
    60      2466         1792      0.7      3.1      for idx in range(X.shape[1]):
    61      2465         8277      3.4     14.2          all_counts = X[:, idx]  # document counts of this feature
    62      2465         7354      3.0     12.6          total_counts = np.count_nonzero(all_counts)  # how many docs the feature occurs in
    63      2465         8044      3.3     13.8          count_in_class0 = np.count_nonzero(all_counts[class0_indices])  # how many of them are class 0
    64      2465         3521      1.4      6.1          p = float(count_in_class0) / total_counts
    65      2465        25890     10.5     44.5          log_odds_this_feature = np.log(p) - np.log(1 - p)
    66      2465         2623      1.1      4.5          log_odds[idx] = log_odds_this_feature
    67         1            1      1.0      0.0      return log_odds

The running time is now 0.05 seconds, or a speedup of ~50x. The bottleneck operation is now entirely done on numpy arrays, so there are no more low-hanging fruit.


  • Having unit tests for the functions that you are optimising is essential to making sure nothing gets broken.
  • Functions to be profiled line by line cannot be typed into the interactive shell and must be imported from a file on disk