Most of my time is spend doing data science work in Python; I also use Java if I have to. My preferred toolkit includes numpy, scikit-learn, gensim and spaCy. I’ve written Matlab and C in the past and occasionally write web-based front ends with Django, JavaScript (Angular), CSS and HTML. I keep meaning to learn R.

Open-source

In order of proportion of code written by me:

  • DiscoUtils, dc_evaluation, vector_builder (from 2013)- a collection of tools for working with distributional semantic models. I use this in my doctoral work.
  • Critical difference (2014)- a tool for visualising the results of multiple tests for statistical significance. See website for examples.
  • word2vec-inversion- investigating a recent state-of-the-art document classification method using distributional semantic information
  • Project Plovdiv (from 2009)- a tool for teaching graph theory and mathematical epidemiology. I have been working on this on and off since 2009 (with help from Vladislav Donchev). The software is now reasonably stable and is being used in Widening Participation events at the Department of Mathematics at Sussex.
  • US address parser (2014)- a statistical parser for US addresses
  • RCaller (2011)- calling R code from Java
  • DISSECT (2014)- a toolkit for distributional compositional semantics. I ported it to Python 3 and simplified the code base a bit.
  • qsutils- a script for processing the output of Sun Grid Engine’s qstat command. Useful when running lots of grid jobs. Also lets you throttle jobs or move to a different queue.

Less known useful software others have written

  • Sikuli script- automatic UI actions
  • CRF Suite and Python bindings- a fast and well-documented implementation of conditional random fields
  • word2vec and GloVe- fast and accurate neural word embeddings
  • gensim- fast topic modelling in Python (contains a re-implementation of word2vec)