Most of my time is spend doing data science work in Python; I also use Java if I have to. My preferred toolkit includes
In order of proportion of code written by me:
- DiscoUtils, dc_evaluation, vector_builder (from 2013)- a collection of tools for working with distributional semantic models. I use this in my doctoral work.
- Critical difference (2014)- a tool for visualising the results of multiple tests for statistical significance. See website for examples.
- word2vec-inversion- investigating a recent state-of-the-art document classification method using distributional semantic information
- Project Plovdiv (from 2009)- a tool for teaching graph theory and mathematical epidemiology. I have been working on this on and off since 2009 (with help from Vladislav Donchev). The software is now reasonably stable and is being used in Widening Participation events at the Department of Mathematics at Sussex.
- US address parser (2014)- a statistical parser for US addresses
- RCaller (2011)- calling R code from Java
- DISSECT (2014)- a toolkit for distributional compositional semantics. I ported it to Python 3 and simplified the code base a bit.
- qsutils- a script for processing the output of Sun Grid Engine’s
qstatcommand. Useful when running lots of grid jobs. Also lets you throttle jobs or move to a different queue.