At AWS I mostly work on real-time transription and speaker diarization, using Java, Python, a bit of C++ and many public AWS services.

Prior to 2021, most of my time was spend doing data science work in Python; I also use Java and Node.js. My preferred toolkit includes numpy, scikit-learn, gensim, spaCy, pytorch and tensorflow. I’ve written Matlab and C in the past and occasionally write web-based front ends with Django/Flask, JavaScript (Angular, React), CSS and HTML. I keep meaning to learn R, Go and ML properly.

Open-source

An outdated list of my open-source contributions, in order of proportion of code written by me:

  • DiscoUtils, dc_evaluation, vector_builder (2013-2015)- a collection of tools for working with distributional semantic models. I used this in my doctoral work.
  • Critical difference (2014)- a tool for visualising the results of multiple tests for statistical significance. See website for examples.
  • word2vec-inversion- investigating a recent state-of-the-art document classification method using distributional semantic information
  • Project Plovdiv (from 2009)- a tool for teaching graph theory and mathematical epidemiology. I have been working on this on and off since 2009 (with help from Vladislav Donchev). The software is now reasonably stable and is being used in Widening Participation events at the Department of Mathematics at Sussex.
  • US address parser (2014)- a statistical parser for US addresses
  • RCaller (2011)- calling R code from Java
  • DISSECT (2014)- a toolkit for distributional compositional semantics. I ported it to Python 3 and simplified the code base a bit.
  • qsutils- a script for processing the output of Sun Grid Engine’s qstat command. Useful when running lots of grid jobs. Also lets you throttle jobs or move to a different queue.

Less known useful software others have written

  • Sikuli script- automatic UI actions
  • CRF Suite and Python bindings- a fast and well-documented implementation of conditional random fields
  • word2vec and GloVe- fast and accurate neural word embeddings
  • gensim- fast topic modelling in Python (contains a re-implementation of word2vec)