• A Benchmark Comparison Of Content Extraction From HTML Pages

    I just published a post about one of the projects I have been involved with at work. It is aimed at developers with some understanding of machine learning, so not as technical as I would have liked, but hey! Many thanks to everyone else who worked on this- Chris Charlton, Marcia Oliveira and Maria Lehl.

  • Gold standard data: lessons from the trenches

    This article is a draft of a talk I am giving at PyData Berlin in July 2017. It is intended for a non-technical audience, but I plan to expand it into a more technical piece soonTM.

  • An exploration of scipy sparse matrices

    My colleague Matti Lyra recently faced an interesting computational problem. He wanted to see how quickly a stream of temporaly-ordered documents evolves, and he chose to do it by looking at how often new words appear in the steam. This post is about how to do this efficiently in Python.

  • Profiling Python

    This article explains the basics of profiling Python code. The hardest part is installing all the great tools that make it trivial to find the bottleneck in your code.

Subscribe via RSS