Posts

Aug 8, 2020 Reporting results in machine learning
A look at some common pitfalls when reporing experimental results in machine learning. We start from common sins like evaluating on the training data and build our way up to a more sophisticated statistical analysis.
Aug 2, 2017 A Benchmark Comparison Of Content Extraction From HTML Pages
I just published a post about one of the projects I have been involved with at work. It is aimed at developers with some understanding of machine learning, so not as technical as I would have liked, but hey! Many thanks to everyone else who worked on this- Chris Charlton, Marcia Oliveira and Maria Lehl.
Jun 19, 2017 Gold standard data: lessons from the trenches
This article is a draft of a talk I am giving at PyData Berlin in July 2017. It is intended for a non-technical audience, but I plan to expand it into a more technical piece soon^TM.
Oct 10, 2014 An exploration of scipy sparse matrices
My colleague Matti Lyra recently faced an interesting computational problem. He wanted to see how quickly a stream of temporaly-ordered documents evolves, and he chose to do it by looking at how often new words appear in the steam. This post is about how to do this efficiently in Python.
Jul 14, 2014 Profiling Python
This article explains the basics of profiling Python code. The hardest part is installing all the great tools that make it trivial to find the bottleneck in your code.