Gold standard data: lessons from the trenches

This article is a draft of a talk I am giving at PyData Berlin in July 2017. It is intended for a non-technical audience, but I plan to expand it into a more technical piece soon^TM.

Introduction

It is often said that rather than spending a month figuring out how to apply unsupervised learning to a problem domain, a data scientist should spend a week labeling data. However, the difficulty of annotating data is often underestimated. Gathering a sufficiently large collection of good-quality labeled data requires careful problem definition, quality control and multiple iterations. As a result, gathering enough data to build a high-accuracy supervised model can take much longer than one might expect. This post describes my experiences in labeling gold-standard data for natural language processing and the lessons learned along the way.

Case study 1: word embeddings

Words vectors have become popular in recent years as they can be built without supervision and can capture complex semantic relations well. For example, adding the vectors of king and woman and subtracting the vector of man yields a vector that is very close to that of queen.

Algorithms for training word vectors have been evaluated by correlating the "similarity" of word pairs, as predicted by a model, to those provided by a human judge. A typical data set consists of word pairs and a similarity score, e.g. cat, dog, 80% and cat, purple, 21%. A model is considered good if it assigns high scores to word pairs that are scored highly by a human. Let us consider what it takes to label such a data set and what can go wrong in the process.

Is the task clearly defined?

What exactly is word similarity? We all have an intuitive understanding, but a lot of corner cases are hard to pin down. What makes the words cat and dog similar? Is it because they are both animals? What about topically related words, such as rice and cooking? What about antonyms, e.g. big and small? Human annotators need clear and unambiguous description of what is required of them, otherwise data quality will be poor. For instance, the similarity scores provided by 13 annotators for the pair tiger–cat range from 50% to 90% in WordSim353, a data set often used to rank word similarity. Click here for a more detailed analysis of this issue.

Is the task easy for humans to do?

Even with clear instructions, some tasks are inherently subjective. It is unlikely that every person you ask will provide the same similarity score for cat and dog, or will interpret a written sarcastic comment the same way out of context. If humans cannot agree what the right answer is for a given input, how can a model ever do well? Of course, data scientists can take steps to address those issues. First, make sure that you have clear written annotator guidelines, even if you plan to do the annotation yourself. Ask others to read the guidelines and explain how they interpret your instructions.

Second, do not be afraid to change the task to make it easier. If your use case allows it, make the task as simple as possible for the annotators. Ask yourself if it is business-critical to get fine-grained labels or if a coarser-grained (and therefore easier) annotation schema would suffice.

Do you have quality control in place?

Not having a mechanism for identifying annotator errors or consistently under-performing annotators is perhaps the most common error in data labeling. There are many viable ways to do quality control - the key is to ensure at least one of them is in use. A common approach is to measure if different annotators agree with one another or with a known gold-standard - see next case study. Conflicts may be resolved by an independent adjudicator.

Ideally, quality control should be an automated process that runs continuously. Remember that systematic errors may be a sign of unclear guidelines. Talk to your annotators to understand the source of the problem. Be prepared to discard your first data set.

How much data can you hope to get?

A supervised learning model may require a significant amount of labeled data to perform well. However, obtaining this much data may be problematic for a number of reasons:

Human annotators are expensive, even on crowd sourcing platforms that systematically underpay.
Good, trustworthy annotators are hard to find, especially if they need special training.
- An aside on ethics: pay fairly and treat annotators with respect at all times. The less you pay, the higher the odds you will have to repeat the job.
Errors in labeling may be hard to correct, so be prepared to discard data.

Keep an eye on the learning curve of your model. Do not rush into gathering more data unless there is evidence to suggest it would help. It is often better to focus on quality rather than quantity.

Case study 2: symptom recognition in medical data

The second case study involves identifying mentions of symptoms or diseases in notes taken by a doctor during routine exams. An ideal note may read: Abdominal pain due to acute bacterial tonsilitis. No allergies. However, these notes are often taken in a hurry and typically look more like this:

Abd pian//acute b tons//n/a alergies.

This presents a different set of challenges for a data scientist.

Do you need an expert annotator?

Most doctors' notes are impossible to decode for a layperson. Annotation therefore has to be done by a trained doctor, but most doctors prefer to practice medicine rather than label data. The typical doctor is also not an expert in machine learning or linguistics (and vice versa), so they may not have the same vocabulary. This makes it harder to provide a clear task definition.

Can you measure inter-annotator agreement?

Can you quantify the degree to which annotators agree? In the case of word similarity, this is as simple as comparing numerical scores. This gets trickier when the annotation unit has complex structure (e.g. it is a phrase). For example, in the sentence burning neck pain one annotator may pick up the first two words as a symptom, while another may pick up the last two words, and a third may not mark any symptoms.

Tooling

Do you need specialist software to capture and store data? Depending on the complexity, can you write something yourself or do you have to assemble a team? Will the tool be intuitive and easy to use, or is it confusing your annotators? In our medical example, the tooling is getting much better and allows annotators to be very productive. Several years ago tools such as BRAT were less mature and took a long time to set up, whereas now they run out of the box.

Further issues arise where hardware is involved, e.g. in internet-of-things or industrial applications. These include writing bespoke software for the device to capture data, transmitting it to a server reliably (and securely!) and storing potentially massive amount of data that machines generate.

People issues

Many data science talks focus on the technical aspects of data gathering, with people issues often taking a back seat. However, a single well-trained and reliable annotator producing large amounts of labeled data can often contribute more to the success of a project than a team of programmers. It is therefore important to build and maintain a good working relationship with the annotators. This is much easier if they are located near you. Be prepared to deal with small issues such as annotators not showing up for sessions, or taking two-week breaks between sessions. The more time passes between sessions, the more the annotators will need re-training. This is the main reason why quality control should be run continuously.

If you are using a crowd sourcing platform, beware of click farms. The academic literature has some fantastic resources on the subject- see references.

General lessons

Get to know the problem domain
Do not be afraid to start from scratch if your assumptions are wrong
Monitor quality continuously
Beware of crowd-sourcing

References and notes

The first case study is based on this paper, which is in turn based on my PhD work. The second case study is inspired by the PhD work of my lab mate Aleksandar Savkov.

Herbert Rubenstein and John Goodenough. 1965. Contextual correlates of synonymy. Communications of the ACM 8(10):627-633.
Felix Hill, Roi Reichart, and Anna Korhonen. 2015. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics
Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The concept revisited. Proceedings of the 10th international conference on World Wide Web pages 406-414.
George Miller and Walter Charles. 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes 6(1):1-28.
Chris Bieman. Crowdsourcing Linguistic Datasets
Chris Carlison-Burch et al. Crowdsourcing and Human Computation Class
Rion Snow, Brendan O'Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and fast- but is it good?: evaluating non-expert annotations for natural language tasks.
Gadiraju, U., Kawase, R., Dietze, S., Demartini, G. 2015. Understanding Malicious Behaviour in Crowdsourcing Platforms: The Case of Online Surveys.
Ron Artstein and Massimo Poesio. 2008. Inter-Coder Agreement for Computational Linguistics. Proceedings of the Association for Computational Linguistics.