1. Assess the lay of the land

The first step to approaching human evaluation is to understand what your organization has already done. Make sure to ask the following questions:

Have we done any similar human evaluation tasks before?
Do we have any human-labeled data?

If your organization has already collected human evaluated data, make sure to understand existing processes. Do you have vendors with whom you already work? Is there an established way to store human-labeled data? Existing approaches can influence how you design your crowdsourcing task, so it’s important to take stock. Understand what went well in previous projects and what lessons were learned.

If you’re starting from scratch, focus on an area that the organization would like to know more about. For example, you may not know how good your top-k organic results are and want to quantify that metric.

At Instacart, we had previously completed a few ad-hoc projects, but now that we are beginning to run large-scale projects, we are revising the methodology.

2. Identify your use cases

Creating human evaluated data is often a costly and time-consuming process. Make sure to ask yourself:

What do we want the human evaluated data to accomplish? Is there a metric in mind?
Why is human evaluated data necessary here? Is this a critical project or nice-to-have?
Is this a one-off attempt or part of a larger continuous project?

Your data could be used as general training and evaluation data, as a way to quality test the output of your model, or as a reference collection to benchmark current and future models. Each of these use cases may require different approaches, which you should keep in mind.

Moreover, make sure that your use cases will genuinely benefit from human labeling. Crowdsourced tasks require proper setup and a budget and should only be reserved for tasks requiring human input.

At Instacart, we wanted to measure the relevance of our search results. Labeled data helps us understand how relevant the products we show to users are when they enter a query into their search bar. This data can be used for training and evaluating models and measuring the quality of our search results

7-steps-to-get-started-with-large-scale-labeling-1a1eb2bf8141

3. Understand your data

Familiarity with your product and the data generated is crucial. As you spend time looking at the data, you will begin to understand if you have all the data a rater needs to complete a task, how complex a task it will be, and potential gray areas. Ask yourself:

What data is presented to the user?
What data is sent by the user?
What logging do we capture?
Do we have all the information we need for a human to make an informed judgment?

How customers view product attributes associated, including price, brand, quantity, and size

This understanding is imperative, as it sets the groundwork for your task design. Without spending the appropriate amount of time, you’ll find many surprises and labels that don’t meet your expectations.

At Instacart, our goal was to measure how relevant our search results were to our queries — thoroughly considering all the associated data helped us avoid pitfalls later on. For example, we initially assumed that displaying product names and images would suffice in describing our products. However, as we internally tried to evaluate some data, we ran into trouble evaluating queries that specified product size, such as “six pack beer” or “bulk candy.” By revisiting Instacart search, we recalled that Instacart “Item Cards” display the product’s size and quantity under the product name. We made sure to present the same information in our human evaluation task. Had we not performed the internal exercise and found this discrepancy, raters definitely would have been confused on measurement-specific queries, and we would have been in for a surprise with our labels!

Ultimately, our search results return groceries — this is entirely different from airline flight times or restaurant reviews — and has its own set of complexities. You will want to do the same legwork to understand the complexities of your product and data.

4. Design your Human Intelligent Task (HIT)

After defining your use cases and data, you will want to design and implement your Human Intelligent Task (HIT) — the actual task you want your rater to complete. We’ll be using “task” and “HIT” interchangeably from now on. In designing your task, make sure to ask:

What exactly do you want to measure using human evaluation?
What is the most straightforward way for a human to evaluate this data?

Your task should try to answer a single or a small set of questions. Avoid conditional or layered tasks, where the rater needs to answer multiple questions, as this adds additional cognitive overhead. Often raters may have language barriers or optimize their work around the volume of tasks they complete, and so complex multi-layered tasks may put you at risk for low-quality results. If you plan on multilingual tasks, such as evaluating both English and French-language products, make sure you design for the product’s native language version first (in Instacart’s case, that’s English), and then expand to other languages.

At Instacart, there are many ways to try to measure the relevance of our search results. With the “query-slate” model, a rater could be presented with the query and an array of products, which he/she rates as a whole for relevance. Alternatively, we could try to capture the relationship between the query and product, for example, whether the product is an ingredient of the query or complementary to the query — which we could then map to a relevance score.

Ultimately, we decided that the most straightforward approach would be to ask: “How relevant is this product to this query?” — in which a rater evaluates a single query and product pair. This was the most straightforward task we could present to a rater while addressing the most critical question we wanted to be answered.

A simple task was especially crucial for us, especially since search relevance for food is already such a complex area. Grocery searches need to consider brands, dietary restrictions, ingredients, compliments, and more — all of which we needed to capture in our guidelines!

5. Determine your guidelines

Creating labeling guidelines is a bit more art than science. Your guidelines need to walk a fine line between being so broad that your labels are imprecise and so prescriptive that a rater is forced to abandon the intuitive judgment that makes human labeling valuable.

Developing your guidelines will be an iterative process. You will need to collect information from your team and users to understand and codify how raters evaluate your data. The following are methods and resources that you can use to develop your guidelines:

Internal Labeling Exercises: Create a sample of tasks and have your team evaluate them internally, with limited guidance. Tasks with high disagreement may mean that the task is not intuitive and that more clearly defined guidelines will help guide your raters.
User Research: If you plan to evaluate data shown to a user, make sure to leverage teams that interact with users often, such as your User Research team or your Customer Experience team (if you have them!). Ask your User Research team about how users think about certain cases that your team disagreed on.
Product Information: Think about how you organize and classify your data right now. Are there certain classes of tasks that you are trying to combine into this HIT? You may need to modify your criteria to handle the complexity of those different classes.

At Instacart, we went through all of the above. Beginning with a barebones set of criteria, our team evaluated hundreds of query-product pairs. Our ratings had disagreements, and we had to ask ourselves interesting questions such as: when users search for “gluten-free pasta,” how relevant is wheat pasta? Or for searches like “hot dog buns,” is it okay for us to show complementary results like hot dog wieners? For a brand search like “Coke,” how relevant is a competitor’s product, like “Pepsi”? Once we had identified these types of cases, we incorporated input from our User Research team on how existing users think about these types of search results. We went through this process iteratively until we had a set of guidelines that gave us the precision we needed without being over-prescriptive.

Is it okay to return hot dog wieners when a user searches for “hot dog buns”?

6. Communicate your task

It doesn’t matter how straightforward your task is if you can’t clearly convey how you want raters to label that data. As the designer of the task, you likely have an amorphous set of rules laid out, which aren’t easily codified. Packaging that information into a digestible set of instructions is a challenge. Make sure to ask yourself:

How can we clearly and concisely convey our guidelines to a rater?
What examples will be most effective in teaching our task to a rater?
How can we present the information consistently with how the information is presented in our product?

Creating instructions for raters will require creating a document that encapsulates the criteria you want them to understand and implement. These instructions can be in the form of a booklet, a slide deck, or any other medium. In these instructions, make sure to communicate the criteria step-by-step and present plenty of examples along the way. As you create these instructions, show them to people who aren’t working on your crowdsourcing project. At this point, you are likely intimately familiar with the task and guidelines and will benefit from the feedback of people who’ve never seen the project before.

When presenting the information to a rater, think about how a user would interact with that same information in your product. Your rater’s UI should resemble your product’s UI — including text size, fonts, image quality — as closely as possible.

At Instacart, we created a set of slides that walked the rater through our criteria, including quizzes that confirm their understanding of the key concepts we presented. We made sure to display the information as close as possible to a typical Instacart Item Card UI with all the associated product information.

7. Maintain high quality

By creating a simple HIT, a clear and understandable set of guidelines, and a well-communicated set of instructions — you’ve built the foundation for high-quality results. These additional methods will help you measure and maintain rater quality during the evaluation period. Ask yourself:

How do we select the best possible judges for our task?
How do we avoid incorrect ratings and measure quality?
How do we handle bad input data or extremely tricky tasks?

In selecting your raters, it is important to know who your raters are. You may want control and visibility into your raters’ specific demographic information, including spoken language, nationality, gender, and age. This can help set up tasks where you want your rater pool to match your users’ demographic makeup. It can also help improve quality — for example, if you are rating English-only results, you likely want only English speakers evaluating that data. Gating by demographics can potentially increase the cost of your ratings, but it can be well worth it for the quality improvement.

You also need to make sure that your raters understand the task. After communicating your task guidelines with raters, test them on a series of hand-chosen HITs that reflect the guidelines’ complexity. This will confirm that they understood the task and its intricacies before they can rate actual data. Only allow raters to rate your data if they score on your test above the threshold you’ve chosen.

You will want to trust the labels that you get back from your platform. That is, the data needs to be reliable. Data is reliable if workers agree on the answers. Different workers produce similar results if they understand the instructions that we have provided to them. If two or more workers agree on the same answer, there is a high probability that the final label is correct. At Instacart, we had five raters evaluate each task and took the consensus rating (3 or more raters in agreement) as the final score.

Inter-rater agreement reliability measures the extent to which independent raters assess a task and produce the same answer. One of the most widely used statistics to compute agreement is Cohen’s kappa (k), a chance-adjust measure of agreement between two raters. A generalization for n raters is Fleiss’ kappa. Both statistics are available in standard libraries and packages like R or scikit-learn. We strongly recommend using inter-rater statistics to measure reliability on every data set.

Another common strategy to ensure high-quality work is to include predefined gold standard data in the data set at random, so we can test how workers perform. This technique is known as “honey pots”, “gold data”, or “verifiable answers”. If you know the correct labels for a set of HITs, you can use that precomputed information to test workers. By interleaving honey pots in the data set, it is possible to identify workers who might be performing poorly. If all workers are performing poorly on any particular honeypots, this may also indicate that there is a mismatch between your intended label and how workers are interpreting your guidelines.

How do we build a set of honey pots? As part of your internal labeling exercise, identify cases where you and your team have reached consensus. Those cases can serve as your precomputed honey pots. You can then randomly add the gold data into the data set that needs to be labeled, so raters evaluate the honey pot tasks the same as your unevaluated data set.

In some cases, the HIT may have poor data, such as an incorrect product image or severely misspelled term. For cases like these, it can help offer the rater an “I Don’t Know” option instead of having them guess. Going one step further, you may want to ask raters who select the option to explain why they cannot evaluate the task. You can provide a list of reasons from which they select or add a text field. These options can help you diagnose your information quality and have the added benefit of deterring excessive use of the option. Additional safeguards, such as rate-limiting the usage of the “I Don’t Know” option, can also be used to ensure that raters don’t abuse the option.

Ready for Takeoff!

Now that you’ve completed the pre-flight checklist, you’re almost ready to label large data sets!

With your task and process clearly defined, it shouldn’t be too difficult to find a crowdsourcing platform that will satisfy your needs. The next step is to create processes for the data you want to evaluate, by sampling, partitioning, and preparing the data for continuous evaluation.

Crowdsourcing-based labeling is a good alternative for collecting data for evaluation and for constructing training data sets. That said, there is little information on how to set up this type of project and the amount of time and preparation needed. Many projects tend to underestimate the preparation steps and focused only on the specific crowdsourcing platform. Shortcuts like these can, unfortunately, lead to subpar results, as crowdsourcing is more than just the choice of platform. We believe that the checklist is useful for making sure that the project is successful and that the collected labels are of good quality.

In our next post for this series, we will focus on the details of running crowdsourcing tasks continuously and at scale.

Acknowledgments & Further Reading

We thank our team members Jonathan Bender, Nicholas Cooley, Jeremy Diaz, Valery Karpei, Aurora Lin, Jeff Moulton, Angadh Singh, Tyler Tate, Tejaswi Tenneti, Aditya Subramanian, and Rachel Zhang. Thanks to Haixun Wang for providing additional feedback.

If you are interested in learning more, there is a dedicated conference, HCOMP (Human Computation), that groups many disciplines such as artificial intelligence, human-computer interaction, economics, social computing, policy, and ethics. These bonus reads offer a good introduction to these topics:

O. Alonso. “The Practice of Crowdsourcing”, Morgan & Claypool, 2019.
A. Doan, R. Ramakrishnan, A. Halevy. “Crowdsourcing systems on the World-Wide Web”, Commun. ACM 54(4): 86–96, 2011.
A. Marcus and A. Parameswaran. “Crowdsourced Data Management: Industry and Academic Perspectives”, Found. Trends Databases 6(1–2), 2015.
J. Wortman Vaughan. “Making Better Use of the Crowd: How Crowdsourcing Can Advance Machine Learning Research”, J. Mach. Learn. Res. 18, 2017.

7 steps to get started with large-scale labeling

1. Assess the lay of the land

2. Identify your use cases

3. Understand your data

4. Design your Human Intelligent Task (HIT)

5. Determine your guidelines

6. Communicate your task

7. Maintain high quality

Ready for Takeoff!

Acknowledgments & Further Reading

Recommend

数字经济时代 UENC如何解锁底层价值“密码”

[程序员] 定居城市选择 [西安] 或 [成都] 哪个好？

Golang 写的 web 也分 Service 和 DAO 吗?

Python 类中类型提示使用“定义类”提示未定义

看到有大佬分享故障排查的过程，就觉得紧张刺激

GitHub - grafana-operator/grafana-operator: An operator for Grafana that install...

Everybody makes mistakes when writing comparison functions | How Not To Code

Nextcloud+对象储存=裸奔？

一文读懂Schnorr签名如何提升比特币

Spanish league joins blockchain world and launches NFTs

About Joyk