Training data, a.k.a ground truth data, including both observations and the corresponding outcomes, is the prerequisite for building supervised machine learning models. The quality and quantity of the training data often has a great impact on the resulting models, whereas it is not always easy to obtain large-scale and high-quality training data as it sometimes requires humans to annotate the outcome or label of each data record manually. Things become even harder when the labeling task is not as straightforward as distinguishing dogs from cats.
Enigma has been working on entity resolution using machine learning models. We used to collect the training data for the models by kicking off labeling tasks internally. We did it this way because the data used for general entity resolution problems may not fit our product needs, and people who label the data at Enigma usually have the best domain knowledge of the entities we’re resolving. However, the labeling process costs a lot of time and human effort, and is hard to scale. We recently decided to try out some labeling platforms such as Amazon SageMaker Ground Truth and Figure Eight to help us scale our collection efforts.
This post introduces how these platforms work, and also describes the preparation and post-processing we have done to complete the training data collection using these platforms, as well as some tips and takeaways. By the end of this post, you will be able to know—
- Which platform is right for you
- How to create a successful labeling job
Overview of labeling platforms
The two labeling platforms we work with are Amazon SageMaker Ground Truth and Figure Eight. They both allow users to upload unlabeled datasets to the platform along with the instructions of how the data is expected to be labeled, and then the platform will launch the labeling job to let the human labelers in their distributed platform to complete the job. Users can determine the number of times each dataset should be labeled (with different prices). For example, if the number is 3, it means each single record in the dataset will be labeled by 3 different labelers. Both platforms can complete human labeling tasks really fast: a job of 10,000 records with 3~5 labelers per each record can be done within one day.
In addition, both the two platforms can also apply built-in models to label the data automatically, reducing the labeling time and the costs of human labelers.
Despite the common functionalities, the two platforms have their own characteristics and customizability in terms of the user interface, input and output, workflows, human labeling, automatic labeling and pricing.
Amazon SageMaker Ground Truth (ASGT)
User Interface: Users can manage the labeling jobs within AWS console as well as other Amazon SageMaker features like training jobs, which means the created training datasets can be easily imported into SageMaker for use in model development and training. Users can monitor the labeling progress in real-time from either the console or the output folder on S3.
Input & output: The input and output of the labeling job must be stored in specified JSON format on S3. It looks like only text and image input data are supported for now. The output folder stores both the raw annotations of labelers and the aggregated annotations.
Built-in workflows: There are four built-in labeling workflows on ASGT: object detection, image classification, text classification, and semantic segmentation. Each workflow has its own labeling tool and annotation consolidation algorithm. Users only need to provide input data in the required format and set up the instructions using the AWS console.
Customized workflow: In addition to the built-in workflows, users can launch a customized labeling workflow by creating the labeling interface and the lambda function for annotation consolidation.
Human labeling & labeler types: There are three types of human labelers: private team, Amazon Mechanical Turk and third-party vendors. The private team only contains labelers within the user’s private organization, e.g., a group of employees at Enigma. Amazon Mechanical Turk refers to public human labelers on Amazon’s workforce network. Third-party vendors are those who specialize in data labeling.
Automatic labeling: ASGT creates a model first based on a small set of labeled data given by the user, and then uses the model to label the input data automatically. The data that the model feels ambiguous will be sent to human labelers and the human-labeled data will then be sent back to the model for active learning.
Pricing: ASGT charges one labeling job at a time, based on the amount of objects that are labeled, the type of workflows and the type of labelers.
Figure Eight (FE)
User Interface: FE has an easy-to-use interface for users to upload input data and download output data and reports. The web portal shows real-time progress as well as advanced analytics and plots. Users can determine specific labelers for private tasks and even monitor the progress and performance of each labeler. The labelers can also provide feedback in regards to labeling tasks to the users.
Input & output: FE accepts .csv, .tsv, .xls, .xlsx and .ods formats through the web portal or the RESTful API. Not only text and image, but also videos and audio are supported. The output data also contains both raw annotations and aggregated annotations.
Built-in workflows: FE has built-in templates of many popular tasks such as sentiment analysis, search relevance, data categorization, image annotation, speech recognition, data enrichment and data validation.
Human labeling & test question: FE has a unique phase at the beginning of the human labeling job: all the labelers participating in the job must first complete the test questions which is a small set of ground truth data given by the user. Only those who passed the questions can be allowed to proceed the labeling task, making sure the labelers understand the task precisely.
Automatic labeling: FE also has a similar ML-assisted labeling workflow combining model labeling and human labeling. Users can choose from multiple pre-trained models for different types of labeling tasks. More interestingly, FE allows users to create multi-job processes through the UI using logic-based routing rules between models and jobs to generate aggregated results.
Pricing: FE charges a company customer a flat rate per year based on the estimated amount of rows labeled. They may offer other pricing options.
Before kicking off a labeling job on these platforms, we need to prepare the dataset to be labeled, the instructions and examples for the job, and set up templates if using customized workflow.
Data to be labeled
The data we use to generate the training data is from the real-world public data sources we are trying to link. More specifically, the model we are building for our entity resolution framework is to identify the relationship of two given company entities based on their common identifying attributes. This is the initial training dataset for our model, so we want the training data to be evenly distributed in terms of the difficulty level. We also want to cover as many cases as possible and subsample each case in a well-balanced manner.
Therefore, we first randomly generated some pairs of entities (note: in practice we are not comparing arbitrary pairs but only interested in pairs that are likely related, but we also need negative samples to train the model), then sampled multiple subsets based on their pairwise similarities on each identifying attribute, making sure we covered different range of similarities and possible combinations.
In the end, we actually generated a couple of datasets with different sizes under the same distribution. We need some small datasets to test out the labeling platforms and we did go through some trials and errors before we know how to create a successful labeling job.
- I assume you can get enough data for labeling on these platforms-if you can’t, then it’s probably not necessary to use these platforms.
- Make sure the data you are going to expose to the public labelers does not contain sensitive information.
- Start with small labeling tasks so you can easily refine your instructions and adjust your dataset based on the resulting labels.
- If you are collecting training data to improve an existing model, you may want to oversample the cases in which the model did poorly. For classification specifically, you can oversample the cases around the class boundaries.
Because these platforms launch labeling jobs at scale, we need to make sure the labelers understand what we are expecting through the job instructions.
The process of preparing the instructions usually starts with exhausting all the possible cases the dataset might have. By looking at the concrete data during this process, we realized that our original expectation of the problem is unclear. We planned to build a binary classifier to determine whether two companies are the same, but there could be more relationships of two companies we want to capture. We finally changed the labeling task to be a multi-class classification problem, which made the number of classes hard to determine and the instructions more complicated: more classes may make it more difficult to draw the boundaries while fewer classes could result more ambiguous cases. We then refined our problem and instructions several times to retrieve the expected labeling results. (See more details in Experiments.)
Since the labelers will only read the instructions for a short period of time, giving concise instructions becomes essential. ASGT allows the users to provide both short instruction and full instruction. Short instruction highlights the most important rules with simple words, and full instruction supplements short instruction with detailed rules and more complex cases.
Examples are highly recommended to be included in the short instructions because examples are better than words, even though we can provide more examples in ASGT’s full instruction or FE’s “test question” phase.
Tips on labeling examples in general:
- For categorization labeling tasks, each category should have one or two examples.
- The examples should speak for themselves: easy to understand and precisely representative of the cases we want the labelers to deal with.
- The examples should include all the edge cases: these are usually the cases we need help from human intelligence the most.
- Again, keep in mind that labelers only spend limited time on reading the instructions, so the examples should be as concise as possible.
Tips on providing examples on FE:
- With FE, we can provide more examples outside the instructions and we can add comments for each example to further explain why we give a certain label.
- Note that the examples and explanations should be consistent with the instructions.
- For categorization labeling task, make sure each category is sufficiently represented by the test questions and all categories have balanced distributions.
- Keep in mind that the labelers who pass the test questions you generate are supposed to succeed on the larger dataset.
Our labeling task can apply the built-in data categorization template on FE, but there is no suitable template available on ASGT (the closest one is the text classification template, but the built-in labeling tool can not display our data well), so we have to customize our own workflow on ASGT. There are two parts we need to customize: the HTML interface to display the instructions as well as a pre-processing Lambda function to import the data to the frontend, and the post-processing Lambda function to tell ASGT how to consolidate the annotations.
The HTML interface is easy to customize to display the instructions however we want, and it can import the data values that are defined and parsed from the input JSON file in the pre-processing Lambda function.
Figure 1 shows our pre-processing lambda function template. The input data we provide is in required JSON format for text data, i.e., each line is a complete and valid JSON object where the text data object to be labeled must be the value of source and each source data record must be a text string or a dumped json object.