For the past two years, I was part of the team responsible for building, expanding and improving datasets for supervised or semi-supervised learning at Element AI. Supervised and semi-supervised learning require labelled data examples to train models, since these techniques first rely on learning from labelled data, then making predictions on unlabelled data. Labelled data is where humans have annotated/assigned a label to data points so that a feature input has been matched to a label output. We use supervised or semi-supervised techniques when we already know (some or all of) the target values that we want a model to predict and when we want to be able to assess the accuracy of predictions (i.e., we have a desired goal). For example, when we want to predict discrete values of a classification problem like identifying whether an email is spam or not, or to predict a continuous value of a regression problem such as a salary estimate when given certain indicators.
As an internal data collection and labelling team, we learned a lot about machine learning and model development. We saw the impacts of data and concept drift, how the quality of data influences model predictions, learned how bias is inherently part of machine learning systems (bias, after all, is everywhere in the world) and observed how data truly is at the heart of machine learning and its applications.
With this new perspective, we started to consider how to design better systems for data collection and labelling, and improve our methods to ultimately build better datasets. We did this by changing how we viewed our team. Instead of considering ourselves “just” as a data collection and labelling team, we became a “machine teaching” team.
Machine teaching is the process where human expertise and insight is infused into training data in order to more effectively teach learning models to achieve a desired goal.
Machine teaching focuses on the human teacher and their interactions with data, as well as their efficiency in teaching a model.
The concept of machine teaching is not new. It was pioneered by Microsoft researchers to advance machine learning systems through the idea that “any information processing skill that is teachable to a human, should be easily teachable to a machine”. With this conceptual framework, machine learning has the opportunity to profit from human knowledge and intuition just as much as from training data.
Although often associated solely with labelling (a topic we dive into for our next blog in this series), machine teaching is also highly coupled to data collection (regardless of machine learning technique that will be used). Humans, in how they collect data or build systems that collect data, play a key role in whether:
- data is representative (spans a number of use-cases and conditions)
- sufficient in volume and diversity (that enough data and balanced training examples exist)
- clean (contains named fields, includes matched inputs/outputs)
- accessible (does not include limiting restrictions such as intellectual property)
- accurate (records are substantially complete across relevant fields)
Ultimately, a good training dataset (even when little data is required/available, i.e., in low data regimes) should contain data that closely resembles the real-world data that will be encountered by the machine learning model in production; a fine-line to walk. Contrary to popular belief, we don’t always want complete alignment with real-world data… Why? Because sometimes real-world bias is too easily overfit to, and so careful consideration of what and how data is collected and labelled can mitigate such bias. For example, historical data of loans and mortgages is known to contain real-world bias where minority and racialized groups have been disadvantaged by higher fees, refusal and interest rates, thereby leading to unequal access to credit. Acknowledging this bias and then taking measures to mitigate it has the potential to make lending algorithms provide fair and equitable loans and avoid discrimination. Increasingly, steps are being taken to improve fairness of lending algorithms including recognition that historical data is biased and should have minimal use in model development, making way for other techniques such as reinforcement or adversarial machine learning to improve equitable outcomes.
It is important to be able to discern where in a dataset biased or unintended model behaviour results from when it is observed, and to track what measures (omission, cleaning etc.) have been taken to address it. Data provenance and versioning is difficult. Even as an experienced machine teaching team, we struggle to adopt and implement systems to consistently catalogue and describe datasets for easy retrieval and reproducibility of training outcomes. To date, there is no industry standard. A number of methods have been introduced to tackle this problem, for example, Datasheets for Datasets, which was adopted most notably by Microsoft, or the Data Nutrition Project, proposed by researchers at Harvard and MIT. Although these solutions are very useful, the implementation can be challenging. They are hard to scale across numerous datasets (i.e., a cataloguing system is required as the number of datasets and versions increase) and across datasets of different domains. For example, each proposed method can be applied well to tabular data (domains such as time-series) where a dataset initially is a table in which each row is an observation and each column is a measure with a given type. However, new challenges arise when working with deep learning vision and natural language datasets where the concept of labels exists (unlike classical forecasting tasks), and each set of labels can have multiple immutable versions with domain-specific or even task-specific metadata making standardization difficult.
Furthermore, with increased discussion around the broader impact of AI, there is a heightened expectation for the machine learning community to show that we care deeply about potential abuses in dataset and model development. And rightfully so. It is crucial to demonstrate the understanding that production models, and data availability itself, have shared responsibility in addressing bias, privacy infringement and impact. If we ignore this, we accept the worst outcomes where we continue to see systemic racism encoded into our technology and marginalization as the norm. As a community, the least we can do is ensure that dataset (and model) development is done so conscientiously, with the intent to mitigate bias and reduce the potential for abuse.
In this blog series, we will talk about the evolution of our machine teaching team at Element AI, how our approach led to greater collaboration and explainable outcomes during model development, and finally, how it informed systems and interface design by highlighting the need and potential for improvement.