When developing machine learning models, we need to be careful of potential biases in the models we develop. (Note: “bias”, in this document, relates to unfair imbalances, not to linear equation offsets)
Types of Bias
Selection Bias
Selection bias occurs when the data we are working with isn’t reflective of the real world. This may include problems such as:
- Non-response/participation bias: If a sufficient number of people cannot or refuse to be included in data, we should ask why. There is a good chance the reason for exclusion will bias our data – this is called non-response (or “participation”) bias
- Coverage Bias: Suppose we collect data on average income of a nation by surveying people who are members of a country club. We are excluding a large proportion of the population, and that exclusion is likely to impact our results.
- Sampling bias: Suppose we’re looking to determine average income of a nation by surveying people who live within a block of a country club. Technically we aren’t excluding people, but our data is likely to be heavily skewed towards the affluent, given country clubs tend to be in wealthier regions.
Group Attribution Bias
Group attribution bias is the tendency to assume what is true of one person is true for an entire group. For example:
- In-group bias: If the modeler, say, runs marathons, they may be more likely to believe that marathon runners are better employees than non-runners.
- Out-group homogeneity bias: The tendency to see less difference than there really is between members of groups we don’t belong to.
Reporting Bias
Reporting bias occurs when the frequency that outcomes happen isn’t representative of the real world. For example: people are more likely to remember very exciting baseball games than unmemorable ones. Given this, we may overestimate the percentage of games that end with walk-off home runs
Automation Bias
Automation bias is a bias towards automated systems instead of, for example, humans – even when the accuracy of automated systems are lower than human conclusions.
Implicit Bias
Implicit bias occurs when modelers make assumptions based on their own assumptions and lived experiences. Confirmation bias is a common type of implicit bias where modelers unconsciously process data in ways that affirm their existing beliefs.
Identifying Bias
There are many ways to identify bias in data, but here are a few of the most common:
Missing Feature Values
If a feature has a large amount of missing data, it’s important to understand why data is missing – in many instances, it may be a sign of selection (or other) bias.
Unexpected Feature Values
It’s important to examine invalid or potentially erroneous results. If you see data that looks suspicous, be sure to investigate whether it is accurate and representative
Data Skew
Ensure your data is representative of the full scope of what you intend to draw predictions on. For example: if you are trying to create a music recommendation algorithm for all people, but your training data samples only preferences of children, you’re likely to create a misrepresentative algorithm