Fairness In Machine Learning

When developing machine learning models, we need to be careful of potential biases in the models we develop. (Note: “bias”, in this document, relates to unfair imbalances, not to linear equation offsets)

Types of Bias

Selection Bias

Selection bias occurs when the data we are working with isn’t reflective of the real world. This may include problems such as:

Non-response/participation bias: If a sufficient number of people cannot or refuse to be included in data, we should ask why. There is a good chance the reason for exclusion will bias our data – this is called non-response (or “participation”) bias
Coverage Bias: Suppose we collect data on average income of a nation by surveying people who are members of a country club. We are excluding a large proportion of the population, and that exclusion is likely to impact our results.
Sampling bias: Suppose we’re looking to determine average income of a nation by surveying people who live within a block of a country club. Technically we aren’t excluding people, but our data is likely to be heavily skewed towards the affluent, given country clubs tend to be in wealthier regions.

Group Attribution Bias

Group attribution bias is the tendency to assume what is true of one person is true for an entire group. For example:

In-group bias: If the modeler, say, runs marathons, they may be more likely to believe that marathon runners are better employees than non-runners.
Out-group homogeneity bias: The tendency to see less difference than there really is between members of groups we don’t belong to.

Reporting Bias

Reporting bias occurs when the frequency that outcomes happen isn’t representative of the real world. For example: people are more likely to remember very exciting baseball games than unmemorable ones. Given this, we may overestimate the percentage of games that end with walk-off home runs

Automation Bias

Automation bias is a bias towards automated systems instead of, for example, humans – even when the accuracy of automated systems are lower than human conclusions.

Implicit Bias

Implicit bias occurs when modelers make assumptions based on their own assumptions and lived experiences. Confirmation bias is a common type of implicit bias where modelers unconsciously process data in ways that affirm their existing beliefs.

Identifying Bias

There are many ways to identify bias in data, but here are a few of the most common:

Missing Feature Values

If a feature has a large amount of missing data, it’s important to understand why data is missing – in many instances, it may be a sign of selection (or other) bias.

Unexpected Feature Values

It’s important to examine invalid or potentially erroneous results. If you see data that looks suspicous, be sure to investigate whether it is accurate and representative

Data Skew

Ensure your data is representative of the full scope of what you intend to draw predictions on. For example: if you are trying to create a music recommendation algorithm for all people, but your training data samples only preferences of children, you’re likely to create a misrepresentative algorithm