An Overview of Machine Learning Classification

Classification

Sometimes, instead of returning a quantification of data (regression), we want to classify it. That is: place it into specific categories.

Often, we’ll use logistic regression as the foundation for our classification challenges. Eg: We could set a probability threshold for a logistic regression, and classify examples based on whether they fit above or below that threshold.

Accuracy & Class Imbalance

So, how do we evaluate our classifications? One option is accuracy: the percentage of correct classifications.

While accuracy is widely used, it has limitations – most notably, class imbalance. Imagine we’re trying to predict something with a very low probability: eg – the chance of a plane crashing. By simply setting the predicted output to always be “no”, the algorithm will be right in 99.9999% of times. Nonetheless, that conclusion doesn’t help us to predict crashes. This is the problem of class imbalance

Precision & Recall

For class imbalance issues, its helpful to separate outcomes into 4 different categories: true positives, false positives, false negatives, and true negatives. With this data, we can engage two important concepts:

Precision: The # of true positives / all positive predictions. That is: what percentage of my “positive” guesses were correct?

Recall: The # of true positives / all actual positives. That is: what percentage of positives did I correctly identify?

So: if you want to improve recall, chances are you’ll increase the number of false positives, but decrease false negatives. Increasing recall makes sense in situations where false positives would be a major problem.

Conversely, if we increase our precision, we’re likely to reduce false positives, but increase false negatives.

Leave a Comment

Your email address will not be published. Required fields are marked *