An Overview of Machine Learning Classification

Sometimes, instead of returning a quantification of data (regression), we want to classify it. That is: place it into specific categories.

Often, we’ll use logistic regression as the foundation for our classification challenges. Eg: We could set a probability threshold for a logistic regression, and classify examples based on whether they fit above or below that threshold.

The Basics of Regularization in Machine Learning

Regularization is the mitigation we take to ensure data is not overfitted. There are many ways to do this, but among the most important is ensuring we penalize model complexity. Generally: the simpler the model, the less likely we are to overfit.

If loss in validation data initially goes down, but then starts to rise again after a certain number of iterations (while the training data continues to go down), chances are we’re overfitting.

An Overview of Feature Crosses in Machine Learning

In linear regression problems, some data doesn’t lend itself easily to direct linear solutions. There are many ways to solve such models, but one particularly valuable one is feature crosses: that is, creating new columns that combine existing features to make data that is easier to work with.

For example: let’s suppose you have a data set where features with negative x and y values OR positive x and y values have a strong correlation to one label, and features with either a negative x/y AND a positive x/y correlate to the the other label. By creating a feature cross for the values of x * y, you could ensure all data with the same positive/negative values showed as positive, and all data with one negative/one positive showed as negative. Suddenly, a linear line becomes a feasible way of separating labels.

Validation Data

If you separate data into training data and test data, you risk overfitting your results to the test data if you do multiple rounds of testing. How can you avoid that risk?

This is where validation data comes in. Validation data is a set of data that sits between training and test data.