Generalization, Overfitting, and Training & Testing Sets

Generalization

In machine learning, generalization refers to a model‘s ability to accurately predict values for new data, beyond that it was trained on. This opens up discussion of overfitting

Overfitting

If you have a set of data that you are training a model upon, it may be tempting to create very complex solutions that cater to atypical results. For example: instead of drawing simple lines, you may create a model that has all sorts of eccentric squiggles to draw around a handful of atypical cases.

While this will successfully give a high accuracy on the data the model is trained upon, chances are that if the model is tested against new data, that complexity will prove inaccurate. This is called overfitting: creating a model that caters too much for anomalies, and would therefor prove inaccurate against new data.

Generally speaking, you should keep your models simple unless there is a good reason not to. Determining how simple depends on a variety of factors.

Training vs Testing Sets

When you are building a model in supervised machine learning, you compare it to data you already know the answers for. Once you have developed a model you are confident in, you should then evaluate it against a test set. The difference between a test set and training set is that the training set is used to create the model, but test sets aren’t: they are solely used to determine whether the trained model is accurate. This helps to prevent overfitting (and underfitting), and ensure the trained model accurately predicts future data, not just the [[examples]] it was trained on.

Leave a Comment

Your email address will not be published. Required fields are marked *