Regularization – Sparsity

Sparse vectors (that is, vectors where most values are zero) often have a lot of dimensions. When we create feature crosses for those vectors, we create even more dimensions. This can create enormous models that require large amounts of memory to process.

To save memory and reduce noise, we can zero-out features (that is, reduce their weights to zero). For situations where feature crosses include a lot of data that is likely to be zero (eg: if we cross latitude and longitude for the whole world into small areas, then determine which of those blocks are likely to have buffalo living in them, we can zero out any block that is entirely ocean). By dropping the weights to zero, we can prevent the need for storing them in memory.

We can do this by regularization, but L_2 regularization won’t do the job. L_2 encourages small weights, but not truly zero.

L_1 regularization encourages unhelpful coefficients to be exactly zero.

How L_2 and L_1 regularization differ

  • L_2 penalizes weight^2, whereas L_1 penalizes |weight|
  • So while the derivative of L_2 is 2 * weight, the derivative of L_1 is “k” – a constant, entirely independent of weight.

L_1 works by substracting some constant value from every weight. Once a weight goes below 0, it will be set to 0.

Leave a Comment

Your email address will not be published. Required fields are marked *