A Brief Overview of Feature Engineering

When we receive data, how do we determine and groom what data is relevant for creating useful machine learning models? This challenge – which is where much of a machine learning engineers’ time is spent – is called feature engineering.

Creating Good Features

Good features have the following qualities:

  • Differing values: If all but a handful of features have the same value, they are less likely to lead to useful results
  • Stable: Values should remain stable over time. Eg: a person’s birth year is unlikely to change, but their address may. Unless we’re looking for data related to a snapshot of the person’s address, we should prioritize static over changing data

Cleaning Data

In order to make data useful, we need to change how it is organized. Some common examples include:

  • Scaling values: If data comes for a feature comes with a range of, say, 145-388, it may be difficult to understand the relative scale of that data easily. Instead, we can scale it to values between, say, 1-100, or 0-1, or -1 to +1.
  • Managing Outliers: Extreme outliers may tilt our models in extreme ways. We can mitigate this by either limiting min/max values (say: “greater than 8” could be a category for all values greater than 8). Alternatively, we could represent the data on (for example) a logarithmic scale to reduce outliers’ impact
  • Binning: Sometimes you want to clump numerical values together where the range they fall into is more relevant than their value. For example: a house’s latitude may not be particularly interesting, but the latitude area (eg: within the bounds of an expensive city) may be. In this instance, we can “bin” the data – that is, create a vector such as [0, 0, 0, 0, 1, 0, 0] that places a continuous variable within a discrete range

Leave a Comment

Your email address will not be published. Required fields are marked *