Data Dependencies

Its important to consider the short and-long term health of the data you’re working with. In machine learning environments, be sure to consider the necessity, reliability, correlations, versioning, and feedback loops

Static vs Dynamic Training and Inference

Static training involves training an algorithm in a controlled experiment, with data that doesnt change. By comparison, dynamic training involves continually learning with data that is constantly coming in. For example: if we were trying to train an algorithm that determined which songs we like, we might take a dump of data, train our algorithm, then create a website that provided those recommendations. In this system, the underlying recommendation model wouldn’t change.

Embeddings & Collaborative Filtering

When dealing with recommendation problems, we’ll often have positive data for only a handful of many, many possible categories. Eg: the number of songs someone has listened to from a library of millions, or the number of words used in a sentence from all words in the dictionary.

In these situations, to try to learn across every possible option would be very expensive in both memory and computation. Instead, we create embeddings, which is to say we map our higher-dimensional data to lower-dimensional space. So for example: instead of having a binary value for every possible word in a dictionary, we could clump words of similar meaning (“happy”, “joyful”, “upbeat”, etc) together and assign them similar values. We could then map those on a graph, so that “happy” and “joyful” have very similar values, whereas “happy” and “lawn” have further apart values.

Multi-Class Neural Networks & Softmax

Multi-class neural networks involve classification problems that involve more than 2 outcomes. So instead of (for example) “spam vs not-spam”, a multi-class problem may involve three, four, five, or thousands of possible categories.

In situations with multiple outcomes, often only one can be the case. Eg: if we’re trying to identify a type of car, it cannot be a Ford AND a Toyota, it can only be one. This is called a “single-label” multi-class classification problem. In these situations, we want to ensure that our probabilities add up to one. (That is: if there’s an 80% chance it’s a Ford, the rest of the possibilities should add up to 20%). This idea – that each possible label has a percentage probability, and those probabilities should add up to 1, is called softmax.

Backpropagation

We train Neural Nets via backpropagation. While the details are complex (and usually handled by our machine learning software), it’s important to note that it works by making gradient descent possible on multi-layered neural networks.

A Basic Overview of Neural Networks

When working with basic nonlinear problems, we can use feature crosses to simplify the data. (Eg: if we have values with negative x & y and positive x & y with a similar trend, we can create a linear solution through the feature cross x * y).

For more difficult nonlinear problems (eg: a spiral or even more random shapes), trying to manually arrange the data into linear patterns becomes increasingly challenging. At some point, we likely want to employ non-linear functions to determine challenging associations.

Regularization – Sparsity

Sparse vectors (that is, vectors where most values are zero) often have a lot of dimensions. When we create feature crosses for those vectors, we create even more dimensions. This can create enormous models that require large amounts of memory to process.

To save memory and reduce noise, we can zero-out features (that is, reduce their weights to zero). For situations where feature crosses include a lot of data that is likely to be zero (eg: if we cross latitude and longitude for the whole world into small areas, then determine which of those blocks are likely to have buffalo living in them, we can zero out any block that is entirely ocean). By dropping the weights to zero, we can prevent the need for storing them in memory.