Collaborative Filtering
Collaborative filtering is the prediction of a user’s interests based on the interests of others. For example: recomending Netflix shows.
To recommend shows, we’ll need to determine which shows are similiar to each other. We could, for example, put the shows on a linear line from “depressing” to “upbeat”. That would be a one-dimensional embedding. We could add a second dimension – say, “slow-moving” to “action-packed”, and see how close movies were to each other across those two dimensions. We could continue adding dimensions and, across however many dimensions we have, compare proximity. Note: while the dimensions we’ve listed above have explicit meanings (depressin, upbeat) etc. But we can also include latent dimensions: that is, traits that aren’t explicitly known, but are inferred from the data itself without a human-specified meaning.
Embeddings
When dealing with recommendation problems, we’ll often have positive data for only a handful of many, many possible categories. Eg: the number of songs someone has listened to from a library of millions, or the number of words used in a sentence from all words in the dictionary.
In these situations, to try to learn across every possible option would be expensive in both memory and computation. Instead, we create embeddings, which is to say we map our higher-dimensional data to lower-dimensional space. So for example: instead of having a binary value for every possible word in a dictionary, we could clump words of similar meaning (“happy”, “joyful”, “upbeat”, etc) together and assign them similar values. We could then map those on a graph, so that “happy” and “joyful” have very similar values, whereas “happy” and “lawn” have further apart values.
Determining Embeddings
While the above principle may sound promising, the question remains: how do we determine whether (for example) two words are conceptually close to each other? In the case of words, word2vec is a google library that uses the distributional hypothesis: the idea that words that are usually surrounded by the same neighboring words are usually similar. Eg: “entree” and “main course” are often found near “menu”, so they should be considered conceptually similar.