Andy Sloane

@a1k0n

http://a1k0n.net

Madison Big Data Meetup

Jan 27, 2015

- Recommendations
- Related Artists
- Radio

Okay, but how do we come up with recommendations?

Collaborative filtering!

Great, but how does that actually work?

- Each time a user plays something, add it to a matrix
- Compute similarity, somehow, between items based on who played what

- So compute some distance between every pair of rows and columns
- That's just O($\frac{{60M}^2}{2}$) = O($1.8\times 10^{15}$) operations... O_O
- We need a better way...

(BTW: Twitter has a decent
approximation that can actually make this work, called DIMSUM:

https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum)

I've tried it but don't have results to report here yet :(

https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum)

I've tried it but don't have results to report here yet :(

Instead, we use a "small" representation for each user & item: $f$-dimensional vectors

(here, $f=2$)

and approximate the big matrix with it.

- Very compact representation of musical style or user's taste
- Only like 40-200 elements (2 shown above for illustration)

Hu, Koren, Volinsky - *Collaborative Filtering for Implicit Feedback Datasets*

Tries to predict whether user $u$ listens to item $i$:

\[P = \left( \begin{array}{cccc} 0 & 0 & 0 & 1 \\ 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \end{array} \right) \approx \left( \begin{array}{ccc} & X & \end{array} \right) \left( \begin{array}{c} \\ Y^T \\ \\ \end{array} \right) \]

$Y$ is all item vectors, $X$ is all user vectors

"implicit" because users don't *tell us* what they like, we only observe what they do/don't listen to

- $x_u$ — user $u$'s vector
- $y_i$ — item $i$'s vector
- $p_{ui}$ — 1 if user $u$ played item $i$, 0 otherwise
- $c_{ui}$ — "confidence", ad-hoc weight based on number of times user $u$ played item $i$; e.g., $1 + \alpha \cdot \tt{plays}_{ui}$
- $\lambda$ — regularization penalty to avoid overfitting

Minimize:
\[ \sum_{u,i} c_{ui} \left( p_{ui} - x_{u}^{T} y_{i} \right)^2 + \lambda \left(\sum_u ||x_u||^2 + \sum_i ||y_i||^2 \right) \]

- $Y^T Y$ = $f$ x $f$ matrix, sum of outer products of all items
- $Y^T (C^u - I) Y$ same, except only items the user played
- $Y^T C^u p_u$ = weighted $f$-dimensional sum of items the user played

Key point: each iteration is linear in size of input, even though we are solving for all users x all items, and needs only $f^2$ memory to solve

No learning rates, just a few tunable parameters ($f$, $\lambda$, $\alpha$)

All you do is add stuff up, solve an $f$x$f$ matrix problem, and repeat!

We use $f = 40$ dimensional vectors for recommendations

Matrix/vector math using numpy in Python, breeze in scala

- Problem: any user (
**60M**) can play any item (**4M**)- thus we may need to add any user's vector to any item's vector

- If we put user vectors in memory, it takes a lot of RAM!
- Worst case: 60M users * 40 dimensions * sizeof(float) = 9.6GB of user vectors
- ...too big to fit in a mapper slot on our cluster

Most recent run made a 14 x 112 grid

e.g., if K = 4, mapper #1 gets users 1, 5, 9, 13, ...

```
def mapper(self, input): # Luigi-style python job
user, item, count = parse(input)
conf = AdHocConfidenceFunction(count) # e.g. 1 + alpha*count
# add up user vectors from previous iteration
term1 = conf * self.user_vectors[user]
term2 = np.outer(user_vectors[user], user_vectors[user])
* (conf - 1)
yield item, np.array([term1, term2])
def reducer(self, item, terms):
term1, term2 = sum(terms)
item_vector = np.solve(
self.YTY + term2 + self.l2penalty * np.identity(self.dim),
term1)
yield item, item_vector
```

Then flip users ↔ items and repeat!
- For each user, how do we find the best items given their vector?
- Brute force is O(60M x 4M x 40) = O(9 peta-operations)!
- Instead, use an approximation based on locality sensitive hashing (LSH)

- Pre-built read-only database of item vectors
- Internally, recursively splits random hyperplanes
- Nearby points likely on the same side of random split
- Builds several random trees (a forest) for better approximation

- Given an $f$-dimensional query vector, finds similar items in database
- Index loads via
`mmap`

, so all processes on the same machine share RAM - Queries are very, very fast, but approximate
- Python implementation available, Java forthcoming

- Annoy index for all items is only 1.2GB
- I have one on my laptop... Live demo!
- Could serve up nearest neighbors at load time, but we precompute Discover on Hadoop

- Send annoy index in distributed cache, load it via
`mmap`

in map-reduce process - Reducer loads vectors + user stats, looks up ANN, generates recommendations.

- Great for music discovery
- Essential for finding believable reasons for latent factor-based recommendations
- When generating recommendations, run through a list of related artists to find potential reasons

- Cosine is similar to dot product; just add a normalization step
- Helps "factor out" popularity from similarity

- Similar to user recommendations, but with more models, not necessarily collaborative filtering based
- Implicit Matrix Factorization (shown previously)
- "Vector-Exp", similar model but probabilistic in nature, trained with gradient descent
- Google word2vec on playlists
- Echo Nest "cultural similarity" — based on scraping web pages about music!

- Query ANNs to generate candidates
- Score candidates from all models, combine and rank
- Pre-build table of 20 nearest artists to each artist

- For each track, generate candidates with ANN from each model
- Score w/ all models, rank with ensemble
- Store top 250 nearest neighbors in a database (Cassandra)
- User plays radio → load 250 tracks and shuffle
- Thumbs up → load more tracks from the thumbed-up song
- Thumbs down → remove that song / re-weight tracks

- ~
**1500**Echo Nest Musical Fingerprints per track - Min-Hash based matching to accelerate all-pairs similarity
- Fast connected components using Hash-to-Min algorithm - $O(\log d)$ mapreduce steps

http://arxiv.org/pdf/1203.5387.pdf

I can be reached here:

Andy Sloane

Email: andy@a1k0n.net

Twitter: @a1k0n

http://a1k0n.net

Special thanks to Erik Bernhardsson, whose slides I plagiarized mercilessly