Item-Based Collaborative Filtering
Why Item-Based Over User-Based?
If you've ever wondered how Amazon or other online platforms suggest products or movies that you might actually be interested in, then you've experienced collaborative filtering. While user-based collaborative filtering is common, it has its drawbacks — users can be unpredictable, their preferences change, and let's not forget, there are just too many users to compare. That's where item-based collaborative filtering shines. Unlike human tastes, the attributes of an item like a book or a movie remain consistent over time. Plus, it's computationally less intensive as there are usually fewer items than users. Not to mention, it's less susceptible to being manipulated by fake user profiles.
Additionally, a general technique for avoiding shilling attacks is to make sure that the signal behavior is based on people actually spending money, fr more reliable results, as opposed to just what they viewed/clicked on.
How Does It Work?
It's very similar to user-based collaborative filtering, but instead of users, we're looking at items. Imagine we're talking about anime recommendations. The first step in item-based filtering is to look for pairs of anime watched by the same person. We then measure how similar those movies are based on user ratings. Simply put, if Alice and Bob both liked (or disliked) "Demon Slayer" and "Attack on Titan" these two anime must share some sort of similarity.
Now, let's say a new user, Carl, watches "Demon Slayer" and loves it. Given the established similarity between this anime and "Attack on Titan" the system can confidently recommend "Attack on Titan" to Carl.
In this way, the focus shifts from analyzing user-to-user relationships to exploring item-to-item similarities, making the recommendations more stable and efficient.
Python Implementation
Building a movie recommender system in Python can be straightforward. This section uses Python and real-world data from the MovieLens project to demonstrate how. We focus on item-based collaborative filtering, a technique that finds similarities between movies based on user ratings. In essence, it's a "people who liked this also liked..." system.
Data Preparation
We start by using Pandas to import a data file from MovieLens, resulting in a DataFrame. This DataFrame holds rows representing users (user_id
) and their ratings (rating
) for movies (movie_id
, title
).
Here's how a typical DataFrame looks:
movie_id | title | user_id | rating |
---|---|---|---|
1 | Toy Story (1995) | 308 | 4 |
1 | Toy Story (1995) | 287 | 5 |
1 | Toy Story (1995) | 148 | 4 |
Next, we reorganize this data into a matrix form using the pivot_table
function. Each row corresponds to a user, and each column to a movie title. If a user has rated a movie, the corresponding cell shows the rating.
Here's how a part of the resulting DataFrame looks:
titleâ–¶ user_idâ–¼ | 'Til There Was You (1997) | 1-900 (1994) | 101 Dalmatians (1996) | 12 Angry Men (1957) | 187 (1997) |
---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | NaN |
1 | 4.0 | NaN | 2.0 | 5.0 | NaN |
2 | NaN | NaN | 3.0 | NaN | NaN |
3 | NaN | 5.0 | NaN | NaN | 2.0 |
4 | NaN | NaN | NaN | NaN | NaN |
What we end up with here is a sparse matrix, that contains every user, and every movie, and at every intersection where a user rated a movie there's a rating value. We can very easily extract vectors of every movie that our user watched, and we can also extract vectors of every user that rated a given movie. So, this pivot table is useful for both user-based and item-based collaborative filtering. If I wanted to find relationships between users, I could look at correlations between these user rows, but if I want to find correlations between movies, for item-based collaborative filtering, I can look at correlations between columns based on the user behavior.
Finding Similarities
We'll then use Pandas' corrwith
function to find how similar the ratings for "Star Wars (1977)" are to all other movies (columns).
similarMovies = movieRatings.corrwith(starWarsRatings)
similarMovies = similarMovies.dropna().sort_values(ascending=False)
df = pd.DataFrame(similarMovies)
We drop all the NaN
values, so that we only have movie similarities that actually exist, where more than one person rated the movie, and we construct a new DataFrame from the results. We also sort this by similarity score, and we should have the top movie similarities for Star Wars. This yields a list of movies along with their correlation scores with "Star Wars (1977)".
title
No Escape (1994) 1.000000
Man of the Year (1995) 1.000000
Hollow Reed (1996) 1.000000
Commandments (1997) 1.000000
Cosi (1996) 1.000000
Stripes (1981) 1.000000
Golden Earrings (1947) 1.000000
Mondo (1996) 1.000000
Line King: Al Hirschfeld, The (1996) 1.000000
Outlaw, The (1943) 1.000000
Hurricane Streets (1998) 1.000000
Scarlet Letter, The (1926) 1.000000
Safe Passage (1994) 1.000000
Good Man in Africa, A (1994) 1.000000
Full Speed (1996) 1.000000
Old Lady Who Walked in the Sea, The (Vieille qui marchait dans la mer, La) (1991) 1.000000
Star Wars (1977) 1.000000
Ed's Next Move (1996) 1.000000
Twisted (1996) 1.000000
Beans of Egypt, Maine, The (1994) 1.000000
Last Time I Saw Paris, The (1954) 1.000000
Maya Lin: A Strong Clear Vision (1994) 1.000000
Designated Mourner, The (1997) 0.970725
Albino Alligator (1996) 0.968496
Angel Baby (1995) 0.962250
Prisoner of the Mountains (Kavkazsky Plennik) (1996) 0.927173
The results we got weren't the best, and has some obscure reccomendations there. Clearly, some improvements need to be made.