Ever wondered how TikTok's For You Page seems to know exactly which kinds of videos you'll be most interested in? I'll explain how recommendation systems, like TikTok's algorithm, work under the hood.
Quick disclaimer - I don't work at TikTok, so I'm open to corrections and disagreement. However, I know a lot about programming, and a bit about machine learning, so I think I have a pretty good idea of what's going on.
So, what is an algorithm?
So, what do I mean by "algorithm" in the first place? In math or computer science, an algorithm is kind of like a cooking recipe. You have a list of steps, that when performed together, complete some process. In the case of social media apps, like TikTok, or Instagram, usually that process is one that decides which posts you'll see in your feed.
Since algorithms are just a list of steps, they're reproducible. So, no matter who's using the app, TikTok can accurately recommend them videos. To do this effectively, though, they need lot of data.
How does TikTok use data?
To understand the role of data in TikTok's For You Page, you need to understand its goal in the first place. TikTok wants to maximize your time in the app, so they can show you more ads, and make more money. Every second you watch, loop, like, comment on, duet, or favorite a video, is another second spent on TikTok. This is why the For You Page shows you videos you're most likely to engage with. For this to work, they analyze the correlation between watch time and a couple of other factors. In other words - the algorithm is fueled by data.
Your watch history is the best indicator of which videos you'll like in the future. Have you noticed that when you like a video with a specific sound, you start seeing more videos with that sound in your feed? Or how when you like a video with a specific hashtag, that hashtag starts to appear more and more? Well, it isn't by accident. I believe TikTok uses at least the following data points to build a profile of your interests:
- Watch history - What topics or hashtags do you watch? Which creators? Which sounds? How long are the videos you like? Where were those videos shot? Do you like duets? Stitches? Etc.
- Engagement - Which videos have you engaged with beyond just watching?
- Following - Who do you follow, and who follows them? Who do they follow? What videos do they like?
- Demographic information - like your age, or location
- What times do you usually use the app?
- How long do you usually use the app?
Just those features are enough to build a pretty clear picture of your interests.
If you're reading this article, you're probably pretty smart, so you've likely noticed there's a bit of a chicken-and-egg problem here. What if you're new to TikTok? What if they don't have data on your interests yet? How can they figure out what to recommend you?
TikTok can group users based on similarities. Then, they suggest videos that others in your cohort liked, because you'll probably like them too. For example, if other fans of tech content have liked a video about the iPhone, expect to see it in your feed.
By the way, if you're enjoying this article, subscribe so you can hear when the next one comes out.
There is an edge case here, though. What if you're the only user on TikTok, or there's no one similar to you? At that point, you might as well be served random videos.
How is it implemented?
We understand what data TikTok is using, but how is everything tied together? How can data be connected to actual videos? Machine learning.
Machine learning, in simplest terms, is using data to make a computer improve at a task. The most common techniques for this today are based on statistics, and try to identify patterns in data. You're probably most familiar with recommendation systems, like Amazon's "Buy it with" section, that predict which items you're most likely to interact with.
The group data technique I described earlier, where your data is combined with your group's, is called collaborative filtering. Clustering algorithms, like k-means clustering, are what actually group users. Once in a cluster, the individual user's preferences can be weighted against the group's.
It's even more complex than it seems
If you think you could churn out a recommendation system like TikTok's overnight, you're wrong. The process I've outlined is an extreme simplification of the real one. To show just how complex these systems can be, let's take a look at another popular recommendation system - Netflix.
Netflix's home page is also driven by a recommendation algorithm. However, the way theirs work sis not as much of a secret. In fact, since 2012, Netflix has been documenting how they tune their algorithm on their tech blog. It's full of really interesting reads, so if you're interested in machine learning, I definitely recommend (ba dum tss) reading it. In this post, "Netflix Recommendations: Beyond the 5 Stars (Part 2)," Xavier Amatriain and Justin Basilico explain some of the data Netflix uses to train their models. Movie ratings, popularity, social data, demographics, temporal data, and data on how users interact with suggestions are all used to make their recommendation system more accurate. It might sound like a no-brainer, but when you consider that some of these features have billions of data points, it's not so straightforward how to analyze them.
Another issue is speed. A complex algorithm like this, running over a large dataset, is gonna be slow. When your entire platform is built around a short attention span, this is a hard problem to solve.
During training, it's okay to have the entirety of your database present. But in production, if a user has to wait for your backend to sort through millions of items before returning a response, people are just gonna leave.
Netflix's tech blog also has an article about this. I'm not gonna reiterate everything in full, so I recommend reading it yourself. The TLDR, though, is that in order to keep the service snappy for users, you have to strike a careful balance between which computations are done online and offline. No matter what, you're won't end up with a perfect solution, so be mindful of the tradeoffs you make.
So, how can you speed this process up in your own projects? I suggest computing some this data ahead of time. While the computations themselves are slow at scale, data retrieval is extremely fast.
For example, if you run a blog, and you know that people from 18-24 years old tend to read your articles about TikTok more than any other group, store this in your database, so that at runtime, instead of having to pick through all of the articles, you can run your model only on that subset relevant to the user.
Another approach is to simply compute each user's recommendations ahead of time. For something like Netflix or TikTok, where scale is huge, this isn't viable, but for the average blog, with only a few hundred articles and users combined, it might not be such a bad idea.
Could you build your own TikTok?
Obviously, if you want to create something of TikTok's scale, you need a big team, and a lot of money, but what if you just want to create a recommendation system for a smaller site or app? How can you build your own algorithm, like the For You Page?
Let's revisit my previous example about blog article recommendation systems.
First, you need to start with data. It's up to you how in-depth you go with the features you collect, but your best bet is to store everything in a database. You should probably stick with an SQL database, as their strongly-typed schemas force you to keep some sort of structure to your data. And if you have very clean, structured data from the beginning, it'll be a lot easier to analyze it later on.
For example, you could track whether users were subscribed, what time they read your site, how long they stayed on the site before leaving, which articles they read, how much of each article did they read, did they find articles through email, search, or existing recommendations, how popular the articles were, and how recent the articles they read were. We can also group the users together based on similarities, and then account for whether people in the group clicked on a given article.
Next, you have to pick a feature you want to optimize. In this example, we would maximize the amount of time spent on the site. So, we could look at where they came from, whether they were subscribed, and how recent the articles shown were, and then model that against time on the site.
Next, we need to train our model on both positive and negative examples. We could track which recommendations they clicked, and which ones they didn't. Once you have a dataset, you're ready to build and train your model.
If you don't have a lot of machine learning experience, I recommend using Keras, or Tensorflow 2, which includes Keras, to build this out. Once you've trained your model, it can be integrated into the rest of your system. For example, if you have a server-side API that produces a list of the next articles to show the user, you can call your trained model here.
Pros and Cons
We can only assume that recommendation algorithms will continue to be commonplace as more advancements are made in machine learning. But there is controversy over these systems and how they use our data. I believe there is more potential for good use of this technology than bad, but it never hurts to understand it better.
Let's look at some of the benefits:
- Discoverability - When content is more tailored to individual users, a lot of niche content, which is usually drowned out by celebrities and bigger accounts on other apps, can actually make its way to the users who they will most likely resonate with.
- Increased user engagement - This can be either a good thing or a bad thing, depending on who you ask. In my opinion, if this technology is applied to things like article recommendation systems on blogs or newsletter, it would improve reader satisfaction. This could also be good for creators, as a more engaged audience will help generate more revenue.
And some of the cons:
- Increased user engagement can also be bad. It's already kind of a stereotype that our generation is "glued to our phones," and not without reason. Every major app nowadays is literally engineered to keeping us staring at a screen for as long as possible. You have to wonder - where will that trend lead us? Nobody knows the answer right now, but I wouldn't be surprised if within the next few years, the public started to backlash against this.
- We should also think carefully about the implications of data collection like this. Big tech companies use highly-personal data, like age and location, just to show us 15-second videos of cats. Is that really necessary? Along with concerns about data collection come concerns about privacy, and also manipulation of algorithms. We've all heard about the Cambridge Analytica scandal, and I'm sure you've noticed how when you start talking about a brand, you suddenly start getting ads for their products. I think this problem will continue to get worse, but I'd love to hear your thoughts about this in the comments.