Technical Blog

All the latest technical and engineering news from the world of Guavus

4 steps to decode the YouTube’s secret algorithm!

by Mohinder Paul, Director Engineering, Guavus, Inc.

Most of us have expressed discontent with our ISPs (Internet Service Providers) when that video on YouTube starts buffering, spoiling our experience.

One of the perennial problems for ISPs has been to figure out if subscribers are contended with the quality of service. While there are multiple aspects to quality of service – starting from the ease of subscribing to using the service in various forms – one of the things that leaves the most significant impact on customer experience is how smoothly a subscriber can watch videos on YouTube.

In other words, if ISPs could evaluate the subscriber experience on YouTube, then this valuable data could be used as a proxy to enhance the experience by correlating the experience score with other data sets, such as customer care data, usage data, location data, and RF conditions and building relevant analytics models. Unfortunately, there is no direct KPI that ISPs could use to fetch the experience score of subscribers who use YouTube to consume video content.

So let us try to solve this problem!

For starters, let us look at a few facts on how YouTube algorithm works:

  1. YouTube has installed caches across different locations and it tries to serve users from the nearest cache.
  2. YouTube breaks each video file into multiple chunks and creates multiple copies of each chunk for different resolutions. It is done so that it can serve lower resolution chunks to users with lower network speeds.
  3. YouTube algorithm tries to store these chunks intelligently at various cache locations based on its learning about which videos are more likely to be watched at a location and which resolutions are more likely to be required at a location.

Imagine for a moment that the ISP has access to this YouTube algorithm that learns from data and decides which resolutions will be played by default (unless the user changes it explicitly) at different locations. This information could then be used to learn what throughput is required to play videos at those resolutions and in turn at those locations. Furthermore, the user experience can be marked based on what throughput they are getting versus what was required at that location.

Therefore, the solution to finding the experience score for YouTube users is to decode this algorithm used by YouTube. However, YouTube is not revealing this secret any time sooner!

Here is the good news!

We can reverse engineer some part of it. We have found 4 simple steps that can be used by ISPs to derive a proxy to the subscriber experience on YouTube.

Step 1: Come up with a mapping of throughput requirement with the video resolution. This is a simple task. Play a lot of YouTube videos of different resolutions in a controlled lab environment and collect the throughput data for all these YouTube sessions. Find an average from this data and you will have obtained a mapping table listing the throughput required to play a video at a certain resolution.

ResolutionThroughout (kbps)

Step 2: For each location (each cell in the world of LTE), look at all the YouTube sessions and note down their actual throughput. Run any clustering algorithm (Like K-Means Clustering) on this data and figure out the cluster to which most of these sessions belong.

Step 3: Find the resolution corresponding to the throughput of the biggest cluster found in step 2 from the table created in step 1. Mark this as the default resolution for this location. In all likelihood, this resolution is the resolution that YouTube’s secret algorithm comes up with for that location. To make it more accurate, don’t just mark one default resolution for a location, instead, run this learning algorithm every hour and mark default resolution for a location for a given hour of the day. Depending on peak hours or off-peak hours, default resolutions could be different.

Cell IDHour of The DayDefault Resolution

Step 4: Now as the learning is complete, look at the throughput and location and hour of day for each YouTube session. Compare the actual throughput against the learned throughput and if the actual throughput is lower than the learned throughput, then it can be assumed that the subscriber is facing some buffering. Exact experience score can be marked based on the actual difference between the required and actual throughput. For example, if the required throughput is 600kbps and the actual throughput is 300kbps, then the score could be marked as 5 out of 10.

Once the ISP has the experience score for multiple such YouTube sessions for different locations, it can be easily combined with RF conditions in that area such as RSRP, SINR, Handovers, so on. The real RCA (root cause analysis) can be found and the problem can be fixed!

Image attribution: Shutterstock

Posted by guavus