Into the YouTube World

YouTube has become one of the biggest video platforms with huge business potentials. Imagine you work in an entertainment management company that has never been in the YouTube field before. Now, we think we can extend our service to YouTube by signing YouTubers and helping their channels to grow.
In this report, we will walk through the basic statistics of main YouTube data and dive deep into the characters of different categories of the videos. Based on the regression and key indicator analysis, we will come up with the final recommendation and takeaways.
1. Data Manipulation
Data description
This dataset includes several months (and counting) of data on daily trending YouTube videos in the United States. Data is from Klagge: Trending YouTube Video Statistics
(https://www.kaggle.com/datasnaek/youtube-new)
Columns: video_id; trending_date; title; channel_title; category_id; publish_time; tags; likes; dislikes; comment_count; comments_disabled; rating_disabled; video_error_or_removed


Observations:
We’ve examined data:
-There are 40949 entries and 14 columns, one column (video_id) is an index
-No missing or errors(at the first glance)
-There are some columns, which datatypes should be changed
At the next step, we’ll study and preprocess outliers, rescale the data unit and add the calculated field.
Add calculated fields
1. We add some columns that we need for the following analysis. The view is normally around millions and makes it hard to be visualized. Therefore, we divided it by 10,000 and named as ‘view_10k’. Another key indicator we need is the dislike ratio, so we also add column ‘dislikerate’ calculated by dislikes/likes
1.2 Meaningful data: As we observe the data, the column’ video_error_or_removed’ means that it is been removed by the author or YouTube. Thus, it is not meaningful for the further analysis anymore, so we drop them.

1.3 Data Preparation: When we first look at the data using the key indicator “View_10k”, we can see that the mean is 3 times of the median, so we know that there are data points with the large value that affects the mean. Next, we are going to take a look at the distribution of the dataset

From the density graph, we can see what this dataset has a big range, and between 5,000 to 20,000, the density is really low, which means there are not many data points in this range, so we drop all data within this range and check the distribution again


We can see that the problem still exist, so we drop the data with a value higher than 300 and check the distribution again


At this point, the data has been dropped from 40949 to 34471 observations, which contains 84.18% of the original dataset. Considering the data range is large and the volume is big, it is an acceptable amount of data we can work with. Now the data is ready! Export the file and use it for the visualization in Tableau
2. Data Visualization
2.1 Total Views by category

While testing on the success of YouTube channel, the views is No.1 indicator. Through this bubble chart, we have a picture of the viewers’ interest on different categories. From the sizes of the bubbles, Entertainment has the highest views among all categories. There are also other popular categories like Comedy(23), Music(10), Science& Technology(28),etc.
This graph can help us to narrow the categories of video we want to work on by the volume of views. This is true that we want to have the categories that most of the people are interested in, but we also have to consider how many channels are in that certain category, which means the competitiveness of that category.
2.2 Competition

As we want to enter the category that have more views, we might also want to see how many channel are in this category. In another word, how competitive it is? By this following visualization, we are trying to figure how many trending videos per one channel created in each category. The larger the number gets, means more competitive that category is. For example, looking at all of the dark red area, it is very likely that in this category, most the trending videos are created by a couple very popular channels. On the opposite, categories in dark blue like Auto& Vehicles has a low number of videos on trend per channel, which means that the trending videos are more evenly spread out by different channels
How does it affect us? If the category is dominated by some head channels, it will be more difficult for us to build a new channel to compete with the existing ones. On the other hands, if the trending videos are more spread out, we have a bigger chance to become one of the popular channels within that category.
2.3 Dislike rate

As the views indicate how popular the video is, the dislike ratio showcases whether people enjoy the content of the videos. The dislike rate is disliked/likes, which means that for every 1 like, how many dislikes each category will have on average. With the reference line of average, there are some categories with dislike rates higher than average. Contents related to Sports(17), People& Blogs(22), and Auto& Vehicles(2) are more likely to get more dislikes than others. In this case, News& Politics(25) is an outlier with a dislike rate four times the average. It is easy to understand that political-related videos are a lot more controversial and people might hold the opposite opinions, which leads to a high dislike rate.
Why does the dislike rate matter? One of the main income sources of video creators is sponsorship, so it is significant to build a positive image that will attract more sponsors. Having a high dislike rate of the channel regardless of the views will have a negative impact on the branding building, and lead to an unstainable income. If we are in the high-dislike-rate category, we have to be more cautious about the content of the video.
2.4 Publish Time

As our goal is to make more videos on-trend, it is important to know if there is any time pattern that will affect the trending of the video. Therefore, we look at the amount of trending videos at every different hour of the day. We do find that there is a trend for some of the categories. We can see that during the afternoon (13:00-17:00), it is likely to have an increase in the amount of video on-trend. Categories like Entertainment(24), Howto& Style(26),Comedy(23) experience a spike during 12:00 to 16:00. Other categories have different patterns: there is an increase in trending video in Gaming(20) around 18:00 after people get out of work; News & Politics(25) starts growing on-trend after lunch(12:00); categories like Travel& Events don’t change much throughout the day.
Since each category has its own target viewer, the different watching behavior of the target viewers leads to the change of the amount of video on-trend for different categories. Overall, more videos, which are published during 13:00-17:00, will be on-trend. The reason could be that people start to spend more time on YouTube in the afternoon. As we manage the publishing of the videos, we want to use the information provided by this graph to make the decision of the beat time to post the video in order to be more likely on-trend or get more views.
2.5 Dashboard

Now, we put all of the visualizations together and add a summary banner with numbers to make this dashboard. With this interactive dashboard, when we want to check and understand the performance of a certain category, we can simply click on the category_id and easily compare it with others.
3. Data Analysis
3.1 Like/View_10k ratio of category


By looking at the slop, we can have an idea of the like ratio of the category. The bigger the slop is, the more likes the video gets per 10 thousand views. Among the ones with the big slops, there are two types of patterns. One is that the points(15, 26,27) are clustered like a line, others(10, 22, 23, 24) are more widely spread out. What is the difference between these two? Let’s run the regression of categories 24 and 26.
3.2 Regression Analysis


The coefficient of the view_10k for 24(Entertainment) is 308.61 means that for every 10 thousands increase in views, there will be 308.61 more likes. The coefficient of view_10k for 26(Howto& Style) is 454.039. In this case, it shows that for the same amount of views, viewers will give more likes to category 26 than category 24.
The adjusted R-Square is an indicator that shows how many data points fit in this regression line. The adjusted R-Squared for category 24 is 0.40227, which means that only 40% of the data points will be on the regression line. This is a lot lower than category 26, which is 70%. It tells that that for Entertainment video, the like/view rate can fluctuate a lot depends on the content of the videos. However, Howto& Style will have a really constant and better performance on the like/view rate, which makes it easy to predict in the future.
3.3 Comment disabled
There is this column called “comments_disabled” in this dataset draws our attention.


In this table, the two categories: 24(Entertainment) and 25(News& Politics) are the two main categories that have the most video-disabled comment. To figure out the reason why we take a look at the channel titles of these two categories.


In the list of the channel titles, the majority of the titles shows that they are the official accounts of TV shows(Jimmy Kimmel Live) or companies(H&M). When we, as a management company, are managing the channel of brand, we can consider to turn off the comments to keep it official and mute potential negative feedback.
4.Outlier Analysis
Although we filtered out the view_10k>300 data earlier, we still want to observe and understand the outlier data since they have really high views, which is a good indicator.


When we group the data by category and count the numbers of times on-trend, we can see that among those outliers, 34.39% of them are category3(Music) and 23.10% of them are category10(Entertainment).


As for the views_10k, only Music can reach views higher than 15,000(*10k). Here, we can see Music and Entertainment are dominating the extremely high-view field and Music is the absolute winner in terms of views.
Should that be THE category we want to enter? Well… If we take a look at all the names of the music channels, we can see a large amount of them have “VEVO”, which is a major music video service. The channels labeled “VEVO” are held by artists. Examples are like “EminemVEVO”, “JenniferLopezVEVO”, “SelenaGomesVEVO”. In this case, the views represent more than the content of the video, instead, it is about the popularity of the artist and music. Since majority of management companies don’t have mainstream artists, it is hard to copy the success of these channels.
4. Recommendation
Based on all the analysis above, there are two categories that fit our expansion goal the best:
-
Science& Technology(28): The total views are around the middle among all categories. The trending video per channel is 13.7, which is relatively low. It is also good for business development with a low dislike rate and high like rate. It is a safe card with all indicators better than average. For a management company that doesn’t have much experience on YouTube, it is a good field to start with. While the technology industry is rapidly changing, we can expect sustainable growth in this field.
-
People& Blog(22): This is a popular category with over 170 Billion views in total. People “like” this type of video and enjoy watching them in the afternoon and at night. It is not that hard to be one of the most popular channels within this category. However, since the dislike rate is relatively high, we have to pay more attention when picking the topic and editing the content.
5. Implementation
----? Here I retrieved all the tag data and find the most frequently used tag for each category. This will give us an idea of the topic and key points of the trending videos. More importantly, we want to utilize these popular tags so that the algorithm can recommend our video with related videos, eventually help the channel to grow.
First, we count the frequency of each tag and select the most frequent ones.

For better visualization, I put all the tags into a word cloud, the fonts are based on the frequency of the tags, so the bigger the tag, the more frequency it is used.
Category 28

Category 22
