Scraping the web can be a good way to acquire on-page HTML data, which can be used to create insightful infographics or for custom applications. One of the most powerful tools to achieve this is the Beautiful Soup library within Python It is specifically built to allow users to scrape all forms of data from any webpage on the internet.
As such, in this tutorial, weāll be looking at how to scrape YouTube title and views as an example. But, this method can be applied to any webpage with meta-data. Why meta-data? Because it is the most consistent way of storing information about a pageās title, views, likes, and so on. Due to it being external from YouTube, itās essentially unaffected by source code changes.
A common problem with targeting HTML elements such as classes, IDs, or even containers themselves is that they can change drastically when the source code is updated. This is what has happened with the new YouTube source code, completely overhauling its class names and causing old methods to not function as well.
Where do we start?
To get the title and view data from a handful of Christmas adverts on Youtube, we started by defining an array of URLs, delimited by commas and formatted in strings.

Then, using the statement āfor url in urlsā, we targeted each individual URL from the array to define our function.
Using the define (ādefā) keyword, we named our function āscrape_title_viewsā, with the parameter of ā(url)ā, which allowed us to run the function for each URL later-on.
The next step is to specify a variable going by the letter ārā, in which we assigned a built-in HTML parser to scrape the entire URL for text in its HTML content.
Note: HTML parser is not the fastest parser, and certainly for larger web scraping tasks there are better ones available, but it will do the job well for smaller tasks such as this.

Following this, we needed to define a few other variables before getting our function to fetch the data we want, including ātitleā, āviewsā, and finally ādataā.
Starting with ātitleā: We are using dot notation to access our ārā variable, since this contains all the parsed HTML content from which we can extract the pieces of information we need. As we only need to find one title in the page, the āselect_one()ā command will suffice. Within the parentheses, we accessed the meta data of the page ā specifically the itemprop Schema attribute which contains name-value pairs like ātitle: [title of YouTube video]ā. For this title variable, weāre fetching the ānameā property.
Our āviewsā variable follows the exact same format, except that weāre fetching the āinteractionCountā property instead.
Finally, our ādataā variable enabled us to store our title and views data in a specified format within the curly braces, such as: ā{ātitleā:title, āviewsā:views}ā. The format can be whichever style you prefer, like: ā{title, views}ā.
To end our function, we put our return statement, which is to return data for each time our function iterates through each URL in the āurlsā array.
Our function is now ready to be called. Because we want to be able to reuse this function multiple times, we kept it separate so that we can modify it as we please, without having to constantly re-write it for different tasks. This would be important if you want to print all the titles together, and then all the views together, as in the screenshot below:

Instead of writing two separate statements for the titles and views, we simply need to switch the return statement for whichever we want to fetch (remember that we can only specify one return statement because the function closes immediately after the first).
Going back to our initial example where we print the title and view count together, we needed to call the function.

In our āifā statement, we simply stated that if the name (ā__name__ā) of our module is the one we are currently using (ā__main__ā), then the Python interpreter needs to run the if statement. It is best practice to use this for āifā statements within Python.
Within this statement, we called the ādataā variable, and weāre telling the interpreter to fill it with the information that weāve specified in the āscrape_title_views(url)ā function. So in this case, weāre fetching the title and views, and assigning them to the ādataā variable in the format we stated it should be in (which is: ā{ātitleā:title, āviewsā:views}ā).
To finish, we instructed the interpreter to print out the ādataā variable, which will execute once you press āenterā twice to exit out of the statement. Below you can see that the interpreter has successfully retrieved and printed the data we were looking for:

But a long list of letters and numbers will leave you feeling as blue as the text, so be sure to convert this data into something visually appealing with libraries like ChartJS. Weāve done exactly this by creating a Christmas-themed graph showcasing YouTube views. Take a look here >
If youād like help collecting or analysing data to create interesting graphs, charts or infographics get in touch with us >
Source: AgroMarketing.digital