Scraping Dishes and the Web
We humans taught ourselves to Google, and now, almost never pause for a second to notice how we know where to look for answers. We unconsciously lift our phones and click on the ‘G’ and get going.
In my last article I mentioned my increased appetite for coding and learning. It was about time I taught my code where to look and gather data from, before it fetches me all the information to take in. I had recently picked up a course on Web Scraping and got myself well-versed with its concepts.
As I was scrolling through the IMDb website, trying to pick a movie to watch, it suddenly struck me to do something fun. “Why don’t I try scraping data from this website and analyse movie trends and preferences”, I thought to myself.
Here is what I did a few hours later with a cup of perfectly brewed black tea by my side:
- I opened the IMDb website and saw a list of movies load onto my screen. I right-clicked on one of the movie names to Inspect the HTML code of the page. It took quite a few clicks to understand how the page was structured before I could start scraping data using BeautifulSoup.
- I scraped information about movie titles, genres, votes, ratings and gross revenue for a total of six genres segregated into Action, Animation, Comedy, Drama, Sci-Fi and Romance.
- I collected all this information in a Pandas DataFrame, cleaned and re-organised the data to make it more comprehensible.
- Lastly I played around with the data, plotted comparisons between various parameters and drew conclusions out of them.
- My first plot summarized the Average Gross Revenue categorized based on genre.
- I then plotted the IMDb ratings of Top 10 movies from each genre and categorized the ratings into three slabs. Even though “Drama” came second after “Comedy” in terms of Average Gross Revenue, it was the only category to have all of it’s Top 10 movies with a rated above 8.5. The genre Sci-Fi not only ranked last in terms of Average Gross revenue, but was also the only genre to have a rating of below 7 and none above 8.5 among it’s Top 10 movies.
- My next observation was around the average IMDb rating categorized based on genre. Surprisingly Sci-Fi had the highest average IMDb rating, and perhaps the only category to cross the 8.5 mark, in-spite of having ratings below 7 for a few of it’s Top 10 movies. This could be due to the fact that other genres had more number of movies with lower ratings that caused their Average IMDb rating to plummet.
- My last plot was regarding the most common rating given to movies by viewers in the data set I worked with. From the looks of it, 8–8.2 is the most popular ‘rating window’ for movies. While one can find very few movies rated above 9 or below 6.2. I color-coded the plot based on the count under each bin to make it more visually distinguishable.
Note:
- My analysis was based on a collection of Top 219 movies(ranked based on number of votes), categorized into six genres.
- I also made sure I mapped each movie to exactly one genre.
- The purpose of this experiment as to get comfortable with Web Scraping.