Introduction

I am a super movie fan and I believe you are as well. Aren’t you curious about what movies are on the Top 200 Highest-Grossing Movies of All Time list? I am curious! Moreover, high-grossing movies often means high rating and are mostly good movies. I am excited to put some more great movies in my pocket list.

The dataset I use is form IMDb website. They list the details of each movie in the list like ranking, title, year, runtime, rating, box office, and etc.

This is what the Highest-Grossing Movies of All Time list looks like on their website:

Test Image

Our goal is to get the six variable we want which are ranking, title, year, runtime, rating, and box office for the top 200 movies using BeautifulSoup. Then we put together all the data we got as a dataframe/table.

Check the website’s status code

The first thing we should do is check the status code to see if the website allow us to get data from them. To do that, you can use requests.get(your_url).status_code to get the code. The code we got is 200, we are good to go :)

Get the HTML code and we are in business.

Here we finally use a BeautifulSoup function. We use BeautifulSoup(your_url.text) to get the html code we want. And this where all the funs begin!

Test Image

This is the step we scrap all the interesting data

Looking at the html code, we found the desired data are all in a class called lister-item-content
We scraped all the columns we want using codes below:

gross_divs = w_soup.find_all('div', {"class":"lister-item-content"})
All the tables are in lister-item-content class
rank = [int(h.find('span', {"class": "lister-item-index unbold text-primary"}).text.strip('.')) for h in gross_divs]
All the rankings are in lister-item-index unbold text-primary class
title = [h.find('a').get_text() for h in gross_divs]
Titles are easily found in a href
year = [h.find('span', {"class": "lister-item-year text-muted unbold"}).get_text() for h in gross_divs]
All the years are in lister-item-year text-muted unbold class
rating = [(h.find('span', {"class": "ipl-rating-star__rating"}).get_text()) for h in gross_divs]
All the IMDb ratings are in ipl-rating-star__rating class
runtime = [int(h.find('span', {"class": "runtime"}).text.strip(' min')) for h in gross_divs]
We use text.strip() to get rid the “min” in runtime
gross = [(h.find_all('span', {"name": "nv"}))[-1].get_text() for h in gross_divs]
Use find_all() to find votes and gross, and get rid of votes by adding [-1]

</ul>

Create Dataframe Using Pandas

Make sure you have pandas imported, and then create a dataframe by saying pf = pd.DataFrame() Then combine two dataframes into one.

Test Image

Final data cleaning

Since the column df[Year] and df[Box_Office(M)] still contain (), $, and M. We need to do some cleaning.

df['Year'] = df['Year'].str.extract(r'(\d\d\d\d)').astype(int)
We use str.extract() to get rid of the ()
df['Box_Office(M)'] = df['Box_Office(M)'].str.partition('$')[2].str.partition('M')[0]
The first str.partition() eliminate $, the second eliminate M

</ul>

The resulting dataframe will look something like this:

Test Image

Is this ethical?

Yes, this is completely ethical since the IMDb website allows people to web-scrape their website as the status code shows. I also did not use it for commercial purpose, this is for my personal use to add more movies to my list.

Conclusion

In this post, I show you step-by-step from how to check the status code, scrap the variables you like, put together as a table, and finally make the dataset clean and looking-good using BeautifulSoup. This post is extremely useful to those who needs to collect, wrangling, and analyze online data for personal use. I will also an exploratory data analysis (EDA) of this dataframe on my next post, hope you enjoy this post and look forward to the next one.

Here’s a link to my github repo link