Programming

How to scrape data from the Web using Python and BeautifulSoup

In this post, we'll cover the basics of web scraping, explore popular Python libraries, and work on a project to extract data from a website and save it in a CSV format on our local device.

jeff

May 18, 2023 — 8 min read

Back in the early internet days, people used to manually copy and paste data from web sources into spreadsheets, or documents to perform analysis; this process is called manual web scraping. However, it became difficult to keep up with the expansion of available data on the web and the need to obtain it from different sources, so software was used to automate the process.

What is Web Scraping?

Web scraping is the process of extracting information from the web. Scraping is defined as something that has been obtained by removing it from a surface. In the context of "Web Scraping", the surface is the web, and we're removing the data (or HTML) from it.

💡

Put simply, web scraping is the process of reading the content of a website, and saving that content in a format that can be used for processing or analysis.

Popular Web Scraping Libraries for Python

Libraries provide most of the code needed to perform a set of actions, wrapped neatly in a package that exposes a set of methods. Python is a popular programming language for Web Scraping due to the availability of solid and well-maintained libraries such as BeautifulSoup, and Scrapy.

We're going to cover BeautifulSoup in this article, but the logic applies to any other library, even if scraping is done through HTTP requests "manually".

Web Scraping Use Cases

Web scraping has been used since the dawn of the internet, across all industries. Some examples of use cases for web scraping include:

Research and Insights: Web scraping is used to collect data from the latest developments in a specific topic such as news, or medical reports. It is also used to extract data from publicly available sources, such as stock markets, exchange rates, and others.
Machine Learning Training Data: Similarly, data collected could be used to train machine learning models to output predictions.

Web Scraping Controversy

Web scraping raises a immediate questions of ethics and legality. You should ask yourself whether it is legal to scrape data from a website before attempting to do so. Many companies indicate if they allow scraping in their policies, however, even if it is not clearly stated you must always acquire consent from the data owner before attempting to scrape any data.

Just so you know, as soon as I typed "Is Web Scraping" into Google, it automatically appended the word legal and a question mark to my search query so it became "Is Web Scraping Legal?". This goes to show that this is a common question asked by many people. The short answer to that question is: It depends.

Keep in mind that some things to consider include (but are not limited to):

Some data is protected by different laws (Including personal information, and other sensitive data)
Some data owners consider scraping their data a violation of the terms and conditions they set
Even if scraping itself is authorized, misusing the data could be a breach of some laws and policies

Notable Legal Cases

Facebook v. Power Ventures: You can read more about the case here.
Craigslist vs 3Tap: You can read more about the case here.
HiQ v. LinkedIn: You can read more about the case here.

⚠️

Make sure you obtain consent to scrape content from the data owner so that there are no breaches from your end. If consent is given, ensure you process and utilize the data in a responsible and legal manner, and follow the law.

Overview of what we're building

For this project, we're going to scrape country data from: https://www.scrapethissite.com. As you can tell from the domain name, the site's purpose is to help people understand web scraping.

Here's what they have to say about this:

This site is made for the purpose of teaching people the art of web scraping. Please feel free to write or use software in order to automatically pull, harvest, scrape, download or otherwise access any information on this site. - https://www.scrapethissite.com/faq/

We're going to collect the countries from this URL: https://www.scrapethissite.com/pages/simple/

On this page, you'll see three columns (depending on your screen size), listing country name, capital, population, and area (in km2) for all countries. The goal of this tutorial is to extract the country data, perform some filtering, then store everything in a CSV.

Prerequisites

We'll need to install a few things to get started:

Python
Jupyter Notebook (You can use your preferred IDE, such as VS Code)
BeautifulSoup
Pandas

1. Install Python

Let's start by installing Python from the official repository: https://www.python.org/downloads/

Download the package that is compatible with your operating system, and proceed with the installation. To make sure Python is correctly installed on your machine, type the following command into your terminal:

python --version

You should get an output similar to this: Python 3.9.6

2. Install Jupyter Notebook

After making sure Python is installed on your machine (see above), we can proceed by using pip to install Jupyter Notebook.

In your terminal type the following:

pip install jupyter

3. Install BeautifulSoup

Then we can install the BeautifulSoup library:

pip install beautifulsoup4

4. Install Pandas

Finally, type in the following to install pandas using pip:

pip install pandas

Great, we have everything we need to get started!

Building our Web Scraper

Before we start coding, let's dive deeper into the HTML structure to identify which tags we'll need to pull the data from, then, we can write the Python code that will perform the actual scraping.

Understanding the Tag Structure

Remember we're going to extract content from HTML tags. How do we know which tags contain our data? Simple, we have to look at the HTML source code of the page. Specifically, we're looking for tags holding these values:

Country Name
Country City
Population
Area (km2)

We can see the source code by opening the Web Inspector tool in our browser, to do so:

Right-click on the first country "Andorra" (Or any other country name).
As shown below, click on "Inspect Element" if you're using Safari. (For Chrome it's Inspect)

This will open the Web Inspector which shows the HTML code of the tag holding the selected value (Andorra in our case). Even though the inspector shows a lot of other technical information about the website, we'll only focus on identifying the tags that hold the values that we're looking to extract.

Looking at the Web Inspector window, we can notice a few things immediately:

<h3 class="country-name"> contains the Country Name
<span class="country-capital"> contains the Country's Capital
<span class="country-population"> contains the Country's Population
<span class="country-area"> contains the Country's Area

We can also see that the country details (capital, population, and area) are all nested in a <div> tag with a class value of country-info. Similarly, the tags with the classes country-name and, country-info are both nested in a parent <div> with class values of col-md-4 and country.

💡

Take a few minutes at this point to identify the structure, tags, classes, and other important information which will help you extract the data easily later.

Putting Everything Together

Given the observation above, we can find everything nested in the <div class="col-md-4 country"> tag. This element is also repeated in every other country. This means we can:

Grab all occurrences of this element
Loop through each occurrence
Extract the text values of the target tags (name, capital, population, and area)
Put all the values in a list
Convert that list to a pandas DataFrame
Perform our DataFrame Analysis
Save the data from the DataFrame as a CSV file

💡

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

Launch Jupyter Notebook (Or your IDE of choice)

We're now ready to start coding!

Open your terminal window and cd to your project directory to launch a new instance of Jupyter Notebook. Type the following and hit return:

jupyter notebook

This command will automatically open your default browser, and redirect you to the Jupyter Notebook Home Page. You can select or create a notebook from this screen.

Create a new Notebook

Now let's create a new notebook by clicking on New -> Python3 as shown below:

Import required modules

First, we'll need to perform a GET request to fetch the HTML data from the site. We can do so with the help of the requests module. We'll import BeautifulSoup, and pandas so we can use them in a bit:

import requests
from bs4 import BeautifulSoup
import pandas as pd

Retrieve content using GET method

We use the requests.get() method, to request the data from the url below:

url = "https://www.scrapethissite.com/pages/simple/"

response = requests.get(url)

# Store HTML content in html_content variable
html_content = response.content

Parse HTML content using BeautifulSoup

Next, we instantiate a BeautifulSoup object by passing html_content which contains all of the HTML code from the url.

# Create a soup variable
# "html.parser" specifies the parser that matches the data type (HTML)

soup = BeautifulSoup(html_content, "html.parser")

Retrieve all countries using a for loop

Let's find all the HTML <div> tags that have a class value of country using BeautifulSoup's .find_all() method. We'll then loop through each one of them and extract the values into a Python list data:

# Find all div tags with class country
countries = soup.find_all("div", class_="country")

# Instantiate list
data = []

# Loop through each country div
for country in countries:
	# Extract the text from the h3 tag with class country-name
    name = country.find("h3", class_="country-name").text.strip()
    
	# Extract the text from the span tag with class country-capital
    capital = country.find("span", class_="country-capital").text.strip()
    
	# Extract the text from the span tag with class country-population
    population = country.find("span", class_="country-population").text.strip()
    
	# Extract the text from the span tag with class country-area
    area = country.find("span", class_="country-area").text.strip()
    
	# Append all text values to our data list
    data.append([name, capital, population, area])

The strip() method removes the leading and trailing whitespace characters from the .text value.

Convert to pandas DataFrame

Great, now we have all the countries stored neatly in our data list. We can quickly convert it to a pandas DataFrame:

# Create a pandas DataFrame from our data list
# Specify friendly column names using columns[] parameter
df = pd.DataFrame(data, columns=["Country Name", "Capital", "Population", "Area"])

# Convert the Area, and Population values to a numeric type
df["Area"] = pd.to_numeric(df["Area"])
df["Population"] = pd.to_numeric(df["Population"])

Return the first five rows in DataFrame

df.head()

	Country Name	Capital	Population	Area
0	Andorra	Andorra la Vella	84000	468.0
1	United Arab Emirates	Abu Dhabi	4975593	82880.0
2	Afghanistan	Kabul	29121286	647500.0
3	Antigua and Barbuda	St. John's	86754	443.0
4	Anguilla	The Valley	13254	102.0

👉

I've done a few data filtering tasks, such as listing the largest five countries by area, and population, and calculating the population density for each country. To see how this was done, you can download the Notebook from this link.

Save DataFrame as CSV

Finally, we'll save our DataFrame into a CSV (short for Comma Separated Values) file:

# Save DataFrame as a CSV file
df.to_csv("country_data.csv", index=False)

Conclusion

There you have it, a few lines of code and you can pull data from any website. Keep in mind that things in the real world are more complicated. In our project, the HTML structure was clear and very well-defined. Usually, you'll need to work with more complex things like paging and AJAX. You can check out the BeautifulSoup documentation and try to scrape data from the other example pages on scrapethissite.com.