Day 5. Web Scraping. 🕷️ Beautiful Soup & Selenium
Welcome to Day 5! Today, you’ll learn how to extract data from websites using Python. We’ll start with Beautiful Soup for static pages, then level up to Selenium for dynamic, interactive sites. By the end, you’ll be able to collect data from the web for your own projects!
Table of contents
🌱 What is Web Scraping?
Web scraping is the process of automatically collecting information from websites. It’s useful for gathering data that isn’t available via an API.
- Beautiful Soup: Parses HTML and XML documents. Great for static pages.
- Selenium: Automates browsers. Useful for dynamic sites that require interaction (clicks, typing, etc).
🥣 Beautiful Soup Basics ↑
For parsing and navigating HTML/XML with Python Requires: beautifulsoup4 and a parser like lxml or Python’s built-in html.parser.
🧩 What can you do with Beautiful soup?
Here’s a breakdown of its core capabilities:
-
Parse HTML and XML: Beautiful Soup takes raw HTML or XML content and transforms it into a navigable “parse tree” made of Python objects. This tree makes it simple to access and manipulate specific parts of the document.
- Navigate the Parse Tree: You can easily move through the HTML/XML structure:
- By tag name: Find elements like
<div>
,<a>
, or<p>
. - By attributes: Locate elements based on their
id
,class
,href
, or any other attribute. - By text content: Search for specific words or phrases within elements.
- Using relationships: Travel up (parent), down (children, descendants), or sideways (siblings) in the tree.
- By tag name: Find elements like
-
Search for Specific Elements: Beautiful Soup offers strong methods like
find()
(to get the first match) andfind_all()
(to get all matches) to pinpoint the exact data you’re looking for. You can combine these with various filters (tag names, attributes, CSS selectors, regular expressions, or even custom functions) for precise selection. - Extract Data: Once you’ve found the elements you want, you can easily pull out:
- Text content: Get the visible text inside a tag (e.g.,
soup.title.string
). - Attribute values: Access the values of attributes like
href
from an<a>
tag orsrc
from an<img>
tag.
- Text content: Get the visible text inside a tag (e.g.,
-
Handle Malformed HTML: One of Beautiful Soup’s strengths is its ability to deal with “tag soup”—poorly structured or incomplete HTML. It tries to make sense of it and build a usable parse tree.
- Integrate with Other Libraries:
- Requests: Often used with the
requests
library to fetch the HTML content from a URL before Beautiful Soup parses it. - Selenium: For dynamic websites that rely heavily on JavaScript for rendering, you might use Selenium (a browser automation tool) to load the page, and then pass the rendered HTML to Beautiful Soup for parsing.
- Pandas: Extracted data can be easily structured and stored in Pandas DataFrames for further analysis or export to formats like CSV or Excel.
- Requests: Often used with the
🧰 Common Uses for Beautiful Soup
Beautiful Soup is primarily used for:
- Web Scraping: This is its main purpose. You can use it to:
- Collect product information (names, prices, descriptions) from e-commerce sites.
- Extract news articles, blog posts, or research papers.
- Gather job listings or real estate data.
- Perform sentiment analysis by scraping reviews or comments.
- Data Mining: Turning unstructured web data into organized datasets for analysis.
- Content Aggregation: Building tools that pull content from multiple online sources into one centralized location.
In short, Beautiful Soup empowers Python developers to programmatically interact with web content, making it an essential tool for anyone looking to extract and work with data from the internet.
📦 Install the required packages
pip install beautifulsoup4 requests lxml
📄 Example: Scraping a Local HTML File
Suppose you have a file called website.html
:
from bs4 import BeautifulSoup
with open("website.html", encoding="utf-8") as file:
contents = file.read()
soup = BeautifulSoup(contents, "html.parser")
print(soup.title)
🧼 Cleaning HTML
clean_text = soup.get_text(strip=True)
🔍 Finding Elements
You can search for tags, classes, ids, and more:
# Find the first <a> tag
anchor = soup.find("a")
print(anchor)
# Find all <a> tags
all_anchors = soup.find_all("a")
for tag in all_anchors:
# .getText() gets the visible text inside the tag
print(tag.getText())
# .get() retrieves the value of an attribute (e.g., href)
print(tag.get("href"))
Search by attributes (id, class, etc.)
# Find by id
heading = soup.find(name="h1", id="name")
# Find by class (note: use class_ because 'class' is a reserved word)
section = soup.find(name="h3", class_="heading")
# Find all elements with a specific class
items = soup.find_all(class_="item")
Search using CSS selectors
# Use .select() for CSS selectors
links = soup.select("a.storylink") # All <a> tags with class 'storylink'
ids = soup.select("#main") # Element with id 'main'
classes = soup.select(".heading") # All elements with class 'heading'
🌳 Navigating the Tree
tag.name # Tag name
tag.attrs # Tag attributes as dict
tag['href'] # Specific attribute
tag.text # All text inside tag (recursive)
tag.string # Direct string only
tag.parent
tag.children # Generator of children
tag.contents # List of children
tag.next_sibling
tag.previous_sibling
🔗 Navigating and Following Links
You can extract and follow links by combining .get("href")
with requests:
for tag in soup.find_all("a"):
link = tag.get("href")
if link and link.startswith("http"):
print("Following:", link)
# You can fetch the linked page with requests.get(link)
For further reference follow to the documentation
🌐 Scraping Live Websites ↑
To scrape a live website, use the requests
library to fetch the page:
import requests
from bs4 import BeautifulSoup
url = "https://news.ycombinator.com/"
response = requests.get(url)
webpage = response.text
soup = BeautifulSoup(webpage, "html.parser")
# Get all article titles
titles = soup.find_all("a", class_="storylink")
for title in titles:
print(title.getText())
⚖️ Is Web Scraping Legal? ↑
- Only scrape public data.
- Respect robots.txt and website terms.
- Don’t overload servers (add delays if scraping many pages).
- Use scraped data responsibly.
🤖 Selenium for Dynamic Websites ↑
Some sites load content with JavaScript or require interaction. Selenium lets you control a real browser to handle these cases.
Unlike Beautiful Soup, which is limited to scraping data, Selenium allows for interaction with web pages, such as typing, clicking, and scrolling. It enables the automation of continuous actions and entire workflows of a particular job or task. It effectively drives a browser to perform actions like a human user.
Selenium can automate almost anything a human can do on a website, like filling forms, transferring information, or playing web-based games.
🚗 Introduction to Selenium WebDriver
- What it is: Selenium WebDriver is a well-known automation and testing tool for web developers.
- Why use it (over Beautiful Soup): Unlike Beautiful Soup, which is limited to scraping data, Selenium allows for interaction with web pages, such as typing, clicking, and scrolling. It enables the automation of continuous actions and entire workflows of a particular job or task. It effectively drives a browser to perform actions like a human user.
- Capabilities: Selenium can automate almost anything a human can do on a website, like filling forms, transferring information, or playing web-based games.
🔧 Installation and Setup of Selenium
- Install Chrome Browser: While Selenium works with other browsers like Firefox or Safari, Chrome is recommended for consistency and use of Chrome Developer Tools. Download the Chrome driver from chromedriver.chromium.org and place it in your PATH.
- Install Selenium Package:
- Import
selenium
in your Python file (e.g.,main.py
). - Install the package using the provided light bulb option in your IDE.
pip install selenium
- Import
- Import WebDriver Module: Change the import statement to
from selenium import webdriver
. - Create a Driver Instance: Initialize a Chrome driver object:
driver = webdriver.Chrome()
.- Chromedriver: This acts as a bridge between the Selenium code and the Chrome browser, telling Selenium how to interact with the browser. Different drivers exist for different browsers (e.g., Safari, Firefox).
- Browser Control:
driver.close()
: Closes the active tab.driver.quit()
: Quits the entire browser. It’s preferred to usequit()
after completing tasks to ensure a fresh browser instance for future runs.
🔎 Example: Open a Page and Find an Element
from selenium import webdriver
from selenium.webdriver.common.by import By
import time as time_module
# Start the browser
browser = webdriver.Chrome()
browser.get("https://www.python.org")
# Find elements
event_times = browser.find_elements(By.CSS_SELECTOR, ".event-widget time")
event_names = browser.find_elements(By.CSS_SELECTOR, ".event-widget li a")
for time, name in zip(event_times, event_names):
print(time.text, name.text)
# Wait for 3 seconds before closing
time_module.sleep(3)
browser.quit()
🔍 Finding and Selecting Elements on a Website
Locating Elements:
Selenium offers various strategies to find HTML elements on a webpage. Once you have identified an element with the inspect tool of the browser, you can copy its Xpath or other and usit as identifier with:
find_element()
method: Used to find a single element.By
Class: Important for specifying the location strategy (e.g.,By.CLASS_NAME
,By.ID
,By.NAME
,By.LINK_TEXT
).- Examples:
- By Class Name: To get the price of an item on Amazon, you might find elements with classes like “a-price-whole” (for dollars) and “a-price-fraction” (for cents).
- Accessing Text Content: After finding an element, use
.text
to retrieve the text content within that HTML element. - By Name: Useful for form input fields.
- By Link Text: Specifically for clicking on links by their visible text.
find_elements()
method: For everyfind_element()
method, there’s afind_elements()
counterpart that returns a list of all matching elements.- Inspecting Elements: Use Chrome Developer Tools (right-click -> Inspect) to examine the HTML structure and identify IDs, class names, or other attributes for elements.
🖱️ Automating Interactions (Typing and Clicking)
- Clicking Elements:
- After identifying an element, use the
.click()
method on the element object. - Selenium can click on links based on their
LINK_TEXT
.
- After identifying an element, use the
- Typing into Input Fields:
- First, find the input field element.
- Use the
.send_keys()
method on the element object, passing the string you want to type.
- Sending Special Keys: To send keys like
Enter
orReturn
, import theKeys
class fromselenium.webdriver.common.keys
.
📝 Challenge: Scrape Upcoming Python Events ↑
- Use Selenium to open python.org
- Extract the date and name of the next 5 events
- Store them in a dictionary like:
events = {
0: {"time": "2025-06-11", "name": "PyCon"},
1: {"time": "2025-06-18", "name": "DjangoCon"},
# ...
}
🚀 Summary ↑
- Use Beautiful Soup for static HTML scraping
- Use Selenium for dynamic, interactive sites
- Always respect website rules and ethics
You now have the tools to collect data from almost any website. Happy scraping!