
Webscraping Competitors: For Analysts
Introduction to Webscraping
On your way to analysing data, especially competitor data, you will encounter a time when you want to know what your competitor is up to in real time.
This is where webscraping comes in. Webscraping is the process of extracting data from websites. It allows you to gather information from the web and use it for various purposes such as price tracking, sentiment analysis, and application development.
On the face of it, webscraping is fairly straightforward. Request the page, find the element, get the text. But very quickly you'll run into Request limits, captchas, Javascript content, and more. This is what we hope to cover in the below segment.
1. Understanding the Structure of an HTML Page
HTML (HyperText Markup Language) is the standard language for creating web pages. It uses a system of tags to define elements that structure the content of a webpage. Understanding the basic structure of an HTML document is crucial for web scraping, as it helps you identify the elements from which you need to extract data.
Basic Structure of an HTML Document
An HTML document is made up of a series of nested elements. Each element is defined by a tag, enclosed in angle brackets (<>). HTML documents typically have a hierarchical structure consisting of the following parts:
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h1>This is a Heading</h1>
<p>This is a paragraph.</p>
</body>
</html>
Let's break down this structure into its key components:
<!DOCTYPE html>
This declaration defines the document type and version of HTML. It helps the browser understand how to interpret the document. For modern web pages, this is typically set to HTML5 using <!DOCTYPE html>.
<html>...</html>
The <html> element is the root of an HTML document. It encapsulates all the content on the page.
<head>...</head>
The <head> section contains meta-information about the document, such as its title, character set, styles, scripts, and other metadata. Key elements within the <head> section include:
- <title>...</title>: Defines the title of the document, which is displayed in the browser's title bar or tab.
- <meta>: Provides metadata such as character set, author, description, and keywords.
- <link>: Used to link external resources like stylesheets.
- <style>...</style>: Contains internal CSS to style the document.
- <script>...</script>: Contains JavaScript for interactivity.
<body>...</body>
The <body> section contains the actual content of the webpage that is visible to the user. Common elements within the body include:
- <h1>...</h1> to <h6>...</h6>: Header tags, with <h1> being the highest level and <h6> the lowest. These are used to define headings of different levels.
- <p>...</p>: Defines a paragraph.
- <a>...</a>: Defines a hyperlink, which is used to link from one page to another.
- <div>...</div>: A division or section of the document, often used as a container for other elements.
- <span>...</span>: An inline container used for styling purposes.
- <ul>...</ul>, <ol>...</ol>, <li>...</li>: Used to create unordered (bulleted) and ordered (numbered) lists.
- <table>...</table>, <tr>...</tr>, <td>...</td>: Used to create tables.
- <img>: Embeds an image in the document.
- <form>...</form>, <input>,...</input>, <button>,...</button>: Used to create forms for user input.
By understanding these basic elements and their structure, you can more effectively navigate and extract data from HTML pages during the web scraping process.
2. Understanding the Structure of an HTML Page
HTML (HyperText Markup Language) is the standard language for creating web pages. It uses a system of tags to define elements that structure the content of a webpage. Understanding the basic structure of an HTML document is crucial for web scraping, as it helps you identify the elements from which you need to extract data.
Basic Structure of an HTML Document
An HTML document is made up of a series of nested elements. Each element is defined by a tag, enclosed in angle brackets (<>). HTML documents typically have a hierarchical structure consisting of the following parts:
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h1>This is a Heading</h1>
<p>This is a paragraph.</p>
</body>
</html>
Let's break down this structure into its key components:
<!DOCTYPE html>
This declaration defines the document type and version of HTML. It helps the browser understand how to interpret the document. For modern web pages, this is typically set to HTML5 using <!DOCTYPE html>.
<html>...</html>
The <html> element is the root of an HTML document. It encapsulates all the content on the page.
<head>...</head>
The <head> section contains meta-information about the document, such as its title, character set, styles, scripts, and other metadata. Key elements within the <head> section include:
- <title>...</title>: Defines the title of the document, which is displayed in the browser's title bar or tab.
- <meta>: Provides metadata such as character set, author, description, and keywords.
- <link>: Used to link external resources like stylesheets.
- <style>...</style>: Contains internal CSS to style the document.
- <script>...</script>: Contains JavaScript for interactivity.
<body>...</body>
The <body> section contains the actual content of the webpage that is visible to the user. Common elements within the body include:
- <h1>...</h1> to <h6>...</h6>: Header tags, with <h1> being the highest level and <h6> the lowest. These are used to define headings of different levels.
- <p>...</p>: Defines a paragraph.
- <a>...</a>: Defines a hyperlink, which is used to link from one page to another.
- <div>...</div>: A division or section of the document, often used as a container for other elements.
- <span>...</span>: An inline container used for styling purposes.
- <ul>...</ul>, <ol>...</ol>, <li>...</li>: Used to create unordered (bulleted) and ordered (numbered) lists.
- <table>...</table>, <tr>...</tr>, <td>...</td>: Used to create tables.
- <img>: Embeds an image in the document.
- <form>...</form>, <input>,...</input>, <button>,...</button>: Used to create forms for user input.
By understanding these basic elements and their structure, you can more effectively navigate and extract data from HTML pages during the web scraping process.
3. Tools and Libraries for Web Scraping
There are several tools and libraries available for web scraping. Some popular ones include:
- Beautiful Soup: A Python library for parsing HTML and XML documents.
- Scrapy: An open-source web crawling framework for Python.
- Selenium: A tool for automating web browsers, useful for scraping dynamic content.
- urllib: A Python module for fetching URLs (Uniform Resource Locators).
4. Finding Specific Elements in an HTML Page
To scrape data, you need to find the specific HTML elements that contain the data you want. This can be done using CSS selectors or XPath expressions.
Example: Using Beautiful Soup
from bs4 import BeautifulSoup
html = """
<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
# Get the title
title = soup.title.string
print(title)
# Get all links
links = soup.find_all('a')
for link in links:
print(link.get('href'))
5. Practical Example: Web Scraping with Python and urllib
Step 1: Import Libraries
import urllib.request
from bs4 import BeautifulSoup
Step 2: Send a GET Request
url = 'http://example.com'
response = urllib.request.urlopen(url)
html_content = response.read().decode('utf-8')
Step 3: Parse HTML Content
soup = BeautifulSoup(html_content, 'html.parser')
Step 4: Extract Specific Data
title = soup.title.string
print(f'Title: {title}')
# Extract all paragraphs
paragraphs = soup.find_all('p')
for paragraph in paragraphs:
print(paragraph.text)
6. Dealing with Complex Pages
Many websites have complex structures with nested elements. Let's look at an example where we need to navigate through nested tags to extract data.
Example: Navigating Nested Tags
html = """
<html>
<body>
<div class="content">
<h1>Main Heading</h1>
<div class="article">
<h2>Article Title</h2>
<p>Article content here.</p>
</div>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
# Get the main heading
main_heading = soup.find('h1').text
print(f'Main Heading: {main_heading}')
# Get the article title
article_title = soup.find('div', class_='article').find('h2').text
print(f'Article Title: {article_title}')
# Get the article content
article_content = soup.find('div', class_='article').find('p').text
print(f'Article Content: {article_content}')
7. Handling Authentication
Some websites require authentication to access certain pages. You can handle authentication in urllib by using the `HTTPBasicAuthHandler` or `HTTPCookieProcessor` classes.
Example: Basic Authentication
import urllib.request
# Create a password manager
password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
# Add the username and password
top_level_url = "http://example.com/"
password_mgr.add_password(None, top_level_url, 'username', 'password')
# Create an authentication handler
handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
# Create an opener
opener = urllib.request.build_opener(handler)
# Use the opener to fetch a URL
url = "http://example.com/protected_page"
response = opener.open(url)
html_content = response.read().decode('utf-8')
print(html_content)
8. Handling Cookies
Cookies are used by websites to store user session information. You can handle cookies in urllib using the `HTTPCookieProcessor` class.
Example: Handling Cookies
import http.cookiejar
import urllib.request
# Create a cookie jar
cj = http.cookiejar.CookieJar()
# Create an opener with HTTPCookieProcessor
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
# Use the opener to fetch a URL
url = "http://example.com"
response = opener.open(url)
html_content = response.read().decode('utf-8')
print(html_content)
# Print cookies
for cookie in cj:
print(f'Cookie: {cookie.name}={cookie.value}')
9. Handling Redirects
Some websites redirect users to another page. By default, urllib handles redirects automatically. However, you can customize this behavior if needed.
Example: Handling Redirects
class RedirectHandler(urllib.request.HTTPRedirectHandler):
def http_error_301(self, req, fp, code, msg, headers):
print(f'Redirected to: {headers["Location"]}')
return super().http_error_301(req, fp, code, msg, headers)
def http_error_302(self, req, fp, code, msg, headers):
print(f'Redirected to: {headers["Location"]}')
return super().http_error_302(req, fp, code, msg, headers)
opener = urllib.request.build_opener(RedirectHandler())
url = "http://example.com"
response = opener.open(url)
html_content = response.read().decode('utf-8')
print(html_content)
10. Respecting Robots.txt
The `robots.txt` file on a website specifies which parts of the site should not be accessed by web crawlers. It's important to respect this file when scraping websites.
Example: Checking Robots.txt
import urllib.robotparser
rp = urllib.robotparser.RobotFileParser()
rp.set_url("http://example.com/robots.txt")
rp.read()
url = "http://example.com/some_page"
user_agent = "MyScraper"
if rp.can_fetch(user_agent, url):
print("You can scrape this page.")
else:
print("You are not allowed to scrape this page.")
11. Dealing with JavaScript-Rendered Content
Some websites use JavaScript to load content dynamically. In such cases, libraries like Selenium can be used to scrape the rendered HTML.
Example: Using Selenium
from selenium import webdriver
# Set up the WebDriver
driver = webdriver.Chrome(executable_path='/path/to/chromedriver')
# Open the webpage
driver.get("http://example.com")
# Extract data
title = driver.title
print(f'Title: {title}')
# Extract dynamic content
dynamic_content = driver.find_element_by_id("dynamic_content").text
print(f'Dynamic Content: {dynamic_content}')
# Close the WebDriver
driver.quit()
12. Throttling Requests
To avoid overloading a server with requests, it's important to add delays between requests. This can be achieved using the `time` module.
Example: Adding Delays
import time
urls = ["http://example.com/page1", "http://example.com/page2", "http://example.com/page3"]
for url in urls:
response = urllib.request.urlopen(url)
html_content = response.read().decode('utf-8')
print(html_content)
# Add a delay
time.sleep(2)
13. Error Handling
When scraping websites, you may encounter various errors such as connection errors or HTTP errors. Proper error handling is crucial to ensure your scraper runs smoothly.
Example: Handling Errors
try:
response = urllib.request.urlopen(url)
html_content = response.read().decode('utf-8')
except urllib.error.HTTPError as e:
print(f'HTTP Error: {e.code}')
except urllib.error.URLError as e:
print(f'URL Error: {e.reason}')
else:
print(html_content)
14. Storing Scraped Data
After scraping data, you may want to store it for later use. You can store data in various formats such as CSV, JSON, or databases.
Example: Storing Data in CSV
import csv
data = [
["Title", "Link"],
["Example Title 1", "http://example.com/1"],
["Example Title 2", "http://example.com/2"]
]
with open('data.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerows(data)
Example: Storing Data in JSON
import json
data = {
"titles": ["Example Title 1", "Example Title 2"],
"links": ["http://example.com/1", "http://example.com/2"]
}
with open('data.json', 'w') as file:
json.dump(data, file)
15. Conclusion
Web scraping is a powerful tool for extracting data from websites. Using Python and libraries like urllib and BeautifulSoup, you can automate the process of gathering information from the web. Always remember to respect the website's terms of service and robots.txt file, and handle requests responsibly to avoid overloading servers.
Simplified Webscraping
See what your competitors are doing with MyQuants, the simple web scraping tool that removes all the fuss and gives you the tools you need to keep up and overtake the competition.
Get Now