Webscraping Competitors: For Analysts

Introduction to Webscraping

On your way to analysing data, especially competitor data, you will encounter a time when you want to know what your competitor is up to in real time.
This is where webscraping comes in. Webscraping is the process of extracting data from websites. It allows you to gather information from the web and use it for various purposes such as price tracking, sentiment analysis, and application development.
On the face of it, webscraping is fairly straightforward. Request the page, find the element, get the text. But very quickly you'll run into Request limits, captchas, Javascript content, and more. This is what we hope to cover in the below segment.

1. Understanding the Structure of an HTML Page

HTML (HyperText Markup Language) is the standard language for creating web pages. It uses a system of tags to define elements that structure the content of a webpage. Understanding the basic structure of an HTML document is crucial for web scraping, as it helps you identify the elements from which you need to extract data.

Basic Structure of an HTML Document

An HTML document is made up of a series of nested elements. Each element is defined by a tag, enclosed in angle brackets (<>). HTML documents typically have a hierarchical structure consisting of the following parts:

<!DOCTYPE html>
<html>
<head>
  <title>Page Title</title>
</head>
<body>
  <h1>This is a Heading</h1>
  <p>This is a paragraph.</p>
</body>
</html>

Let's break down this structure into its key components:

<!DOCTYPE html>

This declaration defines the document type and version of HTML. It helps the browser understand how to interpret the document. For modern web pages, this is typically set to HTML5 using <!DOCTYPE html>.

<html>...</html>

The <html> element is the root of an HTML document. It encapsulates all the content on the page.

<head>...</head>

The <head> section contains meta-information about the document, such as its title, character set, styles, scripts, and other metadata. Key elements within the <head> section include:

<title>...</title>: Defines the title of the document, which is displayed in the browser's title bar or tab.
<meta>: Provides metadata such as character set, author, description, and keywords.
<link>: Used to link external resources like stylesheets.
<style>...</style>: Contains internal CSS to style the document.
<script>...</script>: Contains JavaScript for interactivity.

<body>...</body>

The <body> section contains the actual content of the webpage that is visible to the user. Common elements within the body include:

<h1>...</h1> to <h6>...</h6>: Header tags, with <h1> being the highest level and <h6> the lowest. These are used to define headings of different levels.
<p>...</p>: Defines a paragraph.
<a>...</a>: Defines a hyperlink, which is used to link from one page to another.
<div>...</div>: A division or section of the document, often used as a container for other elements.
<span>...</span>: An inline container used for styling purposes.
<ul>...</ul>, <ol>...</ol>, <li>...</li>: Used to create unordered (bulleted) and ordered (numbered) lists.
<table>...</table>, <tr>...</tr>, <td>...</td>: Used to create tables.
<img>: Embeds an image in the document.
<form>...</form>, <input>,...</input>, <button>,...</button>: Used to create forms for user input.

By understanding these basic elements and their structure, you can more effectively navigate and extract data from HTML pages during the web scraping process.

2. Understanding the Structure of an HTML Page

Basic Structure of an HTML Document

<!DOCTYPE html>
<html>
<head>
  <title>Page Title</title>
</head>
<body>
  <h1>This is a Heading</h1>
  <p>This is a paragraph.</p>
</body>
</html>

Let's break down this structure into its key components:

<!DOCTYPE html>

<html>...</html>

The <html> element is the root of an HTML document. It encapsulates all the content on the page.

<head>...</head>

The <head> section contains meta-information about the document, such as its title, character set, styles, scripts, and other metadata. Key elements within the <head> section include:

<title>...</title>: Defines the title of the document, which is displayed in the browser's title bar or tab.
<meta>: Provides metadata such as character set, author, description, and keywords.
<link>: Used to link external resources like stylesheets.
<style>...</style>: Contains internal CSS to style the document.
<script>...</script>: Contains JavaScript for interactivity.

<body>...</body>

The <body> section contains the actual content of the webpage that is visible to the user. Common elements within the body include:

<h1>...</h1> to <h6>...</h6>: Header tags, with <h1> being the highest level and <h6> the lowest. These are used to define headings of different levels.
<p>...</p>: Defines a paragraph.
<a>...</a>: Defines a hyperlink, which is used to link from one page to another.
<div>...</div>: A division or section of the document, often used as a container for other elements.
<span>...</span>: An inline container used for styling purposes.
<ul>...</ul>, <ol>...</ol>, <li>...</li>: Used to create unordered (bulleted) and ordered (numbered) lists.
<table>...</table>, <tr>...</tr>, <td>...</td>: Used to create tables.
<img>: Embeds an image in the document.
<form>...</form>, <input>,...</input>, <button>,...</button>: Used to create forms for user input.

By understanding these basic elements and their structure, you can more effectively navigate and extract data from HTML pages during the web scraping process.

3. Tools and Libraries for Web Scraping

There are several tools and libraries available for web scraping. Some popular ones include:

Beautiful Soup: A Python library for parsing HTML and XML documents.
Scrapy: An open-source web crawling framework for Python.
Selenium: A tool for automating web browsers, useful for scraping dynamic content.
urllib: A Python module for fetching URLs (Uniform Resource Locators).

4. Finding Specific Elements in an HTML Page

To scrape data, you need to find the specific HTML elements that contain the data you want. This can be done using CSS selectors or XPath expressions.

Example: Using Beautiful Soup

from bs4 import BeautifulSoup

html = """
<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

# Get the title
title = soup.title.string
print(title)

# Get all links
links = soup.find_all('a')
for link in links:
    print(link.get('href'))

5. Practical Example: Web Scraping with Python and urllib

Step 1: Import Libraries

import urllib.request
from bs4 import BeautifulSoup

Step 2: Send a GET Request

url = 'http://example.com'
response = urllib.request.urlopen(url)
html_content = response.read().decode('utf-8')

Step 3: Parse HTML Content

soup = BeautifulSoup(html_content, 'html.parser')

Step 4: Extract Specific Data

title = soup.title.string
print(f'Title: {title}')

# Extract all paragraphs
paragraphs = soup.find_all('p')
for paragraph in paragraphs:
    print(paragraph.text)

6. Dealing with Complex Pages

Many websites have complex structures with nested elements. Let's look at an example where we need to navigate through nested tags to extract data.

Example: Navigating Nested Tags

html = """
<html>
<body>
  <div class="content">
    <h1>Main Heading</h1>
    <div class="article">
      <h2>Article Title</h2>
      <p>Article content here.</p>
    </div>
  </div>
</body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

# Get the main heading
main_heading = soup.find('h1').text
print(f'Main Heading: {main_heading}')

# Get the article title
article_title = soup.find('div', class_='article').find('h2').text
print(f'Article Title: {article_title}')

# Get the article content
article_content = soup.find('div', class_='article').find('p').text
print(f'Article Content: {article_content}')

7. Handling Authentication

Some websites require authentication to access certain pages. You can handle authentication in urllib by using the `HTTPBasicAuthHandler` or `HTTPCookieProcessor` classes.

Example: Basic Authentication

import urllib.request

# Create a password manager
password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()

# Add the username and password
top_level_url = "http://example.com/"
password_mgr.add_password(None, top_level_url, 'username', 'password')

# Create an authentication handler
handler = urllib.request.HTTPBasicAuthHandler(password_mgr)

# Create an opener
opener = urllib.request.build_opener(handler)

# Use the opener to fetch a URL
url = "http://example.com/protected_page"
response = opener.open(url)
html_content = response.read().decode('utf-8')

print(html_content)

8. Handling Cookies

Cookies are used by websites to store user session information. You can handle cookies in urllib using the `HTTPCookieProcessor` class.

Example: Handling Cookies

import http.cookiejar
import urllib.request

# Create a cookie jar
cj = http.cookiejar.CookieJar()

# Create an opener with HTTPCookieProcessor
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))

# Use the opener to fetch a URL
url = "http://example.com"
response = opener.open(url)
html_content = response.read().decode('utf-8')

print(html_content)

# Print cookies
for cookie in cj:
    print(f'Cookie: {cookie.name}={cookie.value}')

9. Handling Redirects

Some websites redirect users to another page. By default, urllib handles redirects automatically. However, you can customize this behavior if needed.

Example: Handling Redirects

class RedirectHandler(urllib.request.HTTPRedirectHandler):
    def http_error_301(self, req, fp, code, msg, headers):
        print(f'Redirected to: {headers["Location"]}')
        return super().http_error_301(req, fp, code, msg, headers)

    def http_error_302(self, req, fp, code, msg, headers):
        print(f'Redirected to: {headers["Location"]}')
        return super().http_error_302(req, fp, code, msg, headers)

opener = urllib.request.build_opener(RedirectHandler())
url = "http://example.com"
response = opener.open(url)
html_content = response.read().decode('utf-8')

print(html_content)

10. Respecting Robots.txt

The `robots.txt` file on a website specifies which parts of the site should not be accessed by web crawlers. It's important to respect this file when scraping websites.

Example: Checking Robots.txt

import urllib.robotparser

rp = urllib.robotparser.RobotFileParser()
rp.set_url("http://example.com/robots.txt")
rp.read()

url = "http://example.com/some_page"
user_agent = "MyScraper"

if rp.can_fetch(user_agent, url):
    print("You can scrape this page.")
else:
    print("You are not allowed to scrape this page.")

11. Dealing with JavaScript-Rendered Content

Some websites use JavaScript to load content dynamically. In such cases, libraries like Selenium can be used to scrape the rendered HTML.

Example: Using Selenium

from selenium import webdriver

# Set up the WebDriver
driver = webdriver.Chrome(executable_path='/path/to/chromedriver')

# Open the webpage
driver.get("http://example.com")

# Extract data
title = driver.title
print(f'Title: {title}')

# Extract dynamic content
dynamic_content = driver.find_element_by_id("dynamic_content").text
print(f'Dynamic Content: {dynamic_content}')

# Close the WebDriver
driver.quit()

12. Throttling Requests

To avoid overloading a server with requests, it's important to add delays between requests. This can be achieved using the `time` module.

Example: Adding Delays

import time

urls = ["http://example.com/page1", "http://example.com/page2", "http://example.com/page3"]

for url in urls:
    response = urllib.request.urlopen(url)
    html_content = response.read().decode('utf-8')
    print(html_content)
    
    # Add a delay
    time.sleep(2)

13. Error Handling

When scraping websites, you may encounter various errors such as connection errors or HTTP errors. Proper error handling is crucial to ensure your scraper runs smoothly.

Example: Handling Errors

try:
    response = urllib.request.urlopen(url)
    html_content = response.read().decode('utf-8')
except urllib.error.HTTPError as e:
    print(f'HTTP Error: {e.code}')
except urllib.error.URLError as e:
    print(f'URL Error: {e.reason}')
else:
    print(html_content)

14. Storing Scraped Data

After scraping data, you may want to store it for later use. You can store data in various formats such as CSV, JSON, or databases.

Example: Storing Data in CSV

import csv

data = [
    ["Title", "Link"],
    ["Example Title 1", "http://example.com/1"],
    ["Example Title 2", "http://example.com/2"]
]

with open('data.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(data)

Example: Storing Data in JSON

import json

data = {
    "titles": ["Example Title 1", "Example Title 2"],
    "links": ["http://example.com/1", "http://example.com/2"]
}

with open('data.json', 'w') as file:
    json.dump(data, file)

15. Conclusion

Web scraping is a powerful tool for extracting data from websites. Using Python and libraries like urllib and BeautifulSoup, you can automate the process of gathering information from the web. Always remember to respect the website's terms of service and robots.txt file, and handle requests responsibly to avoid overloading servers.

Simplified Webscraping

See what your competitors are doing with MyQuants, the simple web scraping tool that removes all the fuss and gives you the tools you need to keep up and overtake the competition.

Get Now

Webscraping Basics

Learn the Basics of Webscraping

Webscraping Competitors: For Analysts

Introduction to Webscraping

1. Understanding the Structure of an HTML Page

Basic Structure of an HTML Document

<!DOCTYPE html>

<html>...</html>

<head>...</head>

<body>...</body>

2. Understanding the Structure of an HTML Page

Basic Structure of an HTML Document

<!DOCTYPE html>

<html>...</html>

<head>...</head>

<body>...</body>

3. Tools and Libraries for Web Scraping

4. Finding Specific Elements in an HTML Page

Example: Using Beautiful Soup

5. Practical Example: Web Scraping with Python and urllib

Step 1: Import Libraries

Step 2: Send a GET Request

Step 3: Parse HTML Content

Step 4: Extract Specific Data

6. Dealing with Complex Pages

Example: Navigating Nested Tags

7. Handling Authentication

Example: Basic Authentication

8. Handling Cookies

Example: Handling Cookies

9. Handling Redirects

Example: Handling Redirects

10. Respecting Robots.txt

Example: Checking Robots.txt

11. Dealing with JavaScript-Rendered Content

Example: Using Selenium

12. Throttling Requests

Example: Adding Delays

13. Error Handling

Example: Handling Errors

14. Storing Scraped Data

Example: Storing Data in CSV

Example: Storing Data in JSON

15. Conclusion

Simplified Webscraping

Related Articles