This lesson introduces pandas.read_html, a useful tool for extracting tables from HTML, and continues to explore BeautifulSoup, a Python library designed for parsing XML and HTML documents. We will start by gathering artists found on the List of Rock and Roll Hall of Fame inductees webpage in Wikipedia. We will then assemble discographies for 2-3 of our favorite artists.
Data skills | concepts¶
Search parameters
HTML
Web scraping
Pandas
Learning objectives¶
Extract and store tables and other HTML elements in a structured format
Apply best practices for managing data
This tutorial is designed to support multi-session workshops offered by The Ohio State University Libraries Research Commons. It assumes you already have a basic understanding of Python, including how to iterate through lists and dictionaries to extract data using a for loop. To learn basic Python concepts visit the Python - Mastering the Basics tutorial.
LESSON 3¶
Lesson 1 and Lesson 2 introduced the basic steps for any webscraping or API project:
Review and understand copyright and terms of use.
Check to see if an API is available.
Examine the URL
Inspect the elements
Identify Python libraries for project
Write and test code
Solution
Read Wikipedia Creative Commons Attribution-ShareAlike 4.0 License
Solution
Yes, but it may take some time to learn how to use. It is not necessary to use for this lesson.
Solution
Example:
Parliament-Funkadelic
Parliament discography
https:// + en.wikipedia.org/ + wiki/ + Parliament_discography
Funkadelic discography
https:// + en.wikipedia.org/ + wiki/ + Funkadelic_discography
Solution
import requests
from bs4 import BeautifulSoup
url='https://en.wikipedia.org/wiki/Parliament_discography'
headers = {'User-Agent': 'Mozilla/5.0'} # Adding a User-Agent is sometimes necessary.
response=requests.get(url, headers=headers).text # Ask Copilot why identifying a user-agent for the headers parameter in requests.get is often necessary when scraping web data.
soup=BeautifulSoup(response, 'html.parser')
headers=soup.find_all('div', {'class':'mw-heading2'})
for header in headers:
print(header.find('h2').text)Solution
import requests
import pandas as pd
from bs4 import BeautifulSoupPandas¶
.read_html()¶
Read HTML tables directly into DataFrames with .read_html() . This extremely useful tool extracts all tables present in specified URL or file, allowing each table to be accessed using standard list indexing and slicing syntax.
The following code instructs Python to go to the Wikipedia List of states and territories of the United States and retrieve the second table.
import requests
import pandas as pd
url='https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
headers = {'User-Agent': 'Mozilla/5.0'}
response=requests.get(url, headers=headers).text
tables=pd.read_html(response)
tables[1]Solution
import requests
import pandas as pd
url='https://en.wikipedia.org/wiki/List_of_Rock_and_Roll_Hall_of_Fame_inductees'
headers = {'User-Agent': 'Mozilla/5.0'}
response=requests.get(url, headers=headers).text
tables=pd.read_html(response)
performers=tables[0]
performersBeautifulSoup¶
.find_previous( ) and .find_all_previous( )¶
Similar to .find_next( ) and .find_all_next( ), .find_previous( ) and .find_all_previous( ) gathers the previous instance of a named tag.
Solution
import requests
import pandas as pd
from bs4 import BeautifulSoup
artists=['Cyndi_Lauper_discography','Joe_Cocker_discography','The_White_Stripes_discography']
base_url='https://en.wikipedia.org/wiki/'
for artist in artists[0:1]:
url=base_url+artist
headers = {'User-Agent': 'Mozilla/5.0'}
response=requests.get(url, headers=headers)
response.encoding = 'utf-8' #requests.get() does not accept an encoding parameter, but you can set encoding on the response object after the request.
text=response.text
soup=BeautifulSoup(text, 'html.parser')
#FIND TABLES
html_tables=pd.read_html(text)
#FIND TABLE_HEADERS
table_headers=[]
table_number=0
tables=soup.find_all('table')
for table in tables:
if table.find_previous('div',{'class':'mw-heading2'}) is not None:
h2=table.find_previous('div',{'class':'mw-heading2'}).text.split('[')[0]
table_headers.append(h2.lower().replace(' ','_'))
# print(f"table_number_{table_number}: {h2}")
else:
h2='no_header'
table_headers.append('h2')
if table.find_previous('div',{'class':'mw-heading3'}) is not None:
h3=table.find_previous('div',{'class':'mw-heading3'}).text.split('[')[0]
table_headers.append(h3.lower().replace(' ','_'))
print(f"table_number_{table_number}: {h3}")
else:
h3='no_header'
table_headers.append(h3)
print(f"table_number_{table_number}: {h2}, {h3}")
table_number += 1Managing files¶
There are several best practices and considerations for effectively managing research data files. When extracting and locally storing data in individual files, using standardized file naming conventions not only helps you organize and utilize your files efficiently but also facilitates sharing and collaboration with others in future projects.
Use short, descriptive names.
Use
_underscores or-dashes instead of spaces in your file names. Use leading zeros for sequential numbers to ensure proper sorting.
file_001_20250506.txtfile_002_20250506.txt
Use all lowercase for directory and filenames if possible.
Avoid special characters, including
~!@#$%^&*()[]{}?:;<>|\/Use standardized dates
YYYYMMDDto track versions and updates.Include version control numbers to keep track of projects.
os module¶
The os module tells Python where to find and save files.
os.mkdir(‘path’)¶
Creates a new directory in your project folder or another specified location. Makes a directory in your project folder or another folder you specify. If a directory by the same name already exists in the path specified, os.mkdir will raise an OSError. Use a try-except block to handle the error.
import os
artist = "Cyndi_Lauper"
try:
os.mkdir(artist)
except FileExistsError:
print(f"Directory '{artist}' already exists.")
except Exception as e:
print(f"An error occurred: {e}")Make a directory for each artist in your project folder.
Use pd.read_csv to create a file for each table. Incorporate the table number and header in the filename.
Increment counter variables for table numbers
Solution
import requests
import pandas as pd
from bs4 import BeautifulSoup
artists=['Cyndi_Lauper_discography','Joe_Cocker_discography','The_White_Stripes_discography']
base_url='https://en.wikipedia.org/wiki/'
for artist in artists[0:1]:
url=base_url+artist
headers = {'User-Agent': 'Mozilla/5.0'}
response=requests.get(url, headers=headers)
response.encoding = 'utf-8' #requests.get() does not accept an encoding parameter, but you can set encoding on the response object after the request.
text=response.text
soup=BeautifulSoup(text, 'html.parser')
#FIND TABLES
html_tables=pd.read_html(text)
#FIND TABLE_HEADERS
table_headers=[]
table_number=0
tables=soup.find_all('table')
for table in tables:
if table.find_previous('div',{'class':'mw-heading3'}) is not None:
h3=table.find_previous('div',{'class':'mw-heading3'}).text.split('[')[0]
table_headers.append(h3.lower().replace(' ','_'))
print(f"table_number_{table_number}: {h3}")
elif table.find_previous('div',{'class':'mw-heading2'}) is not None:
h2=table.find_previous('div',{'class':'mw-heading2'}).text.split('[')[0]
table_headers.append(h2.lower().replace(' ','_'))
print(f"table_number_{table_number}: {h2}")
else:
h='no_header'
table_headers.append(h)
print(f"table_number_{table_number}: {h}")
table_number += 1
#CREATE A DIRECTORY FOR EACH ARTIST AND OUTPUT TABLES TO THE DIRECTORY
position=0
for each_header in table_headers:
if each_header != 'no header':
table_name=each_header
number=position
artist_directory=artist.replace('_discography','').lower()
try:
os.mkdir(artist_directory)
except FileExistsError:
print(f"Directory '{artist_directory}' already exists.")
except Exception as e:
print(f"An error occurred: {e}")
filename=artist_directory+'/table_number_'+str(number)+'_'+table_name+'.csv'
print(filename)
html_table=html_tables[position]
html_table.to_csv(filename)
position += 1