Skip to article frontmatterSkip to article content

Lesson 2. Meet the Animals

The Ohio State University Libraries

This lesson continues to explore the diverse features of BeautifulSoup, a Python library designed for parsing XML and HTML documents. We will utilize BeautifulSoup to extract information about a select group of animals showcased on the Meet the Animals webpage of Smithsonian’s National Zoo and Conservation Biology Institute. Additionally, we will explore Pandas, a powerful Python library used to structuring, analyzing, and manipulating data.

Data skills | concepts

  • Search parameters
  • HTML
  • Web scraping
  • Pandas data structures

Learning objectives

  1. Identify search parameters and understand how they are inserted into a url.
  2. Navigate document nodes, element noes, attribute nodes, and text nodes in a Document Object Model (DOM).
  3. Extract and store HTML elements
  4. Export data to .csv

This tutorial is designed to support multi-session workshops hosted by The Ohio State University Libraries Research Commons. It assumes you already have a basic understanding of Python, including how to iterate through lists and dictionaries to extract data using a for loop. To learn basic Python concepts visit the Python - Mastering the Basics tutorial.

LESSON 2

Before starting any webscraping or API project, you must:

Review and understand the terms of use.

  • Do the terms of service include any restrictions or guidelines?

  • Are permissions/licenses needed to scrape data? If yes, have you obtained these permissions/licenses?

  • Is the information publicly available?

  • If a database, is the database protected by copyright? Or in the public domain?

Fair Use

Limited use of copyrighted materials is allowed under certain conditions for journalism, scholarship, and teaching. Use the Resources for determining fair use to verify your project is within the scope of fair use. Contact University Libraries Copyright Services if you have any questions.

Check for robots.txt directives

robots.txt directives limit web-scraping or web-crawling. Look for this file in the root directory of the website by adding /robots.txt to the end of the url. Respect these directives.

Step 2. Is an API available?

Technically, yes. The Smithsonian Institution provides an Open Access API that allows developers to access a wide range of data.

However, for learning purposes, we’ll focus on scraping a small sample from the Meet the Animals HTML page. This will help us practice how to:

  • Navigate a webpage’s structure
  • Extract specific HTML elements
  • Store the data for further use

This hands-on approach is a great way to build foundational web scraping skills before working with APIs.

Step 3. Examine the URL

Step 4. Inspect the elements

Both XML and HTML are structured as trees, where elements are nested within one another. When you request a URL, the server returns an HTML or XML document. Your browser then downloads and parses this document to display it visually.

In Lesson 1 we worked with well-structured XML, which made it easy to navigate:

  • Each article was uniquely identified by the <LogicalSectionID> tag.
  • Titles appeared in the <LogicalSectionTitle> tag.
  • Category type was included in the <LogicalSectionType> tag.

In contrast, HTML documents can be more complex and less predictable. Fortunately, Google Chrome’s Developer Tools make it easier to explore and understand the structure of a webpage.

Example:

Find the common name for meerkat.

  1. Open the meerkat Meet the Animals webpage in Chrome.
  2. Right-click on the element you want to inspect (e.g., the common name).
  3. Select Inspect.
meerkat_inspect.png

This opens the Developer Tools panel, typically on the right of the screen.

  • The default Elements tab shows the HTML structure (DOM).

  • Scroll through the rendered HTML to explore more content.

  • inspect icon Click the inspect icon in the top-left corner of the in the Developer Tools panel.

  • Hover over elements on the webpage to highlight them in the HTML.

As you hover, Chrome will:

  • Highlight the corresponding element on the page
  • Show a tooltip with tag details (e.g., class, ID)
  • Reveal the element’s location in the HTML tree
meerkat_inspect_element

This process helps you identify the exact tags and attributes you’ll need to target when scraping data from the page.

Viewing an Element’s HTML Structure

To examine an element’s exact location within the DOM:

  1. In Chrome Developer Tools, right-click on the highlighted element.
  2. Select Copy > Copy element.
  3. Paste the copied HTML into Notepad or any text editor to view its full structure and attributes.

This is especially helpful for identifying tags, classes, and nesting when preparing to extract data through web scraping.

meerkat_copy_element.png
meerkat_notepad.png

Step 5. Identify Python libraries for project

requests

The requests library retrieves HTML or XML documents from a server and processes the response.

BeautifulSoup

BeautifulSoup parses HTML and XML documents, helping you search for and extract elements from the DOM.

pandas

Pandas is a large Python library used for manipulating and analyzing tabular data. Helpful Pandas methods include:

pd.DataFrame

A Pandas DataFrame is one of the most powerful and commonly used data structures in Python for working with tabular data—data that is organized in rows and columns, similar to a spreadsheet or SQL table.

A DataFrame is a 2-dimensional labeled data structure with:

  • Rows (each representing an observation or record)
  • Columns (each representing a variable or feature)

Think of it like an Excel sheet or a table in a database.

import pandas as pd

df=pd.DataFrame([data, index, columns, dtype, copy])

🔗 See __Pandas DataFrame documentation.


pd.read_csv( )

The pd.read_csv() function is used to read data from a CSV (Comma-Separated Values) file and load it into a DataFrame.

pd.read_csv('INSERT FILEPATH HERE')

Example:

import pandas as pd
df=pd.read_csv('data/meet_the_animals.csv')  #df is a common abbreviation for DataFrame
df
Loading...

🔗 See Pandas .read_csv( ) documentation.


.tolist( )

The method .tolist() is used in to convert a Series (a single column of data) into a Python list.

df.Series.tolist()

Example:

import pandas as pd
df=pd.read_csv('data/meet_the_animals.csv')
animals=df.animal.tolist()
animals
['black-throated-blue-warbler', 'elds-deer', 'false-water-cobra', 'hooded-merganswer', 'patagonian-mara']

🔗 See .tolist( ) documentation.


.dropna( )

The dropna() method is used to remove missing values (NaN) from a DataFrame or Series. It’s a fast and effective way to clean your data—but it should be used with care.

DataFrame.dropna(*, axis=0, how=<no_default>, thresh=<no_default>, subset=None, inplace=False, ignore_index=False)

By default, if you don’t specify an axis or a subset of columns, Pandas will drop any row or column that contains at least one NaN value. This can lead to significant data loss if not used intentionally.

⚠️ Use with Caution
  • axis=0 drops rows with missing values (default)
  • axis=1 drops columns with missing values
  • You can also specify a subset of columns to check for NaNs
✅ Best Practice

Before using .dropna(), consider assigning the result to a new variable to avoid losing your original data:

🔗 See .dropna( ) documentation.


.fillna( )

The .fillna() method is used to replace NaN (missing) values with a value you specify.

df.Series.fillna(value=None, *, method=None, axis=None, inplace=False, limit=None, downcast=<no_default>)

This is especially useful when you want to:

  • Fill in missing data with a default value
  • Use statistical values like the mean or median
  • Forward-fill or backward-fill based on surrounding data

🔗 See .fillna( ) documentation.


.iterrows( )

The .iterrows() method allows you to iterate over each row in a DataFrame as a pair:

  • The index of the row
  • The row data as a pandas Series
DataFrame.iterrows()

Example:

import pandas as pd
df=pd.read_csv('data/meet_the_animals.csv')
for idx, row in df.iterrows():
    print(row.animal)
black-throated-blue-warbler
elds-deer
false-water-cobra
hooded-merganswer
patagonian-mara

This is useful when you need to process rows one at a time, especially for tasks like conditional logic or row-wise operations.

🔗 See .iterrows( ) documentation.


.iloc

The .iloc property is used to select rows (and columns) by their integer position (i.e., by index number, not label).

DataFrame.iloc[start:end]
import pandas as pd
df=pd.read_csv('data/meet_the_animals.csv')
for idx, row in df.iloc[0:1].iterrows():
    print(row.animal)
black-throated-blue-warbler
  • .iloc[row_index] accesses a specific row
  • .iloc[row_index, column_index] accesses a specific cell
  • You can also use slicing to select multiple rows or columns

Use .iloc when:

  • You want to access data by position, not by label
  • You’re working with numeric row/column indices
  • You’re iterating or slicing through rows or columns

🔗 See .iloc documentation.


.concat( )]

The pandas.concat function is used to join two or more DataFrames along a specific axis:

  • axis=0 → stacks DataFrames vertically (adds rows)
  • axis=1 → stacks DataFrames horizontally (adds columns)
pandas.concat(objs, *, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=None)

Example:

import pandas as pd

results=pd.DataFrame(columns=['common_name','size'])
df=pd.read_csv('data/meet_the_animals.csv')
for idx, row in df.iterrows():
    common_name=row.animal
    size=10
    data_row={
        'common_name':common_name,
        'size':size     
    }
    data=pd.DataFrame(data_row, index=[0])
    results=pd.concat([data, results], axis=0, ignore_index=True)

results
Loading...

🔗 See .concat documentation.


BONUS: try/except

Even with well-written code, things can go wrong—like missing HTML tags on a webpage or inconsistent data formats. That’s where Python’s try / except blocks come in.

They allow your program to handle errors gracefully instead of crashing.

🧪 How It Works

  • The code inside the try block is executed first.
  • If an error occurs, Python jumps to the except block.
  • Your program continues running without stopping unexpectedly.

Example:

import pandas as pd

results=pd.DataFrame(columns=['common_name','size'])
for idx, row in df.iterrows():
    try:
        common_name=row.animal
        size=10
        data_row={
            'common_name':common_name,
            'size':size     
        }
        data=pd.DataFrame(data_row, index=[0])
        results=pd.concat([data, results], axis=0, ignore_index=True)
    except:
        common_name='no name found'
        size=0
        data_row={
                    'common_name':common_name,
                    'size':size     
                }
        data=pd.DataFrame(data_row, index=[0])
        results=pd.concat([data, results], axis=0, ignore_index=True)

results
Loading...

Step 6. Write and test code