Skip to article frontmatterSkip to article content

Lesson 5. Author Affiliations

The Ohio State University Libraries

The Enztrez E-utilities offer a suite of tools that enable researchers to automate the search and retrieval of scientific information from PubMed and other databases maintained by the National Center for Biotechnology Information (NCBI). In Lesson 4 we identified an active NIH funded research project at The Ohio State University and generated a list of PMIDs (PubMed Identifiers) associated with each project. In Lesson 5, we will use this list of PMIDs with the Entrez E-utilities to gather the affiliations of each author listed on the corresponding articles. We will also begin to explore regular expressions, a tool used across programming languages for matching and manipulating string data.

Data skills | concepts

  • Working with APIs
  • Manipulating text
  • Regular expressions

Learning objectives

  1. Locate API documentation and identify key components required to formulate an API request
  2. Parse an API response and store extracted data.
  3. Utilize regular expressions to search, match, and manipulate text.

This tutorial is designed to support multi-session workshops hosted by The Ohio State University Libraries Research Commons. It assumes you already have a basic understanding of Python, including how to iterate through lists and dictionaries to extract data using a for loop. To learn basic Python concepts visit the Python - Mastering the Basics tutorial.

LESSON 5

Science is constantly evolving, with new disciplines emerging from interdisciplinary research, technological innovation, and global collaboration. Analyzing research networks can help researchers identify potential collaborators, track emerging fields, and discover new research directions.

One effective way to explore these networks is by examining author affiliations listed in journal publications.

What is EFetch?

EFetch is a utility provided by NCBI’s Entrez system that retrieves detailed records for a list of unique identifiers (like PMIDs) from databases such as PubMed.

Where are author affiliations?

In PubMed records, author affiliations are embedded in the XML under:

<Author>
  <AffiliationInfo>
    <Affiliation>...</Affiliation>
  </AffiliationInfo>
</Author>

Step 1. Construct an EFetch request

To retrieve XML data for a PubMed article, use the following components:

  • Base URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
  • Parameters:
    • Database name: ?db=pubmed
    • Unique identifier: &id=39773557
    • API key: &api_key=INSERT YOUR API KEY HERE

Required parameters for an EFetch request depend on the specific Entrez database you are querying. For PubMed, the default EFetch response format is XML.

To manage request volume, the NCBI enforces rate limits:

  • Without an API key: 3 requests per second
  • With an API key: up to 10 requests per second

While you can view a single XML record without an API key, completing the exercises in this tutorial requires one. You can obtain an API key by visiting the Settings page of your NCBI account.

Step 2. Identify Python libraries for project

The following Python libraries are needed for this project:

  • requests – to make HTTP requests
  • pandas – to manage and store data
  • BeautifulSoup– to parse XML and extract affiliations

Regular Expressions (regex)

Analyzing author and affiliation data can be messy due to:

  • Inconsistent naming conventions
  • Variations in institutional affiliation formats
  • Ambiguities in author identify.

Regular expressions (regex) match patterns in text. Often described as wildcards on steroids, regular expressions help:

  • Validate patterns (e.g., ZIP codes: (^\d{5}$)
  • Extract variations (e.g., “ha?ematology” matches both “hematology” and “haematology”)
  • Replace text (e.g., re.sub(r’\bOH\b’, ‘Ohio’, text))

Regular expressions are included in several programming langauges and software programs including Python, JavaScript, and Tableau.

Common Regex Patterns

PatternMatches
[A-Z]Any uppercase letter
[a-z]Any lowercase letter
[0-9]{5}Exactly 5 digits
^OhioStarts with “Ohio”
State$Ends with “State”
ha?ematology“hematology” or “haematology”
Ohio State|OSU“Ohio State” or “OSU”

Metacharacters are special symbols in regular expressions that represent patterns rather than literal characters. To match them as literal characters, you must escape them with a backslash ( \ ).

Common metacharacters and their functions

SymbolMeaningExample
[ ]A set or range of characters[a-f] matches any lowercase letter from a to f
\Starts a special sequence\w matches any word character (letter, digit, or underscore)
.Any character except newlined.g matches “dog”, “dig”, “dug”, etc.
^Start of a string^Ohio matches any string that starts with “Ohio”
$End of a stringState$ matches any string that ends with “State”
.*Zero or more of the preceding characterh*matology matches “hematology”, “haematology”, etc.
+One or more of the preceding characterspe+d matches “sped”, “speed”, etc.
?Zero or one of the preceding charactertravel?ling matches “traveling”, “travelling”. etc.
{ }Exactly a specified number of repetitions[0-9]{5} matches any 5-digit number
( )Grouping or capturingThe (Ohio) State University extracts “Ohio”
|Logical OROhio State|OSU matches “Ohio State” or “OSU”

LEARNING RESOURCES

regex101

Screenshot of regex101 homepage

regular expressions 101: build, test, and debug regex is an interactive tool that helps you build, test, and debug regular expressions across multiple programming languages. It lets you test your regex against sample text, provides real-time explanations as you type, and includes a searchable reference for regex syntax.

Learning Regular Expressions

Decorative book cover

Learning Regular Expressions by Ben Forta is available through the Libraries’ O’Reilly Online Learning collection of technical books and videos. Each chapter is structured as a lesson, guiding you through how to match individual characters or sets of characters, use metacharacters, and more—making it a practical resource for mastering regex step by step.

re module

To work with regular expressions in Python, start by importing the built-in re module:

import re

🔗 See re module documentation.

Commonly used re methods

re.match( )

re.match(pattern, string, flags=0)
  • Checks for a match only at the beginning of the string.
  • Returns a match object if found, otherwise None.

Example:

import pandas as pd
import re

addresses=pd.read_csv('data/pubmed_author_affiliations.csv')
addresses=addresses.dropna(subset='affiliation') #drops rows with null affiliation values

for idx, row in addresses.iloc[0:10].iterrows():
    affiliation=str(row.affiliation)
    print(affiliation)
    osu_match=re.match(r"Ohio State University",affiliation) 
    if osu_match:
        print(f" MATCH {osu_match.group()}: {affiliation}")
The Ohio State Biochemistry Program, The Ohio State University, Columbus, OH, 43210, USA. fu.978@osu.edu.
The Ohio State Biochemistry Program, The Ohio State University, Columbus, OH, 43210, USA.
Center for Cancer Metabolism, The Ohio State University Comprehensive Cancer Center, Columbus, OH, 43210, USA.
Department of Biological Chemistry and Pharmacology, The Ohio State University, Columbus, OH, 43210, USA.
Department of Biological Chemistry and Pharmacology, The Ohio State University, Columbus, OH, 43210, USA.
The Ohio State Biochemistry Program, The Ohio State University, Columbus, OH, 43210, USA.
Department of Biological Chemistry and Pharmacology, The Ohio State University, Columbus, OH, USA. fu.978@osu.edu.
Department of Biological Chemistry and Pharmacology, The Ohio State University, Columbus, OH, USA.
Department of Physics, Northeastern University, Boston, MA 02115, USA.
Department of Chemistry and Biochemistry, Center for RNA Biology, Ohio State University, Columbus, OH 43210, USA.

🔗 See re.match() documentation.


re.search(pattern, string, flags=0)
  • Scans through the string and returns the first match of the pattern.
  • Returns a match object or None.

Example:

import pandas as pd
import re

addresses=pd.read_csv('data/pubmed_author_affiliations.csv')
addresses=addresses.dropna(subset='affiliation')

for idx, row in addresses.iloc[0:10].iterrows():
    affiliation=str(row.affiliation)
    # print(affiliation)
    osu_search=re.search(r"The Ohio State University",affiliation) 
    if osu_search:
        print(osu_search.group())
    else:
        print(f"No match: affiliation = {affiliation}")
The Ohio State University
The Ohio State University
The Ohio State University
The Ohio State University
The Ohio State University
The Ohio State University
The Ohio State University
The Ohio State University
No match: affiliation = Department of Physics, Northeastern University, Boston, MA 02115, USA.
No match: affiliation = Department of Chemistry and Biochemistry, Center for RNA Biology, Ohio State University, Columbus, OH 43210, USA.

🔗 See re.search() documentation.


re.findall( )

re.findall(pattern, string, flags=0)
  • Returns all non-overlapping matches of the pattern in the string as a list.

  • Example:

# HOW MANY TORTOISES AND TURTLES ARE IN THIS LIST OF ANIMALS?

import pandas as pd
import re

df=pd.read_csv('data/animals_tortoises.csv')
animals=df.common_name.tolist()
animals=','.join(animals)
print('LIST OF ANIMALS')
print(animals)

tortoises_turtles=re.findall(r"tortoise|turtle", animals)
print('ANSWER')
print(f"There are {len(tortoises_turtles)} tortoises and turtles in this list of animals.")
LIST OF ANIMALS
abyssinian-ground-hornbill,addax,african-clawed-frog,african-pancake-tortoise,african-plated-lizard,aldabr-tortoise,allens-swamp-monkey,alligator-lizard,alligator-snapping-turtle,alpaca
ANSWER
There are 3 tortoises and turtles in this list of animals.

🔗 See re.findall() documentation.


re.sub( )

re.sub(pattern, repl, string, count=0, flags=0)
  • Replaces all occurrences of the pattern in the string with the replacement text (repl).
  • You can limit the number of replacements using the count parameter.

Examples:

# FIND TORTOISES AT THE NATIONAL ZOO AND REPLACE THE COMMON_NAME WITH "SLOW TORTOISE"
import pandas as pd
import re

df=pd.read_csv('data/animals_tortoises.csv')
animals=df.common_name.tolist()
animals=','.join(animals)
pattern="tortoise|turtle"
tortoises_slow=re.sub(pattern,"SLOW TORTOISE,",animals)
tortoises_slow
'abyssinian-ground-hornbill,addax,african-clawed-frog,african-pancake-SLOW TORTOISE,,african-plated-lizard,aldabr-SLOW TORTOISE,,allens-swamp-monkey,alligator-lizard,alligator-snapping-SLOW TORTOISE,,alpaca'

🔗 See re.sub() documentation.