Lesson 5. Author Affiliations - Data Visualization

The Entrez E-utilities offer a suite of tools that enable researchers to automate the search and retrieval of scientific information from PubMed and other databases maintained by the National Center for Biotechnology Information (NCBI). In Lesson 4 we identified an active NIH funded research project at The Ohio State University and generated a list of PMIDs (PubMed Identifiers) associated with each project. In Lesson 5, we will use this list of PMIDs with the Entrez E-utilities to gather the affiliations of each author listed on the corresponding articles. We will also begin to explore regular expressions, a tool used across programming languages for matching and manipulating string data.

Data skills | concepts¶

Working with APIs
Manipulating text
Regular expressions

Learning objectives¶

Locate API documentation and identify key components required to formulate an API request
Parse an API response and store extracted data.
Utilize regular expressions to search, match, and manipulate text.

This tutorial is designed to support multi-session workshops hosted by The Ohio State University Libraries Research Commons. It assumes you already have a basic understanding of Python, including how to iterate through lists and dictionaries to extract data using a for loop. To learn basic Python concepts visit the Python - Mastering the Basics tutorial.

LESSON 5¶

Science is constantly evolving, with new disciplines emerging from interdisciplinary research, technological innovation, and global collaboration. Analyzing research networks can help researchers identify potential collaborators, track emerging fields, and discover new research directions.

One effective way to explore these networks is by examining author affiliations listed in journal publications.

What is EFetch?¶

EFetch is a utility provided by NCBI’s Entrez system that retrieves detailed records for a list of unique identifiers (like PMIDs) from databases such as PubMed.

Where are author affiliations?¶

In PubMed records, author affiliations are embedded in the XML under:

<Author>
  <AffiliationInfo>
    <Affiliation>...</Affiliation>
  </AffiliationInfo>
</Author>

Step 1. Construct an EFetch request¶

To retrieve XML data for a PubMed article, use the following components:

Base URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
Parameters:
- Database name: ?db=pubmed
- Unique identifier: &id=39773557
- API key: &api_key=INSERT YOUR API KEY HERE

Required parameters for an EFetch request depend on the specific Entrez database you are querying. For PubMed, the default EFetch response format is XML.

To manage request volume, the NCBI enforces rate limits:

Without an API key: 3 requests per second
With an API key: up to 10 requests per second

While you can view a single XML record without an API key, completing the exercises in this tutorial requires one. You can obtain an API key by visiting the Settings page of your NCBI account.

Step 2. Identify Python libraries for project¶

The following Python libraries are needed for this project:

requests – to make HTTP requests
pandas – to manage and store data
BeautifulSoup– to parse XML and extract affiliations

Solution

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import time
from datetime import date

#1. Create a last_updated variable with today's date.
today = date.today()
last_updated=today

#2. Create list of PMIDs
pmids=['39773557', '39656677', '37398045', '39229161', '39713331', '39315813', '38338688', '36721057', '37322069']

#3.Create a dataframe to store the search results. 
author_affiliations=pd.DataFrame(columns=["pmid","name","affiliation","last_updated"])

#4. Use requests, BeautifulSoup, and the EFetch utility to retrieve author affiliations.
# Store results in a DataFrame.
count=0
for each_record in pmids:
    # try:
    count += 1
    print('starting record '+str(count)+': '+str(each_record))
    search_url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id="+str(each_record)+"&api_key=INSERT YOUR API KEY HERE"
    xml_data=requests.get(search_url).text
    soup = bs(xml_data, 'xml')
    records=soup.find('PubmedArticle')
    pmid=records.PMID.text
    authors=records.find_all("Author")
    for each_author in authors:
        if each_author.LastName != None:
            lastname=each_author.LastName.text
        else:
            lastname=''
        if each_author.ForeName != None:
            forename=each_author.ForeName.text
        else:
            forename=''
        if lastname != '' and forename != '':
            name=lastname+', '+forename
        else:
            name=''
        

        if each_author.Affiliation != None:
            affiliation=each_author.Affiliation.text
        else:
            affiliation=''
        print(f"{name}, {affiliation}")
                
        row={
            "pmid": pmid,
            "name": name,
            "affiliation": affiliation,
            "last_updated": last_updated

            }
        author_info=pd.DataFrame(row, index=[0])
        author_affiliations = pd.concat([author_info,author_affiliations], axis=0)
        time.sleep(0.15)
        
#5. Export results to csv        
author_affiliations.to_csv('data/pubmed_author_affiliations.csv')

Regular Expressions (regex)¶

Analyzing author and affiliation data can be messy due to:

Inconsistent naming conventions
Variations in institutional affiliation formats
Ambiguities in author identify.

Regular expressions (regex) match patterns in text. Often described as wildcards on steroids, regular expressions help:

Validate patterns (e.g., ZIP codes: (^\d{5}$)
Extract variations (e.g., “ha?ematology” matches both “hematology” and “haematology”)
Replace text (e.g., re.sub(r’\bOH\b’, ‘Ohio’, text))

Regular expressions are included in several programming langauges and software programs including Python, JavaScript, and Tableau.

Common Regex Patterns¶

Pattern	Matches
`[A-Z]`	Any uppercase letter
`[a-z]`	Any lowercase letter
`[0-9]{5}`	Exactly 5 digits
`^Ohio`	Starts with “Ohio”
`State$`	Ends with “State”
`ha?ematology`	“hematology” or “haematology”
`Ohio State\|OSU`	“Ohio State” or “OSU”

Metacharacters are special symbols in regular expressions that represent patterns rather than literal characters. To match them as literal characters, you must escape them with a backslash ( \ ).

Common metacharacters and their functions¶

Symbol	Meaning	Example
`[ ]`	A set or range of characters	`[a-f]` matches any lowercase letter from a to f
`\`	Starts a special sequence	`\w` matches any word character (letter, digit, or underscore)
`.`	Any character except newline	`d.g` matches “dog”, “dig”, “dug”, etc.
`^`	Start of a string	`^Ohio` matches any string that starts with “Ohio”
`$`	End of a string	`State$` matches any string that ends with “State”
`.*`	Zero or more of the preceding character	`h*matology` matches “hematology”, “haematology”, etc.
`+`	One or more of the preceding character	`spe+d` matches “sped”, “speed”, etc.
`?`	Zero or one of the preceding character	`travel?ling` matches “traveling”, “travelling”. etc.
`{ }`	Exactly a specified number of repetitions	`[0-9]{5}` matches any 5-digit number
`( )`	Grouping or capturing	`The (Ohio) State University` extracts “Ohio”
`\|`	Logical OR	`Ohio State\|OSU` matches “Ohio State” or “OSU”

LEARNING RESOURCES¶

regex101

regular expressions 101: build, test, and debug regex is an interactive tool that helps you build, test, and debug regular expressions across multiple programming languages. It lets you test your regex against sample text, provides real-time explanations as you type, and includes a searchable reference for regex syntax.

Learning Regular Expressions

Learning Regular Expressions by Ben Forta is available through the Libraries’ O’Reilly Online Learning collection of technical books and videos. Each chapter is structured as a lesson, guiding you through how to match individual characters or sets of characters, use metacharacters, and more—making it a practical resource for mastering regex step by step.

re module¶

To work with regular expressions in Python, start by importing the built-in re module:

import re

🔗 See re module documentation.

Commonly used re methods¶

re.match( )¶

re.match(pattern, string, flags=0)

Checks for a match only at the beginning of the string.
Returns a match object if found, otherwise None.

Example:

import pandas as pd
import re

addresses=pd.read_csv('data/pubmed_author_affiliations.csv')
addresses=addresses.dropna(subset='affiliation') #drops rows with null affiliation values

for idx, row in addresses.iloc[0:10].iterrows():
    affiliation=str(row.affiliation)
    print(affiliation)
    osu_match=re.match(r"Ohio State University",affiliation) 
    if osu_match:
        print(f" MATCH {osu_match.group()}: {affiliation}")

The Ohio State Biochemistry Program, The Ohio State University, Columbus, OH, 43210, USA. fu.978@osu.edu.
The Ohio State Biochemistry Program, The Ohio State University, Columbus, OH, 43210, USA.
Center for Cancer Metabolism, The Ohio State University Comprehensive Cancer Center, Columbus, OH, 43210, USA.
Department of Biological Chemistry and Pharmacology, The Ohio State University, Columbus, OH, 43210, USA.
Department of Biological Chemistry and Pharmacology, The Ohio State University, Columbus, OH, 43210, USA.
The Ohio State Biochemistry Program, The Ohio State University, Columbus, OH, 43210, USA.
Department of Biological Chemistry and Pharmacology, The Ohio State University, Columbus, OH, USA. fu.978@osu.edu.
Department of Biological Chemistry and Pharmacology, The Ohio State University, Columbus, OH, USA.
Department of Physics, Northeastern University, Boston, MA 02115, USA.
Department of Chemistry and Biochemistry, Center for RNA Biology, Ohio State University, Columbus, OH 43210, USA.

🔗 See re.match() documentation.

re.search( )¶

re.search(pattern, string, flags=0)

Scans through the string and returns the first match of the pattern.
Returns a match object or None.

Example:

import pandas as pd
import re

addresses=pd.read_csv('data/pubmed_author_affiliations.csv')
addresses=addresses.dropna(subset='affiliation')

for idx, row in addresses.iloc[0:10].iterrows():
    affiliation=str(row.affiliation)
    # print(affiliation)
    osu_search=re.search(r"The Ohio State University",affiliation) 
    if osu_search:
        print(osu_search.group())
    else:
        print(f"No match: affiliation = {affiliation}")

The Ohio State University
The Ohio State University
The Ohio State University
The Ohio State University
The Ohio State University
The Ohio State University
The Ohio State University
The Ohio State University
No match: affiliation = Department of Physics, Northeastern University, Boston, MA 02115, USA.
No match: affiliation = Department of Chemistry and Biochemistry, Center for RNA Biology, Ohio State University, Columbus, OH 43210, USA.

🔗 See re.search() documentation.

re.findall( )¶

re.findall(pattern, string, flags=0)

Returns all non-overlapping matches of the pattern in the string as a list.
Example:

# HOW MANY TORTOISES AND TURTLES ARE IN THIS LIST OF ANIMALS?

import pandas as pd
import re

df=pd.read_csv('data/animals_tortoises.csv')
animals=df.common_name.tolist()
animals=','.join(animals)
print('LIST OF ANIMALS')
print(animals)

tortoises_turtles=re.findall(r"tortoise|turtle", animals)
print('ANSWER')
print(f"There are {len(tortoises_turtles)} tortoises and turtles in this list of animals.")

LIST OF ANIMALS
abyssinian-ground-hornbill,addax,african-clawed-frog,african-pancake-tortoise,african-plated-lizard,aldabr-tortoise,allens-swamp-monkey,alligator-lizard,alligator-snapping-turtle,alpaca
ANSWER
There are 3 tortoises and turtles in this list of animals.

🔗 See re.findall() documentation.

re.sub( )¶

re.sub(pattern, repl, string, count=0, flags=0)

Replaces all occurrences of the pattern in the string with the replacement text (repl).
You can limit the number of replacements using the count parameter.

Examples:

# FIND TORTOISES AT THE NATIONAL ZOO AND REPLACE THE COMMON_NAME WITH "SLOW TORTOISE"
import pandas as pd
import re

df=pd.read_csv('data/animals_tortoises.csv')
animals=df.common_name.tolist()
animals=','.join(animals)
pattern="tortoise|turtle"
tortoises_slow=re.sub(pattern,"SLOW TORTOISE,",animals)
tortoises_slow

'abyssinian-ground-hornbill,addax,african-clawed-frog,african-pancake-SLOW TORTOISE,,african-plated-lizard,aldabr-SLOW TORTOISE,,allens-swamp-monkey,alligator-lizard,alligator-snapping-SLOW TORTOISE,,alpaca'

🔗 See re.sub() documentation.

Solution

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import time
from datetime import date
import re

#1. Create a last_updated variable with today's date.
today = date.today()
last_updated=today

#2. Create list of PMIDs
pmids=['39773557', '39656677', '37398045', '39229161', '39713331', '39315813', '38338688', '36721057', '37322069']

#3.Create a dataframe to store the search results. 
author_affiliations=pd.DataFrame(columns=["pmid","name","affiliation","institution","last_updated"])

#4. Use requests, BeautifulSoup, and the EFetch utility to retrieve author affiliations.
# Store results in a DataFrame.
count=0
for each_record in pmids:
    # try:
    count += 1
    print('starting record '+str(count)+': '+str(each_record))
    search_url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id="+str(each_record)+"&api_key=INSERT API KEY HERE"
    xml_data=requests.get(search_url).text
    soup = bs(xml_data, 'xml')
    records=soup.find('PubmedArticle')
    pmid=records.PMID.text
    authors=records.find_all("Author")
    for each_author in authors:
        if each_author.LastName != None:
            lastname=each_author.LastName.text
        else:
            lastname=''
        if each_author.ForeName != None:
            forename=each_author.ForeName.text
        else:
            forename=''
        if lastname != '' and forename != '':
            name=lastname+', '+forename
        else:
            name=''
        

        if each_author.Affiliation != None:
            affiliation=each_author.Affiliation.text

            ohio_state=re.search(r"Ohio State", affiliation)
            harvard_medical_school=re.search(r"Harvard Medical School", affiliation)
            institut_genetique=re.search(r"Institut de Génétique et de Biologie Moléculaire et Cellulaire", affiliation)
            johns_hopkins=re.search(r"Johns Hopkins University", affiliation)
            mcgill=re.search(r"McGill University", affiliation)
            nci=re.search(r"National Cancer Institute", affiliation)
            nidcd=re.search(r"National Institute on Deafness and Other Communication Disorders", affiliation)
            northeastern=re.search(r"Northeastern University", affiliation)
            u_bristol=re.search(r"University of Bristol", affiliation)
            u_maryland=re.search(r"University of Maryland", affiliation)
            u_virginia=re.search(r"University of Virginia", affiliation)
            vicosa=re.search(r"Universidade Federal de Viçosa", affiliation)
            if ohio_state:
                institution="The Ohio State University"
            elif harvard_medical_school:
                institution=harvard_medical_school.group()
            elif institut_genetique:
                institution=institut_genetique.group()
            elif johns_hopkins:
                institution=johns_hopkins.group()
            elif mcgill:
                institution=mcgill.group()
            elif nci:
                institution=nci.group()
            elif nidcd:
                institution=nidcd.group()
            elif northeastern:
                institution=northeastern.group()
            elif u_bristol:
                institution=u_bristol.group()
            elif u_maryland:
                institution=u_maryland.group()
            elif u_virginia:
                institution=u_virginia.group()
            elif vicosa:
                institution=vicosa.group()

        else:
            affiliation=''
        print(f"{name}, {affiliation}")

        row={
            "pmid": pmid,
            "name": name,
            "affiliation": affiliation,
            "institution": institution,
            "last_updated": last_updated

            }
        author_info=pd.DataFrame(row, index=[0])
        author_affiliations = pd.concat([author_info,author_affiliations], axis=0)
        time.sleep(0.15)
        
#5. Export results to csv        
author_affiliations.to_csv('pubmed_author_affiliations.csv')

Websites and APIs

Lesson 4. iCite

Websites and APIs

Lesson 6. OhioLINK ETD