Crossref is a nonprofit organization that manages a registry of Digital Object Identifiers (DOIs). Publishers collaborate with Crossref to assign a unique DOI to each journal article, book, conference paper, or dataset they publish. This DOI acts like a permanent web address, enabling seamless linking between references, citations, research outputs, funding information, and more.
The Crossref REST API offers free access to the nonprofit’s metadata. This tutorial introduces two useful tools: JSON, a simple data format that resembles Python dictionaries and is easy to read and use, and Python’s built-in logging
module.
Data skills | concepts¶
- APIs
- logging
- JSON data
Learning objectives¶
- Interpret documentation and apply concepts to write functional code.
- Extract and work with JSON data using Python’s built-in tools.
- Use Python’s logging module to capture and report errors that interrupt code execution.
This tutorial is designed to support multi-session workshops hosted by The Ohio State University Libraries Research Commons. It assumes you already have a basic understanding of Python, including how to iterate through lists and dictionaries to extract data using a for loop. To learn basic Python concepts visit the Python - Mastering the Basics tutorial.
LESSON 7¶
Crossref¶
Crossref provides detailed documentation and a wide range of robust learning resources to help users effectively work with its REST API.
JSON¶
Crossref queries return data in JSON format, which is easy to read and looks similar to Python dictionaries. You can work with JSON data by looping through its key-value pairs to access the information you need.
Solution
import requests
import pandas as pd
def lookup(target_doi):
base_url='https://api.crossref.org/works/'
url=base_url+target_doi
response=requests.get(url)
response.raise_for_status() #Raise an HTTP Error for bad responses
json_data = response.json() #Parse JSON response
return json_data
file=pd.read_csv('C:/Users/murphy.465/Documents/GitHub/data_visualization/data/dois.csv')
dois=file.doi.tolist()
results=pd.DataFrame(columns=['doi','publisher','article_title','journal_title','year','reference_count'])
for doi in dois:
data={}
response=lookup(doi)
entry=response['message']
data['doi']=doi
data['publisher']=entry['publisher']
data['article_title']=entry['title'][0]
data['journal_title']=entry['container-title'][0]
data['year']=entry['published']['date-parts'][0][0]
data['reference_count']=entry['reference-count']
row=pd.DataFrame(data, index=[0])
results=pd.concat([row,results], axis=0, ignore_index=True)
Logging¶
APIs sometimes return error codes which interrupt our program’s execution. Logging tells Python how to handle these errors. It can also help to identify issues with your code.
Solution
import requests
import pandas as pd
import logging
import time
# Configure logging
formatstring="%(asctime)s - %(levelname)s - %(message)s"
datestring="%m/%d/%Y %I%M%S %p"
logging.basicConfig(filename="cr_errors_find_dois.log", level=logging.ERROR, format=formatstring, datefmt=datestring)
# Define function to request url and log HTTP errors
def lookup(target_doi):
try:
base_url='https://api.crossref.org/works/'
url=base_url+target_doi
response=requests.get(url)
response.raise_for_status() #Raise an HTTP Error for bad responses
json_data = response.json() #Parse JSON response
return json_data
except requests.exceptions.HTTPError as http_err:
logging.error(f"HTTP Error = {http_err}") # Log the HTTP error
time.sleep(10)
except Exception as err:
logging.error(f"Other error = {err}") #Log any other errors
time.sleep(10)
file=pd.read_csv('C:/Users/murphy.465/Documents/GitHub/data_visualization/data/dois.csv')
dois=file.doi.tolist()
results=pd.DataFrame(columns=['doi','publisher','article_title','journal_title','year','reference_count'])
for doi in dois[0:2]:
data={}
response=lookup(doi)
entry=response['message']
data['doi']=doi
data['publisher']=entry['publisher']
data['article_title']=entry['title'][0]
data['journal_title']=entry['container-title'][0]
data['year']=entry['published']['date-parts'][0][0]
data['reference_count']=entry['reference-count']
row=pd.DataFrame(data, index=[0])
results=pd.concat([row,results], axis=0, ignore_index=True)