Automatically updating my publications page with ORCID and doi.org#
For a while I’ve had a hand-crafted .bibtex
file stored locally for my publications/
page.
However, manually updating local text file is a pain to remember, especially since there are many services out there that automatically track new publications.
Update!
A helpful suggestion on Twitter allowed me to include the full citation information, including lists of authors, using the doi.org API!
Here’s the workflow I’d prefer:
Treat one online service provider as a Single Source of Truth for my publications list.
Use an API to programmatically ask this provider for the latest data about my publications.
Reshape that data into a form that I can insert into my website.
Here’s how I accomplished this:
Use ORCID to grab a list of DOIs for my publications#
ORCID is a service for identifying scholars and their contributions. It links various kinds of publications and activities to a unique account for each person. It doesn’t cover all kinds of outputs (like talks, posters, etc), but it seems to cover the most important ones.
👉 Here is my ORCID ID and page: https://orcid.org/0000-0002-2391-0678
.
ORCID has a public-facing API that allows you to automatically query information about an ORCID ID. Following an example notebook shared from the TAPIR project.
Use the doi.org API to grab citation information#
While the ORCID API has a lot of useful information in it, there were some important pieces missing, like co-author information. Fortunately, I learned from a suggestion on Twitter that doi.org is accessible via an API call!
You can ask doi.org
for the reference, bibtex file, or a JSON structure of reference data by adding a header to a doi.org URL, like so:
curl -L -H "Accept:text/x-bibliography; style=apa" -H "User-Agent: mailto:youremail@email.com" https://dx.doi.org/10.1371/journal.pcbi.1009651
This returns a fully-resolved reference like so:
DuPre, E., Holdgraf, C., Karakuzu, A., Tetrel, L., Bellec, P., Stikov, N., & Poline, J.-B. (2022). Beyond advertising: New infrastructures for publishing integrated research objects. PLOS Computational Biology, 18(1), e1009651. https://doi.org/10.1371/journal.pcbi.1009651
Note
The -H "User-Agent: mailto:youremail@email.com"
is a way to identify yourself to the doi.org
API, which reduces the likelihood that you will have your access revoked.
In Python, the same call looks like this:
from requests import get
doi = "10.1371/journal.pcbi.1009651"
url = f"https://dx.doi.org/{doi}"
header = {"accept": "text/x-bibliography; style=apa", "User-Agent": "mailto:youremail@email.com"}
r = requests.get(url, headers=header)
print(r.content)
You can also use other kinds of header configuration, like:
# A citeproc-styled JSON structure
header = {'accept': "citeproc+json"}
# A bibtex entry
header = {'accept': "bibtex"}
The citeproc JSON structure has a ton of information in it, including information about all of the co-authors (and an extra bonus - a link to their ORCID pages!). This is all the information I needed for my website.
A script to do this all at once#
I wrote a little script that runs each time my Sphinx site is built.
It generates a markdown snippet that is then inserted into my publications.md
page.
Python snippet to download ORCID data
Note that the below is a jupytext
document which is why there’s extra metadata at the top.
# ---
# jupyter:
# jupytext:
# formats: py:light
# text_representation:
# extension: .py
# format_name: light
# format_version: '1.5'
# jupytext_version: 1.14.1
# kernelspec:
# display_name: Python 3 (ipykernel)
# language: python
# name: python3
# ---
# +
import pandas as pd
import requests
from IPython.display import Markdown, JSON
from pathlib import Path
from rich import progress
# My ORCID
orcid_id = "0000-0002-2391-0678"
ORCID_RECORD_API = "https://pub.orcid.org/v3.0/"
# Download all of my ORCID records
print("Retrieving ORCID entries from API...")
response = requests.get(
url=requests.utils.requote_uri(ORCID_RECORD_API + orcid_id),
headers={"Accept": "application/json"},
)
response.raise_for_status()
orcid_record = response.json()
# +
# Just to visualize in a notebook if need be
# JSON(orcid_record)
# +
###
# Resolve my DOIs from ORCID as references
# Shamelessly copied from:
# https://gist.github.com/brews/8d3b3ede15d120a86a6bd6fc43859c5e
import requests
import json
def fetchmeta(doi, fmt="reference", **kwargs):
"""Fetch metadata for a given DOI.
Parameters
----------
doi : str
fmt : str, optional
Desired metadata format. Can be 'dict' or 'bibtex'.
Default is 'dict'.
**kwargs :
Additional keyword arguments are passed to `json.loads()` if `fmt`
is 'dict' and you're a big JSON nerd.
Returns
-------
out : str or dict or None
`None` is returned if the server gives unhappy response. Usually
this means the DOI is bad.
Examples
--------
>>> fetchmeta('10.1016/j.dendro.2018.02.005')
>>> fetchmeta('10.1016/j.dendro.2018.02.005', 'bibtex')
References
----------
https://www.doi.org/hb.html
"""
# Parse args and setup the server response we want.
accept_type = "application/"
if fmt == "dict":
accept_type += "citeproc+json"
elif fmt == "bibtex":
accept_type += "x-bibtex"
elif fmt == "reference":
accept_type = "text/x-bibliography; style=apa"
else:
raise ValueError(f"Unrecognized `fmt`: {fmt}")
# Request data from server.
url = "https://dx.doi.org/" + str(doi)
header = {"accept": accept_type}
r = requests.get(url, headers=header)
# Format metadata if server response is good.
out = None
if r.status_code == 200:
if fmt == "dict":
out = json.loads(r.text, **kwargs)
else:
out = r.text
return out
# -
# Extract metadata for each entry
df = []
for iwork in progress.track(
orcid_record["activities-summary"]["works"]["group"], "Fetching reference data..."
):
isummary = iwork["work-summary"][0]
# Extract the DOI
for ii in isummary["external-ids"]["external-id"]:
if ii["external-id-type"] == "doi":
doi = ii["external-id-value"]
break
meta = fetchmeta(doi, fmt="dict")
doi_url = meta["URL"]
title = meta["title"]
references_count = meta["references-count"]
year = meta["issued"]["date-parts"][0][0]
url = meta["URL"]
# Create authors list with links to their ORCIDs
authors = meta["author"]
autht = []
for author in authors:
name = f"{author['family']}, {author['given'][0]}."
if "holdgraf" in author["family"].lower():
name = f"**{name}**"
if "ORCID" in author:
autht.append(f"[{name}]({author['ORCID']})")
else:
autht.append(name)
autht = ", ".join(autht)
journal = meta["publisher"]
url_doi = url.split("//", 1)[-1]
reference = f"{autht} ({year}). **{title}**. {journal}. [{url_doi}]({url})"
df.append({"year": year, "reference": reference})
df = pd.DataFrame(df)
# Convert into a markdown string
md = []
for year, items in df.groupby("year", sort=False):
md.append(f"## {year}")
for _, item in items.iterrows():
md.append(item["reference"])
md.append("")
md.append("")
mds = "\n".join(md)
# +
# Uncomment to preview in a notebook
# Markdown(mds)
# -
# This will only work if this is run as a script
path_out = Path(__file__).parent.parent / "_static/publications.txt"
path_out.write_text(mds)
print(f"Finished updating ORCID entries at: {path_out}")