Automatically updating my publications page with ORCID and doi.org#

For a while I’ve had a hand-crafted .bibtex file stored locally for my publications/ page. However, manually updating local text file is a pain to remember, especially since there are many services out there that automatically track new publications.

Update!

A helpful suggestion on Twitter allowed me to include the full citation information, including lists of authors, using the doi.org API!

Here’s the workflow I’d prefer:

  • Treat one online service provider as a Single Source of Truth for my publications list.

  • Use an API to programmatically ask this provider for the latest data about my publications.

  • Reshape that data into a form that I can insert into my website.

Here’s how I accomplished this:

Use ORCID to grab a list of DOIs for my publications#

ORCID is a service for identifying scholars and their contributions. It links various kinds of publications and activities to a unique account for each person. It doesn’t cover all kinds of outputs (like talks, posters, etc), but it seems to cover the most important ones.

👉 Here is my ORCID ID and page: https://orcid.org/0000-0002-2391-0678.

ORCID has a public-facing API that allows you to automatically query information about an ORCID ID. Following an example notebook shared from the TAPIR project.

Use the doi.org API to grab citation information#

While the ORCID API has a lot of useful information in it, there were some important pieces missing, like co-author information. Fortunately, I learned from a suggestion on Twitter that doi.org is accessible via an API call!

You can ask doi.org for the reference, bibtex file, or a JSON structure of reference data by adding a header to a doi.org URL, like so:

curl -L -H "Accept:text/x-bibliography; style=apa" -H "User-Agent: mailto:youremail@email.com" https://dx.doi.org/10.1371/journal.pcbi.1009651

This returns a fully-resolved reference like so:

DuPre, E., Holdgraf, C., Karakuzu, A., Tetrel, L., Bellec, P., Stikov, N., & Poline, J.-B. (2022). Beyond advertising: New infrastructures for publishing integrated research objects. PLOS Computational Biology, 18(1), e1009651. https://doi.org/10.1371/journal.pcbi.1009651

Note

The -H "User-Agent: mailto:youremail@email.com" is a way to identify yourself to the doi.org API, which reduces the likelihood that you will have your access revoked.

In Python, the same call looks like this:

from requests import get
doi = "10.1371/journal.pcbi.1009651"
url = f"https://dx.doi.org/{doi}"
header = {"accept": "text/x-bibliography; style=apa", "User-Agent": "mailto:youremail@email.com"}
r = requests.get(url, headers=header)
print(r.content)

You can also use other kinds of header configuration, like:

# A citeproc-styled JSON structure
header = {'accept': "citeproc+json"}
# A bibtex entry
header = {'accept': "bibtex"}

The citeproc JSON structure has a ton of information in it, including information about all of the co-authors (and an extra bonus - a link to their ORCID pages!). This is all the information I needed for my website.

A script to do this all at once#

I wrote a little script that runs each time my Sphinx site is built. It generates a markdown snippet that is then inserted into my publications.md page.

Python snippet to download ORCID data

Note that the below is a jupytext document which is why there’s extra metadata at the top.

# ---
# jupyter:
#   jupytext:
#     formats: py:light
#     text_representation:
#       extension: .py
#       format_name: light
#       format_version: '1.5'
#       jupytext_version: 1.14.1
#   kernelspec:
#     display_name: Python 3 (ipykernel)
#     language: python
#     name: python3
# ---

# +
import pandas as pd
import requests
from IPython.display import Markdown, JSON
from pathlib import Path
from rich import progress

# My ORCID
orcid_id = "0000-0002-2391-0678"
ORCID_RECORD_API = "https://pub.orcid.org/v3.0/"

# Download all of my ORCID records
print("Retrieving ORCID entries from API...")
response = requests.get(
    url=requests.utils.requote_uri(ORCID_RECORD_API + orcid_id),
    headers={"Accept": "application/json"},
)
response.raise_for_status()
orcid_record = response.json()

# +
# Just to visualize in a notebook if need be
# JSON(orcid_record)

# +

###
# Resolve my DOIs from ORCID as references
# Shamelessly copied from:
# https://gist.github.com/brews/8d3b3ede15d120a86a6bd6fc43859c5e
import requests
import json


def fetchmeta(doi, fmt="reference", **kwargs):
    """Fetch metadata for a given DOI.

    Parameters
    ----------
    doi : str
    fmt : str, optional
        Desired metadata format. Can be 'dict' or 'bibtex'.
        Default is 'dict'.
    **kwargs :
        Additional keyword arguments are passed to `json.loads()` if `fmt`
        is 'dict' and you're a big JSON nerd.

    Returns
    -------
    out : str or dict or None
        `None` is returned if the server gives unhappy response. Usually
        this means the DOI is bad.

    Examples
    --------
    >>> fetchmeta('10.1016/j.dendro.2018.02.005')
    >>> fetchmeta('10.1016/j.dendro.2018.02.005', 'bibtex')

    References
    ----------
    https://www.doi.org/hb.html
    """
    # Parse args and setup the server response we want.
    accept_type = "application/"
    if fmt == "dict":
        accept_type += "citeproc+json"
    elif fmt == "bibtex":
        accept_type += "x-bibtex"
    elif fmt == "reference":
        accept_type = "text/x-bibliography; style=apa"
    else:
        raise ValueError(f"Unrecognized `fmt`: {fmt}")

    # Request data from server.
    url = "https://dx.doi.org/" + str(doi)
    header = {"accept": accept_type}
    r = requests.get(url, headers=header)

    # Format metadata if server response is good.
    out = None
    if r.status_code == 200:
        if fmt == "dict":
            out = json.loads(r.text, **kwargs)
        else:
            out = r.text
    return out


# -

# Extract metadata for each entry
df = []
for iwork in progress.track(
    orcid_record["activities-summary"]["works"]["group"], "Fetching reference data..."
):
    isummary = iwork["work-summary"][0]

    # Extract the DOI
    for ii in isummary["external-ids"]["external-id"]:
        if ii["external-id-type"] == "doi":
            doi = ii["external-id-value"]
            break

    meta = fetchmeta(doi, fmt="dict")
    doi_url = meta["URL"]
    title = meta["title"]
    references_count = meta["references-count"]
    year = meta["issued"]["date-parts"][0][0]
    url = meta["URL"]

    # Create authors list with links to their ORCIDs
    authors = meta["author"]
    autht = []
    for author in authors:
        name = f"{author['family']}, {author['given'][0]}."
        if "holdgraf" in author["family"].lower():
            name = f"**{name}**"
        if "ORCID" in author:
            autht.append(f"[{name}]({author['ORCID']})")
        else:
            autht.append(name)
    autht = ", ".join(autht)

    journal = meta["publisher"]

    url_doi = url.split("//", 1)[-1]
    reference = f"{autht} ({year}). **{title}**. {journal}. [{url_doi}]({url})"
    df.append({"year": year, "reference": reference})
df = pd.DataFrame(df)

# Convert into a markdown string
md = []
for year, items in df.groupby("year", sort=False):
    md.append(f"## {year}")
    for _, item in items.iterrows():
        md.append(item["reference"])
        md.append("")
    md.append("")
mds = "\n".join(md)

# +
# Uncomment to preview in a notebook
# Markdown(mds)
# -

# This will only work if this is run as a script
path_out = Path(__file__).parent.parent / "_static/publications.txt"
path_out.write_text(mds)
print(f"Finished updating ORCID entries at: {path_out}")