For a while I’ve had a hand-crafted .bibtex
file stored locally for my publications/
page.
However, manually updating local text file is a pain to remember, especially since there are many services out there that automatically track new publications.
Here’s the workflow I’d prefer:
- Treat one online service provider as a Single Source of Truth for my publications list.
- Use an API to programmatically ask this provider for the latest data about my publications.
- Reshape that data into a form that I can insert into my website.
Here’s how I accomplished this:
Use ORCID to grab a list of DOIs for my publications¶
ORCID is a service for identifying scholars and their contributions. It links various kinds of publications and activities to a unique account for each person. It doesn’t cover all kinds of outputs (like talks, posters, etc), but it seems to cover the most important ones.
👉 Here is my ORCID ID and page: https://orcid.org/0000-0002-2391-0678
.
ORCID has a public-facing API that allows you to automatically query information about an ORCID ID. Following an example notebook shared from the TAPIR project.
Use the doi.org API to grab citation information¶
While the ORCID API has a lot of useful information in it, there were some important pieces missing, like co-author information. Fortunately, I learned from a suggestion on Twitter that doi.org is accessible via an API call!
You can ask doi.org
for the reference, bibtex file, or a JSON structure of reference data by adding a header to a doi.org URL, like so:
curl -L -H "Accept:text/x-bibliography; style=apa" -H "User-Agent: mailto:youremail@email.com" https://dx.doi.org/10.1371/journal.pcbi.1009651
This returns a fully-resolved reference like so:
DuPre, E., Holdgraf, C., Karakuzu, A., Tetrel, L., Bellec, P., Stikov, N., & Poline, J.-B. (2022). Beyond advertising: New infrastructures for publishing integrated research objects. PLOS Computational Biology, 18(1), e1009651. https://doi.org/10.1371/journal.pcbi.1009651
In Python, the same call looks like this:
from requests import get
doi = "10.1371/journal.pcbi.1009651"
url = f"https://dx.doi.org/{doi}"
header = {"accept": "text/x-bibliography; style=apa", "User-Agent": "mailto:youremail@email.com"}
r = requests.get(url, headers=header)
print(r.content)
You can also use other kinds of header configuration, like:
# A citeproc-styled JSON structure
header = {'accept': "citeproc+json"}
# A bibtex entry
header = {'accept': "bibtex"}
The citeproc JSON structure has a ton of information in it, including information about all of the co-authors (and an extra bonus - a link to their ORCID pages!). This is all the information I needed for my website.
A script to do this all at once¶
I wrote a little script that runs each time my Sphinx site is built.
It generates a markdown snippet that is then inserted into my publications.md
page.
Python snippet to download ORCID data
Note that the below is a jupytext
document which is why there’s extra metadata at the top.
# ---
# jupyter:
# jupytext:
# formats: py:light
# text_representation:
# extension: .py
# format_name: light
# format_version: '1.5'
# jupytext_version: 1.14.1
# kernelspec:
# display_name: Python 3 (ipykernel)
# language: python
# name: python3
# ---
# +
import pandas as pd
import requests
from IPython.display import Markdown, JSON
from pathlib import Path
from rich import progress
# My ORCID
orcid_id = "0000-0002-2391-0678"
ORCID_RECORD_API = "https://pub.orcid.org/v3.0/"
# Download all of my ORCID records
print("Retrieving ORCID entries from API...")
response = requests.get(
url=requests.utils.requote_uri(ORCID_RECORD_API + orcid_id),
headers={"Accept": "application/json"},
)
response.raise_for_status()
orcid_record = response.json()
# +
# Just to visualize in a notebook if need be
# JSON(orcid_record)
# +
###
# Resolve my DOIs from ORCID as references
# Shamelessly copied from:
# https://gist.github.com/brews/8d3b3ede15d120a86a6bd6fc43859c5e
import requests
import json
def fetchmeta(doi, fmt="reference", **kwargs):
"""Fetch metadata for a given DOI.
Parameters
----------
doi : str
fmt : str, optional
Desired metadata format. Can be 'dict' or 'bibtex'.
Default is 'dict'.
**kwargs :
Additional keyword arguments are passed to `json.loads()` if `fmt`
is 'dict' and you're a big JSON nerd.
Returns
-------
out : str or dict or None
`None` is returned if the server gives unhappy response. Usually
this means the DOI is bad.
Examples
--------
>>> fetchmeta('10.1016/j.dendro.2018.02.005')
>>> fetchmeta('10.1016/j.dendro.2018.02.005', 'bibtex')
References
----------
https://www.doi.org/hb.html
"""
# Parse args and setup the server response we want.
accept_type = "application/"
if fmt == "dict":
accept_type += "citeproc+json"
elif fmt == "bibtex":
accept_type += "x-bibtex"
elif fmt == "reference":
accept_type = "text/x-bibliography; style=apa"
else:
raise ValueError(f"Unrecognized `fmt`: {fmt}")
# Request data from server.
url = "https://dx.doi.org/" + str(doi)
header = {"accept": accept_type}
r = requests.get(url, headers=header)
# Format metadata if server response is good.
out = None
if r.status_code == 200:
if fmt == "dict":
out = json.loads(r.text, **kwargs)
else:
out = r.text
return out
# -
# Extract metadata for each entry
df = []
for iwork in progress.track(
orcid_record["activities-summary"]["works"]["group"], "Fetching reference data..."
):
isummary = iwork["work-summary"][0]
# Extract the DOI
for ii in isummary["external-ids"]["external-id"]:
if ii["external-id-type"] == "doi":
year = isummary["publication-date"]["year"]["value"]
doi = ii["external-id-value"]
break
df.append({"year": year, "doi": doi})
df = pd.DataFrame(df)
# Convert into a markdown string
md = ["|Year|Publications|", "|===|===|"]
for year, items in df.groupby("year", sort=False):
this_pubs = []
for _, item in items.iterrows():
this_pubs.append(f'{{cite}}`{item["doi"]}`')
md.append(f"|{year}|{', '.join(this_pubs)}|")
mds = "\n".join(md)
# +
# Uncomment to preview in a notebook
# Markdown(mds)
# -
# This will only work if this is run as a script
path_out = Path(__file__).parent.parent / "_static/publications.txt"
path_out.write_text(mds)
print(f"Finished updating ORCID entries at: {path_out}")