{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Jupyter Notebooks to markdown and html with Pandoc\n", "\n", "For several months now, the universal [document converter pandoc](https://pandoc.org/) has\n", "had [support for Jupyter Notebooks](https://pandoc.org/MANUAL.html#creating-jupyter-notebooks-with-pandoc). This means that with a single call,\n", "you can convert `.ipynb` files to any of the output formats that Pandoc\n", "supports (and vice-versa!). This post is a quick exploration of what this\n", "looks like." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note that for this post, we're using Pandoc version 2.7.3**. Also, some of what's below is hard\n", "to interpret without actually opening the files that are created by Pandoc. For the sake\n", "of this blog post, I'm going to stick with the raw text output here, though you can expand the\n", "outputs if you wish, I recommend copy/pasting some of these commands on your own if you'd like to try." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "hide_input" ] }, "outputs": [], "source": [ "from subprocess import run as sbrun\n", "from subprocess import PIPE, CalledProcessError\n", "from pathlib import Path\n", "from IPython.display import HTML, Markdown\n", "\n", "# A helper function to capture errors and outputs\n", "def run(cmd, *args, **kwargs):\n", " try:\n", " out = sbrun(cmd.split(), stderr=PIPE, stdout=PIPE, check=True, *args, **kwargs)\n", " out = out.stdout.decode()\n", " if len(out) > 1:\n", " print(out)\n", " except CalledProcessError as e:\n", " print(e.stderr.decode())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Our base notebook\n", "\n", "First off, let's take a look at our base notebook. We'll convert this document\n", "to both Markdown and HTML using Pandoc.\n", "\n", "The notebook will be fairly minimal\n", "in order to make it easier to inspect its contents. It has a collection\n", "of markdown with mixed content, as well as code cells with various outputs.\n", "\n", "{download}`See this link ` for the notebook we'll use." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# `.ipynb` to markdown\n", "\n", "Let's try converting this notebook to markdown. This should preserve as much\n", "information as possible about the input Jupyter notebook. This should include\n", "all markdown cells, cell metadata, and outputs with code cells.\n", "\n", "## A few pandoc options\n", "\n", "Here are a few pandoc options that are relevant to our use-case:\n", "\n", "* `--resource-path` defines the path where Pandoc will look for resources that are linked in the notebook.\n", " This allows us to discover images etc that are in a different folder from where we are invocing `pandoc`.\n", "* `--extract-media` is a path where images and other media will be *extracted* at conversion time. Any links\n", " to images etc should point to files at this path in the output format.\n", "* `-s` (or `--standalone`) tells Pandoc that the output should be a \"standalone\" format. This does different\n", " things depending on the output, such as adding a header if converting to HTML.\n", "* `-o` the output file, and implicitly the output file type (e.g., markdown)\n", "* `-t` the *type* of output file if we want to override the default (e.g., GitHub-flavored markdown vs. Pandoc markdown)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Converting to GitHub-flavored markdown\n", "\n", "Let's start by converting to GitHub-flavored markdown. By not specifying an output file\n", "with `-o`, we'll cause Pandoc to print the result to the screen, which we'll display here." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [ "hide_output" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "
\n", "\n", "# Here's a demo notebook\n", "\n", "This is a demo notebook to play around with the pandoc ipynb support\n", "\n", "## Markdown\n", "\n", "As it is markdown, you can embed images, HTML, etc into your posts\\!\n", "\n", "![](outputs/images/ca17e56d65946db885db7f8f50a9605a6a94e6a7.jpg)\n", "\n", "Here's one \\(inline_{math}\\) and\n", "\n", "\\[\n", "math^{blocks}\n", "\\]\n", "\n", "``` python\n", "def my_functino():\n", " mystring = \"you can also include python cells\"\n", " return mystring\n", "```\n", "\n", "
\n", "\n", "
\n", "\n", "# Code cells\n", "\n", "## Matplotlib output with metadata\n", "\n", "The below code cell has some metadata attached to it. It also outputs a\n", "figure. Both should be included in the output format.\n", "\n", "
\n", "\n", "
\n", "\n", "``` python\n", "from matplotlib import rcParams, cycler\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "plt.ion()\n", "\n", "data = np.random.rand(2, 1000) * 100\n", "fig, ax = plt.subplots()\n", "ax.scatter(*data, s=data[1], c=data[0])\n", "```\n", "\n", "
\n", "\n", " \n", "\n", "
\n", "\n", "
\n", "\n", "![](outputs/images/e843a737607d119ec5b2750a2bb737c915f1b6e8.png)\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", "\n", "## DataFrames\n", "\n", "
\n", "\n", "
\n", "\n", "``` python\n", "import pandas as pd\n", "pd.DataFrame([['hi', 'there'], ['this', 'is'], ['a', 'DataFrame']], columns=['Word A', 'Word B'])\n", "```\n", "\n", "
\n", "\n", "``` \n", " Word A Word B\n", "0 hi there\n", "1 this is\n", "2 a DataFrame\n", "```\n", "\n", "
\n", "\n", "
\n", "\n", "
\n", "\n", "# Bibliography\n", "\n", "Let's test the bibliography here\n", "\n", "Testing this \\[bibliography @holdgraf\\_rapid\\_2016\\]\n", "\n", "@holdgraf\\_evidence\\_2014\n", "\n", "
\n", "\n", "
\n", "\n", "### The actual bibliography\n", "\n", "The bibliography will be placed at the end of the file\n", "\n", "
\n", "\n" ] } ], "source": [ "# ipynb -> gfmd\n", "run(f'pandoc pandoc_ipynb/inputs/notebooks.ipynb --resource-path=inputs -s --extract-media=outputs/images -t gfm')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that cells are divided by hard-coded `
`s, and cell-level metadata (such as tags)\n", "are encoded within the HTML (e.g. `data-tags`). Also note that we haven't gotten the bibliography\n", "to render, probably because we didn't enable the `citeproc` processor on pandoc (we'll try that later).\n", "Finally, note that there's no notebook-level metadata in this output because GFM doesn't support\n", "a YAML header." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## To pandoc-flavored markdown" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [ "hide_output" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", " \n", " \n", " \n", " notebooks\n", " \n", " \n", " \n", "\n", "\n", "
\n", "

Here's a demo notebook

\n", "

This is a demo notebook to play around with the pandoc ipynb support

\n", "

Markdown

\n", "

As it is markdown, you can embed images, HTML, etc into your posts!

\n", "

\n", "

Here's one inlinemath and

\n", "


mathblocks

\n", "
def my_functino():\n",
      "    mystring = "you can also include python cells"\n",
      "    return mystring
\n", "
\n", "
\n", "

Code cells

\n", "

Matplotlib output with metadata

\n", "

The below code cell has some metadata attached to it. It also outputs a figure. Both should be included in the output format.

\n", "
\n", "
\n", "
from matplotlib import rcParams, cycler\n",
      "import matplotlib.pyplot as plt\n",
      "import numpy as np\n",
      "plt.ion()\n",
      "\n",
      "data = np.random.rand(2, 1000) * 100\n",
      "fig, ax = plt.subplots()\n",
      "ax.scatter(*data, s=data[1], c=data[0])
\n", "
\n", "
<matplotlib.collections.PathCollection at 0x7f6e8d6269e8>
\n", "
\n", "
\n", "

\n", "
\n", "
\n", "
\n", "

DataFrames

\n", "
\n", "
\n", "
import pandas as pd\n",
      "pd.DataFrame([['hi', 'there'], ['this', 'is'], ['a', 'DataFrame']], columns=['Word A', 'Word B'])
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Word AWord B
0hithere
1thisis
2aDataFrame
\n", "
\n", "
\n", "
\n", "
\n", "

Bibliography

\n", "

Let's test the bibliography here

\n", "

Testing this [bibliography @holdgraf_rapid_2016]

\n", "

@holdgraf_evidence_2014

\n", "
\n", "
\n", "

The actual bibliography

\n", "

The bibliography will be placed at the end of the file

\n", "
\n", "\n", "\n", "\n" ] } ], "source": [ "# ipynb -> pandoc md\n", "run(f'pandoc pandoc_ipynb/inputs/notebooks.ipynb --resource-path=inputs -s --extract-media=outputs/images')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we've got something a little bit cleaner without all the hard-coded HTML. The `:::` fences\n", "are how Pandoc-flavored markdown denote different divs, and cell-level metadata is encoded\n", "similar to how GFM worked.\n", "\n", "# `.ipynb` to HTML\n", "\n", "Next let's try converting `.ipynb` to HTML. This should let us view the notebook as a web-page\n", "as well as include all of the extra metadata inside the HTML elements. We'll start with\n", "a vanilla HTML conversion. Note that the only thing we had to do was change the output\n", "file extension to `.html` and Pandoc inferred the output type for us:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [ "hide_output" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", " \n", " \n", " \n", " notebooks\n", " \n", " \n", " \n", "\n", "\n", "
\n", "

Here's a demo notebook

\n", "

This is a demo notebook to play around with the pandoc ipynb support

\n", "

Markdown

\n", "

As it is markdown, you can embed images, HTML, etc into your posts!

\n", "

\n", "

Here's one inlinemath and

\n", "


mathblocks

\n", "
def my_functino():\n",
      "    mystring = "you can also include python cells"\n",
      "    return mystring
\n", "
\n", "
\n", "

Code cells

\n", "

Matplotlib output with metadata

\n", "

The below code cell has some metadata attached to it. It also outputs a figure. Both should be included in the output format.

\n", "
\n", "
\n", "
from matplotlib import rcParams, cycler\n",
      "import matplotlib.pyplot as plt\n",
      "import numpy as np\n",
      "plt.ion()\n",
      "\n",
      "data = np.random.rand(2, 1000) * 100\n",
      "fig, ax = plt.subplots()\n",
      "ax.scatter(*data, s=data[1], c=data[0])
\n", "
\n", "
<matplotlib.collections.PathCollection at 0x7f6e8d6269e8>
\n", "
\n", "
\n", "

\n", "
\n", "
\n", "
\n", "

DataFrames

\n", "
\n", "
\n", "
import pandas as pd\n",
      "pd.DataFrame([['hi', 'there'], ['this', 'is'], ['a', 'DataFrame']], columns=['Word A', 'Word B'])
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Word AWord B
0hithere
1thisis
2aDataFrame
\n", "
\n", "
\n", "
\n", "
\n", "

Bibliography

\n", "

Let's test the bibliography here

\n", "

Testing this [bibliography @holdgraf_rapid_2016]

\n", "

@holdgraf_evidence_2014

\n", "
\n", "
\n", "

The actual bibliography

\n", "

The bibliography will be placed at the end of the file

\n", "
\n", "\n", "\n", "\n" ] } ], "source": [ "# ipynb -> HTML\n", "run(f'pandoc pandoc_ipynb/inputs/notebooks.ipynb --resource-path=inputs -s --extract-media=outputs/images')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This time our math rendered properly, along with everything else except for the\n", "bibliography. Let's get that working now.\n", "\n", "We've included a bibliography with our input file. With this (and using the\n", "[citeproc citation style](https://pandoc.org/demo/example19/Extension-citations.html), we can use `pandoc-citeproc` to automatically render a\n", "bibliography within each page. To do so, we've used the following extra options:\n", "\n", "* `--bibliography` specifies the path to a BibTex file\n", "* `-f ipynb+citations` tells Pandoc that our *input* format has citations in it. Before, the `ipynb` was\n", " inferred from the input extension. Now we've made it explicit as well." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [ "hide_output" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", " \n", " \n", " \n", " notebooks\n", " \n", " \n", " \n", "\n", "\n", "
\n", "

Here's a demo notebook

\n", "

This is a demo notebook to play around with the pandoc ipynb support

\n", "

Markdown

\n", "

As it is markdown, you can embed images, HTML, etc into your posts!

\n", "

\n", "

Here's one inlinemath and

\n", "


mathblocks

\n", "
def my_functino():\n",
      "    mystring = "you can also include python cells"\n",
      "    return mystring
\n", "
\n", "
\n", "

Code cells

\n", "

Matplotlib output with metadata

\n", "

The below code cell has some metadata attached to it. It also outputs a figure. Both should be included in the output format.

\n", "
\n", "
\n", "
from matplotlib import rcParams, cycler\n",
      "import matplotlib.pyplot as plt\n",
      "import numpy as np\n",
      "plt.ion()\n",
      "\n",
      "data = np.random.rand(2, 1000) * 100\n",
      "fig, ax = plt.subplots()\n",
      "ax.scatter(*data, s=data[1], c=data[0])
\n", "
\n", "
<matplotlib.collections.PathCollection at 0x7f6e8d6269e8>
\n", "
\n", "
\n", "

\n", "
\n", "
\n", "
\n", "

DataFrames

\n", "
\n", "
\n", "
import pandas as pd\n",
      "pd.DataFrame([['hi', 'there'], ['this', 'is'], ['a', 'DataFrame']], columns=['Word A', 'Word B'])
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Word AWord B
0hithere
1thisis
2aDataFrame
\n", "
\n", "
\n", "
\n", "
\n", "

Bibliography

\n", "

Let's test the bibliography here

\n", "

Testing this (bibliography Holdgraf et al. 2016)

\n", "

Holdgraf et al. (2014)

\n", "
\n", "
\n", "

The actual bibliography

\n", "

The bibliography will be placed at the end of the file

\n", "
\n", "
\n", "
\n", "

Holdgraf, Christopher Ramsay, Wendy de Heer, Brian N. Pasley, and Robert T. Knight. 2014. “Evidence for Predictive Coding in Human Auditory Cortex.” In International Conference on Cognitive Neuroscience. Brisbane, Australia, Australia: Frontiers in Neuroscience.

\n", "
\n", "
\n", "

Holdgraf, Christopher Ramsay, Wendy de Heer, Brian N. Pasley, Jochem W. Rieger, Nathan Crone, Jack J. Lin, Robert T. Knight, and Frédéric E. Theunissen. 2016. “Rapid Tuning Shifts in Human Auditory Cortex Enhance Speech Intelligibility.” Nature Communications 7 (May): 13654. https://doi.org/10.1038/ncomms13654.

\n", "
\n", "
\n", "\n", "\n", "\n" ] } ], "source": [ "# ipynb -> HTML with citations\n", "run(f'pandoc pandoc_ipynb/inputs/notebooks.ipynb -f ipynb+citations --bibliography pandoc_ipynb/inputsreferences.bib --resource-path=inputs -s --extract-media=outputs/images')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we've got citations at the bottom of the page, and in-line references interspersed\n", "in the text. Pretty cool!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Wrapping up\n", "\n", "It seems like we can get pretty far with converting `.ipynb` files into\n", "various flavors of markdown or HTML. My guess is that things will get a bit\n", "trickier if we tried to do this with more complex cell outputs or metdata,\n", "but it's a good start. Using Pandoc also means that it would be relatively\n", "straightforward to convert notebooks into **latex**, **pdf**, or even **Microsoft Word**\n", "format. I'll try to dig into this more in the future.\n", "\n" ] } ], "metadata": { "category": "til", "date": "2019-11-11", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "mystnb": { "execution_mode": "off" }, "redirect": "pandoc-jupyter-notebooks", "tags": "documentation, pandoc, nbconvert", "title": "Testing Pandoc and Jupyter Notebooks", "toc-autonumbering": false, "toc-showcode": false, "toc-showmarkdowntxt": false, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }