All Posts

A few random opportunities in AI for Social Good

Recently a few friends have reached out asking if I knew of any opportunities to work on AI-related things that also have some kind of pro-social tie-in. I think a lof people see AI as a technology with a lot of potential, but in an environment of companies that don’t seem to prioritize the benefit of human-kind over the never-ending hype machine and the promise of hyperscale growth.

So I asked around a few places to see if folks had recommendations for using skills in machine learning or artificial intelligence, but in a context that was explicitly for the benefit of humanity or otherwise pro-social. Here are a few that stood out (they are 100% not vetted, I’m just passing along information in case it’s useful so please tell me if there’s something wrong or problematic in here):

Read more ...


A Sphinx directive for social media embeds

I often want to link to social and other types of web-based media in my Sphinx documentation and blog. Rather than embedding it all in custom HTML code, I decided to write a little wrapper to turn it into a directive.

It’s called {socialpost}, and it works with Twitter, Mastodon, and YouTube links.

Read more ...


Report from FOSDEM23: beautiful chaos in a conference

I recently attended FOSDEM 2023, my first FOSDEM! I had heard of the conference before, but hadn’t really looked into it too much. Fortunately, after some urging from friends and social media, I took a deeper look and decided I should join to see what all the fuss was about.

Here are a few things that I noticed while I was there.

Read more ...


Bundle extensions with your Sphinx theme

Sphinx is great because it has a ton of useful extensions that let you grow its functionality. However, a downside of this is that users have to actually learn about those extensions and activate them manually. It’s not hard, but it’s a non-trivial amount of discovery work.

One way to solve this is for themes to bundle extensions on their own. This way they can include functionality via an extension rather than writing custom code on their own.

Read more ...


Install dependencies from GitHub with pyproject.toml or requirements.txt

This is a short post to demonstrate how to install packages directly from GitHub with pyprojects.toml or requirements.txt, including custom branches and commits. It will focus on pyprojects.toml because this is newer and there’s less information about it, but the general pattern holds for requirements.txt as well.

In pyproject.toml, you can specify dependencies for a project via the dependencies field. For example, to specify Sphinx as a dependency:

Read more ...


Report from the JupyterLite workshop: WebAssembly is pretty cool

I recently attended the JupyterLite community workshop in Paris, here are some quick thoughts from the three-day event[1].

For those without any background, JupyterLite is a distribution of Jupyter’s user interfaces and a Python kernel that runs entirely in the browser. Its goal is to provide a low-overhead and accessible way to use a Jupyter interface via the browser. See the jupyterlite documentation for more information.

Read more ...


Load and plot a remote font with Matplotlib

As part of my sphinx-social-previews prototype, I wanted to be able to use the Roboto Font from Google in image previews. However, Roboto is often not loaded on your local filesystem, so it took some digging to figure out how to make it possible to load via Matplotlib’s text plotting functionality.

Here’s the solution that finally worked for me, inspired from this Tweet with a similar implementation from the dmol book.

../_images/868fcd11b95a614186ca3a676d028d4548af86c0749233733e16ac5cf97f2fed.png

Read more ...


How to update Sphinx options during the build

As part of the pydata-sphinx-theme we have a few settings that auto-enable extensions and configure them on behalf of the user. It has always been mysterious to me how to do this properly during the Sphinx build. It’s easy to configure things with conf.py ahead of time, but what if you want to manually set a value during the build?

I finally figured it out, so documenting the process here.

Read more ...


Automatically update pre-commit hook versions

I figured out a way to automatically update all of the git pre-commit hook versions at once!

pre-commit is a useful command line tool for running simple commands before every git commit. I use it to enforce things like flake8 and black in many of my projects.

Read more ...


subprocess.run can execute shell commands directly

I often run shell commands in Python via the subprocess.run command. One thing that has always bugged me is that this required you to split commands into a list before it’d work properly. For example, you’d have to do:

Today I discovered that you don’t have to do this! There’s a shell= keyword that can be used to tell subprocess to simply run the command directly in the shell.

Read more ...


Fix phantom GitHub workflows in your ci-cd with protected branch rules

Have you ever had a GitHub pull request show “phantom” workflows that never pass? This looks like one or more workflows that are in a constant waiting state, with a yellow status indicator, and that never complete.

It looks something like this:

https://user-images.githubusercontent.com/1839645/204134864-da2541f0-ff4f-4d9f-8c80-aa8c4437d8a0.png

Read more ...


Custom roles and domains in Sphinx with one line

I was working on the roles and structure section of the 2i2c Team Compass and found a nifty feature in Sphinx that I hadn’t known before.

You can currently add labels to any section with the following MyST Markdown structure:

Read more ...


Automatically updating my publications page with ORCID and doi.org

For a while I’ve had a hand-crafted .bibtex file stored locally for my publications/ page. However, manually updating local text file is a pain to remember, especially since there are many services out there that automatically track new publications.

A helpful suggestion on Twitter allowed me to include the full citation information, including lists of authors, using the doi.org API!

Read more ...


Automatically redirect folders in Sphinx websites

  • 19 November 2022
  • til

I spent a bit of time today updating my website after some changes in the MyST-NB and Sphinx Design ecosystems. Along the way, I decided to redirect /posts/ to /blog/, since it seems /blog/ is a much more common folder to use for blog posts.

This posed a problem, because the sphinx-rediraffe extension does not allow you to redirect folders with wildcards. AKA, you cannot do:

Read more ...


Ask Twitter: Why don’t academic researchers use cloud services?

this is an experiment at making my Twitter conversations a bit more useful and archivable over time. It’s going to be a bit messy and unpolished, but hopefully that makes it more likely I’ll actually do it :-)

Over the past decade, cloud infrastructure has become increasingly popular in industry. An ecosystem of modular tools and cloud services (often called the Modern Data Stack) has filled many data needs for companies.

Read more ...


Serving in two roles at once via pre-recorded tutorials

At AGU 2021 this year I was asked to give a short tutorial introduction to Jupyter Book. The tutorial was 30 minutes long, and the session was fully remote.

This posed a few challenges:

https://user-images.githubusercontent.com/1839645/146656769-95e0e430-15ac-4b5e-a0ee-7e77e8f7b9ed.png

Read more ...


Contributing to open source: A short guide for organizations

Over the years I’ve had a recurring question from people who are in organizations both big and small: how can we participate in open source communities?

Whether it is because of altruism or strategic importance, many companies, research groups, non-profits, etc want to be involved in open source projects (particularly large and impactful ones like Jupyter), but getting involved can be an opaque and confusing process if you’re not already familiar with open source. Each community has its own nuances and social dynamics, and approaching from the outside can be a challenge.

Read more ...


A new blog with Sphinx

I recently re-wrote all of the infrastructure for my blog so that it now builds on top of the Sphinx ecosystem! This is a short post to describe the reasons for doing so, and a bit about the implementation.

This is a great question. The answer to “should you re-work your blog to use a new SSG” is almost always “no, it’s a waste of your time”, but I think I had a few good reasons ;-)

https://www.sphinx-doc.org/en/master/_static/sphinxheader.png

Read more ...


What do people think about rST?

Publishing computational narratives has always been a dream of the Jupyter Project, and there is still a lot of work to be done in improving these use-cases. We’ve made a lot of progress in providing open infrastructure for reproducible science with JupyterHub and the Binder Project, but what about the documents themselves? We’ve recently been working on tools like Jupyter Book, which aim to improve the writing and publishing process with the Jupyter ecosystem. This is hopefully the first post of a few that ask how we can best-improve the state of publishing with Jupyter.

Many of the ideas in this post have now made their way into a new flavor of markdown called Markedly Structured Text, or MyST. It brings all of the features of rST into Markdown. Check it out!

Read more ...


Build a simple timeline with sphinx-design

One of the nice things about MyST Markdown is that it is extensible. Any Sphinx extension will work with MyST Markdown (in the context of Jupyter Book, anyway).

One of my favorite Sphinx extensions is Sphinx Design, this brings you flexible UI components that use Bootstrap CSS under the hood (though without heavy javascript). They let you do things like this:

Read more ...


Testing Pandoc and Jupyter Notebooks

For several months now, the universal document converter pandoc has had support for Jupyter Notebooks. This means that with a single call, you can convert .ipynb files to any of the output formats that Pandoc supports (and vice-versa!). This post is a quick exploration of what this looks like.

Note that for this post, we’re using Pandoc version 2.7.3. Also, some of what’s below is hard to interpret without actually opening the files that are created by Pandoc. For the sake of this blog post, I’m going to stick with the raw text output here, though you can expand the outputs if you wish, I recommend copy/pasting some of these commands on your own if you’d like to try.

Read more ...


What would Python-style governance look like in Jupyter?

This is the second in a series of blog posts that explores what it’d look like to directly port the governance model of other communities into the Jupyter project. You can find the first post about Rust here.

Note: These posts are meant as a thought experiment rather than a proposal. Moreover, all the usual caveats come with it, such as the fact that I don’t know the Python governance structure that well, and I might totally botch my characterization of it.

Read more ...


What would Rust-style governance look like in Jupyter?

As I’ve written about before, I like Rust’s governance structure. I mean, who can’t get behind a community that lists governance as a top-level page on its website?

Jupyter is currently in the middle of figuring out the next phase of its governance structure, and so I have been thinking about what this might look like. This post is a quick thought-experiment to explore what it’d mean to port over Rust’s governance directly into the Jupyter community.

Read more ...


Automating Jupyter Book deployments with CI/CD

Lately I’ve spent a lot of time trying to reduce the friction involved in deploying Jupyter Book as well as contributing to the project. Features are a great carrot, but ultimately getting engagement is also about lowering barriers to entry and showing people a path forward. Jupyter Book is a relatively straightforward project, but it involves a few technical pieces that can be painful to use (thanks Jekyll).

Recently I experimented with whether we can automate deploying a Jupyter Book online. Using continuous integration / deployment services seems like a natural place to try this out. One can upload a barebones set of code to a GitHub repository, then configure a build system to create a book and deploy it online from there. This blog post is a place to keep track of the current state of affairs for this workflow.

auto build logos

Read more ...


A few recent talks

Lately I’ve given quite a number of talks about the Jupyter and Binder ecosystems for various purposes. Before each of the talks, I make the slides available at a public address in case others are interested in following up with the material. For those who missed the talks (or the subsequent tweets about them), here are a few of the more recent ones.

A word of warning: there’s a lot of overlap between these talks - I’m not crazy enough to re-invent the wheel each time I have to speak. However, maybe folks will find some value in the different angles taken in each case.

Read more ...


Thoughts from the Jupyter team meeting 2019

I just got back from a week-long Jupyter team meeting that was somehow both very tiring and energizing at the same time. In the spirit of openness, I’d like to share some of my experience. While it’s still fresh in my mind, here are a few takeaways that occurred to me throughout the week.

Note that these are my personal (rough) impressions, but they shouldn’t be taken as a statement from the project/community itself.

Read more ...


Three things I love about CircleCI

I recently had to beef up the continuous deployment of Jupyter Book, and used it as an opportunity to learn a bit more about CircleCI’s features. It turns out, they’re pretty cool! Here are a few of the things that I learned this time around.

For those who aren’t familiar with CircleCI, it is a service that runs Continuous Integration and Continuous Deployment (CI/CD) workflows for projects. This basically means that they manage many kinds of infrastructure that can launch jobs that run test suites, deploy applications, and test on many different environments.

Read more ...


Automatically mirror a github repository with CircleCI

tl;dr: you can automatically mirror the contents of one repository to another by using CI/CD services like CircleCI. This post shows you one way to do it using secrets that let you push to a GitHub repository from a CircleCI process.

We recently ran into an issue with the Data 8 course where we needed to mirror one GitHub site to another. In short, the textbook is built with a tool called jupyter-book, and we use github-pages to host the content at inferentialthinking.com. For weird URL-naming reasons, we had to create a second organization to host the actual site. This introduced the complexity that any time the textbook had to be updated, we did so in two different places. The raw textbook content is hosted at https://github.com/data-8/textbook, and the version hosted online is at https://github.com/inferentialthinking/inferentialthinking.github.io.

Read more ...


Open communities need to be partners, not sources of free labor

In the last couple of years, we’ve seen an increasing number of organizations start to spawn products that take a largely open stack (e.g., the SciPy ecosystem) and wrap it in a thin layer of proprietary/custom interface + infrastructure. On the face of it, this isn’t a problem - I really want people to be able to make money using the open source stack - however, there is a big caveat. When you look at the work that those organizations have done over time, you often see a pretty thin trail of contributions back to those open source projects.

I’d argue that using an open community’s software without contributing back is straight-up exploitative (legal, sure, but still exploitative), and we should think about ways to suppress this kind of behavior. This post is a collection of thoughts on that topic.

Read more ...


My weekly workflow

I’ve had a bunch of conversations with friends who were interested in how to keep track of the various projects they’re working on, and to prioritize their time over the course of a week. I thought it might be helpful to post my own approach to planning time throughout the week in case it’s useful for others to riff off of.

First off, a few general principles that I use to guide my thinking on planning out the week.

Read more ...


How do projects signal how “open” they are?

How do open projects signal their “openness” to the outside community? This is a really hard question, particularly because nowadays “open” has become a buzzword that doesn’t just signal a project’s position to the community, but is also used as a marketing term to increase support, users, or resources.

I was thinking about this the other day, so decided to take to twitter:

Read more ...


I like Rust’s governance structure

Recently I’ve been reading up on governance models for several large-ish open source projects. This is partially because I’m involved in a bunch of these projects myself, and partially because it’s fascinating to see distributed groups of people organizing themselves in effective (or not) ways on the internet.

Governance is tricky, because there is an inherent tension between:

Rust teams

Read more ...


Using CircleCI to preview documentation in Pull Requests

Writing documentation is important - it’s the first point of contact between many users and your project, and can be a pivotal moment in whether they decide to adopt your tech or become a contributor.

However, it can be a pain to iterate on documentation, as it is often involves a lot of rapid iteration locally, followed by a push to GitHub where you “just trust” that the author has done a good job of writing content, design, etc.

Read more ...


Summer conference report back

This is a short update on several of the conferences and workshops over the summer of this year. There’s all kinds of exciting things going on in open source and open communities, so this is a quick way for me to collect my thoughts on some things I’ve learned this summer.

Pangeo is a project that provides access to a gigantic geosciences dataset. They use lots of tools in the open-source community, including Dask for efficient numerical computation, the SciPy stack for a bunch of data analytics, and JupyterHub on Kubernetes for managing user instances and deploying on remote infrastructure. Pangeo has a neat demo of their hosted JupyterHub instance that people can use to access this otherwise-inaccessible dataset! See their video from SciPy below.

Read more ...


Adding copy buttons to code blocks in Sphinx

NOTE: This is now a sphinx extension! Thanks to some friendly suggestions, I’ve written this up as a super tiny sphinx extension. Check it out here: https://github.com/choldgraf/sphinx-copybutton

Sphinx is a fantastic way to build documentation for your Python package. On the Jupyter project, we use it for almost all of our repositories.

Read more ...


Introducing _makeitpop_, a tool to perceptually warp your data!

It should go without saying, but you should never do the stuff that you’re about to read about here. Data is meant to speak for itself, and our visualizations should accurately reflect the data above all else.*

When I was in graduate school, I tended to get on my soapbox and tell everybody why they should stop using Jet and adopt a “perceptually-flat” colormap like viridis, magma, or inferno.

../_images/740ee63ac731761d28669ee426c17caf515cab36f3fca6e8c675430eef5521f8.png

Read more ...


Blogging with Jupyter Notebooks and Jekyll using nbconvert templates

Here’s a quick (and hopefully helpful) post for those wishing to blog in Jekyll using Jupyter notebooks. As some of you may know, nbconvert can easily convert your .ipynb files to markdown, which Jekyll can easily turn into blog posts for you.

However, an annoying part of this is that Markdown doesn’t include classes for input and outputs, which means they each get treated the same in the output. Not ideal.

Read more ...


An academic scientist goes to DevOps Days

Last week I took a few days to attend DevOpsDays Silicon Valley. My goal was to learn a bit about how the DevOps culture works, what are the things people are excited about and discuss in this community. I’m also interested in learning a thing or two that could be brought back into the scientific / academic world. Here are a couple of thoughts from the experience.

tl;dr: DevOps is more about culture and team process than it is about technology, maybe science should be too…

Read more ...


Combining dates with analysis visualization in python

Sometimes you want to do two things:

Plot a timeseries that handles datetimes in a clever way (e.g., with Pandas or Matplotlib)

../_images/a5e3e8e77eaaf3179af34196eaac86273f44b3b5e7e58b872fbbb3ea52e3d634.png

Read more ...


Dates in python

As a part of setting up the website for the Docathon I’ve had to re-learn all of my date string formatting rules. It’s one of those little problems you don’t really think about - turning an arbitrary string into something structured like a date - until you’ve actually got to do it.

There are a bunch of tools in python for using date-like objects, but it’s not always easy to figure out how these work. This post is just a couple of pieces of information I’ve picked up along the process.

../_images/a289273e2ebdd52dbf8d304957f576e5f6b15464ba4df7b93ad2710fb0cc5013.png

Read more ...


Matplotlib Cyclers are Great

Every now and then I come across a nifty feature in Matplotlib that I wish I’d known about earlier. The MPL documentation can be a beast to get through, and as a result you miss some cool stuff sometimes.

This is a quick demo of one such feature: the cycler.

../_images/1a49d155574ef142621815f0f823c023838373d0fc8e64acf60e29da890dddcb.png

Read more ...


Brainy Jingle Bells

This is a quick demo of how I created this video. Check it out below, or read on to see the code that made it!

Here’s a quick viz to show off some brainy holiday spirit.

../_images/1cac946ecc78b232fad0a191ce135b7994609040c2f42d0df7e8b8c7f801ac37.jpg

Read more ...


The bleeding edge of publishing, Scraping publication amounts at biorxiv

Per a recent request somebody posted on Twitter, I thought it’d be fun to write a quick scraper for the biorxiv, an excellent new tool for posting pre-prints of articles before they’re locked down with a publisher embargo.

A big benefit of open science is the ability to use modern technologies (like web scraping) to make new use of data that would originally be unavailable to the public. One simple example of this is information and metadata about published articles. While we’re not going to dive too deeply here, maybe this will serve as inspiration for somebody else interested in scraping the web.

../_images/c7eb5eb4f433ce9fab991de4b5266f3cc5715e1bb8c81c5726792d23cc2d64a4.png

Read more ...


Visualizing publication bias

This article is now interactive! Check out a live Binder instance here

In the next few months, I’ll try to take some time to talk about the things I learn as I make my way through this literature. While it’s easy to make one-off complaints to one another about how “science is broken” without really diving into the details, it’s important learn about how it’s broken, or at least how we could assess something like this.

../_images/funnel_plot_no_dists.png

Read more ...


5 things I learned at SciPy

I’ve finally decompressed after my first go-around with Scipy. For those who haven’t heard of this conference before, Scipy is an annual meeting where members of scientific community get together to discuss their love of Python, scientific programming, and open science. It spans both academics and people from industry, making it a unique place in terms of how software interfaces with scientific research. (if you’re interested the full set of Scipy conferences, check out here.

It was an eye-opening experience that I learned a lot from, so here’s a quick recap of some things that I learned during my first rodeo.

Read more ...


Could Brexit have happened by chance?

As a scientist, watching the Brexit vote was a little bit painful. Though probably not for the reason you’re thinking. No, it wasn’t the politics that bothered me, but the method for making such an incredibly important decision. Let me explain…

Scientists are a bit obsessed with the concept of error. In the context of collecting data and anaylzing it, this takes the form of our “confidence” in the results. If all the data say the same thing, then we are usually pretty confident in the overall message. If the data is more complicated than this (and it always is), then we need to define how confident we are in our conclusions.

../_images/96ad4c6ce6c84e6e538dcda8722efc5916cc48c7b115a2f25ed8fa9271b9d924.png

Read more ...


The beauty of computational efficiency

When we discuss “computational efficiency”, you often hear people throw around phrases like $O(n^2)$ or $O(nlogn)$. We talk about them in the abstract, and it can be hard to appreciate what these distinctions mean and how important they are. So let’s take a quick look at what computational efficiency looks like in the context of a very famous algorithm: The Fourier Transform.

Briefly, A Fourier Transform is used for uncovering the spectral information that is present in a signal. AKA, it tells us about oscillatory components in the signal, and has a wide range of uses in communications, signal processing, and even neuroscience analysis.

../_images/ed0a1e43a49f42359fb675ae8d07a7d38ee4f06e23b873fb9b0fb2a2bfdc9d86.png

Read more ...


NIH grant analysis

As I’m entering the final years of graduate school, I’ve been applying for a few typical “pre-doc” fellowships. One of these is the NRSA, which is notorious for requiring you to wade through forests of beaurocratic documents (seriously, their “guidelines” for writing an NRSA are over 100 pages!). Doing so ends up taking a LOT of time.

This got me wondering what kind of success rates these grants have in the first place. For those who haven’t gone through the process before, it’s a bit opaque:

../_images/79fa811b8d2df68810245ec8db5e6e39714f6bc2c49d1091ce36afa1dea96e28.png

Read more ...


Craigslist data analysis

In the last post I showed how to use a simple python bot to scrape data from Criagslist. This is a quick follow-up to take a peek at the data.

Note - data that you scrape from Craigslist is pretty limited. They tend to clear out old posts, and you can only scrape from recent posts anyway to avoid them blocking you.

../_images/5e61fcc310c3b744b46ed6582f714bcd0903deaf021ed2f908639a653e48b043.png

Read more ...


Scraping craigslist

In this notebook, I’ll show you how to make a simple query on Craigslist using some nifty python modules. You can take advantage of all the structure data that exists on webpages to collect interesting datasets.

First we need to figure out how to submit a query to Craigslist. As with many websites, one way you can do this is simply by constructing the proper URL and sending it to Craigslist. Here’s a sample URL that is returned after manually typing in a search to Craigslist:

../_images/2c7f9b06d258220cbe83caa9f29188e2a6da84e3b74ba4e0e1673b0172fbad1b.png

Read more ...


Coherence correlation

Note - you can find the nbviewer of this post here

A big question that I’ve always wrestled with is the difference between correlation and coherence. Intuitively, I think of these two things as very similar to one another. Correlation is a way to determine the extent to which two variables covary (normalized to be between -1 and 1). Coherence is similar, but instead assesses “similarity” by looking at the similarity for two variables in frequency space, rather than time space.

../_images/eeg_coh.png

Read more ...