{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Per a recent request somebody posted on Twitter, I thought it'd be fun to write a quick scraper for the [biorxiv](http://biorxiv.org/), an excellent new tool for posting pre-prints of articles before they're locked down with a publisher embargo.\n", "\n", "A big benefit of open science is the ability to use modern technologies (like web scraping) to make new use of data that would originally be unavailable to the public. One simple example of this is information and metadata about published articles. While we're not going to dive too deeply here, maybe this will serve as inspiration for somebody else interested in scraping the web.\n", "\n", "First we'll do a few imports. We'll rely heavily on the `requests` and `BeautifulSoup` packages, which together make an excellent one-two punch for doing web scraping. We coud use something like `scrapy`, but that seems a little overkill for this small project." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import requests\n", "import pandas as pd\n", "import seaborn as sns\n", "import numpy as np\n", "from bs4 import BeautifulSoup as bs\n", "import matplotlib.pyplot as plt\n", "from tqdm import tqdm\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From a quick look at the biorxiv we can see that its search API works in a pretty simple manner. I tried typing in a simple search query and got something like this:\n", "\n", "`http://biorxiv.org/search/neuroscience%20numresults%3A100%20sort%3Arelevance-rank`\n", "\n", "Here we can see that the term you search for comes just after `/search/`, and parameters for the search, like `numresults`. The keyword/value pairs are separated by a `%3A` character, which corresponds to `:` (see [this site](http://www.degraeve.com/reference/urlencoding.php) for a reference of url encoding characters), and these key/value pairs are separated by `%20`, which corresponds to a space.\n", "\n", "So, let's do a simple scrape and see what the results look like. We'll query the biorxiv API to see what kind of structure the result will have." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "n_results = 20\n", "url = \"http://biorxiv.org/search/neuroscience%20numresults%3A{}\".format(\n", " n_results)\n", "resp = requests.post(url)\n", "\n", "# I'm not going to print this because it messes up the HTML rendering\n", "# But you get the idea...probably better to look in Chrome anyway ;)\n", "# text = bs(resp.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we search through the result, you may notice that search results are organized into a list (denoted by `li` for each item). Inside each item is information about the article's title (in a `div` of class `highwire-cite-title`) and author information (in a `div` of calss `highwire-cite-authors`).\n", "\n", "Let's use this information to ask three questions:\n", "\n", "1. How has the rate of publications for a term changed over the years\n", "1. Who's been publishing under that term.\n", "1. What kinds of things are people publishing?\n", "\n", "For each, we'll simply use the phrase \"neuroscience\", although you could use whatever you like.\n", "\n", "To set up this query, we'll need to use another part of the biorxiv API, the `limit_from` paramter. This lets us constrain the search to a specific month of the year. That way we can see the monthly submissions going back several years.\n", "\n", "We'll loop through years / months, and pull out the author and title information. We'll do this with two dataframes, one for authors, one for articles." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Define the URL and start/stop years\n", "stt_year = 2012\n", "stp_year = 2016\n", "search_term = \"neuroscience\"\n", "url_base = \"http://biorxiv.org/search/{}\".format(search_term)\n", "url_params = \"%20limit_from%3A{0}-{1}-01%20limit_to%3A{0}-{2}-01%20numresults%3A100%20format_result%3Astandard\"\n", "url = url_base + url_params" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Now we'll do the scraping...\n", "all_articles = []\n", "all_authors = []\n", "for yr in tqdm(range(stt_year, stp_year + 1)):\n", " for mn in range(1, 12):\n", " # Populate the fields with our current query and post it\n", " this_url = url.format(yr, mn, mn + 1) \n", " resp = requests.post(this_url)\n", " html = bs(resp.text)\n", " \n", " # Collect the articles in the result in a list\n", " articles = html.find_all('li', attrs={'class': 'search-result'})\n", " for article in articles:\n", " # Pull the title, if it's empty then skip it\n", " title = article.find('span', attrs={'class': 'highwire-cite-title'})\n", " if title is None:\n", " continue\n", " title = title.text.strip()\n", " \n", " # Collect year / month / title information\n", " all_articles.append([yr, mn, title])\n", " \n", " # Now collect author information\n", " authors = article.find_all('span', attrs={'class': 'highwire-citation-author'})\n", " for author in authors:\n", " all_authors.append((author.text, title))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# We'll collect these into DataFrames for subsequent use\n", "authors = pd.DataFrame(all_authors, columns=['name', 'title'])\n", "articles = pd.DataFrame(all_articles, columns=['year', 'month', 'title'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make things easier to cross-reference, we'll add an `id` column that's unique for each title. This way we can more simply join the dataframes to do cool things:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Define a dictionary of title: ID mappings\n", "unique_ids = {title: ii for ii, title in enumerate(articles['title'].unique())}\n", "articles['id'] = [unique_ids[title] for title in articles['title']]\n", "authors['id'] = [unique_ids[title] for title in authors['title']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can easily join these two dataframes together if we so wish:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | year | \n", "month | \n", "title | \n", "id | \n", "name | \n", "
---|---|---|---|---|---|
0 | \n", "2013 | \n", "11 | \n", "Simultaneous optogenetic manipulation and calc... | \n", "0 | \n", "Frederick B. Shipley | \n", "
1 | \n", "2013 | \n", "11 | \n", "Simultaneous optogenetic manipulation and calc... | \n", "0 | \n", "Christopher M. Clark | \n", "
2 | \n", "2013 | \n", "11 | \n", "Simultaneous optogenetic manipulation and calc... | \n", "0 | \n", "Mark J. Alkema | \n", "
3 | \n", "2013 | \n", "11 | \n", "Simultaneous optogenetic manipulation and calc... | \n", "0 | \n", "Andrew M. Leifer | \n", "
4 | \n", "2013 | \n", "11 | \n", "Functional connectivity networks with and with... | \n", "1 | \n", "Satoru Hayasaka | \n", "