Data Wrangling in Observable With Arquero

Home Run Leaders

Introduction

As an exercise in data wrangling in Observable, we’ll use a few datasets (exported from the R package Lahman) to produce an interactive graph of home run leaders in Major League baseball. We’ll use the Arquero library in order to wrangle our data in a way that should be familiar to users of R’s tidyverse ecosystem.

Here is the graph we want to make:

Setup

We import Arquero and a recent version of Plots, and get the necessary datasets.

import { aq, op } from '@uwdata/arquero'
Plot = import("https://esm.run/@observablehq/plot@0.6.11")

We now load in the csv files, and get views of them:

batting = {
  let b = await aq.loadCSV("Batting.csv");
  return b.select("playerID", "yearID", "teamID", "HR");
}
batting.view()
players = {
  let p = await aq.loadCSV("People.csv");
  return p.select("playerID", "nameFirst", "nameLast", "debut");
}
players.view()
teams = {
  let t = await aq.loadCSV("Teams.csv");
  return t.select("teamID", "yearID", "name");
}
teams.view()

Wrangling

Here is where Arquero feels a lot like R’s dplyr:

hrLeaders = batting
  // only for 1900 or later season:
  .filter(d => d.yearID >= 1900)
  // join with Players to get name and debut year:
  .join(players, "playerID")
  // find HR leaders in each season:
  .groupby("yearID")
  // arrange within each season-group:
  .orderby(aq.desc("HR"))
  // take only the top five:
  .slice(0, 5)
  .ungroup()
  // for table display:
  .orderby("yearID", aq.desc("HR"))

Let’s have a look at the resulting table:

Inputs.table(hrLeaders)

We need to join with the Teams data, and we add a column containing material for the tool tips. We must also create “jittered” versions of the season and home-run variables, in order to avoid over-plotting.

df = hrLeaders.
  // Default for join is to use all columns that appear in both tables
  // https://uwdata.github.io/arquero/api/verbs#join .
  // (We are joining on teamID and yearID):
  join(teams)
  // rename for more readable graph:
  .rename({yearID: "season"})
  // create a better debut variable:
  .derive(
    // As is, aq.loadCSV brings in debut
    // as a string.
    // We should consider making it a Date
    // and then:
    // https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date/toDateString
    // But we must also escape, as per
    // https://uwdata.github.io/arquero/api/#escape .
    // Great background discussion here:
    // https://uwdata.github.io/arquero/api/expressions
    {debutDate: aq.escape(function (d) {
      const date = new Date(d.debut);
      return date.toDateString();
    })}
  )
  // now assemble the tooltip info:
  .derive(
    {info: d => `Playing for the ${d.name}, ${d.nameFirst} ${d.nameLast}, who debuted on ${d.debutDate}, hit ${d.HR} home runs in ${d.season}.`
    }
  )
  // jitter to avoid over-plotting:
  .derive({
    jy: d => d.HR + Math.random() - 0.5
  })
  // keep Plot from formatting years with commas:
  .derive({
    seasonAsDate: aq.escape(d => new Date(d.season, 0, 1))
  })

This is the “glyph-ready” table that is used to make the desired plot. Let’s have a look at it:

Inputs.table(df)

The code for the plot was:

Plot.plot({
    nice: true,
    x: {label: "Season"},
    y: {label: "Home Runs"},
    marks: [
      Plot.dot(df, {
        x: "seasonAsDate", 
        y: "jy", 
        fill: "black", fillOpacity: 0.3
      }),
      Plot.tip(df, nPlot.pointer({
        x: "seasonAsDate",
        y: "jy",
        title: d => d.info
      }))
    ]
  })

Note

A first pass at the work is in this Observable notebook:

https://observablehq.com/@homerhanumat/home-run-leaders

This seems to be another instance where the work is much easier to do in an Observable notebook than in Quarto.