Data Wrangling in Observable With Arquero

Home Run Leaders

Introduction

As an exercise in data wrangling in Observable, we’ll use a few datasets (exported from the R package Lahman) to produce an interactive graph of home run leaders in Major League baseball. We’ll use the Arquero library in order to wrangle our data in a way that should be familiar to users of R’s tidyverse ecosystem.

Here is the graph we want to make:

// Trying to set up reactives correctly:  browser needs the specified version of plot
// before it attempts to run
// the plotting functions!
goalPlot = {
  const nPlot = await import("https://esm.run/@observablehq/plot@0.6.11");
  let p = nPlot.plot({
    nice: true,
    x: {label: "Season"},
    y: {label: "Home Runs"},
    marks: [
      nPlot.dot(df, {
        x: "seasonAsDate", 
        y: "jy", 
        fill: "black", fillOpacity: 0.3
      }),
      nPlot.tip(df, nPlot.pointer({
        x: "seasonAsDate",
        y: "jy",
        title: d => d.info
      }))
    ]
  })
  return p;
}

Setup

We import Arquero and a recent version of Plots, and get the necessary datasets.

import { aq, op } from '@uwdata/arquero'

Plot = import("https://esm.run/@observablehq/plot@0.6.11")

We now load in the csv files, and get views of them:

batting = {
  let b = await aq.loadCSV("Batting.csv");
  return b.select("playerID", "yearID", "teamID", "HR");
}

batting.view()

players = {
  let p = await aq.loadCSV("People.csv");
  return p.select("playerID", "nameFirst", "nameLast", "debut");
}

players.view()

teams = {
  let t = await aq.loadCSV("Teams.csv");
  return t.select("teamID", "yearID", "name");
}

teams.view()

Wrangling

Here is where Arquero feels a lot like R’s dplyr:

hrLeaders = batting
  // only for 1900 or later season:
  .filter(d => d.yearID >= 1900)
  // join with Players to get name and debut year:
  .join(players, "playerID")
  // find HR leaders in each season:
  .groupby("yearID")
  // arrange within each season-group:
  .orderby(aq.desc("HR"))
  // take only the top five:
  .slice(0, 5)
  .ungroup()
  // for table display:
  .orderby("yearID", aq.desc("HR"))

Let’s have a look at the resulting table:

Inputs.table(hrLeaders)

We need to join with the Teams data, and we add a column containing material for the tool tips. We must also create “jittered” versions of the season and home-run variables, in order to avoid over-plotting.

df = hrLeaders.
  // Default for join is to use all columns that appear in both tables
  // https://uwdata.github.io/arquero/api/verbs#join .
  // (We are joining on teamID and yearID):
  join(teams)
  // rename for more readable graph:
  .rename({yearID: "season"})
  // create a better debut variable:
  .derive(
    // As is, aq.loadCSV brings in debut
    // as a string.
    // We should consider making it a Date
    // and then:
    // https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date/toDateString
    // But we must also escape, as per
    // https://uwdata.github.io/arquero/api/#escape .
    // Great background discussion here:
    // https://uwdata.github.io/arquero/api/expressions
    {debutDate: aq.escape(function (d) {
      const date = new Date(d.debut);
      return date.toDateString();
    })}
  )
  // now assemble the tooltip info:
  .derive(
    {info: d => `Playing for the ${d.name}, ${d.nameFirst} ${d.nameLast}, who debuted on ${d.debutDate}, hit ${d.HR} home runs in ${d.season}.`
    }
  )
  // jitter to avoid over-plotting:
  .derive({
    jy: d => d.HR + Math.random() - 0.5
  })
  // keep Plot from formatting years with commas:
  .derive({
    seasonAsDate: aq.escape(d => new Date(d.season, 0, 1))
  })

This is the “glyph-ready” table that is used to make the desired plot. Let’s have a look at it:

Inputs.table(df)

The code for the plot was:

Plot.plot({
    nice: true,
    x: {label: "Season"},
    y: {label: "Home Runs"},
    marks: [
      Plot.dot(df, {
        x: "seasonAsDate", 
        y: "jy", 
        fill: "black", fillOpacity: 0.3
      }),
      Plot.tip(df, nPlot.pointer({
        x: "seasonAsDate",
        y: "jy",
        title: d => d.info
      }))
    ]
  })

Note

A first pass at the work is in this Observable notebook:

https://observablehq.com/@homerhanumat/home-run-leaders

This seems to be another instance where the work is much easier to do in an Observable notebook than in Quarto.

// hack once used to make Quarto render the csv files to docs,
// but for some reason no longer
// needed:
stuff = {
   let b = FileAttachment("Batting.csv").csv({typed: true});
   let p = FileAttachment("People.csv").csv({typed: true});
   let t = FileAttachment("Teams.csv").csv({typed: true});
   return {b, p, t}
}

--- title: "Data Wrangling in Observable With Arquero" #image: banner.jpg format: html: echo: true code-tools: true --- # Home Run Leaders ## Introduction As an exercise in data wrangling in Observable, we'll use a few datasets (exported from the R package **Lahman**) to produce an interactive graph of home run leaders in Major League baseball. We'll use the [Arquero](https://observablehq.com/@uwdata/introducing-arquero) library in order to wrangle our data in a way that should be familiar to users of R's [**tidyverse**](https://www.tidyverse.org/) ecosystem. Here is the graph we want to make: ```{ojs} //| echo: false // Trying to set up reactives correctly: browser needs the specified version of plot // before it attempts to run // the plotting functions! goalPlot = { const nPlot = await import("https://esm.run/@observablehq/plot@0.6.11"); let p = nPlot.plot({ nice: true, x: {label: "Season"}, y: {label: "Home Runs"}, marks: [ nPlot.dot(df, { x: "seasonAsDate", y: "jy", fill: "black", fillOpacity: 0.3 }), nPlot.tip(df, nPlot.pointer({ x: "seasonAsDate", y: "jy", title: d => d.info })) ] }) return p; } ``` ## Setup We import Arquero and a recent version of Plots, and get the necessary datasets. ```{ojs} import { aq, op } from '@uwdata/arquero' ``` ```{ojs} //| eval: false Plot = import("https://esm.run/@observablehq/plot@0.6.11") ``` We now load in the csv files, and get views of them: ```{ojs} batting = { let b = await aq.loadCSV("Batting.csv"); return b.select("playerID", "yearID", "teamID", "HR"); } ``` ```{ojs} batting.view() ``` ```{ojs} players = { let p = await aq.loadCSV("People.csv"); return p.select("playerID", "nameFirst", "nameLast", "debut"); } ``` ```{ojs} players.view() ``` ```{ojs} teams = { let t = await aq.loadCSV("Teams.csv"); return t.select("teamID", "yearID", "name"); } ``` ```{ojs} teams.view() ``` ## Wrangling Here is where Arquero feels a lot like R's **dplyr**: ```{ojs} hrLeaders = batting // only for 1900 or later season: .filter(d => d.yearID >= 1900) // join with Players to get name and debut year: .join(players, "playerID") // find HR leaders in each season: .groupby("yearID") // arrange within each season-group: .orderby(aq.desc("HR")) // take only the top five: .slice(0, 5) .ungroup() // for table display: .orderby("yearID", aq.desc("HR")) ``` Let's have a look at the resulting table: ```{ojs} Inputs.table(hrLeaders) ``` We need to join with the Teams data, and we add a column containing material for the tool tips. We must also create "jittered" versions of the season and home-run variables, in order to avoid over-plotting. ```{ojs} df = hrLeaders. // Default for join is to use all columns that appear in both tables // https://uwdata.github.io/arquero/api/verbs#join . // (We are joining on teamID and yearID): join(teams) // rename for more readable graph: .rename({yearID: "season"}) // create a better debut variable: .derive( // As is, aq.loadCSV brings in debut // as a string. // We should consider making it a Date // and then: // https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date/toDateString // But we must also escape, as per // https://uwdata.github.io/arquero/api/#escape . // Great background discussion here: // https://uwdata.github.io/arquero/api/expressions {debutDate: aq.escape(function (d) { const date = new Date(d.debut); return date.toDateString(); })} ) // now assemble the tooltip info: .derive( {info: d => `Playing for the ${d.name}, ${d.nameFirst} ${d.nameLast}, who debuted on ${d.debutDate}, hit ${d.HR} home runs in ${d.season}.` } ) // jitter to avoid over-plotting: .derive({ jy: d => d.HR + Math.random() - 0.5 }) // keep Plot from formatting years with commas: .derive({ seasonAsDate: aq.escape(d => new Date(d.season, 0, 1)) }) ``` This is the "glyph-ready" table that is used to make the desired plot. Let's have a look at it: ```{ojs} Inputs.table(df) ``` The code for the plot was: ```{ojs} //| eval: false Plot.plot({ nice: true, x: {label: "Season"}, y: {label: "Home Runs"}, marks: [ Plot.dot(df, { x: "seasonAsDate", y: "jy", fill: "black", fillOpacity: 0.3 }), Plot.tip(df, nPlot.pointer({ x: "seasonAsDate", y: "jy", title: d => d.info })) ] }) ``` ## Note A first pass at the work is in this Observable notebook: >[https://observablehq.com/@homerhanumat/home-run-leaders](https://observablehq.com/@homerhanumat/home-run-leaders) This seems to be another instance where the work is much easier to do in an Observable notebook than in Quarto. ```{ojs} //| echo: false //| eval: false // hack once used to make Quarto render the csv files to docs, // but for some reason no longer // needed: stuff = { let b = FileAttachment("Batting.csv").csv({typed: true}); let p = FileAttachment("People.csv").csv({typed: true}); let t = FileAttachment("Teams.csv").csv({typed: true}); return {b, p, t} } ```