// Trying to set up reactives correctly: browser needs the specified version of plot
// before it attempts to run
// the plotting functions!
goalPlot = {
const nPlot = await import("https://esm.run/@observablehq/plot@0.6.11");
let p = nPlot.plot({
nice: true,
x: {label: "Season"},
y: {label: "Home Runs"},
marks: [
nPlot.dot(df, {
x: "seasonAsDate",
y: "jy",
fill: "black", fillOpacity: 0.3
}),
nPlot.tip(df, nPlot.pointer({
x: "seasonAsDate",
y: "jy",
title: d => d.info
}))
]
})
return p;
}Data Wrangling in Observable With Arquero
Home Run Leaders
Introduction
As an exercise in data wrangling in Observable, we’ll use a few datasets (exported from the R package Lahman) to produce an interactive graph of home run leaders in Major League baseball. We’ll use the Arquero library in order to wrangle our data in a way that should be familiar to users of R’s tidyverse ecosystem.
Here is the graph we want to make:
Setup
We import Arquero and a recent version of Plots, and get the necessary datasets.
import { aq, op } from '@uwdata/arquero'Plot = import("https://esm.run/@observablehq/plot@0.6.11")We now load in the csv files, and get views of them:
batting = {
let b = await aq.loadCSV("Batting.csv");
return b.select("playerID", "yearID", "teamID", "HR");
}batting.view()players = {
let p = await aq.loadCSV("People.csv");
return p.select("playerID", "nameFirst", "nameLast", "debut");
}players.view()teams = {
let t = await aq.loadCSV("Teams.csv");
return t.select("teamID", "yearID", "name");
}teams.view()Wrangling
Here is where Arquero feels a lot like R’s dplyr:
hrLeaders = batting
// only for 1900 or later season:
.filter(d => d.yearID >= 1900)
// join with Players to get name and debut year:
.join(players, "playerID")
// find HR leaders in each season:
.groupby("yearID")
// arrange within each season-group:
.orderby(aq.desc("HR"))
// take only the top five:
.slice(0, 5)
.ungroup()
// for table display:
.orderby("yearID", aq.desc("HR"))Let’s have a look at the resulting table:
Inputs.table(hrLeaders)We need to join with the Teams data, and we add a column containing material for the tool tips. We must also create “jittered” versions of the season and home-run variables, in order to avoid over-plotting.
df = hrLeaders.
// Default for join is to use all columns that appear in both tables
// https://uwdata.github.io/arquero/api/verbs#join .
// (We are joining on teamID and yearID):
join(teams)
// rename for more readable graph:
.rename({yearID: "season"})
// create a better debut variable:
.derive(
// As is, aq.loadCSV brings in debut
// as a string.
// We should consider making it a Date
// and then:
// https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date/toDateString
// But we must also escape, as per
// https://uwdata.github.io/arquero/api/#escape .
// Great background discussion here:
// https://uwdata.github.io/arquero/api/expressions
{debutDate: aq.escape(function (d) {
const date = new Date(d.debut);
return date.toDateString();
})}
)
// now assemble the tooltip info:
.derive(
{info: d => `Playing for the ${d.name}, ${d.nameFirst} ${d.nameLast}, who debuted on ${d.debutDate}, hit ${d.HR} home runs in ${d.season}.`
}
)
// jitter to avoid over-plotting:
.derive({
jy: d => d.HR + Math.random() - 0.5
})
// keep Plot from formatting years with commas:
.derive({
seasonAsDate: aq.escape(d => new Date(d.season, 0, 1))
})This is the “glyph-ready” table that is used to make the desired plot. Let’s have a look at it:
Inputs.table(df)The code for the plot was:
Plot.plot({
nice: true,
x: {label: "Season"},
y: {label: "Home Runs"},
marks: [
Plot.dot(df, {
x: "seasonAsDate",
y: "jy",
fill: "black", fillOpacity: 0.3
}),
Plot.tip(df, nPlot.pointer({
x: "seasonAsDate",
y: "jy",
title: d => d.info
}))
]
})Note
A first pass at the work is in this Observable notebook:
This seems to be another instance where the work is much easier to do in an Observable notebook than in Quarto.