// Trying to set up reactives correctly: browser needs the specified version of plot
// before it attempts to run
// the plotting functions!
goalPlot = {
const nPlot = await import("https://esm.run/@observablehq/plot@0.6.11");
let p = nPlot.plot({
nice: true,
x: {label: "Season"},
y: {label: "Home Runs"},
marks: [
nPlot.dot(df, {
x: "seasonAsDate",
y: "jy",
fill: "black", fillOpacity: 0.3
}),
nPlot.tip(df, nPlot.pointer({
x: "seasonAsDate",
y: "jy",
title: d => d.info
}))
]
})
return p;
}
Data Wrangling in Observable With Arquero
Home Run Leaders
Introduction
As an exercise in data wrangling in Observable, we’ll use a few datasets (exported from the R package Lahman) to produce an interactive graph of home run leaders in Major League baseball. We’ll use the Arquero library in order to wrangle our data in a way that should be familiar to users of R’s tidyverse ecosystem.
Here is the graph we want to make:
Setup
We import Arquero and a recent version of Plots, and get the necessary datasets.
import { aq, op } from '@uwdata/arquero'
= import("https://esm.run/@observablehq/plot@0.6.11") Plot
We now load in the csv files, and get views of them:
= {
batting let b = await aq.loadCSV("Batting.csv");
return b.select("playerID", "yearID", "teamID", "HR");
}
.view() batting
= {
players let p = await aq.loadCSV("People.csv");
return p.select("playerID", "nameFirst", "nameLast", "debut");
}
.view() players
= {
teams let t = await aq.loadCSV("Teams.csv");
return t.select("teamID", "yearID", "name");
}
.view() teams
Wrangling
Here is where Arquero feels a lot like R’s dplyr:
= batting
hrLeaders // only for 1900 or later season:
.filter(d => d.yearID >= 1900)
// join with Players to get name and debut year:
.join(players, "playerID")
// find HR leaders in each season:
.groupby("yearID")
// arrange within each season-group:
.orderby(aq.desc("HR"))
// take only the top five:
.slice(0, 5)
.ungroup()
// for table display:
.orderby("yearID", aq.desc("HR"))
Let’s have a look at the resulting table:
.table(hrLeaders) Inputs
We need to join with the Teams data, and we add a column containing material for the tool tips. We must also create “jittered” versions of the season and home-run variables, in order to avoid over-plotting.
= hrLeaders.
df // Default for join is to use all columns that appear in both tables
// https://uwdata.github.io/arquero/api/verbs#join .
// (We are joining on teamID and yearID):
join(teams)
// rename for more readable graph:
.rename({yearID: "season"})
// create a better debut variable:
.derive(
// As is, aq.loadCSV brings in debut
// as a string.
// We should consider making it a Date
// and then:
// https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date/toDateString
// But we must also escape, as per
// https://uwdata.github.io/arquero/api/#escape .
// Great background discussion here:
// https://uwdata.github.io/arquero/api/expressions
debutDate: aq.escape(function (d) {
{const date = new Date(d.debut);
return date.toDateString();
})}
)// now assemble the tooltip info:
.derive(
info: d => `Playing for the ${d.name}, ${d.nameFirst} ${d.nameLast}, who debuted on ${d.debutDate}, hit ${d.HR} home runs in ${d.season}.`
{
}
)// jitter to avoid over-plotting:
.derive({
jy: d => d.HR + Math.random() - 0.5
})// keep Plot from formatting years with commas:
.derive({
seasonAsDate: aq.escape(d => new Date(d.season, 0, 1))
})
This is the “glyph-ready” table that is used to make the desired plot. Let’s have a look at it:
.table(df) Inputs
The code for the plot was:
.plot({
Plotnice: true,
x: {label: "Season"},
y: {label: "Home Runs"},
marks: [
.dot(df, {
Plotx: "seasonAsDate",
y: "jy",
fill: "black", fillOpacity: 0.3
,
}).tip(df, nPlot.pointer({
Plotx: "seasonAsDate",
y: "jy",
title: d => d.info
}))
] })
Note
A first pass at the work is in this Observable notebook:
This seems to be another instance where the work is much easier to do in an Observable notebook than in Quarto.