July 7, 2020

lagged OpenTable reservations vs. change in daily case counts

Having broken my personal promise to not work with COVID data1, I decided to revisit the subject after seeing a recent project from Nathan Yau. Yau was looking at OpenTable reservations, and made a clean & well-annotated plot with panels for each state. The OpenTable dataset tracks the difference between seated dining on a given day between 2019 and 2020. Plotting these differences over time gives a view into how quickly people are returning to more regular consumption patterns. Read more

July 1, 2020

a few thoughts on the gap between information and action

I’ve been thinking about this piece by Mimi Onuoha that was published today, particularly this passage: By nearly every statistical measurement possible, from housing to incarceration to wealth to land ownership, Black Americans are disproportionately disadvantaged. But the grand ritual of collecting and reporting this data has not improved the situation. American history is lined with innumerable instances of what scholar Saidiya Hartman bemoans as “the demand that this suffering be materialized and evidenced by the display of the tortured body or endless recitations of the ghastly and the terrible,” only for very little to change. Read more

June 26, 2020

please don't use a basic linear model to predict cumulative case counts in your state

So, this is my first post in a while. I changed jobs in January, and moved back across the country to my hometown of Boise, ID. I was hoping that my first post-move update would be more uplifting, but by mid-March, I didn’t want to write anything, for a variety of reasons. As a person whose job involves cleaning and analyzing data, the pandemic has been surreal– public health, statistical methods, and data visualizations are now daily topics, for basically everyone I talk to. Read more

November 25, 2019

analyzing the october primary debate, using tidytext

(This is a write-up of a talk I gave to the Ann Arbor R User Group, earlier this month.) It seems like the longer one works with data, the probability they are tasked to work with unstructured text approaches 1. No matter the setting –whether you’re working with survey responses, administrative data, whatever– one of the most common ways that humans record information is by writing things down. Something great about text-based data is that it’s often plentiful, and might have the advantage of being really descriptive of something you’re trying to study. Read more

October 23, 2019

hey, separation plots are kinda cool

Ever since I first started learning about regression analysis, I found myself wishing I could do something equivalent to inspecting residuals for logistic regressions like you could with OLS. Earlier this week, I was looking around for more ways to spot-check logistic regression models, and I came across a plotting technique that’s described in this paper (Greenhill, Ward, & Sacks, 2011). They’re called “separation plots”, and they’re used to help assess the fit adequacy of a model that has a binary variable as its dependent variable. Read more

October 14, 2019

predicting my yearly top songs without listening/usage data (part 2)

This is a continuation from a previous post, which can be found here. Okay, picking up where we left off! In this post we’ll dive into building a set of models that can classify each of my playlist tracks as a “top-song” or not. While this is an exploration of some boutique data, it’s also a cursory look at many of the packages found in the tidymodels ecosystem. A few posts I found useful in terms of working with tidymodels can be found here, and here. Read more

September 17, 2019

predicting my yearly top songs without listening/usage data (part 1)

question: how do tracks end up on my yearly top-100 songs playlist? Last year I dug into the spotifyr package to see if the monthly playlists I curate varied by different track audio features available from the API. This time, I’m back with some more specific questions. Maybe they’ve always done this but, Spotify creates yearly playlists for each user, meant to reflect the user’s top-100 songs. I look forward to getting one each year, but I wish I knew more about how it worked. Read more

August 11, 2018

comparing audio features from my monthly playlists, using spotifyr

The NYT has a fun interactive up this week, looking at audio features to see if popular summer songs have the same sort of “signature”. After attending a presentation earlier this year, I discovered that these same sorts of features are accessible through Spotify’s API! How people curate their collections and approach listening to music usually tells you something about them, and since seeing the presentation I’ve been wanting to take a dive into my own listening habits. Read more

July 15, 2018

how should I get started with R?

Here’s some evergreen advice from David Robinson: When you’ve written the same code 3 times, write a function When you’ve given the same in-person advice 3 times, write a blog post — David Robinson (@drob) November 9, 2017 In a world overflowing with data science blogs, I’ve decided to write some notes about getting started in R. I recently crossed Robinson’s threshold, and want to write down my basic advice (so, hi! Read more

October 1, 2017

taking a second look at family net worth with the SCF

Matt Bruenig at the People’s Policy Project (PPP) published a post at the end of September, looking at 2016 data for family net worth as reported by the Survey of Consumer Finance (SCF). Using the 2007 and 2016 waves of the survey, Bruenig grouped family net worth into percentiles, and took the difference between each point. Bruenig broke the results down by race/ethnicity, but generally speaking, aside from the wealthiest Americans, most families still haven’t recovered to their pre-recession level of household net worth. Read more

Powered by Hugo & Kiss.