Musical Analysis of Rush Part 0: Download the Data
Introduction
Rush is one of the most prolific progressive rock bands in history. They started out as a hard rock band in the vein of Led Zeppelin, then moved into progressive rock, then added synthesizers in the 80s, and added hard rock again in the 90s and 00s.
This makes them a great candidate for a musical analysis! Besides their complex music development and versatility, other reasons to analyze Rush in particular are:
They are among my favorite bands and were my first ever concert back in 2007. So I have domain expertise.
All of their studio albums are on Spotify. This means there are no gaps. Do you hear me Robert Fripp?
With the exception of their debut and two tracks on Fly By Night, they have one lyricist.
Unlike some other candidates (coughmilesdaviscough) there are no albums by other artists that we need to consider, and there is no dispute about which albums to include. Rush’s 19 studio albums are canonically and unambiguously their 19 studio albums according to every fan on the planet.
Not a lot of data cleanup. There is no deduping necessary and a bare minimal amount of cleaning up text.
To Do:
In this RMarkdown document, we do the following:
- Download Rush’s 19 canonical studio albums using the wonderful
spotifyr
package developed by Charlie Thompson - Do some basic cleanup
- Export the resulting files to be analyzed
Data Prep
Spotify Developer Setup
Before you do anything, sign up to be a Spotify developer. You will be assigned a client ID and a client secret you need to query the Spotify API. Store those
credentials in your .Renviron
file as follows:
SPOTIFY_CLIENT_ID="{your_client_id}"
SPOTIFY_CLIENT_SECRET="{your_client_secret}"
Additionally, make sure you install the development version of spotifyr
via the following command:
devtools::install_github('charlie86/spotifyr')
The version on CRAN is out of date and lacks many functions we will need.
Download and Cleanup
First we download the relevant data using spotifyr::get_discography()
. It pulls all audio features for the artist’s discography from the Spotify API, along with lyrics from the
Genius API using the geniusR
package. If you’re
wondering what audio features are, I will explain more in the first analysis I perform.
Because spotifyr::get_discography()
takes a while to run, I wrapped it in a function that caches
the original data file if it’s stored. As this is a one-off prep function that I have no
intention of using anywhere else, I have made no effort to generalize it or prevent it from being
used outside of its intended purpose.
Other packages I am using here:
readr
- A package that reads and writes files better than base R’s equivalents. If it runs into parsing issues, it tells you which lines had the issues. Most importantly, it does not treat strings as factors!dplyr
- For tidy data manipulation. If you’ve never used dplyr before, what are you waiting for?!lubridate
- Functions for datetime manipulationhere
- For working directory detection. A much better alternative togetwd
.
library(readr)
library(dplyr)
library(spotifyr) ## Development version
library(lubridate)
library(here)
source(here("lib", "vars.R")) ## Contains paths we need for exporting
rush_studio_album_names <- c("rush", "fly by night",
"caress of steel", "2112", "a farewell to kings", "hemispheres",
"permanent waves", "moving pictures", "signals",
"grace under pressure", "power windows", "hold your fire", "presto",
"roll the bones", "counterparts", "test for echo", "vapor trails",
"snakes & arrows", "clockwork angels")
get_rush_dat <- function(input_file = ORIGINAL_DAT) {
## Look for the cached file because this takes a really long time to query
if (file.exists(ORIGINAL_DAT)) {
cat("Reading in file...\n")
dat <- readRDS(ORIGINAL_DAT)
} else {
cat("Getting album data from Spotify...\n")
dat <- get_discography("Rush") %>%
ungroup()
saveRDS(dat, ORIGINAL_DAT)
}
return(dat)
}
original_rush_dat <- get_rush_dat()
## Reading in file...
original_rush_dat %>%
glimpse
## Observations: 638
## Variables: 33
## $ artist_name <chr> "Rush", "Rush", "Rush", "Rush", "Rush", "…
## $ artist_uri <chr> "2Hkut4rAAyrQxRdof7FVJq", "2Hkut4rAAyrQxR…
## $ album_uri <chr> "3U6vR85uJOAT08DLnJhZhH", "3U6vR85uJOAT08…
## $ album_name <chr> "2112 - 40 Years Closer: A Q&A With Alex …
## $ album_img <chr> "https://i.scdn.co/image/d633919ce5ff5a9e…
## $ album_type <chr> "album", "album", "album", "album", "albu…
## $ is_collaboration <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ album_release_date <chr> "2016-12-16", "2016-12-16", "2016-12-16",…
## $ album_release_year <date> 2016-12-16, 2016-12-16, 2016-12-16, 2016…
## $ album_popularity <int> 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 1…
## $ track_name <chr> "Terry Brown Intro - Commentary", "2112: …
## $ track_uri <chr> "23DkB3Eb9vRo2PrfbdkUJR", "6rIc5dkUTYTtgg…
## $ track_number <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13…
## $ disc_number <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ danceability <dbl> 0.888, 0.640, 0.356, 0.718, 0.239, 0.748,…
## $ energy <dbl> 0.404, 0.333, 0.736, 0.441, 0.753, 0.324,…
## $ key <chr> "F", "A", "D", "D", "D", "B", "A", "C", "…
## $ loudness <dbl> -13.999, -16.458, -9.257, -16.990, -9.258…
## $ mode <chr> "major", "major", "major", "major", "majo…
## $ speechiness <dbl> 0.7940, 0.9580, 0.1080, 0.9540, 0.0886, 0…
## $ acousticness <dbl> 8.04e-01, 7.35e-01, 8.13e-02, 7.18e-01, 3…
## $ instrumentalness <dbl> 0.00e+00, 0.00e+00, 1.14e-03, 0.00e+00, 9…
## $ liveness <dbl> 0.2370, 0.4230, 0.3130, 0.3860, 0.3760, 0…
## $ valence <dbl> 0.8290, 0.5410, 0.2010, 0.5990, 0.5550, 0…
## $ tempo <dbl> 107.479, 87.368, 131.490, 122.509, 200.38…
## $ duration_ms <dbl> 16480, 600547, 1237773, 95360, 215080, 57…
## $ time_signature <dbl> 1, 4, 4, 3, 4, 3, 4, 4, 4, 4, 4, 4, 4, 5,…
## $ key_mode <chr> "F major", "A major", "D major", "D major…
## $ track_popularity <int> 0, 6, 11, 5, 12, 5, 10, 5, 8, 5, 8, 5, 9,…
## $ track_preview_url <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ track_open_spotify_url <chr> "https://open.spotify.com/track/23DkB3Eb9…
## $ track_n <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13…
## $ lyrics <list> [NULL, NULL, NULL, NULL, NULL, NULL, NUL…
Phew! It worked. There’s even a lyrics
column that contains the lyrics to every song
in Rush’s discography. I have no idea why it’s NULL
here and populated below.
Now that we have the data read in, let’s do some cleanup. I have gone back and changed this file a few times to reflect new cleaning that needed to be done. I will not be explaning these decisions in this document. Here’s what needs to be done:
- Throw out irrelevant columns. This includes “liveness” (we’re only looking at studio albums), URLs we don’t need, and some columns where every value is the same.
- Remove bonus tracks (which are often live)
- The title track of 2112 has a bunch of subtitles, which makes plotting the name very difficult. I’ll remove those.
- “Remastered” makes track names too long, so remove it.
- Convert character string dates to Date objects
- Move estimated release dates and actual release dates to the same column
rush_dat <- original_rush_dat %>%
filter(tolower(album_name) %in% rush_studio_album_names) %>%
filter(!grepl("- Live", track_name)) ## Remove bonus tracks
## Album and track info
rush_albums <- rush_dat %>%
select(artist_name, artist_uri, starts_with("album"), track_name, track_uri, track_n, danceability:key_mode, lyrics) %>%
select(-album_img, -album_type, -liveness) %>% ## remove redundant columns
mutate(album_release_date = if_else(!is.na(ymd(album_release_date)), ymd(album_release_date), ymd(album_release_year))) %>%
select(-album_release_year) %>%
mutate(track_name = gsub(" - Remastered", "", .$track_name),
track_name = if_else(grepl("^2112", .$track_name), "2112", track_name))
## Warning: 34 failed to parse.
## Warning: 34 failed to parse.
rush_albums %>%
glimpse
## Observations: 164
## Variables: 23
## $ artist_name <chr> "Rush", "Rush", "Rush", "Rush", "Rush", "Rush…
## $ artist_uri <chr> "2Hkut4rAAyrQxRdof7FVJq", "2Hkut4rAAyrQxRdof7…
## $ album_uri <chr> "744i0LypfMwHHrKhzsqAx0", "744i0LypfMwHHrKhzs…
## $ album_name <chr> "Clockwork Angels", "Clockwork Angels", "Cloc…
## $ album_release_date <date> 2012-06-08, 2012-06-08, 2012-06-08, 2012-06-…
## $ album_popularity <int> 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 4…
## $ track_name <chr> "Caravan", "BU2B", "Clockwork Angels", "The A…
## $ track_uri <chr> "43l8BalXmo4y50runkgJEh", "6CiPGcWJ3YykntxFja…
## $ track_n <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, …
## $ danceability <dbl> 0.510, 0.336, 0.422, 0.455, 0.417, 0.427, 0.5…
## $ energy <dbl> 0.917, 0.944, 0.952, 0.905, 0.920, 0.776, 0.9…
## $ key <chr> "A", "A", "G", "G", "E", "G", "E", "A", "E", …
## $ loudness <dbl> -5.469, -5.414, -6.232, -6.831, -7.049, -8.03…
## $ mode <chr> "minor", "minor", "major", "major", "minor", …
## $ speechiness <dbl> 0.0439, 0.0877, 0.1150, 0.0531, 0.1160, 0.045…
## $ acousticness <dbl> 5.82e-04, 5.92e-04, 6.12e-05, 2.62e-05, 3.52e…
## $ instrumentalness <dbl> 3.80e-03, 3.14e-02, 7.33e-03, 1.54e-01, 4.24e…
## $ valence <dbl> 0.522, 0.384, 0.150, 0.632, 0.262, 0.379, 0.5…
## $ tempo <dbl> 126.789, 151.564, 119.919, 139.944, 140.029, …
## $ duration_ms <dbl> 339800, 310387, 451440, 411533, 291693, 19366…
## $ time_signature <dbl> 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ key_mode <chr> "A minor", "A minor", "G major", "G major", "…
## $ lyrics <list> [<tbl_df[32 x 2]>, <tbl_df[53 x 2]>, <tbl_df…
Looks like everything worked well. release_date
looks like a date and we don’t see any redundant
columns. Let’s save it and move onto grabbing audio analysis info. We save as an RDS because even
though feather
is made for temporary caching, it can’t handle list columns.
rush_albums %>%
saveRDS(OUTPUT_FEATURES)
Audio Analysis download
Spotify also has an Audio Analysis API, which is different from its Audio Features API. The Audio Analysis API contains in-depth features for individual segments of tracks. Once we start digging into the audio analysis features, I’ll explain more about what’s contained in them.
Like get_rush_dat
, I’m wrapping the code to get this data in a function that caches the data if it
doesn’t exist, and reads it in otherwise because this code takes an inordinately long amount of time
to run. Here’s what the function does when you don’t have the audio analysis file cached:
For each track, do the following via purrr::map_df
:
- Get the audio analysis from the Spotify API.
- Convert it to a tibble with each part of the audio analysis as a list column. Since there are seven parts that are always returned in a particular order, I just assign them labels in said order. This gives us a tidy tibble where every row corresponds to an audio analysis feature of a given track.
- Add the audio analysis info back to the original data frame, and throw out any information
that isn’t track and album metadata. This is to conserve space, since if we want to combine
audio analysis with track-level audio features, we can easily just join to the original data
using the
*_uri
columns.
If it’s your first time running this function, it will probably take at least ten minutes. To help
the process along (and because it’s just really fun), I run beepr::beep
once the data is done being
prepared. It’s a wonderful function that can play any of a variety of sound effects when it’s
finished. I personally have it set to the Super Mario Bros. End of Level sound.
library(jsonlite)
library(purrr)
##
## Attaching package: 'purrr'
## The following object is masked from 'package:jsonlite':
##
## flatten
library(beepr)
get_audio_analysis_df <- function(dat, output_file = OUTPUT_AUDIO_ANALYSIS) {
if (file.exists(output_file)) {
cat("Audio analysis file exists. Reading...\n")
audio_analysis_df <- readRDS(output_file)
} else {
cat("Audio analysis file does not exist. Generating now\n")
audio_analysis_df <- map_df(dat$track_uri, function(x) {
audio_analysis <- get_track_audio_analysis(x)
return(tibble(track_uri = x,
audio_analysis = audio_analysis,
content_type = c("meta", "track", "bars", "beats",
"tatums", "sections", "segments")))
})
album_track_info <- dat %>%
select(album_uri, album_name, track_name, track_uri, track_n)
audio_analysis_col <- audio_analysis_df %>% select(audio_analysis)
audio_analysis_df <- audio_analysis_df %>%
select(-audio_analysis) %>% ## to avoid issues with distinct
inner_join(album_track_info, by = "track_uri") %>%
distinct() %>%
bind_cols(audio_analysis_col)
audio_analysis_df %>%
saveRDS(output_file)
beep(8)
}
return(audio_analysis_df)
}
rush_audio_analysis <- get_audio_analysis_df(rush_albums)
## Audio analysis file exists. Reading...
rush_audio_analysis %>%
glimpse()
## Observations: 1,148
## Variables: 7
## $ track_uri <chr> "43l8BalXmo4y50runkgJEh", "43l8BalXmo4y50runkgJEh…
## $ content_type <chr> "meta", "track", "bars", "beats", "tatums", "sect…
## $ album_uri <chr> "744i0LypfMwHHrKhzsqAx0", "744i0LypfMwHHrKhzsqAx0…
## $ album_name <chr> "Clockwork Angels", "Clockwork Angels", "Clockwor…
## $ track_name <chr> "Caravan", "Caravan", "Caravan", "Caravan", "Cara…
## $ track_n <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3…
## $ audio_analysis <list> [["4.0.0", "Linux", "OK", 0, 1444635788, 15.1851…
And there we have it! Everything has been cleaned up and exported succesfully.
Next Steps
In the next notebook, we will make some plots of musical features. We’ll show Rush’s musical
development over time using a variety of useful ggplot2
extensions.
Question
I know many of you are questioning why I called this Part 0
. I did this for two
reasons:
I originally wrote this as a script and didn’t want to rename some of the other files with number prefixes I wrote.
Since the file starting with 1 is where the analysis begins, I started this file with 0 because this notebook involves no analysis whatsoever.