Final 320 Project
Anyesha Majumdar
Sriya Srikanth
Since 1959, the the National Academy of Recording Arts and Sciences has been awarding the Grammy Award for Record of the Year. As the Academy describes it, "The Record Of The Year GRAMMY goes to the artist(s), producer(s), and engineer(s) involved in crafting the specific recording."
Since the late 1950's, the evolution of music is marked by overarching shifts in popular culture and the uprising of technology. "Nel Blue Dipinto Di Blue," an Italian ballad by Domenico Modugno and the Record of the Year winner in 1959, is a stark contrast from the electropop stylings of Billie Eilish's 2020 award winner, "Bad Guy."
The quality of music and the validity of such awards are topics usually up for debate.
Let's take a look at all the nominees for the Grammy Award for Record of the Year from 1959-2020 and break down what makes some of them winners.
In order to analyze the characteristics of Record of the Year nominess, we utilized Spotipy, a lightweight Python library for the Spotify Web API.
We created multiple spotify playlists, one with the winners since 1959s and three with the nominations for record of the year that did not ultimately win. We split the other nominees into three playlists as the Spotify API only allows for 100 tracks per playlist. We will be using the music in these playlists to ultimately predict what the 2020 Record of the Year will be.
Here, we import necessary libraries and create a function for getting the track ids for every song on our two playlists. The track IDs are important for identifying the songs in the playlist, and eventually identifying the unique characteristics of the songs.
!pip install spotipy
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd
import time
client_id = 'c7a5a47b69634aa2ae2f6896130ba420'
client_secret = 'e3086e5eeeed47109e56fa94b10c57b8'
client_credentials_manager = SpotifyClientCredentials(client_id, client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
def getTrackIDs(user, playlist_id):
ids = []
playlist = sp.user_playlist(user, playlist_id)
for item in playlist['tracks']['items']:
track = item['track']
ids.append(track['id'])
return ids
#Obtain Data from Grammy Winners playlist
grammy_winners = getTrackIDs('205ezqssuga60f3pzaur0s03w', '76alfO7c0UJss4VusPW9ka')
#Obtain Data from other Grammy nominees playlist(had to split into 3 playlists because it can get a max of 100 ids)
grammy_losers1 = getTrackIDs('205ezqssuga60f3pzaur0s03w', '544qU0dEujOATgZoqqUPd2')
grammy_losers2 = getTrackIDs('205ezqssuga60f3pzaur0s03w', '3uPPtJjqNXN6CVs5gerGdC')
grammy_losers3 = getTrackIDs('205ezqssuga60f3pzaur0s03w', '6AyowbDnvoDOBtXd4NrrsK')
#Combining 3 playlist tracks for Final Grammy Losers Playlist
grammy_losers = grammy_losers1 + grammy_losers2 + grammy_losers3
Here, we create a function that takes a track ID and returns an array of all of its audio features. The Spotify API offers an Audio Features Object that stores music characteristics such as energy and tempo, in a quantitative format. We then apply this function on our arrays of track IDs to create Pandas Dataframes of the songs from the winners and losers playlists with columns for their different features.
def getAudioFeatures(trackID):
meta = sp.track(trackID)
audio_features = sp.audio_features(trackID)
# Metadata for tracks
name = meta['name']
album = meta['album']['name']
artist = meta['album']['artists'][0]['name']
release_date = meta['album']['release_date']
length = meta['duration_ms']
popularity = meta['popularity']
# Audio Features provided by Spotify API
acousticness = audio_features[0]['acousticness']
danceability = audio_features[0]['danceability']
energy = audio_features[0]['energy']
instrumentalness = audio_features[0]['instrumentalness']
liveness = audio_features[0]['liveness']
loudness = audio_features[0]['loudness']
speechiness = audio_features[0]['speechiness']
tempo = audio_features[0]['tempo']
time_signature = audio_features[0]['time_signature']
trackID = audio_features[0]['id']
track = [name, album, artist, release_date, length, popularity,
acousticness, danceability, energy, instrumentalness, liveness,
loudness, speechiness, tempo, time_signature, trackID]
return track
Utilizing our Winners playlist and our Spotify API, we can create a dataframe to house each winning tracks' audio features
#Winners!
#Consolidate list of Grammy Winners from playlist
winners = []
for i in range(len(grammy_winners)):
track = getAudioFeatures(grammy_winners[i]) #Get Audio Features for each track in playlist
winners.append(track)
# Create dataframe for Record of the Year winners
roy_winners = pd.DataFrame(
winners,
columns=[ 'name', 'album', 'artist', 'release_date', 'length', 'popularity', 'acousticness', 'danceability', 'energy',
'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'time_signature', 'trackID'])
roy_winners['winner'] = 1 #Binary Variable identifies if a track is a winner or loser
roy_winners.head()
Utilizing our "Losers" playlist and our Spotify API, we can create a dataframe to house each losing tracks' audio features
#Other Nominees!
#Consolidate list of Grammy Losers from playlist
losers = []
for i in range(len(grammy_losers)):
track = getAudioFeatures(grammy_losers[i]) #Get Audio Features for each track in playlist
losers.append(track)
# Create dataframe for Record of the Year losers
roy_losers = pd.DataFrame(
losers,
columns=[
'name', 'album', 'artist', 'release_date', 'length', 'popularity', 'acousticness', 'danceability', 'energy',
'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo','time_signature', 'trackID'])
roy_losers['winner'] = 0 #Binary Variable identifies if a track is a winner or loser
roy_losers.head()
Now, we can create one, consolidated dataframe with all of the nominees of Grammy Record of the Year, including both winners and losers. For our binary variable "winner:" 1 is if the record won, 0 is if the record lost
grammy_nominees = roy_winners.append(roy_losers)
Now, we realized that we had a problem. Some of the songs from our original playlists were not the original track (some were remastered versions) so they were not associated with the proper Grammy year.
To account for this, we took data from Wikipedia's list of grammy record of the year nominees and downloaded it as a CSV. This .CSV has the correct years for each track and their associated Grammy nomination year.
After adding trackIDs (a unique identifier from the Spotify API) for all of the songs we were able to combine the dataframe from the Wikipedia CSV with the earlier dataframe to get accurate years for all the tracks.
from google.colab import files
uploaded = files.upload()
import io
df2 = pd.read_csv(io.BytesIO(uploaded['Grammy Award for Record of the Year - Wikipedia.csv']))
df2.columns = ['name', 'year', 'artist', 'winner', 'trackID']
df2.head()
#Merge the Wikipedia csv and the exisiting dataframe
all_grammys = pd.merge(grammy_nominees, df2, on="trackID")
all_grammys = all_grammys.sort_values('year')
#FINAL DATAFRAME
all_grammys.head(10)
After collecting and cleaning our data, we are ready to do some analysis!
import matplotlib as plt
from matplotlib import pyplot
import seaborn as sns
import sklearn
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
The makePlot function takes in a dataframe, track feature (such as liveness or danceability from the columns in our all_grammys dataframe) and the name of the feature. It creates subsets of the dataframe for the losers of record of the year and the winners. It plots lines of best fit for the feature over time for the losers, winners, and overall nominees all over time.
There are 3 lines: one for winners, one for losers, and one for all nominees. The points in the scatter plot refer to the mean value of whatever feature was passed in when calling the makePlot function, and the colors separate winners and losers. The ft_name is the intended title of the plot.
The purpose of the makePlot function is to be able to get a sense of trends in music over time, and compare winners from losers.
def makePlot(in_df, track_feature, ft_name):
df_all_nominees = all_grammys.groupby(['year'])[track_feature].mean().reset_index()
#taking the mean for each year of the track feature for losers, since there were several every year
df_losers = (all_grammys.loc[all_grammys['winner_y'] == 0]).groupby(['year'])[track_feature].mean().reset_index()
x = df_losers['year']
y = df_losers[track_feature]
m, b = np.polyfit(x, y, 1)
df_winners = all_grammys.loc[all_grammys['winner_y'] == 1]
x2 = df_winners['year']
y2 = df_winners[track_feature]
m2, b2 = np.polyfit(x2, y2, 1)
x3 = df_all_nominees['year']
y3 = df_all_nominees[track_feature]
m3, b3 = np.polyfit(x3, y3, 1)
fig = plt.pyplot.figure(figsize = (10,10))
ax1 = fig.add_subplot(111)
print("Losers: Slope =" , m," Intercept = ", b)
print("Winners: Slope = " , m2," Intercept = ", b2)
ax1.scatter(x = df_losers['year'], y = df_losers[track_feature], color = 'r', label = 'Losers scatter')
ax1.scatter(x = df_winners['year'], y = df_winners[track_feature], color = 'c', label = 'Winners scatter')
plt.pyplot.plot(x, m*x + b, label = "Grammy Losers", color = 'r')
plt.pyplot.plot(x2, m2*x2 + b2, label = "Grammy Winners", color = 'c')
plt.pyplot.plot(x3, m3*x3 + b3, label = "All Grammy nominees", color = 'k')
plt.pyplot.title(ft_name)
plt.pyplot.xlabel('Years')
plt.pyplot.ylabel(track_feature)
plt.pyplot.legend()
In an attempt to avoid redundancy and save space, we chose to plot the trends of 6 audio features in an attempt to illustrate how they have evolved over the years, in both winners and losers.
We have plotted:
Energy
Danceability
Speechiness
Length
Accoustiness
Loudness
The first feature we plotted was energy. Spotify calculates Energy as a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.
From the plot above, we can see that over time nominees for record of the year show an increasing trend in energy. However, winners seemed to have lower energy scores overall, since the line of best fit for winners is lower than that of the losers.
makePlot(all_grammys, 'energy', 'Energy Over Time')
Next, we plotted the feature 'danceability.' Spotify defines danceability as how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
From the plot above, we can again see an increasing trend for danceability over time. It also shows, that winners of record of the year generally scored less on danceability than the losers.
makePlot(all_grammys, 'danceability', 'Danceability Over Time')
Next, we plotted 'speechiness.' Spotify detects the presence of spoken words in a track to calculate the speechiness value. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
As from the plot above, there is a general increasing trend of speechiness over time, though by the scatterplot and lines of best fit, it is very clear that winners a speechiness values than losers.
makePlot(all_grammys, 'speechiness', 'Speechiness Over Time')
Next, we plotted 'length,' which is just the duration of the song in milliseconds.
In the above plot, we can see that here is a small upward trend in length overall with all nominees of record of the year. However, the lines of best fit show that winners tended to be longer than losers overall.
makePlot(all_grammys, 'length', 'Length Over Time')
Next, we plotted 'loudness.' Loudness is the overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track.
From the above plot, we can see that there is a clear upward trend in loudness of music nominated for record of the year. However, it also seems that in general, winners seem to be slightly less loud than losers.
makePlot(all_grammys, 'loudness', 'Loudness Over Time')
Finally, we measured 'acousticness.' Spotify calculates acousticness as a confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
As we can see, music nominated for record of the year has a stark downward trend in acousticness, which makes sense as new electric music has gained ground. However, winners seem to be more acoustic than losers.
makePlot(all_grammys, 'acousticness', 'Acousticness Over Time')
import seaborn as sns
This function creates a set of violin plots to illustrate the distribution of various audio features amongst Winners and Losers
def make_violin(df, audio_feature, y_label):
fig = plt.pyplot.figure(figsize = (7,7))
ax = fig.add_subplot(111)
ax = sns.violinplot(x="winner_x", y=audio_feature, palette = "hls", inner = "box" , scale = "width", data=df)
ax.set_xlabel("Other Nominees = 0, Winner = 1",size = 14,alpha=0.7)
ax.set_ylabel(y_label,size = 14,alpha=0.7)
plt.pyplot.title("Distribution of " + y_label + " amongst Winners and Losers" )
In an attempt to avoid redundancy and save space, we chose to plot the trends of 5 audio features in an attempt to illustrate the distributions amongst both winners and losers.
We have plotted:
Energy
Danceability
Tempo
Length
Accoustiness
Loudness
#Length
make_violin(all_grammys, "length", "Length" )
The violin plot above shows the distribution of length in winners and losers. Though the distributions are similar, we can see that the peak of the distribution for winners was higher than losers, indicating that winners tended to be longer than losers.
#Acousticness
make_violin(all_grammys, "acousticness", "Acousticness" )
The above plot shows the distribution of acousticness in losers and winners. The losers have a unimodal distribution while the winners have a bimodal distribution. We can see that winners had either high or low acousticness scores, but less in the middle. Overall, winners tended to have a higher acousticness score than losers.
#Danceability
make_violin(all_grammys, "danceability", "Danceability" )
The above plot shows the distribution of danceability among winners and losers of grammy record of the year. Though the distributions are both very similar, we can see that winners tend to have a slightly lower danceability score than losers.
#Energy
make_violin(all_grammys, "energy", "Energy" )
The above plot shows the distributions for energy with winners and losers for grammy record of the year. We can see that winners of record of the year tended to be less energetic than the losers.
#Loudness
make_violin(all_grammys, "loudness", "Loudness" )
The above plot shows the distribution of loudness among winners and losers of grammy record of the year. We can see that the winners of record of the year tended to be less loud than the losers.
#Tempo
make_violin(all_grammys, "tempo", "Tempo" )
The above plot shows the distribution of tempo among winners and losers of grammy record of the year. We can see that the distributions are very similar, though it seems like winners tended to possibly be a little slower than losers, However, as it is difficult to make conclusions for this, tempo would probably not be a useful factor in our models.
We now import data from our 2021 Record of the Year Nominees playlist. Utilizing this playlist and our Spotify API, we can create a dataframe to house each tracks' audio features
#Get playlist of nominees for 2021
nominees_2021 = getTrackIDs('205ezqssuga60f3pzaur0s03w', '468byHblUQ5ps5rOVPypms')
#Get audio features for nominees for 2021 and create data frame
nominees_21 = []
for i in range(len(nominees_2021)):
track = getAudioFeatures(nominees_2021[i])
nominees_21.append(track)
# Create dataframe for 2021 Record of the Year Nominees
roy_2021_nominees = pd.DataFrame(nominees_21, columns= ['name', 'album', 'artist', 'release_date', 'length', 'popularity',
'acousticness', 'danceability', 'energy',
'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo',
'time_signature', 'trackID'])
#Clean data, extract necessary columns and set nomination year to 2021
roy_2021_nominees['year'] = 2021
roy_2021_nominees = roy_2021_nominees.drop(['release_date', 'trackID'], axis = 1)
roy_2021_nominees = roy_2021_nominees[['name', 'year', 'artist','album', 'length', 'popularity' , 'acousticness' , 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness','speechiness', 'tempo', 'time_signature']]
roy_2021_nominees.head()
To predict our 2021 Record of the Year winner, we will be utilizing a combination of a Decision Tree Classifier, K-Nearest Neighbors Classifier and Logistic Regression with a Principal Component Analysis (PCA), to speed up the algorithms.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
#Clean up the Test Data
test_data = all_grammys.drop(['trackID','release_date','name_y','artist_y','winner_y'], axis = 1)
test_data = test_data[['name_x', 'year', 'artist_x','album', 'length', 'popularity', 'acousticness' , 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness','speechiness', 'tempo', 'time_signature', 'winner_x' ]]
test_data.columns = ['name', 'year', 'artist','album', 'length', 'popularity', 'acousticness' , 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness','speechiness', 'tempo', 'time_signature', 'winner' ]
#Get data and split into training and testing sets
X = test_data.copy().iloc[:, 4: 15]
y = test_data['winner']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #Split into training and testing sets
# Pre-Process the data - Scale the data so each feature has unit variance
scale = StandardScaler()
scale.fit(X_train)
X_train = scale.transform(X_train)
X_test = scale.transform(X_test)
# Fit the PCA
pca = PCA(n_components=0.3)
pca.fit(X_train)
PCA_X_train = pca.transform(X_train)
PCA_X_test = pca.transform(X_test)
#Decision Tree
dt = DecisionTreeClassifier(max_depth=17, random_state=1)
dt.fit(PCA_X_train, y_train.ravel())
dt_predict = dt.predict(np.c_[PCA_X_test]) #Predict winners
#K-Nearest Neighbor
k_clas = KNeighborsClassifier(n_neighbors=5)
k_clas.fit(PCA_X_train, y_train.ravel())
k_clas_predict = k_clas.predict(np.c_[PCA_X_test]) #Predict winners
#Logistic Regression
log = LogisticRegression()
log.fit(PCA_X_train, y_train.ravel())
log_predict = log.predict(np.c_[PCA_X_test]) #Predict winners
print("Decision Tree Prediction:", dt_predict)
print("K-Nearest Neighbors Prediction: ", k_clas_predict)
print("Logistic Regression Prediction: ", log_predict)
Now to predict the 2021 winner, we will utilize our complete dataset (rather than the train/test sample) and fit the data to our newly created Decision Tree, K-Nearest Neighbors and Logistic Regression models.
noms_2021 = roy_2021_nominees.copy().iloc[:,4:15] #Get Features for 2021 Nominess
dt.fit(X, y.values.ravel()) #Fit historical Grammy data to our Decision Tree Model
dt_2021_pred = dt.predict(np.c_[noms_2021]) #Predict 2021 winners
k_clas.fit(X, y.values.ravel()) #Fit historical Grammy data to our K-Nearest Neighbors Model
knn_2021_pred = k_clas.predict(np.c_[noms_2021]) #Predict 2021 winners
log.fit(X, y.values.ravel()) #Fit historical Grammy data to our K-Nearest Neighbors Model
log_2021_pred = log.predict(np.c_[noms_2021]) #Predict 2021 winners
print("Decision Tree 2021 Prediction: ", dt_2021_pred )
print("K-Nearest Neighbors 2021 Prediction: " , knn_2021_pred)
print("Logistic Regression 2021 Prediction: " , log_2021_pred)
print('\n')
print(roy_2021_nominees['name']) #Our winner is Colors by Black Pumas!
And the Grammy Award for Record of the Year goes to....."Colors" by Black Pumas! From our analysis, 2/3 of the models have predicted that this track will be the 2021 winner. Unfortunately, our Logistic Regression Model inaccurately predicted that none of the tracks would win; however, the Decision Tree and K-Nearest Neighbors have provided some insight on next year's winner.
Judging music is quite a subjective task, and there are surely other factors that we didn't account for that contribute to who wins Record of the Year. Other than the quantitative factors we extracted from the Spotify API, factors such as artist popularity, record sales, or general human bias may also contribute to the winner. These unquantifiable features make it difficult to create a perfect prediction model.
Over the years, though generally considered a universal language, music has evolved due to cultural and technological shifts. Though all Record of the Year winners, notable tracks such as "Moon River" by Henry Mancini, "Beat It," by Michael Jackson, and "24K Magic" by Bruno Mark illustrate the vast musical differences over the past 61 years.
While it's difficult to judge the quality of music, utilizing the Spotify API's audio features illustrated some evident musical trends over the years.
Energy - Over time, nominees have illustrated an increasing trend in energy. However, winners seemed to have lower energy in comparison to losers.
Danceability - Over time, nominees have illustrated an increasing trend in "danceability." This trend may be attributed to a rise in genres such as Hip Hop and EDM. Winners of Record of the year have been generally less on danceable than the losers.
Speechiness - Over time, nominees have illustrated an increasing trend in "speechiness." This trend may be attributed to the rise of Rap music. It is fairly evident that winners tend to be less "speechy" than losers.
Length/Duration - Over time, nominees have illustrated a small upward trend in length. However, the lines of best fit show that winners tended to be longer than losers overall.
Loudness - Over time, nominees have illustrated an upward trend in loudness of music. However, it also seems that in general, winners are slightly less loud than losers.
Acousticness - Over time, nominees have generally illustrated a stark downward trend in "acousticness." However, winners seem to be more acoustic than losers.
Evaluating these musical trends and utilizing a combination of Decision Tree, K-Nearest Neighbors, and Logistic Regression Models, the 2021 Grammy Award for Record of the Year goes to "Colors" by Black Pumas. "Colors" is a soulful, slow-paced R&B hit that fits well within the trends we analyzed. Given that we only had 317 observations and we don't account for features such as artist popularity, cultural shifts, or human bias, our prediction should be taken with a grain of salt. However, our analysis serves to break down the Record of the Year nominees purely from a quantitative, musical standpoint.