A change of direction

After attempting to use regression modelling to predict the popularity of a song using multiple attributes, it became clear that this was not going to happen. Perhaps the diversity of different song compisitions and the subjective nature of people's taste in music means that their are just too many independent (and interdependent) variables to model.

At this point my thought is to change to some classification machine learning models. After doing a bit more analysis of the data in Tableau it seems that binning the popularity scores is probabaly the best way to go. Roughly 30% of the songs I have data on have a popularity rating of 0.

Using Pandas I quickly added a new column to my database to group songs into having a popularity score of 0, or above 0.

Songs with popualarity >0: 44,846
Songs with popularity = 0: 29,232

I will use these to see if I can build a model which predicts whether a song will have 'some popularity' or 'no popularity'.

Random Forests

Github Python Notebook: random forests model

After running a few tests these are the final features for my model:

Full data set: n = 74,078
X =[['danceability','energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature']]
y ='popularity' (as a binary measure)
30% test size

This model gives a much better score of .69.

When looking at the importance of features, there are 9 attributes which seem to have all fairly equaly weighting.