( read )

Leveraging hidden co-occurrence relationships in classical music to develop an inexpensive recommender system

Nowadays it’s pretty clear that to be successful, you need to incorporate data as part of your core business strategy, but getting the best out of your data is not always as simple as it sounds. You read about AI, machine learning, deep learning, big data, neural networks, etc., but what does this mean for a young startup?


Source: https://bit.ly/2LfqED9

It’s important not to let the latest industry trends dictate what the new norm is, or even worse, let it shift your focus away from what really matters, which is to build a product or service which users actually get value from and enjoy. When it comes to data, the end goal should never be purely tech. The question we should be asking ourselves instead is: What value can we draw from our data, and how can we effectively deliver it to your users?

This was precisely the situation we were in about a year ago. We knew we wanted to better leverage our data by recommending new content to our users, but the thought of setting up a new dedicated team with these skill sets seemed daunting to say the least. Everytime we wanted to look into it, we would find excuses to postpone it: “we need to ship more features first”, “we probably don’t even have enough volume for content-based filtering”, etc.

Until one day a light bulb lit up in one of our developer’s head — what if, instead of starting from scratch by implementing machine learning algorithms, training our models, attempting to identify patterns and relationships within our data, fine-tuning our recommendations, and so on, we piggybacked on the already existing relationships in classical music?

The underlying idea was the following: several times a week, we process hundreds of thousands of albums. For each album, someone has had to go through the effort of determining which classical music works should be included in it. This is true for any album, be it from a soloist, an orchestra, or even a compilation album. If any item has been grouped together in an album, it means there’s been a human-powered curation behind it, and a musicological relationship inherently exists within these elements.

So in a way, the co-occurrence matrix usually developed as part of a broader item-to-item recommendation system, already exists, just in a slightly different format than what it’s commonly used for in collaborative-filtering techniques. Instead of representing similarities between user’s preferences, we’re examining similarities between editorial tastes. Why not explore this further?

We already had an existing data processing pipeline which parses & validates hundreds of thousands of albums and enhances it with additional classical music metadata. Being it a highly modularized process, we could very easily add additional steps to it, so implementing a proof-of-concept for our idea was relatively simple and self-contained:


Simplified diagram of our data-processing pipeline.

We started out by analyzing artists. For each artist present in an album, we cross-referenced it with the artists from all of the other albums, paying close attention to their respective roles. This is an important distinction to make, as a classical music artist can appear in many different roles — for example Daniel Barenboim, who can appear on some albums as a piano soloist, but also as a leading conductor on others. By further enhancing this model with the number of recordings of this artist in this particular role, we can set up our own weightage system to derive ranking and popularity.

The end result is an exhaustive list representing how often an artist appears with any other artist in our repertoire. And due to our role-distinction, this concept of “appears with” can have different connotations. From artists performing together, to conductors conducting similar works, etc.

Eventually, we were able to get from idea to a complete roll-out of this feature to our users in a very short time. The end results can be seen in our app in many different ways, all powered by this simple idea:


Fast-forward one year, we’re now in the final stages of fine-tuning our own hybrid recommendations engine which incorporates a mix of collaborative-filtering techniques, as well as a custom-made content-based approach leveraging the extensive labeling of classical music terms we’ve meticulously built up over the past years. And sure, eventually we did end up using libraries and tools such as TensorFlow, LightFM, AWS SageMaker, and many other cool technologies — but they were never the goal in the first place, and by not jumping straight into their implementations, we were able to deliver immediate value to our users, while giving ourselves time to actually learn what they wanted first.

So, do users actually enjoy these features? Turns out they do — about 24% of users that visit an artist’s page, keep on exploring similar artists.

As mentioned in the beginning of the article, don’t let tech complexity keep you away from building cool stuff. There’s still value in simplicity :)