Music Recommendations: How UX research Improved KPIs

#Quantitative #Experimental design #Statistical analysis

Overview

Context

One of the critical issues in the recommendation system is "Filter bubble", a situation in which the same items keep lingering on top of a recommended list because the system keeps the songs that users listen to the most. A well-known approach to solve this issue is to increase the diversity in recommended items.

Challenges

However, the previous finding featured the importance of familiarity in recommendation, which went against data scientists' common idea. They were skeptical to adapt the given insight into action.

To convince them, I tweaked the question like this way: "Would users listen to recommended songs more IF the diversity of the list was increased?" Also, I chose a quantitative approach because the question was "How many".

A/B test: I conducted A/B test to investigate the impact of diversity. First, I set up two hypotheses and defined the concept of "diversity" for the experimental design. Based on these, I ran the A/B test and then a statistical analysis.

Log data analysis: Then I analyzed log data to define user groups based on the distinctions in behavioral tendencies, in particular whether a user was "more" and "less" explorative. Details of the research process are given below.

Results

Increased diversity didn't yield significant differences between test and control groups in general. In an additional analysis of "more" and "less" explorative user groups, however, more explorative user groups tended to prefer more highly diverse playlists.

Impact

Convinced by the result that increasing diversity may not be the solution in all cases, data scientists incorporated the insights I suggested and added a few familiar songs on top of the recommended playlists. This adjustment increased streaming counts, a key performance indicator, by 6%.

Research process

1. Setting up hypotheses

The first hypothesis was testing the current belief:

1. If the diversity of music recommendations is increased, would users listen to recommended songs more often?

I insisted on adding the second hypothesis: finding how different user groups respond to increased diversity.

2. If the diversity of music recommendations is increased, would users respond differently depending on their behavioral tendencies?

2. Experimental design

First of all, I operationally defined the concept of "diversity" in music recommendation.

There were total six recommended playlists in the page ordered by a "similarity" score calculated by ML engineers. For example, the first playlist contained the most similar (or even the same) songs to the ones users already listened to, and this similarity decreased in each successive playlists, with the sixth being the least similar.

Considering the technical setup above, I supposed the "higher" diversity should indicate the "lower" similarity between songs within the range of users' preferences which were also defined by their streaming records.

I decided to utilize the current technical setup because it would save the resource of engineers as well as lower the risk of mere randomization; because all songs in playlists in some way were related to those users had listened to at least once.

When I observed the UI structure of recommended playlists, the first playlist was located exactly on the spot where the left thumb was naturally touching when one would hold a cell phone.

This structure made harder to separate the user experience from the impact of UI design.

That might create a confounding: it was unclear whether users touched the first playlist because it happened to locate in the "prime spot", or because they actually liked the contents in it.

To adjust only the level of diversity and keep the UI structure the same, I suggested to data engineers to switch only songs in the first playlist to those in other playlists, with the same cover design and location. They were fully convinced because it was a trick they didn't try before.

Hence, the design of A/B test was determined to be the following:

Test condition - Songs in the first playlist are randomly replaced to the one of those in other playlists from the second to the sixth.

Control condition - Songs in all playlists remain the same.

3. A/B test

Data scientists applied the experimental design to the in-house A/B testing platform. The independent variable was the diversity of the playlists. The dependent variables were KPIs such as page views, CTR, and streaming counts of recommended songs.

For the test condition, the playlist was swapped on a daily basis, starting with the 1st, then switching to the 2nd, until reaching the 6th. For the control, the same algorithm was used. The A/B test lasted for 7 days.

4. Understanding users' behavior by log data analysis

While A/B test was running, I became curious about users' behavioral tendencies. In order to do that, I made an assumption that users who prefer diversity would be more "explorative" than who don't.

As recommendations are based on songs that users have listened to, I operationally defined users' explorative by creating a formula based on their choice below:

the number of new songs / (the number of new songs + the number of already listened songs)

The closer a calculated ratio gets to 1, the newer the songs that a user has chosen are. Thus, the more explorative the user would be considered. After checking the descriptive statistics, the data scientists and I decided to use the averages to separate "more" explorative users from "less" explorative users.

Besides, it is critical to check whether the sampling distribution can represent the population distribution.

So in this case, the sampling distribution refers to the user groups and the population distribution was the entire user base of the product.

With data scientists' support, I could extract accumulated log data for 7 days using SQL, and then applied systematic sampling to check two user groups' distributions and descriptive statistics using Python Pandas. Then I confirmed there was no significant difference.

5. Statistical analysis

First, I compared the test condition to the control as a whole. Then I also compared user groups by explorative tendency (more vs. less) within each, and data of KPIs for each user group. As it didn't seem to reveal interaction effects according to visualized patterns, I used t-test to compare user groups instead of ANOVA.

Results

1. Do users prefer more diverse playlists?

1. If the diversity in music recommendation is increased, would users listen to recommended songs more often?

Result: No significant difference was found across all KPIs.

2. If the diversity in music recommendation is increased, would users respond differently depending on their behavioral tendency?

Result: More explorative groups revealed significantly higher CTR and page views. Streaming count was not significant but increased in both conditions.
Also more explorative group in the test condition showed higher CTR and PV than the same group in the control condition.

2. Any other insights?

Simply recommending playlists with higher diversity was not effective for most users. If users were more explorative, however, they checked the recommendation page and clicked the songs more often when the first playlist became more diverse. Yet streaming did not diversify accordingly. They retained the same listening preferences.

Hence, diversity in recommendations seems to attract explorative users. Such a result emphasizes the importance of understanding users' behavioral patterns in log data.

Combining results of current and previous projects together, I gained insights into how familiarity with recommendations impacts the perceived reliability of a product feature. In the previous project, users positively responded to the familiar songs while they building their first impressions of the recommendation page. That may imply:

In the beginning of use, noticing familiar songs may increase the perceived reliability of the recommendation system ("Apparently, they seem to know the songs I would love.")

Using the feature for a while, noticing familiar songs may decrease the perceived reliability of the recommendation system ("They don't change the songs. I am already fed up with these songs.")

3. What was the impact?

I presented the results and the insights, and data scientists were convinced that increasing diversity may not be the solution in all possible cases. They decided to interleave a few familiar songs on top of the recommended lists. This adjustment increased the amount of recently joined users (as the insights assumed!) and also streaming counts by 6%.

↑Back to Top