Home » LyRec: A Song Recommender That Reads Between the Lyrics 🎶 | by Sujan Dutta | Jan, 2025

LyRec: A Song Recommender That Reads Between the Lyrics 🎶 | by Sujan Dutta | Jan, 2025

Dataset

Of course, the first thing I needed was a song lyrics dataset. Fortunately, I found one on Kaggle! This dataset is under a Creative Commons (CC0: Public Domain) license.

This dataset contains about 60K song lyrics along with the title and artist name. I know 60K might not cover all the songs you love, but I think it’s a good starting point for LyRec.

songs_df = pd.read_csv(f"{root_dir}/spotify_millsongdata.csv")
songs_df = songs_df.drop(columns=["link"])
songs_df["song_id"] = songs_df.index + 1

I didn’t need to perform any pre-processing on this data. I just removed the link column and added an ID for each song.

Models

I needed to select two LLMs: One for computing the embeddings and another for generating the song summaries. Picking the correct LLM for your task may be a little tricky because of the sheer number of them! It’s a good idea to look at the leaderboard to find the current best ones. For the embedding model, I checked the MTEB leaderboard hosted by HuggingFace.

I was looking for a smaller model (obviously!) without compromising too much accuracy; hence, I decided on GTE-Qwen2-1.5B-Instruct.

from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer(
"Alibaba-NLP/gte-Qwen2-1.5B-instruct",
model_kwargs={"torch_dtype": torch.float16}
)

For the summarizer, I just needed a small enough instruction following LLM, so I went with Gemma-2–2b-It. In my experience, it’s one of the best small models as of now.

import torch
from transformers import pipeline

pipe = pipeline(
"text-generation",
model="google/gemma-2-2b-it",
model_kwargs={"torch_dtype": torch.bfloat16},
device="cuda",
)

Pre-computing the Embeddings

Computing the lyrics embeddings was pretty straightforward. I just used the .encode(…) method with a batch_size of 32 for faster processing.

song_lyrics = songs_df["text"].values

lyrics_embeddings = model.encode(
song_lyrics,
batch_size=32,
show_progress_bar=True
)

np.save(f"{root_dir}/60k_song_lyrics_embeddings.npy", lyrics_embeddings)

At this point, I stored these embeddings in a .npy file. I could have used a more structured format, but it did the job for me.

Coming to the summary embeddings, I first needed to generate the summaries. I had to ensure that the summary captured the emotion and the song’s theme while not being too lengthy. So, I came up with the following prompt for Gemma-2.

You are an expert song summarizer. 
You will be given the full lyrics to a song.
Your task is to write a concise, cohesive summary that
captures the central emotion, overarching theme, and
narrative arc of the song in 150 words.

{song lyrics}

Here’s the code snippet for summary generation. For simplicity, the following shows a sequential processing. I have included the batch-processing version in the GitHub repo.

def get_summary(song_lyrics):
messages = [
{"role": "user",
"content": f'''You are an expert song summarizer.
You will be given the full lyrics to a song.
Your task is to write a concise, cohesive summary that
captures the central emotion, overarching theme, and
narrative arc of the song in 150 words.nn{song_lyrics}'''},
]

outputs = pipe(messages, max_new_tokens=256)
assistant_response = outputs[0]["generated_text"][-1]["content"].strip()
return assistant_response

songs_df["summary"] = songs_df["text"].progress_apply(get_description)

Unsurprisingly, this step took the most time. Luckily, this needs to be done only once, and of course, when we want to update the database with new songs.

Then, I computed and stored the embedding just like the last time.

song_summary = songs_df["summary"].values

summary_embeddings = model.encode(
song_summary,
batch_size=32,
show_progress_bar=True
)

np.save(f"{root_dir}/60k_song_summary_embeddings.npy", summary_embeddings)

Vector Search

With the embeddings in place, it was time to implement the semantic search based on embedding similarity. There are a lot of awesome open-source vector databases available for this job. I decided to use a simple one called FAISS (Facebook AI Similarity Search). It just takes two lines to add the embeddings into the database. First, we create a FAISS index. Here, we need to mention the similarity metric you want to utilize for searching and the dimension of the vectors. I used the dot product (inner product) as the similarity measure. Then, we add the embeddings to the index.

Note: Our database is small enough to do an exhaustive search using dot product. For larger databases, it’s recommended to perform an approximate nearest neighbor (ANN) search. FAISS has support for that.

import faiss

lyrics_embeddings = np.load(f"{root_dir}/60k_song_lyrics_embeddings.npy")
lyrics_index = faiss.IndexFlatIP(lyrics_embeddings.shape[1])
lyrics_index.add(lyrics_embeddings.astype(np.float32))

summary_embeddings = np.load(f"{root_dir}/60k_song_summary_embeddings.npy")
summary_index = faiss.IndexFlatIP(summary_embeddings.shape[1])
summary_index.add(summary_embeddings.astype(np.float32))

To find the most similar songs given a query, we first need to generate the query embedding and then call the .search(…) method on the index. Under the hood, this method computes the similarity between the query and every entry in our database and returns the top k entries and the corresponding scores. Here’s the code performing a semantic search on lyrics embeddings.

query_lyrics = 'Imagine the last song you fell in love with'
query_embedding = model.encode(f'''Instruct: Given the lyrics,
retrieve relevant songsnQuery: {query_lyrics}''')
query_embedding = query_embedding.reshape(1, -1).astype(np.float32)
lyrics_scores, lyrics_ids = lyrics_index.search(query_embedding, 10)

Notice that I added a simple prompt in the query. This is recommended for this model. The same applies to the summary embeddings.

query_description = 'Describe the type of song you wanna listen to'
query_embedding = model.encode(f'''Given a description,
retrieve relevant songsnQuery: {query_description}''')
query_embedding = query_embedding.reshape(1, -1).astype(np.float32)
summary_scores, summary_ids = summary_index.search(query_embedding, k)

Pro tip: How do you do a sanity check?
Just put any entry from the database in the query and see if the search returns the same as the top-scoring entry!

Implementing the Features

At this stage, I had the building blocks of LyRec. Now, it was the time to put these together. Remember the three goals I set in the beginning? Here’s how I implemented those.

To keep things tidy, I created a class named LyRec that would have a method for each feature. The first two features are pretty straightforward to implement.

The method .get_songs_with_similar_lyrics(…) takes a song (lyrics) and a whole number k as input and returns a list of k most similar songs based on the lyrics similarity. Each element in the list is a dictionary containing the artist’s name, song title, and lyrics.

Similarly, .get_songs_with_similar_description(…) takes a free-form text and a whole number k as input and returns a list of k most similar songs based on the description.

Here’s the relevant code snippet.

class LyRec:
def __init__(self, songs_df, lyrics_index, summary_index, embedding_model):
self.songs_df = songs_df
self.lyrics_index = lyrics_index
self.summary_index = summary_index
self.embedding_model = embedding_model

def get_records_from_id(self, song_ids):
songs = []
for _id in song_ids:
songs.extend(self.songs_df[self.songs_df["song_id"]==_id+1].to_dict(orient='records'))
return songs

def get_songs_with_similar_lyrics(self, query_lyrics, k=10):
query_embedding = self.embedding_model.encode(
f"Instruct: Given the lyrics, retrieve relevant songsn Query: {query_lyrics}"
).reshape(1, -1).astype(np.float32)

scores, song_ids = self.lyrics_index.search(query_embedding, k)
return self.get_records_from_id(song_ids[0])

def get_songs_with_similar_description(self, query_description, k=10):
query_embedding = self.embedding_model.encode(
f"Instruct: Given a description, retrieve relevant songsn Query: {query_description}"
).reshape(1, -1).astype(np.float32)

scores, song_ids = self.summary_index.search(query_embedding, k)
return self.get_records_from_id(song_ids[0])

The final feature was a little tricky to implement. Recall that we need to first retrieve the top songs based on lyrics and then re-rank them based on the textual description. The first retrieval was easy. For the second one, we only need to consider the top-scoring songs. I decided to create a temporary FAISS index with the top songs and then search for the songs with the highest summary similarity scores. Here’s my implementation.

def get_songs_with_similar_lyrics_and_description(self, query_lyrics, query_description, k=10):
query_lyrics_embedding = self.embedding_model.encode(
f"Instruct: Given the lyrics, retrieve relevant songsn Query: {query_lyrics}"
).reshape(1, -1).astype(np.float32)

scores, song_ids = self.lyrics_index.search(query_lyrics_embedding, 500)
top_k_indices = song_ids[0]

summary_candidates = []
for idx in top_k_indices:
emb = self.summary_index.reconstruct(int(idx))
summary_candidates.append(emb)
summary_candidates = np.array(summary_candidates, dtype=np.float32)

temp_index = faiss.IndexFlatIP(summary_candidates.shape[1])
temp_index.add(summary_candidates)

query_description_embedding = self.embedding_model.encode(
f"Instruct: Given a description, retrieve relevant songsn Query: {query_description}"
).reshape(1, -1).astype(np.float32)

scores, temp_ids = temp_index.search(query_description_embedding, k)
final_song_ids = [top_k_indices[i] for i in temp_ids[0]]

return self.get_records_from_id(final_song_ids)

Viola! Finally, LyRec is ready. You can find the complete implementation on this repo. Please leave a star if you find this helpful! 😃

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *