Sound Similarity Search with Vector Database¶

Use CassIO and Astra DB / Apache Cassandra® for similarity searches between sound samples, powered by sound embeddings and Vector Search.

NOTE: this uses Cassandra's "Vector Similarity Search" capability. Make sure you are connecting to a vector-enabled database for this demo.

In this notebook you will:

Download a library of sound samples from HuggingFace Datasets.
Calculate sound embedding vectors for them with PANNs Inference.
Store the embedding vectors on a table in your Cassandra / Astra DB instance, using the CassIO library for ease of operation.
Run one or more searches for sounds similar to a provided sample.
Start a simple web-app that exposes a sound search feature.

Import packages¶

The CassIO object needed for this demo is the VectorTable:

In [1]:

Copied!

from cassio.vector import VectorTable
from cassio.vector import VectorTable

Other packages are needed for various tasks in this demo:

In [2]:

Copied!





import os

from IPython.display import Audio
from tqdm.auto import tqdm
import torch
import numpy as np

# processing of sound samples:
from scipy.io import wavfile
import librosa
# HuggingFace dataset loading:
from datasets import load_dataset
# Sound embedding calculation:
from panns_inference import AudioTagging
# To spawn simple data-oriented UIs from the notebook
import gradio
import os

from IPython.display import Audio
from tqdm.auto import tqdm
import torch
import numpy as np

# processing of sound samples:
from scipy.io import wavfile
import librosa
# HuggingFace dataset loading:
from datasets import load_dataset
# Sound embedding calculation:
from panns_inference import AudioTagging
# To spawn simple data-oriented UIs from the notebook
import gradio

In [3]:

Copied!





try:
    from google.colab import files
    IS_COLAB = True
except ModuleNotFoundError:
    IS_COLAB = False
try:
    from google.colab import files
    IS_COLAB = True
except ModuleNotFoundError:
    IS_COLAB = False

Connect to your DB¶

A database connection is needed to access Cassandra. The following assumes that a vector-search-capable Astra DB instance is available. Adjust as needed.

In [4]:

Copied!





from cqlsession import getCQLSession, getCQLKeyspace
cqlMode = "astra_db" # "astra_db"/"local"
session = getCQLSession(mode=cqlMode)
keyspace = getCQLKeyspace(mode=cqlMode)
from cqlsession import getCQLSession, getCQLKeyspace
cqlMode = "astra_db" # "astra_db"/"local"
session = getCQLSession(mode=cqlMode)
keyspace = getCQLKeyspace(mode=cqlMode)

Load the Data¶

In this demo, you will use audio samples from the ESC-50 dataset, a labeled collection of 2000 environmental audio recordings, each with a duration of five seconds.

The dataset can be loaded from the HuggingFace model hub as follows:

(Note that, unless already cached, the download operation may take a few minutes.)

In [5]:

Copied!

audio_dataset = load_dataset("ashraq/esc50", split="train")

# take a look...
print(audio_dataset)
audio_dataset = load_dataset("ashraq/esc50", split="train")

# take a look...
print(audio_dataset)

Dataset({
    features: ['filename', 'fold', 'target', 'category', 'esc10', 'src_file', 'take', 'audio'],
    num_rows: 2000
})

Each sample belongs to a "category". Take a look at the category for the first few items in the dataset:

In [6]:

Copied!





print("Categories:")
print(audio_dataset["category"][:5])
print("\nFilenames:")
print(audio_dataset["filename"][:5])
print("Categories:")
print(audio_dataset["category"][:5])
print("\nFilenames:")
print(audio_dataset["filename"][:5])

Categories:
['dog', 'chirping_birds', 'vacuum_cleaner', 'vacuum_cleaner', 'thunderstorm']

Filenames:
['1-100032-A-0.wav', '1-100038-A-14.wav', '1-100210-A-36.wav', '1-100210-B-36.wav', '1-101296-A-19.wav']

The actual audio signal is sampled at 44100 Hz and available as a NumPy array. Take a look at the first few entries:

In [7]:

Copied!

print(audio_dataset["audio"][:3])
print(audio_dataset["audio"][:3])

[{'path': None, 'array': array([0., 0., 0., ..., 0., 0., 0.]), 'sampling_rate': 44100}, {'path': None, 'array': array([-0.01184082, -0.10336304, -0.14141846, ...,  0.06985474,
        0.04049683,  0.00274658]), 'sampling_rate': 44100}, {'path': None, 'array': array([-0.00695801, -0.01251221, -0.01126099, ...,  0.215271  ,
       -0.00875854, -0.28903198]), 'sampling_rate': 44100}]

Prepare the Audio Embedding Model¶

Note: if you are on a Colab, make sure your "Runtime type" has "Hardware Acceleration" set to GPU for best performance. The cell below will try to auto-detect your setup and adjust to it, please adapt to your specific hardware setup if necessary.

Note: please keep in mind that the cell below may take up to eight minutes to load the full PANNs model, unless already cached locally.

In [8]:

Copied!





GPU_AVAILABLE = torch.cuda.device_count() > 0

if GPU_AVAILABLE:
    # load the default model on the GPU
    model = AudioTagging(checkpoint_path=None, device="cuda")
    print("\nLoaded the sound embedding model on the GPU.")
else:
    # fall back to the CPU
    model = AudioTagging(checkpoint_path=None, device="cpu")
    print(
        "\nLoaded the sound embedding model on the CPU. Reduced defaults "
        "will be used. Please consider upgrading to a GPU-powered "
        "hardware for best experience."
    )
GPU_AVAILABLE = torch.cuda.device_count() > 0

if GPU_AVAILABLE:
    # load the default model on the GPU
    model = AudioTagging(checkpoint_path=None, device="cuda")
    print("\nLoaded the sound embedding model on the GPU.")
else:
    # fall back to the CPU
    model = AudioTagging(checkpoint_path=None, device="cpu")
    print(
        "\nLoaded the sound embedding model on the CPU. Reduced defaults "
        "will be used. Please consider upgrading to a GPU-powered "
        "hardware for best experience."
    )

Checkpoint path: /home/USER/panns_data/Cnn14_mAP=0.431.pth
Using CPU.

Loaded the sound embedding model on the CPU. Reduced defaults will be used. Please consider upgrading to a GPU-powered hardware for best experience.

Create a DB table through CassIO¶

When an instance of VectorTable is created, CassIO takes care of the underlying database operations. An important parameter to supply is the embedding vector dimension (fixed, in this case, by the choice of the PANNs model being used):

In [9]:

Copied!





table_name = "audio_table"
embedding_dimension = 2048

v_table = VectorTable(
    session=session,
    keyspace=keyspace,
    table=table_name,
    embedding_dimension=embedding_dimension,
    primary_key_type="TEXT",
)
table_name = "audio_table"
embedding_dimension = 2048

v_table = VectorTable(
    session=session,
    keyspace=keyspace,
    table=table_name,
    embedding_dimension=embedding_dimension,
    primary_key_type="TEXT",
)

Compute and store embedding vectors for audio¶

This cell processes the audio samples you just loaded. By working in batches, the embedding vectors are evaluated through the PANNs model, and the result is stored in the Cassandra / Astra DB table by invoking the put method of VectorTable.

Note: this operation will take some minutes. Feel free to adjust the total amount of sound clips to process from the library for a quicker demo.

In [10]:

Copied!





if GPU_AVAILABLE:
    BATCH_SIZE = 100
    SAMPLES_TO_PROCESS = 2000
else:
    BATCH_SIZE = 20
    SAMPLES_TO_PROCESS = 200

for i in tqdm(range(0, SAMPLES_TO_PROCESS, BATCH_SIZE)):
    # Find end of batch
    i_end = min(i + BATCH_SIZE, SAMPLES_TO_PROCESS)
    # Extract batch filename and audio signal.
    # (the filename will also serve as row primary key on DB)
    batch_filenames = audio_dataset["filename"][i:i_end]
    batch_audio = np.array([item["array"] for item in audio_dataset["audio"][i:i_end]]) #audios[i:i_end]
    # Generate embeddings for all the audios in the batch
    _, batch_embeddings_np = model.inference(batch_audio)
    batch_categories = audio_dataset["category"][i:i_end]
    # Insert all entries in the batch concurrently
    futures = []
    for filename, category, embedding_np in zip(
        batch_filenames, batch_categories, batch_embeddings_np
    ):
        metadata = {
            "category": category,
            "filename": filename,
        }
        # From a Numpy array to a plain list of floats:
        embedding = embedding_np.tolist()
        futures.append(v_table.put_async(
            document=filename,
            embedding_vector=embedding,
            document_id=filename,
            metadata=metadata,
            ttl_seconds=None,
        ))
    for future in futures:
        future.result()
if GPU_AVAILABLE:
    BATCH_SIZE = 100
    SAMPLES_TO_PROCESS = 2000
else:
    BATCH_SIZE = 20
    SAMPLES_TO_PROCESS = 200

for i in tqdm(range(0, SAMPLES_TO_PROCESS, BATCH_SIZE)):
    # Find end of batch
    i_end = min(i + BATCH_SIZE, SAMPLES_TO_PROCESS)
    # Extract batch filename and audio signal.
    # (the filename will also serve as row primary key on DB)
    batch_filenames = audio_dataset["filename"][i:i_end]
    batch_audio = np.array([item["array"] for item in audio_dataset["audio"][i:i_end]]) #audios[i:i_end]
    # Generate embeddings for all the audios in the batch
    _, batch_embeddings_np = model.inference(batch_audio)
    batch_categories = audio_dataset["category"][i:i_end]
    # Insert all entries in the batch concurrently
    futures = []
    for filename, category, embedding_np in zip(
        batch_filenames, batch_categories, batch_embeddings_np
    ):
        metadata = {
            "category": category,
            "filename": filename,
        }
        # From a Numpy array to a plain list of floats:
        embedding = embedding_np.tolist()
        futures.append(v_table.put_async(
            document=filename,
            embedding_vector=embedding,
            document_id=filename,
            metadata=metadata,
            ttl_seconds=None,
        ))
    for future in futures:
        future.result()

Note that, as customary in Cassandra with (potentially) large binary blobs, you did not store the raw audio signal in the table itself. Rather, in the document field of the VectorTable, you have stored the necessary metadata to retrieve the audio file itself in some other way (which on a realistic setup could be a S3 bucket or similar). In this case this amounts to the filename field.

To emulate a more realistic setup, create a dictionary for later lookup by filename:

In [11]:

Copied!





audios_by_filename = {
    dataset_row["filename"]: dataset_row["audio"]["array"]
    for dataset_row in audio_dataset
}
audios_by_filename = {
    dataset_row["filename"]: dataset_row["audio"]["array"]
    for dataset_row in audio_dataset
}

Here is how this ("direct filename to audio array") lookup would work:

In [12]:

Copied!

# As an example:
print(str(list(audios_by_filename["1-100038-A-14.wav"]))[:64] + "...")
# As an example:
print(str(list(audios_by_filename["1-100038-A-14.wav"]))[:64] + "...")

[-0.0118408203125, -0.103363037109375, -0.14141845703125, -0.120...

Run a similarity search¶

You will now obtain a new audio file and search for samples similar to it.

Get the sound of a cat meowing with:

In [13]:

Copied!

!wget https://storage.googleapis.com/audioset/miaow_16k.wav
!wget https://storage.googleapis.com/audioset/miaow_16k.wav

--2023-07-26 12:38:40--  https://storage.googleapis.com/audioset/miaow_16k.wav
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.209.48, 216.58.209.48, 216.58.204.144, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.209.48|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 215546 (210K) [audio/x-wav]
Saving to: ‘miaow_16k.wav.1’

miaow_16k.wav.1     100%[===================>] 210.49K  --.-KB/s    in 0.06s   

2023-07-26 12:38:40 (3.30 MB/s) - ‘miaow_16k.wav.1’ saved [215546/215546]

Load the audio using the librosa library:

In [14]:

Copied!

meow_sound, meow_rate = librosa.load("miaow_16k.wav")
print("Meow!")
display(Audio(meow_sound, rate=meow_rate))
meow_sound, meow_rate = librosa.load("miaow_16k.wav")
print("Meow!")
display(Audio(meow_sound, rate=meow_rate))

Meow!

In order to run the search, first get the embedding vector for the input file, then use it to run a similarity search on the CassIO VectorTable:

In [15]:

Copied!





# Reshape query audio
reshaped_meow = meow_sound[None, :]
# Get the embeddings for the new audio
_, query_embedding_np = model.inference(reshaped_meow)
query_embedding = query_embedding_np.tolist()[0]

matches = v_table.search(
    embedding_vector=query_embedding,
    top_k=5,
    metric="cos",
    metric_threshold=None,
)

# Show a "play" widget for the top results
for match_i, match in enumerate(matches):
    print(f"Match {match_i}: {match['document']} ", end="")
    print(f"(category: {match['metadata']['category']}, ", end="")
    print(f"distance: {match['distance']:.4f})")
    # retrieve the audio clip content from "storage"
    match_audio = audios_by_filename[match["document"]]
    display(Audio(match_audio, rate=44100))
# Reshape query audio
reshaped_meow = meow_sound[None, :]
# Get the embeddings for the new audio
_, query_embedding_np = model.inference(reshaped_meow)
query_embedding = query_embedding_np.tolist()[0]

matches = v_table.search(
    embedding_vector=query_embedding,
    top_k=5,
    metric="cos",
    metric_threshold=None,
)

# Show a "play" widget for the top results
for match_i, match in enumerate(matches):
    print(f"Match {match_i}: {match['document']} ", end="")
    print(f"(category: {match['metadata']['category']}, ", end="")
    print(f"distance: {match['distance']:.4f})")
    # retrieve the audio clip content from "storage"
    match_audio = audios_by_filename[match["document"]]
    display(Audio(match_audio, rate=44100))

Match 0: 2-83934-A-5.wav (category: cat, distance: 0.8356)
Match 1: 2-82274-A-5.wav (category: cat, distance: 0.8284)
Match 2: 1-34094-A-5.wav (category: cat, distance: 0.7972)
Match 3: 3-95695-A-5.wav (category: cat, distance: 0.7861)
Match 4: 4-250864-A-8.wav (category: sheep, distance: 0.7837)

Experiment with your own WAV file¶

In this section, you can supply any WAV audio file of your own to have a bit of fun.

While you're at it, do a bit of refactoring of the audio processing steps:

In [16]:

Copied!





def wav_filepath_to_audio(filepath):
    loaded_audio, bitrate = librosa.load(filepath)
    return loaded_audio, bitrate
    
def audio_similarity_search(query_audio, top_k=5):
    query_audio0 = query_audio[None, :]
    # If stereo sound comes from Gradio, this input will have a third dimension: average it away!
    if len(query_audio0.shape) == 3:
        query_audio1 = np.average(query_audio0, axis=2)
    else:
        query_audio1 = query_audio0
    # get the embeddings for the audio from the model
    _, query_embedding_np = model.inference(query_audio1)
    query_embedding = query_embedding_np.tolist()[0]
    matches = v_table.search(
        embedding_vector=query_embedding,
        top_k=top_k,
        metric="cos",
        metric_threshold=None,
    )
    return matches
def wav_filepath_to_audio(filepath):
    loaded_audio, bitrate = librosa.load(filepath)
    return loaded_audio, bitrate
    
def audio_similarity_search(query_audio, top_k=5):
    query_audio0 = query_audio[None, :]
    # If stereo sound comes from Gradio, this input will have a third dimension: average it away!
    if len(query_audio0.shape) == 3:
        query_audio1 = np.average(query_audio0, axis=2)
    else:
        query_audio1 = query_audio0
    # get the embeddings for the audio from the model
    _, query_embedding_np = model.inference(query_audio1)
    query_embedding = query_embedding_np.tolist()[0]
    matches = v_table.search(
        embedding_vector=query_embedding,
        top_k=top_k,
        metric="cos",
        metric_threshold=None,
    )
    return matches

Now try providing a sound file of yours (skip this part if you want):

In [17]:

Copied!





if IS_COLAB:
    print("Please upload a WAV file from your computer:")
    uploaded = files.upload()
    wav_file_title = list(uploaded.keys())[0]
    wav_filepath = os.path.join(os.getcwd(), wav_file_title)
else:
    wav_filepath = input("Please provide the full path to a WAV file: ")

supplied_audio, bitrate = wav_filepath_to_audio(wav_filepath)
print("Your query sound:")
display(Audio(supplied_audio, rate=bitrate))

print("Similar clips:")
for match in audio_similarity_search(supplied_audio, top_k=3):
    print(f"{match['document_id']} ({match['metadata']['category']})")
    match_audio = audios_by_filename[match["document"]]
    display(Audio(match_audio, rate=44100))
if IS_COLAB:
    print("Please upload a WAV file from your computer:")
    uploaded = files.upload()
    wav_file_title = list(uploaded.keys())[0]
    wav_filepath = os.path.join(os.getcwd(), wav_file_title)
else:
    wav_filepath = input("Please provide the full path to a WAV file: ")

supplied_audio, bitrate = wav_filepath_to_audio(wav_filepath)
print("Your query sound:")
display(Audio(supplied_audio, rate=bitrate))

print("Similar clips:")
for match in audio_similarity_search(supplied_audio, top_k=3):
    print(f"{match['document_id']} ({match['metadata']['category']})")
    match_audio = audios_by_filename[match["document"]]
    display(Audio(match_audio, rate=44100))

Your query sound:
Similar clips:
2-110613-A-13.wav (crickets)
1-172649-A-40.wav (helicopter)
3-156391-A-35.wav (washing_machine)

Sound Similarity Web App¶

The following cells set up and launch a simple application, powered by Gradio, demonstrating the sound similarity search seen so far.

In its essence, Gradio makes it easy to expose a graphical interface around the following function, built using the components seen earlier, that accepts a user-provided sound as input and returns a number of results from the library, found by similarity.

The input can be either a sound recorded with the user's microphone or an uploaded WAV file (the former taking precedence if both are supplied).

In [18]:

Copied!





NUM_RESULT_WIDGETS = 5

def gradio_upload_audio(microphone_sound, input_sound):
    if microphone_sound is not None:
      input_sound = microphone_sound
    if input_sound:
        # Warning: Gradio sound signals arrive as "int" between +/- 32767.
        # First these must be normalized to [-1:+1]
        # (see https://github.com/gradio-app/gradio/issues/2789)
        max_sound_signal = np.abs(input_sound[1]).max()
        input_sound_norm_signal = input_sound[1]/max_sound_signal
        input_audio = np.array(input_sound_norm_signal, dtype=np.float32)
        found_audios = []
        for match in audio_similarity_search(input_audio, top_k=NUM_RESULT_WIDGETS):
            match_audio = audios_by_filename[match["document"]]
            sample_rate = 44100
            # normalize back to the Gradio y-scale for sounds:
            gradio_rescaled_audio = np.int16(match_audio * 32767)
            this_result = (sample_rate, gradio_rescaled_audio)
            found_audios.append(this_result)
    else:
        found_audios = []
    # pad the result in any case to the number of displayed widgets
    return found_audios + [None]*(NUM_RESULT_WIDGETS-len(found_audios))
NUM_RESULT_WIDGETS = 5

def gradio_upload_audio(microphone_sound, input_sound):
    if microphone_sound is not None:
      input_sound = microphone_sound
    if input_sound:
        # Warning: Gradio sound signals arrive as "int" between +/- 32767.
        # First these must be normalized to [-1:+1]
        # (see https://github.com/gradio-app/gradio/issues/2789)
        max_sound_signal = np.abs(input_sound[1]).max()
        input_sound_norm_signal = input_sound[1]/max_sound_signal
        input_audio = np.array(input_sound_norm_signal, dtype=np.float32)
        found_audios = []
        for match in audio_similarity_search(input_audio, top_k=NUM_RESULT_WIDGETS):
            match_audio = audios_by_filename[match["document"]]
            sample_rate = 44100
            # normalize back to the Gradio y-scale for sounds:
            gradio_rescaled_audio = np.int16(match_audio * 32767)
            this_result = (sample_rate, gradio_rescaled_audio)
            found_audios.append(this_result)
    else:
        found_audios = []
    # pad the result in any case to the number of displayed widgets
    return found_audios + [None]*(NUM_RESULT_WIDGETS-len(found_audios))

The next cell starts the Gradio app: click on the URL that will be displayed to open it.

Please keep in mind that:

The cell will keep running as long as the UI is running. Interrupt the notebook kernel to regain control (e.g. to modify and re-launch, or execute other cells, etc).
The cell output will give both a local URL to access the application, and an URL such as https://<....>.gradio.live to reach it from anywhere. Use the latter link from Colab and when sharing with others. (The link will expire after a certain time.)
The UI will also be shown within the notebook below the cell.

In [ ]:

Copied!





sound_ui = gradio.Interface(
    fn=gradio_upload_audio,
    inputs=[gradio.components.Audio(source="microphone"), gradio.components.Audio(type="numpy", label="Your query audio")],
    outputs=[
        gradio.components.Audio(type="numpy", label=f"Search result #{output_i}")
        for output_i in range(NUM_RESULT_WIDGETS)
    ],
    title="Sound Similarity Search with CassIO & Vector Database",
)

sound_ui.launch(share=True, debug=True)
sound_ui = gradio.Interface(
    fn=gradio_upload_audio,
    inputs=[gradio.components.Audio(source="microphone"), gradio.components.Audio(type="numpy", label="Your query audio")],
    outputs=[
        gradio.components.Audio(type="numpy", label=f"Search result #{output_i}")
        for output_i in range(NUM_RESULT_WIDGETS)
    ],
    title="Sound Similarity Search with CassIO & Vector Database",
)

sound_ui.launch(share=True, debug=True)