Colab initialization

  • install the pipeline in the colab runtime

  • download files neccessary for this example

!pip3 install -U pip > /dev/null
!pip3 install -U bio_embeddings[all] > /dev/null
!wget http://data.bioembeddings.com/public/embeddings/notebooks/pipeline_output_example/reduced_embeddings_file.h5 --output-document reduced_embeddings_file.h5
!wget http://data.bioembeddings.com/public/embeddings/notebooks/pipeline_output_example/disprot/reduced_embeddings_file.h5 --output-document disprot_reduced_embeddings_file.h5

Pairwaise distances between embeddings, and nearest neighbour annotation transfer

In the following, we will compute pairwise distances between sets of embeddings. This borrows ideas from the extract stage using the unsupervised protocol.

But, we can do more than just that!

For example, we can check similarity within one dataset:

import numpy as np
from bio_embeddings.extract import pairwise_distance_matrix_from_embeddings_and_annotations, get_k_nearest_neighbours
from pandas import DataFrame, concat
instrinsic_pairwise_distances = pairwise_distance_matrix_from_embeddings_and_annotations(
    'reduced_embeddings_file.h5',
    'reduced_embeddings_file.h5',
)

Let’s transform this into a nice dataframe with columns and rows, so that we know who’s distnaces we are comparing

distances_dataframe = DataFrame(instrinsic_pairwise_distances.pairwise_matrix,
                                index = instrinsic_pairwise_distances.queries,
                                columns= instrinsic_pairwise_distances.references)

IMPORTANT

You will see that in the following we use the round(1) function, which rounds floating points to one decimal. We do so becuase otherwise the distance between the same sequence (diagonal) is sometimes NOT exactly zero (due to floating point precision).

If you print the dataframe without rounding, you will most likely see something like 1.032383e-07 for position (1,1) in the matrix. This, however, is very close to 0.

distances_dataframe.round(1)
4019ffc973586d62bfa9adebf209bb04 437dcdf95882266f90085369b9a4258f 4c4dcc8dd8b86a7e8c4efb3ece2653ac 528a722b3be1f711e27efc09dca5b2d7 b17640be6d2ed8dacb61f48fd40996c2 c12ad1255bc3ed843ae21de938ab5f62 d91f2da05c84147f9fcfe5121b21777f
4019ffc973586d62bfa9adebf209bb04 0.0 4.2 3.5 4.3 3.6 4.3 3.2
437dcdf95882266f90085369b9a4258f 4.2 0.0 3.3 4.3 3.3 3.7 3.1
4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.5 3.3 0.0 3.6 2.4 3.2 2.1
528a722b3be1f711e27efc09dca5b2d7 4.3 4.3 3.6 0.0 3.6 4.5 3.7
b17640be6d2ed8dacb61f48fd40996c2 3.6 3.3 2.4 3.6 0.0 2.9 2.3
c12ad1255bc3ed843ae21de938ab5f62 4.3 3.7 3.2 4.5 2.9 0.0 3.0
d91f2da05c84147f9fcfe5121b21777f 3.2 3.1 2.1 3.7 2.3 3.0 0.0

Next, we want to find the k nearest neighbours.

Since we are comparing sequences against themselves, the first nearest neighbour will always be the sequence itself. Therefore we pass 2 to the following function in order to get the second nearest neighbour.

The following function returns two results: the indices of the nearest neighbours, as well as the distances to them.

k = 2
k_nn_indices, k_nn_distances = get_k_nearest_neighbours(instrinsic_pairwise_distances.pairwise_matrix, k)

As before, let’s put this into a dataframe with index and column names, so we understand what’s going on:

k_nn_idices_df = DataFrame(k_nn_indices, columns=[f"k_nn_{i+1}_index" for i in range(len(k_nn_indices[0]))])
k_nn_distances_df = DataFrame(k_nn_distances, columns=[f"k_nn_{i+1}_distance" for i in range(len(k_nn_distances[0]))])

neighbours_dataframe = concat([k_nn_idices_df, k_nn_distances_df], axis=1)
neighbours_dataframe.index = instrinsic_pairwise_distances.queries
neighbours_dataframe.round(1)
k_nn_1_index k_nn_2_index k_nn_1_distance k_nn_2_distance
4019ffc973586d62bfa9adebf209bb04 0 6 0.0 3.2
437dcdf95882266f90085369b9a4258f 1 6 0.0 3.1
4c4dcc8dd8b86a7e8c4efb3ece2653ac 2 6 0.0 2.1
528a722b3be1f711e27efc09dca5b2d7 3 4 0.0 3.6
b17640be6d2ed8dacb61f48fd40996c2 4 6 0.0 2.3
c12ad1255bc3ed843ae21de938ab5f62 5 4 0.0 2.9
d91f2da05c84147f9fcfe5121b21777f 6 2 0.0 2.1

Since it’s also nice to see whose identifier it is at index N, we can simply map that onto our dataframe

neighbours_dataframe['k_nn_1_identifier'] = neighbours_dataframe.k_nn_1_index.map(
    lambda x: instrinsic_pairwise_distances.references[x])

neighbours_dataframe['k_nn_2_identifier'] = neighbours_dataframe.k_nn_2_index.map(
    lambda x: instrinsic_pairwise_distances.references[x])
neighbours_dataframe.round(1)
k_nn_1_index k_nn_2_index k_nn_1_distance k_nn_2_distance k_nn_1_identifier k_nn_2_identifier
4019ffc973586d62bfa9adebf209bb04 0 6 0.0 3.2 4019ffc973586d62bfa9adebf209bb04 d91f2da05c84147f9fcfe5121b21777f
437dcdf95882266f90085369b9a4258f 1 6 0.0 3.1 437dcdf95882266f90085369b9a4258f d91f2da05c84147f9fcfe5121b21777f
4c4dcc8dd8b86a7e8c4efb3ece2653ac 2 6 0.0 2.1 4c4dcc8dd8b86a7e8c4efb3ece2653ac d91f2da05c84147f9fcfe5121b21777f
528a722b3be1f711e27efc09dca5b2d7 3 4 0.0 3.6 528a722b3be1f711e27efc09dca5b2d7 b17640be6d2ed8dacb61f48fd40996c2
b17640be6d2ed8dacb61f48fd40996c2 4 6 0.0 2.3 b17640be6d2ed8dacb61f48fd40996c2 d91f2da05c84147f9fcfe5121b21777f
c12ad1255bc3ed843ae21de938ab5f62 5 4 0.0 2.9 c12ad1255bc3ed843ae21de938ab5f62 b17640be6d2ed8dacb61f48fd40996c2
d91f2da05c84147f9fcfe5121b21777f 6 2 0.0 2.1 d91f2da05c84147f9fcfe5121b21777f 4c4dcc8dd8b86a7e8c4efb3ece2653ac

“Hey, Christian: this is all nice and stuff, but I wanna compare two different sets!”

Alighty, just as above, but using two different embedding files. Notice that we call the first embedding file the “query” embedding file, while the second embedding file is the “reference”. Internally, we use these terms, because most of the time you are not just interested in the closest embedding, but in some property of the sequence that produced that embedding, aka the “reference” embedding.

instrinsic_pairwise_distances = pairwise_distance_matrix_from_embeddings_and_annotations(
    'reduced_embeddings_file.h5',
    'disprot_reduced_embeddings_file.h5',
)

distances_dataframe = DataFrame(instrinsic_pairwise_distances.pairwise_matrix,
                                index = instrinsic_pairwise_distances.queries,
                                columns= instrinsic_pairwise_distances.references)

display(distances_dataframe.round(1))

k = 2
k_nn_indices, k_nn_distances = get_k_nearest_neighbours(instrinsic_pairwise_distances.pairwise_matrix, k)
k_nn_idices_df = DataFrame(k_nn_indices, columns=[f"k_nn_{i+1}_index" for i in range(len(k_nn_indices[0]))])
k_nn_distances_df = DataFrame(k_nn_distances, columns=[f"k_nn_{i+1}_distance" for i in range(len(k_nn_distances[0]))])

neighbours_dataframe = concat([k_nn_idices_df, k_nn_distances_df], axis=1)
neighbours_dataframe.index = instrinsic_pairwise_distances.queries

neighbours_dataframe['k_nn_1_identifier'] = neighbours_dataframe.k_nn_1_index.map(
    lambda x: instrinsic_pairwise_distances.references[x])

neighbours_dataframe['k_nn_2_identifier'] = neighbours_dataframe.k_nn_2_index.map(
    lambda x: instrinsic_pairwise_distances.references[x])

display(neighbours_dataframe.round(3))
0011ab0c11c7fea51fefcd039b1b69f5 003b243d0117cbaf2b7434184c409b06 004cef7b0dae937e6d722817c17ed889 0115b4447d6911651804d1303bf5f272 0140b3ec6cba5734a909c1d734e48ea0 01de4461209b76f57819919dc38faa99 025559e7b85ed1448d34e61221a54788 0297e58178e3f37ecaa080f23a5efc20 02bfb6e7933ff7691596243b73dbe6c9 02c8fa126578dbabb41f3ac9dc9a5048 ... fe15f12a73627e411d081ac2c75a08d2 fe275ea1d14453f6855577d120934fe1 fe3e07d970ca830a34d91a3fa7e4f9e2 fe3fce3b83de1afdf00394a4e639dd5e fe462c3364a5d79fdfbef7b4befc69a5 fe6c516055388ef0d8d7ab8b7dd60c45 fe736d03c479e9681251e5d1e155ed85 fe81b26a3d8796c10404d6d10915c3f5 fec303c471974bdfa29b7b32697b39ef feeb0ed7e8d9d82d58e3578c1ae7fcee
4019ffc973586d62bfa9adebf209bb04 3.6 4.1 3.9 4.6 4.6 3.4 4.1 8.3 4.0 3.9 ... 3.8 4.6 3.8 7.1 4.6 5.2 8.3 4.0 4.5 5.1
437dcdf95882266f90085369b9a4258f 3.1 3.7 3.1 4.2 4.0 3.3 3.4 7.8 3.4 3.3 ... 3.3 3.6 3.4 6.6 3.8 4.1 8.0 3.0 3.7 4.3
4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.1 3.2 2.7 3.8 3.4 2.1 3.2 7.9 2.5 2.1 ... 2.5 3.9 2.9 6.5 3.9 4.2 8.0 2.7 3.4 4.3
528a722b3be1f711e27efc09dca5b2d7 3.8 4.7 4.0 4.9 4.8 3.9 4.4 8.3 4.3 3.7 ... 4.3 4.7 4.2 7.4 4.7 5.4 8.7 4.2 4.7 5.0
b17640be6d2ed8dacb61f48fd40996c2 2.1 3.6 2.1 3.9 3.8 2.5 3.2 8.1 2.8 2.7 ... 2.8 3.6 2.9 6.4 3.2 4.2 7.9 2.6 3.7 4.2
c12ad1255bc3ed843ae21de938ab5f62 2.9 3.9 3.2 4.2 3.9 3.2 3.4 8.1 3.3 3.4 ... 3.3 3.6 3.4 6.6 3.6 4.5 8.0 3.1 3.9 4.7
d91f2da05c84147f9fcfe5121b21777f 2.0 3.2 2.5 3.3 3.1 1.9 2.8 7.7 2.3 2.5 ... 2.2 3.5 2.7 6.3 3.3 4.2 7.9 2.5 2.9 4.1

7 rows × 1162 columns

k_nn_1_index k_nn_2_index k_nn_1_distance k_nn_2_distance k_nn_1_identifier k_nn_2_identifier
4019ffc973586d62bfa9adebf209bb04 859 1005 3.201 3.231 bc78ea9e725fe8b819c32f1438aa0be9 de3105e4d61489ce082a5e6ca9b802e7
437dcdf95882266f90085369b9a4258f 435 763 2.872 2.947 5d733f6264797054e7f2daa324e78e95 a7bedcc53366bb2f2e4cf968188d5886
4c4dcc8dd8b86a7e8c4efb3ece2653ac 880 190 1.765 1.872 c13503898cf51733b81cdf19e5d521e8 29bd1227262429b3c72b158d448c3ccc
528a722b3be1f711e27efc09dca5b2d7 356 343 3.238 3.436 4bd8fd185ad7e5d852f510a0dc1b94b2 49350ec6dd02aabc8eb8cabc10476a4f
b17640be6d2ed8dacb61f48fd40996c2 902 953 1.830 1.883 c67f43b4c271800bf1b86b75fd06aaf7 d20782131bcd520f911265e8bd79ee94
c12ad1255bc3ed843ae21de938ab5f62 793 832 2.725 2.750 af2a36ab30bc54843acdb759ea61df71 b78f477508a2fb1765530101e9e9e861
d91f2da05c84147f9fcfe5121b21777f 148 5 1.833 1.945 231a587783f7bca8795a23abc3269646 01de4461209b76f57819919dc38faa99