Colab initialization¶

install the pipeline in the colab runtime
download files neccessary for this example

!pip3 install -U pip > /dev/null
!pip3 install -U bio_embeddings[all] > /dev/null

!wget http://data.bioembeddings.com/public/embeddings/notebooks/pipeline_output_example/reduced_embeddings_file.h5 --output-document reduced_embeddings_file.h5
!wget http://data.bioembeddings.com/public/embeddings/notebooks/pipeline_output_example/disprot/reduced_embeddings_file.h5 --output-document disprot_reduced_embeddings_file.h5

Pairwaise distances between embeddings, and nearest neighbour annotation transfer¶

In the following, we will compute pairwise distances between sets of embeddings. This borrows ideas from the extract stage using the unsupervised protocol.

But, we can do more than just that!

For example, we can check similarity within one dataset:

import numpy as np
from bio_embeddings.extract import pairwise_distance_matrix_from_embeddings_and_annotations, get_k_nearest_neighbours
from pandas import DataFrame, concat

instrinsic_pairwise_distances = pairwise_distance_matrix_from_embeddings_and_annotations(
    'reduced_embeddings_file.h5',
    'reduced_embeddings_file.h5',
)

Let’s transform this into a nice dataframe with columns and rows, so that we know who’s distnaces we are comparing

distances_dataframe = DataFrame(instrinsic_pairwise_distances.pairwise_matrix,
                                index = instrinsic_pairwise_distances.queries,
                                columns= instrinsic_pairwise_distances.references)

IMPORTANT

You will see that in the following we use the round(1) function, which rounds floating points to one decimal. We do so becuase otherwise the distance between the same sequence (diagonal) is sometimes NOT exactly zero (due to floating point precision).

If you print the dataframe without rounding, you will most likely see something like 1.032383e-07 for position (1,1) in the matrix. This, however, is very close to 0.

distances_dataframe.round(1)

	4019ffc973586d62bfa9adebf209bb04	437dcdf95882266f90085369b9a4258f	4c4dcc8dd8b86a7e8c4efb3ece2653ac	528a722b3be1f711e27efc09dca5b2d7	b17640be6d2ed8dacb61f48fd40996c2	c12ad1255bc3ed843ae21de938ab5f62	d91f2da05c84147f9fcfe5121b21777f
4019ffc973586d62bfa9adebf209bb04	0.0	4.2	3.5	4.3	3.6	4.3	3.2
437dcdf95882266f90085369b9a4258f	4.2	0.0	3.3	4.3	3.3	3.7	3.1
4c4dcc8dd8b86a7e8c4efb3ece2653ac	3.5	3.3	0.0	3.6	2.4	3.2	2.1
528a722b3be1f711e27efc09dca5b2d7	4.3	4.3	3.6	0.0	3.6	4.5	3.7
b17640be6d2ed8dacb61f48fd40996c2	3.6	3.3	2.4	3.6	0.0	2.9	2.3
c12ad1255bc3ed843ae21de938ab5f62	4.3	3.7	3.2	4.5	2.9	0.0	3.0
d91f2da05c84147f9fcfe5121b21777f	3.2	3.1	2.1	3.7	2.3	3.0	0.0

Next, we want to find the k nearest neighbours.

Since we are comparing sequences against themselves, the first nearest neighbour will always be the sequence itself. Therefore we pass 2 to the following function in order to get the second nearest neighbour.

The following function returns two results: the indices of the nearest neighbours, as well as the distances to them.

k = 2
k_nn_indices, k_nn_distances = get_k_nearest_neighbours(instrinsic_pairwise_distances.pairwise_matrix, k)

As before, let’s put this into a dataframe with index and column names, so we understand what’s going on:

k_nn_idices_df = DataFrame(k_nn_indices, columns=[f"k_nn_{i+1}_index" for i in range(len(k_nn_indices[0]))])
k_nn_distances_df = DataFrame(k_nn_distances, columns=[f"k_nn_{i+1}_distance" for i in range(len(k_nn_distances[0]))])

neighbours_dataframe = concat([k_nn_idices_df, k_nn_distances_df], axis=1)
neighbours_dataframe.index = instrinsic_pairwise_distances.queries

neighbours_dataframe.round(1)

	k_nn_1_index	k_nn_2_index	k_nn_2_distance
4019ffc973586d62bfa9adebf209bb04	0	6	3.2
437dcdf95882266f90085369b9a4258f	1	6	3.1
4c4dcc8dd8b86a7e8c4efb3ece2653ac	2	6	2.1
528a722b3be1f711e27efc09dca5b2d7	3	4	3.6
b17640be6d2ed8dacb61f48fd40996c2	4	6	2.3
c12ad1255bc3ed843ae21de938ab5f62	5	4	2.9
d91f2da05c84147f9fcfe5121b21777f	6	2	2.1

Since it’s also nice to see whose identifier it is at index N, we can simply map that onto our dataframe

neighbours_dataframe['k_nn_1_identifier'] = neighbours_dataframe.k_nn_1_index.map(
    lambda x: instrinsic_pairwise_distances.references[x])

neighbours_dataframe['k_nn_2_identifier'] = neighbours_dataframe.k_nn_2_index.map(
    lambda x: instrinsic_pairwise_distances.references[x])

neighbours_dataframe.round(1)

	k_nn_1_index	k_nn_2_index	k_nn_2_distance	k_nn_1_identifier	k_nn_2_identifier
4019ffc973586d62bfa9adebf209bb04	0	6	3.2	4019ffc973586d62bfa9adebf209bb04	d91f2da05c84147f9fcfe5121b21777f
437dcdf95882266f90085369b9a4258f	1	6	3.1	437dcdf95882266f90085369b9a4258f	d91f2da05c84147f9fcfe5121b21777f
4c4dcc8dd8b86a7e8c4efb3ece2653ac	2	6	2.1	4c4dcc8dd8b86a7e8c4efb3ece2653ac	d91f2da05c84147f9fcfe5121b21777f
528a722b3be1f711e27efc09dca5b2d7	3	4	3.6	528a722b3be1f711e27efc09dca5b2d7	b17640be6d2ed8dacb61f48fd40996c2
b17640be6d2ed8dacb61f48fd40996c2	4	6	2.3	b17640be6d2ed8dacb61f48fd40996c2	d91f2da05c84147f9fcfe5121b21777f
c12ad1255bc3ed843ae21de938ab5f62	5	4	2.9	c12ad1255bc3ed843ae21de938ab5f62	b17640be6d2ed8dacb61f48fd40996c2
d91f2da05c84147f9fcfe5121b21777f	6	2	2.1	d91f2da05c84147f9fcfe5121b21777f	4c4dcc8dd8b86a7e8c4efb3ece2653ac

“Hey, Christian: this is all nice and stuff, but I wanna compare two different sets!”

Alighty, just as above, but using two different embedding files. Notice that we call the first embedding file the “query” embedding file, while the second embedding file is the “reference”. Internally, we use these terms, because most of the time you are not just interested in the closest embedding, but in some property of the sequence that produced that embedding, aka the “reference” embedding.

instrinsic_pairwise_distances = pairwise_distance_matrix_from_embeddings_and_annotations(
    'reduced_embeddings_file.h5',
    'disprot_reduced_embeddings_file.h5',
)

distances_dataframe = DataFrame(instrinsic_pairwise_distances.pairwise_matrix,
                                index = instrinsic_pairwise_distances.queries,
                                columns= instrinsic_pairwise_distances.references)

display(distances_dataframe.round(1))

k = 2
k_nn_indices, k_nn_distances = get_k_nearest_neighbours(instrinsic_pairwise_distances.pairwise_matrix, k)
k_nn_idices_df = DataFrame(k_nn_indices, columns=[f"k_nn_{i+1}_index" for i in range(len(k_nn_indices[0]))])
k_nn_distances_df = DataFrame(k_nn_distances, columns=[f"k_nn_{i+1}_distance" for i in range(len(k_nn_distances[0]))])

neighbours_dataframe = concat([k_nn_idices_df, k_nn_distances_df], axis=1)
neighbours_dataframe.index = instrinsic_pairwise_distances.queries

neighbours_dataframe['k_nn_1_identifier'] = neighbours_dataframe.k_nn_1_index.map(
    lambda x: instrinsic_pairwise_distances.references[x])

neighbours_dataframe['k_nn_2_identifier'] = neighbours_dataframe.k_nn_2_index.map(
    lambda x: instrinsic_pairwise_distances.references[x])

display(neighbours_dataframe.round(3))

	0011ab0c11c7fea51fefcd039b1b69f5	003b243d0117cbaf2b7434184c409b06	004cef7b0dae937e6d722817c17ed889	0115b4447d6911651804d1303bf5f272	0140b3ec6cba5734a909c1d734e48ea0	01de4461209b76f57819919dc38faa99	025559e7b85ed1448d34e61221a54788	0297e58178e3f37ecaa080f23a5efc20	02bfb6e7933ff7691596243b73dbe6c9	02c8fa126578dbabb41f3ac9dc9a5048	...	fe15f12a73627e411d081ac2c75a08d2	fe275ea1d14453f6855577d120934fe1	fe3e07d970ca830a34d91a3fa7e4f9e2	fe3fce3b83de1afdf00394a4e639dd5e	fe462c3364a5d79fdfbef7b4befc69a5	fe6c516055388ef0d8d7ab8b7dd60c45	fe736d03c479e9681251e5d1e155ed85	fe81b26a3d8796c10404d6d10915c3f5	fec303c471974bdfa29b7b32697b39ef	feeb0ed7e8d9d82d58e3578c1ae7fcee
4019ffc973586d62bfa9adebf209bb04	3.6	4.1	3.9	4.6	4.6	3.4	4.1	8.3	4.0	3.9	...	3.8	4.6	3.8	7.1	4.6	5.2	8.3	4.0	4.5	5.1
437dcdf95882266f90085369b9a4258f	3.1	3.7	3.1	4.2	4.0	3.3	3.4	7.8	3.4	3.3	...	3.3	3.6	3.4	6.6	3.8	4.1	8.0	3.0	3.7	4.3
4c4dcc8dd8b86a7e8c4efb3ece2653ac	2.1	3.2	2.7	3.8	3.4	2.1	3.2	7.9	2.5	2.1	...	2.5	3.9	2.9	6.5	3.9	4.2	8.0	2.7	3.4	4.3
528a722b3be1f711e27efc09dca5b2d7	3.8	4.7	4.0	4.9	4.8	3.9	4.4	8.3	4.3	3.7	...	4.3	4.7	4.2	7.4	4.7	5.4	8.7	4.2	4.7	5.0
b17640be6d2ed8dacb61f48fd40996c2	2.1	3.6	2.1	3.9	3.8	2.5	3.2	8.1	2.8	2.7	...	2.8	3.6	2.9	6.4	3.2	4.2	7.9	2.6	3.7	4.2
c12ad1255bc3ed843ae21de938ab5f62	2.9	3.9	3.2	4.2	3.9	3.2	3.4	8.1	3.3	3.4	...	3.3	3.6	3.4	6.6	3.6	4.5	8.0	3.1	3.9	4.7
d91f2da05c84147f9fcfe5121b21777f	2.0	3.2	2.5	3.3	3.1	1.9	2.8	7.7	2.3	2.5	...	2.2	3.5	2.7	6.3	3.3	4.2	7.9	2.5	2.9	4.1

7 rows × 1162 columns

	k_nn_1_index	k_nn_2_index	k_nn_1_distance	k_nn_2_distance	k_nn_1_identifier	k_nn_2_identifier
4019ffc973586d62bfa9adebf209bb04	859	1005	3.201	3.231	bc78ea9e725fe8b819c32f1438aa0be9	de3105e4d61489ce082a5e6ca9b802e7
437dcdf95882266f90085369b9a4258f	435	763	2.872	2.947	5d733f6264797054e7f2daa324e78e95	a7bedcc53366bb2f2e4cf968188d5886
4c4dcc8dd8b86a7e8c4efb3ece2653ac	880	190	1.765	1.872	c13503898cf51733b81cdf19e5d521e8	29bd1227262429b3c72b158d448c3ccc
528a722b3be1f711e27efc09dca5b2d7	356	343	3.238	3.436	4bd8fd185ad7e5d852f510a0dc1b94b2	49350ec6dd02aabc8eb8cabc10476a4f
b17640be6d2ed8dacb61f48fd40996c2	902	953	1.830	1.883	c67f43b4c271800bf1b86b75fd06aaf7	d20782131bcd520f911265e8bd79ee94
c12ad1255bc3ed843ae21de938ab5f62	793	832	2.725	2.750	af2a36ab30bc54843acdb759ea61df71	b78f477508a2fb1765530101e9e9e861
d91f2da05c84147f9fcfe5121b21777f	148	5	1.833	1.945	231a587783f7bca8795a23abc3269646	01de4461209b76f57819919dc38faa99