Colab initialization¶
install the pipeline in the colab runtime
download files neccessary for this example
!pip3 install -U pip > /dev/null
!pip3 install -U bio_embeddings[all] > /dev/null
!wget http://data.bioembeddings.com/public/embeddings/notebooks/pipeline_output_example/reduced_embeddings_file.h5 --output-document reduced_embeddings_file.h5
!wget http://data.bioembeddings.com/public/embeddings/notebooks/pipeline_output_example/disprot/reduced_embeddings_file.h5 --output-document disprot_reduced_embeddings_file.h5
Pairwaise distances between embeddings, and nearest neighbour annotation transfer¶
In the following, we will compute pairwise distances between sets of embeddings. This borrows ideas from the extract stage using the unsupervised protocol.
But, we can do more than just that!
For example, we can check similarity within one dataset:
import numpy as np
from bio_embeddings.extract import pairwise_distance_matrix_from_embeddings_and_annotations, get_k_nearest_neighbours
from pandas import DataFrame, concat
instrinsic_pairwise_distances = pairwise_distance_matrix_from_embeddings_and_annotations(
'reduced_embeddings_file.h5',
'reduced_embeddings_file.h5',
)
Let’s transform this into a nice dataframe with columns and rows, so that we know who’s distnaces we are comparing
distances_dataframe = DataFrame(instrinsic_pairwise_distances.pairwise_matrix,
index = instrinsic_pairwise_distances.queries,
columns= instrinsic_pairwise_distances.references)
IMPORTANT
You will see that in the following we use the round(1)
function, which rounds floating points to one decimal. We do so becuase otherwise the distance between the same sequence (diagonal) is sometimes NOT exactly zero (due to floating point precision).
If you print the dataframe without rounding, you will most likely see something like 1.032383e-07
for position (1,1) in the matrix. This, however, is very close to 0.
distances_dataframe.round(1)
4019ffc973586d62bfa9adebf209bb04 | 437dcdf95882266f90085369b9a4258f | 4c4dcc8dd8b86a7e8c4efb3ece2653ac | 528a722b3be1f711e27efc09dca5b2d7 | b17640be6d2ed8dacb61f48fd40996c2 | c12ad1255bc3ed843ae21de938ab5f62 | d91f2da05c84147f9fcfe5121b21777f | |
---|---|---|---|---|---|---|---|
4019ffc973586d62bfa9adebf209bb04 | 0.0 | 4.2 | 3.5 | 4.3 | 3.6 | 4.3 | 3.2 |
437dcdf95882266f90085369b9a4258f | 4.2 | 0.0 | 3.3 | 4.3 | 3.3 | 3.7 | 3.1 |
4c4dcc8dd8b86a7e8c4efb3ece2653ac | 3.5 | 3.3 | 0.0 | 3.6 | 2.4 | 3.2 | 2.1 |
528a722b3be1f711e27efc09dca5b2d7 | 4.3 | 4.3 | 3.6 | 0.0 | 3.6 | 4.5 | 3.7 |
b17640be6d2ed8dacb61f48fd40996c2 | 3.6 | 3.3 | 2.4 | 3.6 | 0.0 | 2.9 | 2.3 |
c12ad1255bc3ed843ae21de938ab5f62 | 4.3 | 3.7 | 3.2 | 4.5 | 2.9 | 0.0 | 3.0 |
d91f2da05c84147f9fcfe5121b21777f | 3.2 | 3.1 | 2.1 | 3.7 | 2.3 | 3.0 | 0.0 |
Next, we want to find the k nearest neighbours.
Since we are comparing sequences against themselves, the first nearest neighbour will always be the sequence itself. Therefore we pass 2
to the following function in order to get the second nearest neighbour.
The following function returns two results: the indices of the nearest neighbours, as well as the distances to them.
k = 2
k_nn_indices, k_nn_distances = get_k_nearest_neighbours(instrinsic_pairwise_distances.pairwise_matrix, k)
As before, let’s put this into a dataframe with index and column names, so we understand what’s going on:
k_nn_idices_df = DataFrame(k_nn_indices, columns=[f"k_nn_{i+1}_index" for i in range(len(k_nn_indices[0]))])
k_nn_distances_df = DataFrame(k_nn_distances, columns=[f"k_nn_{i+1}_distance" for i in range(len(k_nn_distances[0]))])
neighbours_dataframe = concat([k_nn_idices_df, k_nn_distances_df], axis=1)
neighbours_dataframe.index = instrinsic_pairwise_distances.queries
neighbours_dataframe.round(1)
k_nn_1_index | k_nn_2_index | k_nn_1_distance | k_nn_2_distance | |
---|---|---|---|---|
4019ffc973586d62bfa9adebf209bb04 | 0 | 6 | 0.0 | 3.2 |
437dcdf95882266f90085369b9a4258f | 1 | 6 | 0.0 | 3.1 |
4c4dcc8dd8b86a7e8c4efb3ece2653ac | 2 | 6 | 0.0 | 2.1 |
528a722b3be1f711e27efc09dca5b2d7 | 3 | 4 | 0.0 | 3.6 |
b17640be6d2ed8dacb61f48fd40996c2 | 4 | 6 | 0.0 | 2.3 |
c12ad1255bc3ed843ae21de938ab5f62 | 5 | 4 | 0.0 | 2.9 |
d91f2da05c84147f9fcfe5121b21777f | 6 | 2 | 0.0 | 2.1 |
Since it’s also nice to see whose identifier it is at index N, we can simply map that onto our dataframe
neighbours_dataframe['k_nn_1_identifier'] = neighbours_dataframe.k_nn_1_index.map(
lambda x: instrinsic_pairwise_distances.references[x])
neighbours_dataframe['k_nn_2_identifier'] = neighbours_dataframe.k_nn_2_index.map(
lambda x: instrinsic_pairwise_distances.references[x])
neighbours_dataframe.round(1)
k_nn_1_index | k_nn_2_index | k_nn_1_distance | k_nn_2_distance | k_nn_1_identifier | k_nn_2_identifier | |
---|---|---|---|---|---|---|
4019ffc973586d62bfa9adebf209bb04 | 0 | 6 | 0.0 | 3.2 | 4019ffc973586d62bfa9adebf209bb04 | d91f2da05c84147f9fcfe5121b21777f |
437dcdf95882266f90085369b9a4258f | 1 | 6 | 0.0 | 3.1 | 437dcdf95882266f90085369b9a4258f | d91f2da05c84147f9fcfe5121b21777f |
4c4dcc8dd8b86a7e8c4efb3ece2653ac | 2 | 6 | 0.0 | 2.1 | 4c4dcc8dd8b86a7e8c4efb3ece2653ac | d91f2da05c84147f9fcfe5121b21777f |
528a722b3be1f711e27efc09dca5b2d7 | 3 | 4 | 0.0 | 3.6 | 528a722b3be1f711e27efc09dca5b2d7 | b17640be6d2ed8dacb61f48fd40996c2 |
b17640be6d2ed8dacb61f48fd40996c2 | 4 | 6 | 0.0 | 2.3 | b17640be6d2ed8dacb61f48fd40996c2 | d91f2da05c84147f9fcfe5121b21777f |
c12ad1255bc3ed843ae21de938ab5f62 | 5 | 4 | 0.0 | 2.9 | c12ad1255bc3ed843ae21de938ab5f62 | b17640be6d2ed8dacb61f48fd40996c2 |
d91f2da05c84147f9fcfe5121b21777f | 6 | 2 | 0.0 | 2.1 | d91f2da05c84147f9fcfe5121b21777f | 4c4dcc8dd8b86a7e8c4efb3ece2653ac |
“Hey, Christian: this is all nice and stuff, but I wanna compare two different sets!”
Alighty, just as above, but using two different embedding files. Notice that we call the first embedding file the “query” embedding file, while the second embedding file is the “reference”. Internally, we use these terms, because most of the time you are not just interested in the closest embedding, but in some property of the sequence that produced that embedding, aka the “reference” embedding.
instrinsic_pairwise_distances = pairwise_distance_matrix_from_embeddings_and_annotations(
'reduced_embeddings_file.h5',
'disprot_reduced_embeddings_file.h5',
)
distances_dataframe = DataFrame(instrinsic_pairwise_distances.pairwise_matrix,
index = instrinsic_pairwise_distances.queries,
columns= instrinsic_pairwise_distances.references)
display(distances_dataframe.round(1))
k = 2
k_nn_indices, k_nn_distances = get_k_nearest_neighbours(instrinsic_pairwise_distances.pairwise_matrix, k)
k_nn_idices_df = DataFrame(k_nn_indices, columns=[f"k_nn_{i+1}_index" for i in range(len(k_nn_indices[0]))])
k_nn_distances_df = DataFrame(k_nn_distances, columns=[f"k_nn_{i+1}_distance" for i in range(len(k_nn_distances[0]))])
neighbours_dataframe = concat([k_nn_idices_df, k_nn_distances_df], axis=1)
neighbours_dataframe.index = instrinsic_pairwise_distances.queries
neighbours_dataframe['k_nn_1_identifier'] = neighbours_dataframe.k_nn_1_index.map(
lambda x: instrinsic_pairwise_distances.references[x])
neighbours_dataframe['k_nn_2_identifier'] = neighbours_dataframe.k_nn_2_index.map(
lambda x: instrinsic_pairwise_distances.references[x])
display(neighbours_dataframe.round(3))
0011ab0c11c7fea51fefcd039b1b69f5 | 003b243d0117cbaf2b7434184c409b06 | 004cef7b0dae937e6d722817c17ed889 | 0115b4447d6911651804d1303bf5f272 | 0140b3ec6cba5734a909c1d734e48ea0 | 01de4461209b76f57819919dc38faa99 | 025559e7b85ed1448d34e61221a54788 | 0297e58178e3f37ecaa080f23a5efc20 | 02bfb6e7933ff7691596243b73dbe6c9 | 02c8fa126578dbabb41f3ac9dc9a5048 | ... | fe15f12a73627e411d081ac2c75a08d2 | fe275ea1d14453f6855577d120934fe1 | fe3e07d970ca830a34d91a3fa7e4f9e2 | fe3fce3b83de1afdf00394a4e639dd5e | fe462c3364a5d79fdfbef7b4befc69a5 | fe6c516055388ef0d8d7ab8b7dd60c45 | fe736d03c479e9681251e5d1e155ed85 | fe81b26a3d8796c10404d6d10915c3f5 | fec303c471974bdfa29b7b32697b39ef | feeb0ed7e8d9d82d58e3578c1ae7fcee | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4019ffc973586d62bfa9adebf209bb04 | 3.6 | 4.1 | 3.9 | 4.6 | 4.6 | 3.4 | 4.1 | 8.3 | 4.0 | 3.9 | ... | 3.8 | 4.6 | 3.8 | 7.1 | 4.6 | 5.2 | 8.3 | 4.0 | 4.5 | 5.1 |
437dcdf95882266f90085369b9a4258f | 3.1 | 3.7 | 3.1 | 4.2 | 4.0 | 3.3 | 3.4 | 7.8 | 3.4 | 3.3 | ... | 3.3 | 3.6 | 3.4 | 6.6 | 3.8 | 4.1 | 8.0 | 3.0 | 3.7 | 4.3 |
4c4dcc8dd8b86a7e8c4efb3ece2653ac | 2.1 | 3.2 | 2.7 | 3.8 | 3.4 | 2.1 | 3.2 | 7.9 | 2.5 | 2.1 | ... | 2.5 | 3.9 | 2.9 | 6.5 | 3.9 | 4.2 | 8.0 | 2.7 | 3.4 | 4.3 |
528a722b3be1f711e27efc09dca5b2d7 | 3.8 | 4.7 | 4.0 | 4.9 | 4.8 | 3.9 | 4.4 | 8.3 | 4.3 | 3.7 | ... | 4.3 | 4.7 | 4.2 | 7.4 | 4.7 | 5.4 | 8.7 | 4.2 | 4.7 | 5.0 |
b17640be6d2ed8dacb61f48fd40996c2 | 2.1 | 3.6 | 2.1 | 3.9 | 3.8 | 2.5 | 3.2 | 8.1 | 2.8 | 2.7 | ... | 2.8 | 3.6 | 2.9 | 6.4 | 3.2 | 4.2 | 7.9 | 2.6 | 3.7 | 4.2 |
c12ad1255bc3ed843ae21de938ab5f62 | 2.9 | 3.9 | 3.2 | 4.2 | 3.9 | 3.2 | 3.4 | 8.1 | 3.3 | 3.4 | ... | 3.3 | 3.6 | 3.4 | 6.6 | 3.6 | 4.5 | 8.0 | 3.1 | 3.9 | 4.7 |
d91f2da05c84147f9fcfe5121b21777f | 2.0 | 3.2 | 2.5 | 3.3 | 3.1 | 1.9 | 2.8 | 7.7 | 2.3 | 2.5 | ... | 2.2 | 3.5 | 2.7 | 6.3 | 3.3 | 4.2 | 7.9 | 2.5 | 2.9 | 4.1 |
7 rows × 1162 columns
k_nn_1_index | k_nn_2_index | k_nn_1_distance | k_nn_2_distance | k_nn_1_identifier | k_nn_2_identifier | |
---|---|---|---|---|---|---|
4019ffc973586d62bfa9adebf209bb04 | 859 | 1005 | 3.201 | 3.231 | bc78ea9e725fe8b819c32f1438aa0be9 | de3105e4d61489ce082a5e6ca9b802e7 |
437dcdf95882266f90085369b9a4258f | 435 | 763 | 2.872 | 2.947 | 5d733f6264797054e7f2daa324e78e95 | a7bedcc53366bb2f2e4cf968188d5886 |
4c4dcc8dd8b86a7e8c4efb3ece2653ac | 880 | 190 | 1.765 | 1.872 | c13503898cf51733b81cdf19e5d521e8 | 29bd1227262429b3c72b158d448c3ccc |
528a722b3be1f711e27efc09dca5b2d7 | 356 | 343 | 3.238 | 3.436 | 4bd8fd185ad7e5d852f510a0dc1b94b2 | 49350ec6dd02aabc8eb8cabc10476a4f |
b17640be6d2ed8dacb61f48fd40996c2 | 902 | 953 | 1.830 | 1.883 | c67f43b4c271800bf1b86b75fd06aaf7 | d20782131bcd520f911265e8bd79ee94 |
c12ad1255bc3ed843ae21de938ab5f62 | 793 | 832 | 2.725 | 2.750 | af2a36ab30bc54843acdb759ea61df71 | b78f477508a2fb1765530101e9e9e861 |
d91f2da05c84147f9fcfe5121b21777f | 148 | 5 | 1.833 | 1.945 | 231a587783f7bca8795a23abc3269646 | 01de4461209b76f57819919dc38faa99 |