{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Colab initialization\n",
"- install the pipeline in the colab runtime\n",
"- download files neccessary for this example"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip3 install -U pip > /dev/null\n",
"!pip3 install -U \"bio-embeddings[all] @ git+https://github.com/sacdallago/bio_embeddings.git\" > /dev/null"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!wget http://data.bioembeddings.com/public/embeddings/notebooks/pipeline_output_example/reduced_embeddings_file.h5 --output-document reduced_embeddings_file.h5\n",
"!wget http://data.bioembeddings.com/public/embeddings/notebooks/pipeline_output_example/disprot/reduced_embeddings_file.h5 --output-document disprot_reduced_embeddings_file.h5"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pairwaise distances between embeddings, and nearest neighbour annotation transfer\n",
"\n",
"In the following, we will compute pairwise distances between sets of embeddings. This borrows ideas from the extract stage using the unsupervised protocol.\n",
"\n",
"But, we can do more than just that!\n",
"\n",
"For example, we can check similarity within one dataset:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"import numpy as np\n",
"from bio_embeddings.extract import pairwise_distance_matrix_from_embeddings_and_annotations, get_k_nearest_neighbours\n",
"from pandas import DataFrame, concat"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"instrinsic_pairwise_distances = pairwise_distance_matrix_from_embeddings_and_annotations(\n",
" 'reduced_embeddings_file.h5',\n",
" 'reduced_embeddings_file.h5',\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's transform this into a nice dataframe with columns and rows, so that we know who's distnaces we are comparing"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"distances_dataframe = DataFrame(instrinsic_pairwise_distances.pairwise_matrix,\n",
" index = instrinsic_pairwise_distances.queries,\n",
" columns= instrinsic_pairwise_distances.references)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**IMPORTANT**\n",
"\n",
"You will see that in the following we use the `round(1)` function, which rounds floating points to one decimal. We do so becuase otherwise the distance between the same sequence (diagonal) is sometimes NOT exactly zero (due to floating point precision).\n",
"\n",
"If you print the dataframe without rounding, you will most likely see something like `1.032383e-07` for position (1,1) in the matrix. This, however, is very close to 0."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 4019ffc973586d62bfa9adebf209bb04 | \n",
" 437dcdf95882266f90085369b9a4258f | \n",
" 4c4dcc8dd8b86a7e8c4efb3ece2653ac | \n",
" 528a722b3be1f711e27efc09dca5b2d7 | \n",
" b17640be6d2ed8dacb61f48fd40996c2 | \n",
" c12ad1255bc3ed843ae21de938ab5f62 | \n",
" d91f2da05c84147f9fcfe5121b21777f | \n",
"
\n",
" \n",
" \n",
" \n",
" 4019ffc973586d62bfa9adebf209bb04 | \n",
" 0.0 | \n",
" 4.2 | \n",
" 3.5 | \n",
" 4.3 | \n",
" 3.6 | \n",
" 4.3 | \n",
" 3.2 | \n",
"
\n",
" \n",
" 437dcdf95882266f90085369b9a4258f | \n",
" 4.2 | \n",
" 0.0 | \n",
" 3.3 | \n",
" 4.3 | \n",
" 3.3 | \n",
" 3.7 | \n",
" 3.1 | \n",
"
\n",
" \n",
" 4c4dcc8dd8b86a7e8c4efb3ece2653ac | \n",
" 3.5 | \n",
" 3.3 | \n",
" 0.0 | \n",
" 3.6 | \n",
" 2.4 | \n",
" 3.2 | \n",
" 2.1 | \n",
"
\n",
" \n",
" 528a722b3be1f711e27efc09dca5b2d7 | \n",
" 4.3 | \n",
" 4.3 | \n",
" 3.6 | \n",
" 0.0 | \n",
" 3.6 | \n",
" 4.5 | \n",
" 3.7 | \n",
"
\n",
" \n",
" b17640be6d2ed8dacb61f48fd40996c2 | \n",
" 3.6 | \n",
" 3.3 | \n",
" 2.4 | \n",
" 3.6 | \n",
" 0.0 | \n",
" 2.9 | \n",
" 2.3 | \n",
"
\n",
" \n",
" c12ad1255bc3ed843ae21de938ab5f62 | \n",
" 4.3 | \n",
" 3.7 | \n",
" 3.2 | \n",
" 4.5 | \n",
" 2.9 | \n",
" 0.0 | \n",
" 3.0 | \n",
"
\n",
" \n",
" d91f2da05c84147f9fcfe5121b21777f | \n",
" 3.2 | \n",
" 3.1 | \n",
" 2.1 | \n",
" 3.7 | \n",
" 2.3 | \n",
" 3.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 4019ffc973586d62bfa9adebf209bb04 \\\n",
"4019ffc973586d62bfa9adebf209bb04 0.0 \n",
"437dcdf95882266f90085369b9a4258f 4.2 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.5 \n",
"528a722b3be1f711e27efc09dca5b2d7 4.3 \n",
"b17640be6d2ed8dacb61f48fd40996c2 3.6 \n",
"c12ad1255bc3ed843ae21de938ab5f62 4.3 \n",
"d91f2da05c84147f9fcfe5121b21777f 3.2 \n",
"\n",
" 437dcdf95882266f90085369b9a4258f \\\n",
"4019ffc973586d62bfa9adebf209bb04 4.2 \n",
"437dcdf95882266f90085369b9a4258f 0.0 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.3 \n",
"528a722b3be1f711e27efc09dca5b2d7 4.3 \n",
"b17640be6d2ed8dacb61f48fd40996c2 3.3 \n",
"c12ad1255bc3ed843ae21de938ab5f62 3.7 \n",
"d91f2da05c84147f9fcfe5121b21777f 3.1 \n",
"\n",
" 4c4dcc8dd8b86a7e8c4efb3ece2653ac \\\n",
"4019ffc973586d62bfa9adebf209bb04 3.5 \n",
"437dcdf95882266f90085369b9a4258f 3.3 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 0.0 \n",
"528a722b3be1f711e27efc09dca5b2d7 3.6 \n",
"b17640be6d2ed8dacb61f48fd40996c2 2.4 \n",
"c12ad1255bc3ed843ae21de938ab5f62 3.2 \n",
"d91f2da05c84147f9fcfe5121b21777f 2.1 \n",
"\n",
" 528a722b3be1f711e27efc09dca5b2d7 \\\n",
"4019ffc973586d62bfa9adebf209bb04 4.3 \n",
"437dcdf95882266f90085369b9a4258f 4.3 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.6 \n",
"528a722b3be1f711e27efc09dca5b2d7 0.0 \n",
"b17640be6d2ed8dacb61f48fd40996c2 3.6 \n",
"c12ad1255bc3ed843ae21de938ab5f62 4.5 \n",
"d91f2da05c84147f9fcfe5121b21777f 3.7 \n",
"\n",
" b17640be6d2ed8dacb61f48fd40996c2 \\\n",
"4019ffc973586d62bfa9adebf209bb04 3.6 \n",
"437dcdf95882266f90085369b9a4258f 3.3 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.4 \n",
"528a722b3be1f711e27efc09dca5b2d7 3.6 \n",
"b17640be6d2ed8dacb61f48fd40996c2 0.0 \n",
"c12ad1255bc3ed843ae21de938ab5f62 2.9 \n",
"d91f2da05c84147f9fcfe5121b21777f 2.3 \n",
"\n",
" c12ad1255bc3ed843ae21de938ab5f62 \\\n",
"4019ffc973586d62bfa9adebf209bb04 4.3 \n",
"437dcdf95882266f90085369b9a4258f 3.7 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.2 \n",
"528a722b3be1f711e27efc09dca5b2d7 4.5 \n",
"b17640be6d2ed8dacb61f48fd40996c2 2.9 \n",
"c12ad1255bc3ed843ae21de938ab5f62 0.0 \n",
"d91f2da05c84147f9fcfe5121b21777f 3.0 \n",
"\n",
" d91f2da05c84147f9fcfe5121b21777f \n",
"4019ffc973586d62bfa9adebf209bb04 3.2 \n",
"437dcdf95882266f90085369b9a4258f 3.1 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.1 \n",
"528a722b3be1f711e27efc09dca5b2d7 3.7 \n",
"b17640be6d2ed8dacb61f48fd40996c2 2.3 \n",
"c12ad1255bc3ed843ae21de938ab5f62 3.0 \n",
"d91f2da05c84147f9fcfe5121b21777f 0.0 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"distances_dataframe.round(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we want to find the k nearest neighbours.\n",
"\n",
"Since we are comparing sequences against themselves, the first nearest neighbour will always be the sequence itself. Therefore we pass `2` to the following function in order to get the second nearest neighbour.\n",
"\n",
"The following function returns two results: the indices of the nearest neighbours, as well as the distances to them."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"k = 2\n",
"k_nn_indices, k_nn_distances = get_k_nearest_neighbours(instrinsic_pairwise_distances.pairwise_matrix, k)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As before, let's put this into a dataframe with index and column names, so we understand what's going on:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"k_nn_idices_df = DataFrame(k_nn_indices, columns=[f\"k_nn_{i+1}_index\" for i in range(len(k_nn_indices[0]))])\n",
"k_nn_distances_df = DataFrame(k_nn_distances, columns=[f\"k_nn_{i+1}_distance\" for i in range(len(k_nn_distances[0]))])\n",
"\n",
"neighbours_dataframe = concat([k_nn_idices_df, k_nn_distances_df], axis=1)\n",
"neighbours_dataframe.index = instrinsic_pairwise_distances.queries"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" k_nn_1_index | \n",
" k_nn_2_index | \n",
" k_nn_1_distance | \n",
" k_nn_2_distance | \n",
"
\n",
" \n",
" \n",
" \n",
" 4019ffc973586d62bfa9adebf209bb04 | \n",
" 0 | \n",
" 6 | \n",
" 0.0 | \n",
" 3.2 | \n",
"
\n",
" \n",
" 437dcdf95882266f90085369b9a4258f | \n",
" 1 | \n",
" 6 | \n",
" 0.0 | \n",
" 3.1 | \n",
"
\n",
" \n",
" 4c4dcc8dd8b86a7e8c4efb3ece2653ac | \n",
" 2 | \n",
" 6 | \n",
" 0.0 | \n",
" 2.1 | \n",
"
\n",
" \n",
" 528a722b3be1f711e27efc09dca5b2d7 | \n",
" 3 | \n",
" 4 | \n",
" 0.0 | \n",
" 3.6 | \n",
"
\n",
" \n",
" b17640be6d2ed8dacb61f48fd40996c2 | \n",
" 4 | \n",
" 6 | \n",
" 0.0 | \n",
" 2.3 | \n",
"
\n",
" \n",
" c12ad1255bc3ed843ae21de938ab5f62 | \n",
" 5 | \n",
" 4 | \n",
" 0.0 | \n",
" 2.9 | \n",
"
\n",
" \n",
" d91f2da05c84147f9fcfe5121b21777f | \n",
" 6 | \n",
" 2 | \n",
" 0.0 | \n",
" 2.1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" k_nn_1_index k_nn_2_index k_nn_1_distance \\\n",
"4019ffc973586d62bfa9adebf209bb04 0 6 0.0 \n",
"437dcdf95882266f90085369b9a4258f 1 6 0.0 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 2 6 0.0 \n",
"528a722b3be1f711e27efc09dca5b2d7 3 4 0.0 \n",
"b17640be6d2ed8dacb61f48fd40996c2 4 6 0.0 \n",
"c12ad1255bc3ed843ae21de938ab5f62 5 4 0.0 \n",
"d91f2da05c84147f9fcfe5121b21777f 6 2 0.0 \n",
"\n",
" k_nn_2_distance \n",
"4019ffc973586d62bfa9adebf209bb04 3.2 \n",
"437dcdf95882266f90085369b9a4258f 3.1 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.1 \n",
"528a722b3be1f711e27efc09dca5b2d7 3.6 \n",
"b17640be6d2ed8dacb61f48fd40996c2 2.3 \n",
"c12ad1255bc3ed843ae21de938ab5f62 2.9 \n",
"d91f2da05c84147f9fcfe5121b21777f 2.1 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"neighbours_dataframe.round(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since it's also nice to see whose identifier it is at index N, we can simply map that onto our dataframe"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"neighbours_dataframe['k_nn_1_identifier'] = neighbours_dataframe.k_nn_1_index.map(\n",
" lambda x: instrinsic_pairwise_distances.references[x])\n",
"\n",
"neighbours_dataframe['k_nn_2_identifier'] = neighbours_dataframe.k_nn_2_index.map(\n",
" lambda x: instrinsic_pairwise_distances.references[x])"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" k_nn_1_index | \n",
" k_nn_2_index | \n",
" k_nn_1_distance | \n",
" k_nn_2_distance | \n",
" k_nn_1_identifier | \n",
" k_nn_2_identifier | \n",
"
\n",
" \n",
" \n",
" \n",
" 4019ffc973586d62bfa9adebf209bb04 | \n",
" 0 | \n",
" 6 | \n",
" 0.0 | \n",
" 3.2 | \n",
" 4019ffc973586d62bfa9adebf209bb04 | \n",
" d91f2da05c84147f9fcfe5121b21777f | \n",
"
\n",
" \n",
" 437dcdf95882266f90085369b9a4258f | \n",
" 1 | \n",
" 6 | \n",
" 0.0 | \n",
" 3.1 | \n",
" 437dcdf95882266f90085369b9a4258f | \n",
" d91f2da05c84147f9fcfe5121b21777f | \n",
"
\n",
" \n",
" 4c4dcc8dd8b86a7e8c4efb3ece2653ac | \n",
" 2 | \n",
" 6 | \n",
" 0.0 | \n",
" 2.1 | \n",
" 4c4dcc8dd8b86a7e8c4efb3ece2653ac | \n",
" d91f2da05c84147f9fcfe5121b21777f | \n",
"
\n",
" \n",
" 528a722b3be1f711e27efc09dca5b2d7 | \n",
" 3 | \n",
" 4 | \n",
" 0.0 | \n",
" 3.6 | \n",
" 528a722b3be1f711e27efc09dca5b2d7 | \n",
" b17640be6d2ed8dacb61f48fd40996c2 | \n",
"
\n",
" \n",
" b17640be6d2ed8dacb61f48fd40996c2 | \n",
" 4 | \n",
" 6 | \n",
" 0.0 | \n",
" 2.3 | \n",
" b17640be6d2ed8dacb61f48fd40996c2 | \n",
" d91f2da05c84147f9fcfe5121b21777f | \n",
"
\n",
" \n",
" c12ad1255bc3ed843ae21de938ab5f62 | \n",
" 5 | \n",
" 4 | \n",
" 0.0 | \n",
" 2.9 | \n",
" c12ad1255bc3ed843ae21de938ab5f62 | \n",
" b17640be6d2ed8dacb61f48fd40996c2 | \n",
"
\n",
" \n",
" d91f2da05c84147f9fcfe5121b21777f | \n",
" 6 | \n",
" 2 | \n",
" 0.0 | \n",
" 2.1 | \n",
" d91f2da05c84147f9fcfe5121b21777f | \n",
" 4c4dcc8dd8b86a7e8c4efb3ece2653ac | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" k_nn_1_index k_nn_2_index k_nn_1_distance \\\n",
"4019ffc973586d62bfa9adebf209bb04 0 6 0.0 \n",
"437dcdf95882266f90085369b9a4258f 1 6 0.0 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 2 6 0.0 \n",
"528a722b3be1f711e27efc09dca5b2d7 3 4 0.0 \n",
"b17640be6d2ed8dacb61f48fd40996c2 4 6 0.0 \n",
"c12ad1255bc3ed843ae21de938ab5f62 5 4 0.0 \n",
"d91f2da05c84147f9fcfe5121b21777f 6 2 0.0 \n",
"\n",
" k_nn_2_distance \\\n",
"4019ffc973586d62bfa9adebf209bb04 3.2 \n",
"437dcdf95882266f90085369b9a4258f 3.1 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.1 \n",
"528a722b3be1f711e27efc09dca5b2d7 3.6 \n",
"b17640be6d2ed8dacb61f48fd40996c2 2.3 \n",
"c12ad1255bc3ed843ae21de938ab5f62 2.9 \n",
"d91f2da05c84147f9fcfe5121b21777f 2.1 \n",
"\n",
" k_nn_1_identifier \\\n",
"4019ffc973586d62bfa9adebf209bb04 4019ffc973586d62bfa9adebf209bb04 \n",
"437dcdf95882266f90085369b9a4258f 437dcdf95882266f90085369b9a4258f \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 4c4dcc8dd8b86a7e8c4efb3ece2653ac \n",
"528a722b3be1f711e27efc09dca5b2d7 528a722b3be1f711e27efc09dca5b2d7 \n",
"b17640be6d2ed8dacb61f48fd40996c2 b17640be6d2ed8dacb61f48fd40996c2 \n",
"c12ad1255bc3ed843ae21de938ab5f62 c12ad1255bc3ed843ae21de938ab5f62 \n",
"d91f2da05c84147f9fcfe5121b21777f d91f2da05c84147f9fcfe5121b21777f \n",
"\n",
" k_nn_2_identifier \n",
"4019ffc973586d62bfa9adebf209bb04 d91f2da05c84147f9fcfe5121b21777f \n",
"437dcdf95882266f90085369b9a4258f d91f2da05c84147f9fcfe5121b21777f \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac d91f2da05c84147f9fcfe5121b21777f \n",
"528a722b3be1f711e27efc09dca5b2d7 b17640be6d2ed8dacb61f48fd40996c2 \n",
"b17640be6d2ed8dacb61f48fd40996c2 d91f2da05c84147f9fcfe5121b21777f \n",
"c12ad1255bc3ed843ae21de938ab5f62 b17640be6d2ed8dacb61f48fd40996c2 \n",
"d91f2da05c84147f9fcfe5121b21777f 4c4dcc8dd8b86a7e8c4efb3ece2653ac "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"neighbours_dataframe.round(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\"Hey, Christian: this is all nice and stuff, but I wanna compare two different sets!\"\n",
"\n",
"\n",
"Alighty, just as above, but using two different embedding files. Notice that we call the first embedding file the \"query\" embedding file, while the second embedding file is the \"reference\". Internally, we use these terms, because most of the time you are not just interested in the closest embedding, but in some property of the sequence that produced that embedding, aka the \"reference\" embedding."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0011ab0c11c7fea51fefcd039b1b69f5 | \n",
" 003b243d0117cbaf2b7434184c409b06 | \n",
" 004cef7b0dae937e6d722817c17ed889 | \n",
" 0115b4447d6911651804d1303bf5f272 | \n",
" 0140b3ec6cba5734a909c1d734e48ea0 | \n",
" 01de4461209b76f57819919dc38faa99 | \n",
" 025559e7b85ed1448d34e61221a54788 | \n",
" 0297e58178e3f37ecaa080f23a5efc20 | \n",
" 02bfb6e7933ff7691596243b73dbe6c9 | \n",
" 02c8fa126578dbabb41f3ac9dc9a5048 | \n",
" ... | \n",
" fe15f12a73627e411d081ac2c75a08d2 | \n",
" fe275ea1d14453f6855577d120934fe1 | \n",
" fe3e07d970ca830a34d91a3fa7e4f9e2 | \n",
" fe3fce3b83de1afdf00394a4e639dd5e | \n",
" fe462c3364a5d79fdfbef7b4befc69a5 | \n",
" fe6c516055388ef0d8d7ab8b7dd60c45 | \n",
" fe736d03c479e9681251e5d1e155ed85 | \n",
" fe81b26a3d8796c10404d6d10915c3f5 | \n",
" fec303c471974bdfa29b7b32697b39ef | \n",
" feeb0ed7e8d9d82d58e3578c1ae7fcee | \n",
"
\n",
" \n",
" \n",
" \n",
" 4019ffc973586d62bfa9adebf209bb04 | \n",
" 3.6 | \n",
" 4.1 | \n",
" 3.9 | \n",
" 4.6 | \n",
" 4.6 | \n",
" 3.4 | \n",
" 4.1 | \n",
" 8.3 | \n",
" 4.0 | \n",
" 3.9 | \n",
" ... | \n",
" 3.8 | \n",
" 4.6 | \n",
" 3.8 | \n",
" 7.1 | \n",
" 4.6 | \n",
" 5.2 | \n",
" 8.3 | \n",
" 4.0 | \n",
" 4.5 | \n",
" 5.1 | \n",
"
\n",
" \n",
" 437dcdf95882266f90085369b9a4258f | \n",
" 3.1 | \n",
" 3.7 | \n",
" 3.1 | \n",
" 4.2 | \n",
" 4.0 | \n",
" 3.3 | \n",
" 3.4 | \n",
" 7.8 | \n",
" 3.4 | \n",
" 3.3 | \n",
" ... | \n",
" 3.3 | \n",
" 3.6 | \n",
" 3.4 | \n",
" 6.6 | \n",
" 3.8 | \n",
" 4.1 | \n",
" 8.0 | \n",
" 3.0 | \n",
" 3.7 | \n",
" 4.3 | \n",
"
\n",
" \n",
" 4c4dcc8dd8b86a7e8c4efb3ece2653ac | \n",
" 2.1 | \n",
" 3.2 | \n",
" 2.7 | \n",
" 3.8 | \n",
" 3.4 | \n",
" 2.1 | \n",
" 3.2 | \n",
" 7.9 | \n",
" 2.5 | \n",
" 2.1 | \n",
" ... | \n",
" 2.5 | \n",
" 3.9 | \n",
" 2.9 | \n",
" 6.5 | \n",
" 3.9 | \n",
" 4.2 | \n",
" 8.0 | \n",
" 2.7 | \n",
" 3.4 | \n",
" 4.3 | \n",
"
\n",
" \n",
" 528a722b3be1f711e27efc09dca5b2d7 | \n",
" 3.8 | \n",
" 4.7 | \n",
" 4.0 | \n",
" 4.9 | \n",
" 4.8 | \n",
" 3.9 | \n",
" 4.4 | \n",
" 8.3 | \n",
" 4.3 | \n",
" 3.7 | \n",
" ... | \n",
" 4.3 | \n",
" 4.7 | \n",
" 4.2 | \n",
" 7.4 | \n",
" 4.7 | \n",
" 5.4 | \n",
" 8.7 | \n",
" 4.2 | \n",
" 4.7 | \n",
" 5.0 | \n",
"
\n",
" \n",
" b17640be6d2ed8dacb61f48fd40996c2 | \n",
" 2.1 | \n",
" 3.6 | \n",
" 2.1 | \n",
" 3.9 | \n",
" 3.8 | \n",
" 2.5 | \n",
" 3.2 | \n",
" 8.1 | \n",
" 2.8 | \n",
" 2.7 | \n",
" ... | \n",
" 2.8 | \n",
" 3.6 | \n",
" 2.9 | \n",
" 6.4 | \n",
" 3.2 | \n",
" 4.2 | \n",
" 7.9 | \n",
" 2.6 | \n",
" 3.7 | \n",
" 4.2 | \n",
"
\n",
" \n",
" c12ad1255bc3ed843ae21de938ab5f62 | \n",
" 2.9 | \n",
" 3.9 | \n",
" 3.2 | \n",
" 4.2 | \n",
" 3.9 | \n",
" 3.2 | \n",
" 3.4 | \n",
" 8.1 | \n",
" 3.3 | \n",
" 3.4 | \n",
" ... | \n",
" 3.3 | \n",
" 3.6 | \n",
" 3.4 | \n",
" 6.6 | \n",
" 3.6 | \n",
" 4.5 | \n",
" 8.0 | \n",
" 3.1 | \n",
" 3.9 | \n",
" 4.7 | \n",
"
\n",
" \n",
" d91f2da05c84147f9fcfe5121b21777f | \n",
" 2.0 | \n",
" 3.2 | \n",
" 2.5 | \n",
" 3.3 | \n",
" 3.1 | \n",
" 1.9 | \n",
" 2.8 | \n",
" 7.7 | \n",
" 2.3 | \n",
" 2.5 | \n",
" ... | \n",
" 2.2 | \n",
" 3.5 | \n",
" 2.7 | \n",
" 6.3 | \n",
" 3.3 | \n",
" 4.2 | \n",
" 7.9 | \n",
" 2.5 | \n",
" 2.9 | \n",
" 4.1 | \n",
"
\n",
" \n",
"
\n",
"
7 rows × 1162 columns
\n",
"
"
],
"text/plain": [
" 0011ab0c11c7fea51fefcd039b1b69f5 \\\n",
"4019ffc973586d62bfa9adebf209bb04 3.6 \n",
"437dcdf95882266f90085369b9a4258f 3.1 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.1 \n",
"528a722b3be1f711e27efc09dca5b2d7 3.8 \n",
"b17640be6d2ed8dacb61f48fd40996c2 2.1 \n",
"c12ad1255bc3ed843ae21de938ab5f62 2.9 \n",
"d91f2da05c84147f9fcfe5121b21777f 2.0 \n",
"\n",
" 003b243d0117cbaf2b7434184c409b06 \\\n",
"4019ffc973586d62bfa9adebf209bb04 4.1 \n",
"437dcdf95882266f90085369b9a4258f 3.7 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.2 \n",
"528a722b3be1f711e27efc09dca5b2d7 4.7 \n",
"b17640be6d2ed8dacb61f48fd40996c2 3.6 \n",
"c12ad1255bc3ed843ae21de938ab5f62 3.9 \n",
"d91f2da05c84147f9fcfe5121b21777f 3.2 \n",
"\n",
" 004cef7b0dae937e6d722817c17ed889 \\\n",
"4019ffc973586d62bfa9adebf209bb04 3.9 \n",
"437dcdf95882266f90085369b9a4258f 3.1 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.7 \n",
"528a722b3be1f711e27efc09dca5b2d7 4.0 \n",
"b17640be6d2ed8dacb61f48fd40996c2 2.1 \n",
"c12ad1255bc3ed843ae21de938ab5f62 3.2 \n",
"d91f2da05c84147f9fcfe5121b21777f 2.5 \n",
"\n",
" 0115b4447d6911651804d1303bf5f272 \\\n",
"4019ffc973586d62bfa9adebf209bb04 4.6 \n",
"437dcdf95882266f90085369b9a4258f 4.2 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.8 \n",
"528a722b3be1f711e27efc09dca5b2d7 4.9 \n",
"b17640be6d2ed8dacb61f48fd40996c2 3.9 \n",
"c12ad1255bc3ed843ae21de938ab5f62 4.2 \n",
"d91f2da05c84147f9fcfe5121b21777f 3.3 \n",
"\n",
" 0140b3ec6cba5734a909c1d734e48ea0 \\\n",
"4019ffc973586d62bfa9adebf209bb04 4.6 \n",
"437dcdf95882266f90085369b9a4258f 4.0 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.4 \n",
"528a722b3be1f711e27efc09dca5b2d7 4.8 \n",
"b17640be6d2ed8dacb61f48fd40996c2 3.8 \n",
"c12ad1255bc3ed843ae21de938ab5f62 3.9 \n",
"d91f2da05c84147f9fcfe5121b21777f 3.1 \n",
"\n",
" 01de4461209b76f57819919dc38faa99 \\\n",
"4019ffc973586d62bfa9adebf209bb04 3.4 \n",
"437dcdf95882266f90085369b9a4258f 3.3 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.1 \n",
"528a722b3be1f711e27efc09dca5b2d7 3.9 \n",
"b17640be6d2ed8dacb61f48fd40996c2 2.5 \n",
"c12ad1255bc3ed843ae21de938ab5f62 3.2 \n",
"d91f2da05c84147f9fcfe5121b21777f 1.9 \n",
"\n",
" 025559e7b85ed1448d34e61221a54788 \\\n",
"4019ffc973586d62bfa9adebf209bb04 4.1 \n",
"437dcdf95882266f90085369b9a4258f 3.4 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.2 \n",
"528a722b3be1f711e27efc09dca5b2d7 4.4 \n",
"b17640be6d2ed8dacb61f48fd40996c2 3.2 \n",
"c12ad1255bc3ed843ae21de938ab5f62 3.4 \n",
"d91f2da05c84147f9fcfe5121b21777f 2.8 \n",
"\n",
" 0297e58178e3f37ecaa080f23a5efc20 \\\n",
"4019ffc973586d62bfa9adebf209bb04 8.3 \n",
"437dcdf95882266f90085369b9a4258f 7.8 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 7.9 \n",
"528a722b3be1f711e27efc09dca5b2d7 8.3 \n",
"b17640be6d2ed8dacb61f48fd40996c2 8.1 \n",
"c12ad1255bc3ed843ae21de938ab5f62 8.1 \n",
"d91f2da05c84147f9fcfe5121b21777f 7.7 \n",
"\n",
" 02bfb6e7933ff7691596243b73dbe6c9 \\\n",
"4019ffc973586d62bfa9adebf209bb04 4.0 \n",
"437dcdf95882266f90085369b9a4258f 3.4 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.5 \n",
"528a722b3be1f711e27efc09dca5b2d7 4.3 \n",
"b17640be6d2ed8dacb61f48fd40996c2 2.8 \n",
"c12ad1255bc3ed843ae21de938ab5f62 3.3 \n",
"d91f2da05c84147f9fcfe5121b21777f 2.3 \n",
"\n",
" 02c8fa126578dbabb41f3ac9dc9a5048 ... \\\n",
"4019ffc973586d62bfa9adebf209bb04 3.9 ... \n",
"437dcdf95882266f90085369b9a4258f 3.3 ... \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.1 ... \n",
"528a722b3be1f711e27efc09dca5b2d7 3.7 ... \n",
"b17640be6d2ed8dacb61f48fd40996c2 2.7 ... \n",
"c12ad1255bc3ed843ae21de938ab5f62 3.4 ... \n",
"d91f2da05c84147f9fcfe5121b21777f 2.5 ... \n",
"\n",
" fe15f12a73627e411d081ac2c75a08d2 \\\n",
"4019ffc973586d62bfa9adebf209bb04 3.8 \n",
"437dcdf95882266f90085369b9a4258f 3.3 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.5 \n",
"528a722b3be1f711e27efc09dca5b2d7 4.3 \n",
"b17640be6d2ed8dacb61f48fd40996c2 2.8 \n",
"c12ad1255bc3ed843ae21de938ab5f62 3.3 \n",
"d91f2da05c84147f9fcfe5121b21777f 2.2 \n",
"\n",
" fe275ea1d14453f6855577d120934fe1 \\\n",
"4019ffc973586d62bfa9adebf209bb04 4.6 \n",
"437dcdf95882266f90085369b9a4258f 3.6 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.9 \n",
"528a722b3be1f711e27efc09dca5b2d7 4.7 \n",
"b17640be6d2ed8dacb61f48fd40996c2 3.6 \n",
"c12ad1255bc3ed843ae21de938ab5f62 3.6 \n",
"d91f2da05c84147f9fcfe5121b21777f 3.5 \n",
"\n",
" fe3e07d970ca830a34d91a3fa7e4f9e2 \\\n",
"4019ffc973586d62bfa9adebf209bb04 3.8 \n",
"437dcdf95882266f90085369b9a4258f 3.4 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.9 \n",
"528a722b3be1f711e27efc09dca5b2d7 4.2 \n",
"b17640be6d2ed8dacb61f48fd40996c2 2.9 \n",
"c12ad1255bc3ed843ae21de938ab5f62 3.4 \n",
"d91f2da05c84147f9fcfe5121b21777f 2.7 \n",
"\n",
" fe3fce3b83de1afdf00394a4e639dd5e \\\n",
"4019ffc973586d62bfa9adebf209bb04 7.1 \n",
"437dcdf95882266f90085369b9a4258f 6.6 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 6.5 \n",
"528a722b3be1f711e27efc09dca5b2d7 7.4 \n",
"b17640be6d2ed8dacb61f48fd40996c2 6.4 \n",
"c12ad1255bc3ed843ae21de938ab5f62 6.6 \n",
"d91f2da05c84147f9fcfe5121b21777f 6.3 \n",
"\n",
" fe462c3364a5d79fdfbef7b4befc69a5 \\\n",
"4019ffc973586d62bfa9adebf209bb04 4.6 \n",
"437dcdf95882266f90085369b9a4258f 3.8 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.9 \n",
"528a722b3be1f711e27efc09dca5b2d7 4.7 \n",
"b17640be6d2ed8dacb61f48fd40996c2 3.2 \n",
"c12ad1255bc3ed843ae21de938ab5f62 3.6 \n",
"d91f2da05c84147f9fcfe5121b21777f 3.3 \n",
"\n",
" fe6c516055388ef0d8d7ab8b7dd60c45 \\\n",
"4019ffc973586d62bfa9adebf209bb04 5.2 \n",
"437dcdf95882266f90085369b9a4258f 4.1 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 4.2 \n",
"528a722b3be1f711e27efc09dca5b2d7 5.4 \n",
"b17640be6d2ed8dacb61f48fd40996c2 4.2 \n",
"c12ad1255bc3ed843ae21de938ab5f62 4.5 \n",
"d91f2da05c84147f9fcfe5121b21777f 4.2 \n",
"\n",
" fe736d03c479e9681251e5d1e155ed85 \\\n",
"4019ffc973586d62bfa9adebf209bb04 8.3 \n",
"437dcdf95882266f90085369b9a4258f 8.0 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 8.0 \n",
"528a722b3be1f711e27efc09dca5b2d7 8.7 \n",
"b17640be6d2ed8dacb61f48fd40996c2 7.9 \n",
"c12ad1255bc3ed843ae21de938ab5f62 8.0 \n",
"d91f2da05c84147f9fcfe5121b21777f 7.9 \n",
"\n",
" fe81b26a3d8796c10404d6d10915c3f5 \\\n",
"4019ffc973586d62bfa9adebf209bb04 4.0 \n",
"437dcdf95882266f90085369b9a4258f 3.0 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.7 \n",
"528a722b3be1f711e27efc09dca5b2d7 4.2 \n",
"b17640be6d2ed8dacb61f48fd40996c2 2.6 \n",
"c12ad1255bc3ed843ae21de938ab5f62 3.1 \n",
"d91f2da05c84147f9fcfe5121b21777f 2.5 \n",
"\n",
" fec303c471974bdfa29b7b32697b39ef \\\n",
"4019ffc973586d62bfa9adebf209bb04 4.5 \n",
"437dcdf95882266f90085369b9a4258f 3.7 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.4 \n",
"528a722b3be1f711e27efc09dca5b2d7 4.7 \n",
"b17640be6d2ed8dacb61f48fd40996c2 3.7 \n",
"c12ad1255bc3ed843ae21de938ab5f62 3.9 \n",
"d91f2da05c84147f9fcfe5121b21777f 2.9 \n",
"\n",
" feeb0ed7e8d9d82d58e3578c1ae7fcee \n",
"4019ffc973586d62bfa9adebf209bb04 5.1 \n",
"437dcdf95882266f90085369b9a4258f 4.3 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 4.3 \n",
"528a722b3be1f711e27efc09dca5b2d7 5.0 \n",
"b17640be6d2ed8dacb61f48fd40996c2 4.2 \n",
"c12ad1255bc3ed843ae21de938ab5f62 4.7 \n",
"d91f2da05c84147f9fcfe5121b21777f 4.1 \n",
"\n",
"[7 rows x 1162 columns]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" k_nn_1_index | \n",
" k_nn_2_index | \n",
" k_nn_1_distance | \n",
" k_nn_2_distance | \n",
" k_nn_1_identifier | \n",
" k_nn_2_identifier | \n",
"
\n",
" \n",
" \n",
" \n",
" 4019ffc973586d62bfa9adebf209bb04 | \n",
" 859 | \n",
" 1005 | \n",
" 3.201 | \n",
" 3.231 | \n",
" bc78ea9e725fe8b819c32f1438aa0be9 | \n",
" de3105e4d61489ce082a5e6ca9b802e7 | \n",
"
\n",
" \n",
" 437dcdf95882266f90085369b9a4258f | \n",
" 435 | \n",
" 763 | \n",
" 2.872 | \n",
" 2.947 | \n",
" 5d733f6264797054e7f2daa324e78e95 | \n",
" a7bedcc53366bb2f2e4cf968188d5886 | \n",
"
\n",
" \n",
" 4c4dcc8dd8b86a7e8c4efb3ece2653ac | \n",
" 880 | \n",
" 190 | \n",
" 1.765 | \n",
" 1.872 | \n",
" c13503898cf51733b81cdf19e5d521e8 | \n",
" 29bd1227262429b3c72b158d448c3ccc | \n",
"
\n",
" \n",
" 528a722b3be1f711e27efc09dca5b2d7 | \n",
" 356 | \n",
" 343 | \n",
" 3.238 | \n",
" 3.436 | \n",
" 4bd8fd185ad7e5d852f510a0dc1b94b2 | \n",
" 49350ec6dd02aabc8eb8cabc10476a4f | \n",
"
\n",
" \n",
" b17640be6d2ed8dacb61f48fd40996c2 | \n",
" 902 | \n",
" 953 | \n",
" 1.830 | \n",
" 1.883 | \n",
" c67f43b4c271800bf1b86b75fd06aaf7 | \n",
" d20782131bcd520f911265e8bd79ee94 | \n",
"
\n",
" \n",
" c12ad1255bc3ed843ae21de938ab5f62 | \n",
" 793 | \n",
" 832 | \n",
" 2.725 | \n",
" 2.750 | \n",
" af2a36ab30bc54843acdb759ea61df71 | \n",
" b78f477508a2fb1765530101e9e9e861 | \n",
"
\n",
" \n",
" d91f2da05c84147f9fcfe5121b21777f | \n",
" 148 | \n",
" 5 | \n",
" 1.833 | \n",
" 1.945 | \n",
" 231a587783f7bca8795a23abc3269646 | \n",
" 01de4461209b76f57819919dc38faa99 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" k_nn_1_index k_nn_2_index k_nn_1_distance \\\n",
"4019ffc973586d62bfa9adebf209bb04 859 1005 3.201 \n",
"437dcdf95882266f90085369b9a4258f 435 763 2.872 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 880 190 1.765 \n",
"528a722b3be1f711e27efc09dca5b2d7 356 343 3.238 \n",
"b17640be6d2ed8dacb61f48fd40996c2 902 953 1.830 \n",
"c12ad1255bc3ed843ae21de938ab5f62 793 832 2.725 \n",
"d91f2da05c84147f9fcfe5121b21777f 148 5 1.833 \n",
"\n",
" k_nn_2_distance \\\n",
"4019ffc973586d62bfa9adebf209bb04 3.231 \n",
"437dcdf95882266f90085369b9a4258f 2.947 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 1.872 \n",
"528a722b3be1f711e27efc09dca5b2d7 3.436 \n",
"b17640be6d2ed8dacb61f48fd40996c2 1.883 \n",
"c12ad1255bc3ed843ae21de938ab5f62 2.750 \n",
"d91f2da05c84147f9fcfe5121b21777f 1.945 \n",
"\n",
" k_nn_1_identifier \\\n",
"4019ffc973586d62bfa9adebf209bb04 bc78ea9e725fe8b819c32f1438aa0be9 \n",
"437dcdf95882266f90085369b9a4258f 5d733f6264797054e7f2daa324e78e95 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac c13503898cf51733b81cdf19e5d521e8 \n",
"528a722b3be1f711e27efc09dca5b2d7 4bd8fd185ad7e5d852f510a0dc1b94b2 \n",
"b17640be6d2ed8dacb61f48fd40996c2 c67f43b4c271800bf1b86b75fd06aaf7 \n",
"c12ad1255bc3ed843ae21de938ab5f62 af2a36ab30bc54843acdb759ea61df71 \n",
"d91f2da05c84147f9fcfe5121b21777f 231a587783f7bca8795a23abc3269646 \n",
"\n",
" k_nn_2_identifier \n",
"4019ffc973586d62bfa9adebf209bb04 de3105e4d61489ce082a5e6ca9b802e7 \n",
"437dcdf95882266f90085369b9a4258f a7bedcc53366bb2f2e4cf968188d5886 \n",
"4c4dcc8dd8b86a7e8c4efb3ece2653ac 29bd1227262429b3c72b158d448c3ccc \n",
"528a722b3be1f711e27efc09dca5b2d7 49350ec6dd02aabc8eb8cabc10476a4f \n",
"b17640be6d2ed8dacb61f48fd40996c2 d20782131bcd520f911265e8bd79ee94 \n",
"c12ad1255bc3ed843ae21de938ab5f62 b78f477508a2fb1765530101e9e9e861 \n",
"d91f2da05c84147f9fcfe5121b21777f 01de4461209b76f57819919dc38faa99 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"instrinsic_pairwise_distances = pairwise_distance_matrix_from_embeddings_and_annotations(\n",
" 'reduced_embeddings_file.h5',\n",
" 'disprot_reduced_embeddings_file.h5',\n",
")\n",
"\n",
"distances_dataframe = DataFrame(instrinsic_pairwise_distances.pairwise_matrix,\n",
" index = instrinsic_pairwise_distances.queries,\n",
" columns= instrinsic_pairwise_distances.references)\n",
"\n",
"display(distances_dataframe.round(1))\n",
"\n",
"k = 2\n",
"k_nn_indices, k_nn_distances = get_k_nearest_neighbours(instrinsic_pairwise_distances.pairwise_matrix, k)\n",
"k_nn_idices_df = DataFrame(k_nn_indices, columns=[f\"k_nn_{i+1}_index\" for i in range(len(k_nn_indices[0]))])\n",
"k_nn_distances_df = DataFrame(k_nn_distances, columns=[f\"k_nn_{i+1}_distance\" for i in range(len(k_nn_distances[0]))])\n",
"\n",
"neighbours_dataframe = concat([k_nn_idices_df, k_nn_distances_df], axis=1)\n",
"neighbours_dataframe.index = instrinsic_pairwise_distances.queries\n",
"\n",
"neighbours_dataframe['k_nn_1_identifier'] = neighbours_dataframe.k_nn_1_index.map(\n",
" lambda x: instrinsic_pairwise_distances.references[x])\n",
"\n",
"neighbours_dataframe['k_nn_2_identifier'] = neighbours_dataframe.k_nn_2_index.map(\n",
" lambda x: instrinsic_pairwise_distances.references[x])\n",
"\n",
"display(neighbours_dataframe.round(3))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 1
}