{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Colab initialization\n", "- install the pipeline in the colab runtime\n", "- download files neccessary for this example" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip3 install -U pip > /dev/null\n", "!pip3 install -U bio_embeddings[all] > /dev/null" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!wget http://data.bioembeddings.com/public/embeddings/notebooks/pipeline_output_example/reduced_embeddings_file.h5 --output-document reduced_embeddings_file.h5\n", "!wget http://data.bioembeddings.com/public/embeddings/notebooks/pipeline_output_example/disprot/reduced_embeddings_file.h5 --output-document disprot_reduced_embeddings_file.h5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pairwaise distances between embeddings, and nearest neighbour annotation transfer\n", "\n", "In the following, we will compute pairwise distances between sets of embeddings. This borrows ideas from the extract stage using the unsupervised protocol.\n", "\n", "But, we can do more than just that!\n", "\n", "For example, we can check similarity within one dataset:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "import numpy as np\n", "from bio_embeddings.extract import pairwise_distance_matrix_from_embeddings_and_annotations, get_k_nearest_neighbours\n", "from pandas import DataFrame, concat" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "instrinsic_pairwise_distances = pairwise_distance_matrix_from_embeddings_and_annotations(\n", " 'reduced_embeddings_file.h5',\n", " 'reduced_embeddings_file.h5',\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's transform this into a nice dataframe with columns and rows, so that we know who's distnaces we are comparing" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "distances_dataframe = DataFrame(instrinsic_pairwise_distances.pairwise_matrix,\n", " index = instrinsic_pairwise_distances.queries,\n", " columns= instrinsic_pairwise_distances.references)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**IMPORTANT**\n", "\n", "You will see that in the following we use the `round(1)` function, which rounds floating points to one decimal. We do so becuase otherwise the distance between the same sequence (diagonal) is sometimes NOT exactly zero (due to floating point precision).\n", "\n", "If you print the dataframe without rounding, you will most likely see something like `1.032383e-07` for position (1,1) in the matrix. This, however, is very close to 0." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
4019ffc973586d62bfa9adebf209bb04437dcdf95882266f90085369b9a4258f4c4dcc8dd8b86a7e8c4efb3ece2653ac528a722b3be1f711e27efc09dca5b2d7b17640be6d2ed8dacb61f48fd40996c2c12ad1255bc3ed843ae21de938ab5f62d91f2da05c84147f9fcfe5121b21777f
4019ffc973586d62bfa9adebf209bb040.04.23.54.33.64.33.2
437dcdf95882266f90085369b9a4258f4.20.03.34.33.33.73.1
4c4dcc8dd8b86a7e8c4efb3ece2653ac3.53.30.03.62.43.22.1
528a722b3be1f711e27efc09dca5b2d74.34.33.60.03.64.53.7
b17640be6d2ed8dacb61f48fd40996c23.63.32.43.60.02.92.3
c12ad1255bc3ed843ae21de938ab5f624.33.73.24.52.90.03.0
d91f2da05c84147f9fcfe5121b21777f3.23.12.13.72.33.00.0
\n", "
" ], "text/plain": [ " 4019ffc973586d62bfa9adebf209bb04 \\\n", "4019ffc973586d62bfa9adebf209bb04 0.0 \n", "437dcdf95882266f90085369b9a4258f 4.2 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.5 \n", "528a722b3be1f711e27efc09dca5b2d7 4.3 \n", "b17640be6d2ed8dacb61f48fd40996c2 3.6 \n", "c12ad1255bc3ed843ae21de938ab5f62 4.3 \n", "d91f2da05c84147f9fcfe5121b21777f 3.2 \n", "\n", " 437dcdf95882266f90085369b9a4258f \\\n", "4019ffc973586d62bfa9adebf209bb04 4.2 \n", "437dcdf95882266f90085369b9a4258f 0.0 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.3 \n", "528a722b3be1f711e27efc09dca5b2d7 4.3 \n", "b17640be6d2ed8dacb61f48fd40996c2 3.3 \n", "c12ad1255bc3ed843ae21de938ab5f62 3.7 \n", "d91f2da05c84147f9fcfe5121b21777f 3.1 \n", "\n", " 4c4dcc8dd8b86a7e8c4efb3ece2653ac \\\n", "4019ffc973586d62bfa9adebf209bb04 3.5 \n", "437dcdf95882266f90085369b9a4258f 3.3 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 0.0 \n", "528a722b3be1f711e27efc09dca5b2d7 3.6 \n", "b17640be6d2ed8dacb61f48fd40996c2 2.4 \n", "c12ad1255bc3ed843ae21de938ab5f62 3.2 \n", "d91f2da05c84147f9fcfe5121b21777f 2.1 \n", "\n", " 528a722b3be1f711e27efc09dca5b2d7 \\\n", "4019ffc973586d62bfa9adebf209bb04 4.3 \n", "437dcdf95882266f90085369b9a4258f 4.3 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.6 \n", "528a722b3be1f711e27efc09dca5b2d7 0.0 \n", "b17640be6d2ed8dacb61f48fd40996c2 3.6 \n", "c12ad1255bc3ed843ae21de938ab5f62 4.5 \n", "d91f2da05c84147f9fcfe5121b21777f 3.7 \n", "\n", " b17640be6d2ed8dacb61f48fd40996c2 \\\n", "4019ffc973586d62bfa9adebf209bb04 3.6 \n", "437dcdf95882266f90085369b9a4258f 3.3 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.4 \n", "528a722b3be1f711e27efc09dca5b2d7 3.6 \n", "b17640be6d2ed8dacb61f48fd40996c2 0.0 \n", "c12ad1255bc3ed843ae21de938ab5f62 2.9 \n", "d91f2da05c84147f9fcfe5121b21777f 2.3 \n", "\n", " c12ad1255bc3ed843ae21de938ab5f62 \\\n", "4019ffc973586d62bfa9adebf209bb04 4.3 \n", "437dcdf95882266f90085369b9a4258f 3.7 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.2 \n", "528a722b3be1f711e27efc09dca5b2d7 4.5 \n", "b17640be6d2ed8dacb61f48fd40996c2 2.9 \n", "c12ad1255bc3ed843ae21de938ab5f62 0.0 \n", "d91f2da05c84147f9fcfe5121b21777f 3.0 \n", "\n", " d91f2da05c84147f9fcfe5121b21777f \n", "4019ffc973586d62bfa9adebf209bb04 3.2 \n", "437dcdf95882266f90085369b9a4258f 3.1 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.1 \n", "528a722b3be1f711e27efc09dca5b2d7 3.7 \n", "b17640be6d2ed8dacb61f48fd40996c2 2.3 \n", "c12ad1255bc3ed843ae21de938ab5f62 3.0 \n", "d91f2da05c84147f9fcfe5121b21777f 0.0 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "distances_dataframe.round(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we want to find the k nearest neighbours.\n", "\n", "Since we are comparing sequences against themselves, the first nearest neighbour will always be the sequence itself. Therefore we pass `2` to the following function in order to get the second nearest neighbour.\n", "\n", "The following function returns two results: the indices of the nearest neighbours, as well as the distances to them." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "k = 2\n", "k_nn_indices, k_nn_distances = get_k_nearest_neighbours(instrinsic_pairwise_distances.pairwise_matrix, k)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As before, let's put this into a dataframe with index and column names, so we understand what's going on:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "k_nn_idices_df = DataFrame(k_nn_indices, columns=[f\"k_nn_{i+1}_index\" for i in range(len(k_nn_indices[0]))])\n", "k_nn_distances_df = DataFrame(k_nn_distances, columns=[f\"k_nn_{i+1}_distance\" for i in range(len(k_nn_distances[0]))])\n", "\n", "neighbours_dataframe = concat([k_nn_idices_df, k_nn_distances_df], axis=1)\n", "neighbours_dataframe.index = instrinsic_pairwise_distances.queries" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k_nn_1_indexk_nn_2_indexk_nn_1_distancek_nn_2_distance
4019ffc973586d62bfa9adebf209bb04060.03.2
437dcdf95882266f90085369b9a4258f160.03.1
4c4dcc8dd8b86a7e8c4efb3ece2653ac260.02.1
528a722b3be1f711e27efc09dca5b2d7340.03.6
b17640be6d2ed8dacb61f48fd40996c2460.02.3
c12ad1255bc3ed843ae21de938ab5f62540.02.9
d91f2da05c84147f9fcfe5121b21777f620.02.1
\n", "
" ], "text/plain": [ " k_nn_1_index k_nn_2_index k_nn_1_distance \\\n", "4019ffc973586d62bfa9adebf209bb04 0 6 0.0 \n", "437dcdf95882266f90085369b9a4258f 1 6 0.0 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 2 6 0.0 \n", "528a722b3be1f711e27efc09dca5b2d7 3 4 0.0 \n", "b17640be6d2ed8dacb61f48fd40996c2 4 6 0.0 \n", "c12ad1255bc3ed843ae21de938ab5f62 5 4 0.0 \n", "d91f2da05c84147f9fcfe5121b21777f 6 2 0.0 \n", "\n", " k_nn_2_distance \n", "4019ffc973586d62bfa9adebf209bb04 3.2 \n", "437dcdf95882266f90085369b9a4258f 3.1 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.1 \n", "528a722b3be1f711e27efc09dca5b2d7 3.6 \n", "b17640be6d2ed8dacb61f48fd40996c2 2.3 \n", "c12ad1255bc3ed843ae21de938ab5f62 2.9 \n", "d91f2da05c84147f9fcfe5121b21777f 2.1 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "neighbours_dataframe.round(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since it's also nice to see whose identifier it is at index N, we can simply map that onto our dataframe" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "neighbours_dataframe['k_nn_1_identifier'] = neighbours_dataframe.k_nn_1_index.map(\n", " lambda x: instrinsic_pairwise_distances.references[x])\n", "\n", "neighbours_dataframe['k_nn_2_identifier'] = neighbours_dataframe.k_nn_2_index.map(\n", " lambda x: instrinsic_pairwise_distances.references[x])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k_nn_1_indexk_nn_2_indexk_nn_1_distancek_nn_2_distancek_nn_1_identifierk_nn_2_identifier
4019ffc973586d62bfa9adebf209bb04060.03.24019ffc973586d62bfa9adebf209bb04d91f2da05c84147f9fcfe5121b21777f
437dcdf95882266f90085369b9a4258f160.03.1437dcdf95882266f90085369b9a4258fd91f2da05c84147f9fcfe5121b21777f
4c4dcc8dd8b86a7e8c4efb3ece2653ac260.02.14c4dcc8dd8b86a7e8c4efb3ece2653acd91f2da05c84147f9fcfe5121b21777f
528a722b3be1f711e27efc09dca5b2d7340.03.6528a722b3be1f711e27efc09dca5b2d7b17640be6d2ed8dacb61f48fd40996c2
b17640be6d2ed8dacb61f48fd40996c2460.02.3b17640be6d2ed8dacb61f48fd40996c2d91f2da05c84147f9fcfe5121b21777f
c12ad1255bc3ed843ae21de938ab5f62540.02.9c12ad1255bc3ed843ae21de938ab5f62b17640be6d2ed8dacb61f48fd40996c2
d91f2da05c84147f9fcfe5121b21777f620.02.1d91f2da05c84147f9fcfe5121b21777f4c4dcc8dd8b86a7e8c4efb3ece2653ac
\n", "
" ], "text/plain": [ " k_nn_1_index k_nn_2_index k_nn_1_distance \\\n", "4019ffc973586d62bfa9adebf209bb04 0 6 0.0 \n", "437dcdf95882266f90085369b9a4258f 1 6 0.0 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 2 6 0.0 \n", "528a722b3be1f711e27efc09dca5b2d7 3 4 0.0 \n", "b17640be6d2ed8dacb61f48fd40996c2 4 6 0.0 \n", "c12ad1255bc3ed843ae21de938ab5f62 5 4 0.0 \n", "d91f2da05c84147f9fcfe5121b21777f 6 2 0.0 \n", "\n", " k_nn_2_distance \\\n", "4019ffc973586d62bfa9adebf209bb04 3.2 \n", "437dcdf95882266f90085369b9a4258f 3.1 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.1 \n", "528a722b3be1f711e27efc09dca5b2d7 3.6 \n", "b17640be6d2ed8dacb61f48fd40996c2 2.3 \n", "c12ad1255bc3ed843ae21de938ab5f62 2.9 \n", "d91f2da05c84147f9fcfe5121b21777f 2.1 \n", "\n", " k_nn_1_identifier \\\n", "4019ffc973586d62bfa9adebf209bb04 4019ffc973586d62bfa9adebf209bb04 \n", "437dcdf95882266f90085369b9a4258f 437dcdf95882266f90085369b9a4258f \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 4c4dcc8dd8b86a7e8c4efb3ece2653ac \n", "528a722b3be1f711e27efc09dca5b2d7 528a722b3be1f711e27efc09dca5b2d7 \n", "b17640be6d2ed8dacb61f48fd40996c2 b17640be6d2ed8dacb61f48fd40996c2 \n", "c12ad1255bc3ed843ae21de938ab5f62 c12ad1255bc3ed843ae21de938ab5f62 \n", "d91f2da05c84147f9fcfe5121b21777f d91f2da05c84147f9fcfe5121b21777f \n", "\n", " k_nn_2_identifier \n", "4019ffc973586d62bfa9adebf209bb04 d91f2da05c84147f9fcfe5121b21777f \n", "437dcdf95882266f90085369b9a4258f d91f2da05c84147f9fcfe5121b21777f \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac d91f2da05c84147f9fcfe5121b21777f \n", "528a722b3be1f711e27efc09dca5b2d7 b17640be6d2ed8dacb61f48fd40996c2 \n", "b17640be6d2ed8dacb61f48fd40996c2 d91f2da05c84147f9fcfe5121b21777f \n", "c12ad1255bc3ed843ae21de938ab5f62 b17640be6d2ed8dacb61f48fd40996c2 \n", "d91f2da05c84147f9fcfe5121b21777f 4c4dcc8dd8b86a7e8c4efb3ece2653ac " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "neighbours_dataframe.round(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"Hey, Christian: this is all nice and stuff, but I wanna compare two different sets!\"\n", "\n", "\n", "Alighty, just as above, but using two different embedding files. Notice that we call the first embedding file the \"query\" embedding file, while the second embedding file is the \"reference\". Internally, we use these terms, because most of the time you are not just interested in the closest embedding, but in some property of the sequence that produced that embedding, aka the \"reference\" embedding." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0011ab0c11c7fea51fefcd039b1b69f5003b243d0117cbaf2b7434184c409b06004cef7b0dae937e6d722817c17ed8890115b4447d6911651804d1303bf5f2720140b3ec6cba5734a909c1d734e48ea001de4461209b76f57819919dc38faa99025559e7b85ed1448d34e61221a547880297e58178e3f37ecaa080f23a5efc2002bfb6e7933ff7691596243b73dbe6c902c8fa126578dbabb41f3ac9dc9a5048...fe15f12a73627e411d081ac2c75a08d2fe275ea1d14453f6855577d120934fe1fe3e07d970ca830a34d91a3fa7e4f9e2fe3fce3b83de1afdf00394a4e639dd5efe462c3364a5d79fdfbef7b4befc69a5fe6c516055388ef0d8d7ab8b7dd60c45fe736d03c479e9681251e5d1e155ed85fe81b26a3d8796c10404d6d10915c3f5fec303c471974bdfa29b7b32697b39effeeb0ed7e8d9d82d58e3578c1ae7fcee
4019ffc973586d62bfa9adebf209bb043.64.13.94.64.63.44.18.34.03.9...3.84.63.87.14.65.28.34.04.55.1
437dcdf95882266f90085369b9a4258f3.13.73.14.24.03.33.47.83.43.3...3.33.63.46.63.84.18.03.03.74.3
4c4dcc8dd8b86a7e8c4efb3ece2653ac2.13.22.73.83.42.13.27.92.52.1...2.53.92.96.53.94.28.02.73.44.3
528a722b3be1f711e27efc09dca5b2d73.84.74.04.94.83.94.48.34.33.7...4.34.74.27.44.75.48.74.24.75.0
b17640be6d2ed8dacb61f48fd40996c22.13.62.13.93.82.53.28.12.82.7...2.83.62.96.43.24.27.92.63.74.2
c12ad1255bc3ed843ae21de938ab5f622.93.93.24.23.93.23.48.13.33.4...3.33.63.46.63.64.58.03.13.94.7
d91f2da05c84147f9fcfe5121b21777f2.03.22.53.33.11.92.87.72.32.5...2.23.52.76.33.34.27.92.52.94.1
\n", "

7 rows × 1162 columns

\n", "
" ], "text/plain": [ " 0011ab0c11c7fea51fefcd039b1b69f5 \\\n", "4019ffc973586d62bfa9adebf209bb04 3.6 \n", "437dcdf95882266f90085369b9a4258f 3.1 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.1 \n", "528a722b3be1f711e27efc09dca5b2d7 3.8 \n", "b17640be6d2ed8dacb61f48fd40996c2 2.1 \n", "c12ad1255bc3ed843ae21de938ab5f62 2.9 \n", "d91f2da05c84147f9fcfe5121b21777f 2.0 \n", "\n", " 003b243d0117cbaf2b7434184c409b06 \\\n", "4019ffc973586d62bfa9adebf209bb04 4.1 \n", "437dcdf95882266f90085369b9a4258f 3.7 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.2 \n", "528a722b3be1f711e27efc09dca5b2d7 4.7 \n", "b17640be6d2ed8dacb61f48fd40996c2 3.6 \n", "c12ad1255bc3ed843ae21de938ab5f62 3.9 \n", "d91f2da05c84147f9fcfe5121b21777f 3.2 \n", "\n", " 004cef7b0dae937e6d722817c17ed889 \\\n", "4019ffc973586d62bfa9adebf209bb04 3.9 \n", "437dcdf95882266f90085369b9a4258f 3.1 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.7 \n", "528a722b3be1f711e27efc09dca5b2d7 4.0 \n", "b17640be6d2ed8dacb61f48fd40996c2 2.1 \n", "c12ad1255bc3ed843ae21de938ab5f62 3.2 \n", "d91f2da05c84147f9fcfe5121b21777f 2.5 \n", "\n", " 0115b4447d6911651804d1303bf5f272 \\\n", "4019ffc973586d62bfa9adebf209bb04 4.6 \n", "437dcdf95882266f90085369b9a4258f 4.2 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.8 \n", "528a722b3be1f711e27efc09dca5b2d7 4.9 \n", "b17640be6d2ed8dacb61f48fd40996c2 3.9 \n", "c12ad1255bc3ed843ae21de938ab5f62 4.2 \n", "d91f2da05c84147f9fcfe5121b21777f 3.3 \n", "\n", " 0140b3ec6cba5734a909c1d734e48ea0 \\\n", "4019ffc973586d62bfa9adebf209bb04 4.6 \n", "437dcdf95882266f90085369b9a4258f 4.0 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.4 \n", "528a722b3be1f711e27efc09dca5b2d7 4.8 \n", "b17640be6d2ed8dacb61f48fd40996c2 3.8 \n", "c12ad1255bc3ed843ae21de938ab5f62 3.9 \n", "d91f2da05c84147f9fcfe5121b21777f 3.1 \n", "\n", " 01de4461209b76f57819919dc38faa99 \\\n", "4019ffc973586d62bfa9adebf209bb04 3.4 \n", "437dcdf95882266f90085369b9a4258f 3.3 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.1 \n", "528a722b3be1f711e27efc09dca5b2d7 3.9 \n", "b17640be6d2ed8dacb61f48fd40996c2 2.5 \n", "c12ad1255bc3ed843ae21de938ab5f62 3.2 \n", "d91f2da05c84147f9fcfe5121b21777f 1.9 \n", "\n", " 025559e7b85ed1448d34e61221a54788 \\\n", "4019ffc973586d62bfa9adebf209bb04 4.1 \n", "437dcdf95882266f90085369b9a4258f 3.4 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.2 \n", "528a722b3be1f711e27efc09dca5b2d7 4.4 \n", "b17640be6d2ed8dacb61f48fd40996c2 3.2 \n", "c12ad1255bc3ed843ae21de938ab5f62 3.4 \n", "d91f2da05c84147f9fcfe5121b21777f 2.8 \n", "\n", " 0297e58178e3f37ecaa080f23a5efc20 \\\n", "4019ffc973586d62bfa9adebf209bb04 8.3 \n", "437dcdf95882266f90085369b9a4258f 7.8 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 7.9 \n", "528a722b3be1f711e27efc09dca5b2d7 8.3 \n", "b17640be6d2ed8dacb61f48fd40996c2 8.1 \n", "c12ad1255bc3ed843ae21de938ab5f62 8.1 \n", "d91f2da05c84147f9fcfe5121b21777f 7.7 \n", "\n", " 02bfb6e7933ff7691596243b73dbe6c9 \\\n", "4019ffc973586d62bfa9adebf209bb04 4.0 \n", "437dcdf95882266f90085369b9a4258f 3.4 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.5 \n", "528a722b3be1f711e27efc09dca5b2d7 4.3 \n", "b17640be6d2ed8dacb61f48fd40996c2 2.8 \n", "c12ad1255bc3ed843ae21de938ab5f62 3.3 \n", "d91f2da05c84147f9fcfe5121b21777f 2.3 \n", "\n", " 02c8fa126578dbabb41f3ac9dc9a5048 ... \\\n", "4019ffc973586d62bfa9adebf209bb04 3.9 ... \n", "437dcdf95882266f90085369b9a4258f 3.3 ... \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.1 ... \n", "528a722b3be1f711e27efc09dca5b2d7 3.7 ... \n", "b17640be6d2ed8dacb61f48fd40996c2 2.7 ... \n", "c12ad1255bc3ed843ae21de938ab5f62 3.4 ... \n", "d91f2da05c84147f9fcfe5121b21777f 2.5 ... \n", "\n", " fe15f12a73627e411d081ac2c75a08d2 \\\n", "4019ffc973586d62bfa9adebf209bb04 3.8 \n", "437dcdf95882266f90085369b9a4258f 3.3 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.5 \n", "528a722b3be1f711e27efc09dca5b2d7 4.3 \n", "b17640be6d2ed8dacb61f48fd40996c2 2.8 \n", "c12ad1255bc3ed843ae21de938ab5f62 3.3 \n", "d91f2da05c84147f9fcfe5121b21777f 2.2 \n", "\n", " fe275ea1d14453f6855577d120934fe1 \\\n", "4019ffc973586d62bfa9adebf209bb04 4.6 \n", "437dcdf95882266f90085369b9a4258f 3.6 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.9 \n", "528a722b3be1f711e27efc09dca5b2d7 4.7 \n", "b17640be6d2ed8dacb61f48fd40996c2 3.6 \n", "c12ad1255bc3ed843ae21de938ab5f62 3.6 \n", "d91f2da05c84147f9fcfe5121b21777f 3.5 \n", "\n", " fe3e07d970ca830a34d91a3fa7e4f9e2 \\\n", "4019ffc973586d62bfa9adebf209bb04 3.8 \n", "437dcdf95882266f90085369b9a4258f 3.4 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.9 \n", "528a722b3be1f711e27efc09dca5b2d7 4.2 \n", "b17640be6d2ed8dacb61f48fd40996c2 2.9 \n", "c12ad1255bc3ed843ae21de938ab5f62 3.4 \n", "d91f2da05c84147f9fcfe5121b21777f 2.7 \n", "\n", " fe3fce3b83de1afdf00394a4e639dd5e \\\n", "4019ffc973586d62bfa9adebf209bb04 7.1 \n", "437dcdf95882266f90085369b9a4258f 6.6 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 6.5 \n", "528a722b3be1f711e27efc09dca5b2d7 7.4 \n", "b17640be6d2ed8dacb61f48fd40996c2 6.4 \n", "c12ad1255bc3ed843ae21de938ab5f62 6.6 \n", "d91f2da05c84147f9fcfe5121b21777f 6.3 \n", "\n", " fe462c3364a5d79fdfbef7b4befc69a5 \\\n", "4019ffc973586d62bfa9adebf209bb04 4.6 \n", "437dcdf95882266f90085369b9a4258f 3.8 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.9 \n", "528a722b3be1f711e27efc09dca5b2d7 4.7 \n", "b17640be6d2ed8dacb61f48fd40996c2 3.2 \n", "c12ad1255bc3ed843ae21de938ab5f62 3.6 \n", "d91f2da05c84147f9fcfe5121b21777f 3.3 \n", "\n", " fe6c516055388ef0d8d7ab8b7dd60c45 \\\n", "4019ffc973586d62bfa9adebf209bb04 5.2 \n", "437dcdf95882266f90085369b9a4258f 4.1 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 4.2 \n", "528a722b3be1f711e27efc09dca5b2d7 5.4 \n", "b17640be6d2ed8dacb61f48fd40996c2 4.2 \n", "c12ad1255bc3ed843ae21de938ab5f62 4.5 \n", "d91f2da05c84147f9fcfe5121b21777f 4.2 \n", "\n", " fe736d03c479e9681251e5d1e155ed85 \\\n", "4019ffc973586d62bfa9adebf209bb04 8.3 \n", "437dcdf95882266f90085369b9a4258f 8.0 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 8.0 \n", "528a722b3be1f711e27efc09dca5b2d7 8.7 \n", "b17640be6d2ed8dacb61f48fd40996c2 7.9 \n", "c12ad1255bc3ed843ae21de938ab5f62 8.0 \n", "d91f2da05c84147f9fcfe5121b21777f 7.9 \n", "\n", " fe81b26a3d8796c10404d6d10915c3f5 \\\n", "4019ffc973586d62bfa9adebf209bb04 4.0 \n", "437dcdf95882266f90085369b9a4258f 3.0 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 2.7 \n", "528a722b3be1f711e27efc09dca5b2d7 4.2 \n", "b17640be6d2ed8dacb61f48fd40996c2 2.6 \n", "c12ad1255bc3ed843ae21de938ab5f62 3.1 \n", "d91f2da05c84147f9fcfe5121b21777f 2.5 \n", "\n", " fec303c471974bdfa29b7b32697b39ef \\\n", "4019ffc973586d62bfa9adebf209bb04 4.5 \n", "437dcdf95882266f90085369b9a4258f 3.7 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 3.4 \n", "528a722b3be1f711e27efc09dca5b2d7 4.7 \n", "b17640be6d2ed8dacb61f48fd40996c2 3.7 \n", "c12ad1255bc3ed843ae21de938ab5f62 3.9 \n", "d91f2da05c84147f9fcfe5121b21777f 2.9 \n", "\n", " feeb0ed7e8d9d82d58e3578c1ae7fcee \n", "4019ffc973586d62bfa9adebf209bb04 5.1 \n", "437dcdf95882266f90085369b9a4258f 4.3 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 4.3 \n", "528a722b3be1f711e27efc09dca5b2d7 5.0 \n", "b17640be6d2ed8dacb61f48fd40996c2 4.2 \n", "c12ad1255bc3ed843ae21de938ab5f62 4.7 \n", "d91f2da05c84147f9fcfe5121b21777f 4.1 \n", "\n", "[7 rows x 1162 columns]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
k_nn_1_indexk_nn_2_indexk_nn_1_distancek_nn_2_distancek_nn_1_identifierk_nn_2_identifier
4019ffc973586d62bfa9adebf209bb0485910053.2013.231bc78ea9e725fe8b819c32f1438aa0be9de3105e4d61489ce082a5e6ca9b802e7
437dcdf95882266f90085369b9a4258f4357632.8722.9475d733f6264797054e7f2daa324e78e95a7bedcc53366bb2f2e4cf968188d5886
4c4dcc8dd8b86a7e8c4efb3ece2653ac8801901.7651.872c13503898cf51733b81cdf19e5d521e829bd1227262429b3c72b158d448c3ccc
528a722b3be1f711e27efc09dca5b2d73563433.2383.4364bd8fd185ad7e5d852f510a0dc1b94b249350ec6dd02aabc8eb8cabc10476a4f
b17640be6d2ed8dacb61f48fd40996c29029531.8301.883c67f43b4c271800bf1b86b75fd06aaf7d20782131bcd520f911265e8bd79ee94
c12ad1255bc3ed843ae21de938ab5f627938322.7252.750af2a36ab30bc54843acdb759ea61df71b78f477508a2fb1765530101e9e9e861
d91f2da05c84147f9fcfe5121b21777f14851.8331.945231a587783f7bca8795a23abc326964601de4461209b76f57819919dc38faa99
\n", "
" ], "text/plain": [ " k_nn_1_index k_nn_2_index k_nn_1_distance \\\n", "4019ffc973586d62bfa9adebf209bb04 859 1005 3.201 \n", "437dcdf95882266f90085369b9a4258f 435 763 2.872 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 880 190 1.765 \n", "528a722b3be1f711e27efc09dca5b2d7 356 343 3.238 \n", "b17640be6d2ed8dacb61f48fd40996c2 902 953 1.830 \n", "c12ad1255bc3ed843ae21de938ab5f62 793 832 2.725 \n", "d91f2da05c84147f9fcfe5121b21777f 148 5 1.833 \n", "\n", " k_nn_2_distance \\\n", "4019ffc973586d62bfa9adebf209bb04 3.231 \n", "437dcdf95882266f90085369b9a4258f 2.947 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 1.872 \n", "528a722b3be1f711e27efc09dca5b2d7 3.436 \n", "b17640be6d2ed8dacb61f48fd40996c2 1.883 \n", "c12ad1255bc3ed843ae21de938ab5f62 2.750 \n", "d91f2da05c84147f9fcfe5121b21777f 1.945 \n", "\n", " k_nn_1_identifier \\\n", "4019ffc973586d62bfa9adebf209bb04 bc78ea9e725fe8b819c32f1438aa0be9 \n", "437dcdf95882266f90085369b9a4258f 5d733f6264797054e7f2daa324e78e95 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac c13503898cf51733b81cdf19e5d521e8 \n", "528a722b3be1f711e27efc09dca5b2d7 4bd8fd185ad7e5d852f510a0dc1b94b2 \n", "b17640be6d2ed8dacb61f48fd40996c2 c67f43b4c271800bf1b86b75fd06aaf7 \n", "c12ad1255bc3ed843ae21de938ab5f62 af2a36ab30bc54843acdb759ea61df71 \n", "d91f2da05c84147f9fcfe5121b21777f 231a587783f7bca8795a23abc3269646 \n", "\n", " k_nn_2_identifier \n", "4019ffc973586d62bfa9adebf209bb04 de3105e4d61489ce082a5e6ca9b802e7 \n", "437dcdf95882266f90085369b9a4258f a7bedcc53366bb2f2e4cf968188d5886 \n", "4c4dcc8dd8b86a7e8c4efb3ece2653ac 29bd1227262429b3c72b158d448c3ccc \n", "528a722b3be1f711e27efc09dca5b2d7 49350ec6dd02aabc8eb8cabc10476a4f \n", "b17640be6d2ed8dacb61f48fd40996c2 d20782131bcd520f911265e8bd79ee94 \n", "c12ad1255bc3ed843ae21de938ab5f62 b78f477508a2fb1765530101e9e9e861 \n", "d91f2da05c84147f9fcfe5121b21777f 01de4461209b76f57819919dc38faa99 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "instrinsic_pairwise_distances = pairwise_distance_matrix_from_embeddings_and_annotations(\n", " 'reduced_embeddings_file.h5',\n", " 'disprot_reduced_embeddings_file.h5',\n", ")\n", "\n", "distances_dataframe = DataFrame(instrinsic_pairwise_distances.pairwise_matrix,\n", " index = instrinsic_pairwise_distances.queries,\n", " columns= instrinsic_pairwise_distances.references)\n", "\n", "display(distances_dataframe.round(1))\n", "\n", "k = 2\n", "k_nn_indices, k_nn_distances = get_k_nearest_neighbours(instrinsic_pairwise_distances.pairwise_matrix, k)\n", "k_nn_idices_df = DataFrame(k_nn_indices, columns=[f\"k_nn_{i+1}_index\" for i in range(len(k_nn_indices[0]))])\n", "k_nn_distances_df = DataFrame(k_nn_distances, columns=[f\"k_nn_{i+1}_distance\" for i in range(len(k_nn_distances[0]))])\n", "\n", "neighbours_dataframe = concat([k_nn_idices_df, k_nn_distances_df], axis=1)\n", "neighbours_dataframe.index = instrinsic_pairwise_distances.queries\n", "\n", "neighbours_dataframe['k_nn_1_identifier'] = neighbours_dataframe.k_nn_1_index.map(\n", " lambda x: instrinsic_pairwise_distances.references[x])\n", "\n", "neighbours_dataframe['k_nn_2_identifier'] = neighbours_dataframe.k_nn_2_index.map(\n", " lambda x: instrinsic_pairwise_distances.references[x])\n", "\n", "display(neighbours_dataframe.round(3))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 1 }