Colab initialization¶

install the pipeline in the colab runtime
download files neccessary for this example

# Protocol -- Step 2 — Install bio_embeddings

!pip3 install -U pip > /dev/null
!pip3 install -U "bio-embeddings[all] @ git+https://github.com/sacdallago/bio_embeddings.git" > /dev/null

# Basic Protocol 2 — Step 3 — Download files

!wget http://data.bioembeddings.com/deeploc/reduced_embeddings_file.h5 --output-document reduced_embeddings_file.h5
!wget http://data.bioembeddings.com/deeploc/annotations.csv --output-document annotations.csv
!wget http://data.bioembeddings.com/deeploc/solubility_annotations.csv --output-document solubility_annotations.csv

Visualize sequence spaces drawn by DeepLoc embeddings¶

In this notebook, we use the output of the embed stage to draw custom UMAP sequence space plots. We will first use the annotations of subcellular localization from DeepLoc. These come in 10 subcellular localization classes.

Following that, we will use a different anntoations file that describes the solubility of the proteins in the DeepLoc set. We will keep the same representations as drawn by UMAP before, so we can “side-by-side” see how embeddings separate the 10 subcellular localization classes, as well as the difference between soluble and membrane bound proteins.

# Basic Protocol 2 — Step 4 — Import dependencies

import h5py
import numpy as np
from pandas import read_csv, DataFrame
from bio_embeddings.project import umap_reduce
from bio_embeddings.visualize import render_scatter_plotly
from bio_embeddings.utilities import QueryEmbeddingsFile

# Basic Protocol 2 — Step 5 — Read annotations file

annotations = read_csv('annotations.csv')

# Basic Protocol 2 — Step 6 — Read the embeddings file

identifiers = annotations.identifier.values
embeddings = list()

with h5py.File('reduced_embeddings_file.h5', 'r') as embeddings_file:
    
    # Create a lookup table for the old ids
    embedding_querier = QueryEmbeddingsFile(embeddings_file)

    # For every identifiery in the annotations file,
    # look for the correct internal identifier
    # and add the embedding for that protein to the embeddings list
    for identifier in identifiers:
        embeddings.append(embedding_querier.query_original_id(identifier))

# Basic Protocol 2 — Step 7 — Project the embeddings in lower dimensions using UMAP

options = {
    'min_dist': .1,
    'spread': 8,
    'n_neighbors': 160,
    'metric': 'euclidean',
    'n_components': 2,
    'random_state': 10
}

projected_embeddings = umap_reduce(embeddings, **options)

# Basic Protocol 2 — Step 8 — Merge projected embeddings and annotations

projected_embeddings_dataframe = DataFrame(
    projected_embeddings,
    columns=["component_0", "component_1"],
    index=identifiers
)

merged_annotations_and_projected_embeddings = annotations.join(
    projected_embeddings_dataframe, on="identifier", how="left"
)

# Basic Protocol 2 — Step 9 — Plot the protein space spanned by the projected embeddings

figure = render_scatter_plotly(merged_annotations_and_projected_embeddings)
figure.show()

Visualize soluble vs. membrane vs. unknown sequences¶

We can use the UMAP projections from before to re-color the visualization using a new annotation: solubility vs. membrane boundness. The annotations are again taken from DeepLoc. This highlights the versatility of the visualizations, e.g. if you have a protein sequence set with different properties you would like to study, you can do so by visualizing the different properties in different graphs.

# Alternate protocol 2 - Step 2
solubility_annotations = read_csv('solubility_annotations.csv')

merged_solubility_annotations_and_projected_embeddings = solubility_annotations.join(
    projected_embeddings_dataframe, on="identifier", how="left"
)

figure = render_scatter_plotly(merged_solubility_annotations_and_projected_embeddings)
figure.show()

Visualize in 3D¶

The bio_embeddings pipeline allows you to also plot 3D protein spaces, which might be better suited for overarching annotations. Additionally, we will only look at definite solubility (aka, drop the “unknown” annotations)

from copy import deepcopy

# Alternate protocol 1 - Step 2
from bio_embeddings.visualize import render_3D_scatter_plotly

# Alternate protocol 1 - Step 1
three_dimensional_options = deepcopy(options)

# We use three components, as we now want three-dimensional data points.
three_dimensional_options['n_components'] = 3

three_dimensional_projected_embeddings = umap_reduce(embeddings, **three_dimensional_options)

three_dimensional_projected_embeddings_dataframe = DataFrame(
    three_dimensional_projected_embeddings,
    # Alternate protocol 1 - Step 2
    columns=["component_0", "component_1", "component_2"],
    index=identifiers
)

only_known = solubility_annotations[solubility_annotations['label'] != "Unknown"]

merged_solubility_annotations_and_3D_projected_embeddings = only_known.join(
    three_dimensional_projected_embeddings_dataframe, on="identifier", how="left"
)

# Alternate protocol 1 - Step 2
figure = render_3D_scatter_plotly(merged_solubility_annotations_and_3D_projected_embeddings)
figure.show()