Colab initialization

  • install the pipeline in the colab runtime

  • download files neccessary for this example

!pip3 install -U pip > /dev/null
!pip3 install -U "bio-embeddings[all] @ git+https://github.com/sacdallago/bio_embeddings.git" > /dev/null
!wget http://data.bioembeddings.com/public/embeddings/notebooks/pipeline_output_example/prottrans_bert_embeddings_file.h5 --output-document prottrans_bert_embeddings_file.h5
!wget http://data.bioembeddings.com/public/embeddings/notebooks/pipeline_output_example/seqvec_embeddings_file.h5 --output-document seqvec_embeddings_file.h5

Visualize the RAW embeddings as images

This can be helpful to spot any patterns, visually

import h5py
import plotly
import numpy as np
import plotly.express as px
proteins = []
with h5py.File('prottrans_bert_embeddings_file.h5', 'r') as f:
    for protein_id in f.keys():
        proteins.append((protein_id, np.array(f[protein_id])))
for (identifier, embedding) in proteins:
    fig = px.imshow(np.rot90(embedding), color_continuous_scale='RdBu', zmin=-.7, zmax=.7)
    plotly.offline.plot(fig, filename="figures/"+identifier+"_prottrans_bert.html")

For the next block of code to work, you should generate some seqvec embeddings and store them as seqvec_embeddings_file.h5 in the pipeline_output_example folder.

Note that we are picking the second embedding layer from the seqvec embeddings.

proteins = []
with h5py.File('seqvec_embeddings_file.h5', 'r') as f:
    for protein_id in f.keys():
        proteins.append((protein_id, np.array(f[protein_id])[1]))
for (identifier, embedding) in proteins:
    fig = px.imshow(np.rot90(embedding), color_continuous_scale='RdBu', zmin=-1.2, zmax=1.2)
    plotly.offline.plot(fig, filename="figures/"+identifier+"_seqvec.html")