Colab initialization¶
install the pipeline in the colab runtime
download files neccessary for this example
import os
!pip3 install -U pip > /dev/null
!pip3 install -U bio_embeddings[all] > /dev/null
!wget http://data.bioembeddings.com/public/embeddings/notebooks/pipeline_output_example/prottrans_bert_embeddings_file.h5 --output-document prottrans_bert_embeddings_file.h5
!wget http://data.bioembeddings.com/public/embeddings/notebooks/pipeline_output_example/seqvec_embeddings_file.h5 --output-document seqvec_embeddings_file.h5
Visualize the RAW embeddings as images¶
This can be helpful to spot any patterns, visually
import h5py
import plotly
import numpy as np
import plotly.express as px
proteins = []
with h5py.File('prottrans_bert_embeddings_file.h5', 'r') as f:
for protein_id in f.keys():
proteins.append((protein_id, np.array(f[protein_id])))
os.mkdir("figures")
for (identifier, embedding) in proteins:
fig = px.imshow(np.rot90(embedding), color_continuous_scale='RdBu', zmin=-.7, zmax=.7)
plotly.offline.plot(fig, filename="figures/"+identifier+"_prottrans_bert.html")
For the next block of code to work, you should generate some seqvec embeddings and store them as seqvec_embeddings_file.h5
in the pipeline_output_example
folder.
Note that we are picking the second embedding layer from the seqvec embeddings.
proteins = []
with h5py.File('seqvec_embeddings_file.h5', 'r') as f:
for protein_id in f.keys():
proteins.append((protein_id, np.array(f[protein_id])[1]))
for (identifier, embedding) in proteins:
fig = px.imshow(np.rot90(embedding), color_continuous_scale='RdBu', zmin=-1.2, zmax=1.2)
plotly.offline.plot(fig, filename="figures/"+identifier+"_seqvec.html")