Colab initialization

  • install the pipeline in the colab runtime

  • download files neccessary for this example

!pip3 install -U pip > /dev/null
!pip3 install -U bio_embeddings[all] > /dev/null
!wget http://data.bioembeddings.com/public/embeddings/notebooks/pipeline_output_example/reduced_embeddings_file.h5 --output-document reduced_embeddings_file.h5
!wget http://data.bioembeddings.com/public/embeddings/notebooks/pipeline_output_example/mapping_file.csv --output-document mapping_file.csv

Open the embeddings from a pipeline run and use them

In the following we will open up some seqvec embeddings produced by the pipeline and use them for visualizations. We will start with the reduced embeddings: these are per-protein embeddings (as opposed to per-amino-acid).

import numpy as np
import h5py
proteins = []
with h5py.File('reduced_embeddings_file.h5', 'r') as f:
    for new_identifier in f.keys():
        proteins.append((new_identifier, f[new_identifier].attrs["original_id"], np.array(f[new_identifier])))
print(f"The first protein in the set was assigned the identifier: {proteins[0][0]}.")
print(f"The ID extracted from the FASTA header is: {proteins[0][1]}.")
print(f"The shape of the embedding is: {proteins[0][2].shape}.")