Colab initialization

  • install the pipeline in the colab runtime

  • download files neccessary for this example

!pip3 install -U pip > /dev/null
!pip3 install -U bio_embeddings[all] > /dev/null
!wget http://data.bioembeddings.com/public/embeddings/notebooks/pipeline_output_example/disprot/reduced_embeddings_file.h5 --output-document reduced_embeddings_file.h5
!wget http://data.bioembeddings.com/public/embeddings/notebooks/pipeline_output_example/disprot/mapping_file.csv --output-document mapping_file.csv

Reindex the embeddings generated from the pipeline

In order to avoid fauly ids from the FASTA headers, the pipeline automatically generates ids for the sequences passed. At the end of a pipeline run, you might want to attempt to re-index these. The pipeline provides a convenience function that does this in-place (changes the original file!)

## This is just to get some logging output in the Notebook

import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")

When executing pipeline runs, your input sequences will get assigned a new internal identifier. This identifier corresponds to the md5 hash of the sequence. We do this, becuase for storing and processing purposes we need unique strings as identifiers, and unfortunately, some FASTA files contain invalid characters in the header.

Nevertheless, sometimes you may want to convert the keys contained in the h5 files produces from the pipeline back from the internal ids to their original id as in the FASTA header of the input sequence.

We produce a mapping_file.csv which shows this mapping (the first, unnamed column represents the sequence’ md5 hash, while the column original_id represents the extracted id from the input FASTA)

This operation can be dangerous, because if the original_id contains invalid characters or is empty, the h5 file will be corrupted.

Nevertheless, we make a helper function available which converts the internal ids back to the original ids in place, meaning that the h5 file will be directly modified (this is meant to avoid duplication of large h5 files, but with the risk of corrupting the original file. Please: only perform this operation if you are sure about what you are doing, and if it’s strictly neccessary!)

import h5py
from bio_embeddings.utilities import reindex_h5_file
# Let's check the keys of our h5 file:
with h5py.File("reduced_embeddings_file.h5", "r") as h5_file:
    for key in h5_file.keys():
        print(key,)
# In place re-indexing of h5 file

reindex_h5_file("reduced_embeddings_file.h5", "mapping_file.csv")
# Let's check the new keys of our h5 file:
with h5py.File("reduced_embeddings_file.h5", "r") as h5_file:
    for key in h5_file.keys():
        print(key,)