bio_embeddings.utilities

Various helpers

class bio_embeddings.utilities.FileManagerInterface[source]
abstract __init__()[source]

Initialize self. See help(type(self)) for accurate signature.

abstract create_directory(prefix, stage, directory_name)str[source]
abstract create_file(prefix, stage, file_name, extension=None)str[source]
abstract create_prefix(prefix)str[source]
abstract create_stage(prefix, stage)str[source]
abstract exists(prefix, stage=None, file_name=None, extension=None)bool[source]
abstract get_file(prefix, stage, file_name, extension=None)str[source]
class bio_embeddings.utilities.FileSystemFileManager[source]
__init__()[source]

Initialize self. See help(type(self)) for accurate signature.

create_directory(prefix, stage, directory_name)str[source]
create_file(prefix, stage, file_name, extension=None)str[source]
create_prefix(prefix)str[source]
create_stage(prefix, stage)str[source]
exists(prefix, stage=None, file_name=None, extension=None)bool[source]
get_file(prefix, stage, file_name, extension=None)str[source]
exception bio_embeddings.utilities.InvalidParameterError[source]

Exception for invalid parameter settings

class bio_embeddings.utilities.QueryEmbeddingsFile(embeddings_file: h5py._hl.files.File)[source]

A helper class that allows you to retrieve embeddings from an embeddings file based on either the original_id (extracted from the FASTA header during the embed stage), or via the new_id (assigned during the embed stage, either an MD5 hash of the input sequence, or an integer (if remapping_simple: True).

Available for embeddings created with the pipeline starting with v0.1.5

import h5py
from bio_embeddings.utilities import QueryEmbeddingsFile

with h5py.File("path/to/file.h5", "r") as file:
    embedding_querier = QueryEmbeddingsFile(file)
    print(embedding_querier.query_original_id("Some_Database_ID_1234").mean())
__init__(embeddings_file: h5py._hl.files.File)[source]
Parameters

embeddings_file – an h5py File, aka h5py.File(“/path/to/file.h5”).

query_new_id(new_id: str)numpy.array[source]

Query embeddings file using the new id, aka. either the MD5 hash of the sequence or a number.

Parameters

new_id – a string representing the new id.

Returns

the embedding as a numpy array

query_original_id(original_id: str)numpy.array[source]

Query embeddings file using the original id, aka. the string extracted from the FASTA header of the sequence.

Parameters

original_id – a string representing the id extracted from the FASTA header

Returns

the embedding as a numpy array

bio_embeddings.utilities.check_required(params: dict, keys: List[str])[source]

Verify if required set of parameters is present in configuration

paramsdict

Dictionary with parameters

keyslist-like

Set of parameters that has to be present in params

MissingParameterError

bio_embeddings.utilities.get_device(device: Union[None, str, torch.device] = None)torch.device[source]

Returns what the user specified, or defaults to the GPU, with a fallback to CPU if no GPU is available.

bio_embeddings.utilities.get_file_manager(**kwargs)[source]
bio_embeddings.utilities.get_model_directories_from_zip(model: Optional[str] = None, directory: Optional[str] = None, overwrite_cache: bool = False)str[source]

If the specified asset directory for the model is in the user cache, returns the directory path, otherwise downloads the zipped directory, unpacks in the cache and returns the location

bio_embeddings.utilities.get_model_file(model: Optional[str] = None, file: Optional[str] = None, overwrite_cache: bool = False)str[source]

If the specified asset for the model is in the user cache, returns the location, otherwise downloads the file to cache and returns the location

bio_embeddings.utilities.read_fasta(path: str)List[Bio.SeqRecord.SeqRecord][source]

Helper function to read FASTA file.

Parameters

path – path to a valid FASTA file

Returns

a list of SeqRecord objects.

bio_embeddings.utilities.read_mapping_file(mapping_file: str)pandas.core.frame.DataFrame[source]

Reads mapping_file.csv and ensures consistent types

bio_embeddings.utilities.reindex_h5_file(h5_file_path: str, mapping_file_path: str)[source]

Will rename the dataset keys using the “original_id” from the mapping file. This operation is generally considered unsafe, as the “original_id” is unsafe (may contain invalid characters, duplicates, or empty strings).

Some sanity checks are performed before starting the renaming process, but generally applying this function is discouraged unless you know what you are doing.

Parameters
  • h5_file_path – path to the hd5_file to re-index

  • mapping_file_path – path to the mapping file (this must have the first column be the current keys, and a column “original_id” as the new desired id)

Returns

Nothing – conversion happens in place!

bio_embeddings.utilities.reindex_sequences(sequence_records: List[Bio.SeqRecord.SeqRecord], simple=False) -> (<class 'Bio.SeqRecord.SeqRecord'>, <class 'pandas.core.frame.DataFrame'>)[source]

Function will sort and re-index the sequence_records IN PLACE! (change the original list!). Returns a DataFrame with the mapping.

Parameters
  • sequence_records – List of sequence records

  • simple – Boolean; if set to true use numerical index (1,2,3,4) instead of md5 hash

Returns

A dataframe with the mapping with key the new ids and a column “original_id” containing the previous id, and the sequence length.

bio_embeddings.utilities.write_fasta_file(sequence_records: List[Bio.SeqRecord.SeqRecord], file_path: str)None[source]