bio_embeddings.utilities¶
Various helpers
- exception bio_embeddings.utilities.InvalidParameterError[source]¶
Exception for invalid parameter settings
- class bio_embeddings.utilities.QueryEmbeddingsFile(embeddings_file: h5py._hl.files.File)[source]¶
A helper class that allows you to retrieve embeddings from an embeddings file based on either the original_id (extracted from the FASTA header during the embed stage), or via the new_id (assigned during the embed stage, either an MD5 hash of the input sequence, or an integer (if remapping_simple: True).
Available for embeddings created with the pipeline starting with v0.1.5
import h5py from bio_embeddings.utilities import QueryEmbeddingsFile with h5py.File("path/to/file.h5", "r") as file: embedding_querier = QueryEmbeddingsFile(file) print(embedding_querier.query_original_id("Some_Database_ID_1234").mean())
- __init__(embeddings_file: h5py._hl.files.File)[source]¶
- Parameters
embeddings_file – an h5py File, aka h5py.File(“/path/to/file.h5”).
- bio_embeddings.utilities.check_required(params: dict, keys: List[str])[source]¶
Verify if required set of parameters is present in configuration
- paramsdict
Dictionary with parameters
- keyslist-like
Set of parameters that has to be present in params
MissingParameterError
- bio_embeddings.utilities.convert_list_of_enum_to_string(list_of_enums: List[enum.Enum]) str [source]¶
- bio_embeddings.utilities.get_device(device: Union[None, str, torch.device] = None) torch.device [source]¶
Returns what the user specified, or defaults to the GPU, with a fallback to CPU if no GPU is available.
- bio_embeddings.utilities.get_model_directories_from_zip(model: Optional[str] = None, directory: Optional[str] = None, overwrite_cache: bool = False) str [source]¶
If the specified asset directory for the model is in the user cache, returns the directory path, otherwise downloads the zipped directory, unpacks in the cache and returns the location
- bio_embeddings.utilities.get_model_file(model: Optional[str] = None, file: Optional[str] = None, overwrite_cache: bool = False) str [source]¶
If the specified asset for the model is in the user cache, returns the location, otherwise downloads the file to cache and returns the location
- bio_embeddings.utilities.read_fasta(path: str) List[Bio.SeqRecord.SeqRecord] [source]¶
Helper function to read FASTA file.
- Parameters
path – path to a valid FASTA file
- Returns
a list of SeqRecord objects.
- bio_embeddings.utilities.read_mapping_file(mapping_file: str) pandas.core.frame.DataFrame [source]¶
Reads mapping_file.csv and ensures consistent types
- bio_embeddings.utilities.reindex_h5_file(h5_file_path: str, mapping_file_path: str)[source]¶
Will rename the dataset keys using the “original_id” from the mapping file. This operation is generally considered unsafe, as the “original_id” is unsafe (may contain invalid characters, duplicates, or empty strings).
Some sanity checks are performed before starting the renaming process, but generally applying this function is discouraged unless you know what you are doing.
- Parameters
h5_file_path – path to the hd5_file to re-index
mapping_file_path – path to the mapping file (this must have the first column be the current keys, and a column “original_id” as the new desired id)
- Returns
Nothing – conversion happens in place!
- bio_embeddings.utilities.reindex_sequences(sequence_records: List[Bio.SeqRecord.SeqRecord], simple=False) -> (<class 'Bio.SeqRecord.SeqRecord'>, <class 'pandas.core.frame.DataFrame'>)[source]¶
Function will sort and re-index the sequence_records IN PLACE! (change the original list!). Returns a DataFrame with the mapping.
- Parameters
sequence_records – List of sequence records
simple – Boolean; if set to true use numerical index (1,2,3,4) instead of md5 hash
- Returns
A dataframe with the mapping with key the new ids and a column “original_id” containing the previous id, and the sequence length.