bio_embeddings.utilities¶
Various helpers
- 
exception bio_embeddings.utilities.InvalidParameterError[source]¶
- Exception for invalid parameter settings 
- 
class bio_embeddings.utilities.QueryEmbeddingsFile(embeddings_file: h5py._hl.files.File)[source]¶
- A helper class that allows you to retrieve embeddings from an embeddings file based on either the original_id (extracted from the FASTA header during the embed stage), or via the new_id (assigned during the embed stage, either an MD5 hash of the input sequence, or an integer (if remapping_simple: True). - Available for embeddings created with the pipeline starting with v0.1.5 - import h5py from bio_embeddings.utilities import QueryEmbeddingsFile with h5py.File("path/to/file.h5", "r") as file: embedding_querier = QueryEmbeddingsFile(file) print(embedding_querier.query_original_id("Some_Database_ID_1234").mean()) - 
__init__(embeddings_file: h5py._hl.files.File)[source]¶
- Parameters
- embeddings_file – an h5py File, aka h5py.File(“/path/to/file.h5”). 
 
 
- 
- 
bio_embeddings.utilities.check_required(params: dict, keys: List[str])[source]¶
- Verify if required set of parameters is present in configuration - paramsdict
- Dictionary with parameters 
- keyslist-like
- Set of parameters that has to be present in params 
 - MissingParameterError 
- 
bio_embeddings.utilities.get_device(device: Union[None, str, torch.device] = None) → torch.device[source]¶
- Returns what the user specified, or defaults to the GPU, with a fallback to CPU if no GPU is available. 
- 
bio_embeddings.utilities.get_model_directories_from_zip(model: Optional[str] = None, directory: Optional[str] = None, overwrite_cache: bool = False) → str[source]¶
- If the specified asset directory for the model is in the user cache, returns the directory path, otherwise downloads the zipped directory, unpacks in the cache and returns the location 
- 
bio_embeddings.utilities.get_model_file(model: Optional[str] = None, file: Optional[str] = None, overwrite_cache: bool = False) → str[source]¶
- If the specified asset for the model is in the user cache, returns the location, otherwise downloads the file to cache and returns the location 
- 
bio_embeddings.utilities.read_fasta(path: str) → List[Bio.SeqRecord.SeqRecord][source]¶
- Helper function to read FASTA file. - Parameters
- path – path to a valid FASTA file 
- Returns
- a list of SeqRecord objects. 
 
- 
bio_embeddings.utilities.read_mapping_file(mapping_file: str) → pandas.core.frame.DataFrame[source]¶
- Reads mapping_file.csv and ensures consistent types 
- 
bio_embeddings.utilities.reindex_h5_file(h5_file_path: str, mapping_file_path: str)[source]¶
- Will rename the dataset keys using the “original_id” from the mapping file. This operation is generally considered unsafe, as the “original_id” is unsafe (may contain invalid characters, duplicates, or empty strings). - Some sanity checks are performed before starting the renaming process, but generally applying this function is discouraged unless you know what you are doing. - Parameters
- h5_file_path – path to the hd5_file to re-index 
- mapping_file_path – path to the mapping file (this must have the first column be the current keys, and a column “original_id” as the new desired id) 
 
- Returns
- Nothing – conversion happens in place! 
 
- 
bio_embeddings.utilities.reindex_sequences(sequence_records: List[Bio.SeqRecord.SeqRecord], simple=False) -> (<class 'Bio.SeqRecord.SeqRecord'>, <class 'pandas.core.frame.DataFrame'>)[source]¶
- Function will sort and re-index the sequence_records IN PLACE! (change the original list!). Returns a DataFrame with the mapping. - Parameters
- sequence_records – List of sequence records 
- simple – Boolean; if set to true use numerical index (1,2,3,4) instead of md5 hash 
 
- Returns
- A dataframe with the mapping with key the new ids and a column “original_id” containing the previous id, and the sequence length.