bio_embeddings.extract

Methods for predicting properties of proteins, both on a per-residue and per-protein level, including supervised (pre-trained) and unsupervised (nearest neighbour search) methods

class bio_embeddings.extract.BasicAnnotationExtractor(model_type: str, device: Union[None, str, torch.device] = None, **kwargs)[source]
get_annotations(raw_embedding: numpy.ndarray) → bio_embeddings.extract.basic.BasicAnnotationExtractor.BasicExtractedAnnotations[source]
get_secondary_structure(raw_embedding: numpy.ndarray) → bio_embeddings.extract.basic.BasicAnnotationExtractor.BasicSecondaryStructureResult[source]
get_subcellular_location(raw_embedding: numpy.ndarray) → bio_embeddings.extract.basic.BasicAnnotationExtractor.BasicSubcellularLocalizationResult[source]
necessary_files = ['secondary_structure_checkpoint_file', 'subcellular_location_checkpoint_file']
bio_embeddings.extract.get_k_nearest_neighbours(pairwise_matrix: numpy.array, k: int = 1) -> (typing.List[int], <built-in function array>)[source]
Parameters
  • pairwise_matrix – an np.array with columns as queries and rows as targets

  • k – the number of k-nn’s to return

Returns

a list of tuples with indices of the nearest neighbour and distance to them (sorted by distance asc.)

bio_embeddings.extract.pairwise_distance_matrix_from_embeddings_and_annotations(query_embeddings_path: str, reference_embeddings_path: str, metric: str = 'euclidean', n_jobs: int = 1) → bio_embeddings.extract.unsupervised_utilities.PairwiseDistanceMatrixResult[source]
Parameters
  • n_jobs – int, see scikit-learn documentation

  • metric – Metric to use (string!), see scikit-learn documentation

  • query_embeddings_path – A string defining a path to an h5 file

  • reference_embeddings_path – A string defining a path to an h5 file

Returns

A tuple containing: - pairwise_matrix: the pairwise distances between queries and references - queries: A list of strings defining the queries - references: A list of strings defining the references