Add a new language model/embedder¶
Pick a name, which should be the one you’re using in the publication, and a lowercase version with underscores (snake_case). E.g. for one hot encoding, we use
one_hot_encoding
. The class name is the CamelCase version, in this caseOneHotEncodingEmbedder
. Stay consistent where you place the underscores.Add all new dependencies in
pyproject.toml
in a new extraAdd an entry to
bio_embeddings/utilities/defaults.yml
with a link to the weights.Create a new class in
bio_embeddings/embed
that at least implementsEmbedderInterface
, or even better (for GPU based models)EmbedderWithFallback
. The most simple example isOneHotEncodingEmbedder
, are more realistic example isProtTransT5Embedder
and its subclasses. If you add any new options, add them toKNOWN_EMBED_OPTIONS
inbio_embeddings/embed/pipeline.py
Add the class in
bio_embeddings/embed/__init__.py
The following two are checked by
SKIP_SLOW_TESTS=1 pytest
:Add the model size in the docs of
bio_embeddings/embed/__init__.py
Add it to
DEFAULT_MAX_AMINO_ACIDS
Add it to the tests following the instructions in
tests/test_embedder_embedding.py
Write a pipeline with your embedder, see that it works
Send a pull request 🚀