Abstract
Genomic foundation models (FM) have adopted different perspectives to represent nucleotide sequences. Protein FMs relate nucleotide sequence to protein structure, whereas DNA FMs relate it to RNA expression level, chromatin accessibility, and other epigenetic features. The former perspective captures structural/functional properties, whereas the latter captures molecular grammar and regulatory logic. For the hackathon, we obtained and evaluated joint-clusters of nucleotide sequence representations obtained by FMs that adopt these distinct perspectives.
Results
We started with 20,909 mouse and 19,364 human for which cDNA sequences were available in the Ensembl database. The amino acid sequences for these genes were obtained using the gget
package (Luebbert and Pachter 2023).
Embeddings for the nucleotide and amino acid sequences for each gene were obtained with the Nucleotide Transformer (Dalla-Torre et al. 2024) and ESM3 (Hayes et al. 2025) respectively. We refer to these as NT- and ESM3-embeddings. A 2d UMAP projection of these embeddings is shown in Figure 1. We used the Leiden algorithm to cluster the NT- and ESM3-embeddings individually, without any dimensionality reduction. This defines the 20 leiden-nt
labels, and 55 leiden-esm3
labels. This exercise suggested that the NT-embeddings have much less structure than the ESM3 embeddings.
We also included the 27 mmidas-joint
labels in Figure 1 obtained with MMIDAS (Marghi et al. 2024). The consensus score is a way to determine if a given gene can be assigned to the same cluster, irrespective of whether we use the NT- or ESM3-embeddings. The overall consensus score (over all genes) as a function of the number of joint clusters in the iterative MMIDAS training process is shown in Figure 2 (left).
We investigated genes included in MMIDAS joint clusters with gget
(Luebbert and Pachter 2023). In particular, gget
provides an interface to Enrichr (Xie et al. 2021), and results are based on the GO Biological Process
category with annotations from the 2021 annotation release.
Among the clusters we investigated, we noticed that none of the 215 genes in mmidas-joint-18
were found through gget
. The gene names suggested that these genes are all part of the immunoglobin family. These genes also appear in clustered on both the NT- and ESM3-embedding UMAPs. Moreover, we noticed that such immunoglobin genes also cluster together in a single mmidas-joint
grouping of human genes. We used the PANTHER Overrepresentation Test (Mi et al. 2019) through the GO ontology resource which uses a more recent version of the GO ontology to investigate this further. This analysis found that genes in this cluster are significantly overrepresented for immunoglobulin mediated immune response and antigen binding, among several other terms.
Discussion
Our analysis suggests that genes that are grouped in one view are not as coherent in another view, see different label sets in Figure 1 and particular examples in Figure 3. As with much of biology, some gene relationships are shared, while others are distinct. Nevertheless, the joint clusters we obtain with post-hoc alignment are meaningful. Our analysis captured a set of immunoglobin genes across species that are annotated only in more recent versions of commonly used databases for gene annotation. This approach may therefore offer a way to refine ontologies.
A subset of inputs that distinct foundation models are trained on are biologically coherent entities, e.g. genes. Our preliminary analysis already captures some relationships across two such models - one trained only on DNA sequences to predict masked tokens, and another trained only on amino acid sequences to predict protein structure. Curating and leveraging such data through analyses of multiple existing foundation models could be used to align representations in new models with various coupling strategies (Gala et al. 2019; Marghi et al. 2024; Radford et al. 2021) - towards building truly multimodal biological foundation models.
One hurdle towards this vision is that large transformer-like models assign different meaning and utility to representations extracted from different layers, and for particular input tokens depending on the training objective. Choosing a single representation per input sequence post-hoc, and without a well-defined task can be tricky. Here we followed examples in the respective repositories for the models we used to obtain a single representation for each gene, which may not be the best approach.
The downstream analysis with various bioinformatics tools for overrepresentation should be interpreted with caution. We used them here as an exploratory tool to interpret our groupings.
Data
We obtained genome-wide cDNA sequences for human and mouse from Ensembl. Custom scripts and the gget
package (Luebbert and Pachter 2023) were used to obtain the nucleotide and amino acid sequences corresponding to all available mouse and human genes in the Ensembl database. We ended up with 20,909 mouse and 19,364 human genes at this stage.
Species | Initial set | Safety filter exceptions | Enabled by workaround | Embeddings available |
---|---|---|---|---|
Mouse | 20,909 | 32 | 546 | 20,877 |
Human | 19,364 | 26 | 596 | 19,338 |
ESM3 has safety checks that prevent it from embedding certain amino acid sequences. We could bypass these filters by masking a variable fraction of amino acids in the sequence. Even with this approach, embeddings for a small fraction of amino acid sequences could not be obtained, Table 1.
Models
Nucleotide Transformer (Dalla-Torre et al. 2024) as our DNA FM and ESM3 (Hayes et al. 2025) as our protein FM. Finally, we used MMIDAS (Marghi et al. 2024) to obtain joint embeddings of genes based on their representations in the DNA and protein FMs.
Nucleotide Transformer: This model (Dalla-Torre et al. 2024) is pre-trained to predict masked tokens in DNA sequences. Tokens in the nucleotide transformer are used to represent k-mers (k=6). Adding a few extra tokens to indicate CLS
, MASK
, PAD
etc. takes the vocabulary size to 4,104. The maximum token length in the models considered is 1,000, which corresponds to sequences with length of roughly 5,952 nucleotides. We truncated any input sequence to match this maximum value, Figure 4.
We obtained embeddings corresonding to the CLS
token for all genes. There is no single representation for which layer to extract such embeddings from. We followed examples in the repository, and obtained representations from layer 20 of the 500M multi v2
and 500M human
models for the mouse and human nucleotide sequences respectively. The dimension of the embeddings are of 1,280 for mouse genes, and 1024 for human genes. We refer to these as NT embeddings.
ESM3: ESM3-open(Hayes et al. 2025) is a 98B parameter model that is trained on 2.78B natural protein sequences to predict sequence, structure, and functional aspects using a masked language modeling objective. This model incorporates a guardrails that can prevent inference on potentially hazardous sequences, see model card.
The model exposes per-residue embeddings, which represent each amino acid within the sequence, and a mean embedding, which is the average of all residue embeddings across the entire protein sequence. For both mouse and human sequences, we only use the 1,536 dimensional mean embedding to represent the amino acid sequences corresponding to the genes. We refer to these as ESM3 embeddings.
MMIDAS: (Marghi et al. 2024) recently proposed a model to obtain joint embeddings of multimodal single-cell resolution data. Treating genes as our samples, and NT- and ESM3-embeddings as ‘modalities’, we obtained consensus clusters for genes.
MMIDAS sparsifies the discrete representation layer to identify an optimal number of consensus categories across modalities. At the start of training, the network initializes an overparameterized discrete latent space, establishing an upper bound on the number of clusters. The model refines the dimensionality of the discrete latent space (equivalent to number of categories or clusters) by evaluating each category’s contribution based on a consensus measure between the modalities. Categories that do not maintain similar probabilities across modalities are pruned. This iterative process continues until all remaining categories satisfy a predefined minimum consensus threshold.
Carbon footprint
In relation to the sustainability focus of the hackathon, we calculated a rough estimate of the carbon footprint of the computations we performed.
Nucleotide Transformer: A single A100
GPU on the local high performance computing cluster was used to run inference with nucleotide transformer. The largest batch size we could use without running out of memory on this hardware was 20, and the inference time for the dataset was around 20 minutes. The total energy consumption for this exercise is estimated to be 0.2 kWh.
ESM3: We ran computations on ml.g5.2xlarge
instances via Amazon SageMaker using an endpoint for the ESM3 hosted on their marketplace. Generating embeddings for all genes of a single species took about 10 hours. The total energy consumption for ml.g5.2xlarge
was about 6.0 kWh.
MMIDAS: For training MMIDAS, we utilized the local high-performance computing (HPC), using one Tesla V100 SXM2
GPU with a 0.3 kW power rating. Training runs took 23 hours each (for mouse and human), resulting in a total energy consumption of 14.0 kWh.
Evo2: We aspired to use a more recent DNA foundation model Evo2 (Nguyen et al. 2024) for our analysis. The minimum requirements to use this model lists 2 x H100
or H200
GPUs. We used 2 hours of an ml.p5.48xlarge
instance to attempt an install of Evo2 (eventually abandoned because of the high cost, and multiple independently reported issues in their code base). Using the 700W maximum power rating per H100
on this instance leads to an estimate of 11.2 kWh.
All embeddings were saved on shared storage to prevent duplicate computations by team members. Future experiments could incorporate tools like CodeCarbon to more reliably track the carbon footprint of experiments run across computing environments and devices.
Code
See the biomolvec and nautilex-esm repositories for related notebooks and scripts.