Do DNA and Protein foundation models represent genes differently?

Rohan Gala; Yeganeh Marghi

Abstract

Genomic foundation models (FM) have adopted different perspectives to represent nucleotide sequences. Protein FMs relate nucleotide sequence to protein structure, whereas DNA FMs relate it to RNA expression level, chromatin accessibility, and other epigenetic features. The former perspective captures structural/functional properties, whereas the latter captures molecular grammar and regulatory logic. For the hackathon, we obtained and evaluated joint-clusters of nucleotide sequence representations obtained by FMs that adopt these distinct perspectives.

Results

We started with 20,909 mouse and 19,364 human for which cDNA sequences were available in the Ensembl database. The amino acid sequences for these genes were obtained using the gget package (Luebbert and Pachter 2023).

Embeddings for the nucleotide and amino acid sequences for each gene were obtained with the Nucleotide Transformer (Dalla-Torre et al. 2024) and ESM3 (Hayes et al. 2025) respectively. We refer to these as NT- and ESM3-embeddings. A 2d UMAP projection of these embeddings is shown in Figure 1. We used the Leiden algorithm to cluster the NT- and ESM3-embeddings individually, without any dimensionality reduction. This defines the 20 leiden-nt labels, and 55 leiden-esm3 labels. This exercise suggested that the NT-embeddings have much less structure than the ESM3 embeddings.

Figure 1: 2d UMAP projection of the NT- and ESM3-embeddings for mouse genes. Colors can be chosen based on Leiden clustering individually performed on NT- and ESM3-embeddings, or MMIDAS consensus clusters obtained jointly. Hover on the dots to see the gene symbols and their cluster memberships. Note: This interactive component is best viewed on a laptop/desktop browser.

We also included the 27 mmidas-joint labels in Figure 1 obtained with MMIDAS (Marghi et al. 2024). The consensus score is a way to determine if a given gene can be assigned to the same cluster, irrespective of whether we use the NT- or ESM3-embeddings. The overall consensus score (over all genes) as a function of the number of joint clusters in the iterative MMIDAS training process is shown in Figure 2 (left).

Figure 2: **MMIDAS for consensus clusters** Left: The consensus score across different views (NT- and ESM3-embeddings) of genes as a function of dimensionality of the discrete representation layer in MMIDAS. Black dashed line indicates the maximum number of categories with consensus score of 0.9. Right: MMIDAS-assigned clusters are highly coherent across the two views (dominant diagonal in the heatmap).

We investigated genes included in MMIDAS joint clusters with gget(Luebbert and Pachter 2023). In particular, gget provides an interface to Enrichr (Xie et al. 2021), and results are based on the GO Biological Process category with annotations from the 2021 annotation release.

Figure 3: **MMIDAS joint clusters capture functionally relevant genes** GO enrichment of genes in the MMIDAS joint clusters. The position of genes in the NT- and ESM3-embedding UMAPs are shown in the right panels. The top two panels are based on GO ontology 2021, and the bottom-most panel is based on GO ontology 2024. Full results for the bottom panel are available here.

Among the clusters we investigated, we noticed that none of the 215 genes in mmidas-joint-18 were found through gget. The gene names suggested that these genes are all part of the immunoglobin family. These genes also appear in clustered on both the NT- and ESM3-embedding UMAPs. Moreover, we noticed that such immunoglobin genes also cluster together in a single mmidas-joint grouping of human genes. We used the PANTHER Overrepresentation Test (Mi et al. 2019) through the GO ontology resource which uses a more recent version of the GO ontology to investigate this further. This analysis found that genes in this cluster are significantly overrepresented for immunoglobulin mediated immune response and antigen binding, among several other terms.

Discussion

Our analysis suggests that genes that are grouped in one view are not as coherent in another view, see different label sets in Figure 1 and particular examples in Figure 3. As with much of biology, some gene relationships are shared, while others are distinct. Nevertheless, the joint clusters we obtain with post-hoc alignment are meaningful. Our analysis captured a set of immunoglobin genes across species that are annotated only in more recent versions of commonly used databases for gene annotation. This approach may therefore offer a way to refine ontologies.

A subset of inputs that distinct foundation models are trained on are biologically coherent entities, e.g. genes. Our preliminary analysis already captures some relationships across two such models - one trained only on DNA sequences to predict masked tokens, and another trained only on amino acid sequences to predict protein structure. Curating and leveraging such data through analyses of multiple existing foundation models could be used to align representations in new models with various coupling strategies (Gala et al. 2019; Marghi et al. 2024; Radford et al. 2021) - towards building truly multimodal biological foundation models.

One hurdle towards this vision is that large transformer-like models assign different meaning and utility to representations extracted from different layers, and for particular input tokens depending on the training objective. Choosing a single representation per input sequence post-hoc, and without a well-defined task can be tricky. Here we followed examples in the respective repositories for the models we used to obtain a single representation for each gene, which may not be the best approach.

The downstream analysis with various bioinformatics tools for overrepresentation should be interpreted with caution. We used them here as an exploratory tool to interpret our groupings.

Data

We obtained genome-wide cDNA sequences for human and mouse from Ensembl. Custom scripts and the gget package (Luebbert and Pachter 2023) were used to obtain the nucleotide and amino acid sequences corresponding to all available mouse and human genes in the Ensembl database. We ended up with 20,909 mouse and 19,364 human genes at this stage.

Table 1: Summary of amino acid sequence embeddings The safety filter exceptions ultimately influence a small fraction of the sequences, so we retained them all for the downstream analysis.

Species	Initial set	Safety filter exceptions	Enabled by workaround	Embeddings available
Mouse	20,909	32	546	20,877
Human	19,364	26	596	19,338

ESM3 has safety checks that prevent it from embedding certain amino acid sequences. We could bypass these filters by masking a variable fraction of amino acids in the sequence. Even with this approach, embeddings for a small fraction of amino acid sequences could not be obtained, Table 1.

Models

Nucleotide Transformer (Dalla-Torre et al. 2024) as our DNA FM and ESM3 (Hayes et al. 2025) as our protein FM. Finally, we used MMIDAS (Marghi et al. 2024) to obtain joint embeddings of genes based on their representations in the DNA and protein FMs.

Nucleotide Transformer: This model (Dalla-Torre et al. 2024) is pre-trained to predict masked tokens in DNA sequences. Tokens in the nucleotide transformer are used to represent k-mers (k=6). Adding a few extra tokens to indicate CLS, MASK, PAD etc. takes the vocabulary size to 4,104. The maximum token length in the models considered is 1,000, which corresponds to sequences with length of roughly 5,952 nucleotides. We truncated any input sequence to match this maximum value, Figure 4.

We obtained embeddings corresonding to the CLS token for all genes. There is no single representation for which layer to extract such embeddings from. We followed examples in the repository, and obtained representations from layer 20 of the 500M multi v2 and 500M human models for the mouse and human nucleotide sequences respectively. The dimension of the embeddings are of 1,280 for mouse genes, and 1024 for human genes. We refer to these as NT embeddings.

ESM3: ESM3-open(Hayes et al. 2025) is a 98B parameter model that is trained on 2.78B natural protein sequences to predict sequence, structure, and functional aspects using a masked language modeling objective. This model incorporates a guardrails that can prevent inference on potentially hazardous sequences, see model card.

The model exposes per-residue embeddings, which represent each amino acid within the sequence, and a mean embedding, which is the average of all residue embeddings across the entire protein sequence. For both mouse and human sequences, we only use the 1,536 dimensional mean embedding to represent the amino acid sequences corresponding to the genes. We refer to these as ESM3 embeddings.

MMIDAS: (Marghi et al. 2024) recently proposed a model to obtain joint embeddings of multimodal single-cell resolution data. Treating genes as our samples, and NT- and ESM3-embeddings as ‘modalities’, we obtained consensus clusters for genes.

MMIDAS sparsifies the discrete representation layer to identify an optimal number of consensus categories across modalities. At the start of training, the network initializes an overparameterized discrete latent space, establishing an upper bound on the number of clusters. The model refines the dimensionality of the discrete latent space (equivalent to number of categories or clusters) by evaluating each category’s contribution based on a consensus measure between the modalities. Categories that do not maintain similar probabilities across modalities are pruned. This iterative process continues until all remaining categories satisfy a predefined minimum consensus threshold.

Carbon footprint

In relation to the sustainability focus of the hackathon, we calculated a rough estimate of the carbon footprint of the computations we performed.

Nucleotide Transformer: A single A100 GPU on the local high performance computing cluster was used to run inference with nucleotide transformer. The largest batch size we could use without running out of memory on this hardware was 20, and the inference time for the dataset was around 20 minutes. The total energy consumption for this exercise is estimated to be 0.2 kWh.

ESM3: We ran computations on ml.g5.2xlarge instances via Amazon SageMaker using an endpoint for the ESM3 hosted on their marketplace. Generating embeddings for all genes of a single species took about 10 hours. The total energy consumption for ml.g5.2xlarge was about 6.0 kWh.

MMIDAS: For training MMIDAS, we utilized the local high-performance computing (HPC), using one Tesla V100 SXM2 GPU with a 0.3 kW power rating. Training runs took 23 hours each (for mouse and human), resulting in a total energy consumption of 14.0 kWh.

Evo2: We aspired to use a more recent DNA foundation model Evo2 (Nguyen et al. 2024) for our analysis. The minimum requirements to use this model lists 2 x H100 or H200 GPUs. We used 2 hours of an ml.p5.48xlarge instance to attempt an install of Evo2 (eventually abandoned because of the high cost, and multiple independently reported issues in their code base). Using the 700W maximum power rating per H100 on this instance leads to an estimate of 11.2 kWh.

All embeddings were saved on shared storage to prevent duplicate computations by team members. Future experiments could incorporate tools like CodeCarbon to more reliably track the carbon footprint of experiments run across computing environments and devices.

Code

See the biomolvec and nautilex-esm repositories for related notebooks and scripts.

References

Dalla-Torre, Hugo, Liam Gonzalez, Javier Mendoza-Revilla, Nicolas Lopez Carranza, Adam Henryk Grzywaczewski, Francesco Oteri, Christian Dallago, et al. 2024. “Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics.” Nature Methods, 1–11.

Gala, R., N. Gouwens, Z. Yao, A. Budzillo, O. Penn, B. Tasic, G. Murphy, H. Zeng, and U. Sümbül. 2019. “A Coupled Autoencoder Approach for Multi-Modal Analysis of Cell Types.” In Advances in Neural Information Processing Systems. Vol. 32.

Hayes, Thomas, Roshan Rao, Halil Akin, Nicholas J Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, et al. 2025. “Simulating 500 Million Years of Evolution with a Language Model.” Science, eads0018.

Luebbert, Laura, and Lior Pachter. 2023. “Efficient Querying of Genomic Reference Databases with Gget.” Bioinformatics 39 (1): btac836.

Marghi, Yeganeh, Rohan Gala, Fahimeh Baftizadeh, and Uygar Sümbül. 2024. “Joint Inference of Discrete Cell Types and Continuous Type-Specific Variability in Single-Cell Datasets with MMIDAS.” Nature Computational Science 4 (9): 706–22.

Mi, H., A. Muruganujan, X. Huang, D. Ebert, C. Mills, X. Guo, and P. D. Thomas. 2019. “Protocol Update for Large-Scale Genome and Gene Function Analysis with the PANTHER Classification System (v. 14.0).” Nature Protocols 14 (3): 703–21.

Nguyen, Eric, Michael Poli, Matthew G. Durrant, Brian Kang, Dhruva Katrekar, David B. Li, Liam J. Bartie, et al. 2024. “Sequence Modeling and Design from Molecular to Genome Scale with Evo.” Science 386 (6723): eado9336. https://doi.org/10.1126/science.ado9336.

Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al. 2021. “Learning Transferable Visual Models from Natural Language Supervision.” In International Conference on Machine Learning, 139:8748–63. Proceedings of Machine Learning Research. PMLR.

Xie, Z., A. Bailey, M. V. Kuleshov, D. J. Clarke, J. E. Evangelista, S. L. Jenkins, A. Lachmann, et al. 2021. “Gene Set Knowledge Discovery with Enrichr.” Current Protocols, March.