Open-Source AI Model Trained on Trillions of DNA Bases

▼ Summary
– The original Evo AI system was trained on bacterial genomes, which allowed it to predict or suggest novel proteins due to bacteria’s clustered gene organization.
– A key limitation was that this approach might not work for more complex genomes found in eukaryotes, where genes are not simply clustered.
– The team developed Evo 2, an open-source AI trained on genomes from all three domains of life, including complex eukaryotes.
– Evo 2 learned to identify intricate features in complex genomes, such as regulatory DNA and splice sites, after training on trillions of DNA base pairs.
– Eukaryotic genomes are complex, with interrupted genes, scattered regulatory sequences, and large amounts of non-coding “junk” DNA, unlike the straightforward organization of bacterial genomes.
The field of artificial intelligence has taken a significant leap forward with the release of Evo 2, an open-source AI model trained on trillions of DNA bases. This new system expands dramatically on its predecessor by learning from the complete genetic libraries of all life forms, from simple bacteria to complex organisms like humans. The original Evo model, trained solely on bacterial genomes, demonstrated an impressive ability to predict gene sequences. However, its application was limited by the simpler, more organized structure of bacterial DNA. The development of Evo 2 directly tackles the immense challenge of deciphering the far more intricate and chaotic genomes found in eukaryotes, which include plants, animals, and fungi.
Bacterial genetics operate on a principle of elegant simplicity. Their genes are continuous, uninterrupted segments of code, and functionally related genes are often grouped together in neat clusters. This organized architecture allows for efficient regulation and made it an ideal, structured dataset for the first-generation AI. Eukaryotic genomes present a vastly different and more difficult puzzle. Here, the protein-coding sections of genes are fractured by non-coding introns. The regulatory elements that control when a gene is activated can be scattered across vast distances of DNA. Even the defining sequences for critical features, like where an intron begins or where a regulatory protein binds, are often weakly defined patterns rather than strict codes, surrounded by oceans of so-called “junk DNA” from ancient viral insertions and broken genes.
The team behind Evo viewed this complexity not as a barrier but as the next frontier. By training their new model on this colossal and messy dataset spanning bacteria, archaea, and eukaryotes, Evo 2 has learned to recognize the subtle, hidden signatures within even the most complicated genomes. The AI has developed internal representations of key biological features that are notoriously hard for researchers to identify, including elusive regulatory DNA and precise splice sites. This breakthrough suggests that machine learning can uncover the deep, organizing principles of life’s blueprint where traditional analysis struggles, opening new pathways for understanding genetics, evolution, and disease.
(Source: Ars Technica)




