HeadlinesBriefing favicon HeadlinesBriefing.com

Protein Sequence Diversity Doesn't Equal Fold Innovation in AI Drug Design

Hacker News •
×

Deep learning has revolutionized biomolecular modeling, with tools like DeepMind's AlphaFold3 achieving remarkable success in predicting protein interactions and enabling drug design. Models such as Chai-2, Latent-X2, and Nabla now produce developable antibody designs, suggesting AI-designed therapeutics may soon dominate clinical pipelines. The standard recipe for improvement involves scaling model size, compute, and data.

However, researchers at Ligo discovered a significant bottleneck when scaling structural training data. While genomics and metagenomics provide billions of protein sequences through resources like MGnify, most natural proteins occupy only a tiny fraction of possible sequence space. Evolution repeatedly reuses stable, adaptable folds rather than exploring novel shapes, creating redundancy that undermines data scaling efforts.

The team found that proteins can share identical folds despite having only 23.9-28.3% sequence identity. When they clustered the AlphaFold Database using Foldseek, they identified 2.3 million non-singleton structural clusters, but believe the true number of reusable structural neighborhoods is closer to 25,000. This suggests predicted-structure clustering significantly overcounts meaningful diversity.

For enzyme design and biomolecular modeling, this redundancy means folding more natural sequences may not yield proportionally new structural information. The findings imply that simply scaling sequence data won't solve the protein structure prediction problem without addressing underlying fold repetition.