Concept Bottleneck Language Models For protein design

Introduce CB-pLM (Concept Bottleneck Protein Language Models) from 24M to 3B, trained on UniRef50 and SwissProt over 718 concepts (including Cluster name, Biological process, and Biopython-derived features, etc.)
Add Concept Bottleneck Module (using <cls> token) and Orthogonality Network to standard BERT-like architecture
Train model with MLM Loss, Concept Loss (mean square error on concept embedding), and Orthogonality Loss (cosine similarity between known/unknown embeddings)
Use gradient-based approximation to modify protein sequences to increase/decrease specific concept values (e.g., which amino acids for increasing aromaticity)