DPLM-2: A Multimodal Diffusion Protein Language Model

Train DPLM with LoRA on sequence and structure tokens (PDB + AFDB-SwissProt, 200K)
Structure tokenizer uses a lookup-free quantizer (i.e., MAGVIT): encoder (GVP-Transformer) and decoder (IPA/EvoFormer) with FAPE loss, violation loss, and distogram loss
Use self-mixup during training: apply an additional forward pass with prediction and then denoise. Sequence and strucutre have different noise schedule
Evaluate on structure folding, inverse folding, sequence design, motif-scaffolding, co-design, and representation learning (thermostability, GO, DeepLoc etc.)