
- Raygun (~700M), an ESM2+VAE-based model, generates diverse variants through deletions, insertions, and substitutions
- Encoder maps protein sequences into a fixed-size multivariate normal distribution for efficient sampling
- Use Reduction Layer and Repetition Layer for mean and standard deviation matrix transformation and length restoration. Use ESM Transformer and 1D Convolution in T-Block Layers to capture global and local properties
- Apply Reconstruction Loss, Cross-Entropy Loss, and Replication Loss (L2 loss between the original protein embeddings and re-encoded embeddings of generated new proteins)
- Train on a subset of Uniref50 by dividing sequence lengths from 100 to 1000 amino acids into 19 bins (80k for training and 14k for validation)