Forge is a multi-agent swarm system designed for automated GPU kernel optimization. It transforms PyTorch models, KernelBench tasks, or HuggingFace model IDs into highly optimized CUDA/Triton kernels through an intelligent evolutionary search process.
Key Features
- Swarm Architecture: 32 parallel Coder+Judge agent pairs that compete to generate optimal kernel implementations
- Evolutionary Optimization: Uses MAP-Elites archive with island model for quality-diversity search across 36 behavior cells
- Pattern RAG System: Retrieves optimization patterns from CUTLASS (1,711 templates), Triton (113 templates), and GPU specifications
- High Performance: Achieves up to 5× speedup over
torch.compile(mode='max-autotune')with 97.6% correctness rate - Multiple Input Support: Works with PyTorch nn.Module, KernelBench tasks (250+ benchmarks), and any HuggingFace model ID
- Inference-Time Scaling: Powered by fine-tuned NVIDIA Nemotron 3 Nano 30B generating 250k tokens/second
How It Works
- Input Processing: Accepts PyTorch models, KernelBench tasks, or HuggingFace model IDs
- Pattern Retrieval: Uses TurboPuffer RAG with 1536-dim semantic embeddings to fetch relevant optimization patterns
- Evolutionary Search: Agents explore optimization strategies including tensor core utilization, memory coalescing, and kernel fusion
- Validation Pipeline: Deduplication → Compilation (nvcc/triton) → Testing → Benchmarking
- Output Generation: Produces optimized CUDA/Triton kernels as drop-in PyTorch replacements
Target Users
- ML Engineers optimizing inference performance for production models
- Researchers needing faster experimentation with large language models
- GPU Developers working on custom kernel implementations
- Companies deploying transformer models at scale requiring maximum GPU utilization
Unique Selling Points
- Guaranteed Performance: Credit refund if Forge doesn't beat torch.compile's performance
- Automated Layer Optimization: Automatically optimizes all layers of any HuggingFace model
- Enterprise-Grade Infrastructure: Access to datacenter GPUs (B200, H100, H200) and NVIDIA Inception Program partnership
- CLI-First Design: Command-line interface with interactive demo capabilities for seamless integration into development workflows

