LogoDomain Rank App
icon of Forge

Forge

CLI swarm agent that auto-generates optimized CUDA kernels from PyTorch or HuggingFace models, achieving up to 5x faster inference than torch.compile.

Introduction

Forge is a multi-agent swarm system designed for automated GPU kernel optimization. It transforms PyTorch models, KernelBench tasks, or HuggingFace model IDs into highly optimized CUDA/Triton kernels through an intelligent evolutionary search process.

Key Features
  • Swarm Architecture: 32 parallel Coder+Judge agent pairs that compete to generate optimal kernel implementations
  • Evolutionary Optimization: Uses MAP-Elites archive with island model for quality-diversity search across 36 behavior cells
  • Pattern RAG System: Retrieves optimization patterns from CUTLASS (1,711 templates), Triton (113 templates), and GPU specifications
  • High Performance: Achieves up to 5× speedup over torch.compile(mode='max-autotune') with 97.6% correctness rate
  • Multiple Input Support: Works with PyTorch nn.Module, KernelBench tasks (250+ benchmarks), and any HuggingFace model ID
  • Inference-Time Scaling: Powered by fine-tuned NVIDIA Nemotron 3 Nano 30B generating 250k tokens/second
How It Works
  1. Input Processing: Accepts PyTorch models, KernelBench tasks, or HuggingFace model IDs
  2. Pattern Retrieval: Uses TurboPuffer RAG with 1536-dim semantic embeddings to fetch relevant optimization patterns
  3. Evolutionary Search: Agents explore optimization strategies including tensor core utilization, memory coalescing, and kernel fusion
  4. Validation Pipeline: Deduplication → Compilation (nvcc/triton) → Testing → Benchmarking
  5. Output Generation: Produces optimized CUDA/Triton kernels as drop-in PyTorch replacements
Target Users
  • ML Engineers optimizing inference performance for production models
  • Researchers needing faster experimentation with large language models
  • GPU Developers working on custom kernel implementations
  • Companies deploying transformer models at scale requiring maximum GPU utilization
Unique Selling Points
  • Guaranteed Performance: Credit refund if Forge doesn't beat torch.compile's performance
  • Automated Layer Optimization: Automatically optimizes all layers of any HuggingFace model
  • Enterprise-Grade Infrastructure: Access to datacenter GPUs (B200, H100, H200) and NVIDIA Inception Program partnership
  • CLI-First Design: Command-line interface with interactive demo capabilities for seamless integration into development workflows

Analytics