Forge

Forge is a multi-agent swarm system designed for automated GPU kernel optimization. It transforms PyTorch models, KernelBench tasks, or HuggingFace model IDs into highly optimized CUDA/Triton kernels through an intelligent evolutionary search process.

Key Features

Swarm Architecture: 32 parallel Coder+Judge agent pairs that compete to generate optimal kernel implementations
Evolutionary Optimization: Uses MAP-Elites archive with island model for quality-diversity search across 36 behavior cells
Pattern RAG System: Retrieves optimization patterns from CUTLASS (1,711 templates), Triton (113 templates), and GPU specifications
High Performance: Achieves up to 5× speedup over torch.compile(mode='max-autotune') with 97.6% correctness rate
Multiple Input Support: Works with PyTorch nn.Module, KernelBench tasks (250+ benchmarks), and any HuggingFace model ID
Inference-Time Scaling: Powered by fine-tuned NVIDIA Nemotron 3 Nano 30B generating 250k tokens/second

How It Works

Input Processing: Accepts PyTorch models, KernelBench tasks, or HuggingFace model IDs
Pattern Retrieval: Uses TurboPuffer RAG with 1536-dim semantic embeddings to fetch relevant optimization patterns
Evolutionary Search: Agents explore optimization strategies including tensor core utilization, memory coalescing, and kernel fusion
Validation Pipeline: Deduplication → Compilation (nvcc/triton) → Testing → Benchmarking
Output Generation: Produces optimized CUDA/Triton kernels as drop-in PyTorch replacements

Target Users

ML Engineers optimizing inference performance for production models
Researchers needing faster experimentation with large language models
GPU Developers working on custom kernel implementations
Companies deploying transformer models at scale requiring maximum GPU utilization

Unique Selling Points

Guaranteed Performance: Credit refund if Forge doesn't beat torch.compile's performance
Automated Layer Optimization: Automatically optimizes all layers of any HuggingFace model
Enterprise-Grade Infrastructure: Access to datacenter GPUs (B200, H100, H200) and NVIDIA Inception Program partnership
CLI-First Design: Command-line interface with interactive demo capabilities for seamless integration into development workflows

Forge

Introduction

Key Features

How It Works

Target Users

Unique Selling Points

Analytics

Information

Categories

Tags

More Products

Your AI Hunt

AIxploria

HuntScreens