Gemini Embedding 2 is a groundbreaking multimodal embedding model built on the Gemini architecture, now available in public preview via the Gemini API and Vertex AI. This model expands beyond text-only embeddings by natively processing and mapping text, images, videos, audio, and documents into a single, unified embedding space. It captures semantic intent across over 100 languages and supports interleaved input of multiple modalities in a single request, enabling complex multimodal understanding.
Key Features:
- Multimodal Processing: Handles text (up to 8192 tokens), images (up to 6 per request), videos (up to 120 seconds), audio (native ingestion without transcription), and documents (PDFs up to 6 pages)
- Unified Embedding Space: Maps diverse media types into a single space for seamless multimodal retrieval and classification
- Flexible Output Dimensions: Uses Matryoshka Representation Learning (MRL) to scale dimensions from default 3072 down to 768 for cost-performance balance
- State-of-the-Art Performance: Outperforms leading models in text, image, and video tasks with strong speech capabilities
- Developer-Friendly: Available through Gemini API, Vertex AI, and popular frameworks like LangChain, LlamaIndex, and Vector Search
Use Cases:
- Retrieval-Augmented Generation (RAG) systems
- Multimodal semantic search and data clustering
- Sentiment analysis across different media types
- Legal discovery processes with image and video search
- Creator content indexing and brand collaboration platforms
- Personal wellness applications with conversational memory embedding

