Gemini File Search Adds Multimodal RAG
Prompt Engineeringgo watch the original →
the gist
Gemini API's File Search now embeds images alongside text in one store, supports cross-modal queries, metadata filtering, and page-level citations for grounded responses.
The Breakthrough
Google expanded the Gemini API File Search tool to handle multimodal documents. The update embeds images and text in a shared vector space using Gemini Embedding 2. Users query both modalities in one call and receive grounded responses with page-level citations.
What Actually Worked
- Users upload files via the files API or file path and attach custom metadata as key-value pairs, such as
{"department": "legal"}or{"modality": "chart", "topic": "vision"}. - The pipeline chunks text into token-bound units and images into discrete tiles or page regions, then embeds both into the File Search store index.
- Queries pass the
file_searchtool togenerate_content, optionally withmetadata_filterlike{"modality": {"eq": "paper"}}; responses includegrounding_metadatawithpage_numberfields. - Cross-modal query "show me the chart where q3 revenue dipped" retrieves a revenue plot PDF, interprets the dip from 168 million to 89 million in Q3.
- Metadata-filtered query "what architectural innovation does the vision transformer introduce versus cnn" on
modality=paperandtopic=visionlimits retrieval to the Vision Transformer paper.
Context
Enterprise documents often mix text with images, such as photographs in insurance claims, schematics in engineering specs, or charts in reports. Previous File Search handled text-only, forcing custom pipelines for visuals. This update simplifies RAG by automating multimodal embedding and retrieval in the Gemini API, reducing the need for separate vision services or self-hosted vector stores. Demos use Gemini 1.5 Flash, papers like "Attention is All You Need" and "Vision Transformer", and fabricated charts converted to PDFs.
Notable Quotes
- "The embed step now treats a screenshot the same way it treats a paragraph."
- "Query like 'show me the chart where revenue dipped' actually retrieves the chart not just a paragraph that mentions a chart."
- "Files cap at 100 megabytes each and the free tier gives you 1 GB total storage. Vector storage is free. Query time embeddings are free."
Content References
Omitted per JSON structure.