Picture this: Your enterprise knowledge base contains thousands of technical diagrams, architectural blueprints, flowcharts, and documents with embedded images. A team member asks, “Show me all instances where we use microservices architecture with Redis caching in our system documentation.” Traditional RAG systems would struggle to interpret the visual elements, potentially missing critical information hidden in diagrams and charts.
This scenario highlights a fundamental limitation in most RAG implementations today: they’re text-only. While these systems excel at processing written content, they fall short when dealing with the rich visual information that comprises a significant portion of enterprise knowledge. Engineering teams lose access to architectural diagrams, product managers can’t query flowcharts, and technical writers struggle to reference visual documentation.
Google’s Gemini 2.0 Flash changes this paradigm entirely. This breakthrough model doesn’t just process text—it understands images, analyzes diagrams, interprets charts, and can reason across both visual and textual content simultaneously. For enterprise RAG systems, this means unlocking an entirely new dimension of knowledge retrieval and generation.
In this comprehensive guide, we’ll build a production-ready multi-modal RAG system that can process both textual documents and visual content, creating a unified knowledge base that truly represents how modern enterprises store and consume information. You’ll learn to implement document ingestion pipelines that handle images, build retrieval mechanisms that understand visual context, and create response systems that can generate insights from both text and visual elements.
Understanding Multi-Modal RAG Architecture
The Multi-Modal Challenge
Traditional RAG systems follow a straightforward pattern: ingest text, create embeddings, store in vector database, retrieve similar content, and generate responses. Multi-modal RAG introduces complexity at every stage. Visual content requires different processing techniques, embeddings must capture both semantic and visual meaning, and retrieval needs to consider cross-modal relationships.
Gemini 2.0 Flash addresses these challenges through its unified architecture. Unlike systems that process text and images separately, Gemini 2.0 Flash understands the relationship between visual and textual elements. When analyzing a technical diagram with accompanying text, it comprehends how the visual flowchart relates to the written specifications.
Architecture Components
A production multi-modal RAG system requires several specialized components:
Document Processing Pipeline: Handles various file formats (PDFs with images, PowerPoint presentations, technical drawings) and extracts both textual content and visual elements while preserving their relationships.
Multi-Modal Embedding Generation: Creates vector representations that capture both semantic meaning from text and visual features from images, enabling cross-modal similarity search.
Intelligent Retrieval System: Determines whether queries require text-only, image-only, or multi-modal responses, then retrieves the most relevant content across all modalities.
Context-Aware Generation: Synthesizes responses that can reference both textual information and visual content, providing comprehensive answers that leverage all available knowledge.
Building the Document Processing Pipeline
Setting Up the Environment
Begin by establishing the foundation for multi-modal document processing:
import google.generativeai as genai
from PIL import Image
import PyPDF2
import io
import base64
from typing import List, Dict, Tuple
import chromadb
from chromadb.utils import embedding_functions
import json
# Configure Gemini API
genai.configure(api_key="your-api-key")
model = genai.GenerativeModel('gemini-2.0-flash-exp')
class MultiModalProcessor:
def __init__(self):
self.chroma_client = chromadb.Client()
self.collection = self.chroma_client.create_collection(
name="multimodal_knowledge_base",
embedding_function=embedding_functions.GoogleGenerativeAiEmbeddingFunction(
api_key="your-api-key",
model_name="models/text-embedding-004"
)
)
Advanced Document Extraction
The document processing pipeline must handle complex scenarios where text and images are interrelated:
def extract_multimodal_content(self, file_path: str) -> List[Dict]:
"""Extract text and images with contextual relationships"""
content_blocks = []
if file_path.endswith('.pdf'):
content_blocks = self._process_pdf_with_images(file_path)
elif file_path.endswith(('.ppt', '.pptx')):
content_blocks = self._process_presentation(file_path)
elif file_path.endswith(('.png', '.jpg', '.jpeg')):
content_blocks = self._process_standalone_image(file_path)
return content_blocks
def _process_pdf_with_images(self, file_path: str) -> List[Dict]:
"""Process PDF preserving text-image relationships"""
import fitz # PyMuPDF
doc = fitz.open(file_path)
content_blocks = []
for page_num, page in enumerate(doc):
# Extract text blocks with position information
text_blocks = page.get_text("dict")
# Extract images with position information
image_list = page.get_images()
# Process each text block
for block in text_blocks["blocks"]:
if "lines" in block:
text_content = ""
for line in block["lines"]:
for span in line["spans"]:
text_content += span["text"] + " "
if text_content.strip():
content_blocks.append({
"type": "text",
"content": text_content.strip(),
"page": page_num,
"bbox": block["bbox"],
"context": self._analyze_surrounding_content(page, block["bbox"])
})
# Process images with context
for img_index, img in enumerate(image_list):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n - pix.alpha < 4: # Ensure it's a valid image
img_data = pix.tobytes("png")
img_bbox = page.get_image_bbox(img)
content_blocks.append({
"type": "image",
"content": base64.b64encode(img_data).decode(),
"page": page_num,
"bbox": img_bbox,
"context": self._analyze_surrounding_content(page, img_bbox)
})
pix = None
doc.close()
return content_blocks
Contextual Content Analysis
Understanding the relationship between text and images is crucial for effective retrieval:
def _analyze_surrounding_content(self, page, bbox) -> Dict:
"""Analyze content surrounding a text block or image"""
# Get text within proximity of the bounding box
expanded_bbox = (
bbox[0] - 50, bbox[1] - 50, # Expand 50 points in each direction
bbox[2] + 50, bbox[3] + 50
)
surrounding_text = page.get_textbox(expanded_bbox)
return {
"surrounding_text": surrounding_text,
"position": bbox,
"proximity_score": self._calculate_proximity_score(bbox, page)
}
def _calculate_proximity_score(self, bbox, page) -> float:
"""Calculate how central or important this content is on the page"""
page_rect = page.rect
center_x = (bbox[0] + bbox[2]) / 2
center_y = (bbox[1] + bbox[3]) / 2
page_center_x = page_rect.width / 2
page_center_y = page_rect.height / 2
distance_from_center = ((center_x - page_center_x) ** 2 +
(center_y - page_center_y) ** 2) ** 0.5
max_distance = (page_rect.width ** 2 + page_rect.height ** 2) ** 0.5
return 1 - (distance_from_center / max_distance)
Implementing Multi-Modal Embeddings and Storage
Advanced Embedding Strategy
Multi-modal embeddings require sophisticated approaches to capture both semantic and visual information:
def generate_multimodal_embeddings(self, content_block: Dict) -> Dict:
"""Generate embeddings that capture multi-modal context"""
if content_block["type"] == "text":
# Generate text embedding with visual context
enhanced_text = self._enhance_text_with_context(content_block)
text_embedding = self._generate_text_embedding(enhanced_text)
return {
"embedding": text_embedding,
"metadata": {
"type": "text",
"page": content_block["page"],
"context_score": content_block["context"]["proximity_score"],
"has_nearby_images": self._has_nearby_visual_content(content_block)
}
}
elif content_block["type"] == "image":
# Generate image embedding with textual context
image_analysis = self._analyze_image_content(content_block["content"])
combined_text = f"{image_analysis} {content_block['context']['surrounding_text']}"
text_embedding = self._generate_text_embedding(combined_text)
return {
"embedding": text_embedding,
"metadata": {
"type": "image",
"page": content_block["page"],
"image_analysis": image_analysis,
"context_score": content_block["context"]["proximity_score"]
}
}
def _analyze_image_content(self, base64_image: str) -> str:
"""Use Gemini to analyze and describe image content"""
try:
image_data = base64.b64decode(base64_image)
image = Image.open(io.BytesIO(image_data))
prompt = """
Analyze this image and provide a detailed description focusing on:
1. Technical content (diagrams, charts, architectural elements)
2. Text content visible in the image
3. Relationships and connections shown
4. Key concepts and terminology
Provide a comprehensive description that would help in text-based search and retrieval.
"""
response = model.generate_content([prompt, image])
return response.text
except Exception as e:
return f"Image analysis failed: {str(e)}"
def _enhance_text_with_context(self, content_block: Dict) -> str:
"""Enhance text content with contextual information"""
base_text = content_block["content"]
context = content_block["context"]
enhanced_text = f"{base_text}"
if context["surrounding_text"]:
enhanced_text += f" Context: {context['surrounding_text']}"
# Add positional context
if context["proximity_score"] > 0.7:
enhanced_text += " [Central content]"
elif context["proximity_score"] < 0.3:
enhanced_text += " [Peripheral content]"
return enhanced_text
Intelligent Storage Strategy
Storing multi-modal content requires careful consideration of metadata and relationships:
def store_content_block(self, content_block: Dict, embedding_data: Dict) -> str:
"""Store content block with rich metadata for multi-modal retrieval"""
document_id = f"{content_block['page']}_{content_block['type']}_{hash(content_block['content'][:100])}"
metadata = {
"type": content_block["type"],
"page": content_block["page"],
"bbox": json.dumps(content_block["bbox"]),
"context_score": embedding_data["metadata"]["context_score"],
"file_source": content_block.get("file_source", ""),
"extraction_timestamp": datetime.now().isoformat()
}
# Add type-specific metadata
if content_block["type"] == "image":
metadata.update({
"image_analysis": embedding_data["metadata"]["image_analysis"],
"surrounding_text": content_block["context"]["surrounding_text"][:500] # Truncate for storage
})
elif content_block["type"] == "text":
metadata.update({
"has_nearby_images": embedding_data["metadata"]["has_nearby_images"],
"word_count": len(content_block["content"].split())
})
# Store in ChromaDB
self.collection.add(
embeddings=[embedding_data["embedding"]],
documents=[content_block["content"]],
metadatas=[metadata],
ids=[document_id]
)
return document_id
Building the Intelligent Retrieval System
Query Classification and Routing
Determining the best retrieval strategy based on query characteristics:
class MultiModalRetriever:
def __init__(self, collection):
self.collection = collection
self.query_classifier = self._setup_query_classifier()
def retrieve_relevant_content(self, query: str, top_k: int = 10) -> List[Dict]:
"""Intelligent multi-modal content retrieval"""
# Classify query intent and requirements
query_analysis = self._analyze_query_intent(query)
# Determine retrieval strategy
if query_analysis["requires_visual"]:
return self._visual_augmented_retrieval(query, query_analysis, top_k)
elif query_analysis["cross_modal"]:
return self._cross_modal_retrieval(query, query_analysis, top_k)
else:
return self._text_focused_retrieval(query, query_analysis, top_k)
def _analyze_query_intent(self, query: str) -> Dict:
"""Analyze query to determine retrieval strategy"""
visual_keywords = [
"diagram", "chart", "image", "figure", "graph", "flowchart",
"architecture", "blueprint", "visual", "screenshot", "illustration"
]
cross_modal_keywords = [
"show me", "explain the", "how does", "what is shown",
"describe the", "relationship between", "connected to"
]
requires_visual = any(keyword in query.lower() for keyword in visual_keywords)
cross_modal = any(keyword in query.lower() for keyword in cross_modal_keywords)
# Use Gemini to analyze complex query intent
intent_prompt = f"""
Analyze this query and determine:
1. Does it require visual content (images, diagrams)?
2. Does it need cross-modal reasoning (text + visual)?
3. What type of content would best answer this query?
4. Key concepts to search for
Query: {query}
Respond in JSON format with: requires_visual, cross_modal, content_types, key_concepts
"""
try:
response = model.generate_content(intent_prompt)
intent_analysis = json.loads(response.text)
except:
intent_analysis = {
"requires_visual": requires_visual,
"cross_modal": cross_modal,
"content_types": ["text"],
"key_concepts": []
}
return intent_analysis
Advanced Retrieval Strategies
Implementing specialized retrieval methods for different query types:
def _visual_augmented_retrieval(self, query: str, analysis: Dict, top_k: int) -> List[Dict]:
"""Retrieve content with emphasis on visual elements"""
# Search for image content first
image_results = self.collection.query(
query_texts=[query],
n_results=top_k // 2,
where={"type": "image"}
)
# Search for text content that references visual elements
enhanced_query = f"{query} diagram chart figure visual representation"
text_results = self.collection.query(
query_texts=[enhanced_query],
n_results=top_k // 2,
where={"type": "text"}
)
# Combine and rank results
combined_results = self._combine_and_rank_results(
image_results, text_results, analysis
)
return combined_results[:top_k]
def _cross_modal_retrieval(self, query: str, analysis: Dict, top_k: int) -> List[Dict]:
"""Retrieve content requiring reasoning across text and images"""
# Get initial results from both modalities
all_results = self.collection.query(
query_texts=[query],
n_results=top_k * 2 # Get more to filter and rank
)
# Group results by page to find related content
page_groups = self._group_results_by_page(all_results)
# Score page groups based on multi-modal relevance
scored_groups = []
for page, content_items in page_groups.items():
score = self._calculate_cross_modal_score(content_items, query, analysis)
scored_groups.append((score, content_items))
# Sort by score and flatten results
scored_groups.sort(key=lambda x: x[0], reverse=True)
final_results = []
for score, items in scored_groups:
final_results.extend(items)
if len(final_results) >= top_k:
break
return final_results[:top_k]
def _calculate_cross_modal_score(self, content_items: List[Dict], query: str, analysis: Dict) -> float:
"""Calculate relevance score for cross-modal content groups"""
has_text = any(item["metadata"]["type"] == "text" for item in content_items)
has_image = any(item["metadata"]["type"] == "image" for item in content_items)
base_score = sum(1 - item["distance"] for item in content_items) / len(content_items)
# Bonus for having both text and images
if has_text and has_image:
base_score *= 1.3
# Bonus for high context scores
avg_context_score = sum(
item["metadata"].get("context_score", 0) for item in content_items
) / len(content_items)
base_score *= (1 + avg_context_score * 0.2)
return base_score
Creating Context-Aware Response Generation
Multi-Modal Response Synthesis
Generating comprehensive responses that leverage both textual and visual content:
class MultiModalResponseGenerator:
def __init__(self, model):
self.model = model
def generate_comprehensive_response(self, query: str, retrieved_content: List[Dict]) -> Dict:
"""Generate response incorporating multi-modal content"""
# Organize content by type and relevance
text_content = [item for item in retrieved_content if item["metadata"]["type"] == "text"]
image_content = [item for item in retrieved_content if item["metadata"]["type"] == "image"]
# Create structured context for generation
context = self._build_generation_context(text_content, image_content, query)
# Generate response with multi-modal awareness
response = self._generate_with_multimodal_context(query, context)
return {
"response": response,
"sources": self._extract_source_references(retrieved_content),
"visual_references": self._extract_visual_references(image_content),
"confidence_score": self._calculate_response_confidence(retrieved_content)
}
def _build_generation_context(self, text_content: List[Dict],
image_content: List[Dict], query: str) -> Dict:
"""Build structured context for response generation"""
context = {
"query": query,
"text_sources": [],
"visual_sources": [],
"cross_references": []
}
# Process text content
for item in text_content[:5]: # Limit to top 5 for context window
context["text_sources"].append({
"content": item["document"],
"page": item["metadata"]["page"],
"relevance": 1 - item["distance"],
"context_score": item["metadata"].get("context_score", 0)
})
# Process visual content
for item in image_content[:3]: # Limit to top 3 images
context["visual_sources"].append({
"description": item["metadata"].get("image_analysis", ""),
"page": item["metadata"]["page"],
"surrounding_text": item["metadata"].get("surrounding_text", ""),
"relevance": 1 - item["distance"]
})
# Find cross-references (content from same pages)
context["cross_references"] = self._find_cross_references(
text_content, image_content
)
return context
def _generate_with_multimodal_context(self, query: str, context: Dict) -> str:
"""Generate response using multi-modal context"""
prompt = f"""
You are an expert knowledge assistant with access to both textual and visual information from enterprise documentation.
User Query: {query}
Available Context:
TEXT SOURCES:
{self._format_text_sources(context['text_sources'])}
VISUAL SOURCES:
{self._format_visual_sources(context['visual_sources'])}
CROSS-REFERENCES:
{self._format_cross_references(context['cross_references'])}
Instructions:
1. Provide a comprehensive answer that draws from both textual and visual sources
2. When referencing visual content, describe what the diagrams/images show
3. Highlight relationships between textual explanations and visual representations
4. If visual content supports or contradicts textual information, note this
5. Structure your response clearly with relevant sections
6. Reference page numbers when citing specific sources
Generate a detailed, accurate response that fully leverages the multi-modal knowledge base:
"""
try:
response = self.model.generate_content(prompt)
return response.text
except Exception as e:
return f"Error generating response: {str(e)}"
Advanced Source Attribution
Providing clear attribution for multi-modal sources:
def _extract_source_references(self, retrieved_content: List[Dict]) -> List[Dict]:
"""Extract and format source references"""
sources = []
for item in retrieved_content:
source = {
"type": item["metadata"]["type"],
"page": item["metadata"]["page"],
"relevance_score": round(1 - item["distance"], 3),
"context_score": round(item["metadata"].get("context_score", 0), 3)
}
if item["metadata"]["type"] == "text":
source["preview"] = item["document"][:200] + "..."
else:
source["description"] = item["metadata"].get("image_analysis", "")
source["surrounding_context"] = item["metadata"].get("surrounding_text", "")[:100]
sources.append(source)
return sources
def _calculate_response_confidence(self, retrieved_content: List[Dict]) -> float:
"""Calculate confidence score for the generated response"""
if not retrieved_content:
return 0.0
# Base confidence on retrieval quality
avg_relevance = sum(1 - item["distance"] for item in retrieved_content) / len(retrieved_content)
# Bonus for multi-modal content
has_text = any(item["metadata"]["type"] == "text" for item in retrieved_content)
has_images = any(item["metadata"]["type"] == "image" for item in retrieved_content)
multimodal_bonus = 0.1 if (has_text and has_images) else 0
# Context quality bonus
avg_context = sum(
item["metadata"].get("context_score", 0) for item in retrieved_content
) / len(retrieved_content)
context_bonus = avg_context * 0.1
final_confidence = min(1.0, avg_relevance + multimodal_bonus + context_bonus)
return round(final_confidence, 3)
Production Deployment and Optimization
Performance Optimization Strategies
Multi-modal RAG systems require careful optimization for production environments:
class ProductionMultiModalRAG:
def __init__(self):
self.processor = MultiModalProcessor()
self.retriever = MultiModalRetriever(self.processor.collection)
self.generator = MultiModalResponseGenerator(model)
# Performance optimization components
self.cache = self._setup_caching_layer()
self.batch_processor = self._setup_batch_processing()
def _setup_caching_layer(self):
"""Implement caching for expensive operations"""
import redis
return redis.Redis(
host='localhost',
port=6379,
decode_responses=True,
socket_keepalive=True,
socket_keepalive_options={}
)
def process_document_batch(self, file_paths: List[str],
batch_size: int = 5) -> Dict:
"""Process multiple documents efficiently"""
results = {
"processed": 0,
"failed": 0,
"total_content_blocks": 0,
"processing_time": 0
}
start_time = time.time()
for i in range(0, len(file_paths), batch_size):
batch = file_paths[i:i + batch_size]
# Process batch in parallel
with concurrent.futures.ThreadPoolExecutor(max_workers=batch_size) as executor:
futures = {
executor.submit(self._process_single_document, fp): fp
for fp in batch
}
for future in concurrent.futures.as_completed(futures):
file_path = futures[future]
try:
content_blocks = future.result()
results["processed"] += 1
results["total_content_blocks"] += len(content_blocks)
except Exception as e:
print(f"Failed to process {file_path}: {str(e)}")
results["failed"] += 1
results["processing_time"] = time.time() - start_time
return results
def query_with_caching(self, query: str, cache_ttl: int = 3600) -> Dict:
"""Query with intelligent caching"""
# Create cache key
cache_key = f"multimodal_query:{hash(query)}"
# Check cache first
cached_result = self.cache.get(cache_key)
if cached_result:
return json.loads(cached_result)
# Generate new response
retrieved_content = self.retriever.retrieve_relevant_content(query)
response_data = self.generator.generate_comprehensive_response(
query, retrieved_content
)
# Cache result
self.cache.setex(
cache_key,
cache_ttl,
json.dumps(response_data, default=str)
)
return response_data
Monitoring and Quality Assurance
Implementing comprehensive monitoring for multi-modal RAG systems:
class MultiModalRAGMonitor:
def __init__(self):
self.metrics = {
"queries_processed": 0,
"avg_response_time": 0,
"multimodal_queries": 0,
"text_only_queries": 0,
"image_retrieval_rate": 0,
"error_rate": 0
}
def log_query_metrics(self, query: str, response_data: Dict,
processing_time: float):
"""Log detailed metrics for each query"""
self.metrics["queries_processed"] += 1
# Update average response time
current_avg = self.metrics["avg_response_time"]
count = self.metrics["queries_processed"]
self.metrics["avg_response_time"] = (
(current_avg * (count - 1) + processing_time) / count
)
# Track query types
has_visual_sources = len(response_data.get("visual_references", [])) > 0
if has_visual_sources:
self.metrics["multimodal_queries"] += 1
else:
self.metrics["text_only_queries"] += 1
# Track image retrieval effectiveness
if "show" in query.lower() or "diagram" in query.lower():
if has_visual_sources:
self.metrics["image_retrieval_rate"] += 1
def generate_quality_report(self) -> Dict:
"""Generate comprehensive quality metrics"""
total_queries = self.metrics["queries_processed"]
if total_queries == 0:
return {"error": "No queries processed yet"}
return {
"total_queries": total_queries,
"avg_response_time_seconds": round(self.metrics["avg_response_time"], 3),
"multimodal_usage_rate": round(
self.metrics["multimodal_queries"] / total_queries, 3
),
"image_retrieval_success_rate": round(
self.metrics["image_retrieval_rate"] /
max(1, self.metrics["multimodal_queries"]), 3
),
"error_rate": round(self.metrics["error_rate"] / total_queries, 3),
"system_health": "Good" if self.metrics["error_rate"] / total_queries < 0.05 else "Needs Attention"
}
The power of Gemini 2.0 Flash in multi-modal RAG systems extends far beyond simple text processing. By implementing these comprehensive pipelines, your enterprise can unlock the full potential of its visual and textual knowledge assets. The system we’ve built handles complex document structures, understands relationships between text and images, and generates responses that truly leverage the rich, multi-dimensional nature of enterprise information.
This multi-modal approach transforms how teams interact with documentation, enabling natural queries about architectural diagrams, technical specifications with embedded visuals, and complex relationships that span both textual explanations and visual representations. As organizations continue to create increasingly visual and interactive documentation, these multi-modal RAG capabilities become not just advantageous, but essential for maintaining competitive knowledge management systems.
Ready to implement multi-modal RAG in your organization? Start by identifying your most visual-heavy documentation and begin with a focused pilot project. The investment in building these sophisticated retrieval and generation capabilities will pay dividends as your team discovers new ways to access and utilize your enterprise knowledge base. Your technical documentation, architectural diagrams, and process flowcharts are waiting to become active participants in your organization’s knowledge ecosystem.