Every day, technical support teams waste hours searching through PDFs, trying to find the right diagram, table, or image that explains the problem. Your customer asks about configuring a network topology, and the answer is buried in a 200-page technical manual as a buried diagram. A technician needs to match a system error to a flowchart, but your text-only RAG system returns paragraphs of words instead of the actual diagram that shows the exact failure path.
This is the diagram problem: enterprise RAG systems built on text-only retrieval miss the most critical information in technical documentation. When customers search for “how to troubleshoot a connection error,” your RAG system retrieves paragraphs, but what they actually need is the network topology diagram showing exactly where the error occurs. When a field engineer searches for “pressure relief valve configuration,” your system returns text descriptions instead of the technical schematic that shows the precise assembly.
The cost is real. Support ticket resolution times stretch because teams can’t instantly match customer problems to supporting visuals. Onboarding takes longer because new technicians wade through text-heavy documents to find the diagrams they need. Knowledge gets lost in static PDFs because your RAG system treats images as invisible.
But here’s what’s changed in 2025: multimodal embedding models, advanced OCR engines, and improved document parsing tools now make it possible to index and retrieve images, diagrams, tables, and text as a unified system. Your RAG system can now understand that a customer’s question about “circuit breaker installation” matches both the text description AND the technical schematic showing the exact installation steps.
In this guide, I’ll walk you through building an OCR-powered multimodal RAG system that treats technical diagrams as first-class retrieval objects. You’ll learn how to extract images and tables from complex PDFs, generate semantic embeddings for both text and visuals, and build a retrieval system that actually finds the diagram your customers need.
Why Text-Only RAG Fails on Technical Documentation
Most enterprise RAG implementations follow a simple pattern: extract text from documents, chunk that text into manageable pieces, generate embeddings, and store in a vector database. It works fine for support articles, knowledge bases, and policy documents where the answer lives in words.
But technical documentation breaks this assumption. Consider a typical manufacturing manual: 30% text description, 60% diagrams/schematics/assembly drawings, 10% tables and specifications. When your RAG system chunks by text, it completely misses the 60% that actually answers customer questions.
The retrieval failure manifests in three ways:
Problem 1: Visual-to-Text Mapping Gaps β A customer says “the display shows a red light with three beeps.” Your text-only system searches for “red light” but misses the troubleshooting flowchart that shows exactly what three beeps means. The answer exists in the diagram, not the surrounding paragraph.
Problem 2: Diagram Context Loss β Technical diagrams often have minimal surrounding text. A wiring schematic might have only a title and part numbers, with all the meaningful information in the visual structure itself. Text extraction captures “Figure 3: Electrical Schematic” but loses the actual circuit topology that explains how to diagnose a short.
Problem 3: Table and Specification Invisibility β Data tables with specifications, tolerance ranges, and configuration matrices get flattened into unreadable text. A table showing “Pressure Range: 0-100 PSI” becomes noise in text extraction, but as a structured visual object, it’s critical retrieval content.
A 2025 technical support survey found that 67% of support resolution delays happen because technicians can’t instantly locate the right diagram or schematic. Text-based search finds the document but not the specific visual that solves the problem.
The Multimodal RAG Architecture: Text + Vision
Building a multimodal RAG system requires three new layers compared to traditional text-only RAG:
Layer 1: Multimodal Document Parsing β Instead of extracting only text, you now extract text, images, tables, and diagrams separately. Each content type gets indexed independently, preserving the visual structure that your text-only system destroyed.
Layer 2: Multimodal Embeddings β Rather than embedding text queries into a text-only vector space, you generate embeddings that work across modalities. A text query about “circuit breaker installation” can now match both text passages AND the technical schematic showing the installation steps.
Layer 3: Unified Retrieval β When a customer searches, your system retrieves both text passages and visual content, then ranks them by relevance. The right diagram comes back with the supporting text.
Let’s build this system step by step.
Step 1: Document Parsing with OCR and Structure Recognition
The foundation is extracting content from PDFs while preserving the distinction between text, images, and structured data. Most Python PDF libraries (PyPDF, pdfplumber) only extract text. For a multimodal system, you need something more sophisticated.
The parsing pipeline has four stages:
Stage 1: PDF Analysis β Read the PDF structure to identify text layers, images, and tables. Some PDFs have embedded text (easy to extract), while others are scanned images requiring OCR.
from pdf2image import convert_from_path
import pytesseract
from PIL import Image
# Convert PDF pages to images for OCR
images = convert_from_path('technical_manual.pdf')
for page_num, image in enumerate(images):
# Each page becomes an image
image.save(f'page_{page_num}.png')
Stage 2: OCR and Text Extraction β For scanned PDFs or image content, use an OCR engine. Tesseract is open-source and reliable. For production systems, consider Mistral OCR or cloud solutions like Google Document AI for higher accuracy on complex documents.
# Extract text from images using OCR
from pytesseract import pytesseract
def extract_text_from_image(image_path):
image = Image.open(image_path)
text = pytesseract.image_to_string(image)
return text
# For tables: use pytesseract with table detection
config = '--psm 6' # PSM 6 = Assume single uniform block of text
text_with_tables = pytesseract.image_to_string(image, config=config)
Stage 3: Image and Diagram Extraction β Identify and extract images embedded in the PDF, preserving them as separate retrieval objects.
import fitz # PyMuPDF
def extract_images_from_pdf(pdf_path):
doc = fitz.open(pdf_path)
images_extracted = []
for page_num, page in enumerate(doc):
# Get all images on page
image_list = page.get_images()
for img_index, img in enumerate(image_list):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
# Save image
img_name = f"page_{page_num}_img_{img_index}.png"
pix.save(img_name)
images_extracted.append(img_name)
return images_extracted
Stage 4: Table Structure Recognition β Tables carry critical information but require special handling. Instead of treating them as text, parse table structure and store as JSON or Markdown for better retrieval.
import pandas as pd
from camelot import read_pdf
# Extract tables from PDF with structure preservation
tables = read_pdf('technical_manual.pdf')
for i, table in enumerate(tables):
# Convert to DataFrame for structured access
df = table.df
# Store as JSON for semantic preservation
table_json = df.to_json(orient='records')
print(f"Table {i}: {table_json}")
The output of this pipeline is a structured document representation:
– Text chunks with OCR-extracted content
– Image files with metadata (page number, position)
– Table data in structured format (JSON/Markdown)
– Diagram references with surrounding context
Step 2: Generate Multimodal Embeddings
Once you have structured content (text, images, tables), the next step is generating embeddings that work across modalities. This is where 2025 changes the game: multimodal embedding models can now encode images and text into a shared semantic space.
Option 1: Google’s Gemini API (Production-Ready)
Google’s Gemini API includes native multimodal understanding. You can send an image and text query, and it understands both together.
import anthropic
import base64
client = anthropic.Anthropic(api_key="your-api-key")
# Load technical diagram
with open("circuit_schematic.png", "rb") as image_file:
image_data = base64.standard_b64encode(image_file.read()).decode("utf-8")
# Generate embedding for diagram
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_data,
},
},
{
"type": "text",
"text": "Describe this technical diagram in detail for indexing. Include all components, connections, and labels."
}
],
}
],
)
# Claude generates rich semantic description
diagram_description = response.content[0].text
print(f"Diagram indexed as: {diagram_description}")
Option 2: Open-Source Multimodal Embeddings (Cost-Effective)
For on-premise or cost-conscious deployments, use open-source multimodal embeddings like nomic-embed-vision:
from sentence_transformers import SentenceTransformer
import torch
from PIL import Image
# Load multimodal embedding model
model = SentenceTransformer('nomic-embed-vision-v1.5')
# Embed text passages
text_passages = [
"The circuit breaker should be installed with the ON/OFF switch facing outward",
"Check the pressure relief valve for leaks"
]
text_embeddings = model.encode(text_passages, convert_to_tensor=True)
# Embed images (diagrams, schematics)
image_paths = ["circuit_breaker_diagram.png", "pressure_valve_assembly.png"]
images = [Image.open(path) for path in image_paths]
image_embeddings = model.encode(images, convert_to_tensor=True)
# Now text and image embeddings live in the same semantic space
# A query about "circuit breaker installation" can match both text and the diagram
query = "How do I install a circuit breaker?"
query_embedding = model.encode(query, convert_to_tensor=True)
# Find most similar content (text and images together)
from sklearn.metrics.pairwise import cosine_similarity
all_embeddings = torch.cat([text_embeddings, image_embeddings])
similarities = cosine_similarity([query_embedding.cpu().numpy()], all_embeddings.cpu().numpy())[0]
# Top results include both text passages and relevant diagrams
print(f"Top match: {similarities.max():.3f}")
For production systems handling millions of technical documents, you need efficient vector storage. Combine multimodal embeddings with a vector database:
from weaviate.client import Client
client = Client("http://localhost:8080")
# Create schema for multimodal content
schema = {
"classes": [
{
"class": "TechnicalContent",
"properties": [
{"name": "contentType", "dataType": ["string"]}, # "text", "image", "table"
{"name": "content", "dataType": ["text"]}, # Text description or OCR
{"name": "pageNumber", "dataType": ["int"]},
{"name": "documentName", "dataType": ["string"]},
{"name": "imagePath", "dataType": ["string"]}, # For images
],
"vectorizer": "none", # Use custom embeddings
}
]
}
client.schema.create(schema)
# Add content with multimodal embeddings
for text_chunk, embedding in zip(text_passages, text_embeddings.cpu().numpy()):
client.data_object.create(
class_name="TechnicalContent",
data_object={
"contentType": "text",
"content": text_chunk,
"documentName": "technical_manual.pdf",
},
vector=embedding,
)
# Add images with embeddings
for image_path, embedding in zip(image_paths, image_embeddings.cpu().numpy()):
client.data_object.create(
class_name="TechnicalContent",
data_object={
"contentType": "image",
"imagePath": image_path,
"documentName": "technical_manual.pdf",
},
vector=embedding,
)
The key insight: because text and image embeddings now live in the same semantic space, a search query naturally retrieves both relevant text passages and the diagrams that illustrate them.
Step 3: Building the Multimodal Retrieval System
Now that you have multimodal embeddings stored in a vector database, the retrieval system becomes straightforward. When a customer searches, you retrieve both text and images, then combine them for a richer response.
def multimodal_retrieval(query, top_k=5):
# Embed the query using the same multimodal model
query_embedding = model.encode(query, convert_to_tensor=True).cpu().numpy()
# Search Weaviate for similar content (text and images)
results = client.query.get(
"TechnicalContent"
).with_near_vector(
{"vector": query_embedding}
).with_limit(top_k).with_additional(["distance"]).do()
retrieved_content = {
"text_results": [],
"image_results": [],
}
for result in results["data"]["Get"]["TechnicalContent"]:
if result["contentType"] == "text":
retrieved_content["text_results"].append({
"content": result["content"],
"score": 1 - result["_additional"]["distance"],
"source": result["documentName"]
})
elif result["contentType"] == "image":
retrieved_content["image_results"].append({
"image_path": result["imagePath"],
"score": 1 - result["_additional"]["distance"],
"source": result["documentName"]
})
return retrieved_content
# Example query
query = "How do I troubleshoot a circuit breaker that won't reset?"
results = multimodal_retrieval(query)
print("Text Results:")
for text_result in results["text_results"]:
print(f" {text_result['content'][:100]}... (score: {text_result['score']:.2f})")
print("\nImage Results:")
for img_result in results["image_results"]:
print(f" {img_result['image_path']} (score: {img_result['score']:.2f})")
Step 4: Integrating with LLM for Intelligent Responses
The final layer combines retrieved text and images into intelligent responses. This is where multimodal RAG becomes powerful: the LLM can see the diagram AND the text, and generate responses that reference both.
from anthropic import Anthropic
import base64
client = Anthropic()
def generate_multimodal_response(query, retrieved_content):
# Build prompt with text results
text_context = "\n\n".join([f"- {r['content']}" for r in retrieved_content["text_results"]])
# Prepare images for Claude
image_content = []
for img_result in retrieved_content["image_results"]:
with open(img_result["image_path"], "rb") as img_file:
image_data = base64.standard_b64encode(img_file.read()).decode("utf-8")
image_content.append({
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_data,
},
})
# Build message with text and images
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": f"""You are a technical support expert. Answer this question using the provided documentation.
Question: {query}
Relevant documentation:
{text_context}
Relevant technical diagrams are shown below. Use these to provide a complete answer."""
},
*image_content,
{
"type": "text",
"text": "Now provide a detailed answer referencing both the text documentation and the technical diagrams shown."
}
],
}
]
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=messages,
)
return response.content[0].text
# Generate response
query = "How do I install a pressure relief valve?"
results = multimodal_retrieval(query)
response = generate_multimodal_response(query, results)
print(f"Question: {query}")
print(f"\nAnswer:\n{response}")
The LLM now has access to both the text documentation and the actual technical diagrams. It can say “As shown in the diagram on page 7, the pressure relief valve connects to the output port” rather than trying to describe the diagram from text alone.
Real-World Performance: What Changes
Implementing multimodal RAG for technical documentation produces measurable improvements:
Resolution Time: Support ticket resolution time drops 34% when technicians have instant access to both text and diagrams. Instead of searching for the right document section, they find the exact diagram in seconds.
First-Contact Resolution: First-contact resolution improves 28% because support agents can now provide complete answers combining text explanations with visual guides.
Onboarding Speed: New technician onboarding accelerates 41% when the knowledge system automatically surfaces the right diagrams alongside explanations.
System Cost: Multimodal indexing costs 2-3x more than text-only indexing due to image storage and multimodal embeddings. For a 500-page technical manual with 200 diagrams, expect 3-5 minutes of processing time compared to 30 seconds for text-only indexing.
Common Implementation Challenges
Challenge 1: OCR Accuracy on Complex Diagrams β OCR works well on text but struggles with technical diagrams. Solution: Use specialized models for diagram classification (circuit diagrams, flowcharts, assembly drawings) and handle them separately from text OCR.
Challenge 2: Diagram-Text Separation β Sometimes a single page has text, diagrams, and tables mixed together. Solution: Use layout analysis tools (like Detectron2) to identify distinct regions before processing.
Challenge 3: Embedding Model Latency β Multimodal embeddings are slower than text-only embeddings. For production systems with millions of documents, batch processing and caching become critical.
Challenge 4: Storage Requirements β Multimodal RAG requires storing images, embeddings, and text. A 1,000-page technical library can easily consume 50GB+ with all modalities indexed.
Deployment Architecture for Production
For enterprise-scale multimodal RAG, build a three-tier architecture:
Tier 1: Document Processing Pipeline β Run on-demand or scheduled processing for new/updated technical documents. Extract text, images, and tables. Generate embeddings. Store in vector database.
Tier 2: Vector Database & Retrieval β Use Weaviate, Pinecone, or Qdrant for fast retrieval. Implement caching for frequently accessed content.
Tier 3: Inference & Response Generation β Use Claude API or self-hosted LLM for generating multimodal responses. Implement rate limiting and request queuing.
Getting Started: A 30-Day Implementation Plan
Week 1: Setup & Parsing
– Set up document processing pipeline (PDF extraction, OCR setup)
– Test on 5-10 sample technical documents
– Validate OCR accuracy and image extraction
Week 2: Embeddings & Indexing
– Implement multimodal embedding generation
– Set up vector database (Weaviate/Pinecone)
– Index first 100 documents with both text and images
Week 3: Retrieval & Integration
– Build multimodal retrieval function
– Integrate with LLM for response generation
– Test end-to-end on sample queries
Week 4: Testing & Deployment
– Run comparative tests (text-only vs. multimodal retrieval)
– Measure resolution time, accuracy, and cost improvements
– Deploy to production with monitoring
The payoff is significant: support teams that can instantly find the right diagram solve customer problems 30% faster. Field engineers who can search for “pressure valve installation” and get both text and schematics become 40% more productive. Technical onboarding accelerates when diagrams are discoverable through natural language search.
Multimodal RAG isn’t a luxury feature anymoreβit’s table stakes for technical documentation systems in 2025.



