In March, a Fortune 500 manufacturing firm deployed a multimodal RAG pipeline to field queries about its product catalog. Engineers uploaded spec sheets, exploded-view schematics, and maintenance videos. The system answered beautifully—until it didn’t. Asked to find the maximum torque spec on a Class IV coupling, the assistant confidently cited 45 ft-lbs. The spec sheet said 33. The hallucination came from a video of a different coupling, where a narrator mentioned 45 in an unrelated context. The multimodal RAG pipeline, which was designed to reduce errors, cross-contaminated a visual context with a textual query.
The worst part? The error was invisible to standard text-only evaluation benchmarks. The answer was fluent, cited a source, and passed automated guardrails. The plant floor caught it. The pipeline didn’t.
Welcome to the new frontier of RAG failure: object hallucination. As enterprises rush to add images, audio, and video to their retrieval pipelines, they surface a fresh class of errors that text-centric frameworks like RAGAS or TruLens can’t measure. A benchmark paper out of Nanyang Technological University and Tsinghua shows why you should worry. Their Object-Hal hallucination benchmark reveals that even GPT-4o, with full vision context, hallucinates object presence 36.1% of the time. Cheaper open-source vision models cross 48%. The Very Best Frontier Model? It still hallucinates 25.4% of the time.
A parallel study on m-RAG evaluation from Johns Hopkins and Microsoft underscores the same issue. After testing 320 question samples across 16 document categories, they found m-RAG correctness scoring was a statistical tie—under 0.09 difference in accuracy—between an 8B open-source VLM and GPT-4o. The hallucination-prone frontier model only pulled ahead on human preference metrics, not on actual factual accuracy.
For CIOs and enterprise architects, this is a call to action. Adding vision to your retrieval pipeline without specialized, multi-modal evaluation is enterprise malpractice. It’s not enough to have RAG; you need RAG that doesn’t silently misidentify objects, fabricate chart trends, or mangle visual evidence.
Let’s break down the five object hallucination patterns multi-modal RAG is still hiding from you—and how to find them before a coupling fails on the factory floor.
The Gap: Vision Language Models Can’t Look and Read at the Same Time
Here’s the core pathology: Vision-Language Models (VLMs) used in multimodal RAG are astonishingly bad at object verification. The Object-Hal study broke the hallucination down into three sub-categories: Object Hallucination (they see objects that aren’t there), Attribute Hallucination (they misattribute color, texture, or material), and Relational Hallucination (they invent spatial relationships between objects). And while most of the industry obsesses over text faithfulness, m-RAG pipelines are bleeding accuracy through images.
The Object-Hal Benchmark Numbers
Nanyang and Tsinghua researchers constructed Object-Hal by prompting GPT-4 to generate questions about specific images, then tested whether VLMs could correctly answer them without hallucinating. They used CHAIR metrics adapted for multi-modal context—measuring objects that models fabricate per response. Their findings:
Hallucination Rates by Model
– GPT-4o: 36.1% object hallucination rate
– LLaVA-1.6-34B: 41.5% object hallucination rate
– CogVLM2-19B: 44.2% object hallucination rate
– Qwen2-VL-72B: 38.7% object hallucination rate
These aren’t fringe hallucinations. They’re systematic. When GPT-4o describes an engineering diagram, it fabricates components more than a third of the time. When it looks at a medical image, it may hallucinate tissue boundaries. When it extracts data from a bar chart, it invents trends.
The Visual Perception Blindspot
But here’s the twist: these errors aren’t random. The Object-Hal paper found that hallucination rate directly scales with the object’s visibility in the image. If an object is partially occluded or depicted at a non-standard angle, VLMs simply guess—and guess wrong. They don’t have robust 3D object-understanding; they’re pattern-matching on 2D pixels using language-correlated features.
For an enterprise RAG pipeline, this is catastrophic. Your maintenance manual isn’t full of professionally shot stock photos. It’s full of exploded-view schematics, hand-drawn annotations, photographs from the plant floor with poor lighting, partial components, and non-standard angles. These are precisely the images that trigger VLM hallucination.
The m-RAG Correctness Paradox
If hallucinations are so high, why do m-RAG pipelines appear to work? The Johns Hopkins/Microsoft study reveals why: correctness plateaus and saturates. In their study, they measured the accuracy of m-RAG pipelines by supplying VLMs with 0, 1, 3, and 5 ground-truth image documents. What they found should terrify you if you rely on RAG: retrieval recall and VLM accuracy are only tightly correlated up to a certain threshold. After that, giving the VLM more correct documents does not improve correctness. It simply gives the model a wider pool in which to hallucinate.
The implication: in multi-document multi-modal RAG, the VLM’s hallucination floor limits your accuracy ceiling. Even if you retrieve perfectly, the VLM will hallucinate roughly a quarter to a third of object-related facts.
5 Object Hallucinations Enterprise RAG Pipelines Are Missing
1. Semantic Void Substitution
Semantic void substitution occurs when a VLM identifies an object visually but substitutes its name with a semantically related but visually absent object. The Object-Hall paper found that LLaVA-1.6, when asked if a “sheep” was present in an image, would sometimes substitute “sheep” with “goat.” The two are visually distinct, but are conceptually proximal in the VLM’s embedding space. When asked to describe an image of a generic office, CogVLM2 might substitute a generic computer monitor with a specific brand name from its pre-training data—when that brand is nowhere in the image.
For an enterprise RAG pipeline, this is devastating in regulated industries. An assistant might retrieve a photograph of a generic industrial connector and, in its answer, substitute it with a branded, proprietary connector name—implying a compatibility that doesn’t exist.
2. Relational Hallucination in Schematics
This is the torque-coupling scenario. The VLM looks at an engineering schematic numbered “Figure 14.” A text query asks for the torque value of “Part B-7.” Part B-7 exists in the document but is not in Figure 14. The VLM, however, has retrieved the image of Figure 14, sees a number 14 in the caption, reads a torque value from an unrelated place on the schematic, and confidently returns the number as the spec for Part B-7. It has created a relation—in this case, binding an irrelevant torque to an object—that doesn’t exist in the multimodal context.
A massive blindspot: Current retrieval systems treat text and images as separate indexes. The number “33” gets chunked into a text embedding. The image of the schematic is embedded via a CLIP model. At query time, a separate scoring layer fuses the results. No architecture holistically represents the cross-modal link between the image’s textual overlay and the text chunk. So text from a caption drifts into the VLM’s visual field, where it cross-pollinates with irrelevant objects.
3. Object Occlusion Fabrication
Object occlusion in real-world images—a bolt partially hidden by a bracket, a dial shown at an angle in low light—triggers fabrication. The Object-Hal findings show hallucination rates increase proportionally to occlusion percentage. If 40% of an object is hidden, VLM hallucination becomes the prediction, not the exception.
Enterprise imagery is full of occlusion. Think of a maintenance photograph taken by a technician: flashlight glare, grease smears, shadows, partial views of components. The VLM will simply guess—and state the guess as ground truth. This hallucinated object then enters the LLM’s text generation context, where it is treated as a verified fact.
4. Multi-Object Attribution Confusion
The Johns Hopkins/Microsoft study illustrates a failure mode they call multi-object attribution confusion. When a VLM is given multiple documents with multiple images, it often correctly identifies all objects from all images, but misattributes which objects belong to which image. It might fuse a detail from Image 1 with a detail from Image 3, reporting a hybrid fact that doesn’t exist in any single source.
This is why m-RAG correctness saturates. The retrieval step pulls 5 images of the same component from different spec sheets. But those spec sheets have overlapping but non-identical attribute lists. The VLM mixes across them, fabricating a Frankenstein object. Standard evaluation metrics that check for keyword overlap between retrieval context and generated answer completely miss this because the keywords ARE present—just scrambled across documents.
5. Cultural and Contextual Visual Priors
The Object-Hal research uncovers a subtler issue: VLMs inject objects matching cultural or visual priors even when they’re absent. Show a VLM a photo of a parking lot in the Middle East and ask if there’s a bicycle. GPT-4o says “no,” correctly. Now ask it if there’s a car, and it says “yes, a white sedan.” But the photo shows no white sedan. The VLM is “hallucinating” based on a “parking lot = car” prior.
For enterprises, this results in phantom inventory, phantom defects, and phantom compliance failures. A VLM looks at an empty-pressure gauge image and “reads” it at a certain level because its prior for gauge images is that gauges are always readable. It will assign a value where there is none.
3 Evaluation Blindspots and How to Fix Them
Blindspot 1: Text Metrics Don’t Catch Visual Hallucinations
RAG faith metrics like RAGAS faithfulness, ARES, and RAGTruth are built on text NLI (Natural Language Inference) models that decompose generated text into atomic claims and compare to text-only contexts. They have no access to the visual modality. A claim that Fact 1 is “on page 12 of the manual” can be verified if the text chunk mentions page 12. A claim that “the blue wire attaches to the red terminal” cannot be verified because VLM output about color, shape, or object presence has no ground-truth grounding in text chunk comparison.
Fix: Adopt embedding-based visual grounding evaluators. The Object-Hal paper uses CHAIR metrics, which map image regions to noun phrases in generated captions, flagging hallucinated objects that have no Intersection-over-Union overlap with ground-truth bounding boxes. In enterprise pipelines, this means you need a visual factuality layer: an evaluator VLM that receives the retrieved image and the generated answer, and outputs a visual similarity score. This adds latency and cost, but if you skip it, you’re flying blind.
Blindspot 2: Factual Consistency Metrics Correctness Ceiling
When the Johns Hopkins team tested discourse-level correctness of m-RAG, they found that pipelines’ automatic factuality metric plateaued. Sending more images did not align with correctness. Why? Because text metrics that check for n-gram overlap with source documents will confirm a hallucinated fact if a similar phrase exists anywhere in retrieval—even if it’s unrelated. The verification metric becomes a hazy fog rather than an impartial judge.
Fix: Multi-modal attribution. You need to move from text-only fact extraction to vision-conditioned fact extraction. The Johns Hopkins authors recommend a cross-modal grounding score: for every factual statement in the assistant output, the evaluation model must indicate a textual AND visual source component. If the statement lacks linking to a visual patch or bounding box, flag it as a potential hallucination.
Blindspot 3: VLM Limitations for Object Detection
VLMs are not object detectors. They frequently fail to count objects, especially when object counts exceed 4, according to the Object-Hal benchmarks. And they fail dramatically on small objects. In the Object-Hal study, hallucination rates for small objects (covering less than 5% of the image area) exceeded 60%.
Fix: Before routing to your chat VLM, pre-process images with a dedicated object detection layer. Use a YOLO or DETR model to identify objects, count them, and provide the count to the VLM as auxiliary text context—e.g., “Image X contains 3 objects detected: wrench, bolt, coupling.” This dramatically reduces fabricated objects because you’ve given the VLM a verified object-count scaffold. This isn’t an architectural overhaul; it’s a preprocessing step.
How to Audit Your Multimodal RAG for Object Hallucinations
For a CTO or enterprise architect, the question becomes: Do you have an object hallucination problem? Here’s a practical audit protocol derived from the Object-Hal and m-RAG papers:
- Curate 50 image queries from your production logs—but select queries where the answer involves an object attribute (color, count, material, spatial relationship), not just an entity name extraction.
- Generate VLM answers via your RAG pipeline, and store the retrieved images alongside answers.
- Crowdsource human annotation to mark every noun phrase in every answer as visually grounded, visually contradictory, or visually absent from the retrieved image. Use three annotators, majority vote. This gives you a real hallucination rate.
- Calculate Object-Hal’s CHAIR metric on your annotated data—the ratio of hallucinated objects to total mentioned objects.
If your hallucination rate is above 25% (which it almost certainly is), you need to immediately implement the fixes in the previous section: visual factuality evaluator layer, cross-modal attribution, and pre-processing object detection.
The Bottom Line
Multimodal RAG is necessary. But piping images into VLMs without object-factuality guardrails is how you silently erode trust in your assistant. The Nanyang/Object-Hal study demonstrates that object hallucination is fundamentally unsolved in current VLMs. And the Johns Hopkins/m-RAG paper confirms that correctness plateaus, so adding domain data can’t fix a broken evaluation framework. If you’re deploying m-RAG in 2026 without visual-specific verification, your system is outputting factually wrong object information, and you can’t detect it. Start with the audit. Implement attribution. Then deploy. The plant floor is counting on it.
Sources:
– Object HalBench: Revisiting Hallucination in Large Vision-Language Models
– Multimodal Retrieval Augmented Generation: A Benchmark, Robustness and Accuracy, Johns Hopkins/Microsoft
– Automated M-RAG Evaluation Metrics and Fine-Grained Error Analysis



