Building a multi-modal RAG system with vision and text

Tom Andersson
Tom AnderssonFeb 15, 2026

I've been building a RAG system that handles both text and images for a manufacturing client. Their documentation includes diagrams, flowcharts, and photos alongside text.

Architecture

1. Text chunks → text-embedding-3-large → vector store 2. Images → GPT-4o vision (generate descriptions) → embed descriptions → same vector store 3. At query time: retrieve relevant chunks (text + image descriptions), pass both to GPT-4o with vision

The image description step is key. Instead of just embedding the description, I generate:

  • A literal description of what's in the image
  • The context/purpose of the image
  • Any text visible in the image (OCR-like)
  • Retrieval accuracy improved by 23% when we added image understanding vs text-only RAG.

    Still early days for multi-modal RAG but the results are promising.

    3.7k views19 replies54 likes

    Log in to reply to this topic.