Building a multi-modal RAG system with vision and text
Tom AnderssonFeb 15, 2026
I've been building a RAG system that handles both text and images for a manufacturing client. Their documentation includes diagrams, flowcharts, and photos alongside text.
Architecture
1. Text chunks → text-embedding-3-large → vector store 2. Images → GPT-4o vision (generate descriptions) → embed descriptions → same vector store 3. At query time: retrieve relevant chunks (text + image descriptions), pass both to GPT-4o with vision
The image description step is key. Instead of just embedding the description, I generate:
Retrieval accuracy improved by 23% when we added image understanding vs text-only RAG.
Still early days for multi-modal RAG but the results are promising.
3.7k views19 replies54 likes
Log in to reply to this topic.