OpenAI Developer Community

I've been building a RAG system that handles both text and images for a manufacturing client. Their documentation includes diagrams, flowcharts, and photos alongside text.

Architecture

1. Text chunks → text-embedding-3-large → vector store 2. Images → GPT-4o vision (generate descriptions) → embed descriptions → same vector store 3. At query time: retrieve relevant chunks (text + image descriptions), pass both to GPT-4o with vision

The image description step is key. Instead of just embedding the description, I generate:

A literal description of what's in the image

The context/purpose of the image

Any text visible in the image (OCR-like)

Retrieval accuracy improved by 23% when we added image understanding vs text-only RAG.

Still early days for multi-modal RAG but the results are promising.

Building a multi-modal RAG system with vision and text

Architecture