Cost analysis: Assistants API vs. building your own RAG pipeline
I've run both the Assistants API with file search and a custom RAG pipeline (LangChain + Pinecone + GPT-4o) for the same use case: customer support over 200 product docs.
Cost comparison (per 1000 queries)
| Component | Assistants API | Custom RAG | |-----------|---------------|------------| | LLM calls | $12.50 | $8.20 | | File search/embeddings | $2.10 | $0.40 (Pinecone) | | Storage | $0.10/GB/day | $0.08/GB/month | | Dev time | 2 days | 3 weeks |
The Assistants API is ~30% more expensive per query but saved us weeks of development. For our volume (~5K queries/day), the custom pipeline breaks even in about 3 months.
My recommendation: start with Assistants API, migrate to custom RAG once you hit scale.
Great benchmarking! For semantic chunking, what library are you using? I've been experimenting with LlamaIndex's SemanticSplitter and the results are promising.
I used a custom implementation based on sentence-transformers embeddings. The idea is to compute cosine similarity between adjacent sentences and split where similarity drops below a threshold.
LlamaIndex's implementation is similar but more polished. Definitely recommend it for production use.
Have you tested with different embedding dimensions? I found that using 256 dims for chunking decisions (cheaper and faster) and 3072 for the final embeddings works well.
Log in to reply to this topic.