Best practices for chunking strategy with text-embedding-3-large?
I've been extensively testing different chunking strategies with text-embedding-3-large for RAG and wanted to share my findings.
Strategies tested
1. Fixed size (512 tokens): Simple but breaks mid-sentence 2. Sentence-based: Better boundaries but uneven chunk sizes 3. Recursive character splitting (LangChain default): Good balance 4. Semantic chunking: Best quality but slowest
Results on retrieval accuracy (MRR@10)
| Strategy | MRR@10 | Avg chunk size | |----------|--------|---------------| | Fixed 512 | 0.72 | 512 tokens | | Sentence | 0.78 | 180 tokens | | Recursive | 0.81 | 400 tokens | | Semantic | 0.86 | 350 tokens |
Semantic chunking wins but adds significant preprocessing time. For most use cases, recursive splitting with 400-token chunks and 50-token overlap is the sweet spot.
Anyone tested with different embedding dimensions? I'm using the default 3072.
For brand consistency, I've found that maintaining a detailed style guide document and including it verbatim in every prompt helps a lot. Also, generating 4 variations and picking the best one is more reliable than trying to get one perfect output.
Log in to reply to this topic.