OpenAI Developer Community

I've been extensively testing different chunking strategies with text-embedding-3-large for RAG and wanted to share my findings.

Strategies tested

1. Fixed size (512 tokens): Simple but breaks mid-sentence 2. Sentence-based: Better boundaries but uneven chunk sizes 3. Recursive character splitting (LangChain default): Good balance 4. Semantic chunking: Best quality but slowest

Results on retrieval accuracy (MRR@10)

| Strategy | MRR@10 | Avg chunk size | |----------|--------|---------------| | Fixed 512 | 0.72 | 512 tokens | | Sentence | 0.78 | 180 tokens | | Recursive | 0.81 | 400 tokens | | Semantic | 0.86 | 350 tokens |

Semantic chunking wins but adds significant preprocessing time. For most use cases, recursive splitting with 400-token chunks and 50-token overlap is the sweet spot.

Anyone tested with different embedding dimensions? I'm using the default 3072.

Best practices for chunking strategy with text-embedding-3-large?

Strategies tested

Results on retrieval accuracy (MRR@10)