Best practices for chunking strategy with text-embedding-3-large?

Raj Krishnan
Raj KrishnanMay 10, 2023

I've been extensively testing different chunking strategies with text-embedding-3-large for RAG and wanted to share my findings.

Strategies tested

1. Fixed size (512 tokens): Simple but breaks mid-sentence 2. Sentence-based: Better boundaries but uneven chunk sizes 3. Recursive character splitting (LangChain default): Good balance 4. Semantic chunking: Best quality but slowest

Results on retrieval accuracy (MRR@10)

| Strategy | MRR@10 | Avg chunk size | |----------|--------|---------------| | Fixed 512 | 0.72 | 512 tokens | | Sentence | 0.78 | 180 tokens | | Recursive | 0.81 | 400 tokens | | Semantic | 0.86 | 350 tokens |

Semantic chunking wins but adds significant preprocessing time. For most use cases, recursive splitting with 400-token chunks and 50-token overlap is the sweet spot.

Anyone tested with different embedding dimensions? I'm using the default 3072.

8.1k views47 replies134 likes
1 Reply
Mia Johnson

For brand consistency, I've found that maintaining a detailed style guide document and including it verbatim in every prompt helps a lot. Also, generating 4 variations and picking the best one is more reliable than trying to get one perfect output.

Log in to reply to this topic.