Comparing GPT-4o vs GPT-4 Turbo for code generation tasks

Derek Frost
Derek FrostMay 20, 2023

I've been running benchmarks comparing GPT-4o and GPT-4 Turbo on code generation tasks. Here are my findings from 500 test cases across Python, TypeScript, and Rust.

Results

| Metric | GPT-4 Turbo | GPT-4o | |--------|------------|--------| | Pass@1 | 78.2% | 81.5% | | Avg tokens | 342 | 287 | | Avg latency | 8.2s | 3.1s | | Cost per task | $0.034 | $0.018 |

GPT-4o is faster, cheaper, and slightly more accurate on my benchmark suite. The latency improvement alone makes it worth switching.

However, GPT-4 Turbo still edges out on complex multi-file refactoring tasks. For those, the deeper reasoning seems to help.

Happy to share my benchmark suite if anyone's interested.

9.4k views52 replies143 likes
1 Reply
Logan K.
Logan K.StaffAccepted AnswerMay 22

max_tokens is the older parameter name. max_completion_tokens is the newer one. They do the same thing for Chat Completions. For the o1 model family, only max_completion_tokens works (it also counts reasoning tokens). Recommendation: use max_completion_tokens going forward.

Log in to reply to this topic.