Comparing GPT-4o vs GPT-4 Turbo for code generation tasks
I've been running benchmarks comparing GPT-4o and GPT-4 Turbo on code generation tasks. Here are my findings from 500 test cases across Python, TypeScript, and Rust.
Results
| Metric | GPT-4 Turbo | GPT-4o | |--------|------------|--------| | Pass@1 | 78.2% | 81.5% | | Avg tokens | 342 | 287 | | Avg latency | 8.2s | 3.1s | | Cost per task | $0.034 | $0.018 |
GPT-4o is faster, cheaper, and slightly more accurate on my benchmark suite. The latency improvement alone makes it worth switching.
However, GPT-4 Turbo still edges out on complex multi-file refactoring tasks. For those, the deeper reasoning seems to help.
Happy to share my benchmark suite if anyone's interested.
max_tokens is the older parameter name. max_completion_tokens is the newer one. They do the same thing for Chat Completions. For the o1 model family, only max_completion_tokens works (it also counts reasoning tokens). Recommendation: use max_completion_tokens going forward.
Log in to reply to this topic.