OpenAI Developer Community

OpenAI recently added Direct Preference Optimization (DPO) to the fine-tuning API. I've been testing it for preference alignment and here are my first impressions.

Data format

Instead of (input, output) pairs, you provide (input, chosen, rejected) triples:

{
  "input": [{"role": "user", "content": "Explain quantum computing"}],
  "preferred_output": [{"role": "assistant", "content": "Clear, concise explanation..."}],
  "non_preferred_output": [{"role": "assistant", "content": "Overly verbose, inaccurate..."}]
}

Early results show DPO is particularly effective for:

Tone and style alignment

Reducing verbosity

Following specific formatting preferences

Less effective for:

Teaching new knowledge

Improving factual accuracy

DPO fine-tuning now available - first impressions

Data format