GPT-4o vision: Extracting tables from images accurately
I'm trying to use GPT-4o's vision capabilities to extract structured data from photos of printed tables (invoices, receipts, etc). The accuracy is decent but not reliable enough for production.
My approach:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Extract all rows from this table as JSON"},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}}
]
}],
response_format={"type": "json_object"}
)
Main issues:
Any tips for improving accuracy? Should I preprocess images first?
Great observation on the overfitting curve. Here are my recommendations:
1. Early stopping: Not directly supported, but you can set n_epochs to 2-3 and train multiple jobs with different values
2. 5000 examples is generally sufficient for most tasks. Quality matters more than quantity.
3. For learning rate, try 0.5x the default (set learning_rate_multiplier: 0.5)
Also consider using a held-out validation set to monitor overfitting.
Setting learning_rate_multiplier to 0.5 and stopping at 2 epochs gave much better results. Final eval accuracy: 85% without the degradation. Thanks!
Log in to reply to this topic.