GPT-4o vision: Extracting tables from images accurately

Dave SharpOct 8, 2024

I'm trying to use GPT-4o's vision capabilities to extract structured data from photos of printed tables (invoices, receipts, etc). The accuracy is decent but not reliable enough for production.

My approach:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract all rows from this table as JSON"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}}
        ]
    }],
    response_format={"type": "json_object"}
)

Main issues:

Columns sometimes get merged

Numbers with commas (1,234) get misread

Multi-page tables lose context

Any tips for improving accuracy? Should I preprocess images first?

5.6k views28 replies67 likes

2 Replies

Maria SantosStaffAccepted AnswerOct 10

Great observation on the overfitting curve. Here are my recommendations:

1. Early stopping: Not directly supported, but you can set n_epochs to 2-3 and train multiple jobs with different values 2. 5000 examples is generally sufficient for most tasks. Quality matters more than quantity. 3. For learning rate, try 0.5x the default (set learning_rate_multiplier: 0.5)

Also consider using a held-out validation set to monitor overfitting.

Chris Nakamura Oct 12

Setting learning_rate_multiplier to 0.5 and stopping at 2 epochs gave much better results. Final eval accuracy: 85% without the degradation. Thanks!