GPT-4o vision: Extracting tables from images accurately

Dave Sharp
Dave SharpOct 8, 2024

I'm trying to use GPT-4o's vision capabilities to extract structured data from photos of printed tables (invoices, receipts, etc). The accuracy is decent but not reliable enough for production.

My approach:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract all rows from this table as JSON"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}}
        ]
    }],
    response_format={"type": "json_object"}
)

Main issues:

  • Columns sometimes get merged
  • Numbers with commas (1,234) get misread
  • Multi-page tables lose context
  • Any tips for improving accuracy? Should I preprocess images first?

    5.6k views28 replies67 likes
    2 Replies
    Maria Santos
    Maria SantosStaffAccepted AnswerOct 10

    Great observation on the overfitting curve. Here are my recommendations:

    1. Early stopping: Not directly supported, but you can set n_epochs to 2-3 and train multiple jobs with different values 2. 5000 examples is generally sufficient for most tasks. Quality matters more than quantity. 3. For learning rate, try 0.5x the default (set learning_rate_multiplier: 0.5)

    Also consider using a held-out validation set to monitor overfitting.

    Chris Nakamura

    Setting learning_rate_multiplier to 0.5 and stopping at 2 epochs gave much better results. Final eval accuracy: 85% without the degradation. Thanks!

    Log in to reply to this topic.