Daniel Hartwell
Listen to Article
Loading...Last month, our team at a healthcare tech startup hit a wall. We'd been using GPT-4 to process medical residency applications—parsing transcripts, extracting grades, and flagging inconsistencies. Everything worked beautifully in testing. Then we deployed to production with 5,000 real applications, and I started getting Slack messages at 2 AM. "The AI is making up grades that don't exist," our QA lead Sarah wrote. "It's hallucinating entire GPAs."
I didn't believe it at first. We'd tested this thing for weeks. But when I pulled the logs, there it was: GPT-4 was confidently reporting a 3.8 GPA for an applicant whose transcript clearly showed 3.2. Not a parsing error—the model literally invented a number. And it wasn't an isolated case. About 3% of our processed applications had fabricated data.
That's when we started the most exhaustive AI model comparison I've ever done. We needed to know: Was this a GPT-4 problem? Would Claude do better? How do these models actually perform when you throw real, messy production data at them—not the cherry-picked examples in marketing materials?
Over the next six weeks, we ran both ChatGPT (GPT-4 and GPT-4o-mini) and Claude (Claude 3.5 Sonnet and Claude 3 Haiku) through identical production workloads. We processed 50,000+ medical documents, tracked every hallucination, measured latency under load, and calculated our actual costs down to the penny. This isn't a synthetic benchmark. This is what happened when we put these models to work on real problems with real consequences.
Here's everything we learned—the performance characteristics nobody talks about, the failure modes that only show up at scale, and the honest trade-offs you need to know before choosing an LLM for production.
The Hallucination Problem Everyone Ignores
Let me start with the issue that kicked off this whole investigation: hallucinations. Everyone knows LLMs hallucinate. What nobody tells you is how differently they hallucinate, and why that matters for your specific use case.
When GPT-4o-mini hallucinates medical residency applicant grades, it doesn't just make random errors. It shows a specific pattern. The model tends to "normalize" data toward expected values. If most medical students have GPAs between 3.5 and 3.9, GPT-4o-mini will subtly drift outlier values toward that range. A 3.2 becomes 3.5. A 4.0 stays 4.0 (because it's already in the expected range). A 2.8 might become 3.1.
This is insidious because it looks plausible. You won't catch it with spot checks. We only discovered it because we built a validation pipeline that compared extracted values against ground truth data for a 1,000-document test set. Here's what that pipeline looked like:
import anthropic
import openai
from difflib import SequenceMatcher
import json
def extract_with_gpt4(document_text, model="gpt-4o-mini"):
"""Extract structured data using GPT-4"""
response = openai.ChatCompletion.create(
model=model,
messages=[
{"role": "system", "content": "Extract GPA, test scores, and honors from this medical school transcript. Return ONLY a JSON object with keys: gpa, mcat_score, honors. If a value is not present, use null."},
{"role": "user", "content": document_text}
],
temperature=0.0 # Deterministic for consistency
)
return json.loads(response.choices[0].message.content)
def extract_with_claude(document_text, model="claude-3-5-sonnet-20241022"):
"""Extract structured data using Claude"""
client = anthropic.Anthropic()
response = client.messages.create(
model=model,
max_tokens=1024,
temperature=0.0,
messages=[
{"role": "user", "content": f"Extract GPA, test scores, and honors from this medical school transcript. Return ONLY a JSON object with keys: gpa, mcat_score, honors. If a value is not present, use null.\n\n{document_text}"}
]
)
return json.loads(response.content[0].text)
def validate_extraction(extracted, ground_truth):
"""Compare extracted data against known correct values"""
errors = []
for key in ground_truth:
if key not in extracted:
errors.append(f"Missing field: {key}")
continue
extracted_val = extracted[key]
truth_val = ground_truth[key]
# For numeric values, check if they match within tolerance
if isinstance(truth_val, (int, float)):
if extracted_val is None:
errors.append(f"{key}: Extracted null, should be {truth_val}")
elif abs(float(extracted_val) - float(truth_val)) > 0.05:
errors.append(f"{key}: Extracted {extracted_val}, should be {truth_val}")
# For strings, check similarity
elif isinstance(truth_val, str):
if extracted_val is None:
errors.append(f"{key}: Extracted null, should be '{truth_val}'")
else:
similarity = SequenceMatcher(None, str(extracted_val), truth_val).ratio()
if similarity < 0.9:
errors.append(f"{key}: Low similarity ({similarity:.2f})")
return errors
When we ran this validation across 1,000 medical transcripts with known ground truth:
GPT-4o-mini hallucination rate: 3.2%
- 32 documents had fabricated or significantly altered numeric values
- Pattern: Values drifted toward statistical norms
- Most common: GPA adjustments (18 cases), test score "corrections" (11 cases)
- Average drift: 0.3-0.5 points on a 4.0 scale
Claude 3.5 Sonnet hallucination rate: 0.8%
- 8 documents had errors, but different pattern
- Pattern: Model refused to extract when uncertain, returned null values
- Most common: Missing data marked as null rather than guessed (6 cases)
- Actual fabrications: Only 2 cases, both involved ambiguous handwritten notes
This is the critical difference. GPT-4o-mini hallucinates by filling in gaps with plausible-sounding data. Claude hallucinates less frequently and tends to refuse rather than guess. For our medical application, Claude's behavior was far safer—a null value flags for human review, while a plausible-but-wrong value slips through.
But here's where it gets interesting: When we tested on a different domain (legal contract analysis), the pattern reversed somewhat.
Domain-Specific Performance: Why Your Use Case Matters More Than Benchmarks
We didn't stop at medical documents. I convinced our CTO to let me test these models across four different production workloads we were considering for AI integration:
- Medical document extraction (what we were already doing)
- Legal contract clause identification (for our compliance team)
- Customer support ticket classification (to route inquiries)
- Code review and bug detection (for our dev team's PR process)
Each domain revealed different strengths. Here's what we found:
Medical Documents: Claude Wins on Accuracy
For extracting structured data from medical transcripts, research papers, and clinical notes:
Claude 3.5 Sonnet:
- Accuracy: 96.8% (measured against human review)
- Hallucination rate: 0.8%
- Average latency: 2.3 seconds per document
- Cost per 1,000 documents: $47.50
GPT-4:
- Accuracy: 94.1%
- Hallucination rate: 2.1%
- Average latency: 1.8 seconds per document
- Cost per 1,000 documents: $52.00
GPT-4o-mini:
- Accuracy: 91.2%
- Hallucination rate: 3.2%
- Average latency: 0.9 seconds per document
- Cost per 1,000 documents: $8.50
The accuracy difference is significant. In healthcare, a 3-5% error rate means hundreds of incorrect decisions if you're processing thousands of documents. We calculated that GPT-4o-mini's error rate would require an additional 160 hours of human review per month to catch the mistakes—which completely eliminated the cost savings from the cheaper model.
Claude's conservative approach (returning null when uncertain) meant we only needed 40 hours of human review for edge cases. The model essentially triaged itself: high-confidence extractions went straight through, low-confidence got flagged. This is exactly what you want in a production system.
Legal Contracts: GPT-4 Surprises with Nuance
For identifying specific clauses in commercial contracts (indemnification, liability caps, termination conditions):
GPT-4:
- Accuracy: 97.2%
- False positives: 1.1%
- False negatives: 1.7%
- Average latency: 3.2 seconds per contract
- Cost per 1,000 contracts: $78.00
Claude 3.5 Sonnet:
- Accuracy: 95.8%
- False positives: 0.9%
- False negatives: 3.3%
- Average latency: 2.8 seconds per contract
- Cost per 1,000 contracts: $62.00
This was unexpected. GPT-4 showed better understanding of legal language nuance, particularly around conditional clauses and nested dependencies. When we dug into the false negatives, Claude was missing clauses that were phrased unusually or split across multiple paragraphs.
Here's an example that Claude missed but GPT-4 caught:
Contract excerpt:
"The Company shall indemnify the Contractor for claims arising from
Company's gross negligence (Section 8.2), provided that such indemnification
shall not exceed the total contract value as specified in Exhibit A, and
further provided that Contractor has complied with notification requirements
detailed in Section 12.4."
Prompt we used:
Identify all indemnification clauses in this contract, including any
limitations or conditions on the indemnification. Return as JSON with:
- clause_text: the exact text
- limitations: array of any caps or conditions
- cross_references: any sections referenced
GPT-4 response:
{
"clause_text": "The Company shall indemnify the Contractor for claims arising from Company's gross negligence",
"limitations": [
"Capped at total contract value (Exhibit A)",
"Requires Contractor compliance with Section 12.4 notification"
],
"cross_references": ["Section 8.2", "Exhibit A", "Section 12.4"]
}
Claude 3.5 Sonnet response:
{
"clause_text": "The Company shall indemnify the Contractor for claims arising from Company's gross negligence",
"limitations": ["Capped at total contract value"],
"cross_references": ["Section 8.2", "Exhibit A"]
}
Claude missed the notification requirement condition. It's subtle—the "further provided that" clause is easy to overlook—but in legal contracts, these conditions matter. When we tested 50 similar cases with multi-part conditions, GPT-4 caught 47, Claude caught 39.
My theory: GPT-4's training likely included more legal documents, or its architecture handles long-range dependencies better. Whatever the reason, for legal work, GPT-4's extra cost was worth it.
Customer Support: Both Models Excel, But Cost Differs Dramatically
For classifying support tickets into categories (billing, technical, account management, feature requests):
GPT-4o-mini:
- Accuracy: 97.8%
- Average latency: 0.6 seconds per ticket
- Cost per 10,000 tickets: $85.00
Claude 3 Haiku:
- Accuracy: 97.3%
- Average latency: 0.5 seconds per ticket
- Cost per 10,000 tickets: $42.00
Claude 3.5 Sonnet:
- Accuracy: 98.1%
- Average latency: 1.2 seconds per ticket
- Cost per 10,000 tickets: $475.00
This is where the mini/small models shine. Customer support classification is a simpler task than medical extraction or legal analysis. The accuracy difference between GPT-4o-mini and Claude 3.5 Sonnet was only 0.3%—not worth paying 5.6x more.
We went with Claude 3 Haiku for production. The slightly lower accuracy (0.5% worse than GPT-4o-mini) was offset by 50% cost savings. At our volume (about 15,000 tickets per month), that's $650/month saved, or $7,800 annually. For a startup, that matters.
Here's the prompt we settled on after testing dozens of variations:
def classify_support_ticket(ticket_text, model="claude-3-haiku-20240307"):
"""Classify support ticket into category"""
client = anthropic.Anthropic()
prompt = f"""Classify this customer support ticket into ONE category:
- BILLING: Payment issues, invoices, refunds, subscription changes
- TECHNICAL: Bugs, errors, performance issues, integration problems
- ACCOUNT: Login issues, password resets, user management, permissions
- FEATURE: Feature requests, product feedback, enhancement suggestions
- OTHER: Anything that doesn't fit above categories
Ticket:
{ticket_text}
Respond with ONLY the category name (BILLING, TECHNICAL, ACCOUNT, FEATURE, or OTHER).
If the ticket mentions multiple issues, choose the PRIMARY issue."""
response = client.messages.create(
model=model,
max_tokens=50, # We only need one word
temperature=0.0,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
💡 Pro Tip: For classification tasks, we found that being extremely explicit about the categories and providing clear examples in the prompt improved accuracy by 2-3% across all models. Also, limiting max_tokens to just what you need (50 tokens for a single word response) reduces latency and cost.
Code Review: GPT-4 Understands Context Better
For reviewing pull requests and identifying potential bugs:
GPT-4:
- True positive rate: 78.2% (correctly identified real bugs)
- False positive rate: 12.3% (flagged non-issues)
- Average latency: 8.5 seconds per PR (500-1000 lines)
- Cost per 100 PRs: $340.00
Claude 3.5 Sonnet:
- True positive rate: 71.8%
- False positive rate: 18.7%
- Average latency: 6.2 seconds per PR
- Cost per 100 PRs: $280.00
Code review is hard for LLMs because it requires understanding project-specific context, coding conventions, and subtle logic errors. Both models struggled with false positives—flagging code that looked suspicious but was actually correct given the broader context.
We tested this on 200 real PRs from our codebase where we knew the ground truth (bugs that made it to production, issues caught in manual review, and clean code that passed all checks).
Example of a false positive that both models flagged:
def process_payment(amount, user_id):
"""Process payment for user"""
# Both models flagged this as a potential bug:
# "No validation that amount is positive"
user = User.objects.
Unlock Premium Content
You've read 30% of this article
What's in the full article
- Complete step-by-step implementation guide
- Working code examples you can copy-paste
- Advanced techniques and pro tips
- Common mistakes to avoid
- Real-world examples and metrics
Don't have an account? Start your free trial
Join 10,000+ developers who love our premium content
Keep reading
Daniel Hartwell
AuthorCovers backend systems, distributed architecture, and database performance. Contributing author at NextGenBeing.
Never Miss an Article
Get our best content delivered to your inbox weekly. No spam, unsubscribe anytime.
Comments (0)
Please log in to leave a comment.
Log In