
Benchmarks didn't ship. Atleast not for Sarvam.
Last week I ran a proper AI evaluation across four models — Mistral AI OCR3, Gemini Flash, Gemini Flash Lite, and Sarvam Vision — on a document vision and extraction workflow I've been building. Nothing groundbreaking, just the kind of…
Last week I ran a proper AI evaluation across four models — Mistral AI OCR3, Gemini Flash, Gemini Flash Lite, and Sarvam Vision — on a document vision and extraction workflow I've been building. Nothing groundbreaking, just the kind of unglamorous, systematic eval work that actually tells you something useful before you commit to a model in production.
And somewhere in the middle of it, I had a moment that crystallised something I've been thinking about for a while.
Sarvam Vision — the model that LinkedIn spent the better part of last 3 months celebrating as India's answer to Gemini, the homegrown giant finally beating Western AI on OCR benchmarks — does not return JSON output.
Not as a config option. Not behind a flag. Just, structurally, as a product decision: you send it a document, it extracts the data, and it hands you back HTML or Markdown. That's the output. That's what you get.
Now if you're a PM, you already understand why this matters even without touching a line of code. Every downstream system that consumes extracted document data — whether it's a database, a validation layer, an automation workflow — expects structured output. JSON is not a developer preference. It's the lingua franca of data exchange. Skipping it isn't minimalism, it's an incomplete product. It means anyone who wants to actually use Sarvam Vision in a real workflow has to build a secondary parser just to reach the baseline that every other model in the eval hit out of the box.
That's not a benchmark problem. That's a product execution problem. And no OCR accuracy score fixes it.
Sarvam's models are fine-tuned on Mistral. This is publicly documented. You're not evaluating a ground-up Indian foundation model when you use Sarvam. You're evaluating Mistral with domain-specific fine-tuning and an Indian language focus layered on top. Which is genuinely valuable work — fine-tuning at that scale for Indic languages is hard and worth doing. But it's a different claim than what the LinkedIn hype cycle was making. "India beats Gemini on OCR" is a very different statement when the underlying architecture is a Western open-source model.
And this is the real thing I want to talk about, because it goes well beyond Sarvam.
LinkedIn has a particular failure mode when it comes to technology that carries national or cultural pride with it. The incentive to celebrate is so strong, and the cost of actually stress-testing the claim is so high, that people share benchmark screenshots without ever running the model in a workflow. They amplify the narrative — "Indian AI finally competing at the frontier" — without asking the product questions that matter: What does the output actually look like? What does integration cost? Where does it break under real conditions? What did the team optimise for, and what did they deprioritise?
Benchmarks, by design, measure one thing cleanly. Production usage is messy, contextual, and full of the edge cases that benchmarks never cover. The distance between a good benchmark score and a model you can actually ship with is enormous — and that distance is where product thinking lives.
As PMs, this should be our default posture toward any AI capability, whether it's a new model, a new tool, or a new platform: benchmarks tell you what's possible under controlled conditions, evals tell you what's true in your specific context. One is marketing. The other is product work.
Run your evals. Name your criteria before you test, not after. And be especially skeptical of the tools that LinkedIn collectively decided were good before most people had actually used them.
The hype cycle is fast. Production feedback is slow. That asymmetry is expensive if you're the one who shipped on the wrong assumption.


