<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://pranavj17.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://pranavj17.github.io/" rel="alternate" type="text/html" /><updated>2026-04-28T05:17:28+00:00</updated><id>https://pranavj17.github.io/feed.xml</id><title type="html">Pranav J — Engineering Notes</title><subtitle>Production AI infrastructure, LLM benchmarks, observability engineering.</subtitle><author><name>Pranav J</name></author><entry><title type="html">I benchmarked 21 NVIDIA NIM free-tier models on real production AI-SRE workloads. Here’s the architecture that emerged.</title><link href="https://pranavj17.github.io/2026/04/28/nvidia-nim-benchmark/" rel="alternate" type="text/html" title="I benchmarked 21 NVIDIA NIM free-tier models on real production AI-SRE workloads. Here’s the architecture that emerged." /><published>2026-04-28T00:00:00+00:00</published><updated>2026-04-28T00:00:00+00:00</updated><id>https://pranavj17.github.io/2026/04/28/nvidia-nim-benchmark</id><content type="html" xml:base="https://pranavj17.github.io/2026/04/28/nvidia-nim-benchmark/"><![CDATA[<p><em>5 wins, 0 losses for an unexpected model. 38% of the catalog returns 404. And the highest-impact change wasn’t a model swap.</em></p>

<hr />

<p>For six months I’ve been running an AI-SRE pipeline in production at a fintech — automating Sentry triage, support-ticket classification, RCA generation, and Graylog query suggestion. The default analysis model was Mistral Nemotron on NVIDIA NIM’s free tier, chosen after a 12-model benchmark in April. It works fine on synthetic prompts.</p>

<p>This week I tested it on real tickets from our ~6,000-entry resolved-issue knowledge base. The results changed the architecture.</p>

<blockquote>
  <p><strong>TL;DR:</strong> Devstral-2-123b — Mistral’s “dev-focused” model — beat Mistral Nemotron 5-0 across diverse services with 71% vs 54% accuracy. Nemotron-3-Super-120b won on the hardest cross-service tickets. Qwen3-Coder-480b produced the cleanest Elixir code review of any free-tier model. 38% of NIM’s catalog returns 404. And the single highest-impact change wasn’t a model swap — it was prompt engineering.</p>
</blockquote>

<p>This article walks through 7 passes of testing across 21 model probes, the failure modes that matter for production, and the 4-tier architecture that emerged.</p>

<p><strong>Full per-model raw outputs, per-ticket grading rubrics, and methodology data:</strong> <a href="https://pranavj17.github.io/2026/04/28/nvidia-nim-benchmark/">pranavj17.github.io/2026/04/28/nvidia-nim-benchmark</a></p>

<hr />

<h2 id="the-system-under-test">The system under test</h2>

<p>The pipeline I’m benchmarking against runs every 3 minutes during business hours:</p>

<pre><code>[Slack/Sentry alert] OR [support ticket created]
        ↓
  bash dispatch (system crontab)
        ↓
  Routing model: gemma4:latest on local Ollama (Mac Mini M4 Pro)
        ↓
  Bash pre-fetch: Metabase + Sentry + Graylog + GitLab APIs
        ↓
  Analysis model: Claude Sonnet 4.6 (subscription, 1-turn)
        ↓
  Output: ticket comment + Slack post + (rare) auto-fix MR
</code></pre>

<p>This processes about 50 tickets/day across 7 internal services. The architecture pre-fetches all evidence in bash before passing a single 1-turn call to Claude — chosen because Claude was the strongest analyst and we wanted to minimize subscription budget consumption (15 calls per 5 hours).</p>

<p><strong>The question I was answering: could free NIM models replace any of these tiers without quality loss?</strong></p>

<hr />

<h2 id="methodology">Methodology</h2>

<p>I designed seven passes of increasing rigor, each addressing a specific architectural question:</p>

<ol>
  <li><strong>Synthetic latency</strong> — which models are fast enough?</li>
  <li><strong>Real KB simple</strong> — do they get production tickets right?</li>
  <li><strong>Real KB complex</strong> — what happens on hard cross-service problems?</li>
  <li><strong>Tool calling</strong> — which support agentic workflows?</li>
  <li><strong>Frontier + specialized</strong> — what about the largest and the niche models?</li>
  <li><strong>Coding-specific</strong> — which produces production-ready code?</li>
  <li><strong>Routing comparison</strong> — does swapping the local routing model help?</li>
</ol>

<p>Each pass used representative real workloads, not synthetic benchmarks. Where ground truth existed (closed tickets with human-resolved RCA), I scored model output against it on a points rubric. All tests ran from the same Mac Mini against the same NIM endpoint with the same auth key.</p>

<hr />

<h2 id="pass-1--synthetic-latency-baseline">Pass 1 — Synthetic latency baseline</h2>

<p>A single SRE-style prompt with ~140 input tokens, max_tokens=1024, run against 8 candidate models.</p>

<p>The top four returned in under 4 seconds:</p>

<ul>
  <li><strong>mistralai/devstral-2-123b-instruct-2512</strong> — 2.7s, 134 tokens</li>
  <li><strong>meta/llama-4-maverick-17b-128e-instruct</strong> — 2.9s, 186 tokens</li>
  <li><strong>mistralai/mistral-nemotron</strong> — 3.3s, 150 tokens (production baseline)</li>
  <li><strong>qwen/qwen3-coder-480b-a35b-instruct</strong> — 3.5s, 142 tokens</li>
</ul>

<p>The slow tier:</p>

<ul>
  <li><strong>qwen/qwen3-next-80b-a3b-instruct</strong> — 11.0s</li>
  <li><strong>nvidia/nemotron-3-super-120b-a12b</strong> — 12.5s, 415 tokens (reasoning-heavy)</li>
</ul>

<p><strong>The first red flag:</strong> two models in the catalog timed out completely at 180 seconds — <code>mistralai/mistral-medium-3-instruct</code> and <code>deepseek-ai/deepseek-v4-flash</code>. This was the early signal that catalog presence ≠ usable inference.</p>

<p>Synthetic prompts couldn’t distinguish the top four. I needed real tickets to see real differences.</p>

<hr />

<h2 id="pass-2--real-kb-tickets-head-to-head-devstral-vs-nemotron">Pass 2 — Real KB tickets, head-to-head (Devstral vs Nemotron)</h2>

<p>I pulled 5 closed tickets from our knowledge base, picked for service diversity:</p>

<ul>
  <li><strong>Ticket A</strong> (portfolio service): Mutual fund misclassified as external after broker-registration change</li>
  <li><strong>Ticket B</strong> (auth service): Mobile-number login redirects to signup (orphan account)</li>
  <li><strong>Ticket C</strong> (member + onboarding): Account stuck “initiated” — workflow not triggered after document upload</li>
  <li><strong>Ticket D</strong> (CRM + notifications): Survey segment targeting bug</li>
  <li><strong>Ticket E</strong> (CRM + WMS sync): Mapping silently dropped for disabled accounts</li>
</ul>

<p>Each model received only the ticket description — no schema, no root cause, no fix. The ask: (1) likely root cause in 2–3 sentences, (2) recommended fix, (3) one SQL or investigation step. Then I graded each output against the actual ground truth on a 9–12 point rubric.</p>

<h3 id="the-result-devstral-50">The result: Devstral 5–0</h3>

<blockquote>
  <p><strong>Across 48 graded points: Devstral 71% vs Mistral Nemotron 54% — a 17 percentage-point gap.</strong></p>
</blockquote>

<p>Devstral won every single ticket. Latency cost: about 18% slower per call (3.8s avg vs 4.5s avg) for 17 percentage points more accuracy.</p>

<p>A 5–0 sweep is unusual. What was Nemotron actually doing wrong?</p>

<h3 id="failure-mode-1--confidently-hallucinating-database-columns">Failure mode 1 — Confidently hallucinating database columns</h3>

<p>On the portfolio-service ticket, Nemotron’s verification SQL referenced a column <code>is_external</code> that doesn’t exist in our schema, and joined against a fictitious <code>fund_master.fund_family</code> table that also doesn’t exist:</p>

<pre><code class="language-sql">-- Nemotron's verification SQL:
SELECT COUNT(*) FROM portfolio_db.investments
WHERE folio_number IN (...) AND is_external = true
AND fund_code IN (
  SELECT fund_code FROM portfolio_db.fund_master
  WHERE fund_family IN (...)
);
</code></pre>

<p>Devstral’s version was closer to real schema — it confused a column <em>value</em> for a column name, but stayed in the realm of plausible:</p>

<pre><code class="language-sql">-- Devstral's verification SQL:
SELECT folio_number, scheme_name, classification_flag, broker_code
FROM portfolio_db.investments
WHERE folio_number = '...'
  AND classification_flag = 'external';
</code></pre>

<p>A human running Devstral’s query finds the discrepancy in seconds and adjusts. Running Nemotron’s query fails completely against an entirely fictional table.</p>

<blockquote>
  <p><strong>For autonomous workflows where output gets piped to a SQL executor, this is the difference between “useful triage” and “broken pipeline.”</strong></p>
</blockquote>

<h3 id="failure-mode-2--sql-filters-that-hide-the-bug">Failure mode 2 — SQL filters that hide the bug</h3>

<p>On the auth login-redirect ticket, the ground truth was that a customer’s mobile number was associated with an <em>orphan duplicate auth user</em> — by definition not active.</p>

<p>Nemotron’s investigation SQL:</p>

<pre><code class="language-sql">WHERE mobile = '...' AND status = 'active';
</code></pre>

<p>The <code>status = 'active'</code> filter is the bug. This query returns zero results and leads the human investigator down the wrong path.</p>

<p>Devstral’s version:</p>

<pre><code class="language-sql">WHERE mobile = '...' OR mobile LIKE '%...';
</code></pre>

<p>No status filter. Would surface the orphan. Would also catch country-code formatting variants.</p>

<h3 id="failure-mode-3--vagueness-vs-spontaneous-architectural-insight">Failure mode 3 — Vagueness vs spontaneous architectural insight</h3>

<p>On the CRM/WMS sync ticket, both models received only that the ticket described “RM mapping doesn’t reflect in CRM despite re-mapping” with the disabled-account symptom and “no errors in Graylog.”</p>

<p>Nemotron suggested generic remediation:</p>

<blockquote>
  <p>“Manually trigger a sync for the affected client and monitor CRM logs for any hidden validation errors or pipeline failures.”</p>
</blockquote>

<p>Devstral named specific architectural elements without being told the architecture:</p>

<blockquote>
  <p>“Run a manual sync for the affected client via the CRM integration tool with verbose logging enabled, then inspect the response and any intermediate service logs (e.g., <strong>Kafka, ETL jobs</strong>) for hidden errors or mismatched data formats.”</p>
</blockquote>

<p>Our actual architecture <em>does</em> use Kafka for cross-service events. The model inferred it from “WMS → CRM sync with silent drops” — that’s the kind of intuition senior SREs develop over years.</p>

<hr />

<h2 id="pass-3--complex-multi-service-scenario">Pass 3 — Complex multi-service scenario</h2>

<p>A harder ticket: an account-split operation propagated email update to auth, CRM, WMS, and folio main records — but <strong>not</strong> to the externally-registered folio email used by the withdrawal pipeline. Result: withdrawal validation fails on stale email.</p>

<p>Five models tested on a 15-point rubric across trigger identification, gap location, immediate fix, long-term fix, and investigation quality.</p>

<p>The winner this time wasn’t Devstral:</p>

<ul>
  <li><strong>nvidia/nemotron-3-super-120b-a12b</strong> — <strong>14/15 (93%)</strong> at 24.6s. Cleanest analysis. Best long-term fix.</li>
  <li><strong>mistralai/devstral-2-123b</strong> — 13/15 (87%) at 7.6s. Direct hit on ground truth. Best speed/quality balance.</li>
  <li><strong>minimaxai/minimax-m2.7</strong> — 12.5/15 at 88.5s. Strong analysis, but latency unusable for cron.</li>
  <li><strong>mistralai/mistral-nemotron</strong> — 11/15 (73%) at 10.2s. Captures sync gap but doesn’t pinpoint specific data flow.</li>
  <li><strong>qwen/qwen3-next-80b-a3b-thinking</strong> — <strong>2.5/15 (failed)</strong> at 16s. Thought aloud for 1500 tokens, ran out of budget before producing the structured answer.</li>
</ul>

<h3 id="three-findings-on-complex-scenarios">Three findings on complex scenarios</h3>

<blockquote>
  <p><strong>Reasoning models DO win on hard problems.</strong> The slow tier earns its latency.</p>
</blockquote>

<p>But the latency cost is steep. 24.6s vs 7.6s. For 30-min cron triage, Devstral still wins on practical grounds. Nemotron-3-Super is viable only as tier-2 escalation for the 1–2 hardest tickets per day.</p>

<p><strong>Thinking models need bigger token budgets.</strong> Qwen3-thinking emitted 1500 tokens of visible reasoning and ran out before producing the structured answer. If you use thinking models, set max_tokens &gt;= 4096 minimum.</p>

<p><strong>MiniMax M2.7 is unusable on cron at 88s</strong> — but interestingly, when delegated to tool-calling instead of analytical mode, the same model returned in 2.9s.</p>

<hr />

<h2 id="pass-4--tool-calling-support">Pass 4 — Tool calling support</h2>

<p>OpenAI-compatible tool calling on free NIM. Three tool definitions: <code>query_graylog</code>, <code>query_metabase</code>, <code>search_kb</code>. The prompt: “investigate this Sentry alert.”</p>

<p>Seven of eight models worked correctly. Best two chained two tool calls in a single response:</p>

<ul>
  <li><strong>mistralai/devstral-2-123b</strong> — 2 chained calls in 1.5s (<code>search_kb</code> + <code>query_graylog</code>)</li>
  <li><strong>minimaxai/minimax-m2.7</strong> — 2 chained calls in 2.9s</li>
</ul>

<p>Single-call but solid:</p>

<ul>
  <li><strong>qwen/qwen3-coder-480b</strong> — 1.5s, sophisticated query syntax</li>
  <li><strong>mistralai/mistral-nemotron</strong> — 2.6s</li>
  <li><strong>meta/llama-3.3-70b-instruct</strong> — 2.1s</li>
  <li><strong>nvidia/nemotron-3-super-120b</strong> — 2.4s</li>
</ul>

<p>And one slow:</p>

<ul>
  <li><strong>nvidia/llama-3.3-nemotron-super-49b-v1.5</strong> — 13.4s, 381 reasoning tokens before the call</li>
</ul>

<h3 id="the-llama-4-maverick-gotcha">The Llama-4-Maverick gotcha</h3>

<blockquote>
  <p><strong>Llama-4-Maverick on NIM is silently broken for tool calling.</strong></p>
</blockquote>

<p>The model <em>appears</em> to attempt a tool call by emitting JSON in the <code>content</code> field:</p>

<pre><code class="language-json">{
  "type": "function",
  "name": "query_graylog",
  "parameters": {"stream_id": "...", "query": "..."}
}
</code></pre>

<p>But this is in the <code>content</code> field, not the proper <code>tool_calls</code> structure. Code that does:</p>

<pre><code class="language-python">response.choices[0].message.tool_calls  # → []
</code></pre>

<p>silently gets an empty list. The model “knows” it should call but doesn’t follow the OpenAI tool-calling protocol correctly.</p>

<p>This is the kind of bug that takes weeks to diagnose in production — looks like the model’s lazy or confused, when actually the protocol is broken. Don’t use Llama-4-Maverick on NIM for tool-calling pipelines without a JSON parser fallback.</p>

<h3 id="why-devstrals-chained-calls-matter">Why Devstral’s chained calls matter</h3>

<p>Without being asked, Devstral called both <code>search_kb</code> AND <code>query_graylog</code> in the same response. That’s how a senior SRE actually thinks: “let me check past tickets, AND look at logs.” Most models call one tool, wait for the result, then call the next. Devstral predicting both upfront is a meaningful agent-loop quality.</p>

<hr />

<h2 id="pass-5--frontier-and-specialized-models">Pass 5 — Frontier and specialized models</h2>

<p>Six untested candidates. The results:</p>

<p><strong>Worked but not worth it:</strong></p>
<ul>
  <li><strong>mistralai/mistral-large-3-675b</strong> at 9.7s — 2x slower than Devstral, similar quality</li>
  <li><strong>deepseek-ai/deepseek-v4-pro</strong> at 55.7s — strong analysis but unusable cron latency</li>
</ul>

<p><strong>Worked and useful:</strong></p>
<ul>
  <li><strong>sarvamai/sarvam-m</strong> at 29.4s — translated mixed Hindi/Tamil/English correctly. Strong domain fit.</li>
  <li><strong>nvidia/nv-embedcode-7b-v1</strong> at 1.3s — code-specific embeddings, 4096-dim, semantically correct clustering. Production-ready.</li>
</ul>

<p><strong>Listed but not actually deployed on free tier:</strong></p>
<ul>
  <li><strong>writer/palmyra-fin-70b-32k</strong> — returns 404 “Function not found for account”</li>
  <li><strong>nvidia/gliner-pii</strong> — no working endpoint discovered (NIM container only, not free-tier hosted)</li>
</ul>

<h3 id="the-indian-language-finding">The Indian-language finding</h3>

<p>The customer base writes tickets mixing English with Hindi, Tamil, Bengali, Marathi. Generalist models often misclassify these. Sarvam-m correctly translated and classified a representative ticket:</p>

<blockquote>
  <p>“Sir mera SIP ka amount change nahi ho raha hai… வருகிற மாதம் தான் debit ஆகணும்…”
→ “Customer cannot update SIP amount; debit must reflect for upcoming month.”</p>
</blockquote>

<p>29-second latency is acceptable for the small subset of tickets that actually need it. Production wrapper must strip <code>&lt;/think&gt;</code> tags — Sarvam-m emits visible reasoning.</p>

<hr />

<h2 id="pass-6--coding-deep-dive">Pass 6 — Coding deep-dive</h2>

<p>A representative production-shaped Elixir module with four bugs (nil safety, idempotency, query efficiency, error handling). Eight code-relevant models tested.</p>

<p>The module under review:</p>

<pre><code class="language-elixir">defmodule MyApp.KycService do
  alias MyApp.{Repo, User, KycSubmission}
  import Ecto.Query

  def submit_kyc(user_id, kyc_data) do
    user = Repo.get(User, user_id)
    address = user.address.line1

    submission = %KycSubmission{
      user_id: user_id,
      pan: kyc_data["pan"],
      address: address,
      status: "pending"
    }

    Repo.insert(submission)
  end

  def check_kyc_status(user_id) do
    submissions = Repo.all(from k in KycSubmission, where: k.user_id == ^user_id)
    submissions |&gt; Enum.map(fn s -&gt; {s.id, s.status} end) |&gt; List.last()
  end
end
</code></pre>

<p>The four bugs to find: nil crash on <code>user.address.line1</code>, no idempotency, inefficient query, no error handling on <code>Repo.insert</code>.</p>

<h3 id="results">Results</h3>

<p><strong>Winner — qwen/qwen3-coder-480b-a35b-instruct</strong> scored 14/15 in 22.2 seconds. Cleanest idiomatic Elixir of any model tested.</p>

<p>The kind of code only senior Elixir engineers write:</p>

<pre><code class="language-elixir">def submit_kyc(user_id, kyc_data) when is_integer(user_id) and is_map(kyc_data) do
  case Repo.get(User, user_id) do
    nil -&gt; {:error, :user_not_found}
    user -&gt;
      with {:ok, address} &lt;- extract_address(user),
           {:ok, pan} &lt;- validate_pan(kyc_data),
           {:ok, submission} &lt;- find_or_create_submission(user_id, pan, address) do
        {:ok, submission}
      else
        error -&gt; error
      end
  end
end

defp extract_address(%{address: %{line1: line1}}) when is_binary(line1), do: {:ok, line1}
defp extract_address(_), do: {:error, :invalid_address}
</code></pre>

<p>Pattern matching with guards. <code>with</code> chain for railway-oriented programming. defp helpers for separation of concerns. Consistent <code>{:ok, _}</code> / <code>{:error, _}</code> tuples.</p>

<p>Other results:</p>
<ul>
  <li><strong>mistralai/devstral-2-123b</strong> — 11/15 at 9.8s. Solid, but a subtle bug in its own fix.</li>
  <li><strong>qwen/qwen2.5-coder-32b</strong> — 9/15 at 105s. Pattern-matched on a string key for an Ecto struct (atom keys) — would fail at runtime.</li>
  <li><strong>mistralai/mistral-nemotron</strong> — 7/15 at 11.3s. Found 8 bugs but introduced 3 new ones in the fix.</li>
</ul>

<p><strong>Four other code-specialist models — codestral-22b, granite-34b-code, codellama-70b, codegemma-7b — all returned 404. Not actually deployed on free tier.</strong></p>

<h3 id="mistral-nemotrons-dangerous-code-mistakes">Mistral Nemotron’s dangerous code mistakes</h3>

<p>Nemotron found <em>more</em> bugs than any other model — but its fix introduced three new ones:</p>

<pre><code class="language-elixir">unless Map.has_key?(kyc_data, "pan") and user.address.line1 do
  raise "Missing required KYC data"
end

existing = Repo.one?(from k in KycSubmission, ...)  # Repo.one? doesn't exist

%KycSubmission{
  status: :pending  # atom...
}

# But the existing-check filtered:
where: k.status == "pending"  # ...string. Always misses.
</code></pre>

<p><code>Repo.one?</code> doesn’t exist in Ecto. <code>user.address.line1</code> still crashes on nil even inside the <code>unless</code>. The status field is an atom but the existing-check filters by string.</p>

<blockquote>
  <p><strong>For autonomous code-generation workflows, this is the worst possible failure mode: confident, plausible, broken.</strong></p>
</blockquote>

<p>A reviewer skimming would approve. CI catches it eventually but burns time and budget.</p>

<hr />

<h2 id="pass-7--routing-model">Pass 7 — Routing model</h2>

<p>Routing is the first step — classifying tickets into service / type / context — and runs hundreds of times a day. Latency matters more than for any other tier. Currently I use <code>gemma4:latest</code> on local Ollama. Tested against <code>upstage/solar-10.7b-instruct</code> (NIM):</p>

<ul>
  <li><strong>ollama/gemma4:latest</strong> (current production) — 5.5s, 8/15 (53% accuracy), local + unlimited</li>
  <li><strong>upstage/solar-10.7b-instruct</strong> — 8.0s, 6/15 (40% accuracy), NIM with 40 RPM cap</li>
</ul>

<p>Gemma4 wins on every axis. Solar offers nothing meaningful.</p>

<h3 id="the-taxonomy-insight">The taxonomy insight</h3>

<p>Both models scored under 60%. The failure pattern was the same: <strong>neither knows the service taxonomy.</strong> Pure pattern-matching gets you 50% accuracy. Adding ~7 lines of taxonomy to the prompt would likely lift the same model to 80%+:</p>

<pre><code>Service taxonomy (example):
- Portfolio service: investments / folios / external classification
- Onboarding service: KYC / new account workflows
- Member service: profiles / email / account state
- Auth service: login / OTP / mobile / session
- Notifications service: email / SMS / push dispatch
- CRM: relationship-manager mapping / CRM-WMS sync
- Order service: orders / withdrawals / redemptions
</code></pre>

<blockquote>
  <p><strong>Same model. Same latency. ~30 percentage point accuracy lift. Free.</strong></p>
</blockquote>

<hr />

<h2 id="the-phantom-catalog-finding">The phantom-catalog finding</h2>

<p>After 7 passes covering 21 distinct model probes, eight models in NVIDIA’s <code>/v1/models</code> listing were not actually inferentially available on free tier:</p>

<ul>
  <li><strong>mistralai/mistral-medium-3-instruct</strong> — TIMEOUT 180s</li>
  <li><strong>deepseek-ai/deepseek-v4-flash</strong> — TIMEOUT 180s</li>
  <li><strong>writer/palmyra-fin-70b-32k</strong> — 404 “Function not found for account”</li>
  <li><strong>nvidia/gliner-pii</strong> — no working endpoint discovered</li>
  <li><strong>mistralai/codestral-22b-instruct-v0.1</strong> — 404</li>
  <li><strong>ibm/granite-34b-code-instruct</strong> — 404</li>
  <li><strong>meta/codellama-70b</strong> — 404</li>
  <li><strong>google/codegemma-7b</strong> — 404</li>
</ul>

<p><strong>8 of 21 = 38% catalog attrition.</strong></p>

<p>This is a much bigger gap than “occasional phantom listings.” It’s structural. The <code>/v1/models</code> endpoint behaves like a “previously available or potentially available” list, not “currently deployed.”</p>

<blockquote>
  <p><strong>Production fallback chains MUST probe each model before relying on it.</strong> Don’t trust catalog presence.</p>
</blockquote>

<hr />

<h2 id="the-synthesis-a-4-tier-production-architecture">The synthesis: a 4-tier production architecture</h2>

<p>After all 7 passes, the architecture that emerged from the data:</p>

<ul>
  <li><strong>Router</strong> (ticket → service+type) → <code>ollama/gemma4:latest</code> (local, current) plus taxonomy in prompt. Local, free, unlimited.</li>
  <li><strong>Triage analyst</strong> (default, ~80% of tickets) → <code>mistralai/devstral-2-123b-instruct-2512</code> (NIM, free). 5-0 vs Nemotron, 4.5s, supports tool calling.</li>
  <li><strong>Hard-ticket analyst</strong> (cross-service, complex) → <code>nvidia/nemotron-3-super-120b-a12b</code> (NIM, free). 14/15 on the complex scenario. 24.6s acceptable for tier-2 escalation.</li>
  <li><strong>Code generator</strong> (auto-fix MR) → <code>qwen/qwen3-coder-480b-a35b-instruct</code> (NIM, free). Best Elixir code review at 14/15.</li>
  <li><strong>Embeddings</strong> (KB / RAG) → <code>nvidia/nv-embedcode-7b-v1</code> (NIM, free). Code-trained, 4096-dim.</li>
  <li><strong>Indian-language pre-router</strong> → <code>sarvamai/sarvam-m</code> (NIM, free, conditional branch).</li>
  <li><strong>Tool-calling agent (v3 architecture)</strong> → <code>mistralai/devstral-2-123b</code> (NIM, free). Chains 2 tool calls in 1.5s.</li>
</ul>

<blockquote>
  <p><strong>All free except local M4 Pro electricity. Total monthly inference cost: $0.</strong></p>
</blockquote>

<p>Compare to running an equivalent stack on Anthropic+OpenAI APIs: $200–1000/month for similar throughput.</p>

<h3 id="what-gets-deprecated">What gets deprecated</h3>

<ul>
  <li><strong>Mistral Nemotron</strong> — hallucinates DB columns and SQL functions. Replace with Devstral immediately.</li>
  <li><strong>Llama-4-Maverick on NIM</strong> — broken tool-calling protocol. Avoid in agent loops.</li>
  <li><strong>Phantom listings</strong> (codestral, granite-code, codellama, codegemma, palmyra-fin, gliner-pii) — design fallback chains around the 13 confirmed-working models, not the 21 catalog entries.</li>
</ul>

<hr />

<h2 id="the-meta-finding-prompt-engineering--model-selection">The meta-finding: prompt engineering &gt; model selection</h2>

<p>Three of the seven passes pointed at the same conclusion:</p>

<ol>
  <li><strong>Routing accuracy</strong> — 53% with no taxonomy → ~80% with 7-line taxonomy added</li>
  <li><strong>SQL accuracy</strong> — every model hallucinates columns; service-context files solve this for everyone</li>
  <li><strong>Indian-language tickets</strong> — even sarvam-m benefits from one-shot examples in prompt</li>
</ol>

<blockquote>
  <p><strong>Adding domain context to existing prompts almost always beats swapping models.</strong> It’s faster to ship, costs nothing in latency or money, and works the same way regardless of which model is behind the API.</p>
</blockquote>

<p>The Q1 2026 LLM landscape is shifting underneath production systems weekly — DeepSeek V4, Llama 4, MiniMax M2.7, Mistral Large 3, Nemotron 3 family all dropped in the last 90 days. <strong>Engineering teams that re-benchmark when they have a problem are 60+ days out of date.</strong> A quarterly re-bench takes a Sunday.</p>

<hr />

<h2 id="caveats--what-would-strengthen-this">Caveats — what would strengthen this</h2>

<p><strong>5-ticket benchmarks are small.</strong> Adding 5 more tickets per pass would harden the conclusions, especially the 5-0 sweep.</p>

<p><strong>Self-graded scoring has bias.</strong> A blind grade by another engineer (or a method that doesn’t know which model produced which output) would be more rigorous.</p>

<p><strong>No schema context in prompts.</strong> Adding actual service schema files would lift everyone’s accuracy. The relative gaps might compress, widen, or stay — that’s worth measuring.</p>

<p><strong>Latency variance under load.</strong> Devstral hit a 27-second outlier on one test. Free-tier latency can spike. Production stability needs more validation before swapping.</p>

<hr />

<h2 id="what-im-doing-next">What I’m doing next</h2>

<ol>
  <li>Adding service taxonomy to the routing prompt — measuring the 53% → 80%+ lift</li>
  <li>Side-by-side test of <code>nv-embedcode-7b-v1</code> vs <code>nomic-embed-text</code> on the actual KB</li>
  <li>Documenting the 4-tier architecture in handover notes</li>
  <li>Prototyping a tool-calling agent loop — replacing bash pre-fetch with <code>Devstral with tools → multi-turn loop</code></li>
</ol>

<p>If you’ve benchmarked LLMs against real production tickets, or built free-tier-only AI infrastructure, I’d love to compare notes — especially on latency variance under sustained load and on prompt-engineering vs model-selection trade-offs.</p>

<hr />

<p><em>Full per-ticket grading rubric, methodology data, and per-model raw outputs at</em> <a href="https://pranavj17.github.io/2026/04/28/nvidia-nim-benchmark/"><em>pranavj17.github.io/2026/04/28/nvidia-nim-benchmark</em></a> <em>— rendered tables, browsable structure.</em></p>]]></content><author><name>Pranav J</name></author><category term="ai-infrastructure" /><category term="llm-benchmark" /><category term="sre" /><category term="nvidia-nim" /><category term="llm" /><category term="sre" /><category term="openclaw" /><category term="devstral" /><category term="mistral-nemotron" /><category term="qwen3-coder" /><summary type="html"><![CDATA[5 wins, 0 losses for an unexpected model. 38% of the catalog returns 404. And the highest-impact change wasn't a model swap.]]></summary></entry></feed>