Nutcrackers, Not Sledgehammers: Why Small Models Belong in Your AI Harness

Hand a frontier model a ten-step task and it will return a polished result. Ask it how it got there, and you have a problem.

Not a technical problem - an audit problem. Somewhere between your prompt and its answer, the model planned, reasoned, skipped, reconsidered, and decided. None of that is visible. None of it is reproducible. In regulated pharma, that has a name: a black box. And black boxes don’t survive audits.

In Part 2 of this series, I described the AI Harness - the software layer that enforces process compliance around AI. Today I want to talk about what goes inside that harness. Because the model you choose determines whether your harness contains a black box or a glass one.

The Black Box Problem

When a large general-purpose model executes a complex task end-to-end, the work happens inside a single, opaque pass. The model might consult its training data, make intermediate judgments, and silently resolve ambiguities - all without leaving a trace you can show anyone.

An auditor’s first question is always some version of: “Show me how this result was produced.” With a monolithic frontier-model workflow, your honest answer is: “We can show you the input and the output. The middle is statistics.”

That answer has a predictable consequence. Probabilistic systems that can’t demonstrate their intermediate steps don’t get integrated into regulated processes - they get bolted on beside them. Every AI output becomes a draft that a qualified human must review, verify, and formally release.

Review Purgatory: Why This Doesn’t Scale

There’s nothing wrong with human oversight - Part 2 made the case for designed human-in-the-loop touchpoints. The problem is when every single output needs full human review because nothing in the process is verifiable on its own.

That’s not human-in-the-loop. That’s human-as-bottleneck.

Run the math: if your AI drafts 50 medical information responses a day and each one requires complete expert review because you can’t trust any individual step, you haven’t automated a process. You’ve created a faster way to generate review backlog. The AI scales; your reviewers don’t. This is why so many pharma AI initiatives plateau at the pilot stage - the black box caps the ceiling.

From Black Boxes to Glass Boxes

Now invert the architecture. Instead of one large model doing ten steps invisibly, the harness orchestrates ten explicit steps - and small, specialized, fine-tuned models execute the ones that genuinely need AI.

A small model fine-tuned for exactly one task - classify this inquiry, extract these data points, check this text against approved safety language - behaves fundamentally differently from a generalist:

Narrow input, narrow output. The task surface is small enough to define, constrain, and test exhaustively.
Verifiable like an analytical method. Defined input, defined output format, measurable performance against a reference dataset, acceptance criteria, re-validation when anything changes. This is vocabulary your QA department already speaks.
Loggable at every seam. Each model call is one discrete, timestamped event: this input, this model version, this output, this gate result.

Couple that with the harness’s audit trail, and the auditor no longer faces a black box. They see a sequence of glass boxes: step three ran at 14:02, model version 2.1, input hash X, output Y, validation gate passed. Every step. Every run.

You can’t validate what one giant model does in a single opaque pass. You can absolutely validate what ten small models do in ten transparent steps.

To be precise: small models are still probabilistic - fine-tuning doesn’t make them deterministic. But a narrow task makes the output space small enough to test, bound, and validate - and in a GxP world, validated beats brilliant every time.

Pharma’s Sleeping Training Data

Here’s the part I find genuinely exciting - and almost nobody talks about it.

Fine-tuning a small model requires high-quality, task-specific training data. Most industries have to create that data from scratch. Pharma has been accumulating it for decades without realizing it.

Think about what a regulated document approval process actually produces: a first draft, reviewer comments, a revised version, more comments, a final approved version. Every approved medical information letter, every reviewed publication, every released regulatory document carries its full revision history - draft, correction, rationale, final.

That structure - output, expert feedback, improved output - is exactly the format you need to fine-tune a model. Your document management system is full of examples of what “wrong” looked like, what the expert said about it, and what “right” looked like after correction. Multiply that by twenty years and every therapeutic area.

In Part 1, I argued that pharma’s SOP culture is a hidden advantage for writing AI Skills. In Part 2, that its quality systems map directly onto harness engineering. This is the third layer of the same pattern: decades of GxP documentation discipline have quietly produced some of the best fine-tuning datasets in any industry. They’re just sitting in archives, waiting to be used.

The Efficiency Dividend

There’s a second, more pragmatic argument - and it’s the one your CFO will like.

Using a frontier model with a long agentic loop for a narrow, repetitive task is like opening a walnut with a sledgehammer. It works. It’s also absurd. Thousands of tokens burned on planning, reasoning, and self-correction - for a classification a specialized model handles in a fraction of a second.

The right tool is a nutcracker. Small models deliver three dividends at once:

Cost. Narrow tasks run on small models at a fraction of frontier-model token costs - the difference compounds fast at thousands of runs per day.
Deployment. A small fine-tuned model runs on a laptop or a modest on-premise server. No hyperscaler dependency, and sensitive data never leaves your infrastructure - which in pharma is often the difference between “approved” and “blocked by IT security.”
Sustainability. Frontier inference consumes serious energy. Routing routine work to small models cuts compute, electricity, and CO2 - increasingly relevant as AI usage shows up in corporate ESG reporting.

Where Each Model Belongs

This isn’t an argument against large models. It’s an argument for putting each model where it belongs:

Large frontier models: exploration, ambiguous reasoning, nuanced scientific writing, one-off analyses - the Tier 3 “Skill-only” work from Part 2, plus the genuinely hard generation steps inside a harness.
Small fine-tuned models: every narrow, repeatable, high-volume step in a regulated pipeline - classification, extraction, terminology checks, formatting validation.
Plain code: everything that doesn’t need a model at all - schema checks, completeness verification, document assembly.

A practical starting point: pick one harnessed workflow, identify its most repetitive AI step, and ask your document management team what revision histories exist for that task. That conversation is the beginning of your first fine-tuned model.

The Bigger Picture

The companies stuck in pilot purgatory share a pattern: they tried to drop the biggest available model into the middle of a regulated process and discovered that brilliance without traceability is unusable.

The way out isn’t a bigger model. It’s a smaller one - many smaller ones - each doing one job, each validated like the analytical methods pharma has trusted for decades, each leaving a glass-box trail an auditor can walk through step by step.

The future of AI in regulated industries doesn’t belong to the largest models. It belongs to the best-orchestrated small ones.

This is Part 3 of a series on AI-driven SOPs in pharma. Part 1: SOPs for AI - How Pharma’s Most Underrated Skill Became the Key to Agentic Workflows · Part 2: The AI Harness - How to Make Agentic SOPs Audit-Proof in Pharma

This article was co-authored with Anthropic’s Claude Fable 5 model. The ideas, domain expertise, and editorial direction are mine - the AI helped structure, draft, and refine the text.