The AI Harness: How to Make Agentic SOPs Audit-Proof in Pharma

In my previous article, I showed how pharma’s SOP expertise translates directly into AI Skills - reusable instruction sets that make AI agents more reliable. But I also left you with a caveat: Skills alone won’t get you to audit-ready.

Here’s why - and what will.

The March of Nines

Let’s start with a number that should make every quality professional uncomfortable.

A well-written Skill gets you to roughly 90% reliability per step. That sounds impressive until you do the math. A 10-step workflow at 90% per step means your overall success rate is about 35%. Run that workflow 20 times a day, and you’re looking at 13 failures daily.

In pharma, a single failure in a regulated process can trigger a CAPA, a deviation report, or worse - an audit finding. “90% reliable” isn’t a feature. It’s a risk.

This is what AI engineers call the March of Nines: getting from 90% to 99% to 99.9% reliability requires exponentially more effort. And each nine matters enormously when the consequence of failure is regulatory.

Why Prompts Hit a Ceiling

Skills are, at their core, prompts. Sophisticated, structured, reusable prompts - but prompts nonetheless. And prompts have inherent limitations:

The AI can skip steps. Even with a clear procedure, a model might compress two steps into one or omit a validation check it deems “obvious.”
The AI can hallucinate. It might cite a reference that doesn’t exist or generate safety language that sounds right but doesn’t match the approved text.
The AI can quit early. Complex multi-step workflows sometimes cause the model to summarize remaining steps rather than execute them.
The AI can drift. In long conversations, earlier instructions get diluted by later context - a phenomenon called “context rot.”

None of these failures are malicious. They’re statistical. And in a regulated environment, statistical failures need systematic solutions - not better prompts.

Enter the Harness

An AI Harness is a software layer that wraps around an AI model to enforce process compliance through code, not hope.

Instead of giving the AI a procedure and trusting it to follow every step, a harness defines the workflow in deterministic software and uses the AI only where its capabilities are actually needed - language understanding, data extraction, content generation.

Think of the difference this way: a Skill is the SOP document. The harness is the entire quality system - validation protocols, batch records, deviation management, and audit trails. You need both. But the harness is what makes the system trustworthy.

The 12 Principles of Harness Engineering

Building a harness isn’t ad hoc. It’s a discipline. The following twelve principles define what separates a robust, audit-ready AI system from a fragile prompt chain. For pharma professionals, many of these will feel immediately familiar - they map directly to concepts you already use in GxP environments.

1. Architecture First

Before writing a single line of code, decide on the design pattern. Just like you wouldn’t build a manufacturing line without a process flow diagram, you don’t build a harness without an architecture.

The main patterns:

Specialized: One agent, one task. Best for narrow, high-reliability processes like adverse event classification.
Hierarchical (Multi-Agent): A supervisor agent delegates to specialized sub-agents. Ideal for complex workflows like regulatory submission preparation where multiple domains intersect.
DAG/Graph-Based: Steps are organized as a directed graph with parallel and sequential branches. Perfect for processes like batch release where some checks can run simultaneously.

In pharma terms: this is your process flow diagram. You define it before you validate - not after.

2. Fixed Plans for Regulated Processes

Harnesses support two planning modes: fixed and dynamic. For any GxP-relevant workflow, use fixed plans - every step predefined, every branch mapped, no improvisation.

Dynamic plans - where the AI decides its next step based on what it finds - are useful for exploratory tasks like literature screening. But for anything that needs to survive an audit, fixed plans are non-negotiable. An auditor will ask “what steps does the system follow?” You need a deterministic answer, not “it depends on what the AI decides.”

3. Virtual File System (The Scratchpad)

Give the agent a workspace - a virtual file system where it can read and write intermediate results. This acts as persistent memory within a session.

Why this matters for pharma: intermediate outputs are traceable. If a sub-agent extracts clinical data in Phase 1 and writes it to the workspace, Phase 3 can read exactly what was extracted - no paraphrasing, no lossy summarization through the AI’s context window. The scratchpad is your in-process batch record.

4. Task Delegation with Context Isolation

Instead of one massive prompt trying to do everything, delegate sub-tasks to specialized sub-agents. Each sub-agent receives only the context it needs - nothing more.

This solves the context rot problem. A monolithic agent processing a 50-page regulatory document will start forgetting instructions by page 30. But a sub-agent that only sees one section at a time? It executes with full fidelity every time.

Bonus: you can use smaller, faster, cheaper models for narrow tasks (classification, data extraction) and reserve more powerful models for tasks that genuinely need them (nuanced scientific writing). This isn’t just efficient - it’s how you justify AI costs to your finance team.

5. Parallel Processing

When sub-tasks are independent, run them simultaneously. A medical inquiry harness doesn’t need to check SmPC Section 4.1, Section 4.2, and Section 4.8 sequentially - it can check all three at once.

Applied at scale: imagine screening 30 congress abstracts for relevance. A serial agent processes them one by one over 30 minutes. A harnessed system spins up 30 parallel sub-agents and finishes in under 2 minutes. Same quality. Fraction of the time.

6. Tool Guardrails

Explicitly control which tools and actions each agent can access. A sub-agent classifying an inquiry type doesn’t need access to the email system. A drafting agent doesn’t need access to the regulatory submission portal.

This is the principle of least privilege - and pharma knows it well from IT system validation. In harness terms: you define a whitelist of permitted tools per agent. Any action outside that whitelist is blocked by software, not by hoping the AI won’t try.

For sensitive actions - submitting a regulatory response, updating a safety database, sending a medical information letter - the harness requires explicit triggers or approvals before proceeding.

7. Memory Management

Harnesses implement two layers of memory:

Short-term memory: Markdown files in the workspace tracking session progress - which steps completed, which gates passed, what the current status is. Think of this as the in-process checklist.
Long-term memory: Knowledge graphs or retrieval-augmented generation (RAG) systems that persist knowledge across sessions. Your approved standard response library, your product knowledge base, your KOL database - all accessible but external to the agent’s context window.

This separation is critical. Short-term memory keeps the current run on track. Long-term memory ensures institutional knowledge is available without bloating the prompt.

8. State Machines

Codify the workflow as a formal state machine. Each step is a state. Transitions between states are governed by gate results. The current state is persisted in a database.

Why this matters: recovery. If the system fails mid-process - the API times out, the server restarts, anything - the state machine knows exactly where it stopped. It resumes from the last successful state, not from the beginning.

In pharma terms: this is equivalent to a batch record that tracks exactly where the process was interrupted and how it was resumed. Auditors love batch records that show graceful recovery.

9. Sandboxed Code Execution

Allow the agent to write and execute code - but only within an isolated sandbox. This lets the harness verify the AI’s work programmatically.

Example: the AI generates a statistical summary of clinical trial data. Instead of trusting the numbers, the harness executes a Python script in a sandbox that independently recalculates the statistics from the raw data. Numbers match? Gate passes. Numbers diverge? Gate fails, escalation triggered.

This is AI-powered double-checking - the kind of verification that would take a human reviewer hours but takes a sandbox milliseconds.

10. Context Management

Keep the main supervisor’s context lean. Summarize tool outputs instead of passing raw data. Store large datasets in the file system. Only load what’s needed for the current decision.

This isn’t just a performance optimization - it directly impacts reliability. A model with a 200,000-token context window that’s 90% full of irrelevant data will perform worse than one with a focused 20,000-token context. Lean context means better decisions.

11. Human-in-the-Loop Touchpoints

Build explicit pause points where the system waits for human input or approval. These aren’t afterthoughts - they’re designed into the harness architecture.

In a medical inquiry workflow, the harness might pause after classification:

“This inquiry has been classified as off-label with 78% confidence. The system requires human confirmation before proceeding with off-label response generation. Confirm or reclassify?”

The human doesn’t review every output. They review the decisions that matter most. The harness decides which decisions those are - based on confidence scores, risk thresholds, or regulatory classification.

This is risk-based quality management applied to AI - exactly what ICH Q9/Q10 envisions.

12. Validation Loops

When a gate fails, the harness doesn’t just stop. It loops: sends the failed output back to the AI with the specific failure reason and asks for a corrected version. If the second attempt fails, it escalates.

Agent generates response → Gate checks safety language
  → PASS → Continue to next step
  → FAIL → "Safety language missing. Required text: [exact text]. Regenerate."
    → Agent regenerates → Gate re-checks
      → PASS → Continue
      → FAIL → Escalate to human reviewer

This is the AI equivalent of an in-process check with rework loop - standard in pharmaceutical manufacturing. The difference is that it runs in seconds, not hours.

From Principles to Practice: The 8-Phase Harness Workflow

Principles are useful. But what does a fully harnessed workflow actually look like in execution? Let me walk through an 8-phase implementation pattern and translate each phase into a pharma SOP context.

I’ll use a Periodic Safety Update Report (PSUR) preparation workflow as the example - one of the most complex, high-stakes processes in pharmacovigilance.

Phase 1: Data Extraction and Verification

What happens: The harness extracts raw data from source documents - ICSR databases, clinical trial reports, post-marketing safety data.

How it’s harnessed: After extraction, a programmatic verification step checks completeness. Did the system capture all reporting periods? Are patient counts consistent across tables? Is the data format valid?

Gate: Extraction completeness >= 99.5%. Missing fields flagged and logged. No AI involved in this verification - it’s pure code.

Phase 2: Structured Classification

What happens: An AI agent classifies each safety signal by seriousness, expectedness, and causality using a validated JSON schema.

How it’s harnessed: The output must conform to a predefined schema. No free-text classifications allowed. If the AI returns “probably related” instead of the schema-defined “possibly related” / “probably related” / “definitely related,” the gate rejects it.

Gate: Schema validation passes. All required fields populated. Classification values match the controlled vocabulary exactly.

Phase 3: Human-in-the-Loop Clarification

What happens: The harness pauses and presents the classified data to the safety scientist. “These 12 signals were classified as ‘new.’ These 3 were classified as ‘changed risk profile.’ Please confirm or adjust before analysis proceeds.”

Why this exists: Some decisions require medical judgment that no AI should make autonomously. The harness knows which decisions those are and enforces the pause.

Phase 4: Knowledge Retrieval (RAG)

What happens: The harness loads relevant context from the knowledge base - the previous PSUR, the current Reference Safety Information (RSI), company SOPs for signal evaluation, regulatory guidelines (GVP Module VII).

How it’s harnessed: The retrieval is deterministic. The harness queries specific document IDs, not a general semantic search. The agent gets exactly the context it needs - no more, no less.

Phase 5: Chunking and Targeted Extraction

What happens: For large documents (a PSUR can reference hundreds of ICSRs), the harness programmatically chunks the data into manageable units - one chunk per signal, one chunk per SOC (System Organ Class).

How it’s harnessed: The chunking is done by code, not by the AI. Each chunk has metadata (signal ID, SOC code, reporting period) attached programmatically. Nothing is lost in summarization.

Phase 6: Parallel Risk Analysis

What happens: The harness spins up a dedicated sub-agent for each safety signal. Each sub-agent receives only its signal’s data plus the relevant RSI section, analyzes the signal against the previous PSUR’s assessment, evaluates whether the benefit-risk profile has changed, and produces a structured assessment.

How it’s harnessed: 30 signals analyzed in parallel, each in an isolated context. Sub-agent A analyzing Signal 1 cannot see or be influenced by Sub-agent B’s analysis of Signal 2. Context isolation ensures independent assessment - the same principle behind blinded review in clinical trials.

Gate per sub-agent: Assessment follows the required structure. All evidence citations reference documents that exist in the knowledge base. Conclusion is one of the predefined categories.

Phase 7: Narrative Generation

What happens: Sub-agents generate narrative text for each PSUR section - the signal summaries, the benefit-risk evaluation, the overall safety assessment.

How it’s harnessed: Each narrative is validated against approved terminology (no hallucinated product names or indications), consistency with the structured data from Phase 6 (numbers in the narrative match numbers in the tables), and required regulatory language (specific phrases mandated by GVP Module VII).

Validation loop: If a narrative fails any check, it’s returned to the sub-agent with the specific failure. The sub-agent corrects and resubmits. Two failures lead to escalation to the safety scientist.

Phase 8: Programmatic Document Assembly

What happens: The final PSUR document is assembled - but not by the AI. The harness programmatically populates a pre-validated Word template with the AI-generated content.

Why this matters: Letting an AI generate a complete Word document is unreliable. Formatting breaks, sections get reordered, headers disappear. Instead, the harness owns the document structure. It places the narrative text from Phase 7 into the correct template sections, inserts the validated tables from Phase 6, and generates the table of contents programmatically.

The AI wrote the content. The software assembled the document. Both did what they’re best at.

The Audit Trail: What the Harness Produces

At the end of this workflow, you don’t just have a PSUR. You have:

Extraction log - what data was pulled, from which sources, with completeness scores
Classification log - every signal classification with the AI’s output, schema validation result, and human confirmation
Retrieval log - which knowledge base documents were loaded, with document IDs and versions
Analysis logs - per-signal structured assessments with gate pass/fail results
Validation logs - every narrative check, every failure, every correction loop, every escalation
Assembly log - which content was placed into which template section

This is a digital batch record for AI-assisted document preparation. When the auditor arrives, you don’t explain your process verbally. You hand them the logs.

The Audit Conversation Changes Completely

Without a harness:

Auditor: “How do you ensure the AI follows the SOP?” You: “We gave it detailed instructions.” Auditor: “How do you verify it followed them?” You: “…we review the output.” Auditor: “How do you document the AI’s decision-making process?” You: silence

With a harness:

Auditor: “How do you ensure the AI follows the SOP?” You: “Each step is enforced by software. The AI only executes within defined boundaries. Every step has a validation gate that must pass before the workflow continues.” Auditor: “How do you handle failures?” You: “Validation loops allow one automated retry with specific error feedback. After two failures, the system escalates to a human reviewer. All failures and resolutions are logged.” Auditor: “Show me the evidence for Run #247.” You: pulls up the execution log with timestamped pass/fail results for every gate, every sub-agent, every correction loop

That’s the difference between “we trust the AI” and “we verify the AI.” Regulators don’t accept trust. They accept evidence.

Getting Started: Where to Harness First

You don’t need to harness every workflow on day one. Start where the regulatory risk is highest and the process is most structured:

Tier 1 - Harness Now: Medical inquiry responses (high volume, compliance-sensitive, well-documented SOPs), adverse event intake processing (patient safety critical, zero tolerance for errors), and regulatory document preparation (PSUR, DSUR, RMP updates).

Tier 2 - Harness Next: Literature monitoring and signal detection, training material generation, KOL communication drafting.

Tier 3 - Skill-Only (For Now): Internal meeting summaries, congress coverage reports, ad-hoc research queries.

Skills + Harness = The Full Stack

Skills and harnesses aren’t competing concepts. They’re complementary layers:

Layer	What It Does	Pharma Analogy
Skill	Tells the AI how to perform a task	The SOP document
Harness	Ensures the AI actually follows the procedure and validates every output	The QA system - validation, batch records, deviation management
Architecture	Defines which pattern (specialized, hierarchical, graph) fits the workflow	The process flow diagram
State Machine	Tracks execution status and enables recovery	The batch record tracking system
Audit Logs	Records every decision, gate result, and escalation	The GxP audit trail

You need all of them. A harness without good Skills is an empty quality system with no procedures. Skills without a harness are procedures with no quality system. Pharma learned decades ago that you need both to operate reliably. The same is true for AI.

The Pharma Advantage, Part Two

In my previous article, I argued that pharma’s SOP culture is a competitive advantage for building AI Skills. The same is true for harnesses - perhaps even more so.

Look at the 12 principles again: Validation gates? That’s IQ/OQ/PQ. State machines with recovery? That’s batch record management. Tool guardrails and least privilege? That’s IT system access control. Human-in-the-loop at risk-based touchpoints? That’s ICH Q9/Q10 quality risk management. Audit logs for every decision? That’s every GxP system ever built. Fixed plans for regulated processes? That’s how pharma has operated for 50 years.

The mental model for harnessing AI isn’t new to this industry. The implementation technology is new, but the philosophy is decades old. Other industries are scrambling to invent these concepts. Pharma already lives them.

The organizations that move first - translating their existing quality frameworks into AI harness architectures - won’t just have faster AI workflows. They’ll have compliant AI workflows. And in pharma, compliance isn’t optional. It’s the price of admission.

Your quality system isn’t overhead. It’s the blueprint for your AI infrastructure.

One question remains open: what goes inside the harness? In Part 3, I make the case that the answer isn’t the biggest frontier model - it’s small, fine-tuned, specialized models that turn the black box into a glass one.

This is Part 2 of a series on AI-driven SOPs in pharma. Part 1: SOPs for AI - How Pharma’s Most Underrated Skill Became the Key to Agentic Workflows · Part 3: Nutcrackers, Not Sledgehammers - Why Small Models Belong in Your AI Harness

This article was co-authored with Anthropic’s Claude Opus 4.6 model. The ideas, domain expertise, and editorial direction are mine - the AI helped structure, draft, and refine the text.