Stop tweaking your prompts

Every time you change your prompt to fix an edge case, you break something else. Here’s a better way to build reliable AI systems using bem.

Apr 24, 2025

Teams building with LLMs are stuck in a loop:
You write a prompt. It mostly works.
Then a user triggers an edge case. You add a clause to the prompt.
Then someone else hits a slightly different edge case. Another clause.
Another exception. Another fallback.

Eventually, the prompt looks like a legal document.
It’s brittle, unpredictable, and impossible to test end-to-end.

If you're building productized AI — not just prototypes — this is a problem.

Prompts aren’t infrastructure. They’re speculation. And when your business depends on precision — financial operations, logistics, compliance, workflow automation — speculation isn’t good enough.

Why Prompt Patching Breaks Down

Let’s say you’ve deployed an LLM pipeline to extract structured data from freight tenders or webhook payloads.

At first, you’ve got a clean, high-performing prompt. It handles most cases.

Then someone forwards a new variation — a different format, a weird header, a slightly ambiguous field label.
The model guesses wrong. You tweak the prompt. Now the original cases are degraded.
You rewrite it again. Now you’re testing behavior across a dozen scenarios and wondering whether it’s time to fine-tune a model just to parse a PDF.

You’re doing prompt surgery, not product development.

Worse: you’re teaching your team to fight the model, not work with it.

What You Actually Want: A Feedback Loop

Let’s look at a realistic case.

You’re building a freight operations dashboard. Your customers forward you PDF tenders from shippers — sometimes structured, often not. Your product uses bem to extract structured fields so you can feed downstream pricing engines, load boards, or ops workflows.

Here’s the schema you’ve configured in bem for your pipeline:

{
  "type": "object",
  "required": [
    "loadReference",
    "origin",
    "destination",
    "weightTons",
    "pickupDate",
    "deliveryDate",
    "equipmentType"
  ],
  "properties": {
    "loadReference": {
      "type": "string",
      "description": "Customer-assigned ID for this freight load"
    },
    "origin": {
      "type": "string",
      "description": "City/state where the load originates"
    },
    "destination": {
      "type": "string",
      "description": "City/state where the load is delivered"
    },
    "weightTons": {
      "type": "number",
      "description": "Weight of the load in tons"
    },
    "pickupDate": {
      "type": "string",
      "format": "date",
      "description": "Earliest pickup date"
    },
    "deliveryDate": {
      "type": "string",
      "format": "date",
      "description": "Required delivery date"
    },
    "equipmentType": {
      "type": "string",
      "description": "Requested equipment (e.g., '53' dry van')"
    }
  }
}

Let’s say a customer emails in a tender like this:

Please quote the following load: 4/26 pickup in Hutchins, TX to delivery in Denver, CO by 4/29. 42,000 lbs of dry food-grade pallets. Use reference #RDP-3391. Need a 53’ van. No team driver required.

bem extracts this as:

{
  "loadReference": "RDP-3391",
  "origin": "Hutchins, TX",
  "destination": "Denver, CO",
  "weightTons": 42,
  "pickupDate": "2024-04-26",
  "deliveryDate": "2024-04-29",
  "equipmentType": "van"
}

Looks fine — but a human ops analyst catches an issue:
"Van" is too vague. The customer specifically requested 53’ dry van. That's relevant for carrier selection logic.

In traditional prompt engineering, you'd go back into the prompt, add new rules like:

"If 'van' is mentioned and a length is also specified, use the full equipment description."

But then you risk introducing side effects — and you’re still not guaranteed the model will generalize that rule to other variations ("53 ft", "53-ft", "53 foot", etc.).

Instead, with bem, you just send the correction:

PATCH /v1-beta/transformations
Authorization: Bearer YOUR_API_KEY
Content-Type: application/json

{
  "transformations": [
    {
      "transformationID": "tr_xyz456abc",
      "correctedJSON": {
        "equipmentType": "53’ dry van"
      }
    }
  ]
}

What happens next:

The correction is recorded, associated with this exact transformation ID, pipeline, schema, and input content.
The underlying model is updated
Future transformations of similar data — especially references to vans + specific lengths — will yield the more precise field value without needing another correction or prompt change.

This is learning at the API level — grounded in real product usage.

How Teams Use This in Real Workflows

Here’s what this looks like in practice.

Payments Infrastructure

A product team is classifying inbound webhook payloads from multiple providers. A new bank integration sends test transactions with an undocumented field variation. The model misclassifies them as production events.

A support engineer uses the bem dashboard to patch one case. Future transactions from that bank are classified correctly.

The model evolves with usage — not via new prompts or retraining.

Supply Chain Platforms

A logistics platform uses bem to parse multi-vendor tenders. One customer’s template has a shifted date field — the model reads the deadline instead of the delivery window.

Rather than trying to prompt around every possible vendor variation, the ops team patches the output once via the API. The next file from that customer parses correctly.

The patch creates a form of memory that lives inside the pipeline — not buried in engineering tickets.

B2B SaaS Ticketing

A helpdesk product is using bem for routing intents from inbound support messages. A new client sends product feedback to "support@" with vague language.

The initial output tags it as a bug. A CSM corrects it to “feature request” through the API.

Over time, bem learns what this customer’s team means by “broken” or “not working” — without needing to hard-code their phrasing into the routing logic.

What This Enables

Using PATCH in your feedback loop gives you a few advantages:

You move from prompt engineering to model feedback.
You empower non-engineers (support, ops, QA) to improve accuracy.
You reduce the number of breaking changes introduced by each prompt tweak.
You let the system learn — which is the whole point of using LLMs in the first place.

This is what scaling looks like. You don’t just ship a static prompt and hope.
You ship a dynamic system that improves every day.

How to Start Using It

If you’re using bem today, the PATCH API is already live:

PATCH /v1-beta/transformations

You pass in:

transformationID
the corrected output (correctedJSON)

You can patch directly from internal tools, feedback UIs, or even triggered via human-in-the-loop review steps.

If you’re using our dashboard, corrections are automatically sent through the same flow.

The Bottom Line

If you’re building a product that needs to be right — stop stuffing your prompts with edge-case logic.

That’s not a product strategy. It’s an anxiety loop.

Let your product learn.
Patch bad outputs.
Move forward.

And ship systems that actually get smarter over time.