← Back to insights

Why the same prompt and the same model gives different output

You ran the brief-to-draft prompt yesterday and the output cleared the editorial gate. You ran the same prompt this morning and the draft would not survive self-review. The model has not changed; the team blames “the AI being weird today”. The diagnosis is mechanical, and the fix is a discipline.

What is actually changing

Four mechanisms produce run-to-run drift on a prompt that “did not change”:

  1. Sampling. The model is not deterministic at default settings. Each run draws from a probability distribution; two draws on the same prompt land in different places. Temperature controls the spread — at temperature 0 the spread narrows, at 0.7 the spread is wide enough to surface the variance you are seeing.
  2. System-message drift. The chat surface — chatgpt.com, claude.ai, anything you talk to in a browser — carries a system message you did not write. Vendors update it. Yesterday’s system message biased one way; today’s biases another. A prompt run inside a chat surface is not the same artefact as a prompt run through the API with a system message you control.
  3. Hidden state in the chat session. A long-running chat carries previous turns into every subsequent turn. The model’s interpretation of your latest prompt is conditioned on the conversation history. A fresh chat tab and your hundred-message work chat do not run the same prompt, even when you paste the same text.
  4. Silent model updates. The model name on the menu — gpt-4, claude-sonnet, gemini-pro — is a label, not a snapshot. Vendors push point updates that change behaviour without changing the name. The team that pinned the prompt three months ago is running it against a different model today.

The prompt is not just the prompt body

The artefact your team’s craft compounds in is a tuple, not a string:

prompt body + model snapshot + temperature + system message + tool config

Storing the body alone is half the artefact. The rerun that gives you different output tomorrow is rerunning a different five of those, and you have no way to tell which.

What you do

Three moves close the gap:

  • Pin the tuple. Use API-pinned model snapshots — the dated identifiers Anthropic and OpenAI publish in their model documentation, not the menu labels in chatgpt.com or claude.ai. Set temperature explicitly. Write the system message yourself.
  • Store the artefact as a file. One template file per prompt, with declared variables, the body, the settings, and the worked example. The Playbook’s prompt template schema names the fields.
  • Treat rerun as structural retry. A second run on the same tuple is your team’s quality-control sample. A second run that pastes the same body into a different chat tab is folklore.

Output stops drifting when you stop rerunning a prompt body and start rerunning a pinned tuple. The team’s craft compounds in an artefact; it does not compound in a chat history.

If you want the long-form treatment, the sample chapter is free in your browser, and the full Playbook is the rest of the work.


Liked this? Most of the work it draws on lives in the Playbook. See what's on offer or read two sample chapters.