Open Source · Prompt Harness
self-tuning-loop
A diff-based prompt harness, open-sourced from my own working setup. Refines guidelines from the edits you'd make anyway.
v0.1 · public · MIT · 2026–
What it is & why
I run a handful of AI-drafting pipelines daily — LinkedIn posts, blog drafts, news curation. Every time I edit one of those drafts, I produce a learning signal: the diff between what the model wrote and what I actually shipped. Most teams throw that signal away because the accepted answer for "make the model better" is fine-tuning, GPUs, or labelled data — so the feedback already sitting on every hard drive goes unused.
self-tuning-loop is the small piece I built to close that gap inside my
own setup. It worked, so I extracted it as OSS. Each (draft, final) pair
is captured; an LLM finds patterns repeating three or more times;
each pattern is classified Safe or Risky; only Safe ones append to a new
version of the prompt guidelines. The output is plain markdown —
auditable with git diff, rollback
is one line.
The loop
┌──────────┐ ┌────────────┐ ┌────────────────┐ │ AI draft │ ─► │ Human edit │ ─► │ Diff captured │ └──────────┘ └────────────┘ └───────┬────────┘ ▲ │ │ ▼ ┌────┴──────────────┐ ┌────────────────────┐ │ Updated prompt │ ◄──[Safe]─│ LLM extracts │ │ (next draft uses) │ │ repeating pattern │ └───────────────────┘ └────────────────────┘ scored per version → improvement visible
Generate · Capture · Analyse · Evolve. The first three run inline with the app; the fourth runs on a cron (weekly is the default). TypeScript on Node.js 22+, Supabase as storage, Anthropic SDK for LLM calls.
How I use it
It started inside MJ Ops. The Saturday
retrospective jobs (weekly-linkedin,
weekly-review) read each
pipeline's past-week diffs and ship an updated prompt for the next
week — LinkedIn drafting, news curation, and Naver blog drafting all
run through the same loop with their own guideline files. Once the
pattern was clearly working on my own pipelines, the harness itself
was extracted into this repo.
Why not fine-tuning, DSPy, TextGrad, OPRO
The academic landscape has serious work on automatic prompt optimisation. self-tuning-loop occupies a different niche: it is the only one of these that uses human edit diffs as its training signal — every other method asks for examples, score functions, or labelled pairs the team has to produce on top of normal work.
| Fine-tune | DSPy | TextGrad | OPRO | STL | |
|---|---|---|---|---|---|
| Cost | $$$ GPU | $ LLM | $$ LLM | $ LLM | $ LLM |
| Data needed | 100s labelled | Examples + metric fn | Differentiable signal | Score function | 3+ diffs |
| Edits as signal | — | — | — | — | ✓ |
| Output format | Black-box weights | Compiled program | Gradient text | Search trace | Markdown |
| Rollback | Restore checkpoint | Recompile | Re-run | Re-run | Delete one line |
| Auditable | No | Partial | Partial | Partial | git diff |
Trade-off: this harness will not beat fine-tuning on hard reasoning. It is built for cases where output style is the thing that needs to converge — tone, formatting, structural conventions.
Status
Public MIT repo, v0.1, README in English and 한국어. TypeScript + Supabase + Anthropic SDK. Audience: solo founders and small teams who already produce edits to AI drafts and want the model to actually notice.
Related reading on Minbook
- The Wasted Signal — why every team already has the training signal it needs.
- System Anatomy — the four-step loop and the Safe/Risky classifier.