Open Source · Prompt Harness

self-tuning-loop

A diff-based prompt harness, open-sourced from my own working setup. Refines guidelines from the edits you'd make anyway.

v0.1 · public · MIT · 2026–

What it is & why

I run a handful of AI-drafting pipelines daily — LinkedIn posts, blog drafts, news curation. Every time I edit one of those drafts, I produce a learning signal: the diff between what the model wrote and what I actually shipped. Most teams throw that signal away because the accepted answer for "make the model better" is fine-tuning, GPUs, or labelled data — so the feedback already sitting on every hard drive goes unused.

self-tuning-loop is the small piece I built to close that gap inside my own setup. It worked, so I extracted it as OSS. Each (draft, final) pair is captured; an LLM finds patterns repeating three or more times; each pattern is classified Safe or Risky; only Safe ones append to a new version of the prompt guidelines. The output is plain markdown — auditable with git diff, rollback is one line.

The loop

self-tuning-loop · the loop

   ┌──────────┐    ┌────────────┐    ┌────────────────┐
   │ AI draft │ ─► │ Human edit │ ─► │ Diff captured  │
   └──────────┘    └────────────┘    └───────┬────────┘
        ▲                                    │
        │                                    ▼
   ┌────┴──────────────┐           ┌────────────────────┐
   │ Updated prompt    │ ◄──[Safe]─│ LLM extracts       │
   │ (next draft uses) │           │ repeating pattern  │
   └───────────────────┘           └────────────────────┘
                  scored per version → improvement visible

Generate · Capture · Analyse · Evolve. The first three run inline with the app; the fourth runs on a cron (weekly is the default). TypeScript on Node.js 22+, Supabase as storage, Anthropic SDK for LLM calls.

How I use it

It started inside MJ Ops. The Saturday retrospective jobs (weekly-linkedin, weekly-review) read each pipeline's past-week diffs and ship an updated prompt for the next week — LinkedIn drafting, news curation, and Naver blog drafting all run through the same loop with their own guideline files. Once the pattern was clearly working on my own pipelines, the harness itself was extracted into this repo.

Why not fine-tuning, DSPy, TextGrad, OPRO

The academic landscape has serious work on automatic prompt optimisation. self-tuning-loop occupies a different niche: it is the only one of these that uses human edit diffs as its training signal — every other method asks for examples, score functions, or labelled pairs the team has to produce on top of normal work.

	Fine-tune	DSPy	TextGrad	OPRO	STL
Cost	$$$ GPU	$ LLM	$$ LLM	$ LLM	$ LLM
Data needed	100s labelled	Examples + metric fn	Differentiable signal	Score function	3+ diffs
Edits as signal	—	—	—	—	✓
Output format	Black-box weights	Compiled program	Gradient text	Search trace	Markdown
Rollback	Restore checkpoint	Recompile	Re-run	Re-run	Delete one line
Auditable	No	Partial	Partial	Partial	git diff

Trade-off: this harness will not beat fine-tuning on hard reasoning. It is built for cases where output style is the thing that needs to converge — tone, formatting, structural conventions.

Status

Public MIT repo, v0.1, README in English and 한국어. TypeScript + Supabase + Anthropic SDK. Audience: solo founders and small teams who already produce edits to AI drafts and want the model to actually notice.