Pre-release version
Hillclimber docs
Overview
Hillclimber is a framework for long-running agentic sessions aimed at measurable, eval-driven codebase improvement.
Its distinct feature is that it pushes you to explicitly define an eval function and spec (success criteria, budget, models) for the experiment. This is particularly useful when you want to run long sessions yet you don't have unlimited tokens to burn and you want fine control over long-running jobs.
By being open-source and harness agnostic, Hillclimber allows you to swap harnesses (Claude Code, Codex, Cursor etc) and models and choose the one that most suits your needs and budget.
Getting started
To run Hillclimber,
cdto your project and run the init command.bashcd my_projects/project_x hillclimber init -iAfter running the init command and following the wizard instructions, two files will be produced:
hillclimber.tomlandeval.py.hillclimber.toml— defines the specs for the experiment. Hillclimber was designed to explicitly push users towards defining goal, budget, models etc.eval.py— defines an eval/fitness function.
Implement the
evaluatefunction insideeval.py.Hillclimber uses the
eval.pyfile to calculate the baseline score and delta for each cycle of the experiment. You must implementevaluatebefore running a Hillclimber experiment.Pro tip: ask the coding agent of your choice to implement it for you if you are lazy 😉
Commit the
hillclimber.tomlandeval.pyfiles.Annoyance Warning
This is an annoying part of the current version and I'm looking forward to a better solution in upcoming versions. Thank you for being with me!
Hillclimber runs each experiment in its own dedicated workspace, forked from your latest commit — which is what lets it run multiple cycles in parallel. The tradeoff: because those workspaces are checked out from committed state, any uncommitted work (including the freshly created
hillclimber.tomlandeval.py) won't make it into them.Before a run, you'll need to commit everything — otherwise Hillclimber will stop and ask you to, rather than risk scoring two different versions of your code.
Start climbing.
Execute the run command and Hillclimber starts improving your codebase.
bashhillclimber run
Key concepts
Experiment
One full run of the hillclimber run command.
Cycle
One attempt to improve the codebase. An experiment consists of 1..n cycles. Cycles can run in parallel, so you explore multiple improvements at once rather than one at a time.
Strategy
A predefined workflow that Hillclimber uses to improve your code. Currently the user must define whether they want a simple strategy (cheaper and faster, but potentially lower improvement rates) or a more sophisticated one.
Artefact
File or folder that Hillclimber should improve.
Goal
A specific eval score that Hillclimber should achieve. If the goal is achieved, Hillclimber stops the experiment.
Budget
Max number of cycles, tokens or money that Hillclimber will use. If the budget is exhausted, Hillclimber stops the experiment.
Agent
The entity that does all the work. An agent consists of a harness (Claude Code, Codex, Cursor) and a model.
Feedback
I'd appreciate any feedback you have. Feel free to DM me or run:
hillclimber feedback "My take is..."