HILLCLIMBER

Pre-release version

Hillclimber docs

Overview

Hillclimber is a framework for long-running agentic sessions aimed at measurable, eval-driven codebase improvement.

Its distinct feature is that it pushes you to explicitly define an eval function and spec (success criteria, budget, models) for the experiment. This is particularly useful when you want to run long sessions yet you don't have unlimited tokens to burn and you want fine control over long-running jobs.

By being open-source and harness agnostic, Hillclimber allows you to swap harnesses (Claude Code, Codex, Cursor etc) and models and choose the one that most suits your needs and budget.

Getting started

  1. To run Hillclimber, cd to your project and run the init command.

    bash
    cd my_projects/project_x
    hillclimber init -i

    After running the init command and following the wizard instructions, two files will be produced: hillclimber.toml and eval.py.

    • hillclimber.toml — defines the specs for the experiment. Hillclimber was designed to explicitly push users towards defining goal, budget, models etc.
    • eval.py — defines an eval/fitness function.
  2. Implement the evaluate function inside eval.py.

    Hillclimber uses the eval.py file to calculate the baseline score and delta for each cycle of the experiment. You must implement evaluate before running a Hillclimber experiment.

    Pro tip: ask the coding agent of your choice to implement it for you if you are lazy 😉

  3. Commit the hillclimber.toml and eval.py files.

    Annoyance Warning

    This is an annoying part of the current version and I'm looking forward to a better solution in upcoming versions. Thank you for being with me!

    Hillclimber runs each experiment in its own dedicated workspace, forked from your latest commit — which is what lets it run multiple cycles in parallel. The tradeoff: because those workspaces are checked out from committed state, any uncommitted work (including the freshly created hillclimber.toml and eval.py) won't make it into them.

    Before a run, you'll need to commit everything — otherwise Hillclimber will stop and ask you to, rather than risk scoring two different versions of your code.

  4. Start climbing.

    Execute the run command and Hillclimber starts improving your codebase.

    bash
    hillclimber run

Key concepts

Experiment

One full run of the hillclimber run command.

Cycle

One attempt to improve the codebase. An experiment consists of 1..n cycles. Cycles can run in parallel, so you explore multiple improvements at once rather than one at a time.

Strategy

A predefined workflow that Hillclimber uses to improve your code. Currently the user must define whether they want a simple strategy (cheaper and faster, but potentially lower improvement rates) or a more sophisticated one.

Artefact

File or folder that Hillclimber should improve.

Goal

A specific eval score that Hillclimber should achieve. If the goal is achieved, Hillclimber stops the experiment.

Budget

Max number of cycles, tokens or money that Hillclimber will use. If the budget is exhausted, Hillclimber stops the experiment.

Agent

The entity that does all the work. An agent consists of a harness (Claude Code, Codex, Cursor) and a model.

Feedback

I'd appreciate any feedback you have. Feel free to DM me or run:

bash
hillclimber feedback "My take is..."