Skip to content

Improve Codebase

Continuous code-improvement cycle. Confirms intent + green baseline, scores quality metrics in parallel, snapshots the baseline, hardens findings via a Senior-Principal critique, looks up best practices, triages with a human gate, then iterates through each refactor with per-item bounded retries until the queue is drained. A final reassessment compares against baseline thresholds and emits a pass/partial verdict.

Two retry mechanisms compose:

  • retries: 2 on Refactor_Item — 3 attempts per refactor item.
  • retries: 50 on Iterative_Refactor — drives the per-item outer loop by failing Continue_Or_Done while the queue still has work; bounds total items at 50.

Threshold contract: $GLOBAL.metric_thresholds defines the minimum acceptable score per metric. Compile_Report attaches the right threshold to each item. Reassess_Metric checks against $LOCAL.current_item.threshold.

Halt: set $LOCAL.stage_halt = true to break out of the outer loop cleanly. Halt_Check fails its evaluate while halted, so the outer retries exhaust and the stage ends in failure with done_log and failed_log preserved.

Version: 1.1.0

Files

  • improve-codebase/TREE.yaml — main

Install

sh
mkdir -p .abtree/trees/improve-codebase \
  && curl -fsSL https://raw.githubusercontent.com/flying-dice/abtree/main/.abtree/trees/improve-codebase/TREE.yaml \
         -o .abtree/trees/improve-codebase/TREE.yaml

Run with Claude

sh
claude "Run the abtree improve-codebase tree. Use 'abtree --help' to learn the execution protocol, then create an execution with 'abtree execution create improve-codebase \"<summary>\"' and drive it to completion."

Tree definition

yaml
name: improve-codebase
version: 1.1.0
description: |
  Continuous code-improvement cycle. Confirms intent + green baseline,
  scores quality metrics in parallel, snapshots the baseline, hardens
  findings via a Senior-Principal critique, looks up best practices,
  triages with a human gate, then iterates through each refactor with
  per-item bounded retries until the queue is drained. A final
  reassessment compares against baseline thresholds and emits a
  pass/partial verdict.

  Two retry mechanisms compose:
    - retries: 2 on Refactor_Item     — 3 attempts per refactor item.
    - retries: 50 on Iterative_Refactor — drives the per-item outer loop
      by failing Continue_Or_Done while the queue still has work;
      bounds total items at 50.

  Threshold contract: $GLOBAL.metric_thresholds defines the minimum
  acceptable score per metric. Compile_Report attaches the right
  threshold to each item. Reassess_Metric checks against
  $LOCAL.current_item.threshold.

  Halt: set $LOCAL.stage_halt = true to break out of the outer loop
  cleanly. Halt_Check fails its evaluate while halted, so the outer
  retries exhaust and the stage ends in failure with done_log and
  failed_log preserved.

state:
  local:
    # Intent + baseline
    change_request: null         # what "improve" means for this run
    scope_confirmed: null        # human-set boolean
    baseline_tests_pass: null

    # Parallel scoring outputs — one per metric. Each entry:
    #   { score: 0..1, observations: [...], risk: low|med|high, cost_benefit: 0..1 }
    # Separate slots (not a shared list) because parallel writes to a
    # shared key are not serialised by abtree — each child writes its own.
    score_dry: null
    score_srp: null
    score_coupling: null
    score_cohesion: null

    # Pre-refactor synthesis
    baseline_scores: null        # { dry, srp, coupling, cohesion } at start
    report: null                 # synthesised report text
    refactor_queue: null         # ordered list, refined in-place by each stage
    online_references: null      # best-practice lookups, keyed by metric

    # Iteration state
    current_item: null
    current_score: null
    refactor_plan: null          # filled by High_Risk_Critique only
    done_log: []
    failed_log: []
    stage_halt: false

    # Human gates
    triage_approved: null

    # Final scoring + verdict
    final_scores: null

  global:
    test_command: "the command that runs the project's full regression test suite (e.g. 'bun test', 'pnpm test')"
    metric_thresholds:
      dry: 0.7
      srp: 0.7
      coupling: 0.7
      cohesion: 0.7

tree:
  type: sequence
  name: Improve_Codebase
  children:

    # 1. Explicit human intent — refuses to run without a stated scope.
    - type: action
      name: Check_Intent
      steps:
        - evaluate: $LOCAL.change_request is set
        - instruct: >
            Read $LOCAL.change_request. State what "improve" means for
            this run — full repo, one module, one metric, etc. Surface
            the interpretation to the human and pause until they
            confirm by calling
            `abtree local write <flow-id> scope_confirmed true`.
            Submit `running` while waiting.
        - evaluate: $LOCAL.scope_confirmed is true

    # 2. Pre-flight: the test suite must already be green. Without this,
    #    later regression-test failures are ambiguous.
    - type: action
      name: Verify_Baseline
      steps:
        - evaluate: $GLOBAL.test_command is set
        - instruct: >
            Run $GLOBAL.test_command end-to-end on the unchanged
            codebase. If anything fails, abort the workflow with
            submit failure — improvement requires a green baseline.
            Otherwise set $LOCAL.baseline_tests_pass to true.
        - evaluate: $LOCAL.baseline_tests_pass is true

    # 3. Score quality metrics — four independent passes, any order.
    - type: parallel
      name: Score_Quality_Metrics
      children:
        - type: action
          name: Score_DRY
          steps:
            - evaluate: $LOCAL.scope_confirmed is true
            - instruct: >
                Score the codebase on Don't-Repeat-Yourself. Identify
                duplicated logic, near-duplicate functions, parallel
                inheritance hierarchies, and repeated patterns that
                should be abstracted. Score in [0, 1] (1 = no
                duplication). Per finding: file, severity (low/med/high),
                risk of refactoring, cost/benefit (0..1, higher = cheap
                and impactful). Store at $LOCAL.score_dry.

        - type: action
          name: Score_SRP
          steps:
            - evaluate: $LOCAL.scope_confirmed is true
            - instruct: >
                Score the codebase on Single Responsibility Principle.
                Identify modules / classes / functions with more than
                one reason to change. Score in [0, 1] (1 = one
                responsibility per unit). Per finding: file, severity,
                risk, cost_benefit. Store at $LOCAL.score_srp.

        - type: action
          name: Score_Coupling
          steps:
            - evaluate: $LOCAL.scope_confirmed is true
            - instruct: >
                Score the codebase on coupling. Identify excessive
                cross-module dependencies, circular imports, leaky
                abstractions. Score in [0, 1] (1 = clean boundaries).
                Per finding: file, severity, risk, cost_benefit. Store
                at $LOCAL.score_coupling.

        - type: action
          name: Score_Cohesion
          steps:
            - evaluate: $LOCAL.scope_confirmed is true
            - instruct: >
                Score the codebase on cohesion. Identify modules whose
                contents don't naturally belong together, utility
                classes that have grown into miscellany dumps. Score
                in [0, 1] (1 = strong cohesion). Per finding: file,
                severity, risk, cost_benefit. Store at
                $LOCAL.score_cohesion.

    # 4. Snapshot the starting scores so the final verdict can show delta.
    - type: action
      name: Snapshot_Baseline
      steps:
        - evaluate: $LOCAL.score_dry is set and $LOCAL.score_srp is set and $LOCAL.score_coupling is set and $LOCAL.score_cohesion is set
        - instruct: >
            Capture only the score values into
            $LOCAL.baseline_scores =
              { dry: <n>, srp: <n>, coupling: <n>, cohesion: <n> }.
            This frozen snapshot is the before-state used by
            Cycle_Verdict to show the codebase-level delta.

    # 5. Synthesise the parallel results into a single working list.
    - type: action
      name: Compile_Report
      steps:
        - evaluate: $LOCAL.baseline_scores is set
        - instruct: >
            Synthesise $LOCAL.score_dry / score_srp / score_coupling /
            score_cohesion into one working list. Each candidate gets:
              { id, metric, threshold (from $GLOBAL.metric_thresholds),
                summary, file, risk, cost_benefit }.
            Store the report text at $LOCAL.report and the candidate
            list at $LOCAL.refactor_queue. The threshold field is the
            target the per-item Reassess_Metric will gate against.

    # 6. Critique-and-harden pass. Matches the rhythm of refine-plan /
    #    implement — a Senior-Principal lens stress-tests the proposed
    #    work before it's locked in.
    - type: action
      name: Critique_Findings
      steps:
        - evaluate: $LOCAL.refactor_queue is set
        - instruct: >
            Act as a Senior Principal Engineer reviewing the proposed
            refactor list. For each item:
              - Is the metric actually wrong here, or is the score
                gaming an irrelevant heuristic?
              - Does fixing this move the needle, or is it cosmetic?
              - Will the change destabilise something downstream?
            Drop items that don't survive scrutiny. Tighten the rest's
            risk and cost_benefit estimates. Overwrite
            $LOCAL.refactor_queue with the hardened list.

    # 7. One-shot online lookup of best-practice patterns for the
    #    metrics still represented in the hardened queue.
    - type: action
      name: Lookup_Online
      steps:
        - evaluate: $LOCAL.refactor_queue is set
        - instruct: >
            For each unique metric represented in
            $LOCAL.refactor_queue, look up canonical refactoring
            patterns and best-practice approaches in the project's
            language and framework. Aim for high-signal references
            agents can apply at refactor time, not exhaustive
            literature reviews. Store at $LOCAL.online_references
            keyed by metric.

    # 8. Triage. Order, filter, surface.
    - type: action
      name: Triage_Refactor_Queue
      steps:
        - evaluate: $LOCAL.refactor_queue is set
        - instruct: >
            Triage $LOCAL.refactor_queue. Drop items where
            cost_benefit < 0.3 (not worth doing). Sort the rest by
            cost_benefit descending — high-impact, low-risk first.
            Surface the triaged list to the human for approval and
            overwrite $LOCAL.refactor_queue with the ordered, filtered
            version.

    # 9. Explicit human-approval gate. Same shape as
    #    technical-writer.Human_Approval_Gate.
    - type: action
      name: Triage_Approval_Gate
      steps:
        - evaluate: $LOCAL.refactor_queue is set
        - instruct: >
            Wait for the human to approve the triaged queue. They'll
            call `abtree local write <flow-id> triage_approved true`
            once they're ready. Submit `running` periodically while
            waiting. If they want changes to the triage, call submit
            failure so the bootstrap re-runs.
        - evaluate: $LOCAL.triage_approved is true

    # 10. Iterative refactor — one item per outer-loop pass.
    #     retries: 50 caps total items processed.
    - type: sequence
      name: Iterative_Refactor
      retries: 50
      children:

        # 10a. Fail fast if the agent has flagged a stage halt.
        - type: action
          name: Halt_Check
          steps:
            - evaluate: $LOCAL.stage_halt is not true
            - instruct: Stage active — proceed.

        # 10b. Pick the next item.
        - type: action
          name: Pick_Next_Item
          steps:
            - evaluate: $LOCAL.refactor_queue is not empty
            - instruct: >
                Pop the head of $LOCAL.refactor_queue and store the
                full item at $LOCAL.current_item. Reset
                $LOCAL.current_score and $LOCAL.refactor_plan to null.

        # 10c. Refactor + test + reassess. Per-item bounded retries.
        - type: sequence
          name: Refactor_Item
          retries: 2
          children:

            # Risk-gated extra prep, per N4.
            - type: selector
              name: Pre_Refactor_Critique
              children:
                - type: action
                  name: High_Risk_Critique
                  steps:
                    - evaluate: $LOCAL.current_item.risk is "high"
                    - instruct: >
                        High-risk item. Map blast radius before
                        touching code: list affected files, downstream
                        consumers, and tests that exercise the area.
                        Identify the safest order of edits. Consult
                        $LOCAL.online_references[current_item.metric]
                        for established patterns. Store a brief plan
                        at $LOCAL.refactor_plan.
                - type: action
                  name: Skip_Critique
                  steps:
                    - instruct: >
                        Risk is low or medium — proceed directly to
                        implementation. No blast-radius mapping needed.

            - type: action
              name: Implement_Refactor
              steps:
                - evaluate: $LOCAL.current_item is set
                - instruct: >
                    Implement the refactor described by
                    $LOCAL.current_item. If $LOCAL.refactor_plan is
                    set, follow it. Consult
                    $LOCAL.online_references[current_item.metric] for
                    canonical patterns. Edit code, run any local
                    sanity checks. If you cannot make progress on this
                    item, set $LOCAL.stage_halt to true and submit
                    failure to end the stage cleanly.

            - type: action
              name: Regression_Test
              steps:
                - evaluate: full regression test suite passes after the refactor (use $GLOBAL.test_command)
                - instruct: >
                    Run $GLOBAL.test_command end-to-end. Do NOT submit
                    success unless every test passes. If you cannot
                    get them green, set $LOCAL.stage_halt to true and
                    submit failure.

            # Reassess — instruct first, evaluate second. Matches the
            # convention used by the rest of the tree's actions.
            - type: action
              name: Reassess_Metric
              steps:
                - instruct: >
                    Run a smaller, focused assessment scoped to
                    $LOCAL.current_item.metric — re-score only that
                    one metric, not the full suite. Store the new
                    score at $LOCAL.current_score.
                - evaluate: $LOCAL.current_score is set and $LOCAL.current_score is greater than or equal to $LOCAL.current_item.threshold

        # 10d. Item passed all three steps within its retry budget.
        - type: action
          name: Record_Item_Done
          steps:
            - evaluate: $LOCAL.current_item is set
            - instruct: >
                Append $LOCAL.current_item (with its final
                $LOCAL.current_score) to $LOCAL.done_log. Clear
                $LOCAL.current_item, $LOCAL.current_score, and
                $LOCAL.refactor_plan.

        # 10e. Loop control. Eval true (queue empty) → outer sequence
        #      succeeds, stage moves on. Eval false → outer fails →
        #      retries: 50 kicks in for the next item.
        - type: action
          name: Continue_Or_Done
          steps:
            - evaluate: $LOCAL.refactor_queue is empty

    # 11. Final reassessment — re-score all four metrics against the
    #     refactored codebase, lightweight (summary scores only).
    - type: action
      name: Final_Reassessment
      steps:
        - evaluate: $LOCAL.refactor_queue is empty
        - instruct: >
            Re-run the four metric scorers (DRY, SRP, coupling,
            cohesion) against the now-refactored codebase. Summary
            scores only — no per-finding detail required. Store at
            $LOCAL.final_scores =
              { dry: <n>, srp: <n>, coupling: <n>, cohesion: <n> }.

    # 12. Verdict selector — pass / partial. The first child claims
    #     success when its evaluate gate holds; otherwise the second
    #     child records the partial outcome.
    - type: selector
      name: Cycle_Verdict
      children:
        - type: action
          name: Cycle_Passed
          steps:
            - evaluate: every metric in $LOCAL.final_scores is at or above its $GLOBAL.metric_thresholds value
            - instruct: >
                Every metric cleared its threshold. Surface a final
                report to the human covering: $LOCAL.baseline_scores
                vs $LOCAL.final_scores (delta per metric),
                $LOCAL.done_log (what got fixed),
                $LOCAL.failed_log (anything that exhausted retries).
                The cycle is complete.

        - type: action
          name: Cycle_Partial
          steps:
            - instruct: >
                Some metrics are still below threshold. Surface a
                report to the human covering: which metrics improved
                vs which didn't, $LOCAL.done_log,
                $LOCAL.failed_log, and a recommendation on whether to
                start another improvement cycle or escalate to a
                human-led architectural review.

MIT licensed