reagent

Autonomous binary analysis agent built in 7 hours at the Anthropic x Replit x Lightspeed Build India hackathon. Give it a binary and a goal, it triages, decompiles, debugs, and delivers verified findings.

Most AI coding tools are passive. You paste code, they explain it. reagent is different: give it a binary and a goal, it runs tools, forms hypotheses, tests them with a debugger, and delivers verified findings. The emphasis on verification is the core idea: static analysis can guess at what a binary does, but dynamic analysis can prove it.

I built this in 7 hours for the Anthropic x Replit x Lightspeed Build India hackathon. Here's how it works.

the architecture

reagent is a hierarchical multi-agent system. There's one orchestrator agent and four specialist subagents:

  • triage: file format, architecture, security features, strings. Runs first, always.
  • static: decompilation, cross-references, control flow. Uses rizin and rz-ghidra.
  • dynamic: GDB/LLDB debugging, breakpoints, registers, memory reads, backtrace.
  • coding: writes and runs Python for things the other agents can't do directly, like XOR decoding, keygen math, hash bruteforcing.

The orchestrator dispatches subagents via a dispatch_subagent tool, synthesizes their results, and decides when the mission is done. Subagents can't call each other; everything routes through the orchestrator.

Each agent is defined as a markdown file with YAML frontmatter:

markdown
---
name: static
description: Static analysis specialist
mode: subagent
tools: [disassemble, decompile, functions, xrefs, strings, sections, search, think, activate_skill, update_model]
max_steps: 30
---
 
You are a static analysis specialist...

The agent system loads these from the agents/ directory. Fields like model and temperature can be overridden per-agent; the coding agent uses a different model than the orchestrator.

the knowledge model

All agents share a structured BinaryModel, a three-tier knowledge base:

Observations are raw facts: disassembly output, hex dumps, register values, strings. They're timestamped and tagged with source and address.

Hypotheses have a lifecycle: proposed → testing → confirmed / rejected. They track confidence, evidence, and which agent proposed vs. verified them. A static analysis agent might propose "this function is a license key validator based on the XOR pattern at 0x4011a0"; the dynamic agent tests it by setting a breakpoint and observing runtime behavior.

Findings are the final output. A finding can only exist if it's been verified by a different agent than the one that proposed it. This is the autonomy guarantee: reagent won't tell you something unless the system has checked it.

the agent loop

Each agent runs the same loop:

  1. Estimate token count. If approaching the context window, trigger compaction.
  2. Checkpoint current context (for D-Mail, described below).
  3. Call the LLM via streaming.
  4. Dispatch tool calls.
  5. Grow the context with results.
  6. Check stop conditions: no tool calls means COMPLETE; hitting max_steps means MAX_STEPS.

Context management is three-tier:

  • Truncation: Tool outputs are capped at 2000 lines / 50KB before they even reach the agent.
  • Pruning: Old tool results >500 chars get replaced with [pruned: N chars] stubs. The last 10 messages are protected.
  • Compaction: When pruning isn't enough, a fast model (Haiku) summarizes old messages into a compact system message. The last 6 messages are kept verbatim.

D-Mail

This is the most unusual part. Named after Steins;Gate.

During the backward pass, an agent can realize it went down the wrong path, maybe it spent 10 steps analyzing a red herring. The send_dmail tool lets it rewind its own context to an earlier checkpoint, injecting a message from its "future self" explaining what not to do.

Mechanically: send_dmail raises BackToTheFuture(checkpoint_id, message), a BaseException subclass, not Exception, so it propagates through generic except Exception handlers in the agent loop. The loop catches it, calls context.revert_to(checkpoint_id), and injects the message as a system prompt: "[D-Mail from your future self]: {message}".

Contexts are JSONL-backed. Checkpoints are written before each LLM call. Reverting means truncating the JSONL to the checkpoint position and rewriting.

the debugger abstraction

The dynamic agent has 9 debugger tools: debug_launch, debug_breakpoint, debug_continue, debug_registers, debug_memory, debug_backtrace, debug_eval, debug_kill, debug_sessions.

Under the hood, these run GDB on Linux or LLDB on macOS. The abstraction layer translates operations to debugger-specific syntax:

python
_CMD_MAP = {
    "breakpoint": {
        "gdb":  "break {location}",
        "lldb": "breakpoint set --name {location}",
    },
    "breakpoint_addr": {
        "gdb":  "break *{address}",
        "lldb": "breakpoint set --address {address}",
    },
    ...
}

The debugger session runs in a PTY (subprocess.Popen with start_new_session=True). The PTY system uses a rolling 50K-line buffer and send_and_match(data, pattern, timeout): send a command, wait for a regex pattern in the output. All output is ANSI-stripped and binary-sanitized before it reaches the LLM.

One specific design note: the code uses subprocess.Popen rather than os.fork to avoid asyncio deadlocks on macOS. Killing a debug session uses os.killpg(pgid, SIGKILL) to take down the entire process group, not just the debugger.

tool output wrapping

Debugger output is wrapped in XML tags:

xml
<debug_output session_id="dbg_0" action="registers">
rax  0x0000000000000001
rbx  0x00007ffff7e12a00
...
</debug_output>

This isn't for humans, it's for the LLM. Structured tags help the model parse which output came from which tool call, especially when sessions interleave.

skills

Agents load documentation on demand via activate_skill. Skills live in a skills/ directory:

plaintext
skills/
  rizin/commands.md     # rizin command reference
  rizin/patterns.md     # common patterns (find loops, find crypto, etc.)
  gdb/commands.md
  gdb/workflows.md
  lldb/...
  frida/                # placeholder, not yet implemented

The SkillRegistry auto-discovers from the filesystem. An agent calls activate_skill("gdb/workflows") when it needs a refresher on how to set up a debugging session; the content is injected into its context for that step.

the decompiler fallback chain

Decompilation goes through three tiers:

  1. rz-ghidra: Best output, requires the Ghidra plugin for rizin. Not always available.
  2. rz-dec: rizin's built-in decompiler. Lower quality but always present.
  3. pdsf + pdf: Pseudo-code + disassembly fallback. Used when both decompilers fail.

The decompile tool tries each in order and returns the first one that works.

the wire protocol

All events (text tokens, tool calls, tool results, observations, hypotheses, findings, compaction notices, D-Mails) flow through a typed async event bus. Both the CLI and TUI subscribe to the same bus. This means the display layer is completely decoupled from agent logic.

The TUI uses Textual. The layout is a 3:1 split, chat panel on the left, findings/hypotheses/observations tabs on the right. The sidebar updates in real time as agents add to the knowledge model.

crackme test suite

The repo ships with 5 test binaries:

binarydifficultytype
crackme01_passwordeasyhardcoded password
crackme02_xoreasyXOR-encoded flag
crackme03_keygenmediumlicense key generation
crackme04_bofmediumbuffer overflow, call hidden win()
crackme05_multistagehardmulti-stage validation

Running reagent solve crackme02_xor --goal "find the flag" starts the full agent loop. The triage agent identifies it as an ELF x86-64, passes to static which finds the XOR loop, passes to coding which writes a Python decoder, and the orchestrator records a verified finding.

what's missing

Frida integration is planned but only the skills/frida/ directory exists. The dynamic agent currently uses GDB/LLDB for in-process debugging; Frida would enable dynamic instrumentation without a debugger, which is useful for packed binaries and anti-debug techniques.

The LLM stack uses litellm for multi-provider support. Default is claude-sonnet-4-5 for main agents and claude-haiku-4-5 for compaction. The --model flag overrides the main model; compaction always uses the fast model.