System Online — v2.077

An Operating System
for AI Agents

I built an AI multi-agent system from scratch. This is everything I learned — architecture decisions, real failures, and the patterns that actually work.

Gateway // active
Planner // thinking
Coder // executing
Reviewer // standby

What is OpenClaw?

Not another chatbot wrapper. OpenClaw is an orchestration layer that lets multiple AI agents collaborate, share context, and execute complex workflows autonomously.

Agents

Autonomous AI workers with specific roles — planner, coder, reviewer, debugger. Each has its own context, tools, and decision-making loop.

Gateway

The central router. Receives requests, determines which agents to invoke, manages authentication, rate limiting, and request routing.

Skills

Reusable capabilities agents can invoke — code execution, web search, file manipulation, API calls. Modular and composable by design.

Memory

Persistent context that survives across sessions. Agents remember past decisions, user preferences, and project state. Not just chat history.

Automation

Scheduled tasks, event triggers, and pipeline orchestration. Your agents work while you sleep — monitoring, deploying, reporting.


My System Architecture

This isn't a theoretical diagram. This is my actual production setup — the machines, the routing, the failure points I discovered the hard way.

Client Layer

Local Machine

macOS / CLI

Web Dashboard

monitoring ui

API Clients

curl / sdk
Infrastructure

Reverse Proxy

nginx + mTLS

API Gateway

rate limit + auth

VPS Node

ubuntu 22.04
Agent Layer

Planner Agent

task decomposition

Executor Agents

code + deploy

Memory Store

sqlite + vector

From Zero to Working System

Building this system wasn't linear. Here's the real timeline — including the failures that taught me the most.

Phase 01 — Foundation

Installation & First Agent

Setting up the gateway, configuring authentication, and getting the first agent to respond. Sounds simple. Took 3 days.

# Initial setup
git clone openclaw/core
cd core && ./install.sh
openclaw init --gateway --port 8080
openclaw agent create planner --model claude-sonnet
Phase 02 — Failure

Everything Broke

Gateway timeouts, agents losing context mid-task, memory leaks after 50+ requests. The default config was not production-ready.

Phase 03 — Debugging

Understanding the System

Learned to read agent logs, trace request flows, and identify bottlenecks. The gateway's default timeout was 30s — most agent workflows need 120s+.

# The fix that saved everything
openclaw config set gateway.timeout 120000
openclaw config set agent.maxRetries 3
openclaw config set memory.gcInterval 3600
Phase 04 — Automation

Cron Jobs & Pipelines

Set up automated health checks, log rotation, and scheduled agent tasks. The system started running itself.

Phase 05 — Stable

Production Ready

After 6 weeks: 99.2% uptime, 4 active agents, automated monitoring, zero manual restarts. The system I actually wanted to build.


Real Problems & Fixes

These aren't hypothetical. Every one of these crashed my system at least once. Here's how I diagnosed and fixed them.

Critical

Gateway Timeout Under Load

CauseDefault 30s timeout too short for multi-agent chains. Agents were mid-execution when the gateway killed the connection.
DiagnoseGateway logs showed ETIMEDOUT errors. Agent logs showed tasks completing after gateway had already returned 504.
FixIncreased gateway timeout to 120s. Added streaming responses for long-running tasks. Implemented agent-level keepalive pings.
Warning

Agent State Corruption

CauseSpecial characters in agent responses (including emoji sequences) caused JSON parse failures in the state manager.
DiagnoseIntermittent. Only triggered when agents used certain Unicode ranges. Took 2 weeks to isolate the pattern.
FixAdded UTF-8 sanitization layer before state serialization. Switched to a streaming JSON parser that handles partial writes.
Warning

Cron Jobs Silently Failing

CauseCron environment lacked PATH entries for OpenClaw binaries. Jobs started, found no executable, exited 0 (!) with no output.
DiagnoseAdded explicit logging to cron wrapper. Discovered that 12 of 15 scheduled tasks had never actually run.
FixCreated a shell wrapper that sources the environment before running. Added health check that alerts when expected outputs are missing.
Critical

Exposed Admin Endpoint

CauseDefault install bound admin API to 0.0.0.0. Combined with no auth on the management port, the entire system was publicly writable.
DiagnoseDiscovered during a routine port scan. Access logs showed 3 unauthorized config reads from unknown IPs.
FixBound admin to 127.0.0.1. Added mTLS for all management endpoints. Implemented IP allowlisting. Rotated all API keys.

Advanced Techniques

Once the system is stable, these patterns take it from "working" to "powerful." Each one was discovered through production usage.

Coordination

Agent Handoff Protocol

When planner delegates to coder, pass structured context — not raw text. Define a handoff schema that both agents validate.

handoff: { task, context, constraints, deadline }
Memory

Layered Memory Architecture

Three tiers: session (ephemeral), project (persistent), global (cross-project). Agents query the right tier based on task scope.

memory.query(scope: "project", key: "auth-pattern")
Automation

Event-Driven Pipelines

Don't poll. Use webhook triggers and filesystem watchers. When a PR merges, the review agent fires automatically.

on: { event: "pr.merged", agent: "reviewer" }
Monitoring

Agent Performance Tracking

Log every agent invocation: duration, token usage, success rate. Identify which agents are slow and which tasks cost too much.

metrics: { p95: "4.2s", tokens: "12k", success: "96%" }
Security

mTLS Between Services

Every connection between gateway, agents, and storage uses mutual TLS. No exceptions. Self-signed CA for internal traffic.

tls: { ca: "internal.pem", verify: true }
Coordination

Parallel Agent Execution

Independent tasks run concurrently. Planner identifies dependencies and dispatches non-blocking agents in parallel batches.

dispatch: [agent1, agent2] | await: [agent3]

Daily Operation Philosophy

OpenClaw isn't a tool you use once. It's a collaborator that grows with you. These principles shaped how I work with it every day.

01

Trust but Verify

Let agents make decisions. Review the outputs, not the process. Intervene only when the result is wrong, not when the approach is unfamiliar.

02

Fail Forward

Every agent failure is training data. Log why it failed, adjust the prompt or constraints, and let it try again. Systems improve through iteration.

03

Compose, Don't Monolith

Small agents with clear boundaries beat one agent that tries to do everything. Composition creates resilience. Monoliths create single points of failure.


Tips & Hidden Tricks

Short, actionable insights. Each one saved me at least an hour of debugging or made my workflow measurably better.

01

Set agent timeouts 3x longer than you think

Complex reasoning chains take time. A timeout that's too aggressive kills more tasks than it protects.

02

Always source your shell profile in cron

Cron doesn't load .bashrc or .zshrc. Wrap every cron command in a script that sources the environment first.

03

Version your agent prompts like code

Store prompts in git. Tag versions. When an agent regresses, you can diff the prompt changes that caused it.

04

Monitor token usage per agent, not per request

One runaway agent can drain your budget. Per-agent limits with alerts prevent cost surprises.

05

Bind admin ports to localhost only

The number one security mistake. Default installs often bind to 0.0.0.0. Check with netstat -tlnp after every install.

06

Use structured handoffs, not free-text delegation

When agents pass work to each other, use JSON schemas. Free text loses context at every handoff boundary.

07

Run a health check agent on a 5-minute loop

A simple agent that pings every endpoint and alerts on failure catches issues before users do.

08

Keep memory lean — prune weekly

Agent memory grows fast. Schedule weekly cleanup to remove stale context. Fresh memory = better decisions.