When a model API call hangs indefinitely (e.g. Anthropic quota exceeded
mid-call), the gateway acquires a session .jsonl.lock but the promise
never resolves, so the try/finally block never reaches release(). Since
the owning PID is the gateway itself, stale detection cannot help —
isPidAlive() always returns true.
This commit adds four layers of defense:
1. **In-process lock watchdog** (session-write-lock.ts)
- Track acquiredAt timestamp on each held lock
- 60-second interval timer checks all held locks
- Auto-releases any lock held longer than maxHoldMs (default 5 min)
- Catches the hung-API-call case that try/finally cannot
2. **Gateway startup cleanup** (server-startup.ts)
- On boot, scan all agent session directories for *.jsonl.lock files
- Remove locks with dead PIDs or older than staleMs (30 min)
- Log each cleaned lock for diagnostics
3. **openclaw doctor stale lock detection** (doctor-session-locks.ts)
- New health check scans for .jsonl.lock files
- Reports PID status and age of each lock found
- In --fix mode, removes stale locks automatically
4. **Transcript error entry on API failure** (attempt.ts)
- When promptError is set, write an error marker to the session
transcript before releasing the lock
- Preserves conversation history even on model API failures
Closes#18060
Add support for Z.AI's native tool_stream parameter to enable real-time
visibility into model reasoning and tool call execution.
- Automatically inject tool_stream=true for zai/z-ai providers
- Allow disabling via params.tool_stream: false in model config
- Follows existing pattern of OpenRouter and OpenAI wrappers
This enables Z.AI API features described in:
https://docs.z.ai/api-reference#streaming
AI-assisted: Claude (OpenClaw agent) helped write this implementation.
Testing: lightly tested (code review + pattern matching existing wrappers)
Closes#18135
Synchronous hook that lets plugins inspect and optionally block messages
before they are written to the session JSONL file. Primary use case is
private mode... when enabled, the plugin returns { block: true } and the
message never gets persisted.
The hook runs on the hot path (synchronous, like tool_result_persist).
Handlers execute sequentially in priority order. If any handler returns
{ block: true }, the write is skipped immediately. Handlers can also
return a modified message to write instead of the original.
Changes:
- src/plugins/types.ts: add hook name, event/result types, handler map entry
- src/plugins/hooks.ts: add runBeforeMessageWrite() following tool_result_persist pattern
- src/agents/session-tool-result-guard.ts: invoke hook before every originalAppend() call
- src/agents/session-tool-result-guard-wrapper.ts: wire hook runner to the guard
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On Windows, fs.promises.writeFile truncates the target file to 0 bytes
before writing. Since loadSessionStore reads the file synchronously
without holding the write lock, a concurrent read can observe the empty
file, fail to parse it, and fall through to an empty store — causing the
agent to lose its session context.
Changes:
- saveSessionStoreUnlocked (Windows path): write to a temp file first,
then rename it onto the target. If rename fails due to file locking,
retry 3 times with backoff, then fall back to copyFile (which
overwrites in-place without truncating to 0 bytes).
- loadSessionStore: on Windows, retry up to 3 times with 50ms
synchronous backoff (via Atomics.wait) when the file is empty or
unparseable, giving the writer time to finish. SharedArrayBuffer is
allocated once and reused across retry attempts.
Treat normal process exits (even with non-zero codes) as completed tool results.
This prevents standard exit codes (like grep exit 1) from being surfaced
as 'Tool Failure' warnings in the UI. The exit code is still appended
to the tool output for assistant awareness.
The gateway's system-presence.ts was not detecting the version when
OpenClaw is run as a launchd service, because the daemon-runtime.ts
sets OPENCLAW_SERVICE_VERSION but system-presence.ts only checked
OPENCLAW_VERSION and npm_package_version.
This caused 'openclaw status' to show 'unknown' for the version.
Issue: #18456🤖 AI-assisted (lightly tested)
Qwen 3 (and potentially other reasoning-capable models served via Ollama)
returns its final answer in a `reasoning` field with an empty `content`
field. This causes blank/empty responses since OpenClaw only reads `content`.
Changes:
- Add `reasoning?` to OllamaChatResponse message type
- Fall back to `reasoning` when `content` is empty in buildAssistantMessage
- Accumulate `reasoning` chunks during streaming when `content` is empty
This allows Qwen 3 to work correctly both with and without /no_think mode.
downgradeOpenAIReasoningBlocks was only called on model change, but
orphaned reasoning items (e.g. from an aborted stream) can exist without
a model switch and cause a 400 from the OpenAI Responses API.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
`models set` accepts any syntactically valid model ID without checking
the catalog, allowing typos to silently persist in config and fail at
runtime. It also unconditionally adds an empty `{}` entry to
`agents.defaults.models`, bypassing any provider routing constraints.
This commit:
- Validates the model ID against the catalog (skipped when catalog is
empty during initial setup)
- Warns when a new entry is added with empty config (no provider routing)
Closesopenclaw/openclaw#17183✍️ Author: Claude Code with @carrotRakko (AI-written, human-approved)
The gateway unconditionally scheduled a SIGUSR1 restart after every
update.run call, even when the update itself failed (broken deps,
build errors, etc.). This left the process restarting into a broken
state — corrupted node_modules, partial builds — causing a crash loop
that required manual intervention.
Three fixes:
1. Only restart on success: scheduleGatewaySigusr1Restart is now
gated on result.status === "ok". Failed or skipped updates still
write the restart sentinel (so the status can be reported back to
the user) but the running gateway stays alive.
2. Early bail on step failure: deps install, build, and ui:build now
check exit codes immediately (matching the preflight section) so a
failed deps install no longer cascades into a broken build and
ui:build.
3. Auto-repair config during update: the doctor step now runs with
--fix alongside --non-interactive, so unknown config keys left over
from schema changes between versions are stripped automatically
instead of causing a startup validation crash.
The global `agents.defaults.thinkingDefault` forces a single thinking
level for all models. Users running multiple models with different
reasoning capabilities (e.g. Claude with extended thinking, GPT-4o
without, Gemini Flash with lightweight reasoning) cannot optimise the
thinking level per model.
Add an optional `thinkingDefault` field to `AgentModelEntryConfig` so
each entry under `agents.defaults.models` can declare its own default.
Resolution priority: per-model → global → catalog auto-detect.
Example config:
"models": {
"anthropic/claude-sonnet-4-20250514": { "thinkingDefault": "high" },
"openai/gpt-4o": { "thinkingDefault": "off" }
}
Co-authored-by: Cursor <cursoragent@cursor.com>
Add automatic llms.txt awareness so agents check for /llms.txt or
/.well-known/llms.txt when exploring new domains.
Changes:
- System prompt: new 'llms.txt Discovery' section (full mode only,
when web_fetch is available) instructing agents to check for llms.txt
files when visiting new domains
- web_fetch tool: updated description to mention llms.txt discovery
llms.txt is an emerging standard (like robots.txt for AI) that helps
site owners describe how AI agents should interact with their content.
Making this a default behavior helps the ecosystem adopt agent-native
web experiences.
Ref: https://llmstxt.org
When a user sets `agents.defaults.model.primary: "ollama/gemma3:4b"`
but forgets to set OLLAMA_API_KEY, the error is a confusing
"unknown model: ollama/gemma3:4b". The Ollama provider requires any
dummy API key to register (the local server doesn't actually check it),
but this isn't obvious from the error.
Add `buildUnknownModelError()` that detects known local providers
(ollama, vllm) and appends an actionable hint with the env var name
and a link to the relevant docs page.
Before: Unknown model: ollama/gemma3:4b
After: Unknown model: ollama/gemma3:4b. Ollama requires authentication
to be registered as a provider. Set OLLAMA_API_KEY="ollama-local"
(any value works) or run "openclaw configure".
See: https://docs.openclaw.ai/providers/ollamaCloses#17328
When the gateway connection fails due to device token mismatch (e.g., after
re-pairing the device), clear the stored device-auth token so that
subsequent connection attempts can obtain a fresh token.
This fixes the cron tool failing with 'device token mismatch' error
after running 'openclaw configure' to re-pair the device.
Fixes#18175
When an agent config specifies `model: { primary: "..." }` without
an explicit `fallbacks` array, the existing code replaced the entire
model object from `agents.defaults`—discarding the default fallbacks.
This caused cron jobs (and agent sessions) to have only one model
candidate (the pinned model) plus the global primary as a final
fallback, skipping all intermediate fallback models.
The fix merges the agent model override into the existing defaults
model object using spread, so that keys like `fallbacks` survive
when the agent only overrides `primary`. Agents can still explicitly
override or clear fallbacks by providing their own `fallbacks` array.
Reproduction scenario:
- `agents.defaults.model = { primary: "codex", fallbacks: ["opus", "flash", "deepseek"] }`
- Agent config: `model: { primary: "codex" }`
- Cron job pins: `model: "flash"`
- Before fix: fallback candidates = [flash, codex] (3 models lost)
- After fix: fallback candidates = [flash, opus, deepseek, ..., codex]
Add a `spawn` action to the /subagents command handler that invokes
spawnSubagentDirect() to deterministically launch a named subagent.
Usage: /subagents spawn <agentId> <task> [--model <model>] [--thinking <level>]
Also includes the shared subagent-spawn module extraction (same as the
refactor/extract-shared-subagent-spawn branch) since it hasn't merged yet.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use stricter regex: /^[A-Za-z0-9+/]*={0,2}$/ ensures = only at end
- Normalize URL-safe base64 to standard (- → +, _ → /)
- Added tests for padding in wrong position and URL-safe normalization
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds explicit base64 format validation in sanitizeContentBlocksImages()
to prevent invalid image data from being sent to the Anthropic API.
The Problem:
- Node's Buffer.from(str, "base64") silently ignores invalid characters
- Invalid base64 passes local validation but fails at Anthropic's stricter API
- Once corrupted data persists in session history, every API call fails
The Fix:
- Add validateAndNormalizeBase64() function that:
- Strips data URL prefixes (e.g., "data:image/png;base64,...")
- Validates base64 character set with regex
- Checks for valid padding (0-2 '=' chars)
- Validates length is proper for base64 encoding
- Invalid images are replaced with descriptive text blocks
- Prevents permanent session corruption
Tests:
- Rejects invalid base64 characters
- Strips data URL prefixes correctly
- Rejects invalid padding
- Rejects invalid length
- Handles empty data gracefully
Closes#18212
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update test expectation: 'defaults to enable on Node 22'
- Update comment in fetch.ts to explain IPv4 fallback rationale
- Addresses greptile review feedback
Fixes issue where Telegram fails to send messages when IPv6 is configured
but not functional on the network.
Problem:
- Many networks (especially in Latin America) have IPv6 configured but
not properly routed by ISP/router
- Node.js tries IPv6 first, gets 'Network is unreachable' error
- With autoSelectFamily=false, Node doesn't fallback to IPv4
- Result: All Telegram API calls fail
Solution:
- Change default from false to true for Node.js 22+
- This enables automatic IPv4 fallback when IPv6 fails
- Config option channels.telegram.network.autoSelectFamily still available
for users who need to override
Symptoms fixed:
- Health check: Telegram | WARN | failed (unknown) - fetch failed
- Logs: Network request for 'sendMessage' failed
- Bot receives messages but cannot send replies
Tested on:
- macOS 26.2 (Sequoia)
- Node.js v22.15.0
- OpenClaw 2026.2.12
- Network with IPv6 configured but not routed
On macOS, `openclaw gateway install` hardcodes the system node
(/opt/homebrew/bin/node) in the launchd plist, ignoring the node from
version managers (fnm/nvm/volta). This causes the Gateway to run a
different node version than the user's shell environment.
Two fixes:
1. `resolvePreferredNodePath` now checks `process.execPath` first.
If the currently running node is a supported version, use it directly.
This respects the user's active version manager selection.
2. `buildMinimalServicePath` now includes version manager bin directories
on macOS (fnm, nvm, volta, pnpm, bun), matching the existing Linux
behavior.
Fixes#18090
Related: #6061, #6064