fix: fix: transcribe audio before mention check in groups with requireMention (openclaw#9973) thanks @mcinteerj

Verified: - pnpm install --frozen-lockfile - pnpm build - pnpm check - pnpm test Co-authored-by: mcinteerj <3613653+mcinteerj@users.noreply.github.com>
2026-02-13 04:58:01 +13:00
parent a5ab9fac0c
commit a2ddcdadeb
7 changed files with 245 additions and 38 deletions
--- a/docs/nodes/audio.md
+++ b/docs/nodes/audio.md
@@ -107,8 +107,27 @@ Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI
 - Transcript is available to templates as `{{Transcript}}`.
 - CLI stdout is capped (5MB); keep CLI output concise.

+## Mention Detection in Groups
+
+When `requireMention: true` is set for a group chat, OpenClaw now transcribes audio **before** checking for mentions. This allows voice notes to be processed even when they contain mentions.
+
+**How it works:**
+
+1. If a voice message has no text body and the group requires mentions, OpenClaw performs a "preflight" transcription.
+2. The transcript is checked for mention patterns (e.g., `@BotName`, emoji triggers).
+3. If a mention is found, the message proceeds through the full reply pipeline.
+4. The transcript is used for mention detection so voice notes can pass the mention gate.
+
+**Fallback behavior:**
+
+- If transcription fails during preflight (timeout, API error, etc.), the message is processed based on text-only mention detection.
+- This ensures that mixed messages (text + audio) are never incorrectly dropped.
+
+**Example:** A user sends a voice note saying "Hey @Claude, what's the weather?" in a Telegram group with `requireMention: true`. The voice note is transcribed, the mention is detected, and the agent replies.
+
 ## Gotchas

 - Scope rules use first-match wins. `chatType` is normalized to `direct`, `group`, or `room`.
 - Ensure your CLI exits 0 and prints plain text; JSON needs to be massaged via `jq -r .text`.
 - Keep timeouts reasonable (`timeoutSeconds`, default 60s) to avoid blocking the reply queue.
+- Preflight transcription only processes the **first** audio attachment for mention detection. Additional audio is processed during the main media understanding phase.