Model Context Protocol

Your agent's best answer
is just one perspective.

mumo adds defensibility through diversity. Pick your models and watch them engage each other on your approach. Move on when you're ready, or react to start another round.

Claude Code·Cursor·Visual Studio·Codex·Windsurf
Evidence

Real sessions. Real shipped code.

Four deliberations the mumo team ran through Claude Code against mumo's own codebase. Each caught something a single reviewer would have softened or missed. Total spend across all four: $0.40.

PostgREST can't hold cross-request transactions; the plpgsql function provides atomicity.GPT-5.4 — Round 2

The cost-tracking plan that survived a three-round panel

A ~150-LOC plan to unify cost tracking — one ledger table, FK-enforced bucket enum, paired-write RPC. The panel shredded v1 as architecturally off: cost_bucket-as-column was the wrong shape; the right shape is a sibling ledger. By v3 the plan had absorbed atomic RPCs, a canonical union view, and an FK-to-lookup-table design.

GPTQwen
$0.23·3 rounds·100k tokens
Rotation is a distraction from uncontrolled access, weak auditability, and stale decrypted-key caching.GPT-5.4 — independently seconded by Qwen

BYOK hardening — the rotation story was covering for a bigger problem

The plan positioned key rotation as THE headline pre-launch risk. Both panelists independently rejected the framing on round 1. Additional catches: algorithm_version alone loses cryptographic attribution on rewrap; backfill needs an atomic migration; KEKProvider abstraction is ceremony — pick one KMS.

GPTQwen
$0.08·1 round·30k tokens
The real fork is payload shape, not transport. Poll a tiny thing or stream a tiny thing.GPT-5.4 — reframing the spec

Streaming round-barrier race — caught before shipping

Late-stage review of AI-moderated streaming. Panel converged on the polling option — but refined it past the original spec. Qwen added three mitigations and a decoupling move: poll returns only a new round_id signal, so SSE opens on the battle-tested per-round endpoint instead of waiting on a heavyweight getSession.

GPTQwen
$0.06·1 round·26k tokens
Not just lossy — actively misleading. Moderator cost is attributed to participant models.Qwen-3.6+ — round 1

Admin dashboard — flagged actively misleading attribution

A 12-bucket cost breakdown proposal replacing legacy 3-bucket compression. Both models immediately flagged the existing design as actively misleading. Key decisions: fold extraction into participant row with tooltip surfacing; rename 'Session chrome' to 'Session overhead'; cache efficiency ratio on participant rows.

GPTQwen
$0.03·1 round·7k tokens
Install

Connect in two minutes.

mumo speaks HTTP MCP. Any client that supports HTTP transport with a custom Authorization header works — paste a key, restart, call tools.

Step 1 — Get a key

Create a platform key at mumo.chat/settings/api-keys. Keys start with mmo_live_.

Step 2 — Export it
export MUMO_API_KEY=mmo_live_YOUR_KEY_HERE
Step 3 — Install the plugin
/plugin marketplace add anthropics/claude-plugins-official
/plugin install mumo

Bundles the MCP server and an auto-triggering skill. Restart Claude Code to pick up the new tools.

Manual install (skip the skill) →

If you'd rather wire the server by hand without the auto-triggering skill, use the CLI directly:

claude mcp add --transport http mumo https://mumo.chat/api/mcp \
  --header "Authorization: Bearer mmo_live_…"
The Loop

Call. Read the room. Decide.

Default mode is remote— you drive each round. Every round returns each model's prose plus a cross-model claim map showing where they agree and where they split. Append when you want more. Stop when you have what you need.

01

Start a deliberation

Call create_deliberation with a prompt. Pick 2–3 models or let mumo choose. Optionally pass a reference doc. Default mode is remote — you drive each round.

02

Read the room

rounds[0].responses[] has each model's full prose. rounds[0].claim_map.claims[] cross-references their positions on key claims — agreements collapse, splits fan out. Where they diverge is where the thinking happens.

03

Steer with typed snippets — or stop

If you want more, call append_roundwith KEEP / EXPLORE / CHALLENGE / CORE / SHIFT snippets quoting the claims you want amplified, pressured, or resolved. Models see the type — a CHALLENGE makes them defend or revise; a neutral forward doesn't. Stop when you have what you need.

// Agent calls create_deliberation
{
  "prompt": "Postgres or MongoDB for the event store?",
  "application": "Claude Code"
}

// ↓ returns (abridged)
{
  "id": "abc-123",
  "status": "ready",
  "rounds": [{
    "responses": [
      { "model": "claude", "prose": "..." },
      { "model": "gpt", "prose": "..." }
    ],
    "claim_map": {
      "claims": [ /* each w/ positions per model */ ]
    }
  }]
}
Tools

Five tools. Each with one job.

One schema, two surfaces. REST and MCP share shapes, so consumers never track two versions — and every improvement lands in both. Full schemas in the MCP reference.

create_deliberationprimary
Start a new deliberation. Two modes distinguished by moderator_model: remote (you drive rounds via append_round) or autonomous (AI moderator runs the full arc, poll for ready). Inputs: prompt, reference, models, rounds, moderator_model, application, distill. MCP defaults distill off — responses + claim map are the high-signal artifacts for agents; pass distill: "summary", "brief", or "both" when surfacing output to a human (prose, structured, or both).
append_round
Add a follow-up round with steering snippets. The highest-signal way to direct attention — models see snippet types as framings, not neutral forwards. Each snippet: type, verbatim quote, quoted_model, optional comment.
get_session
Fetch full session state — all rounds, responses, claim maps, distills, editorial summary. Poll this for autonomous-mode sessions. The highest-signal field for downstream agent decisions: rounds[].claim_map.claims[].positions[].
list_sessions
List the caller's sessions, optionally filtered by status or mode. Useful for agents managing concurrent sessions — filter on status: "ready" to find sessions awaiting the next round.
list_models
Enumerate models available to your tier with provider, display name, context window, max output tokens, and pricing. Call before create_deliberation when the user specifies models by name.
When to reach

Best for the hard questions.

mumo's value scales with how contested the decision is. A second and third perspective change the answer on some questions and not others — here's the rough shape of where the panel earns its keep.

Reach for mumo

  • Architecture decisions with non-obvious tradeoffs — storage, transport, consistency, caching
  • Plan reviews that tend to get rubber-stamped
  • Contested product or pricing choices with multiple defensible answers
  • Pre-launch pressure tests — "what would you regret?"

Skip mumo

  • ×Factual lookups with one correct answer
  • ×Code generation from a clear spec
  • ×Routine refactoring or formatting
How to ask

Prompts that work.

Straight from Opus 4.7, based on multiple mumo MCP sessions. Your questions may have generic answers online. mumo's value is multiple models debating your specifics across rounds — not the best-practice writeup. The prompt's job is to give the panel specifics worth debating — not more, not less.

Architecture tradeoffs

Pattern. State the decision. Name the candidate approaches. List the constraints that change what "correct" means. Ask what you’d regret, not what’s best.

We're building an event store for ~50k events/day, ~10M total,
90% writes / 10% reads (by event_id within a day window).

Weighing:
  - Postgres events table, index on (day, event_id)
  - EventStoreDB
  - Mongo time-series

Constraints: 2 engineers, Postgres-experienced, no Mongo/ES in
production before. 3-month runway.

What would we regret 6 months in for each option? Where do
your answers converge, where do they split?

Why it works. Regret framing forces models to surface load-bearing assumptions instead of rehearsing defaults. The explicit convergence-vs-split ask engages the panel format — you get three takes that actually disagree, not three summaries of the same consensus.

Code review — describe the shape, don’t paste the code

Pattern. Describe what the code does in 3–5 lines. Include a type signature or a 5–10 line illustrative snippet if it clarifies. Ask the specific question you have about the shape — not “review this.”

We have a worker pairing DB writes across two tables via an RPC:

  insert_model_response_with_spend(response_jsonb, ledger_jsonb)
    -- plpgsql, two INSERTs inside an implicit BEGIN/END

Question: is a plpgsql RPC the right atomicity primitive, or am
I avoiding the real question (should the caller hold a shared
transaction across multiple operations)?

Want tradeoffs on observability, testability, and what breaks
when we add a third paired write.

Why it works. The panel debates shape, not syntax. A 500-line file paste produces three compressed line-by-line critiques that converge on surface issues. A 5-line shape + a specific question produces three architectural takes that actually disagree — which is the product.

Strategy / framing

Pattern. Name the decision. Name the frames you’re already using. Ask specifically for the frame you’re missing.

Deciding: build our own feature-flag system or adopt
LaunchDarkly / ConfigCat / Flagsmith.

Context: ~80 flags across 3 services, 4 engineers, need
LDAP-backed audit for an upcoming SOC 2 window. Currently
using a JSON blob in S3 we edit by hand.

I'm framing this as a "build vs. buy on cost + feature parity"
question. Am I missing a framing that changes the answer?

Why it works. Asking for missing frames gets at what’s invisible from inside the decision. Build-vs-buy cost-and-parity framing routinely misses time-to-market, org-leverage-cost, or compliance-boundary framings — and those are where the answer usually actually sits.

What doesn't work
  • ×Pasting whole files and asking “thoughts?” — the panel dissolves into compressed line-by-line feedback and stops disagreeing.
  • ×Asking “which is best?” — invites consensus, suppresses the productive split.
  • ×Questions a Stack Overflow search already answers — that’s single-model territory.
Reliability

Failure is handled. You can retry.

mumo's MCP server is built for agent loops, where transient failures are the norm. Three guarantees that make it safe to wire into your main path.

Deterministic retry
Every call echoes an idempotency_key. If your client times out mid-request, retry with the same key — the server tells you exactly what happened: succeeded and cached, still running, or failed and refunded. No double-charge risk.
60-second refund SLO
When a round fails catastrophically — all models failed, provider outage, dependency timeout — the credit is automatically restored within 60 seconds. Visible as refund_status: "credited" in the progress response.
Classified failures
Failures come with a failure_code rate_limit, provider_outage, all_providers_failed, dependency_timeout, stuck_reconciled. Your agent picks retry behavior per class instead of "try again and hope."
Steering

Five buckets. Five framings.

Snippet types are the primitive. The type carries a specific framing into the next round's prompt — the model receiving a CHALLENGE sees a different instruction than the model receiving a KEEP.

KEEP
This resonates with me.
EXPLORE
Let's go deeper on this.
CHALLENGE
I'm not sold on this.
CORE
This is what it comes down to.
SHIFT
This shifted my perspective.

Worked example — resolving a pricing model

R1GPT: per-seat pricing for enterprise tier.
R1Claude: usage-based pricing with seat fallback.
Agent reads round 1, decides per-seat is the weaker option for early pilots.
R2append_round:
CHALLENGE GPT's "per-seat assumes teams of >10" — "most pilots start at 3–5"
KEEP Claude's "usage-based aligns incentives"
Models read the framings and converge on usage-based with per-seat fallback for >25.
Common Questions

What agents ask.

What's the latency on a single round?
Typical single-round deliberations take 15–40s depending on models chosen and prompt complexity. The first model response starts streaming almost immediately; the full claim map and distill land after all participants complete. For agents that need to respond quickly, use rounds: 1 and return the distill narrative to the user with a link to the session.
Does my agent need to manage session IDs?
Only if you want to append rounds. For single-shot use, the first create_deliberation response already contains everything — claim map, distill, per-model responses, editorial summary. Save the session ID if you want to come back to it; otherwise the session is fully self-contained in the initial response.
Can I pick specific models?
Yes — pass a models array with 2–3 model IDs. Call list_modelsfirst to see what's available on your tier. If you don't specify models, mumo uses a platform-selected default roster tuned for panel dynamics.
Does the response include human-readable output?
By default on MCP, no. Raw per-model responses and a cross-model claim map are the highest-signal artifacts for programmatic consumers, and the human-readable fields cost an extra LLM call per round — not worth it if you're just driving the next round from the claim map. If your agent surfaces output to a human, pass one of three values on create_deliberation, depending on what your human reads better:
  • distill: "summary" — flowing narrative prose.
  • distill: "brief" — structured: narrative, agreements, disagreements, plus continuation.recommendation (stop / continue / explore). Some readers scan structured points faster than prose.
  • distill: "both" — both.
What happens if a model fails mid-session?
Participants fail independently — the session continues with whichever models succeed. You'll see the failed model's slot marked accordingly in the response, with the rest of the panel's work intact. For critical deliberations, 3-model panels give you more resilience than 2.
Are confidence scores reliable?
They're self-reported and not calibrated across models — treat them as rough signals within a model's own output rather than cross-model comparable. A confidence_disclaimer string is included in responses when scores are present; surface it if you display scores to users.

Ready to give your agent a panel?

Two minutes to install. Five tools. Your agent, plus a panel.

Install for Claude Code