MCU Speech-to-Text API
Real-time transcription for an MCU / conferencing server. Stream PCM16 audio over a WebSocket; get live transcripts, an embeddable caption iframe, a subscriber stream, and an automatic meeting summary.
https://mcustt.manager.clickWebSocket
wss://mcustt.manager.clickPCM16 · 16 kHz · mono auto language detect live transcripts iframe embed meeting summary
1 · Overview
The flow: your MCU opens a WebSocket and streams raw audio → the service runs streaming speech-to-text (Google STT v2 / Chirp 2) → transcript events are returned on the same socket and fanned out to (a) any read-only subscribers, and (b) a branded SSE caption page you can iframe. Every meeting gets an auto-generated sessionId. When the meeting ends, a summary is generated from the transcript. You can also batch-summarize subtitles you already have, with no audio.
| Capability | How |
|---|---|
| Live transcription | Stream PCM16 to /ws/stt; read transcript frames back |
| Auto language detection | Open a session with autoDetect:true (default); a language event reports the detected BCP-47 code |
| Live captions in a webpage | Drop the /embed/:id iframe into any page |
| Programmatic transcript stream | Connect a read-only WebSocket to /ws/subscribe |
| Post-meeting summary | Auto on session end; fetch via GET /api/sessions/:id/summary |
| Batch summarize existing subtitles | POST /api/transcribe-summary |
2 · Authentication
Four credential types. Only the tenant API key is long-lived and secret — keep it server-side, never in a browser.
| Credential | Used for | Where | Lifetime |
|---|---|---|---|
Tenant API keysk_mcustt_… | Create sessions, batch, poll jobs | Header x-tenant-key (server-to-server) | Until rotated |
| Ingest token (JWT) | Authorize the audio producer | /ws/stt?token=… | Single-use, ~12 h |
| Embed / Subscribe token (JWT) | Read-only caption access | ?t= (SSE), ?token= (WS), or URL #fragment (iframe) | ~8 h, read-only |
| Owner secret | Rotate ingest token, end session | JSON body ownerSecret | Session lifetime |
3 · Quickstart
Option A — explicit (recommended for server integrations)
# 1) Create a session
curl -X POST https://mcustt.manager.click/api/sessions \
-H "x-tenant-key: $KEY" -H "content-type: application/json" \
-d '{"autoDetect":true,"title":"Board call"}'
# Response:
# {
# "sessionId":"B4FB9SPaYH",
# "ingestToken":"eyJ…", "ingestUrl":"wss://mcustt.manager.click/ws/stt?token=eyJ…",
# "embedToken":"eyJ…", "embedUrl":"https://mcustt.manager.click/embed/B4FB9SPaYH#t=eyJ…",
# "subscribeToken":"eyJ…","subscribeUrl":"wss://mcustt.manager.click/ws/subscribe?token=eyJ…",
# "ownerSecret":"…", "brand":{…}
# }
# 2) Open ingestUrl as a WebSocket and send raw PCM16 binary frames.
# 3) Read JSON {type:"transcript",…} frames back on the same socket.
Option B — zero-ceremony
Skip step 1: open wss://mcustt.manager.click/ws/stt directly with an x-tenant-key header. The first message back is {type:"ready", sessionId, embedUrl, subscribeUrl, ownerSecret}. Optional query: ?sourceLang=en-US&autoDetect=false.
4 · Audio frame contract
| Property | Value |
|---|---|
| Encoding | Raw PCM signed 16-bit (LINEAR16), no WAV header |
| Sample rate | 16 000 Hz |
| Channels | 1 (mono) |
| Byte order | Little-endian |
| Frame size | 20–100 ms per binary message (640–3200 bytes). 20 ms (640 B) is ideal. |
| Per-frame cap | 1 MB (oversized → close 1009) |
RIFF…) or an odd-length frame, the socket closes with 1003 INVALID_AUDIO_FORMAT. If your source is 48 kHz / stereo / Opus, downmix + resample to 16 kHz mono PCM16 before sending.5 · WebSocket — ingest /ws/stt
wss://mcustt.manager.click/ws/stt?token=INGEST_TOKENAuth: ingest token in the query, or an x-tenant-key header (zero-ceremony). Bidirectional: you send audio + control; the server sends transcripts + control.
Client → Server
| Message | Meaning |
|---|---|
| binary frame | One PCM16 audio frame (see §4) |
{"type":"ping","ts":123} | Keepalive; server replies pong |
{"type":"stop"} | End of meeting → finalizes, ends the session, triggers the summary |
Server → Client
| Message | Meaning |
|---|---|
{"type":"ready","sessionId":"…"} | Stream is open. (Zero-ceremony also returns embedUrl, subscribeUrl, ownerSecret.) |
{"type":"transcript","text":"…","isFinal":false,"seq":41,"lang":"en-US","ts":171…} | Interim (being-written) line — seq is provisional and updates in place |
{"type":"transcript","text":"…","isFinal":true,"seq":41,"lang":"en-US","ts":171…} | Final line for seq (persisted) |
{"type":"language","code":"hi-IN"} | Auto-detected language changed |
{"type":"session-end"} | Session ended |
{"type":"error","code":"…","message":"…","fatal":true} | Error; fatal means the socket will close |
Reconnecting
The ingest token is single-use (burned on connect). To reconnect a dropped producer, mint a fresh one with the owner secret via POST /api/sessions/:id/ingest-token, then reopen the socket.
5b · WebRTC ingest (/api/webrtc, WHIP)
/api/webrtc?token=INGEST_TOKEN Content-Type: application/sdpAn alternative to raw-WS PCM when the producer is a browser or mobile app: send a WebRTC audio track (Opus) and the server decodes it to 16 kHz mono internally — your client doesn't resample. Uses WHIP (WebRTC-HTTP Ingestion): POST your SDP offer, get the SDP answer back. Auth: ingest token (?token=) or x-tenant-key header.
Because WHIP is ingest-only, transcripts come back on a separate channel: the response carries X-Session-Id and X-Subscribe-Token headers — open /ws/subscribe (§6) or the SSE feed (§7) with that token to receive transcript events. DELETE /api/webrtc/:resourceId (from the Location header) hangs up.
stun:stun.l.google.com:19302) to gather its own candidates.// Browser — send mic audio over WebRTC, read transcripts over WS
const pc = new RTCPeerConnection({ iceServers:[{urls:"stun:stun.l.google.com:19302"}] });
const mic = await navigator.mediaDevices.getUserMedia({ audio:true });
mic.getTracks().forEach(t => pc.addTrack(t, mic));
await pc.setLocalDescription(await pc.createOffer());
await new Promise(r => pc.onicegatheringstatechange =
() => pc.iceGatheringState === "complete" && r());
const res = await fetch("https://mcustt.manager.click/api/webrtc?token=" + INGEST_TOKEN, {
method:"POST", headers:{ "content-type":"application/sdp" }, body: pc.localDescription.sdp });
await pc.setRemoteDescription({ type:"answer", sdp: await res.text() });
const sub = new WebSocket("wss://mcustt.manager.click/ws/subscribe?token=" + res.headers.get("x-subscribe-token"));
sub.onmessage = e => { const m = JSON.parse(e.data); if (m.type==="transcript") console.log(m.text); };
6 · WebSocket — subscribe /ws/subscribe
wss://mcustt.manager.click/ws/subscribe?token=SUBSCRIBE_TOKEN&lastSeq=NRead-only transcript stream for any consumer (a second screen, your own UI, logging). Receives the same transcript / language / session-end frames as the ingest socket. lastSeq=N replays persisted finals after N first (resume after a disconnect). Sending any binary frame closes the socket (1003).
7 · SSE — caption feed
https://mcustt.manager.click/api/sessions/:id/stream?t=EMBED_TOKENServer-Sent Events stream powering the iframe. Events: ready, transcript, language, session-end. Supports Last-Event-ID for replay and sends a keepalive comment every 15 s.
event: transcript
id: 41
data: {"seq":41,"text":"So the proposal is approved.","isFinal":true,"lang":"en-US","ts":171…}
8 · Iframe embed
Drop live captions into any page. The embed token rides in the URL #fragment so it never reaches server logs.
<iframe src="https://mcustt.manager.click/embed/SESSION_ID#t=EMBED_TOKEN"
style="width:100%;height:320px;border:0;border-radius:12px">
</iframe>
9 · REST — sessions
/api/sessionsAuth: x-tenant-key. Body (all optional): {"title":string,"sourceLang":"en-US","autoDetect":true}. Returns ids + tokens + URLs (see Quickstart). 201.
/api/sessions/:idAuth: tenant key or embed/subscribe token. Returns state:
{ "sessionId":"…","status":"live","sourceLang":"en-US","autoDetect":true,
"detectedLang":"en-US","detectedLangName":"English (US)","lineCount":412,
"ingestConnected":true,"subscribers":2,"summaryStatus":"pending" }
/api/sessions/:id/embed-tokenAuth: tenant key or {"ownerSecret":"…"}. Mints fresh read-only embedToken + subscribeToken (+ URLs).
/api/sessions/:id/ingest-tokenAuth: {"ownerSecret":"…"}. Rotates the single-use ingest token to reconnect a producer.
/api/sessions/:idAuth: tenant key or {"ownerSecret":"…"}. Ends the session (triggers summary). Transcript + summary are retained.
10 · REST — transcript & subtitles
/api/sessions/:id/transcript?format=json|srt|vtt|txtAuth: tenant key or embed/subscribe token. Full transcript in the chosen format. json returns {sessionId,lineCount,lines:[{seq,lang,text,ts}]}; srt/vtt are standard subtitle files with cue timings derived from line timestamps; txt is plain text.
11 · REST — summaries
/api/sessions/:id/summaryAuth: tenant key or token. 200 when ready, 202 {status:"pending|failed"} while generating.
{ "status":"ready","model":"gpt-4o-mini","lang":"en-US",
"content":{
"tldr":"…2-4 sentence summary…",
"decisions":["Ship billing page Friday"],
"actionItems":[{"owner":"Priya","task":"Write the migration"}],
"qa":["Q: Need a feature flag? A: Yes"]
} }
/api/sessions/:id/summarizeAuth: tenant key or owner secret. Force (re)generation; optional window {"fromSeq":N,"toSeq":M}.
12 · REST — batch & async jobs
/api/transcribe-summaryAuth: x-tenant-key. Summarize subtitles you already have (no audio). Body — either form:
{ "subtitles":[{"text":"…","ts":171…,"speaker":"A"}, …],
"sourceLang":"en-US", "want":["summary","transcript"] }
// or simply: { "text":"line 1\nline 2\n…", "want":["summary"] }
Small inputs return the result inline:
{ "language":"en-US", "transcript":"…", "summary":{ "tldr":…, "decisions":[…], … } }
Large inputs (over 12000 chars) run asynchronously:
// 202 Accepted
{ "jobId":"job_8Kd…", "status":"queued", "poll":"/api/jobs/job_8Kd…" }
Idempotency-Key header so a retried request returns the same job instead of starting a duplicate (and double-billing)./api/jobs/:idAuth: x-tenant-key. Poll an async job.
{ "jobId":"job_8Kd…", "status":"queued|running|ready|failed",
"result":{ "language":"en-US", "summary":{…} } | null, "error":null }
13 · Admin — provision a tenant
/api/admin/tenantsOperator-only (admin bearer token). Creates a tenant and returns its API key once.
curl -X POST https://mcustt.manager.click/api/admin/tenants \
-H "Authorization: Bearer $ADMIN_TOKEN" -H "content-type: application/json" \
-d '{"name":"Acme","maxSessions":50,"dailyMinutesCap":6000,
"brand":{"brandName":"Acme Live","accentColor":"#0b5","logoUrl":"https://…/logo.png"}}'
# -> { "tenantId":"…", "name":"Acme", "apiKey":"sk_mcustt_…", "note":"shown once" }
14 · Health & metrics
/healthz liveness/readyz readiness: db + STT + summarizer reachable/metrics Prometheus (admin bearer token)15 · Errors & close codes
All REST errors use one envelope: {"error":{"code":"…","message":"…"}}.
| HTTP | When |
|---|---|
| 400 BAD_REQUEST | Malformed body / bad format |
| 401 | Missing/invalid tenant key, token, or owner secret |
| 404 | Unknown session / job |
| 429 | Rate-limited, or tenant ingest/session cap reached |
| 503 | Capacity / daily cap reached (Retry-After) or summarizer unconfigured |
WebSocket close codes
| Code | Reason |
|---|---|
| 1003 | BAD_REQUEST / INVALID_AUDIO_FORMAT / binary on a subscribe socket |
| 1009 | FRAME_TOO_LARGE (>1 MB) / control >8 KB |
| 1000 | Normal — STOP or idle timeout |
| upgrade 401/403/404/409/429 | token rejected / origin blocked / no session / already-connected / rate-limited |
16 · Rate limits & caps
| Create session | 60 / min per IP |
| Summarize / batch | 20–30 / min per IP |
| WS upgrades | 120 / min per IP |
| Concurrent ingest | 20 per tenant |
| Ingest idle timeout | 120 s with no audio → disconnect |
| Daily audio | per-tenant dailyMinutesCap (sessions self-end over cap) |
| Retention | ended sessions pruned after 30 days |
17 · Code examples
Node.js — MCU producer (ingest)
import WebSocket from "ws";
const ws = new WebSocket("wss://mcustt.manager.click/ws/stt", { headers: { "x-tenant-key": process.env.KEY } });
ws.on("message", (buf) => {
const m = JSON.parse(buf.toString());
if (m.type === "ready") console.log("session", m.sessionId, "embed:", m.embedUrl);
if (m.type === "transcript" && m.isFinal) console.log("[" + m.lang + "]", m.text);
});
// feed your mixed-call PCM16 16k mono frames:
function onPcmFrame(buf) { if (ws.readyState === 1) ws.send(buf); } // 20–40 ms each
// at end of meeting:
function endMeeting() { ws.send(JSON.stringify({ type: "stop" })); }
Python — producer (ingest)
import asyncio, json, websockets
async def main(pcm_chunks): # iterable of bytes, PCM16 16k mono
async with websockets.connect("wss://mcustt.manager.click/ws/stt",
additional_headers={"x-tenant-key": KEY}) as ws:
async def reader():
async for msg in ws:
m = json.loads(msg)
if m.get("type") == "transcript" and m.get("isFinal"):
print(m["lang"], m["text"])
asyncio.create_task(reader())
for chunk in pcm_chunks:
await ws.send(chunk)
await ws.send(json.dumps({"type": "stop"}))
asyncio.run(main(my_chunks))
Browser — subscribe to live transcripts
const ws = new WebSocket("wss://mcustt.manager.click/ws/subscribe?token=" + SUBSCRIBE_TOKEN);
ws.onmessage = (e) => {
const m = JSON.parse(e.data);
if (m.type === "transcript") render(m.text, m.isFinal, m.seq);
};
Try everything interactively in the live test harness.