MCU Speech-to-Text API

Real-time transcription for an MCU / conferencing server. Stream PCM16 audio over a WebSocket; get live transcripts, an embeddable caption iframe, a subscriber stream, and an automatic meeting summary.

Base URL https://mcustt.manager.click
WebSocket wss://mcustt.manager.click
PCM16 · 16 kHz · mono auto language detect live transcripts iframe embed meeting summary

1 · Overview

The flow: your MCU opens a WebSocket and streams raw audio → the service runs streaming speech-to-text (Google STT v2 / Chirp 2) → transcript events are returned on the same socket and fanned out to (a) any read-only subscribers, and (b) a branded SSE caption page you can iframe. Every meeting gets an auto-generated sessionId. When the meeting ends, a summary is generated from the transcript. You can also batch-summarize subtitles you already have, with no audio.

Capability	How
Live transcription	Stream PCM16 to `/ws/stt`; read `transcript` frames back
Auto language detection	Open a session with `autoDetect:true` (default); a `language` event reports the detected BCP-47 code
Live captions in a webpage	Drop the `/embed/:id` iframe into any page
Programmatic transcript stream	Connect a read-only WebSocket to `/ws/subscribe`
Post-meeting summary	Auto on session end; fetch via `GET /api/sessions/:id/summary`
Batch summarize existing subtitles	`POST /api/transcribe-summary`

2 · Authentication

Four credential types. Only the tenant API key is long-lived and secret — keep it server-side, never in a browser.

Credential	Used for	Where	Lifetime
Tenant API key `sk_mcustt_…`	Create sessions, batch, poll jobs	Header `x-tenant-key` (server-to-server)	Until rotated
Ingest token (JWT)	Authorize the audio producer	`/ws/stt?token=…`	Single-use, ~12 h
Embed / Subscribe token (JWT)	Read-only caption access	`?t=` (SSE), `?token=` (WS), or URL `#fragment` (iframe)	~8 h, read-only
Owner secret	Rotate ingest token, end session	JSON body `ownerSecret`	Session lifetime

Get a tenant API key from your operator (it's created via the admin endpoint and shown only once). The ingest / embed / subscribe tokens and owner secret are all returned to you when you create a session.

3 · Quickstart

Option A — explicit (recommended for server integrations)

# 1) Create a session
curl -X POST https://mcustt.manager.click/api/sessions \
  -H "x-tenant-key: $KEY" -H "content-type: application/json" \
  -d '{"autoDetect":true,"title":"Board call"}'

# Response:
# {
#   "sessionId":"B4FB9SPaYH",
#   "ingestToken":"eyJ…",   "ingestUrl":"wss://mcustt.manager.click/ws/stt?token=eyJ…",
#   "embedToken":"eyJ…",    "embedUrl":"https://mcustt.manager.click/embed/B4FB9SPaYH#t=eyJ…",
#   "subscribeToken":"eyJ…","subscribeUrl":"wss://mcustt.manager.click/ws/subscribe?token=eyJ…",
#   "ownerSecret":"…", "brand":{…}
# }

# 2) Open ingestUrl as a WebSocket and send raw PCM16 binary frames.
# 3) Read JSON {type:"transcript",…} frames back on the same socket.

Option B — zero-ceremony

Skip step 1: open wss://mcustt.manager.click/ws/stt directly with an x-tenant-key header. The first message back is {type:"ready", sessionId, embedUrl, subscribeUrl, ownerSecret}. Optional query: ?sourceLang=en-US&autoDetect=false.

4 · Audio frame contract

Property	Value
Encoding	Raw PCM signed 16-bit (LINEAR16), no WAV header
Sample rate	16 000 Hz
Channels	1 (mono)
Byte order	Little-endian
Frame size	20–100 ms per binary message (640–3200 bytes). 20 ms (640 B) is ideal.
Per-frame cap	1 MB (oversized → close `1009`)

Send raw PCM only. If you accidentally send a WAV header (RIFF…) or an odd-length frame, the socket closes with 1003 INVALID_AUDIO_FORMAT. If your source is 48 kHz / stereo / Opus, downmix + resample to 16 kHz mono PCM16 before sending.

5 · WebSocket — ingest `/ws/stt`

WSwss://mcustt.manager.click/ws/stt?token=INGEST_TOKEN

Auth: ingest token in the query, or an x-tenant-key header (zero-ceremony). Bidirectional: you send audio + control; the server sends transcripts + control.

Client → Server

Message	Meaning
binary frame	One PCM16 audio frame (see §4)
`{"type":"ping","ts":123}`	Keepalive; server replies `pong`
`{"type":"stop"}`	End of meeting → finalizes, ends the session, triggers the summary

Server → Client

Message	Meaning
`{"type":"ready","sessionId":"…"}`	Stream is open. (Zero-ceremony also returns `embedUrl`, `subscribeUrl`, `ownerSecret`.)
`{"type":"transcript","text":"…","isFinal":false,"seq":41,"lang":"en-US","ts":171…}`	Interim (being-written) line — `seq` is provisional and updates in place
`{"type":"transcript","text":"…","isFinal":true,"seq":41,"lang":"en-US","ts":171…}`	Final line for `seq` (persisted)
`{"type":"language","code":"hi-IN"}`	Auto-detected language changed
`{"type":"session-end"}`	Session ended
`{"type":"error","code":"…","message":"…","fatal":true}`	Error; `fatal` means the socket will close

Reconnecting

The ingest token is single-use (burned on connect). To reconnect a dropped producer, mint a fresh one with the owner secret via POST /api/sessions/:id/ingest-token, then reopen the socket.

5b · WebRTC ingest (`/api/webrtc`, WHIP)

POST/api/webrtc?token=INGEST_TOKEN Content-Type: application/sdp

An alternative to raw-WS PCM when the producer is a browser or mobile app: send a WebRTC audio track (Opus) and the server decodes it to 16 kHz mono internally — your client doesn't resample. Uses WHIP (WebRTC-HTTP Ingestion): POST your SDP offer, get the SDP answer back. Auth: ingest token (?token=) or x-tenant-key header.

Because WHIP is ingest-only, transcripts come back on a separate channel: the response carries X-Session-Id and X-Subscribe-Token headers — open /ws/subscribe (§6) or the SSE feed (§7) with that token to receive transcript events. DELETE /api/webrtc/:resourceId (from the Location header) hangs up.

The server has a public IP, so no TURN server is needed. Your browser client should still use a STUN server (e.g. stun:stun.l.google.com:19302) to gather its own candidates.

// Browser — send mic audio over WebRTC, read transcripts over WS
const pc = new RTCPeerConnection({ iceServers:[{urls:"stun:stun.l.google.com:19302"}] });
const mic = await navigator.mediaDevices.getUserMedia({ audio:true });
mic.getTracks().forEach(t => pc.addTrack(t, mic));
await pc.setLocalDescription(await pc.createOffer());
await new Promise(r => pc.onicegatheringstatechange =
  () => pc.iceGatheringState === "complete" && r());
const res = await fetch("https://mcustt.manager.click/api/webrtc?token=" + INGEST_TOKEN, {
  method:"POST", headers:{ "content-type":"application/sdp" }, body: pc.localDescription.sdp });
await pc.setRemoteDescription({ type:"answer", sdp: await res.text() });
const sub = new WebSocket("wss://mcustt.manager.click/ws/subscribe?token=" + res.headers.get("x-subscribe-token"));
sub.onmessage = e => { const m = JSON.parse(e.data); if (m.type==="transcript") console.log(m.text); };

6 · WebSocket — subscribe `/ws/subscribe`

WSwss://mcustt.manager.click/ws/subscribe?token=SUBSCRIBE_TOKEN&lastSeq=N

Read-only transcript stream for any consumer (a second screen, your own UI, logging). Receives the same transcript / language / session-end frames as the ingest socket. lastSeq=N replays persisted finals after N first (resume after a disconnect). Sending any binary frame closes the socket (1003).

7 · SSE — caption feed

SSEhttps://mcustt.manager.click/api/sessions/:id/stream?t=EMBED_TOKEN

Server-Sent Events stream powering the iframe. Events: ready, transcript, language, session-end. Supports Last-Event-ID for replay and sends a keepalive comment every 15 s.

event: transcript
id: 41
data: {"seq":41,"text":"So the proposal is approved.","isFinal":true,"lang":"en-US","ts":171…}

8 · Iframe embed

Drop live captions into any page. The embed token rides in the URL #fragment so it never reaches server logs.

<iframe src="https://mcustt.manager.click/embed/SESSION_ID#t=EMBED_TOKEN"
        style="width:100%;height:320px;border:0;border-radius:12px">
</iframe>

9 · REST — sessions

POST/api/sessions

Auth: x-tenant-key. Body (all optional): {"title":string,"sourceLang":"en-US","autoDetect":true}. Returns ids + tokens + URLs (see Quickstart). 201.

GET/api/sessions/:id

Auth: tenant key or embed/subscribe token. Returns state:

{ "sessionId":"…","status":"live","sourceLang":"en-US","autoDetect":true,
  "detectedLang":"en-US","detectedLangName":"English (US)","lineCount":412,
  "ingestConnected":true,"subscribers":2,"summaryStatus":"pending" }

POST/api/sessions/:id/embed-token

Auth: tenant key or {"ownerSecret":"…"}. Mints fresh read-only embedToken + subscribeToken (+ URLs).

POST/api/sessions/:id/ingest-token

Auth: {"ownerSecret":"…"}. Rotates the single-use ingest token to reconnect a producer.

DELETE/api/sessions/:id

Auth: tenant key or {"ownerSecret":"…"}. Ends the session (triggers summary). Transcript + summary are retained.

10 · REST — transcript & subtitles

GET/api/sessions/:id/transcript?format=json|srt|vtt|txt

Auth: tenant key or embed/subscribe token. Full transcript in the chosen format. json returns {sessionId,lineCount,lines:[{seq,lang,text,ts}]}; srt/vtt are standard subtitle files with cue timings derived from line timestamps; txt is plain text.

11 · REST — summaries

GET/api/sessions/:id/summary

Auth: tenant key or token. 200 when ready, 202 {status:"pending|failed"} while generating.

{ "status":"ready","model":"gpt-4o-mini","lang":"en-US",
  "content":{
    "tldr":"…2-4 sentence summary…",
    "decisions":["Ship billing page Friday"],
    "actionItems":[{"owner":"Priya","task":"Write the migration"}],
    "qa":["Q: Need a feature flag? A: Yes"]
  } }

POST/api/sessions/:id/summarize

Auth: tenant key or owner secret. Force (re)generation; optional window {"fromSeq":N,"toSeq":M}.

12 · REST — batch & async jobs

POST/api/transcribe-summary

Auth: x-tenant-key. Summarize subtitles you already have (no audio). Body — either form:

{ "subtitles":[{"text":"…","ts":171…,"speaker":"A"}, …],
  "sourceLang":"en-US", "want":["summary","transcript"] }
// or simply:  { "text":"line 1\nline 2\n…", "want":["summary"] }

Small inputs return the result inline:

{ "language":"en-US", "transcript":"…", "summary":{ "tldr":…, "decisions":[…], … } }

Large inputs (over 12000 chars) run asynchronously:

// 202 Accepted
{ "jobId":"job_8Kd…", "status":"queued", "poll":"/api/jobs/job_8Kd…" }

Send an Idempotency-Key header so a retried request returns the same job instead of starting a duplicate (and double-billing).

GET/api/jobs/:id

Auth: x-tenant-key. Poll an async job.

{ "jobId":"job_8Kd…", "status":"queued|running|ready|failed",
  "result":{ "language":"en-US", "summary":{…} } | null, "error":null }

13 · Admin — provision a tenant

POST/api/admin/tenants

Operator-only (admin bearer token). Creates a tenant and returns its API key once.

curl -X POST https://mcustt.manager.click/api/admin/tenants \
  -H "Authorization: Bearer $ADMIN_TOKEN" -H "content-type: application/json" \
  -d '{"name":"Acme","maxSessions":50,"dailyMinutesCap":6000,
       "brand":{"brandName":"Acme Live","accentColor":"#0b5","logoUrl":"https://…/logo.png"}}'
# -> { "tenantId":"…", "name":"Acme", "apiKey":"sk_mcustt_…", "note":"shown once" }

14 · Health & metrics

GET/healthz liveness

GET/readyz readiness: db + STT + summarizer reachable

GET/metrics Prometheus (admin bearer token)

15 · Errors & close codes

All REST errors use one envelope: {"error":{"code":"…","message":"…"}}.

HTTP	When
400 BAD_REQUEST	Malformed body / bad `format`
401	Missing/invalid tenant key, token, or owner secret
404	Unknown session / job
429	Rate-limited, or tenant ingest/session cap reached
503	Capacity / daily cap reached (`Retry-After`) or summarizer unconfigured

WebSocket close codes

Code	Reason
1003	BAD_REQUEST / INVALID_AUDIO_FORMAT / binary on a subscribe socket
1009	FRAME_TOO_LARGE (>1 MB) / control >8 KB
1000	Normal — STOP or idle timeout
upgrade 401/403/404/409/429	token rejected / origin blocked / no session / already-connected / rate-limited

16 · Rate limits & caps

Create session	60 / min per IP
Summarize / batch	20–30 / min per IP
WS upgrades	120 / min per IP
Concurrent ingest	20 per tenant
Ingest idle timeout	120 s with no audio → disconnect
Daily audio	per-tenant `dailyMinutesCap` (sessions self-end over cap)
Retention	ended sessions pruned after 30 days

17 · Code examples

Node.js — MCU producer (ingest)

import WebSocket from "ws";
const ws = new WebSocket("wss://mcustt.manager.click/ws/stt", { headers: { "x-tenant-key": process.env.KEY } });
ws.on("message", (buf) => {
  const m = JSON.parse(buf.toString());
  if (m.type === "ready")      console.log("session", m.sessionId, "embed:", m.embedUrl);
  if (m.type === "transcript" && m.isFinal) console.log("[" + m.lang + "]", m.text);
});
// feed your mixed-call PCM16 16k mono frames:
function onPcmFrame(buf) { if (ws.readyState === 1) ws.send(buf); }   // 20–40 ms each
// at end of meeting:
function endMeeting() { ws.send(JSON.stringify({ type: "stop" })); }

Python — producer (ingest)

import asyncio, json, websockets
async def main(pcm_chunks):  # iterable of bytes, PCM16 16k mono
    async with websockets.connect("wss://mcustt.manager.click/ws/stt",
            additional_headers={"x-tenant-key": KEY}) as ws:
        async def reader():
            async for msg in ws:
                m = json.loads(msg)
                if m.get("type") == "transcript" and m.get("isFinal"):
                    print(m["lang"], m["text"])
        asyncio.create_task(reader())
        for chunk in pcm_chunks:
            await ws.send(chunk)
        await ws.send(json.dumps({"type": "stop"}))
asyncio.run(main(my_chunks))

Browser — subscribe to live transcripts

const ws = new WebSocket("wss://mcustt.manager.click/ws/subscribe?token=" + SUBSCRIBE_TOKEN);
ws.onmessage = (e) => {
  const m = JSON.parse(e.data);
  if (m.type === "transcript") render(m.text, m.isFinal, m.seq);
};

Try everything interactively in the live test harness.