Skip to main content

Transcription

Transcription is Cortex’s core capability. A headless bot joins an ODIN Voice room, receives a separate audio stream for each speaker, detects when someone is talking, transcribes each utterance, and posts it back as a message — attributed to the right person, in real time.

How it works

The key detail: the bot receives individual per-peer audio, not a mixed stream. That’s what makes attribution accurate — every message is tied to the participant who spoke it.

Want to build your own bot instead?

Cortex runs and scales this bot for you. If you'd rather build and host your own, the ODIN Voice Node.js SDK gives you the same per-speaker audio access Cortex relies on — see the Transcribing Audio sample. Cortex is the managed version of that approach.

Sessions and messages

  • A session is the active transcription of one room. Create it with the externalRoomId of the ODIN room you want to transcribe, then start() it to attach the bot. See Sessions & Gatherings.
  • Each transcribed utterance becomes a message on that session, attributed to a participant.
const session = await project.sessions.create({
title: "Support call #1021",
externalRoomId: "support-1021",
idleTimeout: 90, // optional: auto-end after 90s of silence
});
await session.start();

// ...conversation happens...

const messages = await session.getMessages();
await session.stop();

To process messages as they happen rather than after the fact, use watchMessages() or subscribe to the message.created event.

Summaries

A session can produce an AI summary of the conversation (recap, action items, a shareable transcript). You can regenerate it on demand:

await session.regenerateSummary();

Summaries are also available through session annotations and the session.annotation.created event — handy for emailing a recap when a call ends (see the serverless function guide).

Transcription modes

Cortex supports two speech-to-text backends, configured per deployment:

ModeEngineNotes
Cloud (default)OpenAI WhisperHigh accuracy across many languages.
LocalOn-box STTNo external network hop; useful for data-residency requirements.

For most projects the default cloud mode is exactly what you want, and there’s nothing to configure.

Debug audio (optional)

For troubleshooting, a project can enable debug audio, which retains the raw audio for each transcribed message for a short window (24 hours) so you can play back exactly what was heard. It’s off by default because it stores voice data, and is enabled per project in settings.

Handle voice data responsibly

Transcripts — and especially debug audio — can contain personal data. Enable debug audio only when you need it, and make sure your privacy notices cover transcription.

Identifying speakers

The bot maps each audio stream to a participant using the externalUserId your client passed to ODIN when joining the room. Make sure your clients pass a stable, meaningful userId so transcripts attribute correctly. See Users & Participants.