Skip to main content
Version: 0.11.x

Transcribing Audio Streams

ODIN allows you to transcribe audio streams in real-time. This is useful for building voice assistants, chatbots, content moderation tools, and AI-powered applications. In this guide, we'll show you how to capture audio from room participants and transcribe it using the Node.js SDK.

Use Cases

There are many use cases for audio transcription:

  • Content Moderation: Detect inappropriate language in real-time
  • Voice Assistants: Build AI assistants that respond to voice commands
  • Meeting Transcription: Create automated meeting notes and summaries
  • Accessibility: Provide real-time captions for hearing-impaired users
  • Voice Commands: Trigger game actions based on spoken commands

ODIN makes it easy to implement these use cases in multi-user environments. You can focus on building your application while ODIN handles the audio networking.

Example: Recording and Transcribing

This example demonstrates a Node.js server that:

  1. Connects to an ODIN room
  2. Records incoming audio streams to WAV files
  3. Transcribes the recordings using OpenAI's Whisper API when users stop talking

Dependencies

npm install @4players/odin-nodejs openai wav

Complete Example

import odin from '@4players/odin-nodejs';
import wav from 'wav';
import OpenAI from 'openai';
import fs from 'fs';

const { OdinClient } = odin;

// Configuration
const accessKey = "__YOUR_ACCESS_KEY__";
const roomId = "Lobby";
const userId = "TranscriptionBot";

// Initialize OpenAI client
const openai = new OpenAI({
apiKey: '__YOUR_OPENAI_API_KEY__'
});

// Store active recordings by media ID
const fileRecorder = {};

async function main() {
const client = new OdinClient();
const token = client.generateToken(accessKey, roomId, userId);
const room = client.createRoom(token);

// Configure bot user data
const userData = {
name: "Transcription Bot",
userId: "TranscriberBot001",
outputMuted: 1, // Bot doesn't need to receive audio playback
platform: "ODIN Node.js SDK",
version: "0.11.0"
};
const data = new TextEncoder().encode(JSON.stringify(userData));

// Handle peer join/leave for logging
room.onPeerJoined((event) => {
console.log(`Peer joined: ${event.peerId}`);
if (event.userData) {
try {
console.log("User data:", JSON.parse(new TextDecoder().decode(event.userData)));
} catch (e) {
console.log("User data (raw):", event.userData);
}
}
});

room.onPeerLeft((event) => {
console.log(`Peer left: ${event.peerId}`);
});

// Handle media activity (Voice Activity Detection)
// This event fires when a user starts or stops talking
room.onMediaActivity((event) => {
const { mediaId, peerId, state } = event;

if (state) {
// User started talking - create a new recording file
if (!fileRecorder[mediaId]) {
const timestamp = Date.now();
const fileName = `./recording_${peerId}_${mediaId}_${timestamp}.wav`;
console.log(`Started recording: ${fileName}`);

fileRecorder[mediaId] = {
wavEncoder: new wav.FileWriter(fileName, {
channels: 2,
sampleRate: 48000,
bitDepth: 16
}),
fileName: fileName,
peerId: peerId
};
} else {
// User resumed talking - cancel the timeout
if (fileRecorder[mediaId].timer) {
clearTimeout(fileRecorder[mediaId].timer);
delete fileRecorder[mediaId].timer;
}
}
} else {
// User stopped talking - start a 2-second timer before transcribing
if (fileRecorder[mediaId] && !fileRecorder[mediaId].timer) {
fileRecorder[mediaId].timer = setTimeout(async () => {
// Close the WAV file
fileRecorder[mediaId].wavEncoder.end();
const fileName = fileRecorder[mediaId].fileName;
const peerId = fileRecorder[mediaId].peerId;

console.log(`Recording finished: ${fileName}`);

// Transcribe using OpenAI Whisper
await transcribeAudio(fileName, peerId);

// Clean up
delete fileRecorder[mediaId];
}, 2000); // Wait 2 seconds after speech ends
}
}
});

// Handle incoming audio data
room.onAudioDataReceived((data) => {
const { mediaId, samples16 } = data;

// Write audio data to the active recording
if (fileRecorder[mediaId]) {
const buffer = Buffer.from(samples16.buffer, samples16.byteOffset, samples16.byteLength);
fileRecorder[mediaId].wavEncoder.file.write(buffer);
}
});

// Handle room events
room.onJoined((event) => {
console.log(`Joined room: ${event.roomId}`);
console.log(`My peer ID: ${event.ownPeerId}`);
});

room.onLeft((event) => {
console.log(`Left room: ${event.reason || 'disconnected'}`);
});

// Join the room
room.join("https://gateway.odin.4players.io", data);

// Handle graceful shutdown
process.on('SIGINT', () => {
console.log("Shutting down...");

// Close any active recordings
for (const mediaId in fileRecorder) {
if (fileRecorder[mediaId].wavEncoder) {
fileRecorder[mediaId].wavEncoder.end();
}
}

room.close();
process.exit(0);
});

console.log("Transcription bot running. Press Ctrl+C to stop.");
}

async function transcribeAudio(fileName, peerId) {
try {
console.log(`Transcribing ${fileName}...`);

const transcription = await openai.audio.transcriptions.create({
file: fs.createReadStream(fileName),
model: "whisper-1",
language: "en" // Optional: specify language for better accuracy
});

console.log(`\n[Peer ${peerId}]: "${transcription.text}"\n`);

// Optionally delete the file after transcription
// fs.unlinkSync(fileName);

return transcription.text;
} catch (error) {
console.error("Transcription failed:", error.message);
return null;
}
}

main();

Understanding the Code

Voice Activity Detection (VAD)

The MediaActivity event is triggered by ODIN's built-in Voice Activity Detection:

room.onMediaActivity((event) => {
const { mediaId, peerId, state } = event;

if (state) {
// User started talking
} else {
// User stopped talking
}
});

Recording Strategy

The example uses a timer-based approach to handle natural speech pauses:

  1. When a user starts talking, we create a new WAV file
  2. When they stop talking, we start a 2-second timer
  3. If they resume talking before the timer expires, we cancel it
  4. If the timer expires, we close the file and transcribe it

This prevents creating multiple short recordings for natural speech pauses.

Audio Format

ODIN provides audio data in both 16-bit integer and 32-bit float formats:

room.onAudioDataReceived((data) => {
const { peerId, mediaId, samples16, samples32 } = data;

// samples16: Int16Array - Perfect for WAV recording
// samples32: Float32Array - Better for processing (range -1 to 1)
});

Advanced: Custom Speech-to-Text Services

You can easily adapt this example to use other transcription services:

Google Cloud Speech-to-Text

import speech from '@google-cloud/speech';

const speechClient = new speech.SpeechClient();

async function transcribeWithGoogle(fileName) {
const audio = {
content: fs.readFileSync(fileName).toString('base64'),
};
const config = {
encoding: 'LINEAR16',
sampleRateHertz: 48000,
languageCode: 'en-US',
};

const [response] = await speechClient.recognize({ audio, config });
return response.results
.map(result => result.alternatives[0].transcript)
.join('\n');
}

AWS Transcribe

import { TranscribeClient, StartTranscriptionJobCommand } from "@aws-sdk/client-transcribe";

const transcribeClient = new TranscribeClient({ region: "us-east-1" });

async function transcribeWithAWS(s3Uri, jobName) {
const params = {
TranscriptionJobName: jobName,
LanguageCode: "en-US",
MediaFormat: "wav",
Media: { MediaFileUri: s3Uri }
};

await transcribeClient.send(new StartTranscriptionJobCommand(params));
// Poll for completion...
}

Encoding to FLAC

Some speech-to-text services prefer FLAC encoding. You can use the flac npm package:

npm install flac-encoder

See our blog post on encoding ODIN audio data to FLAC for a detailed guide.

Next Steps

Now that you can transcribe audio, consider these enhancements:

  • Stream the audio directly to the transcription service instead of saving to files
  • Implement real-time transcription with streaming APIs
  • Add response generation using GPT to create voice assistants
  • Send transcriptions back to the room as text messages

Check out our Streaming Audio Files guide to learn how to send AI-generated audio responses back to the room.

ODIN Bot SDK

For a more complete solution, check out our ODIN Bot SDK which provides higher-level abstractions for building voice bots: @4players/odin-bot-sdk.