Transcribing Audio Streams
ODIN allows you to transcribe audio streams in real-time. This is useful for building voice assistants, chatbots, content moderation tools, and AI-powered applications. In this guide, we'll show you how to capture audio from room participants and transcribe it using the Node.js SDK.
Use Cases
There are many use cases for audio transcription:
- Content Moderation: Detect inappropriate language in real-time
- Voice Assistants: Build AI assistants that respond to voice commands
- Meeting Transcription: Create automated meeting notes and summaries
- Accessibility: Provide real-time captions for hearing-impaired users
- Voice Commands: Trigger game actions based on spoken commands
ODIN makes it easy to implement these use cases in multi-user environments. You can focus on building your application while ODIN handles the audio networking.
Example: Recording and Transcribing
This example demonstrates a Node.js server that:
- Connects to an ODIN room
- Records incoming audio streams to WAV files
- Transcribes the recordings using OpenAI's Whisper API when users stop talking
Dependencies
npm install @4players/odin-nodejs openai wav
Complete Example
import odin from '@4players/odin-nodejs';
import wav from 'wav';
import OpenAI from 'openai';
import fs from 'fs';
const { OdinClient } = odin;
const accessKey = "__YOUR_ACCESS_KEY__";
const roomId = "Lobby";
const userId = "TranscriptionBot";
const openai = new OpenAI({
apiKey: '__YOUR_OPENAI_API_KEY__'
});
const fileRecorder = {};
async function main() {
const client = new OdinClient();
const token = client.generateToken(accessKey, roomId, userId);
const room = client.createRoom(token);
const userData = {
name: "Transcription Bot",
userId: "TranscriberBot001",
outputMuted: 1,
platform: "ODIN Node.js SDK",
version: "0.11.0"
};
const data = new TextEncoder().encode(JSON.stringify(userData));
room.onPeerJoined((event) => {
console.log(`Peer joined: ${event.peerId}`);
if (event.userData) {
try {
console.log("User data:", JSON.parse(new TextDecoder().decode(event.userData)));
} catch (e) {
console.log("User data (raw):", event.userData);
}
}
});
room.onPeerLeft((event) => {
console.log(`Peer left: ${event.peerId}`);
});
room.onMediaActivity((event) => {
const { mediaId, peerId, state } = event;
if (state) {
if (!fileRecorder[mediaId]) {
const timestamp = Date.now();
const fileName = `./recording_${peerId}_${mediaId}_${timestamp}.wav`;
console.log(`Started recording: ${fileName}`);
fileRecorder[mediaId] = {
wavEncoder: new wav.FileWriter(fileName, {
channels: 2,
sampleRate: 48000,
bitDepth: 16
}),
fileName: fileName,
peerId: peerId
};
} else {
if (fileRecorder[mediaId].timer) {
clearTimeout(fileRecorder[mediaId].timer);
delete fileRecorder[mediaId].timer;
}
}
} else {
if (fileRecorder[mediaId] && !fileRecorder[mediaId].timer) {
fileRecorder[mediaId].timer = setTimeout(async () => {
fileRecorder[mediaId].wavEncoder.end();
const fileName = fileRecorder[mediaId].fileName;
const peerId = fileRecorder[mediaId].peerId;
console.log(`Recording finished: ${fileName}`);
await transcribeAudio(fileName, peerId);
delete fileRecorder[mediaId];
}, 2000);
}
}
});
room.onAudioDataReceived((data) => {
const { mediaId, samples16 } = data;
if (fileRecorder[mediaId]) {
const buffer = Buffer.from(samples16.buffer, samples16.byteOffset, samples16.byteLength);
fileRecorder[mediaId].wavEncoder.file.write(buffer);
}
});
room.onJoined((event) => {
console.log(`Joined room: ${event.roomId}`);
console.log(`My peer ID: ${event.ownPeerId}`);
});
room.onLeft((event) => {
console.log(`Left room: ${event.reason || 'disconnected'}`);
});
room.join("https://gateway.odin.4players.io", data);
process.on('SIGINT', () => {
console.log("Shutting down...");
for (const mediaId in fileRecorder) {
if (fileRecorder[mediaId].wavEncoder) {
fileRecorder[mediaId].wavEncoder.end();
}
}
room.close();
process.exit(0);
});
console.log("Transcription bot running. Press Ctrl+C to stop.");
}
async function transcribeAudio(fileName, peerId) {
try {
console.log(`Transcribing ${fileName}...`);
const transcription = await openai.audio.transcriptions.create({
file: fs.createReadStream(fileName),
model: "whisper-1",
language: "en"
});
console.log(`\n[Peer ${peerId}]: "${transcription.text}"\n`);
return transcription.text;
} catch (error) {
console.error("Transcription failed:", error.message);
return null;
}
}
main();
Understanding the Code
Voice Activity Detection (VAD)
The MediaActivity event is triggered by ODIN's built-in Voice Activity Detection:
room.onMediaActivity((event) => {
const { mediaId, peerId, state } = event;
if (state) {
} else {
}
});
Recording Strategy
The example uses a timer-based approach to handle natural speech pauses:
- When a user starts talking, we create a new WAV file
- When they stop talking, we start a 2-second timer
- If they resume talking before the timer expires, we cancel it
- If the timer expires, we close the file and transcribe it
This prevents creating multiple short recordings for natural speech pauses.
ODIN provides audio data in both 16-bit integer and 32-bit float formats:
room.onAudioDataReceived((data) => {
const { peerId, mediaId, samples16, samples32 } = data;
});
Advanced: Custom Speech-to-Text Services
You can easily adapt this example to use other transcription services:
Google Cloud Speech-to-Text
import speech from '@google-cloud/speech';
const speechClient = new speech.SpeechClient();
async function transcribeWithGoogle(fileName) {
const audio = {
content: fs.readFileSync(fileName).toString('base64'),
};
const config = {
encoding: 'LINEAR16',
sampleRateHertz: 48000,
languageCode: 'en-US',
};
const [response] = await speechClient.recognize({ audio, config });
return response.results
.map(result => result.alternatives[0].transcript)
.join('\n');
}
AWS Transcribe
import { TranscribeClient, StartTranscriptionJobCommand } from "@aws-sdk/client-transcribe";
const transcribeClient = new TranscribeClient({ region: "us-east-1" });
async function transcribeWithAWS(s3Uri, jobName) {
const params = {
TranscriptionJobName: jobName,
LanguageCode: "en-US",
MediaFormat: "wav",
Media: { MediaFileUri: s3Uri }
};
await transcribeClient.send(new StartTranscriptionJobCommand(params));
}
Encoding to FLAC
Some speech-to-text services prefer FLAC encoding. You can use the flac npm package:
See our blog post on encoding ODIN audio data to FLAC for a detailed guide.
Next Steps
Now that you can transcribe audio, consider these enhancements:
- Stream the audio directly to the transcription service instead of saving to files
- Implement real-time transcription with streaming APIs
- Add response generation using GPT to create voice assistants
- Send transcriptions back to the room as text messages
Check out our Streaming Audio Files guide to learn how to send AI-generated audio responses back to the room.
ODIN Bot SDK
For a more complete solution, check out our ODIN Bot SDK which provides higher-level abstractions for building voice bots: @4players/odin-bot-sdk.