Transcribe audio streams

ODIN allows you to transcribe audio streams in real-time. This is useful if you want to build a voice assistant or a chat bot. In this article, we will show you how to transcribe audio streams in real-time using our NodeJS SDK.

Use cases

There are many use cases for audio transcription. Here are some examples:

Content moderation: You might want to ban users that use inappropriate language in your game or app.
Voice assistant: You might want to build a voice assistant that can answer questions from your users.
Chat bot: You might want to build a chat bot that can answer questions from your users.

While some of these use cases are quite easy to do for single users interacting with an AI, it's much more complicated to do that in a room with multiple users. ODIN makes it easy to do that. You can concentrate on building the use-case and we do the heavy lifting for you.

Example

This example implements a NodeJS server that connects to an ODIN room and starts recording incoming audio streams into a WAV file. Whenever a user stops talking for 2 seconds, the file is closed and transcribed using OpenAI's Whisper API.

Providing a UserData object is not necessary but its good practice and allows you to identify your bot in the room. The user data object is a JSON object that is used by our Web client we use internally for testing. You can use it quickly test if everything works fine. More info on the web client can be found here.

const accessKey = "__YOUR_ACCESS_KEY__";
const roomName = "Lobby";
const userName = "My Bot";

// Load the odin module
import odin from '@4players/odin-nodejs';

const {OdinClient} = odin;

// Import wav module and OpenAI API
import wav from 'wav';
import {Configuration, OpenAIApi} from "openai";
import fs from 'fs';

// Configure OpenAI - use your own API key
const configuration = new Configuration({
  apiKey: '__YOUR_OPENAI_API_KEY__'
});
const openai = new OpenAIApi(configuration);

// Create an odin client instance using our access key and create a room
const odinClient = new OdinClient(accessKey);
const room = odinClient.createRoom(roomName, userName);

// Listen on PeerJoined messages and print the user data of the joined peer
room.addEventListener('PeerJoined', (event) => {
  console.log("Received PeerJoined event", event);
  console.log(JSON.parse(new TextDecoder().decode(event.userData)));
});

// Listen on PeerLeft messages and print the user data of the left peer
room.addEventListener('PeerLeft', (event) => {
  console.log("Received PeerLeft event", event);
});

// Listen on MediaActivity messages and prepare a wav file for each media stream. The basic idea here is to
// create a WAV encoder file whenever a users starts talking and to close the file when the user stops talking. This way,
// we have isolated WAV files for each user and can transcribe them individually. If we don't want to create new files
// during short pauses, we wait 2 seconds before closing the file.
room.addEventListener('MediaActivity', (event) => {
  if (event.state) {
    // User started talking - prepare a new file
    if (!fileRecorder[event.mediaId]) {
      const timer = new Date().getTime();
      const fileName = `./recording_${event.peerId}_${event.mediaId}_${timer}.wav`;
      console.log("Created a new recording file: ", fileName);
      fileRecorder[event.mediaId] = {
        wavEncoder: new wav.FileWriter(fileName, {
          channels: 1,
          sampleRate: 48000,
          bitDepth: 16
        }),
        fileName: fileName
      };
    } else {
      // We already have a file for this media stream - reset the timer to avoid closing the file
      if (fileRecorder[event.mediaId].timer) {
        clearTimeout(fileRecorder[event.mediaId].timer);
        delete fileRecorder[event.mediaId].timer;
      }
    }
  } else {
    // User stopped talking
    if (fileRecorder[event.mediaId]) {
      // If we don't have a timer yet, create one
      if (!fileRecorder[event.mediaId].timer) {
        fileRecorder[event.mediaId].timer = setTimeout(() => {
          // The timer timed out - i.e. the user did stop talking for 2 seconds - close the file
          fileRecorder[event.mediaId].wavEncoder.end();

          // Transcribe the file using OpenAI
          try {
            const file = fs.createReadStream(fileRecorder[event.mediaId].fileName);
            openai.createTranscription(file, "whisper-1").then((response) => {
              console.log("OpenAI Transcription: ", response.data.text);
            });
          } catch (e) {
            console.log("Failed to transcribe: ", e);
          }

          // Delete the file recorder object
          delete fileRecorder[event.mediaId];
        }, 2000);
      }
    }
  }
});

// Configure user data used by the bot - this user data will be compatible with ODIN Web Client (https://4players.app).
const userData = {
  name: "Recorder Bot",
  seed: "123",
  userId: "Bot007",
  outputMuted: 1,
  platform: "ODIN JS Bot SDK",
  version: "0.1"
}
// Create a byte array from the user data (ODIN uses byte arrays for user data for maximum flexibility)
const data = new TextEncoder().encode(JSON.stringify(userData));

// Join the room using the default gateway and our user data
room.join("gateway.odin.4players.io", data);

// Print the room-id
console.log("ROOM-ID:", room.id);

// Add an event filter for audio data received events
room.addEventListener('AudioDataReceived', (data) => {
  // Getting an array of the sample buffer - use for example to visualize audio
  /*
  let ui32 = new Float32Array(data.samples32.buffer);
  console.log(ui32);

  let ui16 = new Int16Array(data.samples16.buffer);
  console.log(ui16);
   */

  // Write the audio data to the file using a WAV encoder
  if (fileRecorder[data.mediaId]) {
    fileRecorder[data.mediaId].wavEncoder.file.write(data.samples16, (error) => {
      if (error) {
        console.log("Failed to write audio file");
      }
    });
  }
});

// Prepare a message compatible with the ODIN Web Client and send it to all users (see @4players/odin-foundation for more info)
const message = {
  kind: 'message',
  payload: {
    text: 'Hello World'
  }
}
room.sendMessage(new TextEncoder().encode(JSON.stringify(message)));

// Wait for a key press to stop the script
console.log("Press any key to stop");
const stdin = process.stdin;
stdin.resume();
stdin.setEncoding('utf8');
stdin.on('data', function (key) {
  console.log("Shutting down");
  room.close();
  fileWriter.end();

  process.exit();
});

Next steps

You might also want the bot to send audio to the room. I.e. the bot could answer questions from your users or warn users or group of users to stop using inappropriate language. We have an example for that too. You can find it here.

Encoding to FLAC

Some speech to text services might require you to deliver FLAC encoded audio data. We have written a blog post about that to get you started quickly. You can find it here.

ODIN Bot SDK

This example is just a starting point. You can use it to build your own audio streaming application. We have built an ODIN Bot SDK in TypeScript built on top of the ODIN NodeJS SDK that you can use to build your own AI bots and provides simple interfaces to capture and send audio streams. We have published it as a separate NPM package. You can find it here.

Use cases​

Example​

Next steps​

Encoding to FLAC​

ODIN Bot SDK​

Use cases

Example

Next steps

Encoding to FLAC

ODIN Bot SDK