How to Create a Conversational AI Voice Agent with OpenAI Realtime API A Step-by-Step Guide for Next JS 15 (2025)

3 min readNov 29, 2024

Building a conversational AI voice agent has become incredibly accessible thanks to OpenAI’s real-time APIs. In this article, we’ll create a fully functional conversational AI voice agent using Next.js 15. By the end, you’ll have a basic voice-enabled AI agent that listens to users, generates responses in real-time, and speaks back to them.

Let’s dive in step by step.

Prerequisites

Basic Knowledge of JavaScript/React: You should be comfortable with basic coding concepts.
Node.js Installed: Ensure you have Node.js v16 or higher installed.
OpenAI API Key: Create an account and obtain an API key from OpenAI.
Microphone and Speaker: Required for testing voice input and output.

Step 1: Setting Up a New Next.js 15 Project

Start by creating a new Next.js project.

npx create-next-app@latest conversational-ai-agent
cd conversational-ai-agent

Install necessary dependencies:

npm install openai react-speech-recognition react-speech-kit

openai: For integrating OpenAI APIs.
react-speech-recognition: For handling voice input.
react-speech-kit: For text-to-speech functionality.

Step 2: Configure the OpenAI API in Next.js

Create a file called .env.local in the root directory and add your OpenAI API key:

OPENAI_API_KEY=your-openai-api-key

Now, create a utility function for interacting with OpenAI’s API.

`utils/openai.js`

import { Configuration, OpenAIApi } from "openai";
const configuration = new Configuration({
  apiKey: process.env.OPENAI_API_KEY,
});
const openai = new OpenAIApi(configuration);
export const getChatResponse = async (prompt) => {
  const response = await openai.createChatCompletion({
    model: "gpt-4",
    messages: [{ role: "user", content: prompt }],
  });
  return response.data.choices[0].message.content;
};

This function sends a user’s query to OpenAI and retrieves the AI’s response.

Step 3: Add Speech Recognition and Text-to-Speech

We’ll now set up the microphone to capture voice input and a text-to-speech system to read AI responses aloud.

`pages/index.js`

import { useState } from "react";
import SpeechRecognition, { useSpeechRecognition } from "react-speech-recognition";
import { useSpeechSynthesis } from "react-speech-kit";
import { getChatResponse } from "../utils/openai";
export default function Home() {
  const [conversation, setConversation] = useState([]);
  const [isProcessing, setIsProcessing] = useState(false);
  const { speak } = useSpeechSynthesis();
  const { transcript, resetTranscript } = useSpeechRecognition();
  if (!SpeechRecognition.browserSupportsSpeechRecognition()) {
    return <p>Your browser does not support Speech Recognition.</p>;
  }
  const handleStart = () => {
    resetTranscript();
    SpeechRecognition.startListening({ continuous: true });
  };
  const handleStop = async () => {
    SpeechRecognition.stopListening();
    setIsProcessing(true);
    const userMessage = transcript;
    const updatedConversation = [...conversation, { role: "user", content: userMessage }];
    setConversation(updatedConversation);
    // Get AI response
    const aiResponse = await getChatResponse(userMessage);
    setConversation([...updatedConversation, { role: "assistant", content: aiResponse }]);
    // Speak AI response
    speak({ text: aiResponse });
    setIsProcessing(false);
  };
  return (
    <div style={{ padding: "2rem", fontFamily: "Arial, sans-serif" }}>
      <h1>Conversational AI Voice Agent</h1>
      <div>
        <p><strong>AI:</strong> {conversation.map((msg, idx) => (
          <span key={idx}>
            <em>{msg.role === "assistant" ? msg.content : ""}</em><br />
          </span>
        ))}</p>
        <p><strong>You:</strong> {transcript}</p>
      </div>
      <button onClick={handleStart} disabled={isProcessing}>
        Start Listening
      </button>
      <button onClick={handleStop} disabled={isProcessing || !transcript}>
        Stop and Process
      </button>
    </div>
  );
}

Key Features:

SpeechRecognition: Captures the user’s voice and continuously listens.
SpeechSynthesis: Converts AI text responses into speech.
Conversation State: Maintains a history of messages between the user and AI.

Step 4: Add CSS for Better UX

Create a styles/global.css file and add the following:

body {
  margin: 0;
  padding: 0;
  font-family: Arial, sans-serif;
  background-color: #f4f4f9;
  color: #333;
}
h1 {
  text-align: center;
  color: #4a90e2;
}
button {
  padding: 10px 20px;
  margin: 5px;
  background-color: #4a90e2;
  color: white;
  border: none;
  border-radius: 5px;
  cursor: pointer;
}
button:disabled {
  background-color: #ccc;
}
div {
  max-width: 600px;
  margin: 0 auto;
}

Step 5: Run Your Application

Start your development server:

npm run dev

Open your browser and navigate to http://localhost:3000.

Click Start Listening to begin capturing your voice.
Speak a question or command.
Click Stop and Process to send your input to OpenAI and hear the AI’s response.

Step 6: Deploy the App (Optional)

Deploy your app to a platform like Vercel for wider accessibility:

npx vercel

Follow the prompts to deploy your app and share the generated URL with others.

Final Thoughts

Congratulations! 🎉 You’ve successfully created a conversational AI voice agent using Next.js 15 and OpenAI’s API. This simple implementation can be expanded with features like custom commands, improved UI, and multi-language support. The possibilities are endless!