agents[Edit section][Copy link]
The LiveKit Agents Framework provides a toolkit for building real-time multimodal AI applications that integrate speech recognition, text-to-speech synthesis, and natural language processing. Engineers can use this framework to create voice assistants, chatbots, and other AI-powered applications that interact with users through audio and text.
The core of the framework is implemented in the livekit-agents
directory, which contains the main components for speech-to-text (STT), text-to-speech (TTS), and language model integration. The VoiceAssistant
class in …/voice_assistant
serves as the central component, orchestrating the interaction between user input, language model processing, and speech output.
Key functionality of the framework includes:
• Speech-to-Text: The framework supports multiple STT providers through a plugin system. The STT
class in …/stt
defines the interface for speech recognition, while specific implementations like Google Cloud Speech-to-Text and OpenAI Whisper are available as plugins in livekit-plugins
.
• Text-to-Speech: Similar to STT, the framework supports various TTS providers. The TTS
class in …/tts
defines the interface, with implementations for services like Google Cloud TTS and OpenAI TTS available as plugins.
• Language Model Integration: The LLM
class in …/llm
provides an interface for interacting with large language models. The OpenAI plugin in …/livekit-plugins-openai
offers integration with models like GPT-3.5 and GPT-4.
• Inter-Process Communication: The framework uses a custom IPC system implemented in …/ipc
to manage communication between different components, including process pools and supervised processes.
The framework is designed with extensibility in mind, utilizing a plugin architecture that allows for easy integration of new STT, TTS, and NLP services. This is evident in the livekit-plugins
directory, which contains various plugin implementations.
For developers looking to get started, the examples
directory provides sample implementations of voice assistants, speech-to-text, and text-to-speech applications using the framework.
Key design choices in the framework include:
• Asynchronous programming: The framework extensively uses Python's asyncio for handling concurrent operations, as seen in the utility functions in …/aio
.
• Streaming interfaces: Both STT and TTS components support streaming, allowing for real-time processing of audio data.
• Modular architecture: The use of abstract base classes and plugins allows for easy swapping of components and addition of new functionality.
• Command-line interface: The framework provides a CLI for managing agent processes, implemented in …/cli
.
For more detailed information on specific components, refer to the relevant sections in this wiki, such as Voice Assistant, Speech-to-Text, and Text-to-Speech.
Voice Assistant[Edit section][Copy link]
References: livekit-agents/livekit/agents/voice_assistant
, examples/voice-assistant
The VoiceAssistant
class serves as the central component for implementing voice-based interactions. It integrates various modules for speech recognition, natural language processing, and speech synthesis:
Core Voice Assistant Functionality[Edit section][Copy link]
References: livekit-agents/livekit/agents/voice_assistant
The VoiceAssistant
class in …/voice_assistant.py
serves as the central component for managing voice-based interactions between users and AI assistants. It integrates various modules to handle speech recognition, natural language processing, and speech synthesis.
Human Input Processing[Edit section][Copy link]
References: livekit-agents/livekit/agents/voice_assistant
The HumanInput
class in …/human_input.py
manages audio input processing from a participant in a LiveKit room. Key functionalities include:
Agent Output and Playback[Edit section][Copy link]
References: livekit-agents/livekit/agents/voice_assistant
The AgentOutput
class manages speech synthesis and playback for the assistant's responses. Key features include:
Example Implementations[Edit section][Copy link]
References: examples/voice-assistant
The …/voice-assistant
directory contains implementations of voice assistants with varying levels of complexity:
Minimal Assistant Setup[Edit section][Copy link]
References: examples/voice-assistant/minimal_assistant.py
The minimal_assistant.py
script initializes and manages a voice assistant using the LiveKit framework. Key components include:
Function Calling Weather Assistant[Edit section][Copy link]
References: examples/voice-assistant/function_calling_weather.py
The AssistantFnc
class encapsulates the weather-related functionality for the voice assistant. Its key method, get_weather()
, retrieves weather information for a given location by making an asynchronous HTTP GET request to the wttr.in API. The method handles successful responses (status code 200) by returning the weather data as a string, and raises an exception for failed requests.
Simple RAG Assistant[Edit section][Copy link]
References: examples/voice-assistant/simple-rag
The Simple RAG Assistant is implemented in …/assistant.py
. It leverages the LiveKit framework to create a voice assistant that uses Retrieval-Augmented Generation (RAG) to enhance its responses. Key components include:
Speech-to-Text[Edit section][Copy link]
References: livekit-agents/livekit/agents/stt
, livekit-plugins/livekit-plugins-google
, livekit-plugins/livekit-plugins-openai
, examples/speech-to-text
The STT
class in …/stt.py
defines the core interface for speech-to-text functionality. It includes methods for recognizing speech from an AudioBuffer
and streaming audio data for real-time transcription.
Google Speech-to-Text Integration[Edit section][Copy link]
References: livekit-plugins/livekit-plugins-google/livekit/plugins/google/stt.py
The STT
class in …/stt.py
provides integration with Google's Speech-to-Text API. Key features include:
OpenAI Speech-to-Text Integration[Edit section][Copy link]
References: livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py
The STT
class in …/stt.py
implements speech recognition using OpenAI's Whisper model. Key features include:
Speech Recognition Configuration[Edit section][Copy link]
References: livekit-plugins/livekit-plugins-google/livekit/plugins/google/stt.py
The STTOptions
dataclass encapsulates configuration options for Google's Speech-to-Text service. Key options include:
Speech Event Processing[Edit section][Copy link]
References: livekit-plugins/livekit-plugins-google/livekit/plugins/google/stt.py
The SpeechStream
class handles the processing of speech events from the Google Cloud Speech-to-Text API. Key aspects of speech event processing include:
Text-to-Speech[Edit section][Copy link]
References: livekit-agents/livekit/agents/tts
, livekit-plugins/livekit-plugins-google
, livekit-plugins/livekit-plugins-openai
, examples/text-to-speech
The TTS
class serves as the primary interface for text-to-speech functionality. It provides methods for synthesizing audio from text input, including synthesize()
for generating complete audio segments and stream()
for incremental audio generation.
Google Text-to-Speech Integration[Edit section][Copy link]
References: livekit-plugins/livekit-plugins-google/livekit/plugins/google/tts.py
The TTS
class in …/tts.py
provides an interface to Google's Text-to-Speech service. Key features include:
OpenAI Text-to-Speech Integration[Edit section][Copy link]
References: livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/tts.py
The TTS
class in …/tts.py
implements text-to-speech functionality using OpenAI's API. Key features include:
Audio Encoding and Streaming[Edit section][Copy link]
References: livekit-plugins/livekit-plugins-google/livekit/plugins/google/tts.py
The ChunkedStream
class handles audio encoding and streaming of synthesized speech. It supports different audio encodings, including MP3, and provides efficient streaming of the generated audio content. Key aspects include:
Cartesia Text-to-Speech Integration[Edit section][Copy link]
References: examples/text-to-speech/cartesia_tts.py
The Cartesia TTS integration is implemented in …/cartesia_tts.py
. This example demonstrates how to use the Cartesia text-to-speech library within a LiveKit application. Key components include:
Natural Language Processing[Edit section][Copy link]
References: livekit-agents/livekit/agents/llm
, livekit-plugins/livekit-plugins-openai
The LLM
class in …/llm.py
serves as the primary interface for interacting with OpenAI-based language models. It provides static methods for creating instances configured to use specific models and services, including Azure, Fireworks, Groq, Octo, OLLaMA, Perplexity, Together, and now also includes the with_deepseek
method for creating instances with a DeepSeek LLM model.
LLM Integration[Edit section][Copy link]
References: livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/llm.py
The LLM
class in …/llm.py
serves as the primary interface for integrating various Large Language Model providers. It offers a unified API for interacting with different LLM services:
Speech-to-Text Processing[Edit section][Copy link]
References: livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py
The STT
class in …/stt.py
implements speech-to-text functionality using OpenAI's Whisper model. Key features include:
Text-to-Speech Synthesis[Edit section][Copy link]
References: livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/tts.py
The TTS
class in …/tts.py
implements text-to-speech functionality using OpenAI's API. Key features include:
File Upload Handling[Edit section][Copy link]
References: livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/beta/assistant_llm.py
The AssistantLLM
class in …/assistant_llm.py
manages file uploads for vision-enabled AI assistants. Key features include:
Inter-Process Communication[Edit section][Copy link]
References: livekit-agents/livekit/agents/ipc
The Inter-Process Communication (IPC) system in the LiveKit agent framework is centered around the ProcPool
class in …/proc_pool.py
. This class is responsible for managing a pool of job executors that are ready to execute tasks.
Job Executor Interface[Edit section][Copy link]
References: livekit-agents/livekit/agents/ipc/job_executor.py
The JobExecutor
protocol serves as a blueprint for job execution within the LiveKit Agents Framework, defining essential properties and asynchronous methods that facilitate the lifecycle management of jobs. This protocol is crucial for ensuring that different types of job executors can be implemented while adhering to a consistent interface.
Thread-based Job Execution[Edit section][Copy link]
References: livekit-agents/livekit/agents/ipc/thread_job_executor.py
The ThreadJobExecutor
class facilitates the execution of jobs within separate threads, ensuring that each job is isolated and runs concurrently without blocking the main execution flow. This class is a key component in the agent framework's ability to handle multiple tasks simultaneously, providing a robust solution for lifecycle management, inter-thread communication, and health monitoring of jobs.
Process-based Job Execution[Edit section][Copy link]
References: livekit-agents/livekit/agents/ipc/proc_job_executor.py
The ProcJobExecutor
class is tasked with the execution management of jobs within separate processes. Its design allows for a clean separation of concerns, where each job runs in isolation, enhancing stability and scalability of the system. Here are the key responsibilities of the ProcJobExecutor
class:
Job Process Main Functionality[Edit section][Copy link]
References: livekit-agents/livekit/agents/ipc/job_main.py
, livekit-agents/livekit/agents/ipc/proc_lazy_main.py
In the LiveKit Agents Framework, the job process's responsibilities include managing the execution of tasks, facilitating communication between different components, and handling logging. The …/job_main.py
file provides the core functionality for these processes.
Communication Protocol[Edit section][Copy link]
References: livekit-agents/livekit/agents/ipc/proto.py
In the LiveKit Agents Framework, inter-process communication (IPC) is a critical component that enables the main process to coordinate with its subprocesses. The …/proto.py
file is central to this functionality, as it defines the message protocols that facilitate this coordination. The file establishes a suite of dataclasses, each representing a specific type of message that can be exchanged between processes.
Command-Line Interface[Edit section][Copy link]
References: livekit-agents/livekit/agents/cli
The CLI for managing and interacting with agent processes is implemented in the …/cli
directory. The main entry point is the run_app()
function, which is exposed in the __init__.py
file.
CLI Structure and Commands[Edit section][Copy link]
References: livekit-agents/livekit/agents/cli/cli.py
The command-line interface is defined using the click
library, with run_app()
serving as the main entry point. It offers several commands for managing agent processes:
Logging Configuration[Edit section][Copy link]
References: livekit-agents/livekit/agents/cli/log.py
The setup_logging()
function in …/log.py
configures logging for both development and production environments. It creates a StreamHandler
and attaches either a JsonFormatter
or ColoredFormatter
based on the devmode
flag:
Protocol and Data Structures[Edit section][Copy link]
References: livekit-agents/livekit/agents/cli/proto.py
The CliArgs
dataclass defines the configuration options for the LiveKit agent CLI, including worker options, log level, development mode, AsyncIO debug mode, watch mode, and drain timeout. It also contains a mp_cch
attribute for inter-process communication.
Utility Functions[Edit section][Copy link]
References: livekit-agents/livekit/agents/utils
The …/utils
directory contains various utility modules and classes used throughout the agent framework:
Asynchronous Utilities[Edit section][Copy link]
References: livekit-agents/livekit/agents/utils/aio
The gracefully_cancel()
function in …/__init__.py
provides a way to cancel multiple asynchronous futures while ensuring proper release of associated callbacks. This is particularly useful for graceful shutdown of complex asynchronous systems.
Miscellaneous Utilities[Edit section][Copy link]
References: livekit-agents/livekit/agents/utils/misc.py
The …/misc.py
file contains utility functions for audio processing, time operations, and unique identifier generation:
Plugin System[Edit section][Copy link]
References: livekit-plugins
The plugin system in LiveKit provides an extensible architecture for adding new capabilities to agents. The core of this system is the Plugin
class from the livekit.agents
module, which serves as the base for all plugins.
Plugin System Architecture[Edit section][Copy link]
References: livekit-plugins
The LiveKit plugin system is built around the Plugin
base class, which provides a foundation for extending functionality. Plugins are registered using a decorator-based mechanism, allowing for easy integration of new capabilities.
Google Plugin[Edit section][Copy link]
References: livekit-plugins/livekit-plugins-google
The Google plugin integrates Google Cloud services for speech-to-text (STT) and text-to-speech (TTS) functionality within the LiveKit Agents framework. It is implemented in the …/livekit-plugins-google
directory.
OpenAI Plugin[Edit section][Copy link]
References: livekit-plugins/livekit-plugins-openai
The OpenAI plugin integrates OpenAI's language models and AI services into the LiveKit ecosystem. It provides implementations for speech-to-text (STT), text-to-speech (TTS), and language model (LLM) functionalities.
Read moreRAG Plugin[Edit section][Copy link]
References: livekit-plugins/livekit-plugins-rag
The RAG plugin implements Retrieval-Augmented Generation capabilities for enhanced natural language processing tasks within the LiveKit Agents framework. Key components include:
Read moreAnthropic Plugin[Edit section][Copy link]
References: livekit-plugins/livekit-plugins-anthropic/livekit/plugins/anthropic
, livekit-plugins/livekit-plugins-anthropic/livekit/plugins/anthropic/__init__.py
, livekit-plugins/livekit-plugins-anthropic/livekit/plugins/anthropic/llm.py
, livekit-plugins/livekit-plugins-anthropic/livekit/plugins/anthropic/models.py
, livekit-plugins/livekit-plugins-anthropic/livekit/plugins/anthropic/version.py
, livekit-plugins/livekit-plugins-anthropic/setup.py
The Anthropic plugin, located at …/anthropic
, enables the integration of Anthropic's advanced language models into the LiveKit ecosystem. This integration facilitates natural language understanding and generation, allowing developers to leverage the capabilities of Anthropic's AI for various applications within the LiveKit framework.
Clova Plugin[Edit section][Copy link]
References: livekit-plugins/livekit-plugins-clova/livekit/plugins/clova
, livekit-plugins/livekit-plugins-clova/setup.py
The LiveKit Agents Framework expands its speech recognition capabilities through the integration of the Clova speech-to-text service, facilitated by the …/clova
directory. The integration is encapsulated within the ClovaSTTPlugin
class, which adheres to the LiveKit plugin architecture, allowing seamless addition of Clova's STT functionality into the ecosystem.
Deepgram Plugin[Edit section][Copy link]
References: livekit-plugins/livekit-plugins-deepgram/livekit/plugins/deepgram/stt.py
, livekit-plugins/livekit-plugins-deepgram/livekit/plugins/deepgram/utils.py
, livekit-plugins/livekit-plugins-deepgram/setup.py
The STT
class in …/stt.py
is the central component of the plugin, enabling applications to utilize Deepgram's speech recognition capabilities. It provides methods for both batch and real-time speech processing, allowing for flexible integration into various use cases.
ElevenLabs Plugin[Edit section][Copy link]
References: livekit-plugins/livekit-plugins-elevenlabs/livekit/plugins/elevenlabs/tts.py
The ElevenLabs plugin is integrated into the LiveKit ecosystem to provide text-to-speech (TTS) synthesis capabilities. It supports Speech Synthesis Markup Language (SSML) parsing and phoneme handling, enabling developers to create more natural and varied speech outputs. The plugin leverages the ElevenLabs API to offer both chunked and streaming audio synthesis from text, accommodating different use cases and performance requirements.
Read moreTranscription Management[Edit section][Copy link]
References: livekit-agents/livekit/agents/transcription
The STTSegmentsForwarder
and TTSSegmentsForwarder
classes in …/stt_forwarder.py
and …/tts_forwarder.py
handle the forwarding of speech-to-text and text-to-speech transcription data respectively.
STT Forwarding[Edit section][Copy link]
References: livekit-agents/livekit/agents/transcription/stt_forwarder.py
The STTSegmentsForwarder
class manages the forwarding of speech-to-text transcription data to a user's room. Key functionalities include:
TTS Forwarding[Edit section][Copy link]
References: livekit-agents/livekit/agents/transcription/tts_forwarder.py
The TTSSegmentsForwarder
class in …/tts_forwarder.py
manages the synchronization of text-to-speech transcription with audio playback. Key features include:
Text Processing[Edit section][Copy link]
References: livekit-agents/livekit/agents/tokenize
, livekit-plugins/livekit-plugins-rag
The SentenceChunker
class in …/chunking.py
handles text chunking for NLP tasks. Key features:
Paragraph Tokenization[Edit section][Copy link]
References: livekit-agents/livekit/agents/tokenize/_basic_paragraph.py
The split_paragraphs()
function in …/_basic_paragraph.py
is designed to segment a given text into distinct paragraphs. The process is as follows:
Sentence Tokenization[Edit section][Copy link]
References: livekit-agents/livekit/agents/tokenize/_basic_sent.py
The split_sentences()
function in …/_basic_sent.py
is designed to segment text into individual sentences. It returns a list of tuples, with each tuple containing a sentence along with its start and end positions in the original text. This is crucial for applications that require sentence-level analysis or processing.
Word Tokenization[Edit section][Copy link]
References: livekit-agents/livekit/agents/tokenize/_basic_word.py
The split_words()
function in …/_basic_word.py
is designed to tokenize a given text into individual words. It returns a list of tuples, with each tuple containing a word from the text along with its starting and ending index positions. This allows for precise tracking of where each word is located within the original string, which is essential for tasks that require word-level analysis or manipulation.
Tokenization Utilities[Edit section][Copy link]
References: livekit-agents/livekit/agents/tokenize/token_stream.py
, livekit-agents/livekit/agents/tokenize/tokenizer.py
The LiveKit Agents Framework employs utility classes to facilitate the tokenization process, which is crucial for natural language processing tasks. The …/token_stream.py
file introduces the BufferedTokenStream
class, which serves as a foundation for handling streams of tokens. This class buffers incoming text and utilizes a tokenization function, tokenize_fnc
, to produce either a list of tokens or tuples with tokens and their respective start and end indices.
Tokenization Integration[Edit section][Copy link]
References: livekit-agents/livekit/agents/tokenize/basic.py
Tokenization within the LiveKit Agents Framework is a critical step in preparing text for further natural language processing tasks. The …/basic.py
file encapsulates the integration of tokenization functionality, providing a streamlined interface for converting large text blocks into structured data forms such as sentences, words, and paragraphs.