pipecat[Edit section][Copy link]
Pipecat is an open-source framework for building real-time voice and multimodal conversational AI applications. It provides a modular architecture for processing audio, video, and text data, integrating various AI services, and creating interactive conversational experiences.
The core of Pipecat is built around a pipeline-based architecture for processing different types of data frames. The Frame Processing section details how the system handles various frame types, including text, audio, and transcription frames. The pipeline architecture, implemented in …/pipeline
, allows for flexible composition of processing components.
Key components of the framework include:
-
Frame Processors: Located in
…/processors
, these components handle tasks such as aggregating frames, filtering, and integrating with external frameworks like Langchain. -
AI Services: The
…/services
directory contains implementations for various AI services, including language models (e.g., OpenAI, Anthropic, Azure), text-to-speech, speech-to-text, and image generation. The AI Services Integration section provides more details on how these services are integrated into the framework. -
Transports: The
…/transports
directory implements input/output mechanisms for audio, video, and network communication. This includes local transports using PyAudio and Tkinter, as well as network transports using WebSockets. -
Voice Activity Detection: Implemented in
…/vad
, this system detects when a user starts and stops speaking, which is crucial for interactive conversational applications.
The framework's design emphasizes modularity and extensibility. Key design choices include:
- Use of protocol buffers for defining frame structures, allowing for efficient serialization and deserialization of data.
- Asynchronous processing using Python's asyncio library, enabling non-blocking I/O operations.
- Abstract base classes for services and transports, facilitating easy addition of new implementations.
Pipecat includes a variety of example applications in the examples
directory, demonstrating how to build different types of conversational AI applications using the framework. These range from simple chatbots to more complex applications like storytelling chatbots and patient intake systems.
For developers looking to build conversational AI applications, Pipecat provides a flexible foundation that can be customized and extended to meet specific requirements. The modular architecture allows for easy integration of new AI services and processing components, making it adaptable to a wide range of use cases in voice and multimodal AI.
Frame Processing[Edit section][Copy link]
References: src/pipecat/frames
, src/pipecat/pipeline
, src/pipecat/processors
The Pipecat framework employs a modular approach to process various types of frames through a pipeline architecture. At its core, the FrameProcessor
class serves as the foundation for all data processing components. This class manages frame processing, metrics, and error handling, providing a common interface for subclasses to implement specific functionality.
Frame Types and Definitions[Edit section][Copy link]
References: src/pipecat/frames/protobufs
The Pipecat system defines four primary frame types in …/frames_pb2.py
:
Pipeline Architecture[Edit section][Copy link]
References: src/pipecat/pipeline
The pipeline architecture in Pipecat is built around several key components:
Read moreFrame Processors[Edit section][Copy link]
References: src/pipecat/processors
The FrameProcessor
class serves as the foundation for various data processing components in the Pipecat system. It provides a common set of functionality for managing frames, metrics, and error handling. Key features include:
Aggregators[Edit section][Copy link]
References: src/pipecat/processors/aggregators
The GatedAggregator
class accumulates frames based on custom functions that determine when to start and stop aggregation. It uses gate_open_fn
and gate_close_fn
to control the "gate" state, pushing frames to output when open and accumulating them when closed.
Filters[Edit section][Copy link]
References: src/pipecat/processors/filters
The Pipecat framework implements several filtering mechanisms to process and control the flow of frames in the pipeline:
Read moreFramework Integration[Edit section][Copy link]
References: src/pipecat/processors/frameworks
The LangchainProcessor
and RTVIProcessor
classes integrate Langchain and RTVI frameworks into the Pipecat processing pipeline.
GStreamer Integration[Edit section][Copy link]
References: src/pipecat/processors/gstreamer
The GStreamerPipelineSource
class in …/pipeline_source.py
integrates GStreamer for audio and video processing within the Pipecat pipeline. It sets up and manages a GStreamer pipeline based on a provided pipeline description string and optional output parameters.
Network Transports[Edit section][Copy link]
References: src/pipecat/transports/network
The network transports in Pipecat provide WebSocket-based communication for real-time data exchange. Two main implementations are available:
Read moreAI Services Integration[Edit section][Copy link]
References: src/pipecat/services
The AIService
class serves as the foundation for various AI services in the Pipecat framework. It handles common functionality like processing start, stop, and cancel frames. The AsyncAIService
provides an asynchronous version of this base class.
AI Service Base Classes[Edit section][Copy link]
References: src/pipecat/services/ai_services.py
AIService
and AsyncAIService
serve as foundational classes for AI services in the Pipecat framework. These classes, defined in …/ai_services.py
, provide essential functionality for managing the lifecycle and frame processing of AI services.
Text-to-Speech Services[Edit section][Copy link]
References: src/pipecat/services/ai_services.py
The TTSService
class, inheriting from AsyncAIService
, provides core functionality for text-to-speech services in the Pipecat framework. Key features include:
Language Model Services[Edit section][Copy link]
References: src/pipecat/services/ai_services.py
The LLMService
class provides a foundation for integrating large language models into the Pipecat framework. Key features include:
Speech-to-Text Services[Edit section][Copy link]
References: src/pipecat/services/ai_services.py
The STTService
class, inheriting from AsyncAIService
, provides a foundation for integrating speech-to-text functionality into the Pipecat pipeline. Key features include:
Image Generation Services[Edit section][Copy link]
References: src/pipecat/services/ai_services.py
The ImageGenService
class, inheriting from AsyncAIService
, provides a foundation for integrating image generation capabilities into the Pipecat framework. Key aspects include:
Vision Services[Edit section][Copy link]
References: src/pipecat/services/ai_services.py
The VisionService
class, inheriting from AsyncAIService
, provides a foundation for integrating computer vision capabilities into the Pipecat pipeline. Key features include:
Deepgram Integration[Edit section][Copy link]
References: src/pipecat/services/deepgram.py
The DeepgramSTTService
class integrates Deepgram's speech-to-text functionality into the Pipecat framework. Key features include:
Whisper Integration[Edit section][Copy link]
References: src/pipecat/services/whisper.py
The WhisperSTTService
class integrates the Whisper speech-to-text model into the Pipecat framework. Key features include:
Transport Layer[Edit section][Copy link]
References: src/pipecat/transports
The transport layer in Pipecat is implemented through a hierarchy of classes that handle input/output for audio, video, and network communication. The base classes BaseTransport
, BaseInputTransport
, and BaseOutputTransport
provide the foundation for specific transport implementations.
WebSocket Transport[Edit section][Copy link]
References: src/pipecat/transports/network
In …/network
, the WebSocket-based transport mechanism is a pivotal component for real-time data exchange within the Pipecat framework. The directory houses the implementation for establishing and managing WebSocket connections, which are essential for transmitting Pipecat frames between clients and servers.
FastAPI WebSocket Integration[Edit section][Copy link]
References: src/pipecat/transports/network/fastapi_websocket.py
The FastAPIWebsocketOutputTransport
class in …/fastapi_websocket.py
serves as a critical component for sending Pipecat frames over a WebSocket connection in real-time applications. This class is equipped with several methods that streamline the communication process:
Voice Activity Detection[Edit section][Copy link]
References: src/pipecat/vad
The Voice Activity Detection (VAD) system in Pipecat is implemented using the Silero VAD model. The system is responsible for detecting when a user starts and stops speaking, which is crucial for processing audio input in real-time applications.
Read moreSerialization[Edit section][Copy link]
References: src/pipecat/serializers
The FrameSerializer
abstract base class in …/base_serializer.py
defines the contract for serializing and deserializing Frame
objects. Concrete implementations include:
Frame Serialization[Edit section][Copy link]
References: src/pipecat/serializers
Frame serialization in Pipecat is primarily handled by the LivekitFrameSerializer
class in …/livekit.py
. This serializer is specifically designed to work with AudioRawFrame
objects, which are the only type defined in its SERIALIZABLE_TYPES
attribute.
Twilio Integration[Edit section][Copy link]
References: src/pipecat/serializers/twilio.py
The TwilioFrameSerializer
class in …/twilio.py
is tailored for the serialization and deserialization of frames in the context of Twilio's communication APIs. It specifically handles AudioRawFrame
objects, converting them to and from the µ-law format required by Twilio, and also manages StartInterruptionFrame
objects to facilitate clear signaling within the communication stream.
Livekit Integration[Edit section][Copy link]
References: src/pipecat/serializers/livekit.py
The LivekitFrameSerializer
class handles serialization and deserialization of AudioRawFrame
objects for Livekit integration. This class is defined in …/livekit.py
.
Utility Functions[Edit section][Copy link]
References: src/pipecat/utils
The …/utils
directory contains utility functions and classes for various tasks:
Time Utilities[Edit section][Copy link]
References: src/pipecat/utils/time.py
In …/time.py
, a collection of utility functions facilitate the conversion and representation of time values for various aspects of the Pipecat framework. These functions are essential for handling time-related data, which is a common requirement in real-time voice and multimodal conversational AI applications.
Example Applications[Edit section][Copy link]
References: examples
The examples
directory showcases various applications built using the Pipecat framework:
Dial-in Chatbots[Edit section][Copy link]
References: examples/dialin-chatbot/bot_twilio.py
, examples/dialin-chatbot/bot_daily.py
Implementing dial-in chatbots with Pipecat involves the integration of transport layers such as Twilio and Daily, alongside AI services for language understanding and text-to-speech conversion. The chatbots are designed to provide voice-based interaction, allowing users to engage in conversations through phone calls.
Read moreSimple Chatbot[Edit section][Copy link]
References: examples/simple-chatbot/bot.py
In the example provided by …/bot.py
, the TalkingAnimation
class enhances user interaction by visually representing the chatbot's speaking state. It activates a sequence of images to simulate speech when an AudioRawFrame
is received and reverts to a static image upon receiving a TTSStoppedFrame
.
Storytelling Chatbot[Edit section][Copy link]
References: examples/storytelling-chatbot/src/bot.py
In the storytelling chatbot application found at …/bot.py
, a combination of text-to-speech, image generation, and event handling is employed to craft interactive storytelling experiences. The application orchestrates these elements through a series of pipelines and processors, each dedicated to a specific aspect of the storytelling process.
Foundational Examples[Edit section][Copy link]
References: examples/foundational/05-sync-speech-and-image.py
, examples/foundational/05a-local-sync-speech-and-image.py
, examples/foundational/06a-image-sync.py
, examples/foundational/07b-interruptible-langchain.py
, examples/foundational/11-sound-effects.py
In the foundational examples of the Pipecat framework, the …/05-sync-speech-and-image.py
script showcases the synchronization of speech with images. It employs OpenAILLMService
for generating text descriptions, ElevenLabsTTSService
for text-to-speech, and FalImageGenService
for image generation. The MonthFrame
and MonthPrepender
classes are pivotal in prepending month information to text frames, while the GatedAggregator
ensures frames are queued until an image is available, synchronizing the output.
StudyPal Application[Edit section][Copy link]
References: examples/studypal/studypal.py
In the StudyPal application, the DailyTransport
class is leveraged to manage audio streams and transcriptions, while the SileroVADAnalyzer
detects voice activity to discern when the user speaks. The application employs the CartesiaTTSService
for converting text responses into speech, enhancing the interactive experience.
Interruptible ElevenLabs Example[Edit section][Copy link]
References: examples/foundational/07d-interruptible-elevenlabs.py
In the …/07d-interruptible-elevenlabs.py
example, the main()
function orchestrates a WebRTC call with a suite of conversational AI features. The setup includes: