RAVANA AGI

Multi-Modal Memory

Multi-Modal Memory

Table of Contents

  1. Introduction
  2. Project Structure
  3. Core Components
  4. Architecture Overview
  5. Detailed Component Analysis
  6. Dependency Analysis
  7. Performance Considerations
  8. Troubleshooting Guide
  9. Conclusion

Introduction

The Multi-Modal Memory system is an advanced episodic memory module within the RAVANA project, designed to store, retrieve, and process information across multiple modalities including text, audio, image, and video. This system enhances the AGI's long-term memory capabilities by enabling semantic understanding and cross-modal retrieval. It integrates Whisper for audio transcription, PostgreSQL with pgvector for high-performance vector storage, and supports hybrid search modes combining vector similarity and full-text search. The system maintains backward compatibility with legacy ChromaDB-based storage while offering a scalable, robust foundation for multi-modal data persistence and retrieval.

Project Structure

The multi-modal memory system is organized into a modular structure within the modules/episodic_memory directory. The core components include API endpoints, data models, database operations, embedding generation, and specialized processors for different media types. The system is designed for extensibility and integration with the broader RAVANA framework.

Diagram sources

Section sources

Core Components

The Multi-Modal Memory system comprises several core components that work together to enable robust, scalable memory operations. The system is built on FastAPI for its RESTful interface, leveraging PostgreSQL with the pgvector extension for efficient vector similarity search. At its heart is the MultiModalMemoryService class, which orchestrates interactions between the database, embedding generation, audio processing, and search functionalities. The data model is defined using Pydantic, ensuring type safety and validation for all memory records and API requests. Audio processing is powered by Whisper, enabling transcription and feature extraction from spoken content. The system supports both legacy ChromaDB operations and the new PostgreSQL-based storage, ensuring backward compatibility during migration. Key components include the PostgreSQLStore for database operations, EmbeddingService for generating text, image, and audio embeddings, and AdvancedSearchEngine for executing hybrid and cross-modal searches.

Section sources

Architecture Overview

The Multi-Modal Memory system follows a layered architecture with clear separation of concerns. The API layer, implemented with FastAPI, exposes endpoints for memory operations and search. The service layer, centered around MultiModalMemoryService, coordinates all business logic and integrates various components. The data layer uses PostgreSQL with pgvector for persistent storage of memory records and their embeddings. The processing layer handles the generation of embeddings for different modalities and the extraction of features from audio and image content.

Diagram sources

Detailed Component Analysis

MultiModalMemoryService Analysis

The MultiModalMemoryService is the central orchestrator of the multi-modal memory system. It integrates the PostgreSQL store, embedding service, Whisper audio processor, and search engine to provide a unified interface for memory operations. The service is initialized with a database URL and model configurations, and it manages the lifecycle of its components through initialize() and close() methods.

Diagram sources

Section sources

Memory Record and Data Models

The data model for the multi-modal memory system is defined in models.py using Pydantic. The MemoryRecord class is the core data structure, capable of storing information from various modalities. It includes fields for text, audio, image, and video content, along with their respective metadata and embeddings. The model supports validation through Pydantic validators, ensuring data integrity.

Diagram sources

Section sources

API Endpoints and Request Flow

The API endpoints are implemented in memory.py using FastAPI. The system supports both legacy endpoints for backward compatibility and new endpoints for multi-modal operations. The request flow for processing an audio file involves uploading the file, transcribing it with Whisper, generating embeddings, and storing the memory record in PostgreSQL.

Diagram sources

Section sources

Database Schema and Migration

The database schema is defined in schema.sql and managed through setup_database.py. The system uses PostgreSQL with the pgvector extension to store memory records and their embeddings. The migration process allows for seamless transition from the legacy ChromaDB storage to the new PostgreSQL-based system, including data migration and schema creation.

Diagram sources

Section sources

Dependency Analysis

The Multi-Modal Memory system has a well-defined dependency structure. The core dependencies include FastAPI for the web framework, asyncpg for PostgreSQL connectivity, sentence-transformers for text embeddings, and faster-whisper for audio processing. The system also depends on pgvector for vector similarity search in PostgreSQL. The component dependencies are managed through Python's import system, with clear interfaces between modules.

Diagram sources

Section sources

Performance Considerations

The Multi-Modal Memory system incorporates several performance optimizations. The EmbeddingService includes an in-memory cache to avoid recomputation of text embeddings, significantly improving response times for repeated queries. The PostgreSQL database uses IVFFlat indexes for efficient vector similarity search. Connection pooling is implemented to handle concurrent requests efficiently. Audio processing is optimized by resampling to 16kHz and limiting maximum file duration. The system also supports batch processing of files with configurable parallelism to maximize throughput.

Section sources

Troubleshooting Guide

Common issues with the Multi-Modal Memory system include database connectivity problems, missing pgvector extension, and audio processing failures. Ensure that PostgreSQL is running and the pgvector extension is installed. Verify the POSTGRES_URL environment variable is correctly set. For audio processing issues, check that the Whisper model is properly downloaded and that the audio file format is supported. Enable debug logging by setting LOG_LEVEL=DEBUG to get more detailed error messages. If migration from ChromaDB fails, ensure the ChromaDB directory exists and is accessible. Monitor the system's health using the /health endpoint, which reports database connectivity and service status.

Section sources

Conclusion

The Multi-Modal Memory system represents a significant advancement in the RAVANA project's memory capabilities. By supporting multiple data modalities and leveraging state-of-the-art technologies like Whisper and pgvector, it enables rich, context-aware memory storage and retrieval. The system's modular architecture ensures maintainability and extensibility, while its backward compatibility facilitates smooth integration with existing components. The comprehensive API and client library make it easy to incorporate multi-modal memory operations into various applications. With its robust performance optimizations and detailed troubleshooting guidance, the system is well-positioned to serve as a foundational component for advanced AGI applications.

Referenced Files in This Document