Multi-modal Service API
Multi-modal Service API
Table of Contents
- Introduction
- Core Components
- Architecture Overview
- Detailed Component Analysis
- Error Handling
- Performance Considerations
- Extensibility and Action System
- Usage Examples
Introduction
The Multi-modal Service API is a core component of the RAVANA AGI system, designed to process and analyze content across multiple modalities including images, audio, and cross-modal data. This service enables the system to extract meaningful insights from diverse input types, perform detailed analysis, and generate comprehensive summaries. The API integrates with external LLM providers, particularly Google's Gemini, to leverage advanced multimodal reasoning capabilities. This documentation provides a comprehensive overview of the API's functionality, architecture, error handling, performance characteristics, and extensibility patterns.
Core Components
The Multi-modal Service API consists of several key components that work together to provide multimodal processing capabilities. The primary component is the MultiModalService
class, which handles the core processing logic for images and audio files. This service integrates with the LLM system to leverage external AI models for content analysis. The action system provides a framework for executing multimodal operations as discrete actions within the broader AGI system. The service supports various file formats and provides robust error handling for common issues such as unsupported formats and file access problems.
Section sources
Architecture Overview
The Multi-modal Service API follows a layered architecture that separates concerns between service logic, action execution, and LLM integration. The service acts as the primary interface for multimodal processing, delegating specific operations to specialized functions. It integrates with the enhanced action manager, which provides additional capabilities such as caching, parallel execution, and knowledge base integration. The architecture leverages asynchronous programming patterns to handle I/O operations efficiently, particularly when communicating with external LLM APIs. The system uses a registry pattern to manage available actions and their execution.
Diagram sources
Detailed Component Analysis
MultiModalService Class Analysis
The MultiModalService
class is the central component of the multimodal processing system, providing methods for handling various types of media content. It manages supported file formats, temporary file storage, and coordinates with external LLM services for content analysis.
Class Diagram
Diagram sources
Method Analysis
process_image() Method
The process_image()
method handles the processing of image files, supporting formats including JPG, JPEG, PNG, GIF, BMP, and WebP. It validates the input file, checks format support, and uses the Gemini API to generate a detailed description of the image content.
async def process_image(self, image_path: str, prompt: str = "Analyze this image in detail") -> Dict[str, Any]:
try:
if not os.path.exists(image_path):
raise FileNotFoundError(f"Image file not found: {image_path}")
file_ext = Path(image_path).suffix.lower()
if file_ext not in self.supported_image_formats:
raise ValueError(f"Unsupported image format: {file_ext}")
# Use Gemini for image captioning
loop = asyncio.get_event_loop()
description = await loop.run_in_executor(
None,
call_gemini_image_caption,
image_path,
prompt
)
# Extract metadata
file_size = os.path.getsize(image_path)
result = {
"type": "image",
"path": image_path,
"format": file_ext,
"size_bytes": file_size,
"description": description,
"analysis_prompt": prompt,
"success": True,
"error": None
}
logger.info(f"Successfully processed image: {image_path}")
return result
Section sources
transcribe_audio() Method
The process_audio()
method (functionally equivalent to transcribe_audio) handles audio file processing, supporting formats including MP3, WAV, M4A, OGG, and FLAC. It performs similar validation steps as the image processor and uses the Gemini API to analyze and describe audio content.
async def process_audio(self, audio_path: str, prompt: str = "Describe and analyze this audio") -> Dict[str, Any]:
try:
if not os.path.exists(audio_path):
raise FileNotFoundError(f"Audio file not found: {audio_path}")
file_ext = Path(audio_path).suffix.lower()
if file_ext not in self.supported_audio_formats:
raise ValueError(f"Unsupported audio format: {file_ext}")
# Use Gemini for audio description
loop = asyncio.get_event_loop()
description = await loop.run_in_executor(
None,
call_gemini_audio_description,
audio_path,
prompt
)
# Extract metadata
file_size = os.path.getsize(audio_path)
result = {
"type": "audio",
"path": audio_path,
"format": file_ext,
"size_bytes": file_size,
"description": description,
"analysis_prompt": prompt,
"success": True,
"error": None
}
logger.info(f"Successfully processed audio: {audio_path}")
return result
Section sources
generate_multimodal_response() Method
The cross_modal_analysis()
method enables cross-modal reasoning by analyzing multiple content items of different types. It extracts descriptions from successfully processed content and creates a comprehensive analysis prompt for the LLM, which identifies themes, relationships, and insights across modalities.
async def cross_modal_analysis(self, content_list: List[Dict[str, Any]], analysis_prompt: str = None) -> Dict[str, Any]:
try:
if not content_list:
raise ValueError("No content provided for cross-modal analysis")
# Prepare content descriptions
descriptions = []
content_types = []
for content in content_list:
if content.get('success', False):
descriptions.append(content.get('description', ''))
content_types.append(content.get('type', 'unknown'))
if not descriptions:
raise ValueError("No successfully processed content for analysis")
# Create analysis prompt
if not analysis_prompt:
analysis_prompt = f"""
Perform a comprehensive cross-modal analysis of the following content:
Content types: {', '.join(set(content_types))}
Content descriptions:
{chr(10).join["f"{i+1}. {desc}" for i, desc in enumerate(descriptions)"]}
Please provide:
1. Common themes and patterns across all content
2. Relationships and connections between different modalities
3. Insights that emerge from combining these different types of information
4. Potential applications or implications
5. Any contradictions or interesting contrasts
"""
# Use LLM for cross-modal analysis
loop = asyncio.get_event_loop()
from core.llm import safe_call_llm
analysis = await loop.run_in_executor(
None,
safe_call_llm,
analysis_prompt
)
result = {
"type": "cross_modal_analysis",
"content_types": content_types,
"num_items": len(content_list),
"analysis": analysis,
"success": True,
"error": None
}
logger.info(f"Successfully performed cross-modal analysis on {len(content_list)} items")
return result
Section sources
LLM Integration Analysis
The multimodal service integrates with LLM providers through the core LLM module, primarily leveraging Google's Gemini API for image and audio analysis. The integration uses the genai
client library to upload files and generate content based on prompts.
Sequence Diagram for Image Processing
Diagram sources
LLM Function Integration
The system uses specific functions in the LLM module to handle different modalities:
call_gemini_image_caption()
: Processes image files by uploading them to Gemini and requesting a caption based on the provided promptcall_gemini_audio_description()
: Processes audio files similarly, requesting a description of the audio contentsafe_call_llm()
: Provides a fallback mechanism that tries multiple LLM providers before defaulting to Gemini
The integration with Gemini uses the files.upload()
method to send media files to the API, which are then processed by the multimodal model (gemini-2.0-flash). This approach allows the system to leverage state-of-the-art multimodal understanding capabilities without requiring local model hosting.
Section sources
Embedding Alignment Across Modalities
The system aligns embeddings across modalities through the cross-modal analysis process. When multiple content items are analyzed together, their textual descriptions (generated by the LLM) serve as aligned representations that can be compared and analyzed. The cross_modal_analysis()
method creates a unified context by combining descriptions from different modalities into a single prompt, allowing the LLM to identify relationships and patterns across the different types of content.
The alignment process works as follows:
- Each modality (image, audio) is processed independently to generate a textual description
- These descriptions are treated as embeddings in a shared semantic space
- The LLM performs cross-modal reasoning by analyzing these textual representations together
- The resulting analysis identifies connections and insights that emerge from the combination of different modalities
This approach effectively creates a common representation space where different modalities can be compared and analyzed together, enabling true multimodal reasoning.
Section sources
Error Handling
The Multi-modal Service API implements comprehensive error handling to manage various failure scenarios gracefully. The system uses try-except blocks to catch exceptions and return structured error responses that maintain API consistency.
Unsupported Formats
When encountering unsupported file formats, the service raises a ValueError
with a descriptive message:
file_ext = Path(image_path).suffix.lower()
if file_ext not in self.supported_image_formats:
raise ValueError(f"Unsupported image format: {file_ext}")
The supported image formats are: .jpg
, .jpeg
, .png
, .gif
, .bmp
, .webp
The supported audio formats are: .mp3
, .wav
, .m4a
, .ogg
, .flac
File Access Errors
The service checks for file existence before processing and raises a FileNotFoundError
if the file cannot be found:
if not os.path.exists(image_path):
raise FileNotFoundError(f"Image file not found: {image_path}")
Model Loading and API Failures
The LLM integration functions include error handling for API failures:
def call_gemini_image_caption(image_path, prompt="Caption this image."):
try:
client = genai.Client(api_key=GEMINI_API_KEY)
my_file = client.files.upload(file=image_path)
response = client.models.generate_content(
model="gemini-2.0-flash",
contents=[my_file, prompt],
)
return response.text
except Exception as e:
return f"[Gemini image captioning failed: {e}]"
Synchronization Issues
The service uses asyncio and thread pooling to handle synchronization between the asynchronous service layer and the synchronous LLM API calls:
loop = asyncio.get_event_loop()
description = await loop.run_in_executor(
None,
call_gemini_image_caption,
image_path,
prompt
)
This pattern prevents blocking the event loop while waiting for external API responses, maintaining system responsiveness.
Section sources
Performance Considerations
The Multi-modal Service API includes several performance optimizations to handle processing efficiently, particularly for batch operations and resource utilization.
Batch Processing
The process_directory()
method enables batch processing of multiple files in a directory:
async def process_directory(self, directory_path: str, recursive: bool = False) -> List[Dict[str, Any]]:
# Get all files
if recursive:
files = list(directory.rglob("*"))
else:
files = list(directory.iterdir())
# Filter for supported files
supported_files = []
for file_path in files:
if file_path.is_file():
ext = file_path.suffix.lower()
if ext in self.supported_image_formats or ext in self.supported_audio_formats:
supported_files.append(file_path)
# Process each file
for file_path in supported_files:
try:
ext = file_path.suffix.lower()
if ext in self.supported_image_formats:
result = await self.process_image(str(file_path))
elif ext in self.supported_audio_formats:
result = await self.process_audio(str(file_path))
else:
continue
results.append(result)
except Exception as e:
logger.warning(f"Failed to process file {file_path}: {e}")
results.append({
"type": "unknown",
"path": str(file_path),
"success": False,
"error": str(e)
})
Parallel Execution
The EnhancedActionManager provides parallel execution capabilities with a concurrency limit:
The system uses a semaphore with a default parallel limit of 3 to prevent overwhelming external APIs or system resources.
GPU Utilization
While the current implementation relies on external LLM APIs (Gemini) rather than local model inference, the architecture is designed to accommodate local GPU utilization if needed. The use of loop.run_in_executor()
allows computationally intensive operations to be offloaded to separate threads, which could be extended to manage GPU-accelerated processing in the future.
Caching
The EnhancedActionManager implements result caching to avoid redundant processing:
async def execute_action_enhanced(self, decision: dict) -> Any:
# Check cache for repeated actions
non_cacheable = {'log_message', 'get_current_time', 'generate_random'}
cache_key = f"{action_name}_{hash(str(params))}"
if action_name not in non_cacheable and cache_key in self.action_cache:
logger.info(f"Using cached result for action: {action_name}")
return self.action_cache[cache_key]
# Execute action with timeout
result = await asyncio.wait_for(
self.execute_action(decision),
timeout=300 # 5 minute timeout
)
# Cache successful results
if (action_name not in non_cacheable and
result and not isinstance(result, Exception) and
not str(result).startswith("Error")):
self.action_cache[cache_key] = result
The cache is automatically managed and cleared when it exceeds a configurable size limit.
Section sources
Extensibility and Action System
The Multi-modal Service API is designed to be extensible through the action system, which follows a consistent pattern for adding new capabilities.
MultiModalAction Base Class Pattern
The system uses the Action
base class (from core.actions.action
) as the foundation for all action types:
class Action(ABC):
def __init__(self, system: 'AGISystem', data_service: 'DataService'):
self.system = system
self.data_service = data_service
@property
@abstractmethod
def name(self) -> str:
pass
@property
@abstractmethod
def description(self) -> str:
pass
@property
@abstractmethod
def parameters(self) -> List[Dict[str, Any]]:
pass
@abstractmethod
async def execute(self, **kwargs: Any) -> Any:
pass
Example Action Implementation
The ProcessImageAction
demonstrates the pattern for extending capabilities:
class ProcessImageAction(Action):
@property
def name(self) -> str:
return "process_image"
@property
def description(self) -> str:
return "Process and analyze an image file."
@property
def parameters(self) -> List[Dict[str, Any]]:
return [
{"name": "image_path", "type": "str", "description": "Path to the image file.", "required": True},
{"name": "analysis_prompt", "type": "str", "description": "Analysis prompt (optional).", "required": False}
]
async def execute(self, image_path: str, analysis_prompt: str = None) -> Any:
return await self.system.action_manager.process_image_action(image_path, analysis_prompt)
To extend the system with new capabilities:
- Create a new class that inherits from
Action
- Implement the required properties (
name
,description
,parameters
) - Implement the
execute()
method to perform the desired operation - Register the action with the action registry
The EnhancedActionManager automatically registers multimodal actions during initialization:
def register_enhanced_actions(self):
"""Register new multi-modal actions as Action instances."""
self.action_registry.register_action(ProcessImageAction(self.system, self.data_service))
self.action_registry.register_action(ProcessAudioAction(self.system, self.data_service))
self.action_registry.register_action(AnalyzeDirectoryAction(self.system, self.data_service))
self.action_registry.register_action(CrossModalAnalysisAction(self.system, self.data_service))
Section sources
Usage Examples
Analyzing Screenshots
To analyze a screenshot and generate a detailed description:
# Initialize the service
service = MultiModalService()
# Process a screenshot
result = await service.process_image(
image_path="screenshots/dashboard.png",
prompt="Analyze this dashboard interface, describing all visible elements, data visualizations, and potential usability issues"
)
print(result["description"])
This would return a detailed analysis of the dashboard, including descriptions of charts, tables, navigation elements, and any potential issues identified.
Processing Voice Notes
To process a voice note from a meeting:
# Process an audio recording
result = await service.process_audio(
audio_path="recordings/meeting_2023-12-01.mp3",
prompt="Transcribe and summarize this meeting, identifying key decisions, action items, and important discussions"
)
print(result["description"])
The system would return a transcription and summary of the meeting content, highlighting important information.
Generating Image Descriptions
To generate descriptions for multiple images in a directory:
# Process all images in a directory
results = await service.process_directory(
directory_path="project_images",
recursive=True
)
# Generate a comprehensive summary
summary = await service.generate_content_summary(results)
print(summary)
This would process all supported image and audio files in the directory and its subdirectories, then generate a summary that includes individual descriptions and cross-modal insights if multiple content types are present.
Cross-Modal Analysis
To perform analysis across different types of content:
# Process multiple files
content_paths = [
"reports/quarterly_financials.pdf",
"presentations/product_launch.pptx",
"recordings/executive_interview.mp3"
]
# Perform cross-modal analysis
analysis_result = await service.cross_modal_analysis_action(
content_paths=content_paths,
analysis_prompt="Analyze these documents together to identify strategic implications for the company's product direction"
)
print(analysis_result["cross_modal_analysis"]["analysis"])
This would process each file and perform a comprehensive analysis that identifies connections and insights across the different modalities.
Section sources
Referenced Files in This Document