Fallback and Resilience Handling
Fallback and Resilience Handling
Table of Contents
- Introduction
- LLM Fallback Mechanism
- Retry Logic and Backoff Strategies
- Exception Handling and Error Classification
- Provider Health Checking and Monitoring
- Configuration Options for Customization
- Code Examples of Failure Handling
- Designing Fault-Tolerant Workflows
- Conclusion
Introduction
This document provides a comprehensive overview of the fallback and resilience mechanisms implemented in the RAVANA system for handling LLM (Large Language Model) service disruptions. The system is designed to maintain operational continuity during timeouts, rate limits, and service outages by intelligently switching between multiple LLM providers. It employs robust retry logic, exponential backoff strategies, and health-checking routines to ensure high availability and reliability. This documentation details the architecture, implementation, and configuration of these resilience features, enabling developers to understand, customize, and extend the system's fault tolerance capabilities.
LLM Fallback Mechanism
The core of the system's resilience lies in its multi-provider fallback strategy, implemented in the call_llm
function within core/llm.py
. When an LLM request is made, the system attempts to fulfill it through a prioritized sequence of providers before resorting to a final fallback.
The primary providers are:
- Zuki
- ElectronHub
- Zanity
- A4F
If all primary providers fail, the system automatically falls back to Gemini, which uses a dedicated API key (GEMINI_API_KEY
) for reliability. This fallback is unconditional and ensures that a response is always generated, even if degraded.
A preferred_provider
parameter can be passed to call_llm
, which reorders the provider list to prioritize the specified provider. This allows for dynamic routing based on performance, cost, or feature requirements.
Diagram sources
Section sources
Retry Logic and Backoff Strategies
The system implements a sophisticated retry mechanism through the safe_call_llm
function, which wraps all LLM calls with configurable retry logic and exponential backoff.
Key parameters:
- retries: Number of retry attempts (default: 3)
- timeout: Timeout for each LLM call (default: 30 seconds)
- backoff_factor: Base delay multiplier (default: 0.5 seconds)
The backoff strategy uses exponential delay: wait = backoff_factor * (2 ^ (attempt - 1))
. For example:
- Attempt 1: 0.5 * 2^0 = 0.5 seconds
- Attempt 2: 0.5 * 2^1 = 1.0 seconds
- Attempt 3: 0.5 * 2^2 = 2.0 seconds
The function also validates responses, raising an error for empty responses. After exhausting all retries, it returns a formatted error message containing the last exception.
An asynchronous version, async_safe_call_llm
, is provided for non-blocking operations, utilizing a thread pool executor to run the synchronous safe_call_llm
.
Diagram sources
Section sources
Exception Handling and Error Classification
The system employs a layered approach to exception handling, with specific mechanisms for different types of LLM failures.
Base Exception Hierarchy
The core exception classes are defined in core/actions/exceptions.py
:
class ActionError(Exception):
"""Base exception for action-related errors."""
pass
class InvalidActionError(ActionError):
"""Raised when an action is not found in the registry."""
pass
class InvalidActionParams(ActionError):
"""Raised when the parameters for an action are invalid."""
pass
class ActionException(Exception):
"""Custom exception for action execution errors."""
pass
While no dedicated LLM-specific exception classes exist, the system uses the base Exception
class and string-based error messages to classify failures.
JSON Parsing and Response Validation
The extract_decision
function handles JSON parsing errors that occur when an LLM returns malformed JSON:
- Catches
json.JSONDecodeError
- Logs the parsing error
- Returns a structured fallback response with default values
- Sets action to
log_message
with an error description
Lazy Response Detection
The is_lazy_llm_response
function detects low-quality or non-actionable responses by checking for common phrases like "I cannot", "I apologize", or "[insert code here]". This allows the system to identify when an LLM is avoiding a task and potentially trigger a retry with a different provider or prompt.
Section sources
Provider Health Checking and Monitoring
The system includes both automated and manual mechanisms for monitoring LLM provider health.
Automated Health Checks
The send_chat_message
function implements implicit health checking:
- Uses
random.choice(PROVIDERS)
to distribute load - Attempts to send a message via the selected provider
- Logs success or failure with detailed error messages
- Specifically checks for 404 errors from Zanity, logging a targeted warning
Manual Testing Function
The test_all_providers
function serves as a comprehensive health check utility:
- Tests each provider with a simple prompt ("What is the capital of France?")
- Tests the Gemini fallback
- Tests advanced Gemini features (search)
- Outputs results directly to the console
This function can be run manually to verify the operational status of all LLM integrations.
Diagram sources
Section sources
Configuration Options for Customization
The system provides several configuration options to customize the fallback and resilience behavior.
Provider Configuration
The PROVIDERS
list in core/llm.py
defines the available LLM providers with their API keys, base URLs, and supported models. This can be modified to add, remove, or reconfigure providers.
Environment Variables
API keys are loaded using os.getenv()
with fallbacks to hardcoded values:
"api_key": os.getenv("A4F_API_KEY", "ddc-a4f-..."),
"api_key": os.getenv("ZUKIJOURNEY_API_KEY", "zu-ab9fba..."),
"api_key": os.getenv("ELECTRONHUB_API_KEY", "ek-nzrvzze..."),
"api_key": os.getenv("ZANITY_API_KEY", "vc-b1EbB_B...")
Decision Maker Configuration
The config.json
file contains settings for the main decision maker:
"main_llm_decision_maker": {
"enabled": true,
"preferred_model": "gpt-4o",
"chain_of_thought": true,
"rag_enabled": true
}
preferred_model
: Specifies the default model to usechain_of_thought
: Enables step-by-step reasoningrag_enabled
: Controls retrieval-augmented generation
Retry and Timeout Customization
The safe_call_llm
function accepts parameters that can be overridden:
timeout
: Adjust the per-call timeoutretries
: Change the number of retry attemptsbackoff_factor
: Modify the base delay for exponential backoff
These can be passed directly when calling safe_call_llm
or through functions that use it (like decision_maker_loop
).
Section sources
Code Examples of Failure Handling
This section provides concrete examples of how failed LLM requests are caught and redirected.
Example 1: Decision Making with Fallback
def decision_maker_loop(situation, memory=None, mood=None, model=None):
try:
# safe_call_llm handles retries and provider fallback
raw_response = safe_call_llm(prompt, model=model, retries=3)
decision_data = extract_decision(raw_response)
return decision_data
except Exception as e:
# Final fallback response structure
return {
"raw_response": f"[Error: {e}]",
"analysis": f"Failed to make decision due to error: {e}",
"plan": [{"action": "log_message", "params": {"message": f"Decision making failed: {e}"}}],
"action": "log_message",
"params": {"message": f"Decision making failed: {e}"},
"confidence": 0.0,
"error": str(e)
}
Example 2: Experimentation Engine Resilience
def agi_experimentation_engine(experiment_idea, llm_model=None):
# Step 1: Refine idea with LLM
refined_idea = call_llm(refine_prompt, model=llm_model)
# If this fails, call_llm will try all providers and fallback to Gemini
# Step 2: Determine simulation type
simulation_type = call_llm(sim_type_prompt, model=llm_model)
# Another resilient call
# Step 7: Online validation with Gemini fallback
try:
online_validation_result = call_gemini_with_search_sync(web_prompt)
except Exception as e:
online_validation_result = f"[Online validation failed: {e}]"
# Explicit try-except with fallback message
Example 3: Direct Provider Call with Error Handling
def call_zuki(prompt, model=None):
try:
# Direct API call
r = requests.post(url, headers=headers, json=data, timeout=20)
r.raise_for_status()
return r.json()['choices'][0]['message']['content']
except Exception as e:
# Return None on any error, which triggers fallback
return None
Section sources
Designing Fault-Tolerant Workflows
To design workflows that maintain operation during partial outages, follow these best practices:
1. Use the Safe Call Wrapper
Always use safe_call_llm
instead of direct provider calls to benefit from built-in retries and fallbacks.
2. Implement Response Quality Checks
Use is_lazy_llm_response
to detect low-quality responses and trigger retries or alternative actions:
response = safe_call_llm(prompt)
if is_lazy_llm_response(response):
# Try with a different prompt or provider
response = safe_call_llm(enriched_prompt)
3. Design for Degraded Operation
Structure workflows to continue with reduced functionality when primary providers are down:
- Use Gemini as a reliable fallback for critical operations
- Cache responses when possible to reduce dependency on live LLM calls
- Implement graceful degradation (e.g., simpler responses when advanced models are unavailable)
4. Monitor and Log
Leverage the system's logging to monitor LLM availability:
- Watch for warning messages about failed attempts
- Track how often the Gemini fallback is used
- Use
test_all_providers()
periodically to verify provider health
5. Configure Based on Use Case
- For high-reliability tasks: Increase
retries
and use apreferred_provider
- For cost-sensitive tasks: Order providers by cost in the
PROVIDERS
list - For feature-specific tasks: Use
preferred_provider
to target providers with specific capabilities (e.g., vision, function calling)
Conclusion
The RAVANA system implements a comprehensive and robust set of fallback and resilience mechanisms for LLM operations. By combining multiple providers, exponential backoff retries, automatic fallback to Gemini, and sophisticated error handling, the system maintains high availability even during service disruptions. The configuration options allow for customization of retry behavior, provider preferences, and operational parameters. Developers can build fault-tolerant workflows by leveraging the safe_call_llm
wrapper, implementing response quality checks, and designing for degraded operation. The monitoring capabilities, including the test_all_providers
function, enable proactive maintenance of LLM integrations. This resilience architecture ensures that the AI system remains responsive and functional, providing a reliable foundation for autonomous operation.
Referenced Files in This Document