Global Rate Limiting for AI Applications: The Last Line of Defense
In our previous articles, we explored IP-based and user-based rate limiting strategies. Now, let's examine the final layer of protection: global rate limiting.
While the other approaches focus on individual users or connections, global rate limiting protects your entire system from excessive load. This is particularly crucial for AI applications where each request may consume significant computational resources and incur substantial costs.
Understanding Global Rate Limiting
Global rate limiting caps the total number of requests your system will handle within a given timeframe, regardless of which users or IPs they come from. This serves as:
- System Protection: Prevents overwhelming your infrastructure
- Cost Control: Caps total expenditure on external AI APIs
- Emergency Brake: Acts as a failsafe against unforeseen spikes
- Predictable Scaling: Ensures your system operates within planned capacity
Let's see how to implement this essential safeguard.
Implementing Global Rate Limiting
Here's a straightforward implementation using Redis for distributed state management:
import { Ratelimit } from '@upstash/ratelimit';
import { Redis } from '@upstash/redis';
import { PreflightCheck } from '../types';
// Initialize Redis
const redis = Redis.fromEnv();
// Global rate limiter - caps total system usage
const globalRateLimiter = new Ratelimit({
redis,
limiter: Ratelimit.slidingWindow(1000, '1 h'), // 1000 requests per hour total
analytics: true,
prefix: 'ratelimit:global:',
});
// A single key for the global limit
const GLOBAL_LIMIT_KEY = 'global';
export const rateLimitGlobalCheck: PreflightCheck = {
name: 'rate_limit_global',
description: 'Checks if the entire system has exceeded its global rate limit',
run: async ({ lastMessage }) => {
try {
// Skip if no content to process
if (!lastMessage || lastMessage.trim().length === 0) {
return {
passed: true,
code: 'global_rate_limit_skipped',
message: 'No content to check',
severity: 'info',
};
}
// Check the global rate limit
console.log('Global rate limit: Checking system-wide limit');
const rateLimit = await globalRateLimiter.limit(GLOBAL_LIMIT_KEY);
// Wait for any pending tasks
await rateLimit.pending;
// If the global rate limit is exceeded
if (!rateLimit.success) {
console.warn('Global rate limit exceeded for the entire system');
// Calculate reset time
const resetDate = new Date(rateLimit.reset);
const timeRemaining = Math.ceil((resetDate.getTime() - Date.now()) / 1000 / 60);
return {
passed: false,
code: 'global_rate_limited',
message: 'Our system is experiencing high demand right now. Please try again later.',
details: {
resetInMinutes: timeRemaining,
resetAt: resetDate.toISOString(),
remaining: 0,
limit: rateLimit.limit,
},
severity: 'error',
};
}
// If within rate limit
return {
passed: true,
code: 'global_rate_limit_passed',
message: 'Global rate limit check passed',
details: {
remaining: rateLimit.remaining,
limit: rateLimit.limit,
percentage: Math.round((1 - rateLimit.remaining / rateLimit.limit) * 100),
},
severity: 'info',
};
} catch (error) {
console.error('Global rate limit error:', error);
// In case of error, we should still allow the request
return {
passed: true,
code: 'global_rate_limit_error',
message: 'Error in global rate limit check',
details: { error: error instanceof Error ? error.message : 'Unknown error' },
severity: 'warning',
};
}
},
};
This implementation:
- Creates a Redis-backed rate limiter with a sliding window of 1000 requests per hour
- Uses a single global key for tracking all requests
- Provides clear feedback when the system limit is reached
- Includes proper error handling to avoid breaking the application
The Importance of the Global Key
Unlike our other rate limiting implementations, the global limiter uses a single key for all requests:
const GLOBAL_LIMIT_KEY = 'global';
This means that all requests, regardless of user or IP, increment the same counter. When this counter reaches the limit, all new requests are rejected until the window advances.
Monitoring System Load
A significant advantage of global rate limiting is the ability to monitor overall system load:
// Result includes percentage of limit used
return {
passed: true,
code: 'global_rate_limit_passed',
message: 'Global rate limit check passed',
details: {
remaining: rateLimit.remaining,
limit: rateLimit.limit,
percentage: Math.round((1 - rateLimit.remaining / rateLimit.limit) * 100),
},
severity: 'info',
};
This percentage can be logged or displayed on an admin dashboard to provide insights into system utilization.
Testing Global Rate Limiting
To test this implementation, consider these scenarios:
Scenario 1: Normal System Load
The system processes 500 requests in an hour from various users:
Request 1: โ
Limit: 1000, Remaining: 999, Percentage: 0%
Request 2: โ
Limit: 1000, Remaining: 998, Percentage: 0%
...
Request 500: โ
Limit: 1000, Remaining: 500, Percentage: 50%
Scenario 2: System Overload
The system receives 1100 requests in an hour:
Request 1: โ
Limit: 1000, Remaining: 999, Percentage: 0%
...
Request 1000: โ
Limit: 1000, Remaining: 0, Percentage: 100%
Request 1001: โ Error: "Our system is experiencing high demand right now."
Scenario 3: Recovery After Reset
After the rate limit window advances:
Request 1001 (after window): โ
Limit: 1000, Remaining: 999, Percentage: 0%
Advanced Implementation: Dynamic Global Limits
For more sophisticated applications, you might want to adjust the global limit dynamically based on system conditions:
// Pseudocode for dynamic global limits
async function getDynamicGlobalLimit() {
// Check system metrics
const cpuLoad = await getSystemCPULoad();
const memoryUsage = await getSystemMemoryUsage();
const apiCostToday = await getCurrentDayAPICost();
// Base limit
let limit = 1000;
// Reduce limit if system is under stress
if (cpuLoad > 80) {
limit = Math.floor(limit * 0.7); // 30% reduction
}
// Reduce limit if approaching daily budget
const dailyBudget = 100; // $100 per day
if (apiCostToday > dailyBudget * 0.8) {
limit = Math.floor(limit * 0.5); // 50% reduction
}
return limit;
}
Multi-Tenant Considerations
For applications serving multiple distinct customers or organizations, you might implement tenant-specific global limits:
// Pseudocode for tenant-specific global limits
async function checkTenantGlobalLimit(tenantId) {
const tenantKey = `global:${tenantId}`;
const tenantLimit = await getTenantLimit(tenantId);
const tenantRateLimiter = new Ratelimit({
redis,
limiter: Ratelimit.slidingWindow(tenantLimit, '1 h'),
analytics: true,
prefix: 'ratelimit:tenant:',
});
return await tenantRateLimiter.limit(tenantKey);
}
Pros and Cons of Global Rate Limiting
Pros
- System Protection: Guards against overwhelming your infrastructure
- Cost Control: Prevents unexpected billing spikes for external API calls
- Simplicity: One limit to monitor and adjust
- Load Insights: Provides visibility into overall system utilization
- Last Line of Defense: Catches excessive usage that other limits miss
Cons
- Blunt Instrument: Affects all users equally, regardless of priority
- Potential User Frustration: Users may be rejected even if they haven't personally overused the system
- Redis Dependency: Relies on external service for state management
- Tuning Challenges: Finding the right global limit requires careful consideration
Potential Improvements
1. Priority Queuing
Instead of rejecting requests when the limit is reached, queue them with priority:
// Pseudocode for priority queuing
async function handleRequestWithPriority(userId, content) {
const globalLimit = await globalRateLimiter.limit(GLOBAL_LIMIT_KEY);
if (globalLimit.success) {
// Process immediately
return processRequest(content);
} else {
const userPriority = await getUserPriority(userId);
// Add to priority queue
await addToQueue({
userId,
content,
priority: userPriority,
timestamp: Date.now(),
});
return {
status: 'queued',
estimatedWaitTime: calculateWaitTime(userPriority),
};
}
}
2. Circuit Breaker Pattern
Implement a circuit breaker that temporarily disables non-essential features:
// Pseudocode for circuit breaker
async function checkCircuitBreaker() {
const rateLimit = await globalRateLimiter.limit(GLOBAL_LIMIT_KEY);
// If approaching limit, activate circuit breaker
if (rateLimit.remaining < rateLimit.limit * 0.1) {
await activateCircuitBreaker();
return true;
}
return false;
}
async function activateCircuitBreaker() {
// Disable non-essential features
await redis.set('circuit_breaker:active', 'true', { ex: 60 * 10 }); // 10 minutes
}
3. Time-of-Day Adjustments
Implement different limits based on typical usage patterns:
// Pseudocode for time-based limits
function getTimeBasedLimit() {
const hour = new Date().getHours();
// Higher limits during business hours
if (hour >= 9 && hour <= 17) {
return 1500; // 1500 requests per hour during business hours
}
// Lower limits during off-hours
return 500; // 500 requests per hour during off-hours
}
Alternative Approaches
1. API Gateway Rate Limiting
Many API gateways provide built-in global rate limiting:
// Example using AWS API Gateway
{
"usagePlan": {
"quota": {
"limit": 1000,
"period": "HOUR"
},
"throttle": {
"burstLimit": 100,
"rateLimit": 50
}
}
}
2. Autoscaling with Rate Limiting
Combine rate limiting with infrastructure scaling:
// Pseudocode for scaling-aware rate limiting
async function getDynamicLimitBasedOnCapacity() {
// Get current number of application instances
const instances = await getRunningInstanceCount();
// Calculate limit based on instances
return instances * 200; // 200 requests per instance per hour
}
3. Cost-Based Limiting
For AI applications where external API costs are significant:
// Pseudocode for cost-based limiting
async function checkCostBasedLimit(content) {
// Estimate tokens based on content
const estimatedTokens = estimateTokenCount(content);
// Get current consumption and budget
const { consumed, budget } = await getDailyTokenBudget();
// If adding this request would exceed budget
if (consumed + estimatedTokens > budget) {
return false;
}
// Track consumption
await incrementTokenConsumption(estimatedTokens);
return true;
}
Best Practices for Global Rate Limiting
- Start Conservative: Begin with a lower limit and gradually increase it
- Monitor Closely: Track how often the global limit is hit
- User Communication: Explain system limitations clearly to users
- Alert System: Set up alerts for when utilization approaches the limit
- Regular Adjustment: Review and adjust limits based on actual usage patterns
Implementing a Safety Factor
It's often wise to implement a safety factor in your global rate limiting:
// Pseudocode for safety factor implementation
function getSafeLimitWithBuffer() {
// Theoretical maximum capacity
const maxCapacity = 1200;
// Apply 20% safety factor
return Math.floor(maxCapacity * 0.8); // 960
}
This provides headroom for unexpected spikes and ensures your system operates comfortably within its capacity.
Conclusion
Global rate limiting serves as the last line of defense for your AI application, protecting your infrastructure, controlling costs, and ensuring system stability. By implementing this check as part of your preflight system, you gain:
- Protection against runaway usage
- Predictable operational costs
- Insights into overall system load
- An emergency brake for unexpected situations
While it may seem like a blunt instrument compared to more targeted rate limiting approaches, the global limit is an essential safeguard for any production AI application.
In our next article, we'll explore the OpenAI content moderation API and how to use it as part of your preflight checks to filter inappropriate content.