Production Readiness Checklist
This checklist ensures all components are properly configured for production deployment.
Security Fixes Applied
The following security hardening has been implemented in the codebase:
| Fix |
Description |
Status |
| CORS Restriction |
Origins restricted by environment (no wildcards) |
✅ Done |
| Request Size Limits |
1MB limit on all FastAPI services |
✅ Done |
| SQL Injection Prevention |
Validation for LLM-generated SQL queries |
✅ Done |
| Graceful Shutdown |
Request draining and trace flushing |
✅ Done |
| Secure Headers |
Restricted allowed methods and headers |
✅ Done |
Remaining Security Tasks
| Task |
Description |
Priority |
| Sealed Secrets |
Move DB passwords from ConfigMaps |
High |
| OTEL CORS |
Restrict metrics endpoint origins |
Medium |
| MCP Secrets |
Configure MCP server credentials |
Medium |
| Vault Token Rotation |
Implement auto-renewal |
Medium |
Pre-Deployment Checklist
1. Secrets Management
| Item |
Status |
Notes |
[ ] Replace $LITELLM_KEY with secure random key |
Required |
Use openssl rand -hex 32 |
[ ] Configure Vault with production token (not root-token-for-dev) |
Required |
Enable AppRole auth |
| [ ] Add real API keys to Vault |
Required |
OpenAI, Anthropic, XAI keys |
[ ] Set BRAVE_API_KEY if using Brave Search MCP |
Optional |
Get from brave.com |
[ ] Set GITHUB_TOKEN if using GitHub MCP |
Optional |
Create fine-grained PAT |
| [ ] Enable Vault audit logging |
Required |
Compliance requirement |
| [ ] Configure secret rotation policy |
Recommended |
90-day rotation |
2. Database Configuration
| Item |
Status |
Notes |
| [ ] Use production PostgreSQL (not container) |
Required |
RDS, Cloud SQL, or managed |
| [ ] Configure SSL/TLS for database connections |
Required |
sslmode=require |
| [ ] Set up database backups |
Required |
Daily automated backups |
| [ ] Configure connection pooling |
Recommended |
PgBouncer or built-in |
| [ ] Create separate database users per service |
Recommended |
Principle of least privilege |
| [ ] Set appropriate resource limits |
Required |
Based on load testing |
3. TLS/SSL Configuration
| Item |
Status |
Notes |
| [ ] Install cert-manager |
Required |
For automatic certificate management |
| [ ] Configure ClusterIssuer for Let's Encrypt |
Required |
Use letsencrypt-prod |
| [ ] Update ingress hosts with real domain names |
Required |
Replace example.com |
| [ ] Configure TLS 1.2+ only |
Required |
Disable TLS 1.0/1.1 |
| [ ] Enable HSTS |
Recommended |
Strict-Transport-Security header |
4. Authentication & Authorization
| Item |
Status |
Notes |
| [ ] Configure OIDC/OAuth provider |
Required |
For admin UI access |
| [ ] Set up API key management |
Required |
Virtual keys in LiteLLM |
| [ ] Review Cedar RBAC policies |
Required |
config/agentgateway/policies/ |
| [ ] Configure JWT validation |
Required |
For A2A authentication |
| [ ] Set rate limits per user/team |
Recommended |
Prevent abuse |
5. Network Security
| Item |
Status |
Notes |
| [ ] Review NetworkPolicy rules |
Required |
Least privilege |
| [ ] Configure WAF rules |
Recommended |
AWS WAF, Cloudflare |
| [ ] Set up DDoS protection |
Recommended |
Cloud provider DDoS |
| [ ] Whitelist admin endpoints |
Required |
IP-based or VPN only |
| [ ] Configure egress rules |
Recommended |
Restrict outbound traffic |
6. Observability
| Item |
Status |
Notes |
| [ ] Configure Prometheus retention |
Required |
Based on storage budget |
| [ ] Set up alerting rules |
Required |
35 rules in prometheus/alerts/ |
| [ ] Configure PagerDuty/Opsgenie integration |
Required |
For on-call |
| [ ] Review Grafana dashboards |
Recommended |
Customize for your needs |
| [ ] Configure log aggregation |
Required |
Loki, CloudWatch, or Datadog |
| [ ] Set up trace sampling |
Recommended |
10-20% in production |
7. Resource Sizing
| Item |
Status |
Notes |
| [ ] Configure HPA min/max replicas |
Required |
Based on traffic patterns |
| [ ] Set appropriate resource requests/limits |
Required |
See sizing guide below |
| [ ] Configure PDB (PodDisruptionBudget) |
Required |
minAvailable: 1 |
| [ ] Set up node affinity rules |
Recommended |
Spread across AZs |
| [ ] Configure pod anti-affinity |
Recommended |
Prevent single-node failure |
8. High Availability
| Item |
Status |
Notes |
| [ ] Deploy across multiple AZs |
Required |
Minimum 2 AZs |
| [ ] Configure Redis HA (Sentinel/Cluster) |
Recommended |
For rate limiting |
| [ ] Set up PostgreSQL replicas |
Required |
Read replicas |
| [ ] Configure load balancer health checks |
Required |
/health endpoint |
| [ ] Test failover procedures |
Required |
Document in runbook |
Resource Sizing Guide
Small (< 100 req/sec)
# LiteLLM
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 2Gi
replicas: 2
# Agent Gateway
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 1000m
memory: 1Gi
replicas: 2
Medium (100-500 req/sec)
# LiteLLM
resources:
requests:
cpu: 1000m
memory: 1Gi
limits:
cpu: 4000m
memory: 4Gi
replicas: 4
# Agent Gateway
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 2Gi
replicas: 4
Large (500-2000 req/sec)
# LiteLLM
resources:
requests:
cpu: 2000m
memory: 2Gi
limits:
cpu: 8000m
memory: 8Gi
replicas: 8
# Agent Gateway
resources:
requests:
cpu: 1000m
memory: 1Gi
limits:
cpu: 4000m
memory: 4Gi
replicas: 8
Enterprise (> 2000 req/sec)
# LiteLLM
resources:
requests:
cpu: 4000m
memory: 4Gi
limits:
cpu: 16000m
memory: 16Gi
replicas: 16
# Agent Gateway
resources:
requests:
cpu: 2000m
memory: 2Gi
limits:
cpu: 8000m
memory: 8Gi
replicas: 16
Deployment Steps
1. Pre-flight Checks
# Verify Kubernetes cluster
kubectl cluster-info
kubectl get nodes
# Check required namespaces
kubectl get ns | grep -E "(agentgateway|litellm|observability|database)"
# Verify secrets are configured
kubectl get secrets -n litellm
kubectl get secrets -n agentgateway
2. Deploy Infrastructure
# Apply base manifests
kubectl apply -k kubernetes/base/
# Wait for database to be ready
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=postgresql -n database --timeout=300s
# Apply overlays for environment
kubectl apply -k kubernetes/overlays/production/
3. Verify Deployment
# Check all pods are running
kubectl get pods -A | grep -E "(agentgateway|litellm|prometheus|grafana)"
# Verify services
kubectl get svc -A | grep -E "(agentgateway|litellm)"
# Check ingress
kubectl get ingress -A
# Test health endpoints
curl -k https://api.example.com/health
curl -k https://api.example.com/v1/models
4. Run Smoke Tests
# Test LLM endpoint
curl -X POST https://api.example.com/v1/chat/completions \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hello"}]}'
# Test MCP endpoint
curl https://api.example.com/mcp/tools
# Check metrics
curl https://api.example.com/metrics | head -20
Monitoring & Alerting
Critical Alerts (Page Immediately)
- Service unavailable (5xx > 5% for 5 min)
- Database connection failures
- Certificate expiration (< 7 days)
- Memory usage > 90%
- Vault sealed
Warning Alerts (Slack/Email)
- Latency P99 > 5s
- Error rate > 1%
- Budget utilization > 80%
- Disk usage > 80%
- Pod restarts > 3 in 1 hour
Info Alerts (Dashboard Only)
- New model deployments
- Configuration changes
- Scaling events
Backup & Recovery
What to Backup
- PostgreSQL database (daily)
- Vault secrets (encrypted export)
- ConfigMaps and Secrets
- Prometheus metrics (optional)
Recovery Procedures
- Database Recovery: Restore from RDS/Cloud SQL snapshot
- Secrets Recovery: Restore from Vault backup or re-create
- Full Cluster Recovery: Apply manifests from Git + restore database
RTO/RPO Targets
- RTO (Recovery Time Objective): 1 hour
- RPO (Recovery Point Objective): 1 hour (daily backups with WAL archiving)
Security Hardening
Application Security (Implemented)
- [x] CORS restricted to environment-specific origins (no wildcards in production)
- [x] Request size limits (1MB) to prevent DoS attacks
- [x] SQL injection prevention for LLM-generated queries
- [x] Graceful shutdown handlers with request draining
- [x] JWT authentication with configurable expiry
Container Security
- [x] Non-root user (
runAsNonRoot: true)
- [x] Read-only root filesystem where possible
- [x] No privilege escalation (
allowPrivilegeEscalation: false)
- [ ] Enable seccomp profiles
- [ ] Enable AppArmor/SELinux
Network Security
- [x] NetworkPolicy for namespace isolation
- [x] TLS for all external traffic
- [ ] mTLS for service-to-service (optional)
- [ ] Egress firewall rules
Production Environment Variables (Required)
# Set CORS origins for production
export CORS_ORIGINS="https://admin.yourdomain.com,https://api.yourdomain.com"
# Set environment mode
export ENVIRONMENT=production
# Request size limit (optional, default 1MB)
export MAX_REQUEST_SIZE_BYTES=1048576
# Shutdown timeout (optional, default 30s)
export SHUTDOWN_TIMEOUT_SECONDS=30
Audit & Compliance
- [ ] Enable Kubernetes audit logging
- [ ] Configure log retention (90 days recommended)
- [ ] Set up SIEM integration
- [ ] Document data flow for compliance
Rollback Procedures
Quick Rollback
# Rollback deployment to previous revision
kubectl rollout undo deployment/litellm -n litellm
kubectl rollout undo deployment/agentgateway -n agentgateway
Full Rollback
# Revert to previous Git commit
git revert HEAD
git push
# Re-apply manifests
kubectl apply -k kubernetes/overlays/production/
| Role |
Name |
Contact |
| Platform Team Lead |
TBD |
TBD |
| On-Call Engineer |
TBD |
TBD |
| Security Contact |
TBD |
TBD |
| Database Admin |
TBD |
TBD |
Sign-Off
| Reviewer |
Date |
Status |
| Platform Engineering |
|
[ ] Approved |
| Security Team |
|
[ ] Approved |
| SRE Team |
|
[ ] Approved |
| Compliance |
|
[ ] Approved |