Semantic Caching Guide#

Reduce latency and cost by caching LLM responses using embedding similarity rather than exact string matching.

Overview#

Traditional caching only works when requests are identical. Semantic caching uses vector embeddings to match requests that are similar in meaning -- so "What is the capital of France?" and "Tell me France's capital city" return the same cached response.

How It Works#

User Request ──▶ LiteLLM Proxy (:4000)
                       │
                       ├── Generate Embedding (text-embedding-3-small)
                       ├── Search Redis Cache (cosine similarity)
                       │
                 ┌─────┴─────────┐
                 │                │
           Similarity ≥ 0.92   Similarity < 0.92
           (Cache Hit)          (Cache Miss)
                 │                │
                 ▼                ▼
           Return Cached    Call LLM Provider
           Response         Store in Redis Cache

Semantic caching is handled transparently by LiteLLM's built-in redis-semantic cache. Every request through /v1/chat/completions is automatically checked against the cache -- no client-side changes needed.

The incoming prompt is converted to a vector embedding via text-embedding-3-small.
The embedding is compared (cosine similarity) against cached embeddings in Redis.
If any cached entry exceeds the similarity threshold (default 0.92), the cached response is returned immediately.
On a miss, the LLM call proceeds normally and the response is stored in Redis for future hits.

Configuration#

Semantic caching is enabled by default in config/litellm/config.yaml:

cache: true
cache_params:
  type: "redis-semantic"
  host: "redis"
  port: 6379
  ttl: 3600
  namespace: "litellm"
  similarity_threshold: 0.92
  redis_semantic_cache_embedding_model: "text-embedding-3-small"

Key Parameters#

Parameter	Default	Description
`type`	`redis-semantic`	Cache type. Use `redis` for exact-match only.
`similarity_threshold`	`0.92`	Minimum cosine similarity for a cache hit (0.0 to 1.0)
`redis_semantic_cache_embedding_model`	`text-embedding-3-small`	Model used to generate embeddings
`ttl`	`3600`	Time-to-live for cached entries in seconds
`host`	`redis`	Redis hostname
`port`	`6379`	Redis port

Tuning the Similarity Threshold#

The threshold controls the trade-off between cache hit rate and response accuracy:

Threshold	Behavior
`0.95+`	Very strict -- only nearly identical prompts match. Low hit rate, high accuracy.
`0.90-0.95`	Balanced -- catches paraphrased questions while avoiding false matches. Recommended.
`0.85-0.90`	Aggressive -- higher hit rate but may return responses for semantically different questions.
`< 0.85`	Not recommended -- too many false positives.

Toggling via Admin UI#

The Admin UI Settings page exposes enable_caching and cache_ttl_seconds. Changes are synced to LiteLLM at runtime without a restart.

Admin API Management Endpoints#

The Admin API (port 8086) provides endpoints for cache visibility and management, used by the Admin UI:

Method	Endpoint	Description
`GET`	`/api/v1/cache/stats`	Cache statistics (entries, hit rate, size)
`GET`	`/api/v1/cache/settings`	Current cache settings
`PUT`	`/api/v1/cache/settings`	Update settings (syncs to LiteLLM)
`GET`	`/api/v1/cache/entries`	List cached entries (paginated)
`DELETE`	`/api/v1/cache/entries/{id}`	Delete a specific entry
`POST`	`/api/v1/cache/clear`	Clear all cache entries

Example: View Cache Stats#

TOKEN="your-jwt-token"

curl http://localhost:8086/api/v1/cache/stats \
  -H "Authorization: Bearer $TOKEN"

{
  "total_entries": 1523,
  "total_hits": 8934,
  "hit_rate": 68.0,
  "cache_size_mb": 4.31,
  "avg_token_savings": 142.5
}

Example: Update Settings#

curl -X PUT http://localhost:8086/api/v1/cache/settings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "enabled": true,
    "similarity_threshold": 0.90,
    "ttl_seconds": 7200,
    "max_entries": 20000
  }'

Testing Semantic Caching#

Send the same question phrased differently and observe the cached response:

# First request -- cache miss, calls the LLM
curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $LITELLM_KEY" \
  -d '{"model": "gpt-5-mini", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

# Second request -- paraphrased, should be a cache hit
curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $LITELLM_KEY" \
  -d '{"model": "gpt-5-mini", "messages": [{"role": "user", "content": "Tell me the capital city of France"}]}'

The second request should return faster with the cached response. Check the response headers or Grafana metrics to confirm cache hits.

Disabling Semantic Caching#

To switch back to exact-match caching, change the config:

cache_params:
  type: "redis"      # exact-match only
  host: "redis"
  port: 6379
  ttl: 3600

Or disable caching entirely via the Admin UI Settings page (enable_caching: false).

Production Considerations#

Embedding costs: Each cache miss requires one embedding API call (text-embedding-3-small is ~$0.02 per 1M tokens). This overhead is negligible compared to the LLM call cost saved on cache hits.
Redis memory: Cached embeddings consume Redis memory. Monitor Redis usage and set max_entries to cap growth.
TTL strategy: Set TTL based on how frequently your data changes. Factual queries benefit from longer TTLs (hours). Time-sensitive data should use shorter TTLs (minutes).
Model isolation: Cache keys are namespaced by model. A cached GPT-5 response will not match a Claude query, even if the prompts are identical.

Observability Guide -- monitor cache hit rates in Grafana
Cost Management Guide -- how caching reduces spend
LiteLLM Deep Dive -- full LiteLLM configuration reference