



Caching strategies for LLM and vector search systems including semantic caching, embedding caching, and response caching to reduce costs and improve latency in RAG applications.
1. Embedding Cache:
2. Vector Search Cache:
3. LLM Response Cache:
4. Semantic Cache:
Concept:
Implementation:
def semantic_cache_lookup(query):
query_emb = embed(query)
similar = cache_index.search(query_emb, threshold=0.95)
if similar:
return cached_responses[similar[0]]
return None
Benefits:
Exact Match:
key = hash(query_text)
Simple but misses similar queries
Semantic Match:
key = vector_similarity_search(query_embedding)
Flexible, higher hit rate
Hybrid:
if exact_match: return cached
elif semantic_match > threshold: return cached
else: compute_new
Embedding Cache: 30+ days
Search Results: 1-24 hours
LLM Responses: 1-6 hours
Semantic Cache: 6-24 hours
Loading more......
Redis:
Memcached:
DynamoDB:
Vector Databases:
Time-Based: Simple TTL
Event-Based:
Version-Based:
Example Savings:
At scale:
Read-Through:
result = cache.get(key)
if not result:
result = compute_expensive()
cache.set(key, result)
return result
Write-Through:
result = compute()
cache.set(key, result)
return result
Cache-Aside:
if key in cache:
return cache[key]
result = compute()
cache[key] = result
return result
Purpose-built semantic caching library: