The support assistant had been running smoothly for two weeks. Then traffic doubled on a Saturday sale. OpenAI started returning 429s. The application had no retry logic. Every request during the spike returned a stack trace to the user.
LLM APIs fail differently from databases and REST services. Understanding the failure modes and designing for them from the start is far cheaper than fixing it under production load.
Table of contents
Open Table of contents
- The LLM API failure taxonomy
- Configuring timeouts
- Retry with exponential backoff for transient errors
- Spring AI’s built-in retry configuration
- Circuit breaker — stop hammering a failing provider
- Handling context length exceeded errors
- Handling content policy violations
- Streaming error handling
- Alerting on error patterns
- References
The LLM API failure taxonomy
| Error | HTTP status | Cause | Retry? |
|---|---|---|---|
| Rate limit | 429 | Too many requests per minute/hour | Yes, with backoff |
| Overloaded | 503 | Provider capacity issue | Yes, with backoff |
| Timeout | — | Response took too long | Yes, with shorter timeout |
| Context length exceeded | 400 | Prompt too long | No — fix the prompt |
| Content policy violation | 400 | Input or output violated safety policy | No — handle the content |
| Invalid API key | 401 | Wrong or expired key | No — fix configuration |
| Model not found | 404 | Wrong model name or model deprecated | No — fix configuration |
| Insufficient quota | 429/402 | Monthly spend limit reached | No — billing issue |
Spring AI throws typed exceptions for most of these. The root types are NonTransientAiException (don’t retry) and TransientAiException (safe to retry).
Configuring timeouts
LLM calls are slow. Set explicit timeouts so slow calls don’t hold threads indefinitely:
@Bean
RestClient restClient() {
// Configure the HTTP client with timeouts
return RestClient.builder()
.requestFactory(new HttpComponentsClientHttpRequestFactory(
HttpClients.custom()
.setConnectionRequestTimeout(Duration.ofSeconds(5)) // get connection from pool
.build()
))
.build();
}
For Spring AI’s OpenAI client specifically, configure via the auto-configured RestClient builder:
spring:
ai:
openai:
api-key: ${OPENAI_API_KEY}
# HTTP client timeouts (via Spring Boot's rest client properties)
# Or set JVM-level socket timeout
# spring.ai.openai.base-url: https://api.openai.com
For more control, use a custom OpenAiApi bean:
@Bean
OpenAiApi openAiApi(
@Value("${spring.ai.openai.api-key}") String apiKey
) {
return OpenAiApi.builder()
.apiKey(apiKey)
.restClient(RestClient.builder()
.requestInterceptor((request, body, execution) -> {
// Add custom timeout header or handle at HTTP level
return execution.execute(request, body);
})
.build())
.build();
}
Important: Set a hard timeout for every LLM call. Without one, a slow provider response holds a thread (and potentially a database connection) for the duration of the call. In a high-concurrency application, a few slow calls can exhaust the thread pool. 30 seconds is a reasonable hard ceiling for non-streaming calls.
Retry with exponential backoff for transient errors
Rate limit (429) and overloaded (503) errors are transient — retrying after a delay usually succeeds. Resilience4j integrates cleanly with Spring Boot:
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-spring-boot3</artifactId>
</dependency>
resilience4j:
retry:
instances:
ai-client:
max-attempts: 3
wait-duration: 1s
exponential-backoff-multiplier: 2 # 1s → 2s → 4s
retry-exceptions:
- org.springframework.ai.retry.TransientAiException
ignore-exceptions:
- org.springframework.ai.retry.NonTransientAiException
@Service
class SupportChatService {
private final ChatClient chatClient;
SupportChatService(ChatClient chatClient) {
this.chatClient = chatClient;
}
@Retry(name = "ai-client", fallbackMethod = "fallbackResponse")
public String chat(String question, String conversationId) {
return chatClient.prompt()
.user(question)
.advisors(a -> a.param(CONVERSATION_ID_KEY, conversationId))
.call()
.content();
}
public String fallbackResponse(String question, String conversationId, Exception e) {
log.error("AI call failed after retries for conversationId={}", conversationId, e);
return "I'm temporarily unable to process your request. Please try again in a moment, "
+ "or contact support@techgadgets.com if the issue persists.";
}
}
The fallback method signature must match the original plus a Throwable parameter. Resilience4j calls it when all retries are exhausted.
Tip: Spring AI has its own retry mechanism built in — configure it via
spring.ai.retry.*properties. It handles basic exponential backoff for transient errors automatically. Use Resilience4j when you need custom fallback logic, circuit breakers, or bulkheads beyond what Spring AI's built-in retry provides.
Spring AI’s built-in retry configuration
Before reaching for Resilience4j, check if the built-in retry is sufficient:
spring:
ai:
retry:
max-attempts: 3 # default: 3
backoff:
initial-interval: 2s # default: 2s
multiplier: 5 # default: 5
max-interval: 3m # default: 3 minutes
on-http-codes-default: true # retry on 429, 503, etc.
The built-in retry handles most transient cases. Add Resilience4j for circuit breaking and custom fallback logic.
Circuit breaker — stop hammering a failing provider
When the LLM provider is down for more than a few seconds, retrying every request wastes resources and worsens the problem. A circuit breaker detects repeated failures and opens the circuit — failing fast until the provider recovers:
resilience4j:
circuitbreaker:
instances:
ai-client:
failure-rate-threshold: 50 # open circuit at 50% failure rate
slow-call-rate-threshold: 80 # also open if 80% of calls are slow
slow-call-duration-threshold: 10s # "slow" = > 10 seconds
wait-duration-in-open-state: 30s # wait 30s before trying again
sliding-window-size: 10 # evaluate over last 10 calls
@CircuitBreaker(name = "ai-client", fallbackMethod = "fallbackResponse")
@Retry(name = "ai-client", fallbackMethod = "fallbackResponse")
public String chat(String question, String conversationId) {
// ...
}
When the circuit is open, the fallback fires immediately without making an API call — protecting the provider from additional load and giving your application a fast failure path.
Handling context length exceeded errors
Context too long (HTTP 400, error code context_length_exceeded) is a non-transient error — retrying the same prompt won’t help. You need to reduce the prompt:
public String chat(String question, String conversationId) {
try {
return chatClient.prompt()
.user(question)
.advisors(a -> a.param(CONVERSATION_ID_KEY, conversationId))
.call()
.content();
} catch (NonTransientAiException e) {
if (e.getMessage() != null && e.getMessage().contains("context_length_exceeded")) {
// Trim the conversation history and retry with a shorter window
chatMemory.clear(conversationId);
log.warn("Context length exceeded for conversationId={} — cleared history and retrying",
conversationId);
return chatClient.prompt()
.user(question)
.advisors(a -> a.param(CONVERSATION_ID_KEY, conversationId)
.param(CHAT_MEMORY_WINDOW_SIZE, 5)) // reduced window
.call()
.content();
}
throw e;
}
}
Caution: Clearing conversation history is a disruptive recovery action — the user loses context. Before clearing, try reducing the window size. If you have summarisation in place, trigger summarisation of the oldest messages. Only clear as a last resort.
Handling content policy violations
When a request violates content policy (HTTP 400 with a content policy error code), the message is the problem:
try {
return callWithRetry(question, conversationId);
} catch (NonTransientAiException e) {
if (isContentPolicyViolation(e)) {
log.warn("Content policy violation for conversationId={}", conversationId);
return "I'm not able to process that request. "
+ "If you have a genuine support question, please rephrase it.";
}
throw e;
}
private boolean isContentPolicyViolation(NonTransientAiException e) {
String msg = e.getMessage();
return msg != null && (msg.contains("content_policy") || msg.contains("content_filter"));
}
Do not log the user’s message that triggered the violation to standard application logs — it may itself contain offensive content that should not be in log files. Log only the event (violation occurred for session X) and store the message separately in a moderation log if needed.
Streaming error handling
Errors in streaming endpoints (Flux<String>) need special handling because the error occurs mid-stream:
@GetMapping(value = "/chat/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
Flux<String> chatStream(@RequestParam String question, @RequestParam String conversationId) {
return chatClient.prompt()
.user(question)
.advisors(a -> a.param(CONVERSATION_ID_KEY, conversationId))
.stream()
.content()
.onErrorResume(TransientAiException.class, e -> {
// Transient: send an error event and suggest retry
log.warn("Transient AI error during stream", e);
return Flux.just("[ERROR: Temporary issue. Please try again.]");
})
.onErrorResume(Exception.class, e -> {
log.error("Unexpected stream error", e);
return Flux.just("[ERROR: Something went wrong. Contact support@techgadgets.com]");
});
}
On the JavaScript side, close the EventSource on error and optionally show a retry button:
source.onerror = () => {
source.close();
if (responseText.startsWith("[ERROR:")) {
displayElement.textContent = responseText; // show error message
} else if (!responseText) {
displayElement.textContent = "Connection lost. Please try again.";
}
// Show retry button
document.getElementById('retryBtn').style.display = 'block';
};
Alerting on error patterns
Register an error counter in the audit advisor and alert when it spikes:
@Override
public ChatResponse adviseResponse(ChatResponse response, Map<String, Object> context) {
String finishReason = response.getResult().getMetadata().getFinishReason();
if ("content_filter".equals(finishReason)) {
contentFilterCounter.increment();
}
return response;
}
// Alert: if content filter rate > 5%, something unusual is happening
Note: The next post is the final production post — deployment and configuration best practices. It covers environment-specific model routing, API key management, feature flags for AI features, and the checklist for safely shipping an AI-powered application to production.