Skip to content
JSBlogs
Go back

Observability for AI applications — tracing and logging LLM calls in Spring Boot

The support assistant went to production on a Friday. On Monday morning, the team found three problems: some responses were taking 8 seconds, a handful of users were getting wrong answers, and nobody knew what the LLM was actually sending and receiving. There were no logs for any of it.

Observability for AI applications is not optional. It is harder to build than standard REST API observability — because the inputs and outputs are large, unstructured, and the model’s internal reasoning is invisible. But the fundamentals are the same: metrics, traces, logs.

Table of contents

Open Table of contents

What you need to observe in an AI application

A standard REST endpoint exposes: latency, error rate, request count. An AI endpoint needs those plus:

SignalWhy it matters
Token usage per requestPrimary cost driver
Model latencyLLM calls are slow (1–10+ seconds) — separate from app latency
First-token latencyFor streaming endpoints — how fast does the UI show something
Retrieved document qualityDid RAG retrieve useful context?
Tool call count and latencyWhich tools are slow, which fail
Prompt contentDebug wrong answers by seeing exactly what the model received
Response contentDetect hallucinations, policy violations, unexpected formats
Finish reasonSTOP is normal; LENGTH means the response was truncated

Spring AI’s built-in Micrometer integration

Spring AI automatically creates Micrometer metrics for every ChatClient call when Micrometer is on the classpath. Add the Actuator and Prometheus starters:

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
    <scope>runtime</scope>
</dependency>

Spring AI registers these metrics automatically:

MetricTypeDescription
spring.ai.chat.client.operation.secondsTimerEnd-to-end ChatClient call duration
spring.ai.chat.model.operation.secondsTimerRaw model API call duration
spring.ai.chat.model.prompt.tokensCounterInput tokens consumed
spring.ai.chat.model.completion.tokensCounterOutput tokens generated
spring.ai.chat.model.total.tokensCounterTotal tokens (prompt + completion)
spring.ai.vectorstore.operation.secondsTimerVector store query duration
spring.ai.embedding.model.operation.secondsTimerEmbedding model call duration

Expose them via Prometheus endpoint:

management:
  endpoints:
    web:
      exposure:
        include: health, info, metrics, prometheus
  metrics:
    tags:
      application: techgadgets-support   # adds app tag to all metrics

Scrape http://localhost:8080/actuator/prometheus to see all Spring AI metrics.

Tip: Set up a Grafana dashboard with panels for token usage (cost), model latency (performance), and error rate (reliability) from day one. These three panels answer 90% of production questions about your AI feature.

Distributed tracing with Spring AI

Spring AI integrates with Micrometer Tracing (which supports both Zipkin/Brave and OpenTelemetry) to create spans for every LLM call. Add the tracing dependency:

<!-- OpenTelemetry tracing -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>
management:
  tracing:
    sampling:
      probability: 1.0   # trace 100% in dev; use 0.1 in production

spring:
  application:
    name: techgadgets-support

  # OTLP exporter (Jaeger, Tempo, etc.)
  otlp:
    tracing:
      endpoint: http://localhost:4318/v1/traces

With tracing enabled, a single user request that triggers RAG + two tool calls + one LLM generation produces a trace like:

HTTP POST /api/support/chat                          [1.8s]
  └── ChatClient.call                                [1.7s]
        ├── QuestionAnswerAdvisor                    [120ms]
        │     └── VectorStore.similaritySearch       [85ms]
        ├── OpenAI chat API call                     [1.1s]
        │     ├── Tool: getOrderStatus               [210ms]
        │     └── Tool: checkRefundEligibility       [180ms]
        └── OpenAI chat API call (final)             [380ms]

This trace immediately shows: the vector store search took 85ms, the first LLM call took 1.1s (which included two tool calls), and the final generation took 380ms.

Important: LLM API calls are external network calls to OpenAI or Anthropic. They are the dominant latency source in most AI applications — 500ms to 5+ seconds. Tracing separates this from application latency (which is typically < 50ms) so you know whether a slow response is your code or the LLM provider.

Structured logging with SimpleLoggerAdvisor

Spring AI’s SimpleLoggerAdvisor logs the full request and response at DEBUG level. Enable it selectively:

// In development — log everything
.defaultAdvisors(
    new QuestionAnswerAdvisor(vectorStore),
    MessageChatMemoryAdvisor.builder(chatMemory).build(),
    new SimpleLoggerAdvisor()   // full prompt + response logging
)
logging:
  level:
    org.springframework.ai.chat.client.advisor.SimpleLoggerAdvisor: DEBUG

The log output includes:

request: ChatClientRequest{messages=[SystemMessage{...}, UserMessage{text='Where is my order?'}], advisors=[...]}
response: ChatClientResponse{result=AssistantMessage{text='Your order TG-9821 is currently...'}}

Caution: SimpleLoggerAdvisor logs the full prompt including all RAG context (potentially hundreds of lines) and the complete model response. In production, this generates enormous log volume. Use it only in development or with a sampling rate. In production, log only what you need: token count, finish reason, conversation ID — not the full content.

Custom structured logging for production

Build a custom advisor that logs the structured fields you actually need:

@Component
class AiAuditAdvisor implements RequestResponseAdvisor {

    private static final Logger log = LoggerFactory.getLogger(AiAuditAdvisor.class);

    @Override
    public AdvisedRequest adviseRequest(AdvisedRequest request, Map<String, Object> context) {
        context.put("requestStartTime", System.currentTimeMillis());
        context.put("conversationId", request.advisorParams()
                .getOrDefault(CONVERSATION_ID_KEY, "unknown"));
        return request;
    }

    @Override
    public ChatResponse adviseResponse(ChatResponse response, Map<String, Object> context) {
        long durationMs = System.currentTimeMillis() - (long) context.get("requestStartTime");
        Usage usage = response.getMetadata().getUsage();

        log.info("ai_call conversationId={} durationMs={} promptTokens={} completionTokens={} finishReason={}",
            context.get("conversationId"),
            durationMs,
            usage != null ? usage.getPromptTokens() : -1,
            usage != null ? usage.getGenerationTokens() : -1,
            response.getResult().getMetadata().getFinishReason()
        );

        return response;
    }
}

Register it in the ChatClient builder:

.defaultAdvisors(
    new QuestionAnswerAdvisor(vectorStore),
    MessageChatMemoryAdvisor.builder(chatMemory).build(),
    aiAuditAdvisor   // structured logging, not full content
)

Now every LLM call produces a single structured log line with the fields needed for monitoring, without logging the full prompt content.

Tracking token usage over time

Register a Micrometer counter in the audit advisor to track cumulative token usage:

@Component
class AiAuditAdvisor implements RequestResponseAdvisor {

    private final Counter promptTokenCounter;
    private final Counter completionTokenCounter;

    AiAuditAdvisor(MeterRegistry registry) {
        this.promptTokenCounter = Counter.builder("ai.tokens.prompt")
                .description("Total prompt tokens consumed")
                .register(registry);
        this.completionTokenCounter = Counter.builder("ai.tokens.completion")
                .description("Total completion tokens generated")
                .register(registry);
    }

    @Override
    public ChatResponse adviseResponse(ChatResponse response, Map<String, Object> context) {
        Usage usage = response.getMetadata().getUsage();
        if (usage != null) {
            promptTokenCounter.increment(usage.getPromptTokens());
            completionTokenCounter.increment(usage.getGenerationTokens());
        }
        return response;
    }
}

Add a Grafana alert when daily token spend exceeds a threshold — before the bill arrives.

What to alert on

AlertThresholdWhat it indicates
Model p95 latency> 5 secondsLLM provider slowdown or prompt too long
Error rate> 2%Rate limits, API key issues, or prompt errors
Daily token spend> budget × 0.8Approaching budget limit
FINISH_REASON = LENGTH rate> 5%maxTokens too low, responses being truncated
Vector store query time> 500mspgvector index needs tuning
Tool call error rate> 5%Downstream service issues

Note: The next post covers cost management in detail — token budgets, prompt caching, and model selection strategies. With the metrics from this post in place, you have the data needed to identify where costs are coming from and which optimisations will have the most impact.

References


Share this post on:

Next Post
Streaming LLM responses in Spring AI for a better user experience