Running local AI models with Ollama and Spring AI — private, free, offline

Dev’s team had a new requirement: a customer wanted their support assistant deployed on-premise. No data was allowed to leave their network. OpenAI was out. So was every cloud provider.

The answer was Ollama.

Open Table of contents

What Ollama is
Installing Ollama
Adding the Spring AI Ollama starter
Profile-based model switching
Using a dedicated Ollama ChatClient bean
Model recommendations for production use cases
Configuring Ollama for production on-premise deployments
Where local models fall short
The Spring AI abstraction pays off here
References

What Ollama is

Ollama is an open-source tool that runs large language models locally. It manages model downloads, GPU/CPU inference, and exposes an OpenAI-compatible HTTP API on localhost:11434.

From Spring AI’s perspective, Ollama looks almost identical to OpenAI. The same ChatClient API works unchanged — only the underlying model and base URL differ. Switch the profile, switch the provider.

Installing Ollama

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Start the server
ollama serve

Pull the models you need:

# Chat models
ollama pull llama3.2:3b        # fast, good for dev
ollama pull llama3.1:8b        # better quality, needs ~8GB RAM
ollama pull qwen2.5:7b         # strong reasoning
ollama pull mistral:7b         # solid general-purpose

# Embedding models
ollama pull nomic-embed-text   # 768 dimensions, fast
ollama pull mxbai-embed-large  # 1024 dimensions, higher quality

Check what’s running:

ollama list      # show downloaded models
ollama ps        # show models currently loaded in memory

Adding the Spring AI Ollama starter

<!-- Add alongside (or instead of) the OpenAI starter -->
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-ollama-spring-boot-starter</artifactId>
</dependency>

Profile-based model switching

The cleanest approach: keep OpenAI as the default and override with Ollama in the dev profile. Application code never changes — only the Spring profile changes.

application.yml (base — OpenAI):

spring:
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      chat:
        options:
          model: gpt-4o-mini
      embedding:
        options:
          model: text-embedding-3-small

    vectorstore:
      pgvector:
        dimensions: 1536
        initialize-schema: true
        index-type: HNSW
        distance-type: COSINE_DISTANCE

application-local.yml (local dev — Ollama, no API key needed):

spring:
  ai:
    openai:
      api-key: "not-used"   # disable OpenAI auto-configuration
    ollama:
      base-url: http://localhost:11434
      chat:
        options:
          model: llama3.2:3b
          temperature: 0.2
          num-ctx: 8192       # context window size
      embedding:
        options:
          model: nomic-embed-text

    vectorstore:
      pgvector:
        dimensions: 768   # nomic-embed-text produces 768-dimensional vectors

Activate it:

export SPRING_PROFILES_ACTIVE=local
./mvnw spring-boot:run

Caution: The embedding dimensions must match what the embedding model produces, and must be consistent between indexing and querying. text-embedding-3-small produces 1536 dimensions; nomic-embed-text produces 768. If you switch embedding models, you must delete and re-create the vector store table — the old embeddings are incompatible.

Using a dedicated Ollama ChatClient bean

If both OpenAI and Ollama need to be active simultaneously (e.g., Ollama for cheap classification, OpenAI for complex reasoning), wire separate beans:

@Configuration
class AiConfig {

    // OpenAI client — for complex reasoning tasks
    @Bean
    @Qualifier("openai")
    ChatClient openAiClient(ChatClient.Builder builder) {
        return builder
                .defaultOptions(OpenAiChatOptions.builder()
                        .model("gpt-4o-mini")
                        .build())
                .build();
    }

    // Ollama client — for cheap local tasks (classification, summarization)
    @Bean
    @Qualifier("local")
    ChatClient ollamaClient(OllamaChatModel ollamaChatModel) {
        return ChatClient.builder(ollamaChatModel)
                .defaultOptions(OllamaOptions.builder()
                        .model("llama3.2:3b")
                        .temperature(0.0)
                        .build())
                .build();
    }
}

Use the local model for tasks where quality is sufficient and cost matters, and the cloud model for the final answer generation:

// Classify intent locally (cheap)
String intent = ollamaClient.prompt()
        .user("Classify as ORDER_STATUS | RETURN | PRODUCT_QUESTION | OTHER: " + message)
        .call().content().strip();

// Generate answer with cloud model if it's a complex case
if ("PRODUCT_QUESTION".equals(intent)) {
    return openAiClient.prompt().user(message).call().content();
}

Model recommendations for production use cases

Use case	Model	RAM	Quality
Classification, extraction	`llama3.2:3b`	4 GB	Good
RAG Q&A assistant	`llama3.1:8b`	8 GB	Good
Complex reasoning	`qwen2.5:14b`	16 GB	Very good
Coding assistant	`codellama:13b`	16 GB	Very good
Embeddings (fast)	`nomic-embed-text`	< 1 GB	Good
Embeddings (quality)	`mxbai-embed-large`	2 GB	Better

GPU acceleration is strongly recommended for chat models. On an Apple M-series Mac, Metal acceleration is automatic. On Linux, CUDA support is available for NVIDIA GPUs.

Tip: For local development, llama3.2:3b generates fast enough that the dev loop feels responsive. For staging and CI tests that use a local model, llama3.1:8b produces results much closer to production quality. Use the 3B for speed in local dev and the 8B for CI eval tests.

Configuring Ollama for production on-premise deployments

When deploying Ollama on-premise (not just local dev), run it as a service and point Spring AI at the remote host:

spring:
  ai:
    ollama:
      base-url: http://ollama-server.internal:11434
      chat:
        options:
          model: llama3.1:8b

Ollama does not have authentication by default. In production, place it behind a reverse proxy (nginx, Caddy) that enforces API key authentication or mTLS, and restrict network access to the application servers only.

Where local models fall short

Be honest about limitations before committing to local models:

Capability	OpenAI GPT-4o-mini	Ollama llama3.1:8b
Instruction following	Excellent	Good
Structured JSON output	Very reliable	Sometimes unreliable
Tool selection accuracy	Very reliable	Inconsistent
Long context (>8K tokens)	Good	Degrades
Non-English languages	Excellent	Model-dependent
Latency (8B on CPU)	~1-3s	~10-30s
Latency (8B on GPU)	~1-3s	~1-4s

For the TechGadgets support assistant:

Classification and extraction: local models work well
RAG Q&A with short context: local models work acceptably
Tool calling with complex reasoning: local models are noticeably less reliable
Structured output (.entity()): test carefully — smaller models produce malformed JSON more often

Important: Run your evaluation suite (from Module 7) against Ollama before committing to a local model in production. The eval pass rate tells you whether answer quality is acceptable. Expect a 10–20% drop in pass rate from frontier models to 7B local models — decide whether that tradeoff is acceptable for your use case.

The Spring AI abstraction pays off here

Because Spring AI abstracts the model behind ChatModel and EmbeddingModel interfaces, switching from OpenAI to Ollama is genuinely just a configuration change — no application code changes required. This is one of the clearest demonstrations of why the abstraction layer matters.

// This code works identically regardless of whether the underlying model
// is OpenAI, Anthropic, Ollama, or any other supported provider
String answer = chatClient.prompt()
        .user(question)
        .call()
        .content();

The controller, service layer, advisors, and tools are all provider-agnostic.

Note: The next post covers multimodal AI — adding image understanding to the support assistant. A customer attaches a photo of a damaged product and asks "is this covered under warranty?" The LLM reads both the image and the question and answers accordingly.

Running local AI models with Ollama and Spring AI — private, free, offline

Table of contents

What Ollama is

Installing Ollama

Adding the Spring AI Ollama starter

Profile-based model switching

Using a dedicated Ollama ChatClient bean

Model recommendations for production use cases

Configuring Ollama for production on-premise deployments

Where local models fall short

The Spring AI abstraction pays off here

References