Skip to content
JSBlogs
Go back

Running local AI models with Ollama and Spring AI — private, free, offline

Dev’s team had a new requirement: a customer wanted their support assistant deployed on-premise. No data was allowed to leave their network. OpenAI was out. So was every cloud provider.

The answer was Ollama.

Table of contents

Open Table of contents

What Ollama is

Ollama is an open-source tool that runs large language models locally. It manages model downloads, GPU/CPU inference, and exposes an OpenAI-compatible HTTP API on localhost:11434.

From Spring AI’s perspective, Ollama looks almost identical to OpenAI. The same ChatClient API works unchanged — only the underlying model and base URL differ. Switch the profile, switch the provider.

Installing Ollama

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Start the server
ollama serve

Pull the models you need:

# Chat models
ollama pull llama3.2:3b        # fast, good for dev
ollama pull llama3.1:8b        # better quality, needs ~8GB RAM
ollama pull qwen2.5:7b         # strong reasoning
ollama pull mistral:7b         # solid general-purpose

# Embedding models
ollama pull nomic-embed-text   # 768 dimensions, fast
ollama pull mxbai-embed-large  # 1024 dimensions, higher quality

Check what’s running:

ollama list      # show downloaded models
ollama ps        # show models currently loaded in memory

Adding the Spring AI Ollama starter

<!-- Add alongside (or instead of) the OpenAI starter -->
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-ollama-spring-boot-starter</artifactId>
</dependency>

Profile-based model switching

The cleanest approach: keep OpenAI as the default and override with Ollama in the dev profile. Application code never changes — only the Spring profile changes.

application.yml (base — OpenAI):

spring:
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      chat:
        options:
          model: gpt-4o-mini
      embedding:
        options:
          model: text-embedding-3-small

    vectorstore:
      pgvector:
        dimensions: 1536
        initialize-schema: true
        index-type: HNSW
        distance-type: COSINE_DISTANCE

application-local.yml (local dev — Ollama, no API key needed):

spring:
  ai:
    openai:
      api-key: "not-used"   # disable OpenAI auto-configuration
    ollama:
      base-url: http://localhost:11434
      chat:
        options:
          model: llama3.2:3b
          temperature: 0.2
          num-ctx: 8192       # context window size
      embedding:
        options:
          model: nomic-embed-text

    vectorstore:
      pgvector:
        dimensions: 768   # nomic-embed-text produces 768-dimensional vectors

Activate it:

export SPRING_PROFILES_ACTIVE=local
./mvnw spring-boot:run

Caution: The embedding dimensions must match what the embedding model produces, and must be consistent between indexing and querying. text-embedding-3-small produces 1536 dimensions; nomic-embed-text produces 768. If you switch embedding models, you must delete and re-create the vector store table — the old embeddings are incompatible.

Using a dedicated Ollama ChatClient bean

If both OpenAI and Ollama need to be active simultaneously (e.g., Ollama for cheap classification, OpenAI for complex reasoning), wire separate beans:

@Configuration
class AiConfig {

    // OpenAI client — for complex reasoning tasks
    @Bean
    @Qualifier("openai")
    ChatClient openAiClient(ChatClient.Builder builder) {
        return builder
                .defaultOptions(OpenAiChatOptions.builder()
                        .model("gpt-4o-mini")
                        .build())
                .build();
    }

    // Ollama client — for cheap local tasks (classification, summarization)
    @Bean
    @Qualifier("local")
    ChatClient ollamaClient(OllamaChatModel ollamaChatModel) {
        return ChatClient.builder(ollamaChatModel)
                .defaultOptions(OllamaOptions.builder()
                        .model("llama3.2:3b")
                        .temperature(0.0)
                        .build())
                .build();
    }
}

Use the local model for tasks where quality is sufficient and cost matters, and the cloud model for the final answer generation:

// Classify intent locally (cheap)
String intent = ollamaClient.prompt()
        .user("Classify as ORDER_STATUS | RETURN | PRODUCT_QUESTION | OTHER: " + message)
        .call().content().strip();

// Generate answer with cloud model if it's a complex case
if ("PRODUCT_QUESTION".equals(intent)) {
    return openAiClient.prompt().user(message).call().content();
}

Model recommendations for production use cases

Use caseModelRAMQuality
Classification, extractionllama3.2:3b4 GBGood
RAG Q&A assistantllama3.1:8b8 GBGood
Complex reasoningqwen2.5:14b16 GBVery good
Coding assistantcodellama:13b16 GBVery good
Embeddings (fast)nomic-embed-text< 1 GBGood
Embeddings (quality)mxbai-embed-large2 GBBetter

GPU acceleration is strongly recommended for chat models. On an Apple M-series Mac, Metal acceleration is automatic. On Linux, CUDA support is available for NVIDIA GPUs.

Tip: For local development, llama3.2:3b generates fast enough that the dev loop feels responsive. For staging and CI tests that use a local model, llama3.1:8b produces results much closer to production quality. Use the 3B for speed in local dev and the 8B for CI eval tests.

Configuring Ollama for production on-premise deployments

When deploying Ollama on-premise (not just local dev), run it as a service and point Spring AI at the remote host:

spring:
  ai:
    ollama:
      base-url: http://ollama-server.internal:11434
      chat:
        options:
          model: llama3.1:8b

Ollama does not have authentication by default. In production, place it behind a reverse proxy (nginx, Caddy) that enforces API key authentication or mTLS, and restrict network access to the application servers only.

Where local models fall short

Be honest about limitations before committing to local models:

CapabilityOpenAI GPT-4o-miniOllama llama3.1:8b
Instruction followingExcellentGood
Structured JSON outputVery reliableSometimes unreliable
Tool selection accuracyVery reliableInconsistent
Long context (>8K tokens)GoodDegrades
Non-English languagesExcellentModel-dependent
Latency (8B on CPU)~1-3s~10-30s
Latency (8B on GPU)~1-3s~1-4s

For the TechGadgets support assistant:

Important: Run your evaluation suite (from Module 7) against Ollama before committing to a local model in production. The eval pass rate tells you whether answer quality is acceptable. Expect a 10–20% drop in pass rate from frontier models to 7B local models — decide whether that tradeoff is acceptable for your use case.

The Spring AI abstraction pays off here

Because Spring AI abstracts the model behind ChatModel and EmbeddingModel interfaces, switching from OpenAI to Ollama is genuinely just a configuration change — no application code changes required. This is one of the clearest demonstrations of why the abstraction layer matters.

// This code works identically regardless of whether the underlying model
// is OpenAI, Anthropic, Ollama, or any other supported provider
String answer = chatClient.prompt()
        .user(question)
        .call()
        .content();

The controller, service layer, advisors, and tools are all provider-agnostic.

Note: The next post covers multimodal AI — adding image understanding to the support assistant. A customer attaches a photo of a damaged product and asks "is this covered under warranty?" The LLM reads both the image and the question and answers accordingly.

References


Share this post on:

Next Post
Streaming LLM responses in Spring AI for a better user experience