ollama keep model loaded in memory

How to Keep Ollama Models Loaded to Eliminate Latency (Keep-Alive Config)

The Blinking Cursor is Mocking You

You know the feeling. You type a query into your local LLM, hit enter, and then… nothing. You stare at that blinking cursor. You hear your GPU fans spin up from a dead silence. You wait six, maybe ten seconds. Finally, the first word appears.

⚠️ DISCLAIMER & LIABILITY WAIVER

THE CONFIGURATIONS AND SCRIPTS PROVIDED IN THIS ARTICLE ARE FOR EDUCATIONAL PURPOSES. THEY ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND. MODIFICATIONS TO SYSTEM SERVICES AND GPU CONFIGURATIONS CARRY INHERENT RISKS.

USE AT YOUR OWN RISK. ALWAYS BACK UP YOUR CONFIGURATIONS BEFORE APPLYING CHANGES.

It kills the vibe. Completely.

If you are building a voice assistant, a coding agent, or just a snappy chat interface, that “Cold Start” latency is the enemy. You probably found the OLLAMA_KEEP_ALIVE setting, set it to -1 (forever), and thought you were safe. But then you walked away for coffee, came back, and found the model had unloaded anyway.

Here is the hard truth: setting the environment variable is rarely enough. Between client-side API overrides, OS-level paging, and a nasty scheduler bug involving VRAM estimation, your model is looking for any excuse to quit. We are going to force it to stay.

See also  Can You Run Llama 3 on 8GB RAM? Best Settings for Mac Users

The Physics of Latency: Why “Warm” Matters

Before we lock this down, you need to understand what is physically happening inside your rig. When a model is “cold,” Ollama has to pull gigabytes of weights from your SSD, push them through the PCIe bus, and load them into VRAM. Even with a fast NVMe drive, this is a bottleneck.

Time To First Token (TTFT) is the metric that matters here. According to performance engineering data, latency isn’t just about compute; it’s about memory bandwidth. If the model isn’t resident in the high-bandwidth memory (HBM) of your GPU, you are stuck at the speed of your PCIe lanes.

State Typical Latency (70B Model) User Experience
Cold Start 6.0s – 15.0s Frustrating delay. Breaks flow.
Warm State 0.2s – 0.5s Instant. Feels like magic.
Paged Out (OS) 2.0s – 5.0s The “silent killer” of performance.

The “Client Reset” Trap: Why Your Config Fails

Here is the most common scenario I see in the field. You set OLLAMA_KEEP_ALIVE=-1 on your server. You verify it. It looks good.

Then you connect a frontend like Open WebUI or a VS Code extension. Suddenly, your model starts unloading after 5 minutes again.

What happened?

Ollama operates on a hierarchy of precedence. API requests override environment variables. If your frontend sends a request with a default "keep_alive": "5m" parameter (which many do silently), the server obeys that specific request and resets the countdown timer. Your global “forever” setting gets wiped out by a single API call.

The “Ghost Footprint” Bug (The Unique Killer)

This is the part almost everyone misses. You might have the keep-alive set correctly, and your client might be behaved, but the model still unloads.

See also  Installing PrivateGPT on Windows WSL: Complete Setup Without Docker

Enter the VRAM Estimation Bug (tracked in GitHub Issue #10359). Ollama’s scheduler has to guess how much VRAM a model needs. If you are running a model with a large context window, the scheduler often overestimates the memory footprint.

  1. You load a model. It fits in VRAM.
  2. You send a request with a large context.
  3. The scheduler calculates the “Ghost Footprint” and panics, thinking you are out of VRAM.
  4. It triggers an immediate eviction to prevent a crash, ignoring your keep_alive setting.

Step 1: The “Iron Grip” Server Configuration

Forget .bashrc. If you are running Ollama as a service (which you should be), it doesn’t care about your shell variables. You need to inject the config at the systemd or Docker level.

For Linux (Systemd) Users

Do not edit the file in /lib/systemd/system. Updates will wipe your changes. Use the override system:

sudo systemctl edit ollama.service

Paste this into the editor:

[Service]
Environment="OLLAMA_KEEP_ALIVE=-1"
# Optional: Prevents loading too many models and crashing VRAM
Environment="OLLAMA_MAX_LOADED_MODELS=1"

Reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

For Docker Users

If you are containerized, hardcode it in your docker-compose.yml:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: always
    environment:
      - OLLAMA_KEEP_ALIVE=-1
      - OLLAMA_MAX_LOADED_MODELS=1
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Step 2: The “Heartbeat” Sidecar (The Fail-Safe)

Since we can’t trust clients to not send bad keep_alive headers, we need to play offense. I use a “Heartbeat” script—a simple Python loop that punches the API every 4 minutes with a specific instruction to stay alive forever.

Updated for Production: This script uses urllib (no pip install needed) and sets num_ctx=1 to ensure the heartbeat itself implies zero memory overhead.

import urllib.request
import urllib.error
import json
import time
from datetime import datetime

# --- CONFIGURATION ---
MODEL_ID = "llama3:latest"  # Update this to your specific model tag
OLLAMA_URL = "http://127.0.0.1:11434/api/generate"
INTERVAL = 240 # 4 Minutes (Beats the 5m default timeout)

def send_heartbeat():
    payload = {
        "model": MODEL_ID,
        "keep_alive": -1,  # Force Infinity
        "prompt": "",      # Empty prompt = No generation cost
        "num_ctx": 1       # Minimal context = No VRAM spike
    }

    data = json.dumps(payload).encode('utf-8')
    req = urllib.request.Request(
        OLLAMA_URL, 
        data=data, 
        headers={'Content-Type': 'application/json'}
    )

    try:
        # 5-second timeout prevents script hang if server is down
        with urllib.request.urlopen(req, timeout=5) as response:
            if response.status == 200:
                print(f"[{datetime.now().strftime('%H:%M:%S')}] ❤️  Heartbeat sent to {MODEL_ID}")
                response.read() # Cleanly close connection
            else:
                print(f"⚠️ Warning: Server returned {response.status}")
    except urllib.error.URLError as e:
         print(f"❌ Connection Error: {e.reason} (Check if Ollama is running)")
    except Exception as e:
        print(f"❌ Error: {e}")

if __name__ == "__main__":
    print(f"Starting Production Heartbeat for {MODEL_ID}...")
    while True:
        send_heartbeat()
        time.sleep(INTERVAL)

Run this in the background using nohup: nohup python3 heartbeat.py &.

Step 3: Verification (Stop Using Nvidia-SMI)

Most people check nvidia-smi to see if a model is loaded. That is a mistake. nvidia-smi only tells you if memory is allocated. It doesn’t tell you if the Ollama scheduler considers the model “active” or “expired.”

See also  Best Local LLMs for Coding That Run on CPU Only (No GPU Required)

The scheduler might have already marked the model for death, even if VRAM hasn’t cleared yet. To see the truth, query the API:

curl http://localhost:11434/api/ps

Look for the expires_at field in the JSON response.

  • Bad: "expires_at": "2024-10-27T10:05:00Z" (It has a death date).
  • Good: "expires_at": "0001-01-01T00:00:00Z" (This is the Null timestamp. It means Forever).

Final Thoughts

Latency isn’t just a number; it’s a feeling. By locking the configuration at the systemd level and running a heartbeat script to fight back against rogue clients, you eliminate the variables. You stop hoping the model stays loaded, and you start knowing it will.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top