Ollama Configuration¶

Use Ollama for local LLM inference with complete privacy.

Why Ollama?¶

Privacy: All processing happens locally
No API costs: Free to use after setup
Offline: Works without internet connection
Fast: Local inference can be faster for repeated queries

Installation¶

macOSLinuxWindows

brew install ollama

curl -fsSL https://ollama.com/install.sh | sh

Download from ollama.com/download/windows

Start Ollama Server¶

ollama serve

The server runs on http://localhost:11434 by default.

Pull Required Models¶

LLM Model¶

We recommend qwen3-coder for code-related documentation:

# Large model (30B) - Best quality
ollama pull qwen3-coder:30b

# Medium model (8B) - Good balance
ollama pull qwen3-coder:8b

# Small model (1.5B) - Fastest
ollama pull qwen3-coder:1.5b

Alternative models:

ollama pull llama3.2:latest
ollama pull mistral:latest
ollama pull codellama:latest

Embedding Model¶

# Recommended
ollama pull embeddinggemma:latest

# Alternatives
ollama pull nomic-embed-text:latest
ollama pull mxbai-embed-large:latest

Configuration¶

Create a .env file:

# LLM Configuration
LLM_PROVIDER="ollama"
LLM_MODEL="qwen3-coder:30b"
LLM_API_KEY="ollama"
LLM_ENDPOINT="http://localhost:11434/v1"

# Embedding Configuration
EMBEDDING_PROVIDER="ollama"
EMBEDDING_MODEL="embeddinggemma:latest"
EMBEDDING_ENDPOINT="http://localhost:11434/api/embed"
EMBEDDING_DIMENSIONS="768"

# Tokenizer for text chunking
HUGGINGFACE_TOKENIZER="Qwen/Qwen3-Coder-30B-A3B-Instruct"

# Disable telemetry (optional)
TELEMETRY_DISABLED=true

Verify Setup¶

Test that everything is working:

import asyncio
import pygrad as pg


async def test_ollama():
    # This should work if Ollama is configured correctly
    print("Testing Ollama configuration...")

    # Index a small repository
    await pg.add("https://github.com/encode/httpx")
    print("Repository indexed successfully!")

    # Test search
    result = await pg.search(
        "https://github.com/encode/httpx",
        "How to make a GET request?"
    )
    print(f"Search result: {result[:200]}...")


asyncio.run(test_ollama())

Troubleshooting¶

Connection Refused¶

If you see "Connection refused" errors:

# Check if Ollama is running
curl http://localhost:11434/api/tags

# If not, start it
ollama serve

Model Not Found¶

If a model is not found:

# List installed models
ollama list

# Pull the missing model
ollama pull qwen3-coder:30b

Out of Memory¶

For large models on limited hardware:

# Use a smaller model
ollama pull qwen3-coder:8b
# Or
ollama pull qwen3-coder:1.5b

Update your .env:

LLM_MODEL="qwen3-coder:8b"

Slow Performance¶

For faster inference:

Use a smaller model
Ensure you have GPU acceleration
Increase Ollama's memory limit

# Set memory limit (in GB)
export OLLAMA_MAX_LOADED_MODELS=1
export OLLAMA_NUM_PARALLEL=1

Advanced Configuration¶

Custom Ollama Host¶

If Ollama runs on a different machine:

LLM_ENDPOINT="http://192.168.1.100:11434/v1"
EMBEDDING_ENDPOINT="http://192.168.1.100:11434/api/embed"

GPU Configuration¶

Ollama automatically uses GPU if available. To verify:

ollama run qwen3-coder:30b
# Check GPU usage in output

Memory Management¶

For systems with limited RAM:

# Limit loaded models
export OLLAMA_MAX_LOADED_MODELS=1

# Set context window
export OLLAMA_NUM_CTX=4096

Recommended Models¶

Use Case	Model	Size	Quality
Development	`qwen3-coder:8b`	~5GB	Good
Production	`qwen3-coder:30b`	~18GB	Best
Low memory	`qwen3-coder:1.5b`	~1GB	Basic
General	`llama3.2:latest`	~4GB	Good