How to Run a Local LLM with Ollama (2026)

Ollama makes running an LLM on your own machine as simple as ollama run llama3.1. The model downloads on first use, exposes an HTTP API on localhost:11434, and runs entirely offline. In 2026 the practical local-LLM stack is Ollama as the runtime, a chosen open-weights model (Llama 3.1, Mistral Small 3, Qwen 2.5, or a coding-specialized one like Qwen2.5-Coder), and a hardware floor of 16GB unified memory on a Mac or 12GB VRAM on a discrete GPU. I'll walk install, model selection, the hardware floor, and the cases where local is actually better than cloud (and the cases where it isn't).

The reason local AI is a real choice in 2026: open-weights models have caught up to mid-tier cloud models on many practical tasks, the hardware to run them is cheap, and the privacy story is honest — no token sent to anyone's server. For specific workflows (offline use, sensitive data, high-volume batch where token costs matter), local is the better answer. For everything else, cloud still wins on quality and capability.

Jump to:

Install Ollama
Pull and run a model
The hardware floor
Picking a model: Llama, Mistral, Qwen
Calling Ollama from JavaScript and Python
When local beats cloud (and when it doesn't)
Quantization: trading quality for memory
FAQ

Install Ollama

macOS:

bash

brew install ollama
ollama serve  # starts the local server on :11434

Linux:

bash

curl -fsSL https://ollama.com/install.sh | sh
systemctl --user start ollama

Windows: download the installer from ollama.com. The Windows version uses CUDA if available, falls back to CPU otherwise.

Verify with ollama --version (Ollama is on the 0.x series; 0.24.x is current as of 2026).

Pull and run a model

The simplest happy path:

bash

ollama run llama3.1

This downloads the model (~4.7GB for the default 8B parameter version) and drops you into an interactive chat. Type a message; the model responds streaming. Exit with /bye.

Note that ollama run llama3.3 is a different model: Llama 3.3 ships in a single 70B size only, so that command pulls a roughly 43GB download and needs serious hardware. For laptop-tier use you want Llama 3.1 8B, not Llama 3.3.

For specific model variants:

bash

ollama pull llama3.3                  # 70B parameters, ~43GB, needs serious hardware
ollama pull mistral-small3.1:24b      # 24B Mistral
ollama pull qwen2.5-coder:32b         # coding-specialized
ollama pull deepseek-r1:7b            # reasoning model

List installed models:

bash

ollama list

Remove a model:

bash

ollama rm llama3.1

The hardware floor

What you actually need to run useful models comfortably:

Setup	Memory floor	What it can run
MacBook M2/M3/M4	16GB unified	7B-8B models (Llama 3.1 8B, Mistral 7B)
MacBook M3/M4 Pro/Max	36GB+ unified	27B-30B models with headroom
Mac Studio M3 Ultra	64GB+ unified	70B models in 4-bit quantization
PC with discrete GPU	12GB VRAM	7B models (RTX 3060 12GB minimum)
PC with high-end GPU	24GB VRAM	24B-30B models (RTX 4090, 3090)
Multi-GPU rig	48GB+ total VRAM	70B models

The Apple Silicon advantage in 2026 is real: the unified memory architecture means the GPU can address the full system RAM, so a $2000 MacBook Pro with 36GB RAM can run models that need a $4000 GPU on the PC side.

CPU-only inference works but is slow — 1-3 tokens per second on a typical 8B model versus 30-100 tokens/sec on GPU. CPU is fine for "I'm trying it out" but not for real workflows.

Picking a model: Llama, Mistral, Qwen

The major open-weights families in 2026:

Llama (Meta) — the workhorse. Llama 3.1 8B for laptop-tier; Llama 3.3, which ships in a single 70B size, for serious hardware. Strong general capability, good at instruction following. The default pick if you have no specific reason to choose otherwise.

Mistral Small 3.1 (Mistral) — competitive with Llama on quality, slightly faster. The 24B parameter version is a sweet spot for 32GB-40GB hardware. Particularly strong on European-language tasks.

Qwen 2.5 (Alibaba) — strong general capability. The Coder variants (Qwen2.5-Coder 7B, 32B) outperform every other open-weights model on code generation as of mid-2026. If your workflow is code-heavy, use Qwen2.5-Coder.

DeepSeek-R1 — reasoning-specialised. Slower per response (it generates reasoning chains) but solves harder problems. Use for the local equivalent of "Claude Opus" workloads — math, multi-step planning, complex code.

Gemma 3 (Google) — competitive lightweight option, particularly the 2B and 9B variants for resource-constrained environments.

Run ollama show <model> to see exact parameter count and memory requirements before pulling.

Calling Ollama from JavaScript and Python

Ollama exposes an OpenAI-compatible HTTP API. Use any OpenAI SDK pointed at localhost:11434.

JavaScript:

javascript

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama",  // ignored, but the field is required
});

const response = await client.chat.completions.create({
  model: "llama3.1",
  messages: [{ role: "user", content: "Hello, world." }],
});

console.log(response.choices[0].message.content);

Python:

python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Hello, world."}],
)
print(response.choices[0].message.content)

Streaming, tool use, structured outputs all work the same way as cloud OpenAI. The native Ollama API at /api/generate and /api/chat is available too, with slightly different request shapes.

When local beats cloud (and when it doesn't)

Local is the right choice when:

Privacy is non-negotiable. Medical, legal, financial data that cannot leave the network. No cloud-provider trust required.
You're offline. Travel, airplane, ship, anything without reliable internet.
High-volume batch. Processing 100K documents overnight — the per-token cost drops to electricity, which is typically 1-2 orders of magnitude cheaper than cloud API.
You want predictable latency. Local model is local-network latency; no API rate limits, no provider outages.
You're learning or experimenting. No API budget needed.

Cloud is the right choice when:

Best-in-class quality matters. Claude Opus and GPT-5 still outperform every open-weights model on hard reasoning tasks.
You need long context (>32K tokens reliably). Most local models degrade past 8-16K tokens; cloud models handle 200K-1M well.
You want zero ops. Cloud APIs require no hardware management, no model updates, no driver issues.
You need agentic tool use at scale. Cloud models are more reliable at correct tool selection across complex toolsets.

The honest 2026 answer: most production AI workloads run on cloud LLMs. Local LLMs are increasingly viable for specific niches, but they're not yet a wholesale replacement.

Quantization: trading quality for memory

A "70B parameter" model at full precision needs 140GB of memory (2 bytes per parameter). At 4-bit quantization, it needs ~40GB. At 2-bit, ~20GB. Less memory means smaller hardware can run bigger models.

Ollama defaults to Q4 (4-bit quantization) for most models, which gives the best size-quality trade-off in practice. Going lower (Q2, Q3) saves memory but visibly degrades output quality. Going higher (Q5, Q6, Q8) improves quality but takes more memory; the gains above Q4 are small.

To run a specific quantization: ollama pull llama3.3:70b-q4_K_M or :70b-q8_0.

For most setups, stick with the default. Worry about quantization only if you hit a memory ceiling and need to fit a bigger model.

What to do next

For combining local LLMs with the broader AI stack:

How to Build an LLM Agent with Tool Use — Ollama supports tool use; the agentic loop is identical to cloud.
How to Build RAG with Embeddings and Vector Search — local embedding models (BGE, E5) pair well with local LLMs for full-stack on-device RAG.
How to Get Reliable JSON from an LLM — Ollama supports structured outputs via the format parameter.

For the choice between local and cloud for specific workloads:

How to Choose Between Claude Haiku, Sonnet, and Opus — covers the cloud-tier decision; pair with this article for the local-vs-cloud decision.

External references: Ollama documentation, open-LLM-leaderboard on Hugging Face for current model rankings.

FAQ

16GB unified memory on a Mac (M2 or later), or 12GB VRAM on a discrete GPU. That floor runs 7-8B parameter models comfortably — Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B.

CPU-only inference works but is slow (1-3 tokens per second). Acceptable for "trying it out", not for real workflows.

Different strengths. Llama 3.3 is the most general-purpose. Mistral Small 3.1 is slightly faster and stronger on European languages. Qwen 2.5-Coder dominates open-weights code generation.

Try 2-3 on your actual workload before committing. The performance gap between the top three on most tasks is smaller than the gap to cloud models.

For specific workloads, yes — sensitive-data processing, high-volume batch, offline use cases. For "Claude Opus quality on hard reasoning", no — open-weights models haven't matched the frontier cloud models yet.

The honest middle ground in 2026: use local for the workloads where local wins on cost, privacy, or latency; use cloud for the workloads where quality matters most.

On a MacBook M3 Pro with a 7B model, expect 50-80 tokens per second — comparable to or faster than cloud APIs for short responses. On a 70B model, 5-15 tokens per second — slower than cloud.

Latency to first token is much better locally (no network round-trip), which feels snappier in interactive UX.

Yes. Ollama implements an OpenAI-compatible API, so the standard tools parameter for tool use and response_format for structured outputs both work. The underlying model has to support these capabilities — Llama 3.3, Qwen 2.5, and Mistral Small 3.1 all do well.

Smaller models (under 7B) are less reliable at tool selection across many tools. Stick to 2-5 tools per agent when running locally.

How to Run a Local LLM with Ollama

Install Ollama

Pull and run a model

The hardware floor

Picking a model: Llama, Mistral, Qwen

Calling Ollama from JavaScript and Python

When local beats cloud (and when it doesn't)

Quantization: trading quality for memory

What to do next

FAQ

Sources

Ishan Karunaratne

Related posts

How to Run find in Parallel with xargs -P

How to Build an LLM Agent with Tool Use

How to Crack an MD5 Hash with Hashcat

What's the minimum hardware to run a local LLM?

Is Llama 3.3 better than Mistral or Qwen in 2026?

Can a local LLM replace Claude or GPT for production work?

How fast is Ollama compared to a cloud API?

Does Ollama support tool use and structured outputs?

Sources

Ishan Karunaratne