TechEarl

How to Run a Local LLM with Ollama

Run a local LLM with Ollama: install, pull a model, the hardware floor, picking between Llama, Mistral, and Qwen, and when local is faster than cloud (and when it isn't).

Ishan Karunaratne⏱️ 9 min readUpdated
Share thisCopied
Run a local LLM with Ollama: install, pull a model, hardware floor, picking between Llama, Mistral, Qwen. When local beats cloud and when it doesn't.

Ollama makes running an LLM on your own machine as simple as ollama run llama3.1. The model downloads on first use, exposes an HTTP API on localhost:11434, and runs entirely offline. In 2026 the practical local-LLM stack is Ollama as the runtime, a chosen open-weights model (Llama 3.1, Mistral Small 3, Qwen 2.5, or a coding-specialized one like Qwen2.5-Coder), and a hardware floor of 16GB unified memory on a Mac or 12GB VRAM on a discrete GPU. I'll walk install, model selection, the hardware floor, and the cases where local is actually better than cloud (and the cases where it isn't).

The reason local AI is a real choice in 2026: open-weights models have caught up to mid-tier cloud models on many practical tasks, the hardware to run them is cheap, and the privacy story is honest — no token sent to anyone's server. For specific workflows (offline use, sensitive data, high-volume batch where token costs matter), local is the better answer. For everything else, cloud still wins on quality and capability.

Jump to:

Install Ollama

macOS:

bash
brew install ollama
ollama serve  # starts the local server on :11434

Linux:

bash
curl -fsSL https://ollama.com/install.sh | sh
systemctl --user start ollama

Windows: download the installer from ollama.com. The Windows version uses CUDA if available, falls back to CPU otherwise.

Verify with ollama --version (Ollama is on the 0.x series; 0.24.x is current as of 2026).

Pull and run a model

The simplest happy path:

bash
ollama run llama3.1

This downloads the model (~4.7GB for the default 8B parameter version) and drops you into an interactive chat. Type a message; the model responds streaming. Exit with /bye.

Note that ollama run llama3.3 is a different model: Llama 3.3 ships in a single 70B size only, so that command pulls a roughly 43GB download and needs serious hardware. For laptop-tier use you want Llama 3.1 8B, not Llama 3.3.

For specific model variants:

bash
ollama pull llama3.3                  # 70B parameters, ~43GB, needs serious hardware
ollama pull mistral-small3.1:24b      # 24B Mistral
ollama pull qwen2.5-coder:32b         # coding-specialized
ollama pull deepseek-r1:7b            # reasoning model

List installed models:

bash
ollama list

Remove a model:

bash
ollama rm llama3.1

The hardware floor

What you actually need to run useful models comfortably:

SetupMemory floorWhat it can run
MacBook M2/M3/M416GB unified7B-8B models (Llama 3.1 8B, Mistral 7B)
MacBook M3/M4 Pro/Max36GB+ unified27B-30B models with headroom
Mac Studio M3 Ultra64GB+ unified70B models in 4-bit quantization
PC with discrete GPU12GB VRAM7B models (RTX 3060 12GB minimum)
PC with high-end GPU24GB VRAM24B-30B models (RTX 4090, 3090)
Multi-GPU rig48GB+ total VRAM70B models

The Apple Silicon advantage in 2026 is real: the unified memory architecture means the GPU can address the full system RAM, so a $2000 MacBook Pro with 36GB RAM can run models that need a $4000 GPU on the PC side.

CPU-only inference works but is slow — 1-3 tokens per second on a typical 8B model versus 30-100 tokens/sec on GPU. CPU is fine for "I'm trying it out" but not for real workflows.

Picking a model: Llama, Mistral, Qwen

The major open-weights families in 2026:

Llama (Meta) — the workhorse. Llama 3.1 8B for laptop-tier; Llama 3.3, which ships in a single 70B size, for serious hardware. Strong general capability, good at instruction following. The default pick if you have no specific reason to choose otherwise.

Mistral Small 3.1 (Mistral) — competitive with Llama on quality, slightly faster. The 24B parameter version is a sweet spot for 32GB-40GB hardware. Particularly strong on European-language tasks.

Qwen 2.5 (Alibaba) — strong general capability. The Coder variants (Qwen2.5-Coder 7B, 32B) outperform every other open-weights model on code generation as of mid-2026. If your workflow is code-heavy, use Qwen2.5-Coder.

DeepSeek-R1 — reasoning-specialised. Slower per response (it generates reasoning chains) but solves harder problems. Use for the local equivalent of "Claude Opus" workloads — math, multi-step planning, complex code.

Gemma 3 (Google) — competitive lightweight option, particularly the 2B and 9B variants for resource-constrained environments.

Run ollama show <model> to see exact parameter count and memory requirements before pulling.

Calling Ollama from JavaScript and Python

Ollama exposes an OpenAI-compatible HTTP API. Use any OpenAI SDK pointed at localhost:11434.

JavaScript:

javascript
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama",  // ignored, but the field is required
});

const response = await client.chat.completions.create({
  model: "llama3.3",
  messages: [{ role: "user", content: "Hello, world." }],
});

console.log(response.choices[0].message.content);

Python:

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="llama3.3",
    messages=[{"role": "user", "content": "Hello, world."}],
)
print(response.choices[0].message.content)

Streaming, tool use, structured outputs all work the same way as cloud OpenAI. The native Ollama API at /api/generate and /api/chat is available too, with slightly different request shapes.

When local beats cloud (and when it doesn't)

Local is the right choice when:

  • Privacy is non-negotiable. Medical, legal, financial data that cannot leave the network. No cloud-provider trust required.
  • You're offline. Travel, airplane, ship, anything without reliable internet.
  • High-volume batch. Processing 100K documents overnight — the per-token cost drops to electricity, which is typically 1-2 orders of magnitude cheaper than cloud API.
  • You want predictable latency. Local model is local-network latency; no API rate limits, no provider outages.
  • You're learning or experimenting. No API budget needed.

Cloud is the right choice when:

  • Best-in-class quality matters. Claude Opus and GPT-5 still outperform every open-weights model on hard reasoning tasks.
  • You need long context (>32K tokens reliably). Most local models degrade past 8-16K tokens; cloud models handle 200K-1M well.
  • You want zero ops. Cloud APIs require no hardware management, no model updates, no driver issues.
  • You need agentic tool use at scale. Cloud models are more reliable at correct tool selection across complex toolsets.

The honest 2026 answer: most production AI workloads run on cloud LLMs. Local LLMs are increasingly viable for specific niches, but they're not yet a wholesale replacement.

Quantization: trading quality for memory

A "70B parameter" model at full precision needs 140GB of memory (2 bytes per parameter). At 4-bit quantization, it needs ~40GB. At 2-bit, ~20GB. Less memory means smaller hardware can run bigger models.

Ollama defaults to Q4 (4-bit quantization) for most models, which gives the best size-quality trade-off in practice. Going lower (Q2, Q3) saves memory but visibly degrades output quality. Going higher (Q5, Q6, Q8) improves quality but takes more memory; the gains above Q4 are small.

To run a specific quantization: ollama pull llama3.3:70b-q4_K_M or :70b-q8_0.

For most setups, stick with the default. Worry about quantization only if you hit a memory ceiling and need to fit a bigger model.

What to do next

For combining local LLMs with the broader AI stack:

For the choice between local and cloud for specific workloads:

External references: Ollama documentation, open-LLM-leaderboard on Hugging Face for current model rankings.

FAQ

Sources

Authoritative references this article was fact-checked against.

TagsOllamaLocal AILLMSelf-HostedLlamaMistralPrivacy

Found this useful? Pass it on.

Copied

Ishan Karunaratne

Software Systems Architect · Senior Software Engineer · Engineering Leadership

Software systems architect and senior software engineer with more than two decades designing, building, and running production software, Linux systems, and DevOps infrastructure, and lately working AI into the stack. Now a CTO, though what I write here is drawn from the full arc of that work, across architecture, engineering, and operations, not any single job.

Keep reading

Related posts

Use xargs -P to run find results in parallel: find ... -print0 | xargs -0 -P 4 -n 1 cmd. Set -P to the core count, why -n 1 matters, CPU-bound vs IO-bound work, and xargs -P vs GNU parallel.

How to Run find in Parallel with xargs -P

find . -type f -name '*.log' -print0 | xargs -0 -P 4 -n 1 gzip compresses every matched file four at a time. The flags that make it work: -P for parallel workers, -n 1 so each worker gets one job, -0 paired with find's -print0 for safety. When parallelism helps (CPU-bound work) and when it just thrashes the disk.

Build an LLM agent with tool use. The agentic loop, tool-call formats on Anthropic / OpenAI / Gemini, JavaScript and Python code, common failure modes.

How to Build an LLM Agent with Tool Use

Build an LLM agent with tool use: the agentic loop, the tool-call format on Anthropic, OpenAI, and Gemini, runnable code in JavaScript and Python, plus the common failure modes.

Crop a video with ffmpeg using the crop filter: width:height:x:y from the top-left origin, centered and square crops with in_w/in_h, and cropdetect for black bars.

How to Crop a Video with ffmpeg

Crop a video with ffmpeg's crop filter: crop=w:h:x:y from the top-left origin, centered crops with in_w/in_h expressions, square crops for social, and cropdetect to strip black bars automatically.