TechEarl

How to Run a Local LLM with Ollama

Run a local LLM with Ollama: install, pull a model, the hardware floor, picking between Llama, Mistral, and Qwen, and when local is faster than cloud (and when it isn't).

Ishan KarunaratneIshan Karunaratne⏱️ 9 min readUpdated
Run a local LLM with Ollama: install, pull a model, hardware floor, picking between Llama, Mistral, Qwen. When local beats cloud and when it doesn't.

Ollama makes running an LLM on your own machine as simple as ollama run llama3.3. The model downloads on first use, exposes an HTTP API on localhost:11434, and runs entirely offline. In 2026 the practical local-LLM stack is Ollama as the runtime, a chosen open-weights model (Llama 3.3, Mistral Small 3, Qwen 2.5, or a coding-specialized one like Qwen2.5-Coder), and a hardware floor of 16GB unified memory on a Mac or 12GB VRAM on a discrete GPU. I'll walk install, model selection, the hardware floor, and the cases where local is actually better than cloud (and the cases where it isn't).

The reason local AI is a real choice in 2026: open-weights models have caught up to mid-tier cloud models on many practical tasks, the hardware to run them is cheap, and the privacy story is honest — no token sent to anyone's server. For specific workflows (offline use, sensitive data, high-volume batch where token costs matter), local is the better answer. For everything else, cloud still wins on quality and capability.

Jump to:

Install Ollama

macOS:

bash
brew install ollama
ollama serve  # starts the local server on :11434

Linux:

bash
curl -fsSL https://ollama.com/install.sh | sh
systemctl --user start ollama

Windows: download the installer from ollama.com. The Windows version uses CUDA if available, falls back to CPU otherwise.

Verify with ollama --version (3.0+ is current as of 2026).

Pull and run a model

The simplest happy path:

bash
ollama run llama3.3

This downloads the model (~8GB for the default 8B parameter version) and drops you into an interactive chat. Type a message; the model responds streaming. Exit with /bye.

For specific model variants:

bash
ollama pull llama3.3:70b              # 70B parameters, needs ~40GB RAM
ollama pull mistral-small3.1:24b      # 24B Mistral
ollama pull qwen2.5-coder:32b         # coding-specialized
ollama pull deepseek-r1:7b            # reasoning model

List installed models:

bash
ollama list

Remove a model:

bash
ollama rm llama3.3

The hardware floor

What you actually need to run useful models comfortably:

SetupMemory floorWhat it can run
MacBook M2/M3/M416GB unified7B-8B models (Llama 3.3 8B, Mistral 7B)
MacBook M3/M4 Pro/Max36GB+ unified27B-30B models with headroom
Mac Studio M3 Ultra64GB+ unified70B models in 4-bit quantization
PC with discrete GPU12GB VRAM7B models (RTX 3060 12GB minimum)
PC with high-end GPU24GB VRAM24B-30B models (RTX 4090, 3090)
Multi-GPU rig48GB+ total VRAM70B models

The Apple Silicon advantage in 2026 is real: the unified memory architecture means the GPU can address the full system RAM, so a $2000 MacBook Pro with 36GB RAM can run models that need a $4000 GPU on the PC side.

CPU-only inference works but is slow — 1-3 tokens per second on a typical 8B model versus 30-100 tokens/sec on GPU. CPU is fine for "I'm trying it out" but not for real workflows.

Picking a model: Llama, Mistral, Qwen

The major open-weights families in 2026:

Llama 3.3 (Meta) — the workhorse. 8B for laptop-tier, 70B for serious hardware. Strong general capability, good at instruction following. The default pick if you have no specific reason to choose otherwise.

Mistral Small 3.1 (Mistral) — competitive with Llama on quality, slightly faster. The 24B parameter version is a sweet spot for 32GB-40GB hardware. Particularly strong on European-language tasks.

Qwen 2.5 (Alibaba) — strong general capability. The Coder variants (Qwen2.5-Coder 7B, 32B) outperform every other open-weights model on code generation as of mid-2026. If your workflow is code-heavy, use Qwen2.5-Coder.

DeepSeek-R1 — reasoning-specialised. Slower per response (it generates reasoning chains) but solves harder problems. Use for the local equivalent of "Claude Opus" workloads — math, multi-step planning, complex code.

Gemma 3 (Google) — competitive lightweight option, particularly the 2B and 9B variants for resource-constrained environments.

Run ollama show <model> to see exact parameter count and memory requirements before pulling.

Calling Ollama from JavaScript and Python

Ollama exposes an OpenAI-compatible HTTP API. Use any OpenAI SDK pointed at localhost:11434.

JavaScript:

javascript
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama",  // ignored, but the field is required
});

const response = await client.chat.completions.create({
  model: "llama3.3",
  messages: [{ role: "user", content: "Hello, world." }],
});

console.log(response.choices[0].message.content);

Python:

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="llama3.3",
    messages=[{"role": "user", "content": "Hello, world."}],
)
print(response.choices[0].message.content)

Streaming, tool use, structured outputs all work the same way as cloud OpenAI. The native Ollama API at /api/generate and /api/chat is available too, with slightly different request shapes.

When local beats cloud (and when it doesn't)

Local is the right choice when:

  • Privacy is non-negotiable. Medical, legal, financial data that cannot leave the network. No cloud-provider trust required.
  • You're offline. Travel, airplane, ship, anything without reliable internet.
  • High-volume batch. Processing 100K documents overnight — the per-token cost drops to electricity, which is typically 1-2 orders of magnitude cheaper than cloud API.
  • You want predictable latency. Local model is local-network latency; no API rate limits, no provider outages.
  • You're learning or experimenting. No API budget needed.

Cloud is the right choice when:

  • Best-in-class quality matters. Claude Opus and GPT-5 still outperform every open-weights model on hard reasoning tasks.
  • You need long context (>32K tokens reliably). Most local models degrade past 8-16K tokens; cloud models handle 200K-1M well.
  • You want zero ops. Cloud APIs require no hardware management, no model updates, no driver issues.
  • You need agentic tool use at scale. Cloud models are more reliable at correct tool selection across complex toolsets.

The honest 2026 answer: most production AI workloads run on cloud LLMs. Local LLMs are increasingly viable for specific niches, but they're not yet a wholesale replacement.

Quantization: trading quality for memory

A "70B parameter" model at full precision needs 140GB of memory (2 bytes per parameter). At 4-bit quantization, it needs ~40GB. At 2-bit, ~20GB. Less memory means smaller hardware can run bigger models.

Ollama defaults to Q4 (4-bit quantization) for most models, which gives the best size-quality trade-off in practice. Going lower (Q2, Q3) saves memory but visibly degrades output quality. Going higher (Q5, Q6, Q8) improves quality but takes more memory; the gains above Q4 are small.

To run a specific quantization: ollama pull llama3.3:70b-q4_K_M or :70b-q8_0.

For most setups, stick with the default. Worry about quantization only if you hit a memory ceiling and need to fit a bigger model.

What to do next

For combining local LLMs with the broader AI stack:

For the choice between local and cloud for specific workloads:

External references: Ollama documentation, open-LLM-leaderboard on Hugging Face for current model rankings.

FAQ

TagsOllamaLocal AILLMSelf-HostedLlamaMistralPrivacy
Share
Ishan Karunaratne

Ishan Karunaratne

Tech Architect · Software Engineer · AI/DevOps

Tech architect and software engineer with 20+ years across software, Linux systems, DevOps, and infrastructure — and a more recent focus on AI. Currently Chief Technology Officer at a tech startup in the healthcare space.

Keep reading

Related posts

Build an LLM agent with tool use. The agentic loop, tool-call formats on Anthropic / OpenAI / Gemini, JavaScript and Python code, common failure modes.

How to Build an LLM Agent with Tool Use

Build an LLM agent with tool use: the agentic loop, the tool-call format on Anthropic, OpenAI, and Gemini, runnable code in JavaScript and Python, plus the common failure modes.

Match a URL with regex. http/https schemes, protocol-relative URLs, ports, paths, query strings, fragments. JavaScript / Python / PHP examples, engine notes, parser alternative, common mistakes, test table.

How to Match a URL with Regex

Match a URL with regex. Covers http/https schemes, protocol-relative URLs, ports, paths, query strings, fragments, runnable JavaScript / Python / PHP, engine notes, and the URL parser alternative.

Connect to a GCP Compute Engine VM with plain OpenSSH and no gcloud CLI. Add a public key via instance metadata, ssh to the external IP, configure ~/.ssh/config, plus OS Login and IAP.

How to SSH into a Google Cloud VM Without gcloud

Connect to a GCP VM using plain OpenSSH, no gcloud required. Add a public key to instance metadata, fetch the external IP, and ssh in like any normal Linux box. Plus OS Login, IAP, and a Windows PuTTY path.