Ollama makes running an LLM on your own machine as simple as ollama run llama3.3. The model downloads on first use, exposes an HTTP API on localhost:11434, and runs entirely offline. In 2026 the practical local-LLM stack is Ollama as the runtime, a chosen open-weights model (Llama 3.3, Mistral Small 3, Qwen 2.5, or a coding-specialized one like Qwen2.5-Coder), and a hardware floor of 16GB unified memory on a Mac or 12GB VRAM on a discrete GPU. I'll walk install, model selection, the hardware floor, and the cases where local is actually better than cloud (and the cases where it isn't).
The reason local AI is a real choice in 2026: open-weights models have caught up to mid-tier cloud models on many practical tasks, the hardware to run them is cheap, and the privacy story is honest — no token sent to anyone's server. For specific workflows (offline use, sensitive data, high-volume batch where token costs matter), local is the better answer. For everything else, cloud still wins on quality and capability.
Jump to:
- Install Ollama
- Pull and run a model
- The hardware floor
- Picking a model: Llama, Mistral, Qwen
- Calling Ollama from JavaScript and Python
- When local beats cloud (and when it doesn't)
- Quantization: trading quality for memory
- FAQ
Install Ollama
macOS:
brew install ollama
ollama serve # starts the local server on :11434Linux:
curl -fsSL https://ollama.com/install.sh | sh
systemctl --user start ollamaWindows: download the installer from ollama.com. The Windows version uses CUDA if available, falls back to CPU otherwise.
Verify with ollama --version (3.0+ is current as of 2026).
Pull and run a model
The simplest happy path:
ollama run llama3.3This downloads the model (~8GB for the default 8B parameter version) and drops you into an interactive chat. Type a message; the model responds streaming. Exit with /bye.
For specific model variants:
ollama pull llama3.3:70b # 70B parameters, needs ~40GB RAM
ollama pull mistral-small3.1:24b # 24B Mistral
ollama pull qwen2.5-coder:32b # coding-specialized
ollama pull deepseek-r1:7b # reasoning modelList installed models:
ollama listRemove a model:
ollama rm llama3.3The hardware floor
What you actually need to run useful models comfortably:
| Setup | Memory floor | What it can run |
|---|---|---|
| MacBook M2/M3/M4 | 16GB unified | 7B-8B models (Llama 3.3 8B, Mistral 7B) |
| MacBook M3/M4 Pro/Max | 36GB+ unified | 27B-30B models with headroom |
| Mac Studio M3 Ultra | 64GB+ unified | 70B models in 4-bit quantization |
| PC with discrete GPU | 12GB VRAM | 7B models (RTX 3060 12GB minimum) |
| PC with high-end GPU | 24GB VRAM | 24B-30B models (RTX 4090, 3090) |
| Multi-GPU rig | 48GB+ total VRAM | 70B models |
The Apple Silicon advantage in 2026 is real: the unified memory architecture means the GPU can address the full system RAM, so a $2000 MacBook Pro with 36GB RAM can run models that need a $4000 GPU on the PC side.
CPU-only inference works but is slow — 1-3 tokens per second on a typical 8B model versus 30-100 tokens/sec on GPU. CPU is fine for "I'm trying it out" but not for real workflows.
Picking a model: Llama, Mistral, Qwen
The major open-weights families in 2026:
Llama 3.3 (Meta) — the workhorse. 8B for laptop-tier, 70B for serious hardware. Strong general capability, good at instruction following. The default pick if you have no specific reason to choose otherwise.
Mistral Small 3.1 (Mistral) — competitive with Llama on quality, slightly faster. The 24B parameter version is a sweet spot for 32GB-40GB hardware. Particularly strong on European-language tasks.
Qwen 2.5 (Alibaba) — strong general capability. The Coder variants (Qwen2.5-Coder 7B, 32B) outperform every other open-weights model on code generation as of mid-2026. If your workflow is code-heavy, use Qwen2.5-Coder.
DeepSeek-R1 — reasoning-specialised. Slower per response (it generates reasoning chains) but solves harder problems. Use for the local equivalent of "Claude Opus" workloads — math, multi-step planning, complex code.
Gemma 3 (Google) — competitive lightweight option, particularly the 2B and 9B variants for resource-constrained environments.
Run ollama show <model> to see exact parameter count and memory requirements before pulling.
Calling Ollama from JavaScript and Python
Ollama exposes an OpenAI-compatible HTTP API. Use any OpenAI SDK pointed at localhost:11434.
JavaScript:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:11434/v1",
apiKey: "ollama", // ignored, but the field is required
});
const response = await client.chat.completions.create({
model: "llama3.3",
messages: [{ role: "user", content: "Hello, world." }],
});
console.log(response.choices[0].message.content);Python:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama3.3",
messages=[{"role": "user", "content": "Hello, world."}],
)
print(response.choices[0].message.content)Streaming, tool use, structured outputs all work the same way as cloud OpenAI. The native Ollama API at /api/generate and /api/chat is available too, with slightly different request shapes.
When local beats cloud (and when it doesn't)
Local is the right choice when:
- Privacy is non-negotiable. Medical, legal, financial data that cannot leave the network. No cloud-provider trust required.
- You're offline. Travel, airplane, ship, anything without reliable internet.
- High-volume batch. Processing 100K documents overnight — the per-token cost drops to electricity, which is typically 1-2 orders of magnitude cheaper than cloud API.
- You want predictable latency. Local model is local-network latency; no API rate limits, no provider outages.
- You're learning or experimenting. No API budget needed.
Cloud is the right choice when:
- Best-in-class quality matters. Claude Opus and GPT-5 still outperform every open-weights model on hard reasoning tasks.
- You need long context (>32K tokens reliably). Most local models degrade past 8-16K tokens; cloud models handle 200K-1M well.
- You want zero ops. Cloud APIs require no hardware management, no model updates, no driver issues.
- You need agentic tool use at scale. Cloud models are more reliable at correct tool selection across complex toolsets.
The honest 2026 answer: most production AI workloads run on cloud LLMs. Local LLMs are increasingly viable for specific niches, but they're not yet a wholesale replacement.
Quantization: trading quality for memory
A "70B parameter" model at full precision needs 140GB of memory (2 bytes per parameter). At 4-bit quantization, it needs ~40GB. At 2-bit, ~20GB. Less memory means smaller hardware can run bigger models.
Ollama defaults to Q4 (4-bit quantization) for most models, which gives the best size-quality trade-off in practice. Going lower (Q2, Q3) saves memory but visibly degrades output quality. Going higher (Q5, Q6, Q8) improves quality but takes more memory; the gains above Q4 are small.
To run a specific quantization: ollama pull llama3.3:70b-q4_K_M or :70b-q8_0.
For most setups, stick with the default. Worry about quantization only if you hit a memory ceiling and need to fit a bigger model.
What to do next
For combining local LLMs with the broader AI stack:
- How to Build an LLM Agent with Tool Use — Ollama supports tool use; the agentic loop is identical to cloud.
- How to Build RAG with Embeddings and Vector Search — local embedding models (BGE, E5) pair well with local LLMs for full-stack on-device RAG.
- How to Get Reliable JSON from an LLM — Ollama supports structured outputs via the
formatparameter.
For the choice between local and cloud for specific workloads:
- How to Choose Between Claude Haiku, Sonnet, and Opus — covers the cloud-tier decision; pair with this article for the local-vs-cloud decision.
External references: Ollama documentation, open-LLM-leaderboard on Hugging Face for current model rankings.





