How to Run LLMs Locally: A Practical Guide for Developers

A practical guide to running LLMs on your own hardware - covering the tools (Ollama, LM Studio, Jan), hardware requirements by VRAM tier, model selection, quantization formats, and how to integrate local inference into your dev workflow.

Running an LLM locally means executing a large language model entirely on your own hardware - no API calls, no cloud dependency, no data leaving your machine. For developers tired of mounting API bills, or teams that need private inference in air-gapped environments, local LLMs have become a real option. The tooling has matured fast. As of early 2026, you can go from zero to a working local ChatGPT-like setup in under five minutes.

This guide covers the hardware you need, the tools worth using, how to pick a model, and where local inference still falls short. No "what is an LLM" preamble - if you're here, you already know.

Why Bother Running LLMs Locally?

Five reasons keep coming up in practice:

Privacy and data sovereignty. Your prompts and data never leave your machine. For regulated industries, legal teams, or anyone handling sensitive code, this alone justifies the effort.
Cost. A single developer burning through GPT-4-class API calls can easily spend $50-200/month. A one-time $300 GPU runs a capable 8B model indefinitely.
Latency. No network round-trip. Token generation starts immediately. For interactive coding assistants or local RAG pipelines, this matters.
Offline access. Airplanes, classified networks, remote sites - local models work everywhere.
Experimentation freedom. Run uncensored models, test fine-tuned variants, swap architectures without waiting for an API provider to support them.

The trade-off is straightforward: you give up frontier-model quality (GPT-4o, Claude Opus) in exchange for control, privacy, and zero marginal cost per token.

What Hardware Do You Actually Need?

VRAM is the bottleneck. The rough formula: (model parameters x bits per weight) / 8 = gigabytes of VRAM required. A 7B parameter model at 4-bit quantization needs about 3.5 GB of VRAM. A 70B model at 4-bit needs roughly 35 GB.

Here is what fits at each tier:

VRAM	What You Can Run	Expected Speed
8 GB (RTX 4060, M1/M2)	7-8B models at Q4_K_M	30-50 tok/s
16 GB (RTX 4070 Ti, M2 Pro)	14B models at Q4, or 8B at Q8	20-40 tok/s
24 GB (RTX 4090, RTX 3090)	27-32B models at Q4_K_M	15-30 tok/s
48 GB+ (dual GPU, M4 Max, A6000)	70B models at Q4	8-15 tok/s

A model that fits entirely in VRAM runs roughly 10x faster than one that spills over into system RAM. Capacity beats raw GPU speed here - an RTX 3090 with 24 GB often outperforms an RTX 4080 with 16 GB on larger models simply because it avoids offloading.

CPU-only inference works for quantized 7B models, but expect 3-8 tokens per second depending on your CPU. Usable for batch processing. Painful for interactive chat.

Apple Silicon deserves its own mention. The M1 through M4 chips share a unified memory pool between CPU and GPU, which means a MacBook Pro with 36 GB of unified memory can load models that would need a 36 GB discrete GPU on a PC. The MLX framework from Apple is 20-30% faster than llama.cpp on Apple Silicon for most model sizes. An M4 Pro with 48 GB of RAM comfortably runs Qwen 3 32B (Q4) at 15-22 tokens per second - fast enough for interactive use.

The Tools: Ollama, LM Studio, Jan, and Beyond

The local LLM ecosystem has consolidated around a few tools. Here is how they compare:

	Ollama	LM Studio	Jan	llama.cpp	vLLM
Interface	CLI + REST API	GUI + API server	GUI desktop app	CLI	CLI + API server
Best for	Developers, automation	Beginners, exploration	Privacy-focused desktop use	Power users, custom builds	Multi-user serving
Model format	GGUF (auto-converted)	GGUF	GGUF	GGUF	GPTQ, AWQ, FP16
OpenAI-compatible API	Yes	Yes	Yes	Via server mode	Yes
Multi-GPU	Basic layer splitting	Limited	No	Layer offloading	Tensor parallelism
GitHub stars	100K+	Closed source	30K+	80K+	45K+

Ollama: The Docker of Local LLMs

Ollama mirrors Docker's UX: you pull models by name and run them with a single command. It wraps llama.cpp under the hood and exposes an OpenAI-compatible API on port 11434.

# Install (macOS/Linux)
  curl -fsSL https://ollama.com/install.sh | sh
  
  # Pull and run a model
  ollama run llama3.2
  
  # Run a specific quantized variant
  ollama run qwen3:14b-q4_K_M
  
  # List downloaded models
  ollama list

Ollama handles quantization format conversion automatically, manages model storage, and supports GPU acceleration out of the box on NVIDIA, AMD, and Apple Silicon. For most developers, this is the right starting point.

One thing to know: Ollama processes requests sequentially by default. Under concurrent load - say, five developers hitting the same Ollama server - latency spikes from 2 seconds to 45+ seconds. For multi-user scenarios, vLLM with its continuous batching architecture is a better fit, achieving up to 16x higher throughput under concurrency.

LM Studio: Best GUI Experience

LM Studio is a desktop application with a polished interface for browsing, downloading, and chatting with models. It searches Hugging Face directly, lets you adjust inference parameters with sliders, and spins up a local API server with one click. If you want to experiment with different models without touching a terminal, start here.

Jan: Privacy-First Desktop App

Jan is an open-source desktop app that runs completely offline. It has added agentic workflow features with project workspaces and browser-based MCP tool integration. Good for non-technical users who want a private ChatGPT replacement.

Choosing a Model and Understanding Quantization

Where to Find Models

Two main sources: the Ollama model library (curated, ready to pull) and Hugging Face (the full ecosystem, including GGUF-quantized variants from community contributors like TheBloke and bartowski).

Quantization: What Q4_K_M Actually Means

Quantization compresses model weights from 16-bit floating point down to 4, 5, 6, or 8 bits. The GGUF format (used by llama.cpp, Ollama, LM Studio, and Jan) is the standard for local inference. The naming convention breaks down like this:

Q = quantized, the number is bits per weight (Q4 = 4-bit, Q8 = 8-bit)
K = k-quant method (grouped quantization with per-group scaling) - better quality than older methods
S/M/L = Small, Medium, Large variant - controls how many layers get higher precision

Quality ranking from best to most compressed: Q8_0 > Q6_K > Q5_K_M > Q4_K_M > Q4_K_S > Q3_K_S > Q2_K

Q4_K_M is the sweet spot for most users - it cuts VRAM usage by roughly 75% compared to FP16 while keeping perplexity degradation under 1% for most models. Q5_K_M is worth the extra memory if you have room. Q8_0 is near-lossless but needs twice the VRAM of Q4.

Recommended Models (Early 2026)

Use Case	Model	Parameters	Min VRAM (Q4)
Fast general chat	Llama 3.2 8B	8B	5 GB
Coding assistant	Qwen 3 14B	14B	8 GB
Strong reasoning	Gemma 3 27B	27B	16 GB
Near-frontier quality	Qwen 3 32B	32B	20 GB
Maximum local capability	Llama 4 Scout 70B	70B	40 GB

Qwen 3 models have been particularly strong in 2026, with the 32B variant matching or beating GPT-4o on several public benchmarks while running on a single RTX 4090. The Gemma 3 27B from Google is another standout - its 128K context window and multimodal support make it versatile for RAG applications.

Adding a UI and Integrating Into Your Workflow

Open WebUI: A ChatGPT-Like Interface for Local Models

Open WebUI is a self-hosted web interface that connects to Ollama or any OpenAI-compatible API. It gives you multi-model chat, document upload, built-in RAG, conversation history, and MCP tool integration - all running on your own hardware.

Getting it running takes three commands:

# 1. Make sure Ollama is running
  ollama serve
  
  # 2. Pull a model
  ollama pull qwen3:14b
  
  # 3. Start Open WebUI via Docker
  docker run -d -p 3000:8080 \
    --add-host=host.docker.internal:host-gateway \
    -v open-webui:/app/backend/data \
    --name open-webui \
    ghcr.io/open-webui/open-webui:main

Open your browser to http://localhost:3000 and you have a private, multi-model chat interface with RAG capabilities.

Local Models as an OpenAI API Drop-In

Both Ollama and LM Studio expose OpenAI-compatible endpoints. Any code that uses the OpenAI SDK works with a two-line change:

from openai import OpenAI
  
  client = OpenAI(
      base_url="http://localhost:11434/v1",  # Ollama
      api_key="not-needed"
  )
  
  response = client.chat.completions.create(
      model="qwen3:14b",
      messages=[{"role": "user", "content": "Explain MVCC in PostgreSQL"}]
  )

This makes local models a drop-in replacement for development and testing. Tools like Continue use this same approach to provide local Copilot-style autocomplete in VS Code and JetBrains IDEs.

Limitations and When to Use Cloud Instead

Local LLMs have real constraints:

No frontier models. GPT-4o, Claude Opus, Gemini Ultra - these are not available for local deployment. The gap has narrowed, but it still exists for complex reasoning and very long context tasks.
Context window limits. Most local setups top out at 32K-128K tokens in practice. Cloud models offer 200K+ with better performance at those lengths.
Multi-user throughput. Serving more than a handful of concurrent users locally requires vLLM and serious GPU hardware. API providers handle scaling for you.
No automatic updates. You manage model versions, security patches, and infrastructure yourself.

The pragmatic approach: use local models for development, testing, privacy-sensitive workloads, and experimentation. Use cloud APIs for production serving at scale and tasks requiring frontier-level quality. Many teams run both - local Ollama for dev, cloud API for production - with the same code thanks to OpenAI-compatible endpoints.

Key Takeaways

Start with Ollama for CLI-driven workflows, or LM Studio if you prefer a GUI. Both get you running in under 5 minutes.
VRAM determines what you can run. 8 GB handles 7-8B models. 24 GB handles 27-32B models. Capacity matters more than GPU generation.
Q4_K_M quantization is the default choice - 75% smaller than FP16 with minimal quality loss.
Qwen 3 and Gemma 3 are the strongest open model families for local use right now.
Open WebUI + Ollama gives you a private ChatGPT-like setup with RAG in three commands.
Local for dev, cloud for prod is a practical hybrid that gives you the best of both worlds.