tutorial•May 16, 2026•4 min read

Best Local LLM for 8GB VRAM: Optimal Settings for AI Roleplay & ERP

A blunt 2026 guide to making 8GB cards work for local roleplay: what fits, what slows down, and which settings actually earn their place.

Every 8GB guide eventually turns into the same smug advice: buy more VRAM.

Useless when the card is already inside your machine.

The good news is that 8GB still works. The bad news is that 8GB only works when you stop lying to yourself about what the number means.

1. The VRAM Tax: OS and Browser Overhead

On a normal desktop, 8GB does not mean 8GB of free memory for inference:

Windows (DWM): Sequesters 1.5GB – 2.5GB of VRAM to render the desktop.
macOS (WindowServer): Consumes 1.5GB – 2.0GB of VRAM.
Linux (GNOME/KDE): Consumes 1.0GB – 2.0GB depending on the compositor.
Browsers (Chrome/Firefox with hardware accel): Consume 1.0GB – 2.0GB for a few active tabs.
Discord / Spotify / Electron apps: Consume another 0.5GB – 1.0GB combined.

If you run a standard desktop setup, your starting available VRAM is closer to 4.5GB – 5.5GB of actual usable space. If your combined model weights and Key-Value (KV) cache exceed this limit, the system offloads layers to system RAM (DDR4/DDR5) via the PCIe bus. This "spilling" drops your generation speed from a fluid 30+ tokens per second down to 1–2 tokens per second.

2. VRAM Allocation Dynamics Table

To prevent memory spilling, you must balance model parameters against context size. The table below outlines memory footprints for GGUF Q4_K_M (4-bit) models:

Model Size	Quantization	Context Length	Estimated VRAM	8GB Hardware Viability
3B	Q4_K_M	8,000 tokens	~3.2 GB	Excellent. High headroom for background apps.
7B – 8B	Q4_K_M	4,000 tokens	~5.8 – 6.0 GB	Optimal. Fits comfortably with basic OS tasks.
7B – 8B	Q4_K_M	8,000 tokens	~6.2 GB	Viable. Requires closing heavy browser tabs.
7B – 8B	Q4_K_M	16,000 tokens	~7.2 GB	Borderline. Requires strict OS resource pruning.
12B – 14B	Q4_K_M	4,000 tokens	~8.6 GB	Exceeds limits. Spills to CPU/RAM immediately.

3. Quantization Choices: GGUF vs. AWQ

On 8GB cards, your quantization format dictates stability:

GGUF (GGUF-iMat): The gold standard for consumer desktop deployment. It supports row-wise quantization and Importance Matrix (iMatrix) optimization, preserving 92% of the model's FP16 intelligence. GGUF allows hybrid offloading (spilling layers to the CPU) without crashing.
AWQ / GPTQ: Designed strictly for server GPUs. They do not support CPU offloading. If your context window expands past your physical VRAM limit by even a single token, AWQ will immediately trigger a fatal Out of Memory (OOM) crash.

The 1-Bit Exception: PrismML Bonsai

If you need an ultra-low footprint, 2026 saw the release of PrismML Bonsai 8B. Running on a custom 1-bit GGUF fork, it packs an 8B parameters model into a tiny 1.15GB VRAM footprint. While requiring specialized inference kernels, it allows you to run a full 16K context window with less than 2GB of VRAM.

4. Practical CLI Execution Snippets

Here are copy-pasteable configurations to execute local models on 8GB hardware via llama-server (llama.cpp) or Ollama:

Config A: Fast Coding / Autocomplete Assistant

Model: Qwen 2.5 Coder 3B GGUF (Q4_K_M)

Command (Ollama):

# Enable Flash Attention and restrict parallel requests to save VRAM
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_NUM_PARALLEL=1
ollama run qwen2.5-coder:3b

Config B: Core Desktop Assistant (Daily Generalist)

Model: Llama 3.1 8B Instruct GGUF (Q4_K_M)

Command (llama-server):

# -c 8192 limits context, -fa enables Flash Attention, -ngl 99 forces all layers onto GPU
./llama-server -m llama-3.1-8b-instruct.Q4_K_M.gguf -c 8192 -fa -ngl 99

Config C: 12B Model Roleplay (Advanced Context Shifting)

Model: Mistral NeMo 12B Instruct GGUF (iMat Q4_K_M)

Command (llama-server):

# --cache-type-k q8_0 compresses Key-Value cache to 8-bit, -ngl 41 offloads 41 of 51 layers
./llama-server --model Mistral-Nemo-Instruct-12B-iMat-GGUF-Q4_K_M.gguf --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --ctx-size 16384 --n-gpu-layers 41

Config D: 35B MoE Model (Extreme Hybrid Execution)

Model: Qwen 3.6 35B MoE GGUF (Q4_K_M — Active parameters: 3B)

Command (llama-server):

# Splits the active experts across CPU/GPU using --n-cpu-moe 38 to maintain high token throughput
./llama-server -m Qwen3.6-35B-A3B-Q4_K_M.gguf -ngl 99 --n-cpu-moe 38 -c 8192 -fa on --cache-type-k q8_0 --reasoning-budget-message "\n\nStop thinking and answer now."

Stop treating your machine like an unlimited cloud server. Keep your context lengths capped, enforce Flash Attention, utilize quantized KV caches, and close background tabs. That is how you turn an 8GB card into a fast, disciplined local stack.

Related Guides

tutorialJune 6, 2026

Free Uncensored AI: How to Run Local LLM API on Google Colab (2026)

A practical 2026 guide to using Google Colab as a disposable LLM host: what fits on the free tier, where the limits really are, and how to expose a stable API without pretending the setup is production-grade.

Read Article

tutorialMay 28, 2026

Local LLM on Mac: Setup Guide for Uncensored AI Roleplay (Apple Silicon M-Series)

A 2026 Mac guide for local roleplay stacks covering unified memory, model sizing, MLX versus llama.cpp, thermal limits, and clean Apple Silicon setup paths.

Read Article

tutorialJuly 3, 2026

SillyTavern Image Generation: Connect Stable Diffusion for Visual AI Roleplay

A 2026 SillyTavern image generation guide covering Stable Diffusion connections, local versus API workflows, ComfyUI integration, prompt extraction, and visual consistency for AI roleplay.

Read Article

Ready for private AI?

Experience zero-log, client-side encrypted AI roleplay directly in your browser.

Launch App