AI's Future Isn’t in the Cloud - It's Already on Your PC
Master Ollama: Deploy, Customize, and Run local LLMs with full control, privacy, and speed.
Website Visitors:What is Ollama?
Ollama is an open-source framework designed to enable the execution of large-language models (LLMs) directly on your personal computer. It serves as a versatile bridge between downloadable LLM model files and your local machine, giving you powerful tools to interact with, test, and fine-tune these models in a private environment. By abstracting away much of the complexity traditionally associated with deploying LLMs, Ollama makes it simple to start using models like LLaMA, Phi, DeepSeek, and others right away.
Key Features of Ollama
- Privacy-first & offline operation – Run models on your device without ever transmitting your data to external servers. This ensures sensitive input stays on your machine.
- Comprehensive model management – Easily pull, copy, rename, or delete models. Manage local versions of several models concurrently.
- Built-in API layer – Ollama offers a local HTTP API server, hosted at
http://localhost:11434
, to interact programmatically with models. - Model creation and extension – Use modelfiles to customize inference logic, system prompts, context length, and precision levels.
- Hardware-optimized – Supports GPU acceleration with NVIDIA CUDA (Windows).
- Flexible deployment – Deploy locally or on remote machines, enabling shared access to models across teams or projects.
- Multimodal support – Run models that accept text and images as input for richer capabilities.
- Open source integrations – Compatible with community-developed interfaces and custom workflows.
Installing Ollama on Windows
Windows 10/11:
- Visit the official Ollama website.
- Download the Windows
.exe
installer. - Run the installer and follow the prompts.
- To install to a custom directory run the command as:
ollamasetup.exe /DIR="C:\Program Files\Ollama"
If ollama is installed in
C:\Program Files
, set environment variables under System variables. If installed under the user profile (default), set variables under User variables.
Model files and configuration are stoed under%userprofile%\.ollama
location by default. To change the default model download location, set the OLLAMA_MODEL_STORE
environment variable to a custom path like "D:\ollama\ollama_models"
.
Running Models: Local & Remote
On Windows, Ollama installs as a service.
Example usage:
|
|
Remote Deployment
To set a remote host for Ollama, you can configure the environment variable OLLAMA_HOST
to 0.0.0.0 to allow it to listen on all interfaces. On Windows, you can do this by running the command set OLLAMA_HOST=0.0.0.0
in the command line or by setting it in the system environment variables. Then interact normally using CLI or API.
Once you run
OLLAMA_HOST
command in your terminal, you should runollama serve
command for ollama to pickup this change. To keep it simple, create a environment variable as per your ollama install type (default or custom) and restart your machine to take effect.
CLI Commands and Parameters
Command | Description |
---|---|
ollama pull <model> |
Download the named model to your system. |
ollama run <model> |
Execute the model interactively or provide a one-time prompt. |
ollama create <name> -f <file> |
Build a custom model using a modelfile. |
ollama show <model> |
View metadata and details of the model. |
ollama list |
List all downloaded models. |
ollama ps |
Show active model sessions. |
ollama stop <model> |
Halt an active model instance. |
ollama cp <src> <dest> |
Duplicate or rename a model. |
ollama rm <model> |
Delete a downloaded model from disk. |
ollama push <model> |
Upload your model to a remote registry. |
Runtime Configuration via Environment Variables (Windows)
Here’s a (currently) comprehensive list of environment variables that Ollama will recognize on Windows (and other platforms). You can set these in PowerShell (or your system-wide “Environment Variables” dialog - System > Advanced > Environment Variables) to tweak Ollama’s behavior without passing flags on every command.
-
OLLAMA_HOST
– default host/URL of the Ollama daemon1 2
set OLLAMA_HOST=0.0.0.0 ollama serve
-
OLLAMA_PORT
– default port for the Ollama daemon (11434) -
OLLAMA_NO_TLS
– if set (1
ortrue
), disables TLS when talking to the daemon
– same as passing--no-tls
toollama serve
orollama run
-
OLLAMA_PROFILE
– name of the active profile (overrides--profile
)
– profiles group endpoint URLs, credentials, etc. -
OLLAMA_PROFILES_PATH
– custom filesystem location for your profiles directory
– defaults to%USERPROFILE%\.ollama\profiles
-
OLLAMA_DATA_DIR
– root directory where Ollama stores all data (models, logs, metadata)
– defaults to%USERPROFILE%\.ollama\data
-
OLLAMA_MODEL_STORE
– override for where downloaded models are kept
– if unset, defaults toOLLAMA_DATA_DIR\models
-
OLLAMA_CACHE_DIR
– directory for temporary downloads, layer caches, and shards
– defaults to%USERPROFILE%\.ollama\cache
-
OLLAMA_DEBUG
– if set (1
ortrue
), enables verbose CLI debug logging
– same as passing--debug
on anyollama
command -
OLLAMA_JSON
– if set (1
ortrue
), forces JSON-formatted output where supported
– same as passing--json
-
OLLAMA_CONTEXT_SIZE
– override default token context window size for models that permit it
– same as--context <n>
-
OLLAMA_STREAM
– if set (1
ortrue
), enables token-by-token streaming by default
– same as passing--stream
-
OLLAMA_CONCURRENCY
– default maximum parallel inference requests when runningollama serve
– same as--concurrency <n>
-
OLLAMA_ALLOW_CACHED
– if set (1
ortrue
), allows reuse of cached model archives onollama install
– same as--allow-cached
-
PROXY / HTTP_PROXY / HTTPS_PROXY / NO_PROXY
– standard env vars for outbound HTTP(s) proxying
– Ollama respects these when downloading models or contacting registries (do not set for inference clients) -
OLLAMA_ORIGINS
– comma-separated list of CORS origins permitted to call yourollama serve
endpoint
– default:127.0.0.1,0.0.0.0
-
OLLAMA_KEEP_ALIVE
– how long a model stays loaded in memory after each request
– formats: duration (10m
,24h
), seconds (3600
),-1
(indefinite), or0
(unload immediately)
– note: the/api/generate
and/api/chat
endpoints’keep_alive
parameter takes precedence -
OLLAMA_MAX_QUEUE
– maximum number of queued requests before returning HTTP 503 (“server overloaded”)
– default:512
-
OLLAMA_MAX_LOADED_MODELS
– maximum number of models that can be loaded concurrently in memory
– default: 3 × number of GPUs, or 3 if running on CPU -
OLLAMA_NUM_PARALLEL
– maximum simultaneous inference requests each model can handle
– default: auto-tuned between 1 and 4 based on available memory -
OLLAMA_FLASH_ATTENTION
– if set (1
ortrue
), enables Flash Attention for lower memory usage on large contexts -
OLLAMA_KV_CACHE_TYPE
– quantization type for the model’s key/value cache (f16
,q8_0
, orq4_0
)
– using quantized cache reduces memory at minor precision cost —
How to set these in PowerShell for your session:1 2 3
# Example: point Ollama data into D:\ollama, enable JSON output $Env:OLLAMA_DATA_DIR = "D:\ollama\data" $Env:OLLAMA_JSON = "1"
Or permanently via System → Advanced → Environment Variables. After setting, open a new terminal and run ollama <command>
—your variables will take effect automatically.
Finally, you can always confirm what your current env vars are before running Ollama:
|
|
Understanding Quantization in Ollama
Quantization is a key technique used to reduce the memory footprint and computational load of large language models, enabling them to run efficiently on machines with limited resources. In essence, it reduces the precision of a model’s numerical weights and activations—typically from 16-bit or 32-bit floats to smaller formats like 8-bit or even 4-bit integers—without significantly sacrificing accuracy.
When you use a large language model (LLM), the model’s “brain” (its parameters) is stored as numbers. The way these numbers are stored affects how much memory the model uses, how fast it runs, and how good its answers are. This is called quantization.
Common Quantization Types
-
F16 (Float16) - Half-precision floating point:
- Think of it like a very detailed drawing. Each number has a lot of information, giving it high precision.
- Pros: Very accurate, leading to high-quality model outputs.
- Cons: Takes up a lot of memory and is slower to process compared to quantized versions. Requires powerful, modern hardware (especially graphics cards, GPUs) to run efficiently.
- Analogy: A high-resolution photo with full color depth.
-
Q8_0 - 8-bit integer quantization:
- This is like reducing the detail in your drawing, but still keeping a lot of the original information. It converts the numbers from floating-point to 8-bit integers.
- Pros: Good balance of quality and reduced memory usage. It’s often the “gold standard” for quantized models. It runs well on most computer processors (CPUs).
- Cons: Still relatively large compared to other quantized formats (e.g., 5-7 GB for large models).
- Analogy: A good quality JPEG image with visible, but not excessive, compression.
-
Q6_K - Optimized 6-bit quantization with grouped scaling:
- This is a smarter way to reduce detail. Instead of just cutting off bits, it groups numbers and scales them together to minimize the loss of important information.
- Pros: Significantly reduces memory usage (e.g., 3-5 GB for similar models) and makes the model load and run faster. It’s a good compromise between size/speed and quality.
- Cons: Slight reduction in quality compared to Q8_0, but often imperceptible for many tasks.
- Analogy: A well-optimized JPEG image, where the compression is noticeable only upon close inspection.
-
Q5_K, Q4_K - 5-bit and 4-bit variants:
- These are even more aggressive reductions in detail. The “K” often indicates specific optimization techniques (like those used in Q6_K) to preserve as much quality as possible despite the high compression.
- Pros: Smallest model sizes, meaning they load and run the fastest. Excellent for devices with limited memory or for maximum speed.
- Cons: The “loss in generation quality” becomes more noticeable here. The model’s answers might be slightly less coherent, accurate, or creative compared to less quantized versions.
- Analogy: A very compressed JPEG or GIF image. You can still tell what it is, but some details and smoothness are lost.
-
Q3_K_S, Q2_K - Ultra-compressed formats:
- These are the most extreme forms of detail reduction. They prioritize making the model as tiny as possible, often at the cost of significant quality degradation.
- Pros: Absolutely minimal memory usage. Can potentially run on extremely resource-constrained devices.
- Cons: The “quality significantly” trade-off is very apparent. These are rarely used for general-purpose LLMs where output quality is important because the answers can become noticeably poor or nonsensical. They are more for very specific, niche applications where size is the absolute priority.
- Analogy: A heavily pixelated or extremely low-resolution image, where the original content is barely recognizable.
Example Model Sizes (LLaMA3 8B)
- F16: ~13 GB
- Q8_0: ~7 GB
- Q6_K: ~5.2 GB
- Q4_K_M: ~3.8 GB
- Q2_K: ~2.2 GB
Which Quantization to Use?
For general use on a typical Windows machine with moderate CPU and 8–16 GB RAM, Q4_K_M or Q5_K_M is often the best tradeoff. These formats load faster, consume less RAM, and still produce high-quality outputs suitable for coding, summarization, and chat.
If you have more memory available (e.g., 32 GB) or a capable GPU, you can opt for Q6_K or Q8_0 for slightly improved accuracy. If system constraints are tight, Q3_K_S may allow you to experiment with models that otherwise wouldn’t fit at all. So, if you have moderate sized machines download the models ending with ModelName-Q4_K_M. They work pretty decent on such resources.
API Interface
Ollama exposes a local API at http://localhost:11434
.
POST /api/generate
Generate a completion:
|
|
POST /api/chat
Chat interaction:
|
|
POST /api/embeddings
|
|
GET /api/models
List local models:
|
|
POST /api/models
Download a model:
|
|
DELETE /api/models
Remove a model:
|
|
POST /api/create
|
|
GET /api/show
|
|
GUI Interfaces for Ollama
- Open WebUI – Chat interface with support for RAG, embeddings, and multiple model tabs.
- LM Studio – Cross-platform GUI for running models. You can connect Ollama via Llamalink plugin.
- LibreChat – Self-hosted chat interface supporting plugins and memory.
- Lollms WebUI – Modular UI with model marketplace and support for Ollama.
- Enchanted – Lightweight chat GUI for fast local access.
- Parallel LLM Runner - A lightweight streamlit based app for running & comparing multiple LLMs.
Parallel LLM Runner
The Parallel-LLM-Runner project allows side-by-side model comparisons using a Streamlit interface.
Features:
- Compare response quality from several models.
- Choose between horizontal and vertical layout.
- Prompt history and configuration memory.
- Great for research, benchmarking, and development.
Setup Steps:
|
|
VSCode Extensions
Here are two most commonly used vscode extensions that allows us to use ollama for using local LLMs.
- Continue Extension – VSCode plugin enabling chat, code generation, and context-aware assistance using local Ollama models. Backend can be set to
localhost:11434
. - AI Toolkit Extension – Visual Studio Code plugin supporting multi-model chat, file-based context, embeddings, and offline-first Ollama support. In models section, look for Ollama models and select your locally installed model.
Conclusion
Ollama empowers developers, researchers, and enthusiasts to run powerful large language models directly on Windows with simplicity and control. Its CLI tools, robust API interface, environment variable configuration, and growing ecosystem of GUIs and integrations make it a compelling choice for anyone wanting local AI capabilities. Whether you’re building applications, analyzing language data, comparing models, or integrating with IDEs like VSCode, Ollama brings scalable, offline-first LLM inference to your desktop. With fine-grained customization through environment variables and native service integration on Windows, it’s never been easier to take full control of your AI workflows—no cloud dependency required.
Your inbox needs more DevOps articles.
Subscribe to get our latest content by email.