AI's Future Isn’t in the Cloud - It's Already on Your PC

Master Ollama: Deploy, Customize, and Run local LLMs with full control, privacy, and speed.

Chay published in AI

2025-06-22 2261 words 11 minutes

Website Visitors:

Contents

What is Ollama?

Ollama is an open-source framework designed to enable the execution of large-language models (LLMs) directly on your personal computer. It serves as a versatile bridge between downloadable LLM model files and your local machine, giving you powerful tools to interact with, test, and fine-tune these models in a private environment. By abstracting away much of the complexity traditionally associated with deploying LLMs, Ollama makes it simple to start using models like LLaMA, Phi, DeepSeek, and others right away.

Key Features of Ollama

Privacy-first & offline operation – Run models on your device without ever transmitting your data to external servers. This ensures sensitive input stays on your machine.
Comprehensive model management – Easily pull, copy, rename, or delete models. Manage local versions of several models concurrently.
Built-in API layer – Ollama offers a local HTTP API server, hosted at http://localhost:11434, to interact programmatically with models.
Model creation and extension – Use modelfiles to customize inference logic, system prompts, context length, and precision levels.
Hardware-optimized – Supports GPU acceleration with NVIDIA CUDA (Windows).
Flexible deployment – Deploy locally or on remote machines, enabling shared access to models across teams or projects.
Multimodal support – Run models that accept text and images as input for richer capabilities.
Open source integrations – Compatible with community-developed interfaces and custom workflows.

Installing Ollama on Windows

Windows 10/11:

Visit the official Ollama website.
Download the Windows .exe installer.
Run the installer and follow the prompts.
To install to a custom directory run the command as: ollamasetup.exe /DIR="C:\Program Files\Ollama"

If ollama is installed in C:\Program Files, set environment variables under System variables. If installed under the user profile (default), set variables under User variables.

Model files and configuration are stoed under%userprofile%\.ollama location by default. To change the default model download location, set the OLLAMA_MODEL_STORE environment variable to a custom path like "D:\ollama\ollama_models".

Running Models: Local & Remote

On Windows, Ollama installs as a service.

Example usage:

1
2
3


ollama pull llama3 # Pulls llama3 model from ollama registry (online repository)
ollama run llama3 # runs the model
>>> why is the sky blue?

Remote Deployment

To set a remote host for Ollama, you can configure the environment variable OLLAMA_HOST to 0.0.0.0 to allow it to listen on all interfaces. On Windows, you can do this by running the command set OLLAMA_HOST=0.0.0.0 in the command line or by setting it in the system environment variables. Then interact normally using CLI or API.

Once you run OLLAMA_HOST command in your terminal, you should run ollama serve command for ollama to pickup this change. To keep it simple, create a environment variable as per your ollama install type (default or custom) and restart your machine to take effect.

CLI Commands and Parameters

Command	Description
`ollama pull <model>`	Download the named model to your system.
`ollama run <model>`	Execute the model interactively or provide a one-time prompt.
`ollama create <name> -f <file>`	Build a custom model using a modelfile.
`ollama show <model>`	View metadata and details of the model.
`ollama list`	List all downloaded models.
`ollama ps`	Show active model sessions.
`ollama stop <model>`	Halt an active model instance.
`ollama cp <src> <dest>`	Duplicate or rename a model.
`ollama rm <model>`	Delete a downloaded model from disk.
`ollama push <model>`	Upload your model to a remote registry.

Runtime Configuration via Environment Variables (Windows)

Here’s a (currently) comprehensive list of environment variables that Ollama will recognize on Windows (and other platforms). You can set these in PowerShell (or your system-wide “Environment Variables” dialog - System > Advanced > Environment Variables) to tweak Ollama’s behavior without passing flags on every command.

OLLAMA_HOST
– default host/URL of the Ollama daemon
1 2

set OLLAMA_HOST=0.0.0.0 ollama serve
OLLAMA_PORT
– default port for the Ollama daemon (11434)
OLLAMA_NO_TLS
– if set (1 or true), disables TLS when talking to the daemon
– same as passing --no-tls to ollama serve or ollama run
OLLAMA_PROFILE
– name of the active profile (overrides --profile)
– profiles group endpoint URLs, credentials, etc.
OLLAMA_PROFILES_PATH
– custom filesystem location for your profiles directory
– defaults to %USERPROFILE%\.ollama\profiles
OLLAMA_DATA_DIR
– root directory where Ollama stores all data (models, logs, metadata)
– defaults to %USERPROFILE%\.ollama\data
OLLAMA_MODEL_STORE
– override for where downloaded models are kept
– if unset, defaults to OLLAMA_DATA_DIR\models
OLLAMA_CACHE_DIR
– directory for temporary downloads, layer caches, and shards
– defaults to %USERPROFILE%\.ollama\cache
OLLAMA_DEBUG
– if set (1 or true), enables verbose CLI debug logging
– same as passing --debug on any ollama command
OLLAMA_JSON
– if set (1 or true), forces JSON-formatted output where supported
– same as passing --json
OLLAMA_CONTEXT_SIZE
– override default token context window size for models that permit it
– same as --context <n>
OLLAMA_STREAM
– if set (1 or true), enables token-by-token streaming by default
– same as passing --stream
OLLAMA_CONCURRENCY
– default maximum parallel inference requests when running ollama serve
– same as --concurrency <n>
OLLAMA_ALLOW_CACHED
– if set (1 or true), allows reuse of cached model archives on ollama install
– same as --allow-cached
PROXY / HTTP_PROXY / HTTPS_PROXY / NO_PROXY
– standard env vars for outbound HTTP(s) proxying
– Ollama respects these when downloading models or contacting registries (do not set for inference clients)
OLLAMA_ORIGINS
– comma-separated list of CORS origins permitted to call your ollama serve endpoint
– default: 127.0.0.1,0.0.0.0
OLLAMA_KEEP_ALIVE
– how long a model stays loaded in memory after each request
– formats: duration (10m, 24h), seconds (3600), -1 (indefinite), or 0 (unload immediately)
– note: the /api/generate and /api/chat endpoints’ keep_alive parameter takes precedence
OLLAMA_MAX_QUEUE
– maximum number of queued requests before returning HTTP 503 (“server overloaded”)
– default: 512
OLLAMA_MAX_LOADED_MODELS
– maximum number of models that can be loaded concurrently in memory
– default: 3 × number of GPUs, or 3 if running on CPU
OLLAMA_NUM_PARALLEL
– maximum simultaneous inference requests each model can handle
– default: auto-tuned between 1 and 4 based on available memory
OLLAMA_FLASH_ATTENTION
– if set (1 or true), enables Flash Attention for lower memory usage on large contexts
OLLAMA_KV_CACHE_TYPE
– quantization type for the model’s key/value cache (f16, q8_0, or q4_0)
– using quantized cache reduces memory at minor precision cost —
How to set these in PowerShell for your session:
1 2 3

# Example: point Ollama data into D:\ollama, enable JSON output $Env:OLLAMA_DATA_DIR = "D:\ollama\data" $Env:OLLAMA_JSON = "1"

Or permanently via System → Advanced → Environment Variables. After setting, open a new terminal and run ollama <command>—your variables will take effect automatically.

Finally, you can always confirm what your current env vars are before running Ollama:

1

Get-ChildItem Env: | Where-Object { $_.Name -like 'OLLAMA_*' }

Understanding Quantization in Ollama

Quantization is a key technique used to reduce the memory footprint and computational load of large language models, enabling them to run efficiently on machines with limited resources. In essence, it reduces the precision of a model’s numerical weights and activations—typically from 16-bit or 32-bit floats to smaller formats like 8-bit or even 4-bit integers—without significantly sacrificing accuracy.

When you use a large language model (LLM), the model’s “brain” (its parameters) is stored as numbers. The way these numbers are stored affects how much memory the model uses, how fast it runs, and how good its answers are. This is called quantization.

Common Quantization Types

F16 (Float16) - Half-precision floating point:
- Think of it like a very detailed drawing. Each number has a lot of information, giving it high precision.
- Pros: Very accurate, leading to high-quality model outputs.
- Cons: Takes up a lot of memory and is slower to process compared to quantized versions. Requires powerful, modern hardware (especially graphics cards, GPUs) to run efficiently.
- Analogy: A high-resolution photo with full color depth.
Q8_0 - 8-bit integer quantization:
- This is like reducing the detail in your drawing, but still keeping a lot of the original information. It converts the numbers from floating-point to 8-bit integers.
- Pros: Good balance of quality and reduced memory usage. It’s often the “gold standard” for quantized models. It runs well on most computer processors (CPUs).
- Cons: Still relatively large compared to other quantized formats (e.g., 5-7 GB for large models).
- Analogy: A good quality JPEG image with visible, but not excessive, compression.
Q6_K - Optimized 6-bit quantization with grouped scaling:
- This is a smarter way to reduce detail. Instead of just cutting off bits, it groups numbers and scales them together to minimize the loss of important information.
- Pros: Significantly reduces memory usage (e.g., 3-5 GB for similar models) and makes the model load and run faster. It’s a good compromise between size/speed and quality.
- Cons: Slight reduction in quality compared to Q8_0, but often imperceptible for many tasks.
- Analogy: A well-optimized JPEG image, where the compression is noticeable only upon close inspection.
Q5_K, Q4_K - 5-bit and 4-bit variants:
- These are even more aggressive reductions in detail. The “K” often indicates specific optimization techniques (like those used in Q6_K) to preserve as much quality as possible despite the high compression.
- Pros: Smallest model sizes, meaning they load and run the fastest. Excellent for devices with limited memory or for maximum speed.
- Cons: The “loss in generation quality” becomes more noticeable here. The model’s answers might be slightly less coherent, accurate, or creative compared to less quantized versions.
- Analogy: A very compressed JPEG or GIF image. You can still tell what it is, but some details and smoothness are lost.
Q3_K_S, Q2_K - Ultra-compressed formats:
- These are the most extreme forms of detail reduction. They prioritize making the model as tiny as possible, often at the cost of significant quality degradation.
- Pros: Absolutely minimal memory usage. Can potentially run on extremely resource-constrained devices.
- Cons: The “quality significantly” trade-off is very apparent. These are rarely used for general-purpose LLMs where output quality is important because the answers can become noticeably poor or nonsensical. They are more for very specific, niche applications where size is the absolute priority.
- Analogy: A heavily pixelated or extremely low-resolution image, where the original content is barely recognizable.

Example Model Sizes (LLaMA3 8B)

F16: ~13 GB
Q8_0: ~7 GB
Q6_K: ~5.2 GB
Q4_K_M: ~3.8 GB
Q2_K: ~2.2 GB

Which Quantization to Use?

For general use on a typical Windows machine with moderate CPU and 8–16 GB RAM, Q4_K_M or Q5_K_M is often the best tradeoff. These formats load faster, consume less RAM, and still produce high-quality outputs suitable for coding, summarization, and chat.

If you have more memory available (e.g., 32 GB) or a capable GPU, you can opt for Q6_K or Q8_0 for slightly improved accuracy. If system constraints are tight, Q3_K_S may allow you to experiment with models that otherwise wouldn’t fit at all. So, if you have moderate sized machines download the models ending with ModelName-Q4_K_M. They work pretty decent on such resources.

API Interface

Ollama exposes a local API at http://localhost:11434.

`POST /api/generate`

Generate a completion:

1
2
3
4
5
6


{
  "model": "llama3",
  "prompt": "Explain quantum computing in simple terms.",
  "options": {"temperature": 0.7},
  "stream": true
}

`POST /api/chat`

Chat interaction:

1
2
3
4
5
6
7


{
  "model": "llama3",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What’s the capital of Germany?"}
  ]
}

`POST /api/embeddings`

1
2
3
4


{
  "model": "llama3",
  "prompt": "Machine learning is fun."
}

`GET /api/models`

List local models:

1

GET http://localhost:11434/api/models

`POST /api/models`

Download a model:

1
2
3


{
  "name": "llama3"
}

`DELETE /api/models`

Remove a model:

1
2
3


{
  "name": "llama3"
}

`POST /api/create`

1
2
3
4


{
  "name": "custom-phi",
  "modelfile": "FROM phi3:instruct"
}

`GET /api/show`

1

GET http://localhost:11434/api/show?name=llama3

GUI Interfaces for Ollama

Open WebUI – Chat interface with support for RAG, embeddings, and multiple model tabs.
LM Studio – Cross-platform GUI for running models. You can connect Ollama via Llamalink plugin.
LibreChat – Self-hosted chat interface supporting plugins and memory.
Lollms WebUI – Modular UI with model marketplace and support for Ollama.
Enchanted – Lightweight chat GUI for fast local access.
Parallel LLM Runner - A lightweight streamlit based app for running & comparing multiple LLMs.

Parallel LLM Runner

The Parallel-LLM-Runner project allows side-by-side model comparisons using a Streamlit interface.

Features:

Compare response quality from several models.
Choose between horizontal and vertical layout.
Prompt history and configuration memory.
Great for research, benchmarking, and development.

Setup Steps:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


# Clone the repo
git clone https://github.com/ChayScripts/Parallel-LLM-Runner.git
cd Parallel-LLM-Runner

# Create and activate virtual environment
python -m venv venv
venv\Scripts\activate  # Windows

# Install dependencies
pip install streamlit requests pyperclip pytz

# Start the app
streamlit run app.py

VSCode Extensions

Here are two most commonly used vscode extensions that allows us to use ollama for using local LLMs.

Continue Extension – VSCode plugin enabling chat, code generation, and context-aware assistance using local Ollama models. Backend can be set to localhost:11434.
AI Toolkit Extension – Visual Studio Code plugin supporting multi-model chat, file-based context, embeddings, and offline-first Ollama support. In models section, look for Ollama models and select your locally installed model.

Conclusion

Ollama empowers developers, researchers, and enthusiasts to run powerful large language models directly on Windows with simplicity and control. Its CLI tools, robust API interface, environment variable configuration, and growing ecosystem of GUIs and integrations make it a compelling choice for anyone wanting local AI capabilities. Whether you’re building applications, analyzing language data, comparing models, or integrating with IDEs like VSCode, Ollama brings scalable, offline-first LLM inference to your desktop. With fine-grained customization through environment variables and native service integration on Windows, it’s never been easier to take full control of your AI workflows—no cloud dependency required.

Your inbox needs more DevOps articles.

Subscribe to get our latest content by email.