Question 1

What is Ollama?

Accepted Answer

Ollama is a free, open-source tool that downloads and runs open large language models on your own machine or server. The project is MIT-licensed and wraps the llama.cpp inference engine, adding a clean command line, automatic model fetching, GGUF quantization handling, and an OpenAI-compatible REST API on port 11434.

The workflow is short. Install Ollama with a single command, run ollama pull to download a model, and ollama run to start chatting. The Ollama library covers more than 4,500 models in May 2026, with families like Llama, Mistral, Gemma, DeepSeek, Phi, and Qwen, along with embedding models for retrieval and vision models that read images. The engine itself is free, and the models are free.

Question 2

Is Ollama free?

Accepted Answer

The Ollama engine is free under the MIT license, and the open-source models it runs are free as well. There are no per-token charges and no rate limits, so the only ongoing cost is the hardware and electricity behind your server. Ollama Inc. now also sells an optional paid hosted tier called Ollama Cloud, with monthly plans for larger models that need datacenter GPUs. The hosted plan is not required for local or self-hosted use of any kind in 2026.

Question 3

Does Ollama need a GPU?

Accepted Answer

No, Ollama runs on CPU alone, with throughput as the trade-off. A modern 8-core CPU on a quantized 7B model produces roughly 5 to 15 tokens per second, while a high-end GPU on the same model gets 40 to 80 tokens per second or more. For small models like Llama 3.2 3B or Phi-3.5 3.8B, CPU-only performance is comfortable on a VPS, and is a sensible default for privacy-focused work that does not need frontier speed at scale.

Question 4

How much RAM do I need to run Ollama?

Accepted Answer

A rule of thumb is roughly 0.6 GB of memory per billion parameters at the default Q4_K_M quantization, plus headroom for the context window the model holds. In practice, 8 GB of RAM comfortably runs a 7B model, 16 GB runs a 13B, and 32 GB is the recommended headroom for 13B-and-up workloads on CPU. Larger models, longer contexts, and concurrent requests push the requirement up. A VPS in that 8 to 32 GB range fits most personal and small-team work.

Question 5

How much disk space do Ollama models take?

Accepted Answer

Default Q4_K_M model files run a few gigabytes each. A 7B model is about 4.1 GB on disk, an 8B is about 4.6 GB, a 13B about 7.9 GB, and a 70B around 40 GB. Pulling several models for different tasks adds up fast, since each new tag downloads a separate file. Embedding models are smaller, in the hundreds of megabytes. For a personal Ollama server, 64 to 100 GB of SSD storage gives room for a few mid-sized models and growth.

Question 6

Can Ollama run without internet?

Accepted Answer

Once a model has been pulled to disk, Ollama needs no internet to use it. Chat, completion, embeddings, and retrieval-augmented generation all run locally against the files in your Ollama directory. The engine only talks to the internet again when you pull a new model or check for an update. That is the main reason teams in finance, healthcare, and government environments choose Ollama — prompts and responses stay on your own infrastructure, even on an air-gapped server.

Question 7

What models can Ollama run?

Accepted Answer

The Ollama library has more than 4,500 models as of May 2026, with sizes from 1B parameters up past 70B. Major families include Llama from Meta, Mistral, Gemma from Google, DeepSeek, Phi from Microsoft, Qwen from Alibaba, and gpt-oss, plus newer entries like Kimi, GLM, and MiniMax. Vision models such as LLaVA accept images alongside text. Embedding models like nomic-embed-text output 768-dimensional vectors for retrieval-augmented generation work over your own files.

Question 8

Does Ollama have an API?

Accepted Answer

Ollama has two APIs that share the same engine. The native REST API at port 11434 covers /api/chat, /api/generate, /api/embeddings, /api/pull, /api/delete, and a /api/ps endpoint for inspecting running models. The OpenAI-compatible API sits at /v1 on the same port and mirrors the OpenAI routes for chat completions, completions, embeddings, and model listing. Apps built against the OpenAI SDK can switch to a local Ollama backend with only a base URL change in the client config.

Question 9

Can Ollama do RAG?

Accepted Answer

Ollama is a common engine for local retrieval-augmented generation work on your own files. The /api/embeddings endpoint and an embedding model like nomic-embed-text produce 768-dimensional vectors that pair with vector stores such as Chroma, Weaviate, or pgvector. Frameworks like LangChain and LlamaIndex have built-in adapters that point at an Ollama base URL, so a private RAG chatbot over your own documents can run end to end on the same VPS, without sending any data outside your own infrastructure.

Question 10

Can you run Ollama on a VPS?

Accepted Answer

A VPS is one of the more common setups for the tool. A VPS with 4 to 16 GB of RAM, a few CPU cores, and 50 to 100 GB of SSD storage handles small-to-mid models like Llama 3.2 3B, Phi-3.5 3.8B, and quantized 7B variants comfortably for most uses. Larger models and high-concurrency production loads do better on GPU-accelerated hardware, but most personal, team, and RAG use cases fit a CPU-only VPS without much trouble.

Ollama VPS Docker Hosting

Why Run Ollama on GreenGeeks

Memory for Small to Mid-Sized Models

Fast CPU Throughput for Token Output

SSD Storage for Multi-GB Model Weights

Always-On Private API for Apps

Self-Managed VPS Plans

VPS 4GB

VPS 8GB

VPS 16GB

VPS 32GB

What is Ollama?

What You Can Build with Ollama

The Key Features of Ollama

Frequently Asked Questions

Launch Ollama on a VPS