Complete deployment tutorial of local large model, 2026 Use Ollama to run Llama and Qwen on your own computer
Running large language models locally is an order of magnitude more fun in 2026 than it was two years ago. Open-source models such as the Llama family, the Qwen family, and DeepSeek distilled versions range from a few billion to a few hundred billion parameters, and ordinary desktops and high-spec laptops can run one or two cost-effective models. The core benefits of local deployment are data privacy and zero quota anxiety, at the cost of VRAM, RAM, and the initial setup barrier. This article uses Ollama, an open-source tool, as the main thread and walks through the whole thing—download, install, running a model, and hooking up a front end—the standard way to run a large model on a personal computer in 2026.
What Is Ollama, and What Problem Does It Solve

Ollama is an open-source framework for running large models locally, which spread quickly in the developer community starting in 2024. It packages the entire pipeline—download, quantization, inference, and API exposure—into a single command, so beginners can use it without understanding model architecture.
The pain point it solves is direct. In the past, to run a local large model you had to install PyTorch or llama.cpp, download the raw weights, write your own conversion script, and tune inference parameters, a whole round of hassle that took half a day to start. Ollama hides all of this behind the scenes; running a model takes just one line, ollama run.
Ollama is cross-platform, supporting macOS, Linux, and Windows, each with a native installer. It's especially friendly to Apple Silicon users—the unified memory architecture of M-series chips makes Ollama noticeably smoother at running large models than the Intel platform.
Step One: Check Whether Your Hardware Is Up to Par

To run a local large model, VRAM or unified memory is the first hard metric. A rough correspondence: running a 7B-class model needs about 8GB of memory, 13B needs 16GB, 30B needs 32GB, and 70B needs 64GB to start. These are the minimums for the 4-bit quantized versions; higher-precision versions roughly double that.
Here are a few common-scenario configuration references. An M1/M2/M3 MacBook Air with 8GB can barely run the 3B class, so a small model under 4B is recommended. An M2/M3 MacBook Pro with 16GB is a sweet spot, running 7-13B models smoothly. An M3 Max with 36GB or an M4 Pro starting at 24GB can run 30B models at a usable generation speed. A gaming PC with an RTX 4070/4080/4090 and 12-24GB of VRAM runs 13-30B models very smoothly.
An ordinary office laptop with 8GB of RAM can basically only play with 1-3B small models, with barely usable inference speed. If you want to seriously use local large models for work, investing in a device with 16GB or more of RAM is the baseline barrier.
Step Two: Download and Install Ollama

Go to ollama.com and download the installer for your system: Mac gets a .dmg, Windows gets an .exe, and Linux uses a curl install script. The install process is simple—on Mac just drag it into Applications, and on Windows double-click to install and you're done.
After installation, Ollama runs as a background service. Open a terminal and type ollama --version; if it shows a version number, the install succeeded.
One extra note for Mac users: Ollama listens on 127.0.0.1:11434 by default, so if you want to access it from other devices on your LAN, set the system environment variable OLLAMA_HOST=0.0.0.0 and then restart the Ollama service.
Linux users can check the service status with systemctl status ollama. If the GPU isn't recognized, you may need to install the NVIDIA Container Toolkit or ROCm drivers, depending on your graphics card model.
Step Three: Pull Your First Model
Ollama's model library covers the mainstream open-source models. Common entry-level choices:
The Llama family is from Meta, strong in general capability, with English performance better than Chinese. The command ollama pull llama3.1:8b pulls an 8-billion-parameter version, with the default 4-bit quantization at about 4-5GB.
The Qwen family is from Alibaba, strong in Chinese, with decent coding ability too. ollama pull qwen2.5:7b pulls the 7-billion-parameter version. The new-generation Qwen3 is also live in the Ollama library, worth a try.
The DeepSeek family is well optimized for coding tasks. ollama pull deepseek-r1:7b pulls a reasoning-optimized version that looks small in size but has decent logical reasoning ability.
The Phi family is a small model from Microsoft, 3-4B parameters, and runs on small devices with 4GB of RAM. ollama pull phi3:mini.
Gemma is Google's open-source model, in various 2-9B specs. ollama pull gemma2:9b is a fairly handy medium model.
The first pull downloads several GB, so be patient. After it's done, use ollama list to see the models you already have locally.
Step Four: Run the Model and Chat
The most direct experience: type ollama run qwen2.5:7b in the terminal, and once the model loads you can chat directly.
The first run takes a few tens of seconds to load the model, and subsequent starts are much faster. An M2 MacBook Pro with 16GB running a 7B model generates at roughly tens of tokens per second, which feels close to the ChatGPT web version—smooth.
Exit the conversation with /bye or Ctrl+D. Ollama keeps the model in memory for a few minutes, during which a restart is instant. To free the memory immediately, run ollama stop qwen2.5:7b.
An advanced usage is controlling generation quality with parameters. After entering the conversation with ollama run qwen2.5:7b, type /set parameter temperature 0.3 to lower the temperature so the model answers more stably; above 0.8 it's more creative.
Step Five: Hook Up Open WebUI to Make the Interface Feel Like ChatGPT
The terminal experience isn't friendly enough for ordinary users. Open WebUI is the most popular Ollama front end in the open-source community, with an interface close to ChatGPT and support for multiple sessions, Markdown, code highlighting, RAG, and more.
The fastest way to install is Docker, one line:
docker run -d -p 3000:8080 -v open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main
Once it's running, visit localhost:3000 in a browser and register your first account (a local account that doesn't go to the cloud). In settings, the Ollama Endpoint points to host.docker.internal:11434 by default and will automatically detect the models you've pulled.
The experience after that is almost identical to ChatGPT. You can create multiple conversations, switch between different models to compare results, and upload files for RAG Q&A, with all data staying on your machine.
If you don't want to install Docker, Open WebUI also supports pip installation and can run in a Python environment, but Docker's clean isolation is more recommended.
Model Selection: Hands-On Recommendations for Chinese Scenarios
The most common question when running local large models is "which model to pick." Here are a few hands-on recommendations by scenario.
For Chinese writing and everyday conversation, Qwen 2.5 7B or the Qwen 3 family is the top pick, with natural, fluent Chinese expression and a relatively recent knowledge cutoff.
For coding tasks, the DeepSeek Coder family and the Qwen 2.5 Coder family are both top-tier; the 7B version can handle most everyday coding tasks, and the 30B version is close to first-line closed-source models.
For English writing and creativity, the Llama 3.1/3.2 family and the Mistral family outperform Chinese models, but their Chinese support is slightly weaker.
If your hardware is strained and you can only run under 3B, Phi3 mini is one of the best all-rounders in the 3-4B range, and Gemma 2B works in a pinch too.
70B-class models (such as Llama 3.3 70B and Qwen 2.5 72B) have overall capability close to the early GPT-4 level, but they need 64GB or more of memory to run, so don't attempt them on an ordinary setup.
A Few Common Performance Optimization Tips
Stuttering is the problem beginners run into most. A few common optimization directions.
First, pick the right size. If your hardware isn't enough, pick a small model; don't force a big model and expect a miracle. The experience of a smoothly running 13B is far better than a stuttering, word-by-word 30B.
Second, prefer the GGUF quantized version. What Ollama provides by default is already a quantized version, usually Q4_K_M or Q5_K_M. If you have high quality requirements and enough VRAM, you can pull a Q8 version (add the :8b-q8_0 suffix to the command), which noticeably improves answer quality at the cost of doubling VRAM.
Third, close unnecessary background programs. Local large model inference uses a lot of VRAM and RAM, and running dozens of browser tabs, an IDE, and Docker containers at the same time will noticeably slow inference.
Fourth, control the context length. Ollama's default context is 2048 tokens, and a long context consumes more VRAM. If you only do short Q&A, this default is just right; for summarizing long documents, set a larger context, at the cost of being slower.
The Real Use Cases for Local Large Models
Many people leave a local large model idle after a few days of use because they didn't find the right scenario. Three directions that are genuinely usable.
First, privacy-sensitive conversation and document processing. For business contracts, internal documents, and personal private data, running locally completely avoids the compliance risk of going to the cloud.
Second, stable assistive workflows. For example, batch translation, batch summarization, and batch generation of structured data; a local model has no rate limits, no quota limits, and can run offline, suiting unattended tasks.
Third, exploratory learning. To learn concepts like RAG, Function Call, and Agent, use a local model for free experimentation with zero cost of failure—you'll understand much faster than just reading docs.
If you only chat day to day and occasionally write a doc, cloud ChatGPT or a domestic large model is enough. The real value of local large models is in the three dimensions of batch processing, privacy, and control.
Frequently Asked Questions
Which model is most suitable for running a local large model on a Mac
For M1/M2 16GB, we recommend Qwen 2.5 7B or Llama 3.1 8B, which run smoothly and work in both Chinese and English. For M2 Pro/Max with 36GB or more, you can try Qwen 2.5 32B or DeepSeek 32B for a noticeably higher-tier experience. If you prioritize Chinese, pick Qwen; for coding tasks, pick Qwen Coder or DeepSeek Coder; for English creativity, pick Llama. A Mac Pro or Mac Studio with 64GB or more can take on 70B models.
Can a local large model search the web
Ollama doesn't go online by default and only runs local inference. To let the model go online, you add search capability at the front-end layer. Open WebUI has an official Web Search feature that connects to search backends such as SearXNG or the Tavily API, so the model can search first and then generate an answer. You can also use frameworks like LangChain and LlamaIndex to assemble your own search + RAG flow. This combination gets close to the experience of ChatGPT with browsing, but the configuration barrier is higher than plain conversation.
How do you use the Ollama API
Ollama exposes an OpenAI-compatible REST API by default on port 11434, and most tools that support the OpenAI protocol can connect directly. For example, change the API endpoint of Continue.dev, Cline, or Cursor to http://localhost:11434/v1, fill the model name with a model you've pulled locally, and you can run AI coding in the editor with a local model. Note that local models are weaker at coding than cloud Claude/GPT, so this suits simple tasks or scenarios where you don't want to spend money.
Do local large models use a lot of power
During inference, the GPU or CPU runs at high load for a long time, so power consumption is indeed significantly higher than at idle. An M2 MacBook Pro running a 7B model draws roughly tens of watts for the whole machine during continuous generation. An NVIDIA 4090 desktop card draws several hundred watts per card during inference. For long batch tasks, pay attention to cooling and electricity costs. A Mac laptop's bottom case will get noticeably warm after running a while, but both the software and hardware layers have protections that won't damage the device.
Why does my local model give irrelevant answers
A few common reasons. First, the model is too small; models under 3B have inherently limited capability and getting things wrong is normal—switch to 7B or larger for instant improvement. Second, the prompt is unclear; a local model can't "guess" your intent like ChatGPT, so write the background and requirements out more explicitly and completely. Third, the context isn't enough; Ollama's default context is 2048, so a long conversation gets truncated and forgets earlier content—set OLLAMA_NUM_CTX in config or raise max tokens in Open WebUI. Fourth, the quantization is too aggressive; 2-bit quantization like Q2 noticeably reduces quality, so use Q4 or above if you can.
📝 本文来自抖文 www.douwen.me ,转载请保留出处。
原文链接:https://www.douwen.me/archives/1116/
💬 评论 (7)
Step-by-step is gold.
Practical tips not fluff.
Easy to follow.
Best summary I've read on this.
Sharing this with my team.
Loved the FAQ section.
Solid breakdown, very useful.