Recently, for a small task, I had to check approximately 350 webpages to make an analysis. While I was thinking how I can filter items, I came up with thought of using AI. But if I wanted to use a Python code and an tech giant LLM, it’d be very pricey due to high token spending. So why not running an AI locally, I have a good GPU, I have lots of RAM… So, I decided to take the plunge to figure out how to run an AI model right here on my own computer.

The best part? It was surprisingly easy, completely private, and didn’t cost a dime in API fees. Since my last encounters with various AI tools, I wanted to share exactly how I did it using Ollama and Hugging Face.

Why Run AI Locally?

First thing’s first. Why would you even bother?

  • Privacy: Nothing you type leaves your computer. Period.
  • Offline Access: You can chat with your AI even if your internet connection drops.
  • Freedom: You aren’t tied to subscription limits or corporate filters.

The Great RAM Debate: GPU VRAM vs. System RAM

Before you download a model, you need to know what kind of hardware you are working with. When we talk about “RAM” for AI, there are two different types that dictate how fast your AI will actually process information.

  • System RAM (Normal RAM): This is the everyday memory your computer’s CPU uses. If you don’t have a dedicated graphics card, Ollama will run the AI using your CPU and this normal RAM. It works perfectly fine, but it will be noticeably slower.
  • GPU VRAM (Video RAM): This is the memory built into dedicated graphics cards. GPUs are essentially massive math-calculating engines. If your AI model is small enough to fit entirely inside your GPU’s VRAM, your AI will generate text blazingly fast.
  • The Apple Silicon Factor: If you have an M-series Mac (which I know many of you do based on my previous MacOS posts), you have “Unified Memory.” This means your GPU and CPU share the same pool of RAM, making modern Macs absolute beasts for running local AI. Hence making Mac Mini shortage around the world.

What You’ll Need

  • A computer (Mac, Windows, or Linux) -doh-.
  • 8GB of RAM minimum. (16GB+ of RAM or VRAM puts you in a much better spot).
  • Basic competency with using your computer’s terminal. If you are not familiar with terminal commands, just follow the steps below closely.

Step-by-Step Guide

Step 1: Install Ollama

Ollama is the tool that handles all the heavy lifting behind the scenes.

  1. Head over to Ollama’s official website.
  2. Download the installer for your specific operating system.
  3. Run the installer and follow the standard setup instructions.

To verify it worked, open your terminal and type:

Bash terminal;

ollama --version

If it spits out a version number, we are good to go.

Step 2: How to Choose the Right Model

If you go to Hugging Face right now, you will be overwhelmed by the options. Here is a quick cheat sheet on how to pick the right one for your machine:

1. Match the Model to Your Memory Look at the parameter count (usually denoted by a number followed by a “B” for billions).

  • 8GB of RAM/VRAM: Stick to 7B to 12B models. Look for models like Llama 3.3 (8B) or Gemma 3 (12B).
  • 16GB to 32GB of RAM/VRAM: You can comfortably run 14B to 32B models. These have much better logic and reasoning. Look at Phi-4 (14B) or Qwen 3 (30B).
  • 64GB+ or Multiple GPUs: You can run massive 70B+ enterprise-grade models.

2. Decode the Quantization (Q4 vs Q8) When searching for models to run locally, you want the GGUF format. But you’ll notice files ending in q4_k_mq6_k, or q8_0. This is called quantization—essentially how compressed the model’s data is.

  • Q4_K_M: The optimal zone. It compresses the model size by about 60% so it fits easily into smaller RAM setups, with almost zero noticeable drop in intelligence. Start here.
  • Q8_0: Barely compressed. Use this only if you have plenty of RAM/VRAM and want the absolute maximum quality.

Step 3: Find and Download on Hugging Face

Now that you know what to look for:

  1. Go to Hugging Face.
  2. In the search bar, look for your chosen model followed by “GGUF” (e.g., Llama 3.3 8B GGUF).
  3. Click on the model’s repository and go to the Files and versions tab.
  4. Find your chosen quantization (like the q4_k_m.gguf file) and download it.
  5. Create a dedicated folder for it on your computer, like ~/AI_Models.

Step 4: Create a Modelfile

Ollama needs a tiny set of instructions to know what to do with the file you just downloaded.

  1. Open a plain text editor.
  2. Create a new file and type the following line, pointing it to exactly where you saved your downloaded model:

Plaintext;

FROM ./your-downloaded-model-name.gguf

(Make sure to replace your-downloaded-model-name.gguf with the actual file name!)

  1. Save this text file in the same folder as your model, and simply name it Modelfile (no extension like .txt, just Modelfile).

Step 5: Import the Model into Ollama

Now it’s time to bring your Hugging Face model into Ollama’s environment.

  1. Open your terminal.
  2. Navigate to the folder where you saved your files. (Use the cd command, e.g., cd ~/AI_Models).
  3. Run this command to create the model in Ollama (let’s name it my-custom-ai):

Bash terminal;

ollama create my-custom-ai -f Modelfile

Ollama will take a few seconds to read the file and set everything up.

Step 6: Run Your AI

This is the final step. In your terminal, type:

Bash terminal;

ollama run my-custom-ai

After a brief loading pause, you should see a prompt appear. Type “Hello!” and hit enter. Congratulations—you are now chatting with an AI running entirely on your own hardware.

Wrapping Up

Getting this to work felt like a major step forward. The fact that the open-source community provides tools like Ollama and hubs like Hugging Face makes running local AI incredibly accessible to everyone.

and maybe this text also has been generated locally based on experience.

who knows? 🙂

Levent