Machine Learning

Use Cases

Machine learning (ML) is a subset of artificial intelligence (AI). It is great for making predictions based on historical data. [1]

Use-cases:

  • Image recognition

    • Identifying objects in an image.

  • Natural language processing (NLP)

    • Understanding written or vocal speech.

  • Recommendation engines

    • Predicting what similar products a user might like.

Mathematics

Machine learning does not always require having a deep knowledge of math. When it is needed, these are the most relevant mathematical subjects a machine learning expert should be familiar with [2][3]:

  • Linear algebra

  • Statistics

  • Differential calculus

  • Integral calculus

Programming Languages

More than half of all machine learning programs are built using Python. [4] Here are the top 3 programming languages used [4][5]:

  1. Python

  2. R

  3. Java

Graphics Card Vendors

Introduction

NVIDIA provides the best support for machine learning with its proprietary CUDA library. It is possible to use AMD and Intel graphics cards by using the open source OpenCL library [7] but NVIDIA provides the best performance and compatibility. [6]

AMD ROCm

AMD’s Radeon Open Compute (ROCm) is a driver used for running workloads on a GPU instead of a CPU. The Heterogeneous Computer Compiler (HCC) is a fork of CLANG that provides support for optimizing compiled code for GPUs. Heterogeneous-compute Interface for Portability (HIP) is a C++ framework that is mostly a drop-in replacement for CUDA. [64]

If the GPU is not supported by a ROCm program, it is possible to set the environment variable HSA_OVERRIDE_GFX_VERSION to enable support. [65] This does not guarantee that it will work. Here is a partial ROCm compatibility list.

$ rocminfo | grep gfx
  Name:                    gfx1151
      Name:                    amdgcn-amd-amdhsa--gfx1151
      Name:                    amdgcn-amd-amdhsa--gfx11-generic
$ export HSA_OVERRIDE_GFX_VERSION=11.5.1

Resources

Mathematics [2][3]:

Large Language Models (LLMs)

Ollama

Usage

Ollama is a service that allows interacting with locally downloaded large language model (LLM). It is the best free and open source alternative to ChatGPT for the CLI. [8]

Installation [9]:

Uninstall:

  • Linux [49]

    $ sudo systemctl disable --now ollama
    $ sudo rm --force /etc/systemd/system/ollama.service
    $ sudo rm --force /usr/local/bin/ollama
    $ sudo rm --recursive --force /usr/local/lib/ollama/
    
  • macOS [50][51]

    $ killall Ollama ollama
    $ sudo rm -f /usr/local/bin/ollama
    $ rm -r -f ~/Library/Application\ Support/Ollama
    

Upgrade:

  • Uninstall and then install Ollama again.

Ollama provides many different models. These are categorized by how many billions (B) of parameters the use. The higher the number, the more accurate it is but at the cost of more memory usage. [10] Refer to the models section for the top models. Refer to the quantization section for more information about the size and accuracy of models.

Starter models to try:

  • For desktops, use Ollama 8B [11]:

    $ ollama run llama3.1
    
  • For phones and low-end hardware, use Ollama 3B [12]:

    $ ollama run llama3.2
    
  • For image recognition on desktops, use Ollama 11B with vision. Provide the full path to the image file when chatting with Ollama. [13]

    $ ollama run llama3.2-vision
    

Save a conversation to revisit later by using /save <SAVE_NAME>. It will be stored as a new model which can be viewed with /list or the CLI command ollama list. Load the conversation by using /load <SAVE_NAME>.

Exit the LLM instance by typing /bye.

List installed models.

$ ollama list

Delete a model.

  • Linux or macOS

    $ ollama rm <OLLAMA_MODEL>
    

Delete all models.

  • Linux

    $ sudo rm -r -f /usr/share/ollama/.ollama/models/blobs/
    $ sudo rm -r -f /usr/share/ollama/.ollama/models/manifests/
    
  • macOS

    $ rm -r -f ~/.ollama/models/*
    

Models

As of mid-2025, the best overall model is Gemma 3. It supports over 140 languages, supports 128,000 context tokens, and has image recognition (except for the 1B and 270M variants). Gemma 3 27B is better than Llama 405B and Deepseek v3 671B. [53] It is available in 270M, 1B, 4B, 12B, and 27B sizes. [54] One downside is that it does not support tools.

  • ollama run gemma3:27b

Top local LLMs for literature as of early 2025 [28]:

  • 32B or less:

    1. QwQ 32B (Q4_K_M) = Although quantized models normally perform worse the more they are shrunk, this performs better at INT4 than it does with INT5, INT8, or even FP16. [36]

    • ollama run qwq:32b

    1. Gemma 3 12B

    • ollama run gemma3:12b-it-qat

    1. Gemma 3 4B

    • ollama run gemma3:4b-it-qat

    1. Mistral Nemo 2407 Instruct 12B

    • ollama run mistral-nemo:12b-instruct-2407-fp16

    1. Gemma 2 9B

    • ollama run gemma2:9b

    1. Llama 3.1 8B

    • ollama run llama3.1

Top local LLms for programming that are 32B or smaller as of early 2025:

  • 32B or less:

    1. Qwen Coder 32B (Q8_0) [18][29][30]

    • ollama run qwen2.5-coder:32b-instruct-q8_0

    1. DeepSeek Coder v2 Lite 16B [17]

    • ollama run deepseek-coder-v2:16b

    1. Codestral 22B [31][32]

    • ollama run codestral:22b

  • 10B or less:

    1. Ministral Instruct 8B

    • ollama run cas/ministral-8b-instruct-2410_q4km

    1. Qwen2.5 Coder Instruct 7B [32][33]

    • ollama run qwen2.5-coder:7b-instruct

    1. DeepSeek Coder Base 7B [34][35]

    • ollama run deepseek-coder:6.7b

Top local multimodal LLMs for examining images as of 2024. [16] Ollama added support for multimodal LLMs in version 0.7.0 in 2025. [15]

  1. Qwen-VL-Max

  2. InternLM-XComposer2-VL (based on InternLM2-7B)

  3. MiniCPM-V 2.6 (based on Qwen2-8B)

  4. Qwen-VL-Plus

  5. InfMLLM (based on Vicuna-13B)

  6. ChatTruth-7B (based on Qwen-7B)

  7. InternVL-Chat-V1.5 (based on InternLM2-20B)

  8. WeMM (based on InternLM-7B)

  9. PureMM (based on Vicuna-13B)

  10. InternVL-Chat-V1.1 (based on LLaMA2-13B)

  11. LLaVA-1.6 (based on Vicuna-34B)

  12. MiniCPM-Llama3-V 2.5 (based on LLaMA3-8B)

Distrobox

Introduction

distrobox can be used to run Ollama on immutable operating systems such as Fedora Atomic Desktop and openSUSE MicroOS. This guide focuses on systems using an AMD graphics device. For NVIDIA support, either (1) use the --nvidia argument with distrobox create or (2) use the option nvidia=true with distrobox-assemble create.

Fedora Atomic Desktop

Create and enter a distrobox container for Fedora.

$ distrobox create --volume /dev/dri:/dev/dri --volume /dev/kfd:/dev/kfd --additional-packages "pciutils" --init --image quay.io/fedora/fedora:latest --name ollama-fedora
$ distrobox enter ollama-fedora
openSUSE MicroOS

Allow ROCm to be used by non-root users.

$ sudo -E ${EDITOR} /etc/udev/rules.d/90-rocm.rules
KERNEL=="kfd", GROUP=="video", MODE="0660"
SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"
$ sudo udevadm control --reload-rules
$ sudo udevadm trigger

Find the existing UID and GID mappings. If none exist, create one using the same name for both the user and group.

$ cat /etc/subuid
$ cat /etc/subgid
$ sudo -E ${EDITOR} /etc/subuid
<NAME>:100000:65536
$ sudo -E ${EDITOR} /etc/subgid
<NAME>:100000:65536

Find the GID for the render and video group.

$ grep render /etc/group
$ grep video /etc/group

Create a Distrobox build configuration file. Replace the subuid, subgid, and nogroup values with the related starting value. Also replace the GIDs for the render and video group.

$ ${EDITOR} distrobox-ollama-ubuntu.ini
[ollama-ubuntu]
image=docker.io/rocm/dev-ubuntu-24.04:latest
init=true
additional_packages = "pciutils"
additional_flags="--device=/dev/kfd --device=/dev/dri"
subuid=100000
subgid=100000
init_hooks="export ROCM_PATH=/opt/rocm;"
init_hooks="addgroup --gid 486 render"
init_hooks="addgroup --gid 483 video"
init_hooks="addgroup --gid 100000 nogroup"
init_hooks="usermod -aG render,video,nogroup $LOGNAME;"
nvidia=false
pull=false
root=false
replace=true
start_now=false

Create and enter the Distrobox container. [19]

$ distrobox-assemble create --file distrobox-ollama-ubuntu.ini
$ distrobox enter ollama-ubuntu

LLM Tuning

Context Length

The context length determines how much information a LLM can process. A larger context length can process larger files and remember older parts of the conversation for longer. For local models, it is recommended to use a minimum context length of 64K for advanced use-cases. [55] For comparison, as of early 2026, remote models such as ChatGPT offer 16K, 32K, and 128K depending on the subscription. [56]

Ollama defaults based on VRAM size [57]:

  • Below 24 GiB = 4096 (4K)

  • Between 24 and 48 GiB = 32768 (32K)

  • At or above 48 GiB = 262144 (256K)

Below is an example of how context length affects memory usage with the Qwen3 1.7B model. The exact model and size will affect the real memory usage. [39]

Context Length

Memory Increase Multiplier from 4K

Example Memory Usage (GiB)

4096 (4K)

1x

2

16384 (16K)

2x

4

32768 (32K)

3x

6

49152 (48K)

4x

8

65536 (64K)

5x

10

81920 (80K)

6x

12

98304 (96k)

7x

14

114688 (112K)

8x

16

131072 (128K)

9x

18

Configure a context length value. [55]

  • Linux:

    $ sudo systemctl edit ollama.service
    [Service]
    Environment="OLLAMA_CONTEXT_LENGTH=<VALUE>"
    $ sudo systemctl daemon-reload
    $ sudo systemctl restart ollama
    
  • macOS:

    $ launchctl setenv OLLAMA_CONTEXT_LENGTH <VALUE>
    
Quantization

Most LLMs available to download use, at most, a floating-point value of 16. It is possible to use quantization to lower the memory usage. This allows for running larger models and/or increasing the context size. Some models have downloads that already include it being quantized which also lowers the download size. Other models require configuring your LLM service to quantize it.

Quantization

GB Size Per Billion Parameters [37][38]

Notes

FP32

4

Lossless.

FP16

2

Identicial to FP32. Most models are published at this size.

INT8 (Q8_0)

1

‘Extremely low quality loss.’

INT5 (Q5_K/Q5_K_M)

0.6

‘Very low quality loss.’

INT4 (Q4_K/Q4_K_M)

0.5

‘Balanced quality.’ [20][27]

Anything below INT4 results in a huge loss in quality and is not usable. [20] If a model cannot fit into VRAM, then the extra size is placed into system RAM which can be anywhere from 30x to 50x slower. [39][52]

Configure a quantization value.

  • Linux:

    $ sudo systemctl edit ollama.service
    [Service]
    Environment="OLLAMA_KV_CACHE_TYPE=<QUANTIZATION_VALUE>"
    Environment="OLLAMA_FLASH_ATTENTION=1"
    $ sudo systemctl daemon-reload
    $ sudo systemctl restart ollama
    
  • macOS [9][21]:

    $ launchctl setenv OLLAMA_KV_CACHE_TYPE <QUANTIZATION_VALUE>
    $ launchctl setenv OLLAMA_FLASH_ATTENTION 1
    

Open WebUI

Installation

Open WebUI provides a simple web interface to interact with LLMs similar to ChatGPT. It supports using offline Ollama models, doing web searches, user accounts, and more.

Run it with default settings (it will be accessible at http://127.0.0.1:3000 after the container finishes starting):

$ podman run --detach --publish 3000:8080 --volume open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main

Run it with Ollama as an integrated service:

$ podman run --detach --publish 3000:8080 --volume open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:ollama

Run it with Ollama as an integrated service and with access to NVIDIA GPUs (only AMD and Intel GPUs are accessible by default):

$ podman run --detach --publish 3000:8080 --gpus all --volume open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:cuda

Run it with access to a local Ollama service:

$ podman run --detach --network=host --env PORT=3000 --volume open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main

Run it with access to a remote Ollama service [22]:

$ podman run --detach --publish 3000:8080 --env OLLAMA_BASE_URL=<OLLAMA_BASE_URL> --volume open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main

Run it with authentication disabled (autologin enabled):

$ podman run --detach --publish 3000:8080 --env WEBUI_AUTH=False --volume open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main

Run it with search engine support. [23][24]

  • Brave has a free service that allows for 1 query a second and 2000 queries a month. It requires an account with a credit card on file.

    $ podman run --detach --publish 3000:8080 --env ENABLE_WEB_SEARCH=true --env WEB_SEARCH_CONCURRENT_REQUESTS=1 --env ENABLE_SEARCH_QUERY_GENERATION=False --env WEB_SEARCH_ENGINE=brave --env BRAVE_SEARCH_API_KEY=<BRAVE_SEARCH_API_KEY> --volume open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main
    
  • DuckDuckGo is the easiest to configure since it does not require an API key. However, search results are normally rate limited unless Open WebUI is configured to do less searches at a time. [25][26]

    $ podman run --detach --publish 3000:8080 --env ENABLE_WEB_SEARCH=true --env WEB_SEARCH_CONCURRENT_REQUESTS=1 --env ENABLE_SEARCH_QUERY_GENERATION=False --env WEB_SEARCH_ENGINE=duckduckgo --volume open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main
    
  • Google Programmable Search Engine (PSE) has a free service that allows for 100 queries every day. It requires an account with a credit card on file.

    $ podman run --detach --publish 3000:8080 --env ENABLE_WEB_SEARCH=true --env WEB_SEARCH_ENGINE=google_pse --env GOOGLE_PSE_API_KEY=<GOOGLE_PSE_API_KEY> --env GOOGLE_PSE_ENGINE_ID=<GOOGLE_PSE_ENGINE_ID> --volume open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main
    
  • Tavily offers has a free service that allows for 1000 queries every month. No credit card required.

    $ podman run --detach --publish 3000:8080 --env ENABLE_WEB_SEARCH=true --env WEB_SEARCH_ENGINE=tavily --env TAVILY_API_KEY=<TAVILY_API_KEY> --volume open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main
    

Verify if a search engine rate limit is being reached:

$ podman logs open-webui | grep -i ratelimit
Configuration

Change the Ollama URL:

  • User > Admin Panel > Settings > Connections > Manage Ollama API Connections

Change the search engine settings:

  • User > Admin Panel > Settings > Web Search

Disable query generation to prevent rate limiting of most search engines with free tiers of access. Search engine results may become less useful. [26]

  • User > Admin Panel > Settings > Interface > Web Search Query Generation: Off > Save

Modelfile

A Modelfile allows customizing an existing model for use with Ollama. The syntax for instructions is similar to a Containerfile.

View a human-friendly overview of a model.

$ ollama show ${model}:${tag}

Save a model as a Modelfile to use as a starting point.

$ ollama list
$ ollama show ${model}:${tag} --modelfile > ${model}.modelfile
$ less ${model}.modelfile

Modelfile instructions [46][47]:

  • FROM <MODEL>:<TAG> = Required. The model to use.

  • ADAPTER = The LoRA adapters to use.

  • LICENSE = The license to use.

  • MESSAGE <ROLE> = One or more existing messages. These will appear as chat history when a user runs the model. This can be used for simple training. Role can be system (or use the SYSTEM instruction intead), user (the end-user), or assistant (the AI). For long multi-line messages, use triple quotes """ to start and end the message.

  • PARAMETER = Configure model runtime settings.

    • min_p (float) = Use instead of top_p. Minimum probability of taking into account different but similar tokens. Default is 0.0.

    • num_ctx (int) = Context size. The higher the number, the more the model will remember. Default is 2048.

    • num_predict (int) = Predict how many tokens (how much processing) maximum is required to respond to the prompt. Default is -1 for infinity.

    • repeat_last_n (int) = How many messages a model can refer back to. Default is 64.

    • repeat_penalty (float) = Lower will be more repetitive. Higher will be less repetitive. Default is 1.1.

    • seed (int) = Configure a seed to get consistent output. Otherwise, Ollama will generate a random seed every time the model is loaded. Default is 0 for random.

    • temperature (float) = Higher will be more creative but less accurate. Default is 0.8.

    • stop (string) = One or more stop sequences that define when the AI should stop generating text.

    • top_k (int) = Higher will provide more varied output. Lower will be more focused. Default is 40.

    • top_p (float) = Use instead of min_p. Optionally use with top_k. Higher will provide more varied output. Lower will be more focused. Default is 0.9.

  • SYSTEM = The persona the AI should have.

  • TEMPLATE = The prompt template.

Create a new model from the Modelfile.

$ ollama create <NEW_MODEL> --file <NEW_MODEL>.modelfile

Example Modelfile [48]:

FROM llama3.1:latest
PARAMETER num_ctx 4096
PARAMETER repeat_last_n 96
PARAMETER temperature 0.5
SYSTEM You are the world-class paleontologist Dr. Alan Grant from Jurassic Park.
MESSAGE user Tell me about yourself in two sentences.
MESSAGE assistant """My name is Dr. Alan Grant.
I'm a world-class plaeontologist who specializes in the study of velociraptors."""

Training

Introduction

There are two types of quantization training strategies to lower the memory usage of a LLM [40]:

  • Post-training quantization (PTQ) = Easier but less accurate. Any existing LLM can be quantized and cached. Refer to the quantization section.

  • Quantization-aware training (QAT) = Harder but more accurate. The LLM must be specifically trained knowing that the data is quantized. For example, Gemma 3 models have QAT variants. [41]

Although it is not training, the easiest way to make a LLM aware of your data is to give it context first. Run the LLM with Ollama, provide it with the information and instructions on what to do, and then save the model. Alternatively, use a Modelfile to define MESSAGE instructions. When a user loads the model, they will see the message history.

$ ollama run <MODEL>
/save <NEW_MODEL>
/bye
$ ollama list
$ ollama run <NEW_MODEL>
LLama Factory

Unlike most tools for training LLMs which require writing custom Python programs, Llama Factory provides a CLI and standardized configuration format. Despite it listing support for training LLaVA models (such as Gemma3) that also support parsing visual images, it does not work. Only text-based LLMs work. [58]

Memory requirements [59]:

Training Method

Quantization

Memory Usage of Model

LoRA

FP16

2x

QLoRA

INT8 (Q8_0)

1x

QLoRA

INT4 (Q4_K/Q4_K_M)

0.5x

Installation:

  • Find the latest release version of Llama Factory.

  • Create a container with all of the dependencies pre-installed.

    • AMD [60]

      $ git clone --branch v0.9.4 https://github.com/hiyouga/LlamaFactory.git
      $ cd LlamaFactory
      $ cd docker/docker-rocm/
      $ sudo docker compose up -d
      $ sudo docker exec -it llamafactory /bin/bash
      
    • NVIDIA [59]

      $ git clone --branch v0.9.4 https://github.com/hiyouga/LlamaFactory.git
      $ cd LlamaFactory
      $ cd docker/docker-cuda/
      $ sudo docker compose up -d
      $ sudo docker exec -it llamafactory /bin/bash
      

Usage:

  • Create a read-only access token in HuggingFace.

    • Settings > Access Tokens > +Create new token > Token type: Read, Token name: llama-factory > Create token

  • Set that token as an environment variable.

    $ export HF_TOKEN="<HUGGING_FACE_ACCESS_TOKEN>"
    
  • Visit the LLM page on HuggingFace.co. Most require the end-user to accept a usage agreement. Most examples use Qwen/Qwen3-4B-Instruct-2507.

  • Configure the training to use less memory. [61]

    $ ${EDITOR} examples/train_qlora/qwen3_lora_sft_otfq.yaml
    # Lower memory usage.
    ## Use optimized training backends.
    enable_liger_kernel: true
    use_unsloth_gc: true
    ## Optimize RAM usage by using quantization during training only.
    optim: paged_adamw_8bit
    ## Optionally save the resulting model with 4-bit quantization.
    #quantization_bit: 4
    ## Train on one dataset at a time.
    ## This is set by default in the example.
    #per_device_train_batch_size: 1
    
  • Optionally define your own dataset for training instead of the examples.

    • Create a minimum example of a data/dataset_info.json. The default formatting is alpaca unless configured to be sharegpt. [62]

      {
        "<DATASET_NAME_1>": {
          "file_name": "<DATASET_FILE_1>.json",
          "formatting": "alpaca",
          "columns": {
            "<INSTRUCTION_NAME_USED_IN_DATASET>": "instruction",
            "<INPUT_FIELD_NAME_USED_IN_DATASET>": "input",
            "<OUTPUT_FIELD_NAME_USED_IN_DATASET>": "output",
          },
        }
        "<DATASET_NAME_2>": {
          "file_name": "<DATASET_FILE_2>.json",
          "formatting": "alpaca",
          "columns": {
            "<INSTRUCTION_NAME_USED_IN_DATASET>": "instruction",
            "<INPUT_FIELD_NAME_USED_IN_DATASET>": "input",
            "<OUTPUT_FIELD_NAME_USED_IN_DATASET>": "output",
          }
        }
      }
      
      $ ${EDITOR} examples/train_qlora/qwen3_lora_sft_otfq.yaml
      dataset=<DATASET_NAME_1>,<DATASET_NAME_2>
      
  • Optionally configure a different “instruction” (or “it” for short) model before training. This example uses google/gemma-3-27b-it which is a LLaVA model that does not fully work.

    • First, find supported models.

      $ grep -P "^    name=" ./src/llamafactory/data/template.py
      
    • Modify the examples.

      $ sed -i 's/Qwen\/Qwen3-4B-Instruct-2507/google\/gemma-3-27b-it/g' examples/*/*.yaml
      $ sed -i 's/qwen3-4b/gemma3-27b/g' examples/*/*.yaml
      $ sed -i 's/template: qwen3_nothink/template: gemma3/g' examples/*/*.yaml
      
  • Optionally disable the iGPU to force the use of the dGPU.

    • AMD [63]

      $ export GPU_DEVICE_ORDINAL="0"
      $ export HIP_VISIBLE_DEVICES="0"
      $ export OMP_DEFAULT_DEVICE="0"
      $ export ROCR_VISIBLE_DEVICES="0"
      
  • Train the model with the efficient QLoRA method. This qunatizes the model before training. The resulting trained difference is then applied to the non-quantized model at the end.

    $ llamafactory-cli train examples/train_qlora/qwen3_lora_sft_otfq.yaml
    
  • Try the trained model.

    $ llamafactory-cli chat examples/inference/qwen3_lora_sft.yaml
    
  • Save the model for use with Ollama. [59]

    $ llamafactory-cli export examples/merge_lora/qwen3_lora_sft.yaml
    

Prompt Engineering

Prompt engineering is a focus on getting the best answers from LLMs. [42]

A good prompt will usually have the following [43]:

  • Instruction = Explain in detail exactly what task you want to happen.

  • Context = Provide examples.

  • Input data = Information unique to instruction.

  • Output indicator

    • Provide the education level that the answer should be in. For example, pre-school, middle school, college undergraduate, or PHD.

    • Provide the tone. For example, academic, lighthearted, serious, etc.

    • Provide the format of the output. For example, how many sentences, JSON or YAML, C or Rust code, etc.

    • Provide a persona. For example, customer support, game master, teacher, etc.

The more instruction, context, input data, and output indicator, the higher chance of the answer being what is expected. Avoid being vague.

Shot-based prompts usually follow a simple question and answer format. Leave the answer field empty and then the LLM will try to fill it in.

Types of shot-based prompts:

  • Zero-shot = Provide an instruction with no examples.

  • One-shot = Provide an instruction with exactly 1 example.

  • Few-shot = Provide an instruction with 2 or more examples.

Few-shot prompting provides the best results compared to zero-shot and one-shot. [44]

Question: Who is the captain?
Answer: Jean-Luc Picard
Question: Who is the doctor?
Answer: Beverly Crusher
Question: Who is the engineer?
Answer:
Answer: Geordi La Forge

The LLM can be told to roleplay to both think and provide answers in a different way. It is important to specify (1) the role it should play and (2) the tone it should use. [45]

You are the new overly confident captain of the original U.S.S. Enterprise. You are on a peaceful mission to explore space. A Klingon Bird-of-Prey just de-cloaked near the port-bow which starts to divert power to their weapons. This is the first time your crew has experienced a real threat. What is the first order you give to the crew? Use only 1 sentence.
"Raise shields, Sulu, and let's give these Klingons a cordial reminder that the Federation doesn't take kindly to unannounced visits!"

Agents

OpenCode

OpenCode is a programming agent. It provides both a Plan (read-only) and Build (writable) mode to assist with programming and CLI tasks.

OpenCode requires a LLM that has tool support. For example, Gemma3 is not supported. Even if a LLM has tool support, it may not work through the service provider. For example, Qwen3-Coder does not have tool support through Ollama but it works with LM Studio. When using Ollama, make sure the LLM page has tags for both thinking (not required but recommended) and tools. [70]

Installation:

  • For the best results, install and use a GPU-accelerated terminal emulator such as Alacritty, Ghostty, Kitty, or WezTerm.

  • Install OpenCode on Linux or macOS. [66]

    $ curl -fsSL https://opencode.ai/install | bash
    
  • If using a private Ollama server, configure that first. Optionally set the default LLM by configuring the top-level model field. [67][68]

    $ ${EDITOR} ~/.config/opencode/config.json
    {
      "$schema": "https://opencode.ai/config.json",
      "model": "ollama/qwen3-coder:30b",
      "provider": {
        "ollama": {
          "npm": "@ai-sdk/openai-compatible",
          "options": {
            "baseURL": "http://127.0.0.1:11434/v1"
          },
          "models": {
            "qwen3-coder:30b": {
              "reasoning": true,
              "tools": true
            },
            "<LLM2>: {
              "reasoning": true,
              "tools": true
            }
          }
        }
      }
    }
    
  • Launch OpenCode in the directory of the programming project.

    $ opencode
    
  • If using a remote model, configure that now.

    /connect
    
  • First change to Plan mode by pressing the “TAB” key. OpenCode defaults to opening in Build mode. Once a good solution has been determined, switch back to Build mode to implement it.

  • Use the “at” symbol to mention a file name: @<PATH_TO_FILE>. Tab completion can also be used to fill out the entire path.

  • Switch to the model.

    /models
    
  • Exit OpenCode. It will automatically update to the latest version after this.

    /exit
    
  • Start OpenCode again with the last session or a specific session. This loads all of the context. [71]

    $ opencode --continue
    
    $ opencode session list
    $ opencode --session <OPENCODE_SESSION_ID>
    

History

Bibliography

  1. “Classification, regression, and prediction - what’s the difference?” Towards Data Science. December 11, 2020. Accessed November 7, 2022. https://towardsdatascience.com/classification-regression-and-prediction-whats-the-difference-5423d9efe4ec

  2. “A beginner’s guide to the math that powers machine learning.” TNW The heart of tech. October 2, 2022. Accessed November 7, 2022. https://thenextweb.com/news/a-beginners-guide-to-the-math-that-powers-machine-learning-syndication

  3. “Math for Machine Learning: 14 Must-Read Books.” Machine Learning Techniques. June 13, 2022. Accessed November 7, 2022. https://mltechniques.com/2022/06/13/math-for-machine-learning-12-must-read-books/

  4. “What is the best programming language for Machine Learning?” Towards Data Science. May 5, 2017. Accessed November 7, 2022. https://towardsdatascience.com/what-is-the-best-programming-language-for-machine-learning-a745c156d6b7

  5. “7 Top Machine Learning Programming Languages.” Codeacademy. October 20, 2021. Accessed November 7, 2022. https://www.codecademy.com/resources/blog/machine-learning-programming-languages/

  6. “How to Pick the Best Graphics Card for Machine Learning.” Towards Data Science. September 19, 2022. Accessed November 7, 2022. https://towardsdatascience.com/how-to-pick-the-best-graphics-card-for-machine-learning-32ce9679e23b

  7. “Does TensorFlow Support OpenCL?” IndianTechWarrior. Accessed November 7, 2022. https://indiantechwarrior.com/does-tensorflow-support-opencl/

  8. “Chatbot Arena LLM Leaderboard: Community-driven Evaluation for Best LLM and AI chatbots.” Chatobt Arena. Accessed December 4, 2024. https://lmarena.ai/

  9. “FAQ.” GitHub ollama/ollama. April 28, 2025. Accessed May 27, 2025. https://github.com/ollama/ollama/blob/main/docs/faq.md

  10. “What does 7b, 8b and all the b’s mean on the models and how are each models different from one another?” Reddit r/LocalLLaMA. May 23, 2024. Accessed December 4, 2024. https://www.reddit.com/r/LocalLLaMA/comments/1cylwmd/what_does_7b_8b_and_all_the_bs_mean_on_the_models/

  11. “Running Llama 3.1 Locally with Ollama: A Step-by-Step Guide.” Medium - Paulo Batista. July 25, 2024. Accessed December 4, 2024. https://medium.com/@paulocsb/running-llama-3-1-locally-with-ollama-a-step-by-step-guide-44c2bb6c1294

  12. “LLaMA 3.2 vs. LLaMA 3.1 vs. Gemma 2: Finding the Best Open-Source LLM for Content Creation.” Medium - RayRay. October 2, 2024. Accessed December 4, 2024. https://byrayray.medium.com/llama-3-2-vs-llama-3-1-vs-gemma-2-finding-the-best-open-source-llm-for-content-creation-1f6085c9f87a

  13. “Llama 3.2 Vision.” Ollama. November 6, 2024. Accessed December 4, 2024. https://ollama.com/blog/llama3.2-vision

  14. “I can now run a GPT-4 class model on my laptop.” Simon Willison’s Weblog. December 9, 2024. Accessed December 12, 2024. https://simonwillison.net/2024/Dec/9/llama-33-70b/

  15. “v0.7.0.” GitHub ollama/ollama. May 12, 2025. Accessed June 26, 2025. https://github.com/ollama/ollama/releases/tag/v0.7.0

  16. “MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models.” GitHub BradyFU/Awesome-Multimodal-Large-Language-Models. November 26, 2024. Accessed June 26, 2025. https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation?tab=readme-ov-file

  17. “deepseek-coder-v2.” Ollama. September, 2024. Accessed December 13, 2024. https://ollama.com/library/deepseek-coder-v2

  18. “Best LLM Model for coding.” Reddit r/LocalLLaMA. November 6, 2024. Accessed February 4, 2025. https://www.reddit.com/r/LocalLLaMA/comments/1gkewyp/best_llm_model_for_coding/

  19. “OpenSUSE MicroOS Howto with AMDGPU / ROCm - To run CUDA AI Apps like Ollama.” GitHub Gist torsten-online. February 10, 2025. Accessed March 7, 2025. https://gist.github.com/torsten-online/22dd2746ddad13ebbc156498d7bc3a80

  20. “Difference in different quantization methods #2094.” GitHub ggml-org/llama.cpp. October 27, 2024. Accessed May 27, 2025. https://github.com/ggml-org/llama.cpp/discussions/2094

  21. “Configuring Your Ollama Server.” ShinChven’s Blog. January 15, 2025. Accessed May 27, 2025. https://atlassc.net/2025/01/15/configuring-your-ollama-server

  22. “Open WebUI.” GitHub open-webui/open-webui. June 10, 2025. Accessed June 23, 2025. https://github.com/open-webui/open-webui

  23. “Web Search.” Open WebUI. Accessed June 23, 2025. https://docs.openwebui.com/category/-web-search/

  24. “Environment Variable Configuration.” Open WebUI. June 22, 2025. Accessed June 23, 2025. https://docs.openwebui.com/getting-started/env-configuration

  25. “duckduckgo_search.exceptions.RatelimitException: 202 Ratelimit #6624.” GitHub open-webui/open-webui. June 6, 2025. Accessed June 23, 2025. https://github.com/open-webui/open-webui/discussions/6624

  26. “issue: Too Many Requests #14244.” GitHub open-webui/open-webui. June 14, 2025. Accessed June 23, 2025. https://github.com/open-webui/open-webui/discussions/14244

  27. “A Visual Guide to Quantization.” Exploring Language Models. July 22, 2024. Accessed June 26, 2025. https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization

  28. “Creative Writing v3.” EQ-Bench Creative Writing v3 Leaderboard. Accessed June 24, 2025. https://eqbench.com/creative_writing.html

  29. “Qwen-2.5-Coder 32B – The AI That’s Revolutionizing Coding! - Real God in a Box?” Reddit r/LocalLLaMA. March 14, 2025. Accessed June 24, 2025. https://www.reddit.com/r/LocalLLaMA/comments/1gp84in/qwen25coder_32b_the_ai_thats_revolutionizing/

  30. “So what is now the best local AI for coding?” Reddit r/LocalLLaMA. February 25, 2025. Accessed June 24, 2025. https://www.reddit.com/r/LocalLLaMA/comments/1ia0j9o/so_what_is_now_the_best_local_ai_for_coding/

  31. “Codestral 22B, Owen 2.5 Coder B, and DeepSeek V2 Coder: Which AI Coder Should You Choose?” Deepgram. October 10, 2024. Accessed June 24, 2025. https://deepgram.com/learn/best-local-coding-llm

  32. “In Feb 2025, what’s your LLM stack for productivity?” Reddit r/LocalLLaMA. February 8, 2025. Accessed June 24, 2025. https://www.reddit.com/r/LocalLLaMA/comments/1ik6fy3/in_feb_2025_whats_your_llm_stack_for_productivity/

  33. “Anthropic’s Claude 3.7 Sonnet is the new king of code generation (but only with help), and DeepSeek R1 disappoints (Deep dives from the DevQualityEval v1.0).” Symflower. 2025. Accessed February 5, 2026. https://symflower.com/en/company/blog/2025/dev-quality-eval-v1.0-anthropic-s-claude-3.7-sonnet-is-the-king-with-help-and-deepseek-r1-disappoints/

  34. “Stable Code 3B: Coding on the Edge.” Hacker News. January 20, 2025. Accessed June 24, 2025. https://news.ycombinator.com/item?id=39019532

  35. “DeepSeek Coder”. GitHub deepseek-ai/DeepSeek-Coder. March 6, 2024. Accessed June 24, 2025. https://github.com/deepseek-ai/deepseek-coder

  36. “Comparing quants of QwQ Preview in Ollama.” December 17, 2024. Accessed June 24, 2025. leikareipa.github.io. https://leikareipa.github.io/blog/comparing-quants-of-qwq-preview-in-ollama/

  37. “Question on model sizes vs. GPU.” Reddit r/ollama. September 4, 2024. Accessed June 26, 2025. https://www.reddit.com/r/ollama/comments/1d4ofem/question_on_model_sizes_vs_gpu/

  38. “How much VRAM do I need for LLM model fine-tuning?” Modal Blog. September 1, 2024. Accessed June 26, 2025. https://modal.com/blog/how-much-vram-need-fine-tuning

  39. “Context Kills VRAM: How to Run LLMs on consumer GPUs.” Medium Lyx. May 9, 2025. Accessed February 5, 2026. https://medium.com/@lyx_62906/context-kills-vram-how-to-run-llms-on-consumer-gpus-a785e8035632

  40. “A Guide to Quantization in LLMs.” Symbl.ai. February 21, 2025. Accessed June 27, 2025. https://symbl.ai/developers/blog/a-guide-to-quantization-in-llms/

  41. “gemma3:27b.” Ollama. April 18, 2025. Accessed June 27, 2025. https://ollama.com/library/gemma3:27b

  42. “What is Prompt Engineering?” AWS Cloud Computing Concepts Hub. Accessed June 30, 2025. https://aws.amazon.com/what-is/prompt-engineering/

  43. “Elements of a Prompt.” Prompt Engineering Guide. April 24, 2025. Accessed June 30, 2025. https://www.promptingguide.ai/introduction/elements

  44. “Technique #3: Examples in Prompts: From Zero-Shot to Few-Shot.” Learn Prompting. March 6, 2025. Accessed June 30, 2025. https://learnprompting.org/docs/basics/few_shot

  45. “Mastering Persona Prompts: A Guide to Leveraging Role-Playing in LLM-Based Applications like ChatGPT or Google Gemini.” Medium Ankit Kumar. February 16, 2025. Accessed June 30, 2025. https://architectak.medium.com/mastering-persona-prompts-a-guide-to-leveraging-role-playing-in-llm-based-applications-1059c8b4de08

  46. “Ollama Model File.” GitHub ollama/ollama. July 11, 2025. Accessed July 22, 2025. https://github.com/ollama/ollama/blob/main/docs/modelfile.md

  47. “How to Customize LLM Models with Ollama’s Modelfile?” Collabnix. March 20, 2025. https://collabnix.com/how-to-customize-llm-models-with-ollamas-modelfile/

  48. “Ollama - Building a Custom Model.” Unmesh Gundecha. October 22, 2023. Accessed July 22, 2025. https://unmesh.dev/post/ollama_custom_model/

  49. “How to uninstall Ollama.” Collabnix. April 15, 2024. Accessed July 22, 2025. https://collabnix.com/how-to-uninstall-ollama/

  50. “Stop Ollama #690.” GitHub ollama/ollama. July 20, 2025. Accessed July 22, 2025. https://github.com/ollama/ollama/issues/690

  51. “how to remove ollama from macos? #2028.” GitHub ollama/ollama. June 26, 2025. Accessed July 22, 2025. https://github.com/ollama/ollama/issues/2028

  52. “Sizing VRAM to Generative AI & LLM Workloads.” Puget Systems. July 18, 2025. Accessed October 3, 2025. https://www.pugetsystems.com/labs/articles/sizing-vram-to-generative-ai-and-llm-workloads/

  53. “Gemma 3: A 27B Multimodal LLM Better Than Really Big Models.” Medium. March 12, 2025. Accessed October 6, 2025. https://medium.com/@elmo92/gemma-3-a-27b-multimodal-llm-better-than-really-big-models-b4fe0f4949b4

  54. “Gemma 3 model overview.” Google AI for Developers. August 14, 2024. Accessed October 6, 2025. https://ai.google.dev/gemma/docs/core

  55. “Context length.” Ollama’s documentation. Accessed February 5, 2026. https://docs.ollama.com/context-length

  56. “Pricing.” ChatGPT Plans. 2026. Accessed February 5, 2026. https://chatgpt.com/pricing/

  57. “server: use tiered VRAM-based default context length.” GitHub ollama/ollama. February 2, 2026. Accessed February 5, 2026. https://github.com/ollama/ollama/commit/0334ffa6250752c0e5e3d7f4467b0f50cc906fde

  58. “Serious misalignment in LLaVA implementation #6008.” GitHub hiyouga/LlamaFactory. February 18, 2026. Accessed March 3, 2026. https://github.com/hiyouga/LlamaFactory/issues/6008

  59. “LLaMA-Factory Easy and Efficient LLM Fine-Tuning.” GitHub hiyouga/LlamaFactory. March 3, 2026. Accessed March 3, 2026. https://github.com/hiyouga/LlamaFactory

  60. “Fine-tune Llama-3.1 8B with Llama-Factory.” 2026. Accessed March 3, 2026. https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/fine_tune/llama_factory_llama3.html

  61. “FAQ.” GitHub hiyouga/LlamaFactory. March 3, 2026. Accessed March 3, 2026. https://github.com/hiyouga/LlamaFactory/issues/4614

  62. “LlamaFactory/data/README.md.” GitHub hiyouga/LlamaFactory. June 25, 2025. Accessed March 3, 2026. https://github.com/hiyouga/LlamaFactory/blob/main/data/README.md

  63. “How to install ROCm on Linux.” wasdtech. July 6, 2025. Accessed March 3, 2026. https://wasdtech.altervista.org/installation-of-rocm/

  64. “Multi your Threads #4: ROCm Roll!” The Great Refactoring. Accessed March 3, 2026. https://vilelasagna.ddns.net/multi-your-threads/multi-your-threads-4-rocm-roll

  65. “Docker Compose.” Open WebUI. September 21, 2024. Accessed March 3, 2026. https://open-webui.com/docker-compose/

  66. “Intro.” opencode.ai. March 4, 2026. Accessed March 4, 2026. https://opencode.ai/docs/

  67. “From Zero to Local AI: Running OpenCode with Ollama on Your Machine.” Medium Hanns Juarez. November 16, 2025. Accessed March 6, 2026. https://medium.com/@hannsflip/from-zero-to-local-ai-running-opencode-with-ollama-on-your-machine-8a12cc4f551e

  68. “Providers.” OpenCode. March 6, 2026. Accessed March 6, 2026. https://opencode.ai/docs/providers/

  69. “Best Local models to run OpenCode?” Reddit r/LocalLLaMA. January 18, 2026. Accessed March 6, 2026. https://www.reddit.com/r/LocalLLaMA/comments/1mncd7i/best_local_models_to_run_opencode/

  70. “qwen3-coder:latest does not support tools #1619.” GitHub anomalyco/opencode. October 25, 2025. Accessed March 6, 2026. https://github.com/anomalyco/opencode/issues/1619

  71. “OpenCode CLI Commands.” OpenCode Guide. Accessed March 12, 2026. https://opencodeguide.com/en/cli-commands/