Post

Building a Private AI Terminal with Docker, llama.cpp, and OpenCode

Building a Private AI Terminal with Docker, llama.cpp, and OpenCode

Complete Tech Blog & Setup Guide
Author: Phaneesh | Date: May 9, 2026 | Repository: Local_Ai_ON_OPENCODE


Introduction

As developers increasingly rely on Large Language Models (LLMs) to enhance their workflows, privacy and offline availability have become critical concerns. Sending proprietary code or system architectures to cloud APIs isn’t always viable. This post explores an elegant solution found in our local repository: building a completely self-hosted AI terminal using Docker Compose, llama.cpp, and OpenCode.

By containerizing these tools, we can spin up a dedicated, privacy-first AI assistant directly on our local machines without complex dependency management.

Project Purpose & Architecture

  1. gemma-llama: An inference server that runs the highly efficient Gemma-4-E4B model via llama.cpp.
  2. opencode-terminal: A custom terminal environment that integrates seamlessly with the AI server, providing a workspace with all necessary system mounts to act as an intelligent developer assistant.

This architecture completely isolates the AI inference engine from the local machine’s host OS while allowing the terminal container controlled access via persistent volume mounts and the Docker socket.

Code Deep Dive: The Orchestration

Let’s look at how the orchestration is configured. The docker-compose.yaml file defines the gemma-llama service with optimizations specifically for local environments:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
services:
  gemma-llama:
    image: ghcr.io/ggml-org/llama.cpp:full
    container_name: gemma-llama
    command: >
      --server
      -m /models/gemma-4-E4B-it-Q4_K_M.gguf
      --port 8080
      --host 0.0.0.0
      --reasoning off
      -ngl 0
      --jinja
      -c 16384
      --parallel 1
    ports:
      - "8080:8080"
    volumes:
      - ~/gemma_models:/models

Key configurations:

  • --server: Exposes the model over an OpenAI-compatible HTTP API.
  • -m /models/...: Uses a quantized (Q4_K_M) version of the Gemma-4B model. Quantization is crucial here as it drastically reduces the RAM required while maintaining high output quality.
  • -c 16384: Allocates a large context window (16K tokens), enabling the AI to read extensive code files and logs.
  • volumes: Mounts a local directory ~/gemma_models so you don’t have to re-download gigabytes of model data every time the container spins up.

The OpenCode Terminal Integration

The magic happens in the opencode-terminal container, which serves as your daily driver interface. It links to the gemma-llama service and mounts your local workspace:

1
2
3
4
5
6
7
8
9
10
11
  opencode-terminal:
    image: opencode:usethis
    container_name: opencode-terminal
    working_dir: /work    
    volumes:
      - $HOME/.opencode:/root/.config/opencode    
      - $PWD:/work
      - /var/run/docker.sock:/var/run/docker.sock
    depends_on:
      gemma-llama:
        condition: service_healthy

By mounting /var/run/docker.sock, the AI inside the terminal can execute Docker commands, introspect other containers, and perform DevOps tasks natively. The depends_on block with a service_healthy condition guarantees that the terminal only becomes available after the llama.cpp server has successfully loaded the massive model file into memory.

Configuring OpenCode

You can interact with your locally hosted model using OpenCode.
OpenCode connects to the llama-server endpoint over its OpenAI-compatible API.

To set this up, configure your ~/.opencode/opencode.json as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llama.cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-server (local)",
      "options": {
        "baseURL": "http://localhost:8080/v1"
      },
      "models": {
        "gemma-4-E4B-it-Q4_K_M.gguf": {
          "name": "gemma-4-E4B-it-Q4_K_M (local)",
          "limit": {
            "context": 65536,
            "output": 16384
          }
        }
      }
    }
  }
}

Key Takeaways

  1. True Privacy: By keeping inference localized via llama.cpp and GGUF models, no data leaves your workstation.
  2. Reproducibility: Docker Compose makes this setup “plug-and-play” across any Linux environment.
  3. Seamless Workflow: Tying the AI directly into a terminal environment bridges the gap between text generation and actionable execution.

This repository is a masterclass in combining modern containerization with lightweight LLM inference. To get started, all you need is docker compose up followed by your terminal access command. Happy local coding!

This post is licensed under CC BY 4.0 by the author.