Day 1: Taking Declarative Control of a Running Talos Cluster with OpenTofu

Day 1: Taking Declarative Control of a Running Talos Cluster with OpenTofu

The Day 0 bootstrapper exits the moment Talos starts listening on port 50000. The server has an OS. It's waiting for configuration. That's where OpenTofu takes over.

This post covers 02-kubernetes/, the layer that codifies the full cluster: 13 nodes across four roles, machine config generation and application, and Cilium managed as a Helm release. This is where the siderolabs/talos provider earns its place — proper declarative API, proper state, proper drift detection.

The Boundary

The siderolabs/talos provider talks directly to the Talos API on port 50000. Unlike the null_resource approach from 01-hardware/, it can actually observe state: read running configuration, apply patches, track whether a node needs updating. The moment a server is in maintenance mode waiting for its machine config, this layer owns it.

resource "talos_machine_configuration_apply" "node" {
  for_each = local.nodes

  client_configuration        = talos_machine_secrets.cluster.client_configuration
  machine_configuration_input = data.talos_machine_configuration.node[each.key].machine_configuration
  node                        = each.value.ip
  endpoint = each.value.role != "controlplane" ? local.first_controlplane_ip : null

  apply_mode = "no_reboot"
}

apply_mode = "no_reboot" is important for a running production cluster: config changes are applied live, and anything that requires a reboot is staged for the next maintenance window rather than rebooting immediately. Safe to run tofu apply without taking nodes down.

Node Roles and Patch Composition

The cluster has 13 nodes across four roles: control planes, workers, storage, and monitoring. Rather than one config file per node, machine configurations are built by composing patches:

config_patches = concat(
  [local.patch_common],                          # every node
  each.value.role == "controlplane" ? [local.patch_controlplane] : [],
  each.value.role == "monitoring"   ? [local.patch_monitoring_node, local.patch_monitoring_extras] : [],
  each.value.role == "storage"      ? [local.patch_storage_extras, local.patch_storage_label] : [],
  [local.patch_per_node[each.key]],              # hostname, IP, gateway, disk
)

The common patch sets cluster-wide settings: cluster name, pod and service subnets, CNI disabled (Cilium is installed separately), kube-proxy disabled (Cilium handles it). Role patches add what's specific to that role. The per-node patch sets hostname, static IP, gateway, and install disk.

This keeps the config readable and avoids repeating the same values across 13 node definitions.

Non-Control-Plane Node Routing

Worker, storage, and monitoring nodes are not directly reachable on port 50000 from outside the cluster network. The provider needs an entry point. The solution is the same as talosctl -e <cp-ip> -n <node-ip>: route through the first control plane.

endpoint = each.value.role != "controlplane" ? local.first_controlplane_ip : null

Control planes are reached directly. Everyone else goes through the first control plane as a relay. One line, handles the whole fleet.

NVMe Enumeration Is Not Deterministic

This caught me on one of the worker nodes. Talos detects the install disk by PCI bus discovery order, which is not guaranteed to be consistent across reboots or between physically identical servers. Two nodes with the same Samsung 512 GB NVMe drives had their disks enumerate in different orders.

The only safe way to confirm which device holds the STATE partition on a running node is:

talosctl get disks -n <node-ip>

For one worker, the STATE partition had landed on /dev/nvme1n1, not /dev/nvme0n1. One node also has a SATA disk rather than NVMe: a different hardware generation that ended up in the same cluster. Both are accounted for explicitly in the node metadata:

frodo = {
  role    = "worker"
  ip      = local.workers["frodo"]
  gateway = "..."
  disk    = "/dev/sda"  # SATA -- the only non-NVMe node
}

Assuming all nodes use the same disk path is a silent failure waiting to happen on the next upgrade.

Storage and Monitoring Node Specialization

Storage and monitoring nodes use Talos volume management to carve disk space beyond the system partition.

Storage nodes cap the EPHEMERAL volume at 50 GiB and claim the remaining space as a dedicated partition for GarageHQ data (an S3-compatible object store that handles its own HA across a 3-node cluster, independent of Rook-Ceph):

---
apiVersion: v1alpha1
kind: VolumeConfig
name: EPHEMERAL
provisioning:
  diskSelector:
    match: system_disk
  maxSize: 50GiB
---
apiVersion: v1alpha1
kind: UserVolumeConfig
name: garage-data
provisioning:
  diskSelector:
    match: system_disk
  minSize: 100GB
  grow: true

Monitoring nodes get a dedicated local storage volume carved from a secondary disk, plus a taint and label applied via machine config to keep general workloads off them:

patch_monitoring_node = yamlencode({
  machine = {
    nodeLabels = { pool = "monitoring" }
    nodeTaints = { dedicated = "monitoring:NoSchedule" }
  }
})

Both are machine config concerns, not Kubernetes concerns — the right layer for anything that affects how the node boots and presents itself.

Cilium: From Inline Manifest to Helm

Cilium was originally deployed as an inline manifest in the Talos machine config, which is how Talos bootstraps CNI before the Kubernetes API is available. It worked, but it meant Cilium was outside any management layer: no Helm release, no upgrade path, no values file.

Moving it to Helm under OpenTofu ownership required pre-annotating the existing resources with Helm ownership metadata before installing; otherwise Helm refuses to take over resources it didn't create:

helm template cilium cilium/cilium --version 1.16.3 -n kube-system \
  | kubectl annotate -f - meta.helm.sh/release-name=cilium \
      meta.helm.sh/release-namespace=kube-system --overwrite
helm template cilium cilium/cilium --version 1.16.3 -n kube-system \
  | kubectl label -f - app.kubernetes.io/managed-by=Helm --overwrite

Then there are two Talos-specific settings that the default Cilium Helm values get wrong.

First, Talos mounts cgroupv2 at /sys/fs/cgroup, not at the path Cilium expects. Disable Cilium's auto-mount and point it to the right location:

cgroup = {
  autoMount = { enabled = false }
  hostRoot  = "/sys/fs/cgroup"
}

Second, and less obvious: Talos's containerd-shim does not include SYS_MODULE in its capability bounding set. When the default Cilium values request it, runc returns EPERM and the init container crashes. The fix is to explicitly override the capability lists and drop SYS_MODULE:

securityContext = {
  capabilities = {
    ciliumAgent      = ["CHOWN", "KILL", "NET_ADMIN", "NET_RAW", "IPC_LOCK",
                        "SYS_ADMIN", "SYS_RESOURCE", "DAC_OVERRIDE", "FOWNER",
                        "SETGID", "SETUID"]
    cleanCiliumState = ["NET_ADMIN", "SYS_ADMIN", "SYS_RESOURCE"]
  }
}

Neither of these is documented anywhere obvious. Both are required. The pods crash silently until you read the container logs and work backwards from unable to apply caps: operation not permitted.

ArgoCD: The Last Piece of Day 1

ArgoCD lives in this layer for the same reason Cilium does: it's infrastructure, not a workload. Upgrading it is a version bump and a tofu apply, not an ArgoCD Application pointing at itself. Keeping both here means one tool owns all cluster-level components and one tofu apply gets you from Talos API to GitOps-ready.

resource "helm_release" "argocd" {
  name             = "argocd"
  repository       = "https://argoproj.github.io/argo-helm"
  chart            = "argo-cd"
  version          = "9.4.15"
  namespace        = "argocd"
  create_namespace = true

  depends_on = [helm_release.cilium]

  values = [yamlencode({
    global = {
      domain = "argocd.example.com"
    }
    configs = {
      params = {
        "server.insecure" = true
      }
    }
  })]
}

server.insecure = true is intentional: TLS terminates at the ingress layer, not inside ArgoCD. The ingress itself comes as a Day 2 Application once ArgoCD is running.

depends_on = [helm_release.cilium] enforces ordering: Cilium must be running before any pod can schedule. Without it, ArgoCD pods could land on nodes before the CNI is ready and get stuck in a networking failure that requires manual intervention.

Secrets in Git: SOPS and Age

Before handing off to ArgoCD, there's a practical problem: ArgoCD will need credentials. Database passwords, API tokens, private keys. Those have to live somewhere, and "somewhere" should be the same git repo as everything else.

The carlpett/sops provider was declared in versions.tf from the initial commit, alongside Talos and Helm. No secrets were needed at that point (the Talos provider generates its own machine secrets), but the encryption strategy was designed for the full Day 0 through Day 2 lifecycle.

The choice is SOPS with Age keys. Not Vault, not sealed-secrets. The reasoning is simple: SOPS is a local tool. There's no server to run, no controller that must be deployed before you can create your first secret. The decryption key is a single file. Encrypted values live in the same repo as the code that uses them, so git diff shows which secrets changed, git blame shows who changed them and when.

Two Keys, Two Purposes

The .sops.yaml at the repo root defines two creation rules:

creation_rules:
  - path_regex: environments/.*/secrets\.enc\.yaml$
    age: <personal-key>
  - path_regex: .*\.enc\.yaml$
    age: <personal-key>,<cluster-key>

The first key is mine, used locally during tofu apply to decrypt infrastructure secrets. The second is a cluster key that ArgoCD uses for runtime decryption in-cluster.

Infrastructure secrets (matched by the first rule) only need the personal key. Everything else, including gitops manifests that ArgoCD will consume, gets encrypted to both keys. One sops --encrypt call, two recipients, and the same file is decryptable both at my terminal and inside the cluster. If the cluster key is compromised, rotate it without touching the bootstrap key.

State Encryption (Separate Concern)

The OpenTofu state file is a separate encryption problem. It lives in Hetzner Object Storage (S3 backend) and contains every resource attribute, including computed values like kubeconfig contents. SOPS doesn't cover this; OpenTofu's native encryption block does:

encryption {
  key_provider "pbkdf2" "state" {
    passphrase = var.state_encryption_passphrase
  }
  method "aes_gcm" "state" {
    keys = key_provider.pbkdf2.state
  }
  state {
    method = method.aes_gcm.state
  }
}

PBKDF2 derives a key from a passphrase, AES-GCM encrypts the state at rest. The passphrase comes from an environment variable: password manager for local runs, Kubernetes Secret on the self-hosted runner.

Two encryption layers, two different scopes:

  • SOPS: encrypts secret values in git (credentials, tokens, private keys)
  • State encryption: encrypts the entire state file at rest (which contains all resource attributes, including things the Talos provider computes like kubeconfig contents)

Day 1 sets up both. Day 2 activates SOPS at runtime when ArgoCD starts decrypting secrets through KSOPS.

What This Layer Doesn't Touch

02-kubernetes/ stops once ArgoCD is running and has the credentials and encryption keys it needs to pull its first repo. The bootstrap — repository secrets, the KSOPS decryption setup, and the root Application that kicks off the GitOps loop — is the bridge between Day 1 and Day 2, and the subject of the next post.

The handoff point is clean: OpenTofu owns everything needed to run the cluster, ArgoCD owns everything that runs on top of it. The same boundary principle from Day 0, applied one layer up.

Looking for my current work? Visit Nubosas – where we build Sovereign Private Clouds for scaling SaaS companies.

social