<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Carlos Perelló Marín's blog</title><link href="https://carlosperello.blog/" rel="alternate"/><link href="/feeds/all.atom.xml" rel="self"/><id>https://carlosperello.blog/</id><updated>2026-04-15T10:00:00+02:00</updated><entry><title>Day 2: GitOps Handoff — Bootstrapping ArgoCD on Talos with OpenTofu</title><link href="https://carlosperello.blog/day-2-gitops-handoff-bootstrapping-argocd-on-talos-with-opentofu/" rel="alternate"/><published>2026-04-15T10:00:00+02:00</published><updated>2026-04-15T10:00:00+02:00</updated><author><name/></author><id>tag:carlosperello.blog,2026-04-15:/day-2-gitops-handoff-bootstrapping-argocd-on-talos-with-opentofu/</id><summary type="html">&lt;p&gt;The &lt;a href="https://carlosperello.blog/day-1-taking-declarative-control-of-a-running-talos-cluster-with-opentofu/"&gt;Day 1 post&lt;/a&gt; ended with ArgoCD deployed as a Helm release inside &lt;code&gt;02-kubernetes/&lt;/code&gt;. It's running, but it can't do anything yet. It has no repository credentials, no decryption keys, and no root Application to tell it what to manage. "Running" is not "working."&lt;/p&gt;
&lt;p&gt;This post covers the bootstrap resources …&lt;/p&gt;</summary><content type="html">&lt;p&gt;The &lt;a href="https://carlosperello.blog/day-1-taking-declarative-control-of-a-running-talos-cluster-with-opentofu/"&gt;Day 1 post&lt;/a&gt; ended with ArgoCD deployed as a Helm release inside &lt;code&gt;02-kubernetes/&lt;/code&gt;. It's running, but it can't do anything yet. It has no repository credentials, no decryption keys, and no root Application to tell it what to manage. "Running" is not "working."&lt;/p&gt;
&lt;p&gt;This post covers the bootstrap resources that bridge Day 1 and Day 2: the credentials, the KSOPS decryption setup, and the root Application that kicks off the GitOps loop. Once that loop starts, OpenTofu steps back.&lt;/p&gt;
&lt;h2&gt;The Chicken-and-Egg Problem&lt;/h2&gt;
&lt;p&gt;ArgoCD manages applications from git repositories. To pull from a private repo, it needs credentials. To decrypt SOPS-encrypted manifests, it needs the cluster Age key. But those credentials and keys are themselves secrets that need to live somewhere.&lt;/p&gt;
&lt;p&gt;The solution is straightforward: OpenTofu bootstraps the credentials as Kubernetes Secrets before ArgoCD tries to use them. ArgoCD doesn't manage its own credentials, at least not at bootstrap time. OpenTofu creates them, ArgoCD consumes them, and the two never overlap.&lt;/p&gt;
&lt;p&gt;Everything below lives in &lt;code&gt;argocd-bootstrap.tf&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;Repository Credentials&lt;/h2&gt;
&lt;p&gt;Two private repos need to be registered: the platform gitops repo (controlled by Nubosas) and the tenant workload repo (controlled by the tenant). Both use a dedicated service account with repo-scoped GitHub tokens.&lt;/p&gt;
&lt;p&gt;The tokens are stored SOPS-encrypted in &lt;code&gt;secrets.enc.yaml&lt;/code&gt; and decrypted at apply time via the &lt;code&gt;carlpett/sops&lt;/code&gt; provider that was declared back in Day 1:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kr"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;&amp;quot;sops_file&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;&amp;quot;secrets&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;source_file&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;${path.module}/environments/${var.environment}/secrets.enc.yaml&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Each repo is registered as a Kubernetes Secret with the label ArgoCD watches for:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kr"&gt;resource&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;&amp;quot;kubernetes_secret&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;&amp;quot;argocd_repo_platform&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nb"&gt;metadata&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;repo-platform-gitops&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;argocd&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nb"&gt;labels&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;argocd.argoproj.io/secret-type&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;repository&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nb"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;git&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;https://github.com/example-org/platform-gitops.git&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;username&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;example-bot&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;data.sops_file.secrets.data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;github_token_platform&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;helm_release.argocd&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Same pattern for the tenant repo, different token. The &lt;code&gt;depends_on&lt;/code&gt; ensures ArgoCD is running before we create secrets in its namespace.&lt;/p&gt;
&lt;p&gt;This is the Day 1 encryption strategy paying off. The &lt;code&gt;sops&lt;/code&gt; provider was declared but unused in the initial commit because no secrets were needed for Talos bootstrap. Now it's active, decrypting at apply time with the personal Age key.&lt;/p&gt;
&lt;h2&gt;KSOPS: Decrypting Secrets at Runtime&lt;/h2&gt;
&lt;p&gt;OpenTofu decrypts secrets at apply time. But ArgoCD needs to decrypt SOPS-encrypted manifests at sync time, inside the cluster, without the personal key.&lt;/p&gt;
&lt;p&gt;That's where &lt;a href="https://github.com/viaduct-ai/kustomize-sops"&gt;KSOPS&lt;/a&gt; comes in. It's a Kustomize plugin that intercepts SOPS-encrypted resources during ArgoCD's render phase and decrypts them using the cluster Age key.&lt;/p&gt;
&lt;p&gt;ArgoCD's container image doesn't ship with KSOPS. The standard approach is an init container that downloads the binary into a shared volume before the repo-server starts.&lt;/p&gt;
&lt;h3&gt;The Init Container&lt;/h3&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;repoServer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;initContainers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;install-ksops&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;alpine:3.21&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;command&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;[&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;sh&amp;quot;&lt;/span&gt;&lt;span class="p p-Indicator"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;-c&amp;quot;&lt;/span&gt;&lt;span class="p p-Indicator"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;|&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="no"&gt;set -e&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="no"&gt;wget -qO- https://github.com/viaduct-ai/kustomize-sops/releases/download/v4.4.0/ksops_4.4.0_Linux_x86_64.tar.gz \&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="no"&gt;| tar xz -C /custom-tools&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="no"&gt;chmod +x /custom-tools/ksops&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="no"&gt;mkdir -p /custom-tools/plugin/viaduct.ai/v1/ksops&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="no"&gt;cp /custom-tools/ksops /custom-tools/plugin/viaduct.ai/v1/ksops/ksops&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;volumeMounts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;custom-tools&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nt"&gt;mountPath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;/custom-tools&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;securityContext&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;runAsNonRoot&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;true&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;runAsUser&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;65534&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;allowPrivilegeEscalation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;false&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;capabilities&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nt"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;[&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;ALL&amp;quot;&lt;/span&gt;&lt;span class="p p-Indicator"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;seccompProfile&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nt"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;RuntimeDefault&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Two things worth calling out. First, the binary gets laid down twice: once as a flat &lt;code&gt;ksops&lt;/code&gt; for Kustomize to invoke directly, and once inside &lt;code&gt;plugin/viaduct.ai/v1/ksops/ksops&lt;/code&gt; so Kustomize can also discover it as an exec plugin via &lt;code&gt;KUSTOMIZE_PLUGIN_HOME&lt;/code&gt;. Different code paths in Kustomize look for it in different places.&lt;/p&gt;
&lt;p&gt;Second, the security context. This init container downloads and unpacks a single binary, so it has no reason to run as root, hold any capabilities, or escalate privileges. &lt;code&gt;runAsUser: 65534&lt;/code&gt; is &lt;code&gt;nobody&lt;/code&gt;. The &lt;code&gt;seccompProfile: RuntimeDefault&lt;/code&gt; restricts syscalls to the container runtime's default set. Defense in depth for a container that runs for two seconds, but it costs nothing.&lt;/p&gt;
&lt;h3&gt;Mounting the Binary and the Key&lt;/h3&gt;
&lt;p&gt;The KSOPS binary goes into &lt;code&gt;/usr/local/bin/ksops&lt;/code&gt; on the repo-server via the shared volume, plus the plugin tree gets mounted under &lt;code&gt;/home/argocd/ksops-plugin&lt;/code&gt;. The cluster Age key gets mounted as a directory so SOPS finds it at the path it expects:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;repoServer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;volumes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;custom-tools&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;emptyDir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;{}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;sops-age-key&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;secretName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;ksops-age-key&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;defaultMode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;292&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# 0444 — readable by argocd user (uid 999)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;volumeMounts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;custom-tools&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;mountPath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;/usr/local/bin/ksops&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;subPath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;ksops&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;custom-tools&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;mountPath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;/home/argocd/ksops-plugin&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;subPath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;plugin&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;sops-age-key&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;mountPath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;/home/argocd/.config/sops/age&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;readOnly&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;true&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;SOPS_AGE_KEY_FILE&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;/home/argocd/.config/sops/age/age.agekey&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;XDG_CONFIG_HOME&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;/home/argocd/.config&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;KUSTOMIZE_PLUGIN_HOME&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;/home/argocd/ksops-plugin&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Three env vars matter here. &lt;code&gt;SOPS_AGE_KEY_FILE&lt;/code&gt; tells SOPS exactly where the private key lives. &lt;code&gt;XDG_CONFIG_HOME&lt;/code&gt; keeps SOPS's config search rooted under the argocd user's home. &lt;code&gt;KUSTOMIZE_PLUGIN_HOME&lt;/code&gt; is what makes Kustomize discover KSOPS as an exec plugin. Miss any of these and the symptom is the same: opaque "decryption failed" or "plugin not found" errors at sync time.&lt;/p&gt;
&lt;p&gt;The Age key Secret is created by OpenTofu alongside the repo credentials:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kr"&gt;resource&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;&amp;quot;kubernetes_secret&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;&amp;quot;ksops_age_key&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nb"&gt;metadata&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ksops-age-key&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;argocd&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nb"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;age.agekey&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;data.sops_file.secrets.data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;age_cluster_private_key&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;helm_release.argocd&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Notice: the cluster private key is itself SOPS-encrypted in &lt;code&gt;secrets.enc.yaml&lt;/code&gt;, decrypted by OpenTofu using the personal key, then stored as a Kubernetes Secret for KSOPS. SOPS decrypting a key that enables more SOPS decryption.&lt;/p&gt;
&lt;p&gt;Finally, Kustomize needs the alpha plugins flag to recognize KSOPS. This goes under &lt;code&gt;configs.cm&lt;/code&gt; (the argocd-cm ConfigMap), not &lt;code&gt;configs.params&lt;/code&gt; (the argocd-cmd-params-cm ConfigMap that drives CLI flags):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;configs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;server.insecure&amp;quot;&lt;/span&gt;&lt;span class="p p-Indicator"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;true&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;cm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;kustomize.buildOptions&amp;quot;&lt;/span&gt;&lt;span class="p p-Indicator"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;--enable-alpha-plugins&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;--enable-exec&amp;quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Easy to put in the wrong block — &lt;code&gt;configs.params&lt;/code&gt; looks like the natural home for a "build option" but ArgoCD reads &lt;code&gt;kustomize.buildOptions&lt;/code&gt; from the argocd-cm ConfigMap. Wrong block, silent no-op.&lt;/p&gt;
&lt;h2&gt;The Three-Layer Decryption Flow&lt;/h2&gt;
&lt;p&gt;Day 1 set up the two-key design. Here's how it completes:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;At rest in git&lt;/strong&gt;: secrets are encrypted with SOPS using Age public keys. Both the personal key and cluster key are recipients, so either can decrypt.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;At bootstrap (&lt;code&gt;tofu apply&lt;/code&gt;)&lt;/strong&gt;: the &lt;code&gt;carlpett/sops&lt;/code&gt; provider decrypts &lt;code&gt;secrets.enc.yaml&lt;/code&gt; using the personal Age key (available locally). The decrypted values become Kubernetes Secrets.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;At runtime (ArgoCD sync)&lt;/strong&gt;: when ArgoCD reconciles a repo containing SOPS-encrypted manifests, KSOPS intercepts the render, reads the cluster Age key from the mounted Secret, and decrypts inline. The plaintext manifest is applied to the cluster. Plaintext never touches git.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Two keys, three layers, one encryption strategy designed from Day 1.&lt;/p&gt;
&lt;h2&gt;The Root Application&lt;/h2&gt;
&lt;p&gt;With credentials and decryption in place, the last step is telling ArgoCD what to manage. This is the &lt;strong&gt;App of Apps&lt;/strong&gt; pattern: a single root Application that watches a directory for other Application definitions.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# root-app.yaml&lt;/span&gt;
&lt;span class="nt"&gt;apiVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="nt"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Application&lt;/span&gt;
&lt;span class="nt"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;root&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;argocd&lt;/span&gt;
&lt;span class="nt"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;project&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;default&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;repoURL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;https://github.com/example-org/platform-gitops.git&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;targetRevision&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;HEAD&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;apps&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;directory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;recurse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;true&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;destination&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;server&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;https://kubernetes.default.svc&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;argocd&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;syncPolicy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;automated&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;prune&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;true&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;selfHeal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;code&gt;directory.recurse: true&lt;/code&gt; is important. The &lt;code&gt;apps/&lt;/code&gt; directory uses subdirectories (&lt;code&gt;apps/platform/&lt;/code&gt;, &lt;code&gt;apps/tenants/&lt;/code&gt;) to organize Application definitions. Without recursion, ArgoCD only scans the top level and misses everything in subdirectories. I initially had this flat and had to add recursion when the directory structure grew.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;prune: true&lt;/code&gt; means ArgoCD deletes resources that no longer exist in git. &lt;code&gt;selfHeal: true&lt;/code&gt; means it reverts any manual changes. Together, they enforce that git is the single source of truth.&lt;/p&gt;
&lt;h3&gt;Why &lt;code&gt;null_resource&lt;/code&gt; Is Ugly but Right&lt;/h3&gt;
&lt;p&gt;There's no clean way to apply a custom resource through the OpenTofu Kubernetes provider and have ArgoCD manage it afterwards. The &lt;code&gt;kubernetes_manifest&lt;/code&gt; resource could create the Application, but then both OpenTofu and ArgoCD would be managing the same resource, fighting over the source of truth. The root Application needs to be created once and then owned entirely by ArgoCD.&lt;/p&gt;
&lt;p&gt;The pragmatic answer is &lt;code&gt;null_resource&lt;/code&gt; with &lt;code&gt;local-exec&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kr"&gt;resource&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;&amp;quot;local_sensitive_file&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;&amp;quot;kubeconfig&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;talos_cluster_kubeconfig.cluster.kubeconfig_raw&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;filename&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;${path.module}/.kubeconfig&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kr"&gt;resource&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;&amp;quot;null_resource&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;&amp;quot;argocd_root_app&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nb"&gt;triggers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;once&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;bootstrap&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kr"&gt;provisioner&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;&amp;quot;local-exec&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;kubectl apply -f ${path.module}/root-app.yaml&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nb"&gt;environment&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="na"&gt;KUBECONFIG&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;local_sensitive_file.kubeconfig.filename&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nv"&gt;helm_release.argocd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nv"&gt;kubernetes_secret.argocd_repo_platform&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nv"&gt;kubernetes_secret.argocd_repo_tenant&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nv"&gt;kubernetes_secret.ksops_age_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;code&gt;triggers = { once = "bootstrap" }&lt;/code&gt; is a static value. It triggers on the first &lt;code&gt;tofu apply&lt;/code&gt; and never again, because the trigger value never changes. Subsequent applies see the &lt;code&gt;null_resource&lt;/code&gt; in state with the same trigger and skip it. This is a fire-and-forget bootstrap step.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;depends_on&lt;/code&gt; list is the full dependency chain: ArgoCD must be running, both repo credentials must exist, and the KSOPS key must be in place. Only then does &lt;code&gt;kubectl apply&lt;/code&gt; create the root Application. After that, ArgoCD takes over.&lt;/p&gt;
&lt;p&gt;Destroying the &lt;code&gt;null_resource&lt;/code&gt; from OpenTofu state does NOT delete the ArgoCD Application from the cluster. It's a one-way handoff.&lt;/p&gt;
&lt;h2&gt;Multi-Tenancy: AppProject as the Authorization Boundary&lt;/h2&gt;
&lt;p&gt;Once the root Application syncs, it picks up everything in &lt;code&gt;apps/&lt;/code&gt;, including the tenant onboarding resources. The natural question at this point: what stops a tenant from deploying into &lt;code&gt;kube-system&lt;/code&gt; or another tenant's namespace?&lt;/p&gt;
&lt;p&gt;The answer is the &lt;strong&gt;AppProject&lt;/strong&gt; CRD. Every Application references a project, and the project defines what that Application is allowed to do.&lt;/p&gt;
&lt;p&gt;The tenant AppProject:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;apiVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="nt"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;AppProject&lt;/span&gt;
&lt;span class="nt"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;tenant-a&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;argocd&lt;/span&gt;
&lt;span class="nt"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Tenant workloads&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;sourceRepos&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;https://github.com/example-tenant/tenant-gitops.git&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;destinations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;server&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;https://kubernetes.default.svc&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;tenant-a&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;clusterResourceWhitelist&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;[]&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;namespaceResourceBlacklist&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;group&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;ResourceQuota&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;group&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;LimitRange&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;orphanedResources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;code&gt;sourceRepos&lt;/code&gt; locks the tenant to their own repo. &lt;code&gt;destinations&lt;/code&gt; restricts deployment to the &lt;code&gt;tenant-a&lt;/code&gt; namespace only. &lt;code&gt;clusterResourceWhitelist: []&lt;/code&gt; blocks all cluster-scoped resources (Namespaces, ClusterRoles, CRDs). &lt;code&gt;namespaceResourceBlacklist&lt;/code&gt; prevents the tenant from overriding the ResourceQuota and LimitRange that the platform manages.&lt;/p&gt;
&lt;p&gt;If the tenant pushes a Deployment targeting &lt;code&gt;kube-system&lt;/code&gt; to their gitops repo, the ArgoCD application controller rejects it at sync time. No workaround.&lt;/p&gt;
&lt;h3&gt;The Default Project Gap&lt;/h3&gt;
&lt;p&gt;The tenant AppProject was configured correctly. But I noticed the &lt;code&gt;default&lt;/code&gt; project (which ships with ArgoCD) was wide open: &lt;code&gt;sourceRepos: ['*']&lt;/code&gt;, destinations unrestricted, full cluster-scoped access.&lt;/p&gt;
&lt;p&gt;Both the root Application and the tenant onboarding Application use &lt;code&gt;project: default&lt;/code&gt;, which is fine since they source from the platform-controlled repo. The problem is lateral: if anyone creates an Application under &lt;code&gt;project: default&lt;/code&gt; (misconfiguration, a future templating mistake, direct &lt;code&gt;kubectl&lt;/code&gt; access), they get unrestricted access to the entire cluster from any repo.&lt;/p&gt;
&lt;p&gt;The fix: lock the default project's &lt;code&gt;sourceRepos&lt;/code&gt; to the platform repo only.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;apiVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="nt"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;AppProject&lt;/span&gt;
&lt;span class="nt"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;default&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;argocd&lt;/span&gt;
&lt;span class="nt"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Platform-managed resources only&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;sourceRepos&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;https://github.com/example-org/platform-gitops.git&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;destinations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;server&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;https://kubernetes.default.svc&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;*&amp;#39;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;clusterResourceWhitelist&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;group&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;*&amp;#39;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;*&amp;#39;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Still allows all destinations and cluster-scoped resources (the platform needs full control), but any Application in &lt;code&gt;project: default&lt;/code&gt; can only pull from code Nubosas controls. Defense in depth for a gap that might never be exploited.&lt;/p&gt;
&lt;p&gt;This lives in &lt;code&gt;platform/default-project.yaml&lt;/code&gt; and gets deployed by a &lt;code&gt;platform-config&lt;/code&gt; Application that the root app picks up automatically. The pattern scales: any future platform-level resource (NetworkPolicies, additional AppProject restrictions) goes into &lt;code&gt;platform/&lt;/code&gt; and is synced through the same path.&lt;/p&gt;
&lt;p&gt;There's more to multi-tenancy security (Cilium NetworkPolicies for runtime network isolation, ArgoCD RBAC if tenants ever get UI access), but those are separate concerns for a future post.&lt;/p&gt;
&lt;h2&gt;The Full Dependency Chain&lt;/h2&gt;
&lt;p&gt;From Day 1 through Day 2, the dependency chain is explicit:&lt;/p&gt;
&lt;pre class="mermaid"&gt;flowchart TD
    A["Talos machine secrets&lt;br/&gt;+ node configs"] --&gt; B["talos_machine_bootstrap&lt;br/&gt;(etcd init)"]
    B --&gt; C["talos_cluster_kubeconfig&lt;br/&gt;(client certs)"]
    C --&gt; D["helm_release.cilium&lt;br/&gt;(CNI, pods can schedule)"]
    D -. "Day 2 starts" .-&gt; E["helm_release.argocd"]
    E --&gt; F1["kubernetes_secret&lt;br/&gt;argocd_repo_* (repo creds)"]
    E --&gt; F2["kubernetes_secret&lt;br/&gt;ksops_age_key"]
    E --&gt; F3["local_sensitive_file&lt;br/&gt;kubeconfig"]
    F1 --&gt; G["null_resource.argocd_root_app&lt;br/&gt;(kubectl apply root-app.yaml)"]
    F2 --&gt; G
    F3 --&gt; G
    G -. "OpenTofu stops" .-&gt; H["root Application&lt;br/&gt;(auto-syncs apps/)"]
    H --&gt; I1["platform-config&lt;br/&gt;(locked default project)"]
    H --&gt; I2["tenant-onboarding&lt;br/&gt;(namespace + quota + AppProject)"]
    H --&gt; I3["tenant Application&lt;br/&gt;(tenant-gitops workloads)"]

    classDef day1 fill:#e8f0ff,stroke:#0063FF,color:#0A354F
    classDef day2 fill:#fff7e6,stroke:#C64350,color:#0A354F
    class A,B,C,D day1
    class E,F1,F2,F3,G,H,I1,I2,I3 day2&lt;/pre&gt;
&lt;p&gt;One &lt;code&gt;tofu apply&lt;/code&gt; gets you from Talos API to a fully operational GitOps loop. After that, OpenTofu manages the cluster infrastructure (Talos configs, Cilium, ArgoCD itself) and ArgoCD manages everything that runs on top of it.&lt;/p&gt;
&lt;p&gt;The boundary is the same principle from Day 0, applied one layer up: each tool owns what it can observe and control through an API. OpenTofu talks to the Talos and Kubernetes APIs. ArgoCD talks to git and the Kubernetes API. They don't overlap.&lt;/p&gt;
&lt;h2&gt;What's Next&lt;/h2&gt;
&lt;p&gt;The GitOps loop is running. Workloads can deploy. But the cluster still needs storage for stateful applications, which is where Rook-Ceph comes in. That's the subject of the next post.&lt;/p&gt;</content><category term="Tech"/><category term="bare-metal"/><category term="opentofu"/><category term="argocd"/><category term="gitops"/><category term="kubernetes"/><category term="sops"/><category term="ksops"/><category term="age"/><category term="multi-tenancy"/><category term="talos"/><category term="hetzner"/></entry><entry><title>Day 1: Taking Declarative Control of a Running Talos Cluster with OpenTofu</title><link href="https://carlosperello.blog/day-1-taking-declarative-control-of-a-running-talos-cluster-with-opentofu/" rel="alternate"/><published>2026-04-01T10:00:00+02:00</published><updated>2026-04-01T10:00:00+02:00</updated><author><name/></author><id>tag:carlosperello.blog,2026-04-01:/day-1-taking-declarative-control-of-a-running-talos-cluster-with-opentofu/</id><summary type="html">&lt;p&gt;The &lt;a href="https://carlosperello.blog/the-day-0-bootstrapper-replacing-the-opentofu-antipattern-with-a-proper-tool/"&gt;Day 0 bootstrapper&lt;/a&gt; exits the moment Talos starts listening on port 50000. The server has an OS. It's waiting for configuration. That's where OpenTofu takes over.&lt;/p&gt;
&lt;p&gt;This post covers &lt;code&gt;02-kubernetes/&lt;/code&gt;, the layer that codifies the full cluster: 13 nodes across four roles, machine config generation and application, and Cilium …&lt;/p&gt;</summary><content type="html">&lt;p&gt;The &lt;a href="https://carlosperello.blog/the-day-0-bootstrapper-replacing-the-opentofu-antipattern-with-a-proper-tool/"&gt;Day 0 bootstrapper&lt;/a&gt; exits the moment Talos starts listening on port 50000. The server has an OS. It's waiting for configuration. That's where OpenTofu takes over.&lt;/p&gt;
&lt;p&gt;This post covers &lt;code&gt;02-kubernetes/&lt;/code&gt;, the layer that codifies the full cluster: 13 nodes across four roles, machine config generation and application, and Cilium managed as a Helm release. This is where the &lt;code&gt;siderolabs/talos&lt;/code&gt; provider earns its place — proper declarative API, proper state, proper drift detection.&lt;/p&gt;
&lt;h2&gt;The Boundary&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;siderolabs/talos&lt;/code&gt; provider talks directly to the Talos API on port 50000. Unlike the &lt;code&gt;null_resource&lt;/code&gt; approach from &lt;code&gt;01-hardware/&lt;/code&gt;, it can actually observe state: read running configuration, apply patches, track whether a node needs updating. The moment a server is in maintenance mode waiting for its machine config, this layer owns it.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kr"&gt;resource&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;&amp;quot;talos_machine_configuration_apply&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;&amp;quot;node&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;for_each&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;local.nodes&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;client_configuration&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;talos_machine_secrets.cluster.client_configuration&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;machine_configuration_input&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;data.talos_machine_configuration.node[each.key].machine_configuration&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;node&lt;/span&gt;&lt;span class="w"&gt;                        &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;each.value.ip&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;each.value.role&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;controlplane&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;local.first_controlplane_ip&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;null&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;apply_mode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;no_reboot&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;code&gt;apply_mode = "no_reboot"&lt;/code&gt; is important for a running production cluster: config changes are applied live, and anything that requires a reboot is staged for the next maintenance window rather than rebooting immediately. Safe to run &lt;code&gt;tofu apply&lt;/code&gt; without taking nodes down.&lt;/p&gt;
&lt;h2&gt;Node Roles and Patch Composition&lt;/h2&gt;
&lt;p&gt;The cluster has 13 nodes across four roles: control planes, workers, storage, and monitoring. Rather than one config file per node, machine configurations are built by composing patches:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="na"&gt;config_patches&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;local.patch_common&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="c1"&gt;                          # every node&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nv"&gt;each.value.role&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;controlplane&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;local.patch_controlplane&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nv"&gt;each.value.role&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;monitoring&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;local.patch_monitoring_node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;local.patch_monitoring_extras&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nv"&gt;each.value.role&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;storage&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;local.patch_storage_extras&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;local.patch_storage_label&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;local.patch_per_node[each.key&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;&lt;span class="c1"&gt;              # hostname, IP, gateway, disk&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The common patch sets cluster-wide settings: cluster name, pod and service subnets, CNI disabled (Cilium is installed separately), kube-proxy disabled (Cilium handles it). Role patches add what's specific to that role. The per-node patch sets hostname, static IP, gateway, and install disk.&lt;/p&gt;
&lt;p&gt;This keeps the config readable and avoids repeating the same values across 13 node definitions.&lt;/p&gt;
&lt;h2&gt;Non-Control-Plane Node Routing&lt;/h2&gt;
&lt;p&gt;Worker, storage, and monitoring nodes are not directly reachable on port 50000 from outside the cluster network. The provider needs an entry point. The solution is the same as &lt;code&gt;talosctl -e &amp;lt;cp-ip&amp;gt; -n &amp;lt;node-ip&amp;gt;&lt;/code&gt;: route through the first control plane.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;each.value.role&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;controlplane&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;local.first_controlplane_ip&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;null&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Control planes are reached directly. Everyone else goes through the first control plane as a relay. One line, handles the whole fleet.&lt;/p&gt;
&lt;h2&gt;NVMe Enumeration Is Not Deterministic&lt;/h2&gt;
&lt;p&gt;This caught me on one of the worker nodes. Talos detects the install disk by PCI bus discovery order, which is not guaranteed to be consistent across reboots or between physically identical servers. Two nodes with the same Samsung 512 GB NVMe drives had their disks enumerate in different orders.&lt;/p&gt;
&lt;p&gt;The only safe way to confirm which device holds the STATE partition on a running node is:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;talosctl get disks -n &amp;lt;node-ip&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;For one worker, the STATE partition had landed on &lt;code&gt;/dev/nvme1n1&lt;/code&gt;, not &lt;code&gt;/dev/nvme0n1&lt;/code&gt;. One node also has a SATA disk rather than NVMe: a different hardware generation that ended up in the same cluster. Both are accounted for explicitly in the node metadata:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;frodo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;worker&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;ip&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;local.workers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;frodo&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;gateway&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;...&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;disk&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;/dev/sda&amp;quot;&lt;/span&gt;&lt;span class="c1"&gt;  # SATA -- the only non-NVMe node&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Assuming all nodes use the same disk path is a silent failure waiting to happen on the next upgrade.&lt;/p&gt;
&lt;h2&gt;Storage and Monitoring Node Specialization&lt;/h2&gt;
&lt;p&gt;Storage and monitoring nodes use Talos volume management to carve disk space beyond the system partition.&lt;/p&gt;
&lt;p&gt;Storage nodes cap the EPHEMERAL volume at 50 GiB and claim the remaining space as a dedicated partition for &lt;a href="https://garagehq.deuxfleurs.fr/"&gt;GarageHQ&lt;/a&gt; data (an S3-compatible object store that handles its own HA across a 3-node cluster, independent of Rook-Ceph):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="nt"&gt;apiVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;v1alpha1&lt;/span&gt;
&lt;span class="nt"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;VolumeConfig&lt;/span&gt;
&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;EPHEMERAL&lt;/span&gt;
&lt;span class="nt"&gt;provisioning&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;diskSelector&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;match&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;system_disk&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;maxSize&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;50GiB&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="nt"&gt;apiVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;v1alpha1&lt;/span&gt;
&lt;span class="nt"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;UserVolumeConfig&lt;/span&gt;
&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;garage-data&lt;/span&gt;
&lt;span class="nt"&gt;provisioning&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;diskSelector&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;match&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;system_disk&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;minSize&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;100GB&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;grow&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Monitoring nodes get a dedicated local storage volume carved from a secondary disk, plus a taint and label applied via machine config to keep general workloads off them:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="na"&gt;patch_monitoring_node&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;yamlencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nb"&gt;machine&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nb"&gt;nodeLabels&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="na"&gt;pool&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;monitoring&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nb"&gt;nodeTaints&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="na"&gt;dedicated&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;monitoring:NoSchedule&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Both are machine config concerns, not Kubernetes concerns — the right layer for anything that affects how the node boots and presents itself.&lt;/p&gt;
&lt;h2&gt;Cilium: From Inline Manifest to Helm&lt;/h2&gt;
&lt;p&gt;Cilium was originally deployed as an inline manifest in the Talos machine config, which is how Talos bootstraps CNI before the Kubernetes API is available. It worked, but it meant Cilium was outside any management layer: no Helm release, no upgrade path, no values file.&lt;/p&gt;
&lt;p&gt;Moving it to Helm under OpenTofu ownership required pre-annotating the existing resources with Helm ownership metadata before installing; otherwise Helm refuses to take over resources it didn't create:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;helm&lt;span class="w"&gt; &lt;/span&gt;template&lt;span class="w"&gt; &lt;/span&gt;cilium&lt;span class="w"&gt; &lt;/span&gt;cilium/cilium&lt;span class="w"&gt; &lt;/span&gt;--version&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.16.3&lt;span class="w"&gt; &lt;/span&gt;-n&lt;span class="w"&gt; &lt;/span&gt;kube-system&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl&lt;span class="w"&gt; &lt;/span&gt;annotate&lt;span class="w"&gt; &lt;/span&gt;-f&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;meta.helm.sh/release-name&lt;span class="o"&gt;=&lt;/span&gt;cilium&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;meta.helm.sh/release-namespace&lt;span class="o"&gt;=&lt;/span&gt;kube-system&lt;span class="w"&gt; &lt;/span&gt;--overwrite
helm&lt;span class="w"&gt; &lt;/span&gt;template&lt;span class="w"&gt; &lt;/span&gt;cilium&lt;span class="w"&gt; &lt;/span&gt;cilium/cilium&lt;span class="w"&gt; &lt;/span&gt;--version&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.16.3&lt;span class="w"&gt; &lt;/span&gt;-n&lt;span class="w"&gt; &lt;/span&gt;kube-system&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl&lt;span class="w"&gt; &lt;/span&gt;label&lt;span class="w"&gt; &lt;/span&gt;-f&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;app.kubernetes.io/managed-by&lt;span class="o"&gt;=&lt;/span&gt;Helm&lt;span class="w"&gt; &lt;/span&gt;--overwrite
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then there are two Talos-specific settings that the default Cilium Helm values get wrong.&lt;/p&gt;
&lt;p&gt;First, Talos mounts cgroupv2 at &lt;code&gt;/sys/fs/cgroup&lt;/code&gt;, not at the path Cilium expects. Disable Cilium's auto-mount and point it to the right location:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;cgroup&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nb"&gt;autoMount&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;hostRoot&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;/sys/fs/cgroup&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Second, and less obvious: Talos's containerd-shim does not include &lt;code&gt;SYS_MODULE&lt;/code&gt; in its capability bounding set. When the default Cilium values request it, &lt;code&gt;runc&lt;/code&gt; returns &lt;code&gt;EPERM&lt;/code&gt; and the init container crashes. The fix is to explicitly override the capability lists and drop &lt;code&gt;SYS_MODULE&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;securityContext&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nb"&gt;capabilities&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;ciliumAgent&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;CHOWN&amp;quot;, &amp;quot;KILL&amp;quot;, &amp;quot;NET_ADMIN&amp;quot;, &amp;quot;NET_RAW&amp;quot;, &amp;quot;IPC_LOCK&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                        &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;SYS_ADMIN&amp;quot;, &amp;quot;SYS_RESOURCE&amp;quot;, &amp;quot;DAC_OVERRIDE&amp;quot;, &amp;quot;FOWNER&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                        &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;SETGID&amp;quot;, &amp;quot;SETUID&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;cleanCiliumState&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;NET_ADMIN&amp;quot;, &amp;quot;SYS_ADMIN&amp;quot;, &amp;quot;SYS_RESOURCE&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Neither of these is documented anywhere obvious. Both are required. The pods crash silently until you read the container logs and work backwards from &lt;code&gt;unable to apply caps: operation not permitted&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;ArgoCD: The Last Piece of Day 1&lt;/h2&gt;
&lt;p&gt;ArgoCD lives in this layer for the same reason Cilium does: it's infrastructure, not a workload. Upgrading it is a version bump and a &lt;code&gt;tofu apply&lt;/code&gt;, not an ArgoCD Application pointing at itself. Keeping both here means one tool owns all cluster-level components and one &lt;code&gt;tofu apply&lt;/code&gt; gets you from Talos API to GitOps-ready.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kr"&gt;resource&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;&amp;quot;helm_release&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;&amp;quot;argocd&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;argocd&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;repository&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;https://argoproj.github.io/argo-helm&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;chart&lt;/span&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;argo-cd&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;9.4.15&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;argocd&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;create_namespace&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;true&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;helm_release.cilium&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;yamlencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nb"&gt;global&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="na"&gt;domain&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;argocd.example.com&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nb"&gt;configs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nb"&gt;params&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;server.insecure&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;true&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;})]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;code&gt;server.insecure = true&lt;/code&gt; is intentional: TLS terminates at the ingress layer, not inside ArgoCD. The ingress itself comes as a Day 2 Application once ArgoCD is running.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;depends_on = [helm_release.cilium]&lt;/code&gt; enforces ordering: Cilium must be running before any pod can schedule. Without it, ArgoCD pods could land on nodes before the CNI is ready and get stuck in a networking failure that requires manual intervention.&lt;/p&gt;
&lt;h2&gt;Secrets in Git: SOPS and Age&lt;/h2&gt;
&lt;p&gt;Before handing off to ArgoCD, there's a practical problem: ArgoCD will need credentials. Database passwords, API tokens, private keys. Those have to live somewhere, and "somewhere" should be the same git repo as everything else.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;carlpett/sops&lt;/code&gt; provider was declared in &lt;code&gt;versions.tf&lt;/code&gt; from the initial commit, alongside Talos and Helm. No secrets were needed at that point (the Talos provider generates its own machine secrets), but the encryption strategy was designed for the full Day 0 through Day 2 lifecycle.&lt;/p&gt;
&lt;p&gt;The choice is &lt;a href="https://github.com/getsops/sops"&gt;SOPS&lt;/a&gt; with &lt;a href="https://github.com/FiloSottile/age"&gt;Age&lt;/a&gt; keys. Not Vault, not sealed-secrets. The reasoning is simple: SOPS is a local tool. There's no server to run, no controller that must be deployed before you can create your first secret. The decryption key is a single file. Encrypted values live in the same repo as the code that uses them, so &lt;code&gt;git diff&lt;/code&gt; shows which secrets changed, &lt;code&gt;git blame&lt;/code&gt; shows who changed them and when.&lt;/p&gt;
&lt;h3&gt;Two Keys, Two Purposes&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;.sops.yaml&lt;/code&gt; at the repo root defines two creation rules:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;creation_rules&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;path_regex&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;environments/.*/secrets\.enc\.yaml$&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;age&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;&amp;lt;personal-key&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;path_regex&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;.*\.enc\.yaml$&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;age&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;&amp;lt;personal-key&amp;gt;,&amp;lt;cluster-key&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The first key is mine, used locally during &lt;code&gt;tofu apply&lt;/code&gt; to decrypt infrastructure secrets. The second is a cluster key that ArgoCD uses for runtime decryption in-cluster.&lt;/p&gt;
&lt;p&gt;Infrastructure secrets (matched by the first rule) only need the personal key. Everything else, including gitops manifests that ArgoCD will consume, gets encrypted to both keys. One &lt;code&gt;sops --encrypt&lt;/code&gt; call, two recipients, and the same file is decryptable both at my terminal and inside the cluster. If the cluster key is compromised, rotate it without touching the bootstrap key.&lt;/p&gt;
&lt;h3&gt;State Encryption (Separate Concern)&lt;/h3&gt;
&lt;p&gt;The OpenTofu state file is a separate encryption problem. It lives in Hetzner Object Storage (S3 backend) and contains every resource attribute, including computed values like kubeconfig contents. SOPS doesn't cover this; OpenTofu's native &lt;code&gt;encryption&lt;/code&gt; block does:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;encryption&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;key_&lt;/span&gt;&lt;span class="kr"&gt;provider&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;&amp;quot;pbkdf2&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;&amp;quot;state&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;passphrase&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;var.state_encryption_passphrase&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;method&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;aes_gcm&amp;quot; &amp;quot;state&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;key_provider.pbkdf2.state&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nb"&gt;state&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;method.aes_gcm.state&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;PBKDF2 derives a key from a passphrase, AES-GCM encrypts the state at rest. The passphrase comes from an environment variable: password manager for local runs, Kubernetes Secret on the self-hosted runner.&lt;/p&gt;
&lt;p&gt;Two encryption layers, two different scopes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SOPS&lt;/strong&gt;: encrypts secret &lt;em&gt;values&lt;/em&gt; in git (credentials, tokens, private keys)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;State encryption&lt;/strong&gt;: encrypts the &lt;em&gt;entire state file&lt;/em&gt; at rest (which contains all resource attributes, including things the Talos provider computes like kubeconfig contents)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Day 1 sets up both. Day 2 activates SOPS at runtime when ArgoCD starts decrypting secrets through KSOPS.&lt;/p&gt;
&lt;h2&gt;What This Layer Doesn't Touch&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;02-kubernetes/&lt;/code&gt; stops once ArgoCD is running and has the credentials and encryption keys it needs to pull its first repo. The bootstrap — repository secrets, the KSOPS decryption setup, and the root Application that kicks off the GitOps loop — is the bridge between Day 1 and Day 2, and the subject of &lt;a href="https://carlosperello.blog/day-2-gitops-handoff-bootstrapping-argocd-on-talos-with-opentofu/"&gt;the next post&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The handoff point is clean: OpenTofu owns everything needed to run the cluster, ArgoCD owns everything that runs on top of it. The same boundary principle from Day 0, applied one layer up.&lt;/p&gt;</content><category term="Tech"/><category term="bare-metal"/><category term="opentofu"/><category term="talos"/><category term="kubernetes"/><category term="hetzner"/><category term="cilium"/><category term="sops"/><category term="age"/><category term="encryption"/></entry><entry><title>The Day 0 Bootstrapper: Replacing the OpenTofu Antipattern with a Proper Tool</title><link href="https://carlosperello.blog/the-day-0-bootstrapper-replacing-the-opentofu-antipattern-with-a-proper-tool/" rel="alternate"/><published>2026-03-24T10:00:00+01:00</published><updated>2026-03-24T10:00:00+01:00</updated><author><name/></author><id>tag:carlosperello.blog,2026-03-24:/the-day-0-bootstrapper-replacing-the-opentofu-antipattern-with-a-proper-tool/</id><summary type="html">&lt;p&gt;In the &lt;a href="https://carlosperello.blog/the-bare-metal-antipattern-why-forcing-opentofu-to-bootstrap-os-is-a-trap/"&gt;previous post&lt;/a&gt; I showed why using OpenTofu to bootstrap a bare-metal OS is the wrong tool for the job. The short version: the physical bootstrap sequence is imperative and one-directional: rescue mode, disk flash, reboot. OpenTofu is designed for declarative state backed by observable APIs. Forcing the two …&lt;/p&gt;</summary><content type="html">&lt;p&gt;In the &lt;a href="https://carlosperello.blog/the-bare-metal-antipattern-why-forcing-opentofu-to-bootstrap-os-is-a-trap/"&gt;previous post&lt;/a&gt; I showed why using OpenTofu to bootstrap a bare-metal OS is the wrong tool for the job. The short version: the physical bootstrap sequence is imperative and one-directional: rescue mode, disk flash, reboot. OpenTofu is designed for declarative state backed by observable APIs. Forcing the two together with &lt;code&gt;null_resource&lt;/code&gt; provisioners gives you something that works once and silently breaks everything after.&lt;/p&gt;
&lt;p&gt;The clean boundary is the moment Talos starts listening on port 50000. From there, the &lt;code&gt;siderolabs/talos&lt;/code&gt; provider gives you a real declarative API for machine configuration and cluster bootstrapping. Everything before that port opens belongs in a different kind of tool entirely.&lt;/p&gt;
&lt;p&gt;This post is about that tool.&lt;/p&gt;
&lt;h2&gt;What It Needs to Do&lt;/h2&gt;
&lt;p&gt;The physical bootstrap sequence for a Hetzner dedicated server is:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Activate rescue mode via the Robot API&lt;/li&gt;
&lt;li&gt;Reboot the server into rescue&lt;/li&gt;
&lt;li&gt;Inspect disks and confirm before wiping&lt;/li&gt;
&lt;li&gt;Flash Talos to the target disk&lt;/li&gt;
&lt;li&gt;Reboot into Talos&lt;/li&gt;
&lt;li&gt;Wait for the Talos API to answer on port 50000&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;It also needs to handle the case where a node is being reprovisioned: draining it from the cluster, wiping Talos state, and leaving it in maintenance mode for OpenTofu to re-apply the machine config.&lt;/p&gt;
&lt;p&gt;None of this is declarative. It's a sequential script. So that's what it is: &lt;code&gt;tools/talos-bootstrap.py&lt;/code&gt;, a Python script using the &lt;a href="https://pypi.org/project/hetzner/"&gt;hetzner&lt;/a&gt; Robot API client, with two modes:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;# Provision a fresh server
.venv/bin/python tools/talos-bootstrap.py --server-id 12345

# Drain, wipe, and re-provision a running node
.venv/bin/python tools/talos-bootstrap.py --server-id 12345 --reset
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;A full provision run looks like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ .venv/bin/python tools/talos-bootstrap.py --server-id 12345

-- talos-bootstrap  server=12345  (203.0.113.189) --------------------

  [ok]   (203.0.113.189)

* Rescue mode
  - Rescue was already active -- re-activated to refresh password.
  - Password starts with &amp;#39;Hn...&amp;#39;
  - Triggering hard reset...
  [ok] Reset triggered.

* Waiting for rescue SSH
  - Waiting for server to go down...
  - Waiting for SSH (rescue) .... [ok]

* Disk inspection

  Available disks:
    [0] /dev/nvme0n1  476.9G
    [1] /dev/nvme1n1  476.9G
  Select disk [0]: 1
  - Target: /dev/nvme1n1  (476.9G)
  [ok] /dev/nvme1n1 appears empty -- proceeding without confirmation.

* Flashing Talos v1.11.3 to /dev/nvme1n1
  - Source: https://github.com/siderolabs/talos/releases/download/v1.11.3/metal-amd64.raw.zst
  [ok] Written to /dev/nvme1n1.
  - Rebooting into Talos...

* Waiting for Talos API
  - Waiting for Talos API ... [ok]

-- done --------------------------------------------------------------
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;If any disk on the server has existing data, the confirmation step kicks in before anything is touched:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;* Disk inspection

  Available disks:
    [0] /dev/nvme0n1  476.9G
    [1] /dev/nvme1n1  476.9G
  Select disk [0]: 1
  - Target: /dev/nvme1n1  (476.9G)

  [!] Existing data found on this server:
    nvme0n1  476.9G
      nvme0n1p2  1M
      nvme0n1p3  2G  [xfs]  label=BOOT
      nvme0n1p4  1M
    nvme1n1  476.9G
      nvme1n1p2  1M
      nvme1n1p3  2G  [xfs]  label=BOOT
      nvme1n1p4  1M

  Flashing Talos to /dev/nvme1n1 on a server with existing data.
  All data on /dev/nvme1n1 will be PERMANENTLY DESTROYED.

  Type the server ID (12345) to confirm:

[!!] Confirmation failed -- aborting.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h2&gt;Design Decisions Worth Explaining&lt;/h2&gt;
&lt;h3&gt;Idempotency at the entry point&lt;/h3&gt;
&lt;p&gt;The first thing the provision flow does is check whether Talos is already answering on port 50000. If it is, there's nothing to do:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;port_open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Talos API already answering on port 50000, nothing to do.&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;  (Use --reset to drain and reprovision this node.)&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It's not full idempotency: if the server is mid-flash with no OS responding anywhere, the script will start from rescue again. But for the common case (re-running after a successful provision, checking a node that's already up), it exits cleanly rather than blowing up a running server.&lt;/p&gt;
&lt;h3&gt;Always re-activate rescue to get a fresh password&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;hetzner&lt;/code&gt; package returns a cached rescue object. If rescue was activated in a previous session, &lt;code&gt;server.rescue.password&lt;/code&gt; may return the password from that earlier activation, not necessarily the one the server actually booted with. Using a stale password means SSH authentication fails immediately (rc=5, permission denied).&lt;/p&gt;
&lt;p&gt;The fix is to always call &lt;code&gt;activate()&lt;/code&gt; regardless of whether rescue is already marked as active:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;already_active&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rescue&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;active&lt;/span&gt;
&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rescue&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;activate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;rescue_password&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rescue&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;password&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;already_active&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Rescue was already active, re-activated to refresh password.&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Calling &lt;code&gt;activate()&lt;/code&gt; on an already-active rescue resets it and returns a valid password. It's the only reliable way to get credentials you can actually use.&lt;/p&gt;
&lt;h3&gt;Hard reset, not soft&lt;/h3&gt;
&lt;p&gt;After rescue activation, the server needs to reboot into rescue mode. A soft reset (ACPI shutdown + boot) can hang if the current OS doesn't respond cleanly. For a bootstrap tool, we don't care about graceful shutdown. Hard reset every time.&lt;/p&gt;
&lt;h3&gt;Wait for SSH to go down before waiting for it to come back up&lt;/h3&gt;
&lt;p&gt;After triggering the reset, there's a window where the old OS is still running and port 22 is still open. If you immediately poll for SSH availability, you'll connect to the wrong system:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Wait for the old SSH to disappear first&lt;/span&gt;
&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;deadline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;120&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;deadline&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;port_open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;server_ip&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Now wait for rescue SSH&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;wait_for_port&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;server_ip&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;SSH (rescue)&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;die&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Timed out waiting for SSH.&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Without this, the script picks up the pre-reboot SSH session, authenticates with the rescue password against whatever was running before, and fails. The 15-second initial sleep gives the server time to begin shutting down before we start checking.&lt;/p&gt;
&lt;h3&gt;Disk safeguard checks all disks, not just the target&lt;/h3&gt;
&lt;p&gt;The wipe confirmation triggers if any disk on the server has existing data, not just the one selected as the flash target:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;disks_with_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;disks&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;disk_has_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;disks_with_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Existing data found on this server:&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;disks_with_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fmt_disk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;  Flashing Talos to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;dev&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; on a server with existing data.&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;  Type the server ID (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;) to confirm: &amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Installing on an empty second disk while there's data on the first is still reprovisioning a server with existing data. You should know about it before confirming. The confirmation requires typing the server ID, not just &lt;code&gt;y&lt;/code&gt;: hard to misfire on a prod node.&lt;/p&gt;
&lt;h2&gt;The Reset Flow&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;--reset&lt;/code&gt; flag handles reprovisioning a node that's still in the cluster. It:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Looks up the Kubernetes node name from the server's IP (or takes it via &lt;code&gt;--node-name&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Drains the node (&lt;code&gt;kubectl drain --ignore-daemonsets --delete-emptydir-data&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Removes it from the cluster (&lt;code&gt;kubectl delete node&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Sends &lt;code&gt;talosctl reset --graceful=false --reboot&lt;/code&gt; to wipe STATE and EPHEMERAL partitions&lt;/li&gt;
&lt;li&gt;Waits for port 50000 to come back up in maintenance mode (Talos running, no machine config applied)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;After that, &lt;code&gt;tofu apply&lt;/code&gt; in &lt;code&gt;02-kubernetes/&lt;/code&gt; re-applies the machine configuration and the node rejoins the cluster. The bootstrapper handles the physical lifecycle; OpenTofu handles the declarative configuration. The boundary stays clean.&lt;/p&gt;
&lt;p&gt;If Talos isn't reachable at all (broken node, failed upgrade), the reset flow falls back to the full rescue bootstrap automatically.&lt;/p&gt;
&lt;h2&gt;What the Script Doesn't Do&lt;/h2&gt;
&lt;p&gt;It doesn't manage Talos upgrades. For a running node, &lt;code&gt;talosctl upgrade&lt;/code&gt; handles version bumps without rescue mode: it downloads the new image and reboots in place. The bootstrapper is only for initial provisioning and full reprovisioning. If you're just upgrading a Talos version, don't reach for this.&lt;/p&gt;
&lt;p&gt;It also doesn't apply machine configuration. That's OpenTofu's job. The script stops the moment Talos is listening on port 50000 (maintenance mode). From there, the &lt;code&gt;siderolabs/talos&lt;/code&gt; provider takes over.&lt;/p&gt;
&lt;h2&gt;The Handoff&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;01-hardware/&lt;/code&gt; directory in the reference repo still exists as an antipattern reference for the previous post, but nothing in it runs in production. The bootstrapper handles Day 0. When it exits, the cluster is exactly one &lt;code&gt;tofu apply&lt;/code&gt; away from a configured, running node.&lt;/p&gt;
&lt;p&gt;That &lt;code&gt;tofu apply&lt;/code&gt;, the &lt;code&gt;02-kubernetes/&lt;/code&gt; layer, is the subject of the &lt;a href="https://carlosperello.blog/day-1-taking-declarative-control-of-a-running-talos-cluster-with-opentofu/"&gt;next post&lt;/a&gt;.&lt;/p&gt;</content><category term="Tech"/><category term="bare-metal"/><category term="talos"/><category term="kubernetes"/><category term="hetzner"/><category term="python"/></entry><entry><title>The Bare-Metal Antipattern: Why Forcing OpenTofu to Bootstrap OS is a Trap</title><link href="https://carlosperello.blog/the-bare-metal-antipattern-why-forcing-opentofu-to-bootstrap-os-is-a-trap/" rel="alternate"/><published>2026-03-11T10:00:00+01:00</published><updated>2026-03-11T10:00:00+01:00</updated><author><name/></author><id>tag:carlosperello.blog,2026-03-11:/the-bare-metal-antipattern-why-forcing-opentofu-to-bootstrap-os-is-a-trap/</id><summary type="html">&lt;p&gt;Coming from a full cloud world where everything is an API, it felt natural to extend OpenTofu to cover our move to bare-metal. When we were automating the &lt;a href="https://www.nubosas.com/success-stories/turisapps-bare-metal-kubernetes-migration/"&gt;Turisapps migration&lt;/a&gt; — moving 850+ custom domains from DigitalOcean/Nomad to a bare-metal Kubernetes cluster on Hetzner — I wanted one tool, one state …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Coming from a full cloud world where everything is an API, it felt natural to extend OpenTofu to cover our move to bare-metal. When we were automating the &lt;a href="https://www.nubosas.com/success-stories/turisapps-bare-metal-kubernetes-migration/"&gt;Turisapps migration&lt;/a&gt; — moving 850+ custom domains from DigitalOcean/Nomad to a bare-metal Kubernetes cluster on Hetzner — I wanted one tool, one state file, one &lt;code&gt;tofu apply&lt;/code&gt; to go from empty server to running cluster.&lt;/p&gt;
&lt;p&gt;The first red flag should have been that Hetzner doesn't even provide an official bare-metal provider — only a Cloud one. We had to rely on a community wrapper (&lt;a href="https://github.com/silenium-dev/terraform-provider-hetzner-robot"&gt;silenium-dev/hetzner-robot&lt;/a&gt;) just to get OpenTofu to talk to the physical servers.&lt;/p&gt;
&lt;p&gt;I ignored that flag. Here's what happened.&lt;/p&gt;
&lt;h2&gt;The Single-Tool Dream&lt;/h2&gt;
&lt;p&gt;The appeal was real. In cloud-land, &lt;code&gt;tofu apply&lt;/code&gt; really does do everything: provision an EC2 instance, attach its EBS volume, configure its security groups, and register it with a load balancer. The entire lifecycle is API calls. If something is wrong, destroy and recreate. State matches reality because the cloud provider enforces it.&lt;/p&gt;
&lt;p&gt;Bare metal is different. The physical server already exists — you've already paid Hetzner for it. There's no "create" API. There's just a server sitting in a rack, waiting to be told what to do.&lt;/p&gt;
&lt;p&gt;But the tooling itch was strong. OpenTofu could talk to Hetzner Robot. OpenTofu could SSH. OpenTofu had &lt;code&gt;null_resource&lt;/code&gt;. Surely we could wire it all together.&lt;/p&gt;
&lt;h2&gt;What "OpenTofu for Everything" Looks Like&lt;/h2&gt;
&lt;p&gt;Here's the actual code from the first attempt:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kr"&gt;resource&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;&amp;quot;hetzner-robot_boot&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;&amp;quot;rescue&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;server_id&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;var.server_id&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;operating_system&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;linux&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;architecture&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;64&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;active_profile&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;rescue&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kr"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;&amp;quot;external&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;&amp;quot;rescue_password&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;hetzner-robot_boot.rescue&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;program&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;bash&amp;quot;, &amp;quot;-c&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;curl -sf -u &amp;#39;${var.robot_user}:${var.robot_password}&amp;#39; https://robot-ws.your-server.de/boot/${var.server_id}/rescue | python3 -c \&amp;quot;import json,sys; d=json.load(sys.stdin); print(json.dumps({&amp;#39;password&amp;#39;: d[&amp;#39;rescue&amp;#39;][&amp;#39;password&amp;#39;]}))\&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kr"&gt;resource&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;&amp;quot;null_resource&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;&amp;quot;reboot_into_rescue&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;hetzner-robot_boot.rescue&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kr"&gt;provisioner&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;&amp;quot;local-exec&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;-&lt;/span&gt;&lt;span class="dl"&gt;EOT&lt;/span&gt;
&lt;span class="sh"&gt;      curl -s -u &amp;#39;${var.robot_user}:${var.robot_password}&amp;#39; \&lt;/span&gt;
&lt;span class="sh"&gt;        -d &amp;#39;type=sw&amp;#39; \&lt;/span&gt;
&lt;span class="sh"&gt;        https://robot-ws.your-server.de/reset/${var.server_id}&lt;/span&gt;
&lt;span class="sh"&gt;      sleep 60&lt;/span&gt;
&lt;span class="sh"&gt;      until nc -z ${var.server_ip} 22; do echo &amp;#39;Waiting for SSH...&amp;#39;; sleep 10; done&lt;/span&gt;
&lt;span class="dl"&gt;    EOT&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kr"&gt;resource&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;&amp;quot;null_resource&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;&amp;quot;flash_talos&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;null_resource.reboot_into_rescue&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nb"&gt;connection&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ssh&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;user&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;root&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;data.external.rescue_password.result.password&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;var.server_ip&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;10m&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kr"&gt;provisioner&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;&amp;quot;remote-exec&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;inline&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;wget -O /tmp/talos.raw.zst https://github.com/siderolabs/talos/releases/download/v1.12.1/metal-amd64.raw.zst&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;zstd -d -c /tmp/talos.raw.zst | dd of=/dev/nvme0n1 bs=4M status=progress&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;sync&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kr"&gt;resource&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;&amp;quot;null_resource&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;&amp;quot;reboot_into_talos&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;null_resource.flash_talos&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kr"&gt;provisioner&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;&amp;quot;local-exec&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;-&lt;/span&gt;&lt;span class="dl"&gt;EOT&lt;/span&gt;
&lt;span class="sh"&gt;      curl -s -u &amp;#39;${var.robot_user}:${var.robot_password}&amp;#39; \&lt;/span&gt;
&lt;span class="sh"&gt;        -d &amp;#39;type=sw&amp;#39; \&lt;/span&gt;
&lt;span class="sh"&gt;        https://robot-ws.your-server.de/reset/${var.server_id}&lt;/span&gt;
&lt;span class="dl"&gt;    EOT&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kr"&gt;resource&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;&amp;quot;null_resource&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;&amp;quot;wait_for_talos&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;null_resource.reboot_into_talos&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kr"&gt;provisioner&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;&amp;quot;local-exec&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;until nc -z ${var.server_ip} 50000; do echo &amp;#39;Waiting for Talos API...&amp;#39;; sleep 10; done&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It works. The first time.&lt;/p&gt;
&lt;p&gt;In practice, getting even that first run to succeed required working around several provider bugs — the community &lt;code&gt;silenium-dev/hetzner-robot&lt;/code&gt; provider returned empty values for both &lt;code&gt;ipv4_address&lt;/code&gt; and &lt;code&gt;password&lt;/code&gt; after rescue activation (a POST vs GET response parsing bug), had no reboot resource, and needed an &lt;code&gt;external&lt;/code&gt; data source with an inline curl+python script just to read back the rescue password. Each workaround added another layer of imperative glue on top of the declarative facade. But eventually:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;hetzner-robot_boot.rescue: Creating...
hetzner-robot_boot.rescue: Creation complete after 1s [id=2944755]
data.external.rescue_password: Reading...
null_resource.reboot_into_rescue: Creating...
null_resource.reboot_into_rescue: Provisioning with &amp;#39;local-exec&amp;#39;...
null_resource.reboot_into_rescue (local-exec): Executing: [&amp;quot;/bin/sh&amp;quot; &amp;quot;-c&amp;quot; &amp;quot;curl -s -u &amp;#39;&amp;lt;robot_user&amp;gt;:&amp;lt;robot_password&amp;gt;&amp;#39; \
  -d &amp;#39;type=sw&amp;#39; \
  https://robot-ws.your-server.de/reset/2944755
  ...&amp;quot;]
data.external.rescue_password: Read complete after 1s [id=-]
null_resource.reboot_into_rescue (local-exec): {&amp;quot;reset&amp;quot;:{&amp;quot;server_ip&amp;quot;:&amp;quot;203.0.113.189&amp;quot;,...,&amp;quot;type&amp;quot;:&amp;quot;sw&amp;quot;}}
null_resource.reboot_into_rescue (local-exec): Waiting for server to come back in rescue mode...
null_resource.reboot_into_rescue (local-exec): Connection to 203.0.113.189 port 22 [tcp/ssh] succeeded!
null_resource.reboot_into_rescue: Creation complete after 1m32s [id=...]
null_resource.flash_talos: Creating...
null_resource.flash_talos: Provisioning with &amp;#39;remote-exec&amp;#39;...
null_resource.flash_talos (remote-exec): Connected!
null_resource.flash_talos (remote-exec): Downloading Talos...
null_resource.flash_talos (remote-exec): /tmp/talos.r 100% 188.71M  111MB/s in 1.7s
null_resource.flash_talos (remote-exec): Flashing NVMe...
null_resource.flash_talos (remote-exec): 4453302272 bytes (4.5 GB, 4.1 GiB) copied, 4.82233 s, 923 MB/s
null_resource.flash_talos: Creation complete after 8s [id=...]
null_resource.reboot_into_talos: Creation complete after 1s [id=...]
null_resource.wait_for_talos (local-exec): Connection to 203.0.113.189 port 50000 [tcp/*] succeeded!
null_resource.wait_for_talos: Creation complete after 55s [id=...]

Apply complete! Resources: 5 added, 0 changed, 0 destroyed.

Outputs:

node_ip = &amp;quot;203.0.113.189&amp;quot;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Server flashed. Talos running. &lt;code&gt;tofu apply&lt;/code&gt; succeeded.&lt;/p&gt;
&lt;h2&gt;Why It's a Trap&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;It's not idempotent.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Run &lt;code&gt;tofu apply&lt;/code&gt; again on a healthy, running production server. OpenTofu sees &lt;code&gt;null_resource.flash_talos&lt;/code&gt; in state. Nothing in the inputs changed. So it re-runs the provisioner — downloads Talos again, &lt;code&gt;dd&lt;/code&gt;s over the NVMe, reboots. Your production server is gone.&lt;/p&gt;
&lt;p&gt;The only escape hatch is adding &lt;code&gt;lifecycle { ignore_changes = all }&lt;/code&gt;. But at that point you've told OpenTofu to permanently ignore this resource. You've neutered the tool. You're no longer using infrastructure-as-code; you're using infrastructure-as-a-script-you-ran-once that happens to live in a &lt;code&gt;.tf&lt;/code&gt; file.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;State and reality are already decoupled.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;OpenTofu state tracks API resource attributes. But there's no API to ask a physical server "what OS are you running?" before it's booted. The &lt;code&gt;null_resource&lt;/code&gt; in state just records that the provisioner ran successfully. It has no idea if &lt;code&gt;/dev/nvme0n1&lt;/code&gt; actually contains Talos, or if the server rebooted cleanly, or if Talos came up healthy.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The polling hack has no exit.&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;until&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;nc&lt;span class="w"&gt; &lt;/span&gt;-z&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$IP&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;50000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;do&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;sleep&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This runs on your local machine, blocking your terminal, with no timeout. If Talos never comes up — hardware failure, bad flash, wrong NVMe device — this loop runs forever. There's no error, no timeout, no recovery path. Kill it with Ctrl-C and OpenTofu doesn't even register a failure in state.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;SSH mid-flight failures corrupt state.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;dd&lt;/code&gt; to flash the NVMe takes minutes. If the SSH connection drops during that window — network blip, rescue mode timeout, anything — OpenTofu marks &lt;code&gt;null_resource.flash_talos&lt;/code&gt; as failed but the server might be 70% flashed. Now OpenTofu thinks the resource needs to be re-created, the server is in an unknown condition, and there's no clean recovery path short of manually intervening and wiping state.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The reboot timing hack.&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;nohup&lt;span class="w"&gt; &lt;/span&gt;bash&lt;span class="w"&gt; &lt;/span&gt;-c&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;sleep 5 &amp;amp;&amp;amp; reboot&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;/dev/null&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&amp;gt;&lt;span class="p"&gt;&amp;amp;&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;&amp;amp;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Sleeping 5 seconds in the background so the SSH session can close cleanly before the reboot kills it. It works in practice. It also breaks silently when rescue mode is slow or &lt;code&gt;sync&lt;/code&gt; takes longer than expected. This kind of timing hack has no business being in your infrastructure state.&lt;/p&gt;
&lt;h2&gt;The Root Cause&lt;/h2&gt;
&lt;p&gt;The fundamental mismatch is between &lt;strong&gt;declarative&lt;/strong&gt; and &lt;strong&gt;imperative&lt;/strong&gt; operations.&lt;/p&gt;
&lt;p&gt;OpenTofu is designed for declarative infrastructure: describe the desired end state, and OpenTofu makes the API calls to get there. This works brilliantly for cloud resources because cloud providers model everything as observable, addressable state.&lt;/p&gt;
&lt;p&gt;Flashing an OS onto a physical disk is &lt;strong&gt;imperative&lt;/strong&gt;: it's a one-time action with side effects. There's no API to query afterwards. There's no way to declare "this disk should contain Talos v1.12.1" and have OpenTofu verify it.&lt;/p&gt;
&lt;p&gt;When you use &lt;code&gt;null_resource&lt;/code&gt; + provisioners to paper over this mismatch, you get something that looks like infrastructure-as-code but has none of its guarantees.&lt;/p&gt;
&lt;h2&gt;It Gets Worse on the Second Run&lt;/h2&gt;
&lt;p&gt;Running &lt;code&gt;tofu apply&lt;/code&gt; a second time on the live server produced this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;hetzner-robot_boot.rescue: Modifying... [id=2944755]
hetzner-robot_boot.rescue: Modifications complete after 1s [id=2944755]
data.external.rescue_password: Read complete after 1s [id=-]

Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;No wipe, no reboot — so it looks safe. But look at what actually happened: the provider silently re-activated rescue mode on the running Talos server because it couldn't read its own state back correctly after the first apply (a bug in how the provider parses POST vs GET responses). The &lt;code&gt;null_resource&lt;/code&gt; steps were skipped only because their IDs happened to be stable in state, not because OpenTofu verified Talos was still installed.&lt;/p&gt;
&lt;p&gt;The server is now scheduled to boot into rescue mode on its next reboot. State says &lt;code&gt;1 changed, 0 destroyed&lt;/code&gt; and everything looks fine.&lt;/p&gt;
&lt;p&gt;This is the real danger: not the dramatic wipe, but the silent inconsistency. The system looks healthy in state while the actual server is in a configuration that will fail the next time it restarts.&lt;/p&gt;
&lt;h2&gt;"Just Fix the Provider" Is the Wrong Answer&lt;/h2&gt;
&lt;p&gt;At this point you might think: patch the community provider, fix the POST response parsing, get &lt;code&gt;password&lt;/code&gt; and &lt;code&gt;ipv4_address&lt;/code&gt; back properly, and everything works. But that path leads somewhere worse.&lt;/p&gt;
&lt;p&gt;The provider bug is a symptom, not the disease. Even a perfectly working provider can't solve the fundamental problem: bare metal bootstrapping is an imperative sequence of side effects. Rescue boot → reboot → flash → sync → reboot → wait. There is no desired end state to declare. There is no API that can tell OpenTofu "the NVMe currently contains Talos v1.12.1." You can make the provider return the right values, but you still end up with &lt;code&gt;null_resource&lt;/code&gt; provisioners and &lt;code&gt;local-exec&lt;/code&gt; curl scripts holding the whole thing together — a pipeline of shell commands wearing an OpenTofu costume.&lt;/p&gt;
&lt;p&gt;You'd have all the fragility of a bash script with none of the debuggability, and the false reassurance of a state file that doesn't reflect reality.&lt;/p&gt;
&lt;h2&gt;The Fix: Know Where OpenTofu's Job Ends&lt;/h2&gt;
&lt;p&gt;The second run proved the point better than any contrived example could. Re-running &lt;code&gt;tofu apply&lt;/code&gt; on a live Talos server silently re-activated rescue mode — because that's what's in state, and OpenTofu dutifully tries to maintain it. The rescue activation is a transient step in an imperative sequence, not a desired end state. Putting it in OpenTofu state means OpenTofu will try to enforce it permanently.&lt;/p&gt;
&lt;p&gt;The entire physical bootstrap process — rescue activation, reboot, OS flash, reboot into Talos, wait for the API — is a one-way imperative sequence. None of it belongs in OpenTofu state. There is no meaningful "desired state" for any of these steps once the server is running.&lt;/p&gt;
&lt;p&gt;The rule: &lt;strong&gt;if the desired state can't be observed through an API, don't put it in OpenTofu.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The clean boundary is the moment Talos starts listening on port 50000. From there, the &lt;code&gt;siderolabs/talos&lt;/code&gt; provider gives you a proper declarative API for machine configuration and cluster bootstrapping. Everything before that port opens belongs in a dedicated bootstrapper: a purpose-built script that owns the physical lifecycle explicitly, runs once, and hands off to OpenTofu when there's an actual API to talk to.&lt;/p&gt;
&lt;p&gt;That bootstrapper, and how it replaces everything in &lt;code&gt;01-hardware/&lt;/code&gt;, is the subject of the &lt;a href="https://carlosperello.blog/the-day-0-bootstrapper-replacing-the-opentofu-antipattern-with-a-proper-tool/"&gt;next post in this series&lt;/a&gt;.&lt;/p&gt;</content><category term="Tech"/><category term="bare-metal"/><category term="opentofu"/><category term="talos"/><category term="kubernetes"/><category term="hetzner"/></entry><entry><title>What I’ve Been Building These Past Five Years</title><link href="https://carlosperello.blog/five-year-update/" rel="alternate"/><published>2025-09-13T00:00:00+02:00</published><updated>2025-09-13T00:00:00+02:00</updated><author><name/></author><id>tag:carlosperello.blog,2025-09-13:/five-year-update/</id><summary type="html">&lt;p&gt;A look at the projects I’ve delivered through Nubosas—AWS &amp;amp; Terraform migrations, CI/CD automation, and large-scale DataOps.&lt;/p&gt;</summary><content type="html">&lt;p&gt;It’s been a long time since my last post, and quite a lot has happened since launching &lt;strong&gt;&lt;a href="https://www.nubosas.com"&gt;Nubosas&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;After a brief period without a project, I began a series of engagements as a &lt;strong&gt;Nubosas contractor&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The first was with &lt;strong&gt;Revieve&lt;/strong&gt;, where I led the migration of two AWS accounts from console-managed infrastructure to &lt;strong&gt;Terraform-managed&lt;/strong&gt; environments.
I imported all existing resources, extended the setup with automation, and built CI/CD pipelines using &lt;strong&gt;GitHub Actions&lt;/strong&gt;.
The project also included deploying a &lt;strong&gt;MongoDB Atlas&lt;/strong&gt; cluster to support new applications.&lt;/p&gt;
&lt;p&gt;Next, I joined a still-active engagement with a &lt;strong&gt;Fortune 500 company&lt;/strong&gt;, working on their &lt;strong&gt;Global Data Foundation&lt;/strong&gt; team as a &lt;strong&gt;DataOps engineer&lt;/strong&gt;.
My focus has been supporting their AWS infrastructure and guiding the migration to &lt;strong&gt;Databricks on Azure&lt;/strong&gt;, enabling a modern, scalable data platform across global regions.&lt;/p&gt;
&lt;p&gt;In between these major projects I also handled a smaller migration from a single on-premises server to the cloud—proof that even modest workloads benefit from strong automation and careful planning.&lt;/p&gt;
&lt;p&gt;These experiences have reinforced what I value most: building reliable, automated infrastructure that frees teams to innovate.&lt;/p&gt;
&lt;p&gt;Today, through Nubosas, I focus on helping scaling SaaS businesses graduate from expensive public clouds (AWS/Azure/GCP) to their own highly automated, Private Sovereign Clouds using bare-metal Kubernetes. It brings Fortune 500 reliability without the cloud tax.&lt;/p&gt;
&lt;p&gt;I’ll use this blog to share occasional technical insights and lessons learned.
If you’d like to discuss a project or connect, you can reach me through &lt;a href="https://www.nubosas.com"&gt;Nubosas&lt;/a&gt;, on
&lt;a href="https://www.linkedin.com/in/carlosperello"&gt;LinkedIn&lt;/a&gt;, or on &lt;a href="https://twitter.com/carlosperello"&gt;X (Twitter)&lt;/a&gt;.&lt;/p&gt;</content><category term="Work"/></entry><entry><title>Good practices: Minimum Docker Image Size</title><link href="https://carlosperello.blog/good-practices-minimum-docker-image-size/" rel="alternate"/><published>2020-04-29T09:00:00+02:00</published><updated>2020-04-29T09:00:00+02:00</updated><author><name/></author><id>tag:carlosperello.blog,2020-04-29:/good-practices-minimum-docker-image-size/</id><summary type="html">&lt;p&gt;Docker container are becoming more and more popular over time. It's a really simple way to
encapsulate your application code with its dependencies and leave it ready to be deployed into
different environments.&lt;/p&gt;
&lt;p&gt;One of its key features are that it allows you to test your application together with the …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Docker container are becoming more and more popular over time. It's a really simple way to
encapsulate your application code with its dependencies and leave it ready to be deployed into
different environments.&lt;/p&gt;
&lt;p&gt;One of its key features are that it allows you to test your application together with the operating
system that will be used to deploy it. The implication of this is that your tests will not only
validate that your app updates don't break your tests, but also make quite simple to upgrade the
operating system dependencies and validate it with regular app tests.&lt;/p&gt;
&lt;p&gt;Let's say we have a Python project which uses PostgreSQL with this set of dependencies in a
&lt;code&gt;requirements.txt&lt;/code&gt; file:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;psycopg2&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="mf"&gt;2.8.5&lt;/span&gt;
&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="mf"&gt;2.23.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;A simple &lt;code&gt;Dockerfile&lt;/code&gt; for that app would be:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;python:3.8-slim&lt;/span&gt;

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;/app&lt;/span&gt;

&lt;span class="k"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;apt-get&lt;span class="w"&gt; &lt;/span&gt;update&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;apt-get&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;-y&lt;span class="w"&gt; &lt;/span&gt;--allow-unauthenticated&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;build-essential&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;libpq-dev

&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;run_app.sh&lt;span class="w"&gt; &lt;/span&gt;run_app.sh
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;my_app.py&lt;span class="w"&gt; &lt;/span&gt;my_app.py
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;requirements.txt&lt;span class="w"&gt; &lt;/span&gt;requirements.txt

&lt;span class="c"&gt;# Install app dependencies.&lt;/span&gt;
&lt;span class="k"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;-r&lt;span class="w"&gt; &lt;/span&gt;requirements.txt

&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;/app/run_app.sh&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Which would produce an image size of 357MB starting from a base Python 3.8 image of 113MB. That
means that each deployment needs to download 357MB from your docker registry to your production
environment. Imagine how big would it become as your dependencies grow, specially when you start
using external libraries or modules that need to be compiled, requiring dev libraries to create
the container.&lt;/p&gt;
&lt;p&gt;Is there anything we could do to reduce that size?&lt;/p&gt;
&lt;p&gt;Yeah, and since the &lt;a href="https://docs.docker.com/develop/develop-images/multistage-build/"&gt;multi stage&lt;/a&gt;
functionality addition in Docker is quite simple!&lt;/p&gt;
&lt;p&gt;The way to approach this optimisation is to use one container only for the build and dependencies
installation and then, in a fresh and new container, just copy the files you are interested on.&lt;/p&gt;
&lt;p&gt;The way you implement this split depends on the technologies being used to run the service, in our
case for this article, we use a Python based application, so we need a way to easily copy the
installed dependencies. There are two different ways to achieve it:
 * Use pip to install all dependencies into an specific directory.
 * Use a virtualenv so the whole installation is isolated, included bin scripts.&lt;/p&gt;
&lt;p&gt;The approach used on this article is the virtualenv one, but it's just a personal preference for
simplicity.&lt;/p&gt;
&lt;p&gt;A possible implementation would have first, a &lt;code&gt;base&lt;/code&gt; stage with the common rules reused in different
stages:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;python:3.8-slim&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;base&lt;/span&gt;

&lt;span class="c"&gt;# PYTHONUNBUFFERED: https://docs.python.org/3.8/using/cmdline.html#envvar-PYTHONUNBUFFERED&lt;/span&gt;
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;PYTHONUNBUFFERED&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;VIRTUAL_ENV&lt;span class="w"&gt; &lt;/span&gt;/srv/venv
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;PYTHONDONTWRITEBYTECODE&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;/app&lt;/span&gt;

&lt;span class="k"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;virtualenv

&lt;span class="k"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;apt-get&lt;span class="w"&gt; &lt;/span&gt;update&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;apt-get&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;-y&lt;span class="w"&gt; &lt;/span&gt;--allow-unauthenticated&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;libpq5
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It basically initialises the environment, sets the final workdir, installs the virtualenv
package that will used in the following stages and finally installs library dependencies required
by our app, in this case, libpq5 for PostgreSQL.&lt;/p&gt;
&lt;p&gt;Next stage would be used only for the build, it will include all development dependencies. It will
start from previous &lt;code&gt;base&lt;/code&gt; stage:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;base&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;compile-image&lt;/span&gt;

&lt;span class="k"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;apt-get&lt;span class="w"&gt; &lt;/span&gt;update&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;apt-get&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;-y&lt;span class="w"&gt; &lt;/span&gt;--allow-unauthenticated&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;build-essential&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;libpq-dev&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;python&lt;span class="w"&gt; &lt;/span&gt;-m&lt;span class="w"&gt; &lt;/span&gt;virtualenv&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VIRTUAL_ENV&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;PATH&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VIRTUAL_ENV&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/bin:&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PATH&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;

&lt;span class="c"&gt;# Install app dependencies.&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;requirements.txt&lt;span class="w"&gt; &lt;/span&gt;requirements.txt
&lt;span class="k"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;-r&lt;span class="w"&gt; &lt;/span&gt;requirements.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;To install the runtime dependencies for our app, it creates a virtualenv, which is used later
to install the requirements.txt dependencies with pip. To simplify the python command execution,
the &lt;code&gt;PATH&lt;/code&gt; environment variable is updated to use the virtual env's commands by default.&lt;/p&gt;
&lt;p&gt;Finally, is time to create the final image with just our application code and the required final
libraries:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;base&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;build-image&lt;/span&gt;
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;PATH&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VIRTUAL_ENV&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/bin:&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PATH&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;

&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;--from&lt;span class="o"&gt;=&lt;/span&gt;compile-image&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VIRTUAL_ENV&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VIRTUAL_ENV&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;run_app.sh&lt;span class="w"&gt; &lt;/span&gt;run_app.sh
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;my_app.py&lt;span class="w"&gt; &lt;/span&gt;my_app.py

&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;/app/run_app.sh&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;As you can see, this stage is copying the virtual environment we created in the previous stage
which would have the final installed modules required to run our app and the code for our app.
The postgres dependency requires libpq library to work, but we already have it installed from
the base image, which allows us to share that installation between the dependencies installation
stage and this final container stage.&lt;/p&gt;
&lt;p&gt;Once this last stage is built, we end having a container of 157MB which is near 2.30 times smaller
than the original container.&lt;/p&gt;</content><category term="Tech"/><category term="Docker"/><category term="Good practices"/></entry></feed>