The Bare-Metal Antipattern: Why Forcing OpenTofu to Bootstrap OS is a Trap

Coming from a full cloud world where everything is an API, it felt natural to extend OpenTofu to cover our move to bare-metal. When we were automating the Turisapps migration — moving 850+ custom domains from DigitalOcean/Nomad to a bare-metal Kubernetes cluster on Hetzner — I wanted one tool, one state file, one tofu apply to go from empty server to running cluster.

The first red flag should have been that Hetzner doesn't even provide an official bare-metal provider — only a Cloud one. We had to rely on a community wrapper (silenium-dev/hetzner-robot) just to get OpenTofu to talk to the physical servers.

I ignored that flag. Here's what happened.

The Single-Tool Dream

The appeal was real. In cloud-land, tofu apply really does do everything: provision an EC2 instance, attach its EBS volume, configure its security groups, and register it with a load balancer. The entire lifecycle is API calls. If something is wrong, destroy and recreate. State matches reality because the cloud provider enforces it.

Bare metal is different. The physical server already exists — you've already paid Hetzner for it. There's no "create" API. There's just a server sitting in a rack, waiting to be told what to do.

But the tooling itch was strong. OpenTofu could talk to Hetzner Robot. OpenTofu could SSH. OpenTofu had null_resource. Surely we could wire it all together.

What "OpenTofu for Everything" Looks Like

Here's the actual code from the first attempt:

resource "hetzner-robot_boot" "rescue" {
  server_id        = var.server_id
  operating_system = "linux"
  architecture     = "64"
  active_profile   = "rescue"
}

data "external" "rescue_password" {
  depends_on = [hetzner-robot_boot.rescue]
  program = [
    "bash", "-c",
    "curl -sf -u '${var.robot_user}:${var.robot_password}' https://robot-ws.your-server.de/boot/${var.server_id}/rescue | python3 -c \"import json,sys; d=json.load(sys.stdin); print(json.dumps({'password': d['rescue']['password']}))\"",
  ]
}

resource "null_resource" "reboot_into_rescue" {
  depends_on = [hetzner-robot_boot.rescue]

  provisioner "local-exec" {
    command = <<-EOT
      curl -s -u '${var.robot_user}:${var.robot_password}' \
        -d 'type=sw' \
        https://robot-ws.your-server.de/reset/${var.server_id}
      sleep 60
      until nc -z ${var.server_ip} 22; do echo 'Waiting for SSH...'; sleep 10; done
    EOT
  }
}

resource "null_resource" "flash_talos" {
  depends_on = [null_resource.reboot_into_rescue]

  connection {
    type     = "ssh"
    user     = "root"
    password = data.external.rescue_password.result.password
    host     = var.server_ip
    timeout  = "10m"
  }

  provisioner "remote-exec" {
    inline = [
      "wget -O /tmp/talos.raw.zst https://github.com/siderolabs/talos/releases/download/v1.12.1/metal-amd64.raw.zst",
      "zstd -d -c /tmp/talos.raw.zst | dd of=/dev/nvme0n1 bs=4M status=progress",
      "sync"
    ]
  }
}

resource "null_resource" "reboot_into_talos" {
  depends_on = [null_resource.flash_talos]

  provisioner "local-exec" {
    command = <<-EOT
      curl -s -u '${var.robot_user}:${var.robot_password}' \
        -d 'type=sw' \
        https://robot-ws.your-server.de/reset/${var.server_id}
    EOT
  }
}

resource "null_resource" "wait_for_talos" {
  depends_on = [null_resource.reboot_into_talos]

  provisioner "local-exec" {
    command = "until nc -z ${var.server_ip} 50000; do echo 'Waiting for Talos API...'; sleep 10; done"
  }
}

It works. The first time.

In practice, getting even that first run to succeed required working around several provider bugs — the community silenium-dev/hetzner-robot provider returned empty values for both ipv4_address and password after rescue activation (a POST vs GET response parsing bug), had no reboot resource, and needed an external data source with an inline curl+python script just to read back the rescue password. Each workaround added another layer of imperative glue on top of the declarative facade. But eventually:

hetzner-robot_boot.rescue: Creating...
hetzner-robot_boot.rescue: Creation complete after 1s [id=2944755]
data.external.rescue_password: Reading...
null_resource.reboot_into_rescue: Creating...
null_resource.reboot_into_rescue: Provisioning with 'local-exec'...
null_resource.reboot_into_rescue (local-exec): Executing: ["/bin/sh" "-c" "curl -s -u '<robot_user>:<robot_password>' \
  -d 'type=sw' \
  https://robot-ws.your-server.de/reset/2944755
  ..."]
data.external.rescue_password: Read complete after 1s [id=-]
null_resource.reboot_into_rescue (local-exec): {"reset":{"server_ip":"203.0.113.189",...,"type":"sw"}}
null_resource.reboot_into_rescue (local-exec): Waiting for server to come back in rescue mode...
null_resource.reboot_into_rescue (local-exec): Connection to 203.0.113.189 port 22 [tcp/ssh] succeeded!
null_resource.reboot_into_rescue: Creation complete after 1m32s [id=...]
null_resource.flash_talos: Creating...
null_resource.flash_talos: Provisioning with 'remote-exec'...
null_resource.flash_talos (remote-exec): Connected!
null_resource.flash_talos (remote-exec): Downloading Talos...
null_resource.flash_talos (remote-exec): /tmp/talos.r 100% 188.71M  111MB/s in 1.7s
null_resource.flash_talos (remote-exec): Flashing NVMe...
null_resource.flash_talos (remote-exec): 4453302272 bytes (4.5 GB, 4.1 GiB) copied, 4.82233 s, 923 MB/s
null_resource.flash_talos: Creation complete after 8s [id=...]
null_resource.reboot_into_talos: Creation complete after 1s [id=...]
null_resource.wait_for_talos (local-exec): Connection to 203.0.113.189 port 50000 [tcp/*] succeeded!
null_resource.wait_for_talos: Creation complete after 55s [id=...]

Apply complete! Resources: 5 added, 0 changed, 0 destroyed.

Outputs:

node_ip = "203.0.113.189"

Server flashed. Talos running. tofu apply succeeded.

Why It's a Trap

It's not idempotent.

Run tofu apply again on a healthy, running production server. OpenTofu sees null_resource.flash_talos in state. Nothing in the inputs changed. So it re-runs the provisioner — downloads Talos again, dds over the NVMe, reboots. Your production server is gone.

The only escape hatch is adding lifecycle { ignore_changes = all }. But at that point you've told OpenTofu to permanently ignore this resource. You've neutered the tool. You're no longer using infrastructure-as-code; you're using infrastructure-as-a-script-you-ran-once that happens to live in a .tf file.

State and reality are already decoupled.

OpenTofu state tracks API resource attributes. But there's no API to ask a physical server "what OS are you running?" before it's booted. The null_resource in state just records that the provisioner ran successfully. It has no idea if /dev/nvme0n1 actually contains Talos, or if the server rebooted cleanly, or if Talos came up healthy.

The polling hack has no exit.

until nc -z $IP 50000; do sleep 10; done

This runs on your local machine, blocking your terminal, with no timeout. If Talos never comes up — hardware failure, bad flash, wrong NVMe device — this loop runs forever. There's no error, no timeout, no recovery path. Kill it with Ctrl-C and OpenTofu doesn't even register a failure in state.

SSH mid-flight failures corrupt state.

The dd to flash the NVMe takes minutes. If the SSH connection drops during that window — network blip, rescue mode timeout, anything — OpenTofu marks null_resource.flash_talos as failed but the server might be 70% flashed. Now OpenTofu thinks the resource needs to be re-created, the server is in an unknown condition, and there's no clean recovery path short of manually intervening and wiping state.

The reboot timing hack.

nohup bash -c 'sleep 5 && reboot' > /dev/null 2>&1 &

Sleeping 5 seconds in the background so the SSH session can close cleanly before the reboot kills it. It works in practice. It also breaks silently when rescue mode is slow or sync takes longer than expected. This kind of timing hack has no business being in your infrastructure state.

The Root Cause

The fundamental mismatch is between declarative and imperative operations.

OpenTofu is designed for declarative infrastructure: describe the desired end state, and OpenTofu makes the API calls to get there. This works brilliantly for cloud resources because cloud providers model everything as observable, addressable state.

Flashing an OS onto a physical disk is imperative: it's a one-time action with side effects. There's no API to query afterwards. There's no way to declare "this disk should contain Talos v1.12.1" and have OpenTofu verify it.

When you use null_resource + provisioners to paper over this mismatch, you get something that looks like infrastructure-as-code but has none of its guarantees.

It Gets Worse on the Second Run

Running tofu apply a second time on the live server produced this:

hetzner-robot_boot.rescue: Modifying... [id=2944755]
hetzner-robot_boot.rescue: Modifications complete after 1s [id=2944755]
data.external.rescue_password: Read complete after 1s [id=-]

Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

No wipe, no reboot — so it looks safe. But look at what actually happened: the provider silently re-activated rescue mode on the running Talos server because it couldn't read its own state back correctly after the first apply (a bug in how the provider parses POST vs GET responses). The null_resource steps were skipped only because their IDs happened to be stable in state, not because OpenTofu verified Talos was still installed.

The server is now scheduled to boot into rescue mode on its next reboot. State says 1 changed, 0 destroyed and everything looks fine.

This is the real danger: not the dramatic wipe, but the silent inconsistency. The system looks healthy in state while the actual server is in a configuration that will fail the next time it restarts.

"Just Fix the Provider" Is the Wrong Answer

At this point you might think: patch the community provider, fix the POST response parsing, get password and ipv4_address back properly, and everything works. But that path leads somewhere worse.

The provider bug is a symptom, not the disease. Even a perfectly working provider can't solve the fundamental problem: bare metal bootstrapping is an imperative sequence of side effects. Rescue boot → reboot → flash → sync → reboot → wait. There is no desired end state to declare. There is no API that can tell OpenTofu "the NVMe currently contains Talos v1.12.1." You can make the provider return the right values, but you still end up with null_resource provisioners and local-exec curl scripts holding the whole thing together — a pipeline of shell commands wearing an OpenTofu costume.

You'd have all the fragility of a bash script with none of the debuggability, and the false reassurance of a state file that doesn't reflect reality.

The Fix: Know Where OpenTofu's Job Ends

The second run proved the point better than any contrived example could. Re-running tofu apply on a live Talos server silently re-activated rescue mode — because that's what's in state, and OpenTofu dutifully tries to maintain it. The rescue activation is a transient step in an imperative sequence, not a desired end state. Putting it in OpenTofu state means OpenTofu will try to enforce it permanently.

The entire physical bootstrap process — rescue activation, reboot, OS flash, reboot into Talos, wait for the API — is a one-way imperative sequence. None of it belongs in OpenTofu state. There is no meaningful "desired state" for any of these steps once the server is running.

The rule: if the desired state can't be observed through an API, don't put it in OpenTofu.

The clean boundary is the moment Talos starts listening on port 50000. From there, the siderolabs/talos provider gives you a proper declarative API for machine configuration and cluster bootstrapping. Everything before that port opens belongs in a dedicated bootstrapper: a purpose-built script that owns the physical lifecycle explicitly, runs once, and hands off to OpenTofu when there's an actual API to talk to.

That bootstrapper, and how it replaces everything in 01-hardware/, is the subject of the next post in this series.

social