The Day 0 Bootstrapper: Replacing the OpenTofu Antipattern with a Proper Tool

In the previous post I showed why using OpenTofu to bootstrap a bare-metal OS is the wrong tool for the job. The short version: the physical bootstrap sequence is imperative and one-directional: rescue mode, disk flash, reboot. OpenTofu is designed for declarative state backed by observable APIs. Forcing the two together with null_resource provisioners gives you something that works once and silently breaks everything after.

The clean boundary is the moment Talos starts listening on port 50000. From there, the siderolabs/talos provider gives you a real declarative API for machine configuration and cluster bootstrapping. Everything before that port opens belongs in a different kind of tool entirely.

This post is about that tool.

What It Needs to Do

The physical bootstrap sequence for a Hetzner dedicated server is:

Activate rescue mode via the Robot API
Reboot the server into rescue
Inspect disks and confirm before wiping
Flash Talos to the target disk
Reboot into Talos
Wait for the Talos API to answer on port 50000

It also needs to handle the case where a node is being reprovisioned: draining it from the cluster, wiping Talos state, and leaving it in maintenance mode for OpenTofu to re-apply the machine config.

None of this is declarative. It's a sequential script. So that's what it is: tools/talos-bootstrap.py, a Python script using the hetzner Robot API client, with two modes:

# Provision a fresh server
.venv/bin/python tools/talos-bootstrap.py --server-id 12345

# Drain, wipe, and re-provision a running node
.venv/bin/python tools/talos-bootstrap.py --server-id 12345 --reset

A full provision run looks like this:

$ .venv/bin/python tools/talos-bootstrap.py --server-id 12345

-- talos-bootstrap  server=12345  (203.0.113.189) --------------------

  [ok]   (203.0.113.189)

* Rescue mode
  - Rescue was already active -- re-activated to refresh password.
  - Password starts with 'Hn...'
  - Triggering hard reset...
  [ok] Reset triggered.

* Waiting for rescue SSH
  - Waiting for server to go down...
  - Waiting for SSH (rescue) .... [ok]

* Disk inspection

  Available disks:
    [0] /dev/nvme0n1  476.9G
    [1] /dev/nvme1n1  476.9G
  Select disk [0]: 1
  - Target: /dev/nvme1n1  (476.9G)
  [ok] /dev/nvme1n1 appears empty -- proceeding without confirmation.

* Flashing Talos v1.11.3 to /dev/nvme1n1
  - Source: https://github.com/siderolabs/talos/releases/download/v1.11.3/metal-amd64.raw.zst
  [ok] Written to /dev/nvme1n1.
  - Rebooting into Talos...

* Waiting for Talos API
  - Waiting for Talos API ... [ok]

-- done --------------------------------------------------------------

If any disk on the server has existing data, the confirmation step kicks in before anything is touched:

* Disk inspection

  Available disks:
    [0] /dev/nvme0n1  476.9G
    [1] /dev/nvme1n1  476.9G
  Select disk [0]: 1
  - Target: /dev/nvme1n1  (476.9G)

  [!] Existing data found on this server:
    nvme0n1  476.9G
      nvme0n1p2  1M
      nvme0n1p3  2G  [xfs]  label=BOOT
      nvme0n1p4  1M
    nvme1n1  476.9G
      nvme1n1p2  1M
      nvme1n1p3  2G  [xfs]  label=BOOT
      nvme1n1p4  1M

  Flashing Talos to /dev/nvme1n1 on a server with existing data.
  All data on /dev/nvme1n1 will be PERMANENTLY DESTROYED.

  Type the server ID (12345) to confirm:

[!!] Confirmation failed -- aborting.

Design Decisions Worth Explaining

Idempotency at the entry point

The first thing the provision flow does is check whether Talos is already answering on port 50000. If it is, there's nothing to do:

if port_open(server.ip, 50000):
    ok("Talos API already answering on port 50000, nothing to do.")
    print("  (Use --reset to drain and reprovision this node.)\n")
    sys.exit(0)

It's not full idempotency: if the server is mid-flash with no OS responding anywhere, the script will start from rescue again. But for the common case (re-running after a successful provision, checking a node that's already up), it exits cleanly rather than blowing up a running server.

Always re-activate rescue to get a fresh password

The hetzner package returns a cached rescue object. If rescue was activated in a previous session, server.rescue.password may return the password from that earlier activation, not necessarily the one the server actually booted with. Using a stale password means SSH authentication fails immediately (rc=5, permission denied).

The fix is to always call activate() regardless of whether rescue is already marked as active:

already_active = server.rescue.active
server.rescue.activate()
rescue_password = server.rescue.password
if already_active:
    info("Rescue was already active, re-activated to refresh password.")

Calling activate() on an already-active rescue resets it and returns a valid password. It's the only reliable way to get credentials you can actually use.

Hard reset, not soft

After rescue activation, the server needs to reboot into rescue mode. A soft reset (ACPI shutdown + boot) can hang if the current OS doesn't respond cleanly. For a bootstrap tool, we don't care about graceful shutdown. Hard reset every time.

Wait for SSH to go down before waiting for it to come back up

After triggering the reset, there's a window where the old OS is still running and port 22 is still open. If you immediately poll for SSH availability, you'll connect to the wrong system:

# Wait for the old SSH to disappear first
time.sleep(15)
deadline = time.time() + 120
while time.time() < deadline and port_open(server_ip, 22):
    time.sleep(5)
# Now wait for rescue SSH
if not wait_for_port(server_ip, 22, timeout=300, label="SSH (rescue)"):
    die("Timed out waiting for SSH.")

Without this, the script picks up the pre-reboot SSH session, authenticates with the rescue password against whatever was running before, and fails. The 15-second initial sleep gives the server time to begin shutting down before we start checking.

Disk safeguard checks all disks, not just the target

The wipe confirmation triggers if any disk on the server has existing data, not just the one selected as the flash target:

disks_with_data = [d for d in disks if disk_has_data(d)]
if disks_with_data:
    warn("Existing data found on this server:")
    for d in disks_with_data:
        print(fmt_disk(d, indent=2))
    print(f"\n  Flashing Talos to {dev} on a server with existing data.")
    answer = input(f"\n  Type the server ID ({server.number}) to confirm: ")

Installing on an empty second disk while there's data on the first is still reprovisioning a server with existing data. You should know about it before confirming. The confirmation requires typing the server ID, not just y: hard to misfire on a prod node.

The Reset Flow

The --reset flag handles reprovisioning a node that's still in the cluster. It:

Looks up the Kubernetes node name from the server's IP (or takes it via --node-name)
Drains the node (kubectl drain --ignore-daemonsets --delete-emptydir-data)
Removes it from the cluster (kubectl delete node)
Sends talosctl reset --graceful=false --reboot to wipe STATE and EPHEMERAL partitions
Waits for port 50000 to come back up in maintenance mode (Talos running, no machine config applied)

After that, tofu apply in 02-kubernetes/ re-applies the machine configuration and the node rejoins the cluster. The bootstrapper handles the physical lifecycle; OpenTofu handles the declarative configuration. The boundary stays clean.

If Talos isn't reachable at all (broken node, failed upgrade), the reset flow falls back to the full rescue bootstrap automatically.

What the Script Doesn't Do

It doesn't manage Talos upgrades. For a running node, talosctl upgrade handles version bumps without rescue mode: it downloads the new image and reboots in place. The bootstrapper is only for initial provisioning and full reprovisioning. If you're just upgrading a Talos version, don't reach for this.

It also doesn't apply machine configuration. That's OpenTofu's job. The script stops the moment Talos is listening on port 50000 (maintenance mode). From there, the siderolabs/talos provider takes over.

The Handoff

The 01-hardware/ directory in the reference repo still exists as an antipattern reference for the previous post, but nothing in it runs in production. The bootstrapper handles Day 0. When it exits, the cluster is exactly one tofu apply away from a configured, running node.

That tofu apply, the 02-kubernetes/ layer, is the subject of the next post.