Graceful single node k3s Shutdown
I managed to solve the ingress-nginx
issues mentioned in the previous post. The solution was found in this blog post and it was actually a recent security-related addition for ingress-nginx
. And being the decisive person I am, the homeserver is once again powered by k3s.
Engulfed in the feeling of success, I decided to try tackle another (most likely solved) problem; graceful shutdown of pods on single-node k3s cluster.
From what I have observed, k3s doesn’t really do much when it is being executed under systemd
during shutdown. Due to how Kubernetes works, it’s quite likely that pods are actually left running by containerd
. This is obviously a bit of a problem, as on shutdown they are probably just killed with SIGKILL
. This may cause database corruption, or even file system problems in the long run.
So far I’ve remediated this issue by executing kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
before reboots and shutdowns. This is just a bit easy to forget, most likely during solving some other OS-related issue requiring reboots. Which is not that uncommon, at least for a tinkerer like me.
A bit about systemd units
Systemd service units have the After=
and Before=
directives in the [Unit]
section of the service definition, which control the order of starting the units. Additionally services can be bound to each other with the Requires=
, PartOf=
and BindsTo=
parameters.
These parameters also affect the stop order of units when Requires=
, PartOf=
and BindsTo=
are used. If unit B has After=A.service
, it will be started after B has started, and when unit A is stopped, B is stopped first.
Following this logic, to create a service that does something before another service stops, we can use a unit like the following:
## /etc/systemd/system/example-shutdown.service
[Unit]
Description=example shutdown service
Requires=k3s.service
After=k3s.service
[Service]
Type=simple
ExecStart=/bin/sleep inf
ExecStop=/bin/echo SHUTTING DOWN
[Install]
WantedBy=k3s.service
The service requires k3s.service
, and is started after k3s. What’s a little less obvious, is the need to utilize sleep
as the “main” process of the service. This is because Type=simple
services must have the ExecStart=
directive. Type=oneshot
doesn’t need it, and might also work for this use case, but this way it’s pretty clear what the service is doing: sleeping and waiting to perform the ExecStop=
directive. In this case “SHUTTING DOWN” will just be printed to the system journal. WantedBy=k3s.service
adds this service in the Wants=
list of k3s.service
, which means that when k3s starts, it would prefer to have this service started too. k3s wants to start this service, and this service requires k3s and is also defined to start after k3s, resulting in k3s starting first.
Solution
In the footsteps of the previous example, the following unit file emerges:
## /etc/systemd/system/k3s-shutdown@.service
[Unit]
Description=k3s graceful shutdown
BindsTo=k3s.service
After=k3s.service
[Service]
Type=simple
ExecStartPre=/usr/local/bin/kubectl uncordon %i
ExecStart=/bin/sleep inf
ExecStop=/usr/local/bin/kubectl drain %i --ignore-daemonsets --delete-emptydir-data
[Install]
WantedBy=k3s.service
First, a notable addition is that the service name ends with the @ symbol. This allows using the unit like systemctl enable k3s-shutdown@my-node.service
, and the value after the @ symbol is passed to the unit template and accessible with the variable %i
. In this case k3s-shutdown@my-node.service
will result in a service that executes /usr/local/bin/kubectl drain **my-node** --ignore-daemonsets --delete-emptydir-data
when it stops.
BindsTo=
is also a new addition. This parameter is similar to Requires=
, but a deeper relation. From man systemd.unit
(with systemd version 252):
BindsTo=
Configures requirement dependencies, very similar in style to Requires=.
However, this dependency type is stronger: in addition to the effect of
Requires= it declares that if the unit bound to is stopped, this unit
will be stopped too
...
When used in conjunction with After= on the same unit the behaviour of
BindsTo= is even stronger. In this case, the unit bound to strictly has
to be in active state for this unit to also be in active state.
...
Following this and the notes before about stop order, we can be pretty confident that when k3s starts, the shutdown service is started after k3s and stopped before k3s is stopeed.
With the above notes about BindsTo
in mind, the ExecStartPre=
parameter becomes very useful: since we know systemd will do it’s darnest to make sure the shutdown service is started after k3s, we can automatically undordon the node in conjunction with this service.
All that’s left is to enable the shutdown service using the correct node name: systemctl enable --now k3s-shutdown@my-node.service
. The service is safe to start under normal conditions, as on startup it only runs the uncordon
command, which won’t have an effect when pods aren’t evicted from the node.
When k3s.service is stopped, e.g. on host shutdown:
- k3s-shutdown@node-name.service stops, draining the node
- After the stop is complete, k3s is stopped
- host shutdown can continue safely, with no leftover running pods
When starting k3s, e.g. powering on the host or after a reboot:
- k3s.service is started, and wants k3s-shutdown@node-name.service too
- The dependenciy priorities result in k3s “giving up” on the requirement
- After the kubelet is running (k3s in “Active” state), k3s-shutdown@node-name.service is started
- If the node has been drained, it will be undrained
- k3s-shutdown@node-name sleeps until it is stopped again
So far, this looks to be working really well. For further improvements or likely mistakes in the systemd interpretation, be sure to drop me a comment on Mastodon!