Skip to main content
Version: Next

Recover your cluster

Recovery of a Constellation cluster means getting it back into a healthy state after too many concurrent node failures in the control plane. Reasons for an unhealthy cluster can vary from a power outage, or planned reboot, to migration of nodes and regions. Recovery events are rare, because Constellation is built for high availability and automatically and securely replaces failed nodes. When a node is replaced, Constellation's control plane first verifies the new node before it sends the node the cryptographic keys required to decrypt its state disk.

Constellation provides a recovery mechanism for cases where the control plane has failed and is unable to replace nodes. The constellation recover command securely connects to all nodes in need of recovery using attested TLS and provides them with the keys to decrypt their state disks and continue booting.

Identify unhealthy clusters

The first step to recovery is identifying when a cluster becomes unhealthy. Usually, this can be first observed when the Kubernetes API server becomes unresponsive.

You can check the health status of the nodes via the cloud service provider (CSP). Constellation provides logging information on the boot process and status via serial console output. In the following, you'll find detailed descriptions for identifying clusters stuck in recovery for each CSP.

First, open the AWS console to view all Auto Scaling Groups (ASGs) in the region of your cluster. Select the ASG of the control plane <cluster-name>-<UID>-control-plane and check that enough members are in a Running state.

Second, check the boot logs of these Instances. In the ASG's Instance management view, select each desired instance. In the upper right corner, select Action > Monitor and troubleshoot > Get system log.

In the serial console output, search for Waiting for decryption key. Similar output to the following means your node was restarted and needs to decrypt the state disk:

{"level":"INFO","ts":"2022-09-08T10:21:53Z","caller":"cmd/main.go:55","msg":"Starting disk-mapper","version":"2.0.0","cloudProvider":"gcp"}
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"setupManager","caller":"setup/setup.go:72","msg":"Preparing existing state disk"}
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"rejoinClient","caller":"rejoinclient/client.go:65","msg":"Starting RejoinClient"}
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"recoveryServer","caller":"recoveryserver/server.go:59","msg":"Starting RecoveryServer"}

The node will then try to connect to the JoinService and obtain the decryption key. If this fails due to an unhealthy control plane, you will see log messages similar to the following:

{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"rejoinClient","caller":"rejoinclient/client.go:77","msg":"Received list with JoinService endpoints","endpoints":["",""]}
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"rejoinClient","caller":"rejoinclient/client.go:96","msg":"Requesting rejoin ticket","endpoint":""}
{"level":"WARN","ts":"2022-09-08T10:21:53Z","logger":"rejoinClient","caller":"rejoinclient/client.go:101","msg":"Failed to rejoin on endpoint","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp connect: connection refused\"","endpoint":""}
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"rejoinClient","caller":"rejoinclient/client.go:96","msg":"Requesting rejoin ticket","endpoint":""}
{"level":"WARN","ts":"2022-09-08T10:22:13Z","logger":"rejoinClient","caller":"rejoinclient/client.go:101","msg":"Failed to rejoin on endpoint","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp i/o timeout\"","endpoint":""}
{"level":"ERROR","ts":"2022-09-08T10:22:13Z","logger":"rejoinClient","caller":"rejoinclient/client.go:110","msg":"Failed to rejoin on all endpoints"}

This means that you have to recover the node manually.

Recover a cluster

Recovering a cluster requires the following parameters:

  • The constellation-state.yaml file in your working directory or the cluster's endpoint
  • The master secret of the cluster

A cluster can be recovered like this:

$ constellation recover --master-secret constellation-mastersecret.json
Pushed recovery key.
Pushed recovery key.
Pushed recovery key.
Recovered 3 control-plane nodes.

In the serial console output of the node you'll see a similar output to the following:

{"level":"INFO","ts":"2022-09-08T10:26:59Z","logger":"recoveryServer","caller":"recoveryserver/server.go:93","msg":"Received recover call"}
{"level":"INFO","ts":"2022-09-08T10:26:59Z","logger":"recoveryServer","caller":"recoveryserver/server.go:125","msg":"Received state disk key and measurement secret, shutting down server"}
{"level":"INFO","ts":"2022-09-08T10:26:59Z","logger":"recoveryServer.gRPC","caller":"zap/server_interceptors.go:61","msg":"finished streaming call with code OK","grpc.start_time":"2022-09-08T10:26:59Z","system":"grpc","span.kind":"server","grpc.service":"recoverproto.API","grpc.method":"Recover","peer.address":"","grpc.code":"OK","grpc.time_ms":15.701}
{"level":"INFO","ts":"2022-09-08T10:27:13Z","logger":"rejoinClient","caller":"rejoinclient/client.go:87","msg":"RejoinClient stopped"}