← Go back to SeismoCloud Status

Controller outage

July 17, 2020 at 11:02 PM

Controller Statistics controller

Resolved after 10h 58m of downtime. July 18, 2020 at 10:00 AM

Full report

Around 2020-07-17 23:02:00 UTC the SeismoCloud controller pod was experiencing a crash. This is normally handled by Kubernetes and the pod is restarted.

However, due a misconfiguration, the container image update policy was set to Always, meaning that the node should always try to pull the latest container image from the registry. Normally, this is not an issue, however in the day of 2020-07-16 the previous provider (where some non-critical services are still hosted) experienced a power issue, and it’s not reachable (the issue is ongoing at the time of the report). So Kubernetes was not able to restart the container.

Around 2020-07-18 09:45:00 UTC we identified the issue, and we changed the policy to ifNotAvailable, which means that Kubernetes should try to pull the container image if not present locally. The pod was rescheduled shortly and the system was again up and running in few minutes.

Future improvements

The current issue was due the misconfiguration of the imagePullPolicy field in the container specs, which is normally left to the value of ifNotAvailable. Probably this misconfiguration was left from the development environment.

We scheduled a re-check of all configurations about containers to look for similar issues. The registry service is not meant to be online with the same SLA, so this service will be migrated according the previous roadmap.