Controller Statistics controller
2020-07-18 09:30:00 UTC Investigating: we’re investigating on the issue.
2020-07-18 09:45:00 UTC Confirmed: the issue was confirmed in a misconfiguration. ETA in few hours.
2020-07-18 10:00:00 Resolved: all systems are back to normal.
2020-07-17 23:02:00 UTC the SeismoCloud controller pod was experiencing a crash. This is normally handled by Kubernetes and the pod is restarted.
However, due a misconfiguration, the container image update policy was set to
Always, meaning that the node should always try to pull the latest container image from the registry. Normally, this is not an issue, however in the day of
2020-07-16 the previous provider (where some non-critical services are still hosted) experienced a power issue, and it’s not reachable (the issue is ongoing at the time of the report). So Kubernetes was not able to restart the container.
2020-07-18 09:45:00 UTC we identified the issue, and we changed the policy to
ifNotAvailable, which means that Kubernetes should try to pull the container image if not present locally. The pod was rescheduled shortly and the system was again up and running in few minutes.
The current issue was due the misconfiguration of the
imagePullPolicy field in the container specs, which is normally left to the value of
ifNotAvailable. Probably this misconfiguration was left from the development environment.
We scheduled a re-check of all configurations about containers to look for similar issues. The registry service is not meant to be online with the same SLA, so this service will be migrated according the previous roadmap.