Major outage on all systems

July 1, 2020 at 4:41 PM

API MQTT broker Controller Statistics controller My dashboard Database Database replica 2 Legacy HTTP backend

Resolved after 22h 18m of downtime. July 2, 2020 at 3:00 PM

Full report

Around 2020-07-01 14:00:00 UTC the power source failed again, and the UPS was able to maintain all systems running for a while. However the main power wasn’t restored properly, and in few hours the whole system was down again. Unfortunately there were no one able to assist us until the day after.

Around 2020-07-02 08:00:00 UTC we made contact with the security, however they were not able to restore the power despite multiple tentatives. At 2020-07-02 13:30:00 UTC one of our staff was able to enter the building after necessary permissions (for COVID19). At 2020-07-02 15:00:00 UTC all services were back to normal again.

Future improvements

Clearly we can’t rely on this facility anymore. All of our services but the database were prepared to be moved fast to another server, or deployed in clusters. In fact, the only thing that prevented us to spin up new services in another facility is the lack of datas.

We decided to have a geographical read-only cluster, with a documented and scripted procedure for manual takeover (eg. for activating hot-standby databases), then make a plan for a long-term solution.