API MQTT broker Controller Statistics controller My dashboard Database Database replica 2 Legacy HTTP backend
2020-07-01 16:41:07 UTC Investigating: we’re investigating on the issue that make all services unreachable
2020-07-01 17:43:20 UTC Confirmed: the issue is in the server colocation. The ETA for the resolution is
2020-07-02 07:00:00 UTC
2020-07-02 08:00:00 UTC Update: unfortunately the damage is deeper than we previously thought. We’re going on-site to manually export all data and move them to another provider. New ETA is
2020-07-02 13:00:00 UTC
2020-07-02 15:00:00 UTC Resolved: all systems are back to normal. Further maintenance is required and will be scheduled. Future improvements will be published in few hours.
2020-07-01 14:00:00 UTC the power source failed again, and the UPS was able to maintain all systems running for a while. However the main power wasn’t restored properly, and in few hours the whole system was down again. Unfortunately there were no one able to assist us until the day after.
2020-07-02 08:00:00 UTC we made contact with the security, however they were not able to restore the power despite multiple tentatives. At
2020-07-02 13:30:00 UTC one of our staff was able to enter the building after necessary permissions (for COVID19). At
2020-07-02 15:00:00 UTC all services were back to normal again.
Clearly we can’t rely on this facility anymore. All of our services but the database were prepared to be moved fast to another server, or deployed in clusters. In fact, the only thing that prevented us to spin up new services in another facility is the lack of datas.
We decided to have a geographical read-only cluster, with a documented and scripted procedure for manual takeover (eg. for activating hot-standby databases), then make a plan for a long-term solution.