What happens when a host goes down without Control Tower access?


In this blog post we will look at what happens when a site loses it’s connection to the Control Tower and a host goes down. Given a lot of sites, it will be norm to have a few sites offline, all the time.

We consider autonomous self-healing sites being a fundamental requirement for the edge use case. A local control loop at the site should be able to the highest degree possible maintain the desired state without needing to reach out to, or get requests from the central component. We elaborate on this below.

Connections are down.. all the time…

If we assume the risk of a site loosing its internet connection is 1 in 10000 (i.e. 0.01% risk), the chance of having 1000 sites up at any point in time is ~90%, if the number of sites increases to 10000, the chance of all of them being up at the same time drops to 37%.

So as the number of sites scale up, you will have sites offline… all the time. So you better start thinking of what that means and how your edge system can handle that.

The rest of this post will address how Avassa addresses that and what support the platform give the Edge applications.

Self-healing sites and host failover

The system must be able to handle hardware failures, even at the face of connectivity loss! Containers need their images available on the host where they’re scheduled to run. Therefore the Edge Enforcer comes with a site-local, distributed and multi-tenant image registry. Even with no connectivity to the Control Tower and possibly local host failures, thanks to the distributed registry, containers will be restarted on new hosts.

In many cases edge applications need to authenticate themselves or present a certificate protected endpoint to site-local resources, hence secrets like credentials, certificates etc. must always be available. Relying on having a Control Tower connection to e.g. renew a certificate would in effect make the application come to a grinding halt in case of connection loss. A distributed secrets manager is key.

Many edge applications produce data that is destined for the cloud, MQTT to the cloud is a very common setup. In certain cases simply dropping the data in case of a connection outage is perfectly fine… until it is not. In those cases you’d have to put the burden on local applications or IoT devices to cache the data. The Edge Enforcer comes bundled with its own pub/sub system that can readily act as a cache for those outages.

What about service discovery? Again, the Edge Enforcer brings it’s own DNS server for local service discovery, this DNS server doesn’t require any Control Tower or Internet connection at all, so even though services may move around because of host failures, service discovery is always there.

Conclusion

At Avassa, we are certain that we bring a batteries included solution for the edge case, with hosts and connectivity coming and going. This will ensure highest possible application availability.