There are many reasons for running applications at the edge. Some of the ones we hear often includes latency, bandwidth, and security requirements. But another very common reason, that might carry the most impact on operational robustness of them all, is offline capabilities. Offline capabilities means that the edge applications are able to withstand that the connectivity to a central or public cloud solution might be inconsistent, and at times down altogether. Without offline capabilities, a disruption in the connectivity would inevitably cause a disruption in the cloud application operation at the edge. This could be devastating for e.g., a point of sales solution, personal safety functions and or any other business-critical applications.
When deploying applications at the edge, one of the many advantages you should be able to enjoy is the assurance that your local operations will remain unaffected by network issues connecting to the cloud.
However there is a catch 22 built in to this. The central and public cloud providers are well equipped with tools for application availability, unlimited resources for fail-over in case of hardware issues, to name an example. A easily made mistake is to assume the same when it comes to edge implementations. And consequently, one might forget that the edge needs be fully autonomous when it comes to providing application availability, and act fully autonomous owning to self-healing features.
When you read this, it might appear obvious. But beware, many edge solutions out there fall short. In this continued article, I’m going to sort out what are typical design mistakes for autonomous self-healing features, including common issues we have seen in some established solutions.
🐞 Design mistake 1: Only one single control loop
Some solutions assume that the central component is involved in making edge local decisions. In these solutions, the edge site needs the central component to act in order to take healing actions. Test your solution in the following way to make sure this won’t be an issue:
- Create an edge site with a number of nodes.
- Start your application on one of the nodes.
- Purposely disconnect the site from the central cloud component.
- Crash and/or fail the node where the application is running.
Does it automatically restart/start on another host on the edge site? It should.
🧩 Solution: There needs to be two control loops.
An autonomous one, running on each site which takes local scheduler actions whenever needed. The outer control loop running in the central cloud deals with configuration changes, application deployments etc. But it should not have to be involved in keeping the application up and running at the edge site.
🐞 Design mistake 2: Assuming central APIs for local functions
Imagine that your application needs to restart or move to another host; that might require passing secrets to the application while having the container image available. It’s a common mistake to assume that the scheduler on the edge can reach out to central APIs to get secrets or the image once again. Remember, the network might be down, or the local network for the edge might not allow applications to reach out to external ports in this manner.
🧩 Solution: All artefacts needed for an application restart or migration needs to be available and cached on the edge sites where needed.
in the spotlight
Edge observability: the full story
Efficient and reliable observability and monitoring are key for successful edge application management. Learn all about it in the white paper, available for download free of charge.
A related critical feature is that all these assets need to be replicated between the hosts on the edge site. If one host goes down and another host picks up the application and the image is only available on the first host, you are stuck.
🐞 Design mistake 3: Only a central console
Many cloud out solutions provide user interfaces, command line tools and APIs in the cloud environment. However, the edge site must be seen as a first class citizen, which is centrally managed through two control loops. If you loose connection to the central cloud you cannot risk having your hands tied.
🧩 Solution: As an edge-local site administrator (the on-premise IT team), you should be able to use dedicated edge tools to work with your site and deployed applications.
🐞 Design mistake 4: Not allowing edge local modifications
The issue above pointed out the risks of not having a local console for your edge environments. Furthermore, it needs to provide monitoring capabilities. Let’s say your edge site suffers a long network outage towards your central orchestrator. Normally, you would deploy your applications from the central component towards the edges. But you might be disconnected and need to upgrade your application locally to a new image version, with a new configuration. If you would have to wait for the network to come back you would be blocked from this operation which might be a security patch or a critical feature needed to run business on your site.
🧩 Solution: It must be possible to perform edge local modifications, and the solution must include well-defined procedures how conflicts are resolved when the connection is back.
Summary: In this article, I’ve summarized the critical aspects for any edge deployment which is serious about offline capabilities. The edges themselves must be able to run as small autonomous micro-clouds, but centrally managed. Cloud out solutions often just stretches a node to the edge and therefore is managed through one, and only, one control loop without any local tools. And trying the other way around, treating each site as a separate cluster without strong central orchestration would lead to operational pain.
✅ Two control loops, offline edge local healing and APIs
You can watch a video that illustrates how the Avassa system treats some of these offline principles.
LET’S KEEP IN TOUCH
Sign up for our newsletter
We’ll send you occasional emails to keep you posted on updates, feature releases, and event invites, and you can opt out at any time.