Organizations that run edge applications within their premises on local compute instead of central clouds make this decision for several reasons: latency requirements, regulations, and business continuity. The business must run even if the connection to the cloud is down. Another aspect of business continuity is the performance and availability of the edge applications themselves. Therefore we must focus on how to monitor the applications at the edge proactively. Edge monitoring is different from monitoring applications in the central cloud in several ways:
- You have individual applications are running on hundreds or thousands of sites, in contrast to a few central clouds.
- The infrastructure is more heterogeneous and distributed, which makes it harder to understand the dependencies between infrastructure and applications.
- Edge sites have contextual meaning for the application user experience; this is not the case for applications in the central cloud.
- The number of sites combined with application telemetry at each edge site makes it challenging to feed all data into a central monitoring solution.
The rainy day statement for the current state of affairs for application monitoring could be:
Users detect most issues, not the monitoring tools or operations staff. And the operations staff mostly dig into low-level data, such as application logs, to find the underlying cause.
Another interesting observation is that application issues are, in many cases, performance-related — there are slow responding applications, but not total blackouts. If we add this current state of affairs to the challenges of monitoring applications at the edge, we might get into trouble. How can we address application continuity issues for the edge? We cannot stick with the situation that we have users at edge sites convincing we have application issues and, after that, tedious digging of application logs at each site.
To uplift the discussion, let’s look at a generalized flow for resolving application issues:
In the first TTI step, we identify and acknowledge we have an issue. As stated above, this is mostly done by end users. Also, it usually takes days, if not weeks, until acknowledged by operations. Once it has been identified and acknowledged, we analyze the issue to discover what is the underlying cause. When that is completed, we can fix the issue and wrap things up by validating the fix.
All these steps need to be adopted, and adapted, for an edge use case.
- TTI: Active application health monitoring must run at each edge site to proactively detect application performance issues.
- TTA: Proper tools will help operations staff analyze issues across sites. Manually logging into hundreds of sites to inspect docker logs will not scale.
- TTF: Automated tools are needed that can update applications and/or application configuration across a large set of sites.
- TTV: Proper validation of resolutions is sometimes overlooked and left to the users. This can not be the case for the edge. We must have automatic validation across all edge sites to ensure the business is operational.
💡 An overall observation is that we need a shift-left movement for the edge use case. We need proper proactive detection of application issues. It is just too costly and complex to leave that to end users.
How does this correlate with modern observability principles and tools? The observability trend focuses on the central collection and analysis of telemetry data such as metrics, logs, and traces. While this is part of a complete monitoring solution, it leans towards the right in the illustration above. Additionally, the edge scale combined with a possibly unstable network makes it challenging to assume that all telemetry data should go into a central solution. And more local intelligence and context need to be managed at the edge as outlined below. This is also well formulated in an article by Matt Rickard.
What is missing?
- Focus on applications and TTI: We need to detect application performance issues at each site. Application-oriented techniques using synthetic application requests and running locally at each site can help. Unfortunately, too many monitoring solutions are focused on host health.
- Distributed state aggregation: It is hard for operations teams to look at logs and deduce if and where there are problems. The edge solution should calculate the operational state for edge sites, applications, pods/services, containers, and hosts. This will help by providing traffic lights before digging into the logs.
- Edge site awareness: Context is essential for any monitoring solution. Users will report issues corresponding to applications and sites. For example, “I have an issue with the check-out application at store A.” Container logs do not necessarily know which application and site they belong to.
Read more in our white paper on Observability in the distributed edge: The full story.
LET’S KEEP IN TOUCH
Sign up for our newsletter
We’ll send you occasional emails to keep you posted on updates, feature releases, and event invites, and you can opt out at any time.