Operating edge applications in thousands of locations: 10 things you wish you knew

When you first start to deploy edge workloads, everything will look simple enough: a few sites, some containers, and a stable network. But eventually, it’s time to scale. The number of sites will grow into hundreds or thousands, and the target environments at the edges will show surprising behaviors.

Shrug emoticon and 'IT WORKS on my machine' text, illustrating the gap between lab-tested software and real-world edge operations at scale.

What worked well in the lab started to show its limits in the field. The challenges were less about the software itself and more about operations, distribution, and reliability.

Below follows the top 10 lessons learned from our experiences at Avassa.

1: The control plane isn’t just a bigger Kubernetes

A centralized control plane works well when managing a few hundred nodes with stable connections. At the edge, the situation is different. Thousands of sites behave independently, and network stability cannot be assumed.

Design for local autonomy. Each site must operate safely independently, continue to function offline, and synchronize only when the network allows it. This changes how you think about orchestration, availability, and data consistency.

2: A scalable fleet manager and zero-touch maintenance

Once you accept that every site is essentially its own small cluster, a new challenge appears: managing all of them over time. Manual intervention doesn’t scale.

Fleet management needs to be fully automated. Software updates, configuration changes, and certificate renewals must occur without requiring anyone to log in or run commands. The fleet management capabilities must cover both the infrastructure layer and the application layer.

Edge environments require systems that can maintain and upgrade themselves safely.

3: Connectivity through segmented networks

Edge systems often operate behind strict network segmentation and firewalls, following ISA-95 or Purdue model boundaries. Direct inbound connections are usually blocked, and outbound access may be limited or filtered.

Communication must therefore be designed as outbound-initiated and proxy-aware, often using relays across DMZs. Control and monitoring traffic must be able to tolerate delays, retries, and asymmetric connectivity.

Cloud tools that assume public endpoints and full reachability do not work under these conditions. Reliable edge orchestration depends on protocols and proxies that respect existing security zones.

4: Certificates and trust require real engineering

At scale, managing certificates is not a small detail. Each site, host, and service requires its own set of credentials that expire, need to be renewed, and occasionally fail.

Without automation and clear ownership of the trust model, you will eventually face expired or mismatched certificates in production. It is worth investing early in a structured and automated approach to PKI.

5: Container images must be available locally

When running tests, it is easy to pull container images directly from a public or central registry. At a large scale, this does not work reliably.

A local or site-local registry ensures that applications can start after reboots, upgrades, or network outages. It removes a common source of operational failures and provides a predictable deployment path.

6: Observability must be distributed

Sending every log and metric from every container to a central dashboard does not make the system observable; it only makes it noisy. At scale, raw data without context becomes unusable.

Observability should start at the edge. Logs and metrics need to be aggregated, filtered, and contextualized locally, so that what is forwarded upstream already represents useful information. Each site should be able to summarize its own state and report only what matters, not every event line.

This makes troubleshooting faster and the overall monitoring system more meaningful. Central systems should consume insights, not raw output.

7: People are part of the system

Even with a high degree of automation, people still need to understand and troubleshoot the platform, as well as the edge applications, more importantly.

Clear interfaces, remote interactive troubleshooting tools, understandable aggregated states, and safe tools for local operators make a large difference when diagnosing issues in the field. Operational clarity improves reliability more than most technical optimizations.

8: Configuration drift is unavoidable

With thousands of sites, variations in hardware, capabilities, and network characteristics are expected.

Rather than enforcing identical configurations everywhere, use structured parameters and templates to handle these differences in a controlled way. This makes deployments more robust and reduces the risk of subtle mismatches.

9: Scaling tests must be continuous

Testing scalability once is not enough. Performance changes with new versions, additional features, and larger datasets.

Run scale tests regularly that simulate site creation, application deployments, and update cycles. This provides an early warning when behavior starts to drift, giving confidence before each release.

10: Keep the system as simple as possible

When a system grows, it is tempting to add more layers and abstractions. In most cases, that adds complexity faster than it adds capability.

Focus on reducing dependencies and keeping components loosely coupled. A simple, well-understood system is easier to scale, maintain, and debug.

Closing thoughts

Many organizations underestimate the effort required to build and maintain an edge platform. The early stages often go well, internal prototypes run smoothly, and a few connected sites validate the concept. The real challenges appear later, when the rollout extends to hundreds or thousands of sites and the system must operate continuously.

Cloud-based tooling cannot be simply reused for this purpose. The assumptions that make cloud platforms efficient, such as stable networks, shared storage, and centralized control, do not apply to the edge.

Building reliable, performant systems that can manage large fleets of edge sites is a demanding engineering task. It requires careful design, operational experience, and automation from the start.

The teams that succeed are usually those that recognize early that the edge is its own environment, distributed, unreliable at times, and different by nature, and design their systems accordingly.

Scalability report

We regularly perform large-scale tests to validate Control Tower performance in realistic environments. Read our Scability Test Report for managing applications in 10 000+ locations.

Highlighted resources

What is Edge AI? Key Benefits & Why You Should Use It

Smooth Sailing at the Edge: How to Migrate Legacy VMs to Containers with Avassa

Edge Observability – Shifting Left for Proactive Monitoring