Deploying an ML Serving System to the Edge with Avassa

Published on:October 17, 2023

There once was a lifecycle of a machine learning servable on the edge…

The rapid uptake of applied machine learning across many tasks and industries is largely driven by how accessible and cheap the underlying technologies have become. There is a vibrant and growing set of both commercial and open source tooling available to support the whole lifecycle of designing, building and deploying machine learning for tasks related to e.g. object and activity detection and classification.

The full MLOps lifecycle consists of several steps, including training, tuning, retraining, testing and deployment. This writeup will cover how to package and deploy a basic machine learning model for serving at the edge.

Packaging a basic servable with Tensorflow Serving

Let’s look at a basic setup, modeled after the Tensorflow Serving with Docker tutorial but adjusted for running at the edge with Avassa. Our application consists of the tensorflow/serving container image. And we use an init container to pull a copy of the tensorflow test data set and mount it into the server container.

Packaging the server

Let’s build the application specification from scratch. We’ll start with a very simple structure for a single container and single service application called ml-serving that pulls in the tensorflow/serving image from Docker hub. We add a replica-count of 1 and make the service able to reach outside networks.

name: ml-serving
version: "1.0"
services:
  - name: ml-serving-service
    mode: replicated
    replicas: 1
    network:
      outbound-access:
        allow-all: true
    containers:
      - name: tensorflow-serving
        image: registry-1.docker.io/tensorflow/serving

We can now deploy this application which will bring up the Tensorflow Serving server, but it will be pointless since we are not feeding it any servable models yet. In order to do this, we need to fetch a model, make it available to the server on a volume and configure the server to use the specific model.

Adding a setup script and model volume

We start by adding a volume model-data to our service to hold our model data and mount it into the serving container under the /models mount path.

We then create a config-map item that contains a setup.sh script to clone the example models from the Tensorflow GitHub repository and copy the saved_model_half_plus_two_cpu servable model to the /models/half_plus_two directory on the mounted volume. Please note that we use the CPU-version of the model because we can’t (yet) assume the existence of a GPU on the target hosts. More about this later.

We also add an environment variable MODEL_NAME with the value half_plus_two to configure Tensorflow Serving to look for a servable model in the /models/half_plus_two path.

name: ml-serving
version: "1.0"
services:
  - name: ml-serving-service
    mode: replicated
    replicas: 1
    network:
      outbound-access:
        allow-all: true
    volumes:
      - name: model-data
        ephemeral-volume:
          size: 100MB
          file-mode: "755"
          file-ownership: 0:0
        config-map:
          items:
            - name: setup.sh
              data: |
                #!/bin/sh
                apk update
                apk add git
                git clone https://github.com/tensorflow/serving
                cp -R ./serving/tensorflow_serving/servables/tensorflow/testdata/saved_model_half_plus_two_cpu /models/half_plus_two
    containers:
      - name: tensorflow-serving
        image: registry-1.docker.io/tensorflow/serving
        mounts:
          - volume-name: model-data
            mount-path: /models
        env:
          MODEL_NAME: half_plus_two

Executing the setup through an init-container

The final step is to add an init-container called model-setup that will be executed before bringing up the tensorflow-serving container. The init container mounts two volumes. One is the config-map containing the setup.sh script, and the other is the model-data volume that will hold the serving model. Once that is done, it runs the setup.sh script to clone the model, copy it to the appropriate directory on the model-data volume and then exit.

name: ml-serving
version: "1.0"
services:
  - name: ml-serving-service
    mode: replicated
    replicas: 1
    network:
      outbound-access:
        allow-all: true
    volumes:
      - name: model-data
        ephemeral-volume:
          size: 100MB
          file-mode: "755"
          file-ownership: 0:0
      - name: setup
        config-map:
          items:
            - name: setup.sh
              data: |
                #!/bin/sh
                apk update
                apk add git
                git clone https://github.com/tensorflow/serving
                cp -R ./serving/tensorflow_serving/servables/tensorflow/testdata/saved_model_half_plus_two_cpu /models/half_plus_two
    init-containers:
      - name: model-setup
        image: registry-1.docker.io/alpine
        cmd: ["sh", "/setup/setup.sh"]
        mounts:
          - volume-name: model-data
            mount-path: /models
          - volume-name: setup
            files:
              - name: setup.sh
                mount-path: /setup/setup.sh
    containers:
      - name: tensorflow-serving
        image: registry-1.docker.io/tensorflow/serving
        mounts:
          - volume-name: model-data
            mount-path: /models
        env:
          MODEL_NAME: half_plus_two

We now have a fully functioning application deployment specification that will allow us to deploy a servable model at scale across edge locations for local client consumption through the APIs provided by the TensorFlow Serving server.

Placement matching on GPUs

Serving systems like TensorFlow can make use of GPUs to accelerate inference, and thereby provide lower latency responses than when running on CPUs only. Certain types of servable models may rely heavily on the presence of GPUs such that they are not useful on systems that don’t have them.

Application specifications can be annotated to formally require the presence of a GPU and fail any attempt to deploy to hosts that does not have them.

We start by defining a new label only on the Tesla family of GPUs using pattern matching. The example below will automatically add a label named any-tesla to sites where one is available.

$ supctl create system settings <<EOF
gpu-labels:
  - label: any-tesla
    max-number-gpus: 1
    nvidia-patterns:
      - name == "*Tesla*"
EOF

And here’s the resulting label at a site called stockholm-sergel with a single GPU of the intended type.

$ supctl show -s stockholm-sergel system cluster hosts --fields hostname,gpu-labels
- hostname: stockholm-sergel
  gpu-labels:
    - name: any-tesla
      max-number-gpus: 1
      matching-gpus:
        - uuid: GPU-b75c47d9-5fb4-63e0-a07b-ff2633af741c
        - uuid: GPU-ee1b2a5c-3cd0-0c4a-a240-d87c22748a35

The final step is to add a GPU label matching statement to our application specification. We use our any-tesla label as a requirement for a specific container, in our case the container with TensorFlow Serving. Because we now require a GPU, we can switch to a GPU-optimized model saved_model_half_plus_two_gpu.

name: ml-serving
version: "1.0"
services:
  - name: ml-serving-service
    mode: replicated
    replicas: 1
    network:
      outbound-access:
        allow-all: true
    volumes:
      - name: model-data
        ephemeral-volume:
          size: 100MB
          file-mode: "755"
          file-ownership: 0:0
        config-map:
          items:
            - name: setup.sh
              data: |
                #!/bin/sh
                apk update
                apk add git
                git clone <https://github.com/tensorflow/serving>
                cp -R ./serving/tensorflow_serving/servables/tensorflow/testdata/saved_model_half_plus_two_gpu /models/half_plus_two
    init-containers:
      - name: model-setup
        image: registry-1.docker.io/alpine
        cmd: ["sh", "/setup/setup.sh"]
        mounts:
          - volume-name: model-data
            mount-path: /models
          - volume-name: setup
            files:
              - name: setup.sh
                mount-path: /setup/setup.sh
    containers:
      - name: tensorflow-serving
        image: registry-1.docker.io/tensorflow/serving
        mounts:
          - volume-name: model-data
            mount-path: /models
        env:
          MODEL_NAME: half_plus_two
        gpu:
          labels:
            - any-tesla

Next steps

The next step in the lifecycle is to deploy this packaged application to edge locations. And since this application has a couple of requirements (e.g. disk size, reasonable amounts of CPU and optionally GPU) we should only deploy it to hosts in locations that have sufficient resources available. This is all done using Deployment Specifications. Read more about how to define and apply a deployment specification to an application in our documentation.

What about offline capabilities?

While the above setup is very straightforward and requires no specific container builds, it has an obvious shortcoming. By pulling a fresh copy of the entire Tensorflow repo every time we start an instance of the service, we both assume access to GitHub from all edge sites (which is not a given in real deployments) and we repeat ourselves instead of using versioned containers for all content.

This could easily be addressed in at least two ways:

By packaging the model into the same container as the server, and then deploying it as a versioned unit. A residual shortcoming would be is that the server and the model would be part of the same lifecycle even though it is very likely that the model needs updating more often than the server. In reality, we don’t think this is a practical issue, if only the model is updated between versions, only the model container layer needs to be distributed to the sites. Distributing only new container layers is built into the Avassa platform.
We could also keep the server as a container and then manage the model files in a separate container and share the content with the server through a volume. This is a slightly more complex, but would allow us to only lifecycle a versioned servable model container and leave the server container running separately.

Conclusion

There are many ways to go about packaging an ML server framework along with the servable models for deployment at the edge. This article demonstrated a very simple way of packaging the Tensorflow Serving server container for deployment and using a few features in the Avassa platform to clone a demo servable model and make it available to the server using an init-container, volumes and a config-map script. We also added an (optional) application-level requirement to only deploy on hosts with a specific type of GPU.

And always feel free to sign-up for a free trial to get access to a running system including a small set of edge locations to try this out for yourself.

Highlighted resources

What is Edge AI? Key Benefits & Why You Should Use It

Smooth Sailing at the Edge: How to Migrate Legacy VMs to Containers with Avassa

Edge Observability – Shifting Left for Proactive Monitoring

Packaging and deploying an ML serving system to the edge