Packaging and deploying an ML serving system to the edge
There once was a lifecycle of a machine learning servable on the edge…
The rapid uptake of applied machine learning across many tasks and industries is largely driven by how accessible and cheap the underlying technologies have become. There is a vibrant and growing set of both commercial and open source tooling available to support the whole lifecycle of designing, building and deploying machine learning for tasks related to e.g. object and activity detection and classification.
The full MLOps lifecycle consists of several steps, including training, tuning, retraining, testing and deployment. This writeup will cover how to package and deploy a basic machine learning model for serving at the edge.
Packaging a basic servable with Tensorflow Serving
Let’s look at a basic setup, modeled after the Tensorflow Serving with Docker tutorial but adjusted for running at the edge with Avassa. Our application consists of the tensorflow/serving
container image. And we use an init container to pull a copy of the tensorflow test data set and mount it into the server container.
Packaging the server
Let’s build the application specification from scratch. We’ll start with a very simple structure for a single container and single service application called ml-serving
that pulls in the tensorflow/serving
image from Docker hub. We add a replica-count of 1 and make the service able to reach outside networks.
name: ml-serving
version: "1.0"
services:
- name: ml-serving-service
mode: replicated
replicas: 1
network:
outbound-access:
allow-all: true
containers:
- name: tensorflow-serving
image: registry-1.docker.io/tensorflow/serving
We can now deploy this application which will bring up the Tensorflow Serving server, but it will be pointless since we are not feeding it any servable model yet. In order to do this, we need to fetch a model, make it available to the server on a volume and configure the server to use the specific model.
Adding a setup script and model volume
We start by adding a volume model-data
to our service to hold our model data and mount it into the serving container under the /models
mount path.
We then create a config-map
item that contains a setup.sh
script to clone the example models from the Tensorflow GitHub repository and copy the saved_model_half_plus_two_cpu
servable model to the /models/half_plus_two
directory on the mounted volume. Please note that we use the CPU-version of the model because we can’t (yet) assume the existence of a GPU on the target hosts. More about this later.
We also add an environment variable MODEL_NAME
with the value half_plus_two
to configure Tensorflow Serving to look for a servable model in the /models/half_plus_two
path.
name: ml-serving
version: "1.0"
services:
- name: ml-serving-service
mode: replicated
replicas: 1
network:
outbound-access:
allow-all: true
volumes:
- name: model-data
ephemeral-volume:
size: 100MB
file-mode: "755"
file-ownership: 0:0
config-map:
items:
- name: setup.sh
data: |
#!/bin/sh
apk update
apk add git
git clone https://github.com/tensorflow/serving
cp -R ./serving/tensorflow_serving/servables/tensorflow/testdata/saved_model_half_plus_two_cpu /models/half_plus_two
containers:
- name: tensorflow-serving
image: registry-1.docker.io/tensorflow/serving
mounts:
- volume-name: model-data
mount-path: /models
env:
MODEL_NAME: half_plus_two
Executing the setup through an init-container
The final step is to add an init-container
called model-setup
that will be executed before bringing up the tensorflow-serving
container. The init container mounts two volumes. One is the config-map
containing the setup.sh
script, and the other is the model-data
volume that will hold the serving model. Once that is done, it runs the setup.sh
script to clone the model, copy it to the appropriate directory on the model-data
volume and then exit.
name: ml-serving
version: "1.0"
services:
- name: ml-serving-service
mode: replicated
replicas: 1
network:
outbound-access:
allow-all: true
volumes:
- name: model-data
ephemeral-volume:
size: 100MB
file-mode: "755"
file-ownership: 0:0
config-map:
items:
- name: setup.sh
data: |
#!/bin/sh
apk update
apk add git
git clone https://github.com/tensorflow/serving
cp -R ./serving/tensorflow_serving/servables/tensorflow/testdata/saved_model_half_plus_two_cpu /models/half_plus_two
init-containers:
- name: model-setup
image: registry-1.docker.io/alpine
cmd: ["sh", "/setup/setup.sh"]
mounts:
- volume-name: model-data
mount-path: /models
- volume-name: setup
files:
- name: setup.sh
mount-path: /setup/setup.sh
containers:
- name: tensorflow-serving
image: registry-1.docker.io/tensorflow/serving
mounts:
- volume-name: model-data
mount-path: /models
env:
MODEL_NAME: half_plus_two
We now have a fully functioning application deployment specification that will allow us to deploy a servable model at scale across edge locations for local client consumption through the APIs provided by the TensorFlow Serving server.
Placement matching on GPUs
Serving systems like TensorFlow can make use of GPUs to accelerate inference, and thereby provide lower latency responses than when running on CPUs only. Certain types of servable models may rely heavily on the presence of GPUs such that they are not useful on systems that don’t have them.
Application specifications can be annotated to formally require the presence of a GPU and fail any attempt to deploy to hosts that does not have them.
We start by defining a new label only on the Tesla family of GPUs using pattern matching. The example below will automatically add a label named any-tesla
to sites where one is available.
$ supctl create system settings <<EOF
gpu-labels:
- label: any-tesla
max-number-gpus: 1
nvidia-patterns:
- name == "*Tesla*"
EOF
And here’s the resulting label at a site called stockholm-sergel
with a single GPU of the intended type.
$ supctl show -s stockholm-sergel system cluster hosts --fields hostname,gpu-labels
- hostname: stockholm-sergel
gpu-labels:
- name: any-tesla
max-number-gpus: 1
matching-gpus:
- uuid: GPU-b75c47d9-5fb4-63e0-a07b-ff2633af741c
- uuid: GPU-ee1b2a5c-3cd0-0c4a-a240-d87c22748a35
The final step is to add a GPU label matching statement to our application specification. We use our any-tesla label as a requirement for a specific container, in our case the container with TensorFlow Serving. Because we now require a GPU, we can switch to a GPU-optimized model saved_model_half_plus_two_**gpu**
.
name: ml-serving
version: "1.0"
services:
- name: ml-serving-service
mode: replicated
replicas: 1
network:
outbound-access:
allow-all: true
volumes:
- name: model-data
ephemeral-volume:
size: 100MB
file-mode: "755"
file-ownership: 0:0
config-map:
items:
- name: setup.sh
data: |
#!/bin/sh
apk update
apk add git
git clone <https://github.com/tensorflow/serving>
cp -R ./serving/tensorflow_serving/servables/tensorflow/testdata/saved_model_half_plus_two_gpu /models/half_plus_two
init-containers:
- name: model-setup
image: registry-1.docker.io/alpine
cmd: ["sh", "/setup/setup.sh"]
mounts:
- volume-name: model-data
mount-path: /models
- volume-name: setup
files:
- name: setup.sh
mount-path: /setup/setup.sh
containers:
- name: tensorflow-serving
image: registry-1.docker.io/tensorflow/serving
mounts:
- volume-name: model-data
mount-path: /models
env:
MODEL_NAME: half_plus_two
gpu:
labels:
- any-tesla
Next steps
The next step in the lifecycle is to deploy this packaged application to edge locations. And since this application has a couple of requirements (e.g. disk size, reasonable amounts of CPU and optionally GPU) we should only deploy it to hosts in locations that have sufficient resources available. This is all done using Deployment Specifications. Read more about how to define and apply a deployment specification to an application in our documentation.
What about offline capabilities?
While the above setup is very straightforward and requires no specific container builds, it has an obvious shortcoming. By pulling a fresh copy of the entire Tensorflow repo every time we start an instance of the service, we both assume access to GitHub from all edge sites (which is not a given in real deployments) and we repeat ourselves instead of using versioned containers for all content.
This could easily be addressed in at least two ways:
- By packaging the model into the same container as the server, and then deploying it as a versioned unit. A residual shortcoming would be is that the server and the model would be part of the same lifecycle even though it is very likely that the model needs updating more often than the server. In reality, we don’t think this is a practical issue, if only the model is updated between versions, only the model container layer needs to be distributed to the sites. Distributing only new container layers is built into the Avassa platform.
- We could also keep the server as a container and then manage the model files in a separate container and share the content with the server through a volume. This is a slightly more complex, but would allow us to only lifecycle a versioned servable model container and leave the server container running separately.
Conclusion
There are many ways to go about packaging an ML server framework along with the servable models for deployment at the edge. This article demonstrated a very simple way of packaging the Tensorflow Serving server container for deployment and using a few features in the Avassa platform to clone a demo servable model and make it available to the server using an init-container, volumes and a config-map script. We also added an (optional) application-level requirement to only deploy on hosts with a specific type of GPU.
And always feel free to sign-up for a free trial to get access to a running system including a small set of edge locations to try this out for yourself.