TF Serving

TensorFlow Serving

Demos

Title	Description
Basics	A minimal end-to-end example of TF Serving
GPUs & Batching	Dynamic batching and using a GPU

Why use TFServing?

Offers a gRPC endpoint that can offer better compression of payloads relative to REST. This helps with larger payloads like images, where tensors are larger.
GPU batching support: which can batch requests together and send them to the GPU in a single call. This can help with latency and throughput. This isn’t free though and requires tuning to get right.
Model versioning: Allows you to deploy multiple versions of a model and route traffic to them. This is useful for A/B testing and canary deployments. In addition to model version numbers, you can assign these versions to labels, like “canary” and “stable” which can be accessible at an endpoint like /v1/models/<model name>/labels/<version label>
You can change what TF-Serving returns beyond just the output of the last layer of the model; for example, you can return the output of several intermediate layers for debugging. See these docs on constructing custom Signatures. This is fairly advanced, and I’m not sure how to do this yet, or even if I would want to do this.
TFServing vs. FastAPI/Flask: According to the docs, “the average latency of performing inference with TensorFlow Serving is usually not lower than using TensorFlow directly, where TensorFlow Serving shines is keeping the tail latency down for many clients querying many different models, all while efficiently utilizing the underlying hardware to maximize throughput.” This blog post by Sayak Paul confirms these findings.

Impressions

I did not have luck getting dynamic batching to work well. Apparently, it can take a lot of patience and be hard to tune. Here is a guide. More observability around this would be nice to help users tune dynamic batching. It appears that you can setup Tensorboard to give you some insights. I tried to set this up but it was very confusing and not intuitive. You would likely have to spend a long time with this to optimize things.
It is much better to use gRPC requests over REST requests when possible. In my experiments, gRPC is much faster. Furthermore, gRPC supports additional features that REST does not, such as the ability to profile inference requests, additional validation, and so on. The downside is gRPC is harder to debug and a bit more clunky to work with.
TF Serving documentation can be confusing in places. This article on how to serve models uses a 200-line python to make a request that is pretty distracting. Other documents are really good, like using TF Serving with Kubernetes.
You must specify dtype in your model layers as dtype will be used in API validation for gRPC requests. Additionally, layer names are used for making requests so name those thoughtfully as well.
The Keras docs are a much more gentle introduction to TF Serving.
The SavedModel format is a serialization format needed for TF Serving. It’s used for many things in the TF ecosystem, like TFLite, TF.js, and TF Hub. I really like the fact that it is used across many products.
I could not get the exact same logits from my predictions with the api vs. using the model in a notebook. They are very close but not exactly the same. REST and gRPC appear to be exactly the same.
It can be hard to debug the server when an error is thrown as I get a stack trace from another language.