FastAPI

Serving models with FastAPI

FastAPI is a web framework for Python. People like to use this framework for serving prototypes of ML models.

Impressions

Model serving frameworks (TF Serving, TorchServe, etc) are probably the way to go for production / enterprise deployments, especially for larger models. They offer more features, and latency will be more predictable (even if slower). I think that for smaller models (< 200MB) FastAPI is fine.
It is super easy to get started with FastAPI.
I was able to confirm Sayak’s Benchmark where FastAPI is faster than TF Serving, but also less consistent overall. FastAPI is also more likely to fail, although I haven’t been able to cause that. In my experiments FastAPI was much faster for this small model, but this could change with larger models.
Memory is consumed linearly as you increase the number of Uvicorn workers. Model serving frameworks like TF-Serving seem to work more efficiently. You should be careful to set the environment variable TF_FORCE_GPU_ALLOW_GROWTH=true if you are running inference on GPUs. I think in many cases you would be doing inference on CPUs, so this might not be relevant most of the time.
FastAPI seems like it could be really nice on smaller models and scoped hardware where there is only one worker per node and you load balance across nodes (because you aren’t replicating the model with each worker).
Debugging FastAPI is amazing, as its pure python and you get a nice docs page at http://<IP>/docs that lets you test out your endpoints right on the page! The documentation for FastPI is also amazing.
If you want the request parameters to be sent in the body (as you often do with ML b/c you want to send data to be scored), you have to use Pydantic. This is very opinionated, but easy enough to use.

Load Model & Make Predictions

Going to use the model trained in the TF Serving tutorial. Furthermore, we are going to load this from the SavedModel format.

# this cell is exported to a script

from fastapi import FastAPI, status
from pydantic import BaseModel
from typing import List
import tensorflow as tf
import numpy as np

def load_model(model_path='/home/hamel/hamel/notes/serving/tfserving/model/1'):
    "Load the SavedModel Object."
    sm = tf.saved_model.load(model_path)
    return sm.signatures["serving_default"] # this is the default signature when you save a model

# this cell is exported to a script

def pred(model: tf.saved_model, data:np.ndarray, pred_layer_nm='dense_3'):
    """
    Make a prediction from a SavedModel Object.  `pred_layer_nm` is the last layer that emits logits.
    
    https://www.tensorflow.org/guide/saved_model
    """
    data = tf.convert_to_tensor(data, dtype='int32')
    preds = model(data)
    return preds[pred_layer_nm].numpy().tolist()

Test Data

_, (x_val, _) = tf.keras.datasets.imdb.load_data(num_words=20000)
x_val = tf.keras.preprocessing.sequence.pad_sequences(x_val, maxlen=200)[:2, :]

Make a prediction

model = load_model()
pred(model, x_val[:2, :])

[[0.8761785626411438, 0.12382148206233978],
 [0.0009457750129513443, 0.9990542531013489]]

Build The FastApi App

# this cell is exported to a script

app = FastAPI()

items = {}


@app.on_event("startup")
async def startup_event():
    "Load the model on startup https://fastapi.tiangolo.com/advanced/events/"
    items['model'] = load_model()


@app.get("/")
def health(status_code=status.HTTP_200_OK):
    "A health-check endpoint"
    return 'Ok'

We want to send the data for prediction in the Request Body (not with path parameters). According the docs:

FastAPI will recognize that the function parameters that match path parameters should be taken from the path, and that function parameters that are declared to be Pydantic models should be taken from the request body.

# this cell is exported to a script

class Sentence(BaseModel):
    tokens: List[List[int]]

@app.post("/predict")
def predict(data:Sentence, status_code=status.HTTP_200_OK):
    preds = pred(items['model'], data.tokens)
    return preds

Recap: the FastAPI App

Let’s look at main.py with all the pieces combined:

Code to display source code in Quarto

#This is a hack for Quarto for generated scripts
from IPython.display import display, Markdown
code = !cat main.py

display(Markdown('```{.python filename="main.py"}\n' + '\n'.join(code) + '\n```'))

main.py

# AUTOGENERATED! DO NOT EDIT! File to edit: index.ipynb.

# %% auto 0
__all__ = ['app', 'items', 'load_model', 'pred', 'startup_event', 'health', 'Sentence', 'predict']

# %% index.ipynb 3
from fastapi import FastAPI, status
from pydantic import BaseModel
from typing import List
import tensorflow as tf
import numpy as np

def load_model(model_path='/home/hamel/hamel/notes/serving/tfserving/model/1'):
    "Load the SavedModel Object."
    sm = tf.saved_model.load(model_path)
    return sm.signatures["serving_default"] # this is the default signature when you save a model

# %% index.ipynb 4
def pred(model: tf.saved_model, data:np.ndarray, pred_layer_nm='dense_3'):
    """
    Make a prediction from a SavedModel Object.  `pred_layer_nm` is the last layer that emits logits.
    
    https://www.tensorflow.org/guide/saved_model
    """
    data = tf.convert_to_tensor(data, dtype='int32')
    preds = model(data)
    return preds[pred_layer_nm].numpy().tolist()

# %% index.ipynb 10
app = FastAPI()

items = {}


@app.on_event("startup")
async def startup_event():
    "Load the model on startup https://fastapi.tiangolo.com/advanced/events/"
    items['model'] = load_model()


@app.get("/")
def health(status_code=status.HTTP_200_OK):
    "A health-check endpoint"
    return 'Ok'

# %% index.ipynb 12
class Sentence(BaseModel):
    tokens: List[List[int]]

@app.post("/predict")
def predict(data:Sentence, status_code=status.HTTP_200_OK):
    preds = pred(items['model'], data.tokens)
    return preds

Run The App

We can run the app with the command:

uvicorn main:app --host 0.0.0.0 --port 5701

main corresponds to the file main.py
app corresponds to the app object inside main.py - app = FastAPI()
--reload: makes the server restart if the code changes, for development only

import requests, json

def predict_rest(json_data, url='http://localhost:5701/predict'):
    json_response = requests.post(url, json={'tokens': json_data})
    return json.loads(json_response.text)

predict_rest(x_val.tolist())

[[0.8761785626411438, 0.12382148206233978],
 [0.0009457750129513443, 0.9990542531013489]]

Load Test FastAPI

It’s really fast

from fastcore.parallel import parallel
from functools import partial
parallel_pred = partial(parallel, threadpool=True, n_workers=500)

sample_data = [x_val.tolist()] * 1000

%%time
results = parallel_pred(predict_rest, sample_data)

CPU times: user 2.29 s, sys: 252 ms, total: 2.54 s
Wall time: 2.38 s

Adding Uvicorn Workers

Uvicorn also has an option to start and run several worker processes. Nevertheless, as of now, Uvicorn’s capabilities for handling worker processes are more limited than Gunicorn’s. So, if you want to have a process manager at this level (at the Python level), then it might be better to try with Gunicorn as the process manager.

You can add Uvicorn workers with the --workers flag:

uvicorn main:app --host 0.0.0.0 --port 5701 --workers 8

GPUs

When I scaled up to 8 workers on a GPU, I got OOM errors. In order to avoid this you want to Limit GPU memory growth by settting the TF_FORCE_GPU_ALLOW_GROWTH to true:

TF_FORCE_GPU_ALLOW_GROWTH=true uvicorn main:app --host 0.0.0.0 --port 5701 --workers 8

From the docs:

By default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES) visible to the process.

This means if you are running on GPUs and have > 1 worker, you will get OOM workers without setting this env variable!

%%time
results = parallel_pred(predict_rest, sample_data)

CPU times: user 2.26 s, sys: 294 ms, total: 2.55 s
Wall time: 2.34 s

Scaling up workers didn’t have any effect in this particular instance. Could be because the low latency of the model I’m using doesn’t challenge the throughput enough.