vLLM & large models

Using tensor parallelism w/ vLLM & Modal to run Llama 70b

August 8, 2023


A previous version of this note suggested that you could run Llama 70b on a single A100. This was incorrect. The Modal container was caching the download of the much smaller 7b model. I have updated the post to reflect this. h/t to Cade Daniel for finding the mistake.


Large models like Llama-2-70b may not fit in a single GPU. I previously profiled the smaller 7b model against various inference tools. When a model is too big to fit on a single GPU, we can use various techniques to split the model across multiple GPUs.

Compute & Reproducibility

I used Modal Labs for serverless compute. Modal is very economical and built for machine learning use cases. Unlike other clouds, there are plenty of A100s available. They even give you $30 of free credits, which is more than enough to run the experiments in this note. Thanks to Modal, the scripts I reference in this note are reproducible.

In this note, I’m using modal client version: 0.50.2889

Distributed Inference w/ vLLM

vLLM supports tensor parallelism, which you can enable by passing the tensor_parallel_size argument to the LLM constructor.

I modified this example Modal code for Llama v2 13b to run Llama v2 70b on 4 GPUs with tensor parallelism. Below is a simplified diff with the most important changes:

def download_model_to_folder():
    from huggingface_hub import snapshot_download

-        "meta-llama/Llama-2-13b-chat-hf",
+        "meta-llama/Llama-2-70b-chat-hf",

image = (
    .pip_install("torch==2.0.1", index_url="https://download.pytorch.org/whl/cu118")
+    # Pin vLLM to 8/2/2023
+    .pip_install("vllm @ git+https://github.com/vllm-project/vllm.git@79af7e96a0e2fc9f340d1939192122c3ae38ff17")
-    # Pin vLLM to 07/19/2023
-    .pip_install("vllm @ git+https://github.com/vllm-project/vllm.git@bda41c70ddb124134935a90a0d51304d2ac035e8")
    # Use the barebones hf-transfer package for maximum download speeds. No progress bar, but expect 700MB/s.
-    .pip_install("hf-transfer~=0.1")
+     #Force a rebuild to invalidate the cache (you can remove `force_build=True` after the first time)
+    .pip_install("hf-transfer~=0.1", force_build=True)
        timeout=60 * 20)

-@stub.cls(gpu="A100", secret=Secret.from_name("huggingface"))
+# You need a minimum of 4 A100s that are the 40GB version
+@stub.cls(gpu=gpu.A100(count=4, memory=40), secret=Secret.from_name("huggingface"))
class Model:
    def __enter__(self):
        from vllm import LLM

        # Load the model. Tip: MPT models may require `trust_remote_code=true`.
-       self.llm = LLM(MODEL_DIR)
+       self.llm = LLM(MODEL_DIR, tensor_parallel_size=4)

See big-inference-vllm.py for the actual script I used.

Be Careful To Mind The Cache When Downloading Files

I found that when I ran the above code and changed the model name, I had to force a rebuild of the image to invalidate the cache. Otherwise, the old version of the model would be used. You can force a rebuild by adding force_build=True to the .pip_install call.

When I initially wrote this note, I was fooled into believing I could load meta-llama/Llama-2-70b-chat-hf on a single A100. It was this tricky issue of the container that cached the download of the much smaller 7b model. 🤦

After setting the appropriate secrets for HuggingFace and Weights & Biases, You can run this code on Modal with the following command:

modal run big-inference-vllm.py

You need at least 4 A100 GPUs to serve Llama v2 70b.

What Happens With Smaller Models?

Even though distributed inference is interesting for big models that do not fit on a single GPU, interesting things happen when you serve smaller models this way. Below, I test throughput for Llama v2 7b on 1, 2, and 4 GPUs. The throughput is measured by passsing these 59 prompts to llm.generate. llm.generate is described in the vLLM documentation:

Call llm.generate to generate the outputs. It adds the input prompts to vLLM engine’s waiting queue and executes the vLLM engine to generate the outputs with high throughput.

Here are the results, averaged over 5 runs for each row:

avg tok/sec
model GPU num_gpus
Llama-2-70b-chat-hf NVIDIA A100-SXM4-40GB 4 380.9
Llama-2-7b-chat-hf NVIDIA A10 1 458.8
2 497.3
4 543.6
NVIDIA A100-SXM4-40GB 1 842.9
2 699.1
4 650.7

You can see all the individual runs here. In my experiments, the 70b model needed a minimum of 4 A100s to run, so that’s why there is only one row for that model (Modal only has instances with 1, 2, or 4 GPUs).

Do Not Compare To Latency Benchmark

The tok/sec number you see here is VERY different than the latency benchmark shown on this note. This particular benchmark maximizes throughput by running multiple requests in parallel. The previous latency benchmark measures the time it takes to process a single request.


  • A100s are much faster than A10s, but A10s are significantly cheaper.1
  • On A10s, scaling up to more GPUs increases throughput at first, but then seems to diminish. It appears like there is a Goldilocks zone in terms of the right number of GPUs to maximize throughput. I did not explore this in detail, as Modal only has instances with specific numbers of GPUs.2
  • The much larger Llama v2 70b model is only ~2x slower than its 7b counterpart.

Aside: Pipeline Parallelism

In theory, Pipeline Parallelism (“PP”) is slower than Tensor Parallelism, but tools for PP are compatible with a wider range of models from the HuggingFace Hub. By default, HuggingFace accelerate will automatically split the model across multiple GPUs when you pass device_map="auto". (Accelerate offers other kinds of parallelism as well, like integrations with DeepSpeed).

This blog post and these docs are an excellent place to start. I will explore this and other kinds of parallelism in future notes.


  1. As of 8/6/2023 2 A10s costs .000612 / sec on Modal, whereas 1 A100 40GB will cost 0.001036 / sec. See this pricing chart↩︎

  2. For A10 and A100s you can only get up to 4 GPUs. Furthermore, I ran into an issue with vLLM and llama 70b, where it doesn’t like an odd number of GPUs.↩︎