avg tok/sec | |||
---|---|---|---|
model | GPU | num_gpus | |
Llama-2-70b-chat-hf | NVIDIA A100-SXM4-40GB | 4 | 380.9 |
Llama-2-7b-chat-hf | NVIDIA A10 | 1 | 458.8 |
2 | 497.3 | ||
4 | 543.6 | ||
NVIDIA A100-SXM4-40GB | 1 | 842.9 | |
2 | 699.1 | ||
4 | 650.7 |
vLLM & large models
A previous version of this note suggested that you could run Llama 70b on a single A100. This was incorrect. The Modal container was caching the download of the much smaller 7b model. I have updated the post to reflect this. h/t to Cade Daniel for finding the mistake.
Introduction
Let’s paste an image below:
Large models like Llama-2-70b may not fit in a single GPU. I previously profiled the smaller 7b model against various inference tools. When a model is too big to fit on a single GPU, we can use various techniques to split the model across multiple GPUs.
Compute & Reproducibility
I used Modal Labs for serverless compute. Modal is very economical and built for machine learning use cases. Unlike other clouds, there are plenty of A100s available. They even give you $30 of free credits, which is more than enough to run the experiments in this note. Thanks to Modal, the scripts I reference in this note are reproducible.
In this note, I’m using modal client
version: 0.50.2889
Distributed Inference w/ vLLM
vLLM
supports tensor parallelism, which you can enable by passing the tensor_parallel_size
argument to the LLM
constructor.
I modified this example Modal code for Llama v2 13b
to run Llama v2 70b
on 4 GPUs with tensor parallelism. Below is a simplified diff with the most important changes:
def download_model_to_folder():
from huggingface_hub import snapshot_download
snapshot_download(- "meta-llama/Llama-2-13b-chat-hf",
+ "meta-llama/Llama-2-70b-chat-hf",
local_dir="/model",
token=os.environ["HUGGINGFACE_TOKEN"],
)
image = (
Image.from_dockerhub("nvcr.io/nvidia/pytorch:22.12-py3")
.pip_install("torch==2.0.1", index_url="https://download.pytorch.org/whl/cu118")+ # Pin vLLM to 8/2/2023
+ .pip_install("vllm @ git+https://github.com/vllm-project/vllm.git@79af7e96a0e2fc9f340d1939192122c3ae38ff17")
- # Pin vLLM to 07/19/2023
- .pip_install("vllm @ git+https://github.com/vllm-project/vllm.git@bda41c70ddb124134935a90a0d51304d2ac035e8")
# Use the barebones hf-transfer package for maximum download speeds. No progress bar, but expect 700MB/s.- .pip_install("hf-transfer~=0.1")
+ #Force a rebuild to invalidate the cache (you can remove `force_build=True` after the first time)
+ .pip_install("hf-transfer~=0.1", force_build=True)
.run_function(
download_model_to_folder,
secret=Secret.from_name("huggingface"),
timeout=60 * 20)
)
...
-@stub.cls(gpu="A100", secret=Secret.from_name("huggingface"))
+# You need a minimum of 4 A100s that are the 40GB version
+@stub.cls(gpu=gpu.A100(count=4, memory=40), secret=Secret.from_name("huggingface"))
class Model:
def __enter__(self):
from vllm import LLM
# Load the model. Tip: MPT models may require `trust_remote_code=true`.- self.llm = LLM(MODEL_DIR)
+ self.llm = LLM(MODEL_DIR, tensor_parallel_size=4)
...
See big-inference-vllm.py for the actual script I used.
I found that when I ran the above code and changed the model name, I had to force a rebuild of the image to invalidate the cache. Otherwise, the old version of the model would be used. You can force a rebuild by adding force_build=True
to the .pip_install
call.
When I initially wrote this note, I was fooled into believing I could load meta-llama/Llama-2-70b-chat-hf
on a single A100. It was this tricky issue of the container that cached the download of the much smaller 7b
model. 🤦
After setting the appropriate secrets for HuggingFace and Weights & Biases, You can run this code on Modal with the following command:
modal run big-inference-vllm.py
You need at least 4 A100 GPUs to serve Llama v2 70b.
What Happens With Smaller Models?
Even though distributed inference is interesting for big models that do not fit on a single GPU, interesting things happen when you serve smaller models this way. Below, I test throughput for Llama v2 7b
on 1, 2, and 4 GPUs. The throughput is measured by passsing these 59 prompts to llm.generate
. llm.generate
is described in the vLLM documentation:
Call
llm.generate
to generate the outputs. It adds the input prompts to vLLM engine’s waiting queue and executes the vLLM engine to generate the outputs with high throughput.
Here are the results, averaged over 5 runs for each row:
You can see all the individual runs here. In my experiments, the 70b model needed a minimum of 4 A100s to run, so that’s why there is only one row for that model (Modal only has instances with 1, 2, or 4 GPUs).
```{python}
The tok/sec number you see here is VERY different than the latency benchmark shown on this note. This particular benchmark maximizes throughput by running multiple requests in parallel. The previous latency benchmark measures the time it takes to process a single request.
Observations
- A100s are much faster than A10s, but A10s are significantly cheaper.1
- On A10s, scaling up to more GPUs increases throughput at first, but then seems to diminish. It appears like there is a Goldilocks zone in terms of the right number of GPUs to maximize throughput. I did not explore this in detail, as Modal only has instances with specific numbers of GPUs.2
- The much larger
Llama v2 70b
model is only ~2x slower than its 7b counterpart.
Aside: Pipeline Parallelism
In theory, Pipeline Parallelism (“PP”) is slower than Tensor Parallelism, but tools for PP are compatible with a wider range of models from the HuggingFace Hub. By default, HuggingFace accelerate will automatically split the model across multiple GPUs when you pass device_map="auto"
. (Accelerate offers other kinds of parallelism as well, like integrations with DeepSpeed).
This blog post and these docs are an excellent place to start. I will explore this and other kinds of parallelism in future notes.
Footnotes
As of 8/6/2023 2 A10s costs
.000612 / sec
on Modal, whereas 1 A100 40GB will cost0.001036 / sec
. See this pricing chart↩︎For A10 and A100s you can only get up to 4 GPUs. Furthermore, I ran into an issue with vLLM and llama 70b, where it doesn’t like an odd number of GPUs.↩︎