Deploy LLaMa 2 with AIME API Server for Operation of Conversational AI Solutions

In our previous article, we delved into the details of running LLaMa Chat as console application. Now, it's time to take a leap forward with LLaMa 2, the current evolution in the world of open large language models. In this article, we'll provide a guide how to setup LLaMa 2 and demonstrate how to connect it with our AIME-API-Server, opening up new possibilities for deploying conversational AI experiences.

🦙

Read the next article in the AIME Blog LLaMa series:
• How to run Llama 3 with AIME API to Deploy Conversational AI Solutions

At AIME, we understand the importance of providing AI tools as services to harness the full potential of technologies like LLaMa 2. That's why we're excited to announce the integration of LLaMa 2 with our new AIME-API, providing developers with a streamlined solution for providing its capabilities as a scalable HTTP/HTTPS service for easy integration with client applications. Whether you're building chatbots, virtual assistants, or interactive customer support systems, this integration offers flexibility to make modifications and adjustments to the model and the scalability to deploy such solutions.

Our LLaMa2 implementation is a fork from the original LLaMa 2 repository supporting all LLaMa 2 model sizes: 7B, 13B and 70B.

Our fork provides the possibility to convert the weights to be able to run the model on a different GPU configuration than the original LLaMa 2 (see table 2).

With the integration in our AIME-API and its batch aggregation feature, it is possible to leverage the full possible inference performance of different GPU configurations by aggregating requests to be processed in parallel as batch jobs by the GPUs. This dramaticly increases the achievable throughput of chat requests.

We also benchmarked the different GPU configurations with this technology to guide the decision which hardware gives best throughput performance to process LLaMa requests.

Hardware Requirements

Although the LLaMa2 models were trained on a cluster of A100 80GB GPUs it is possible to run the models on different and smaller multi-GPU hardware for inference.

In table 1 we list a summary of the minimum GPU requirements and recommended AIME systems to run a specific LLaMa 2 model with realtime reading performance:

Model	Size	Minimum GPU Configuration	Recommended AIME Server	Recommended AIME Cloud Instance
7B	13GB	1x Nvidia RTX A5000 24GB or 1x Nvidia RTX 4090 24GB	AIME G400 Workstation	V10-1XA5000-M6
13B	25GB	2x Nvidia RTX A5000 24GB or 1x NVIDIA RTX A6000	AIME A4000 Server	V10-2XA5000-M6, v14-1XA6000-M6
70B	129GB	2x Nvidia A100 80GB, 4x Nvidia RTX A6000 48GB or 8x Nvidia RTX A5000 24GB	AIME A8000 Server	V28-2XA180-M6, C24-4X6000ADA-Y1, C32-8XA5000-Y1

Table 1: Summary of the minimum GPU requirements and recommended AIME systems to run a specific LLaMa 2 model with at least realtime reading performance

A detailed performance analysis of different hardware configurations can be found in the section "LLaMa2 Inference GPU Benchmarks" of this article.

Getting Started: How to Deploy LLaMa2-Chat

In the following we show how to setup, get the source, download the LLaMa 2 models, and how to run LLaMa2 as console application and as worker for serving HTTPS/HTTP requests.

Create an AIME ML Container

Here are instructions for setting up the environment using the AIME ML container management. Similar results can be achieved by using Conda or alike.

To create a PyTorch 2.1.2 environment for installation, we use the AIME ML container, as described in aime-ml-containers using the following command:

> mlc-create llama2 Pytorch 2.1.2-aime -w=/path/to/your/workspace -d=/destination/of/the/checkpoints

The -d parameter is only necessary if you don’t want to store the checkpoints in your workspace but in a specific data directory. It is mounting the folder /destination/of/the/checkpoints to /data in the container. This folder requires at least 250 GB of free storage to store (all) the LLaMa models.

Once the container is created, open it with:

> mlc-open llama2

You are now working in the ML container being able to install all requirements and necessary pip and apt packages without interfering with the host system.

Clone the LLaMa2-Chat Repository

LLaMa2-Chat is a forked version of the original LLaMa 2 reference implementation by AIME, with following added features:

Tool for converting the original model checkpoints to different GPU configurations
Improved text sampling
Implemented token-wise text output
Interactive console chat
AIME API Server integration

Clone our LLaMa2-Chat repository with:

[llama2] user@client:/workspace$
> git clone https://github.com/aime-labs/llama2_chat

Now install the required pip packages:

[llama2] user@client:/workspace$
> pip install -r /workspace/llama2_chat/requirements.txt

Download LLaMa 2 Model Checkpoints

From Meta
In order to download the model weights and tokenizer, you have to apply for access by meta and accept their license. Once your request is approved, you will receive a signed URL via e-mail. Make sure you have wget and md5sum installed in the container:

[llama2] user@client:/workspace$
> sudo apt-get install wget md5sum

Then run the download.sh script with:

[llama2] user@client:/workspace$
> /workspace/download.sh

Pass the URL provided when prompted to start the download. Keep in mind that the links expire after 24 hours and a certain amount of downloads. If you start seeing errors such as `403: Forbidden`, you can always re-request a link.

From Huggingface
To download the checkpoints and tokenizer via Huggingface you also have to apply for access by meta and accept their license and register at huggingface with the same email address as your granted meta access.

Install git lfs in the container to be able to clone repositories with large files by:

[llama2] user@client:/workspace$
> sudo apt-get install git-lfs
> git lfs install

Then download the checkpoints of the 70B model with:

[llama2] user@client:/workspace$
> cd /destination/to/store/the/checkpoints
[llama2] user@client:/destination/to/store/the/checkpoints$
> git lfs clone https://huggingface.co/meta-llama/Llama-2-70b-chat

To download the other model sizes, replace Llama-2-70b-chat with Llama-2-7b-chat or Llama-2-13b-chat. ‌The download process might take a while depending on your internet connection speed. For the 70B model 129GB are to be downloaded. You will be asked twice for username and password. As password you have to use an access token generated in the Huggingface account settings.

Convert the Checkpoints to your GPU Configuration

The downloaded checkpoints are designed to only work with certain GPU configurations: 7B with 1 GPU, 13B with 2 GPUs and 70B with 8 GPUs. To run the model on a different GPU configuration, we provide a tool to convert the weights respectively. Table 2 shows all supported GPU configs.

Model size	Num GPUs 24GB	Num GPUs 40GB	Num GPUs 48GB	Num GPUs 80GB
7B	1	1	1	1
13B	2	1	1	1
70B	8	4	4	2

Table 2: This table shows the required amount of GPUs to run the desired model size depending on the available GPU memory.

The converting can be started with:

[llama2] user@client:/workspace$
> python3 /workspace/llama2_chat/convert_weights.py --input_dir /destination/to/store/the/checkpoints --output_dir /destination/to/store/the/checkpoints --model_size 70B --num_gpus <num_gpus>

The converting will take some minutes depending on CPU and storage speed and model size.

Run LLaMa 2 as Interactive Chat in the Terminal

To try the LLaMa2 models as a chat bot in the terminal, use the following PyTorch script and specify the desired model size to use:

[llama2] user@client:/workspace$
> torchrun --nproc_per_node <num_gpus> /workspace/llama2_chat/chat.py --ckpt_dir /data/llama2-model/llama-2-70b-chat

The chat mode is simply initiated by giving the following context as the starting prompt. It sets the environment so that the language model tries to complete the text as a chat dialog:

A dialog, where User interacts with an helpful, kind, obedient, honest and very reasonable assistant called Dave.
User: Hello, Dave
Dave: How can I assist you today?

Now the model acts as a simple chat bot. The starting prompt influences the mood of the chat answers. In this case, he credibly fills the role of a helpful assistant and does not leave it again without further ado. Interesting, funny to useful answers emerge - depending on the input texts.

Run LLaMa 2 Chat as Service with the AIME API Server

Setup AIME API Server
To read about how to setup and start the AIME API Server please see the setup section in the documentation.

Start LLaMa2 as API Worker
To connect the LLama2-chat model with the AIME-API-Server ready to receive chat requests through the JSON HTTPS/HTTP interface, simply add the flag --api_server followed by the address of the AIME API Server to connect to:

[llama2] user@client:/workspace$
> torchrun --nproc_per_node <num_gpus> /workspace/llama2_chat/chat.py --ckpt_dir /data/llama2-model/llama-2-70b-chat --api_server https://api.aime.info/

Now the LLama 2 model acts as a worker instance for the AIME API Server ready to process jobs ordered by clients of the AIME API Server.

For a documentation on how to send client requests to the AIME API Server please read on here.

Try out LLaMa2 Chat with AIME API Demonstrator

The AIME-API-Server offers client interfaces for various programming languages. To demonstrate the power of LLaMa 2 with the AIME API Server, we provide a fully working LLaMa2 demonstrator using our Java Script client interface.

AIME API LLaMa 2 Demonstrator

LLaMa 2 Inference GPU Benchmarks

To measure the performance of your LLaMA 2 worker connected to the AIME API Server, we developed a benchmark tool as part of our AIME API Server to simulate and stress the server with the desired amount of chat requests. First install the requirements with:

user@client:/workspace/aime-api-server/$
> pip install -r requirements_api_benchmark.txt

Then the benchmark tool can be started with:

user@client:/workspace/aime-api-server/$
> python3 run_api_benchmark.py --api_server https://api.aime.info/ --total_requests <number_of_requests>

run_api_benchmark.py will send as many requests as possible in parallel and up to <number_of_requests> chat requests to the API server. Each requests starts with the initial context "Once upon a time". LLaMa2 will generate, depending on the used model size, a story of about 400 to 1000 tokens length. The processed tokens per second are measured and averaged over all processed requests.

Results

We show the results of the different LLaMa 2 model sizes and GPU configurations. The model is loaded in such a way that it can use the GPUs in batch mode. In the bar charts the maximum possible batch size, which equals the parallel processable chat sessions, is stated below the GPU model and is directly related to the available GPU memory.

The results are shown as total possible throughput in tokens per second and the smaler bar shows the tokens per second for an individual chat session.

The human (silent) reading speed is about 5 to 8 words per second. With a ratio of 0.75 words per token, it is comparable to a required text generation speed of about 6 to 11 tokens per second to be experienced as not to slow.

LLaMa 2 7B GPU Performance

LLaMa 2 13B GPU Performance

LLaMa 2 70B GPU Performance

Conclusion

The integration of LLaMa 2 within AIME API and its simple and fast setup opens up a scalable solution for developers looking to deploy their conversational AI solutions. By deploying LLaMa 2 with the AIME API Server you can build advanced applications such as chatbots, virtual assistants and interactive customer support systems.

The hardware requirements for running LLaMa 2 models are flexible, allowing for deployment on various distributed GPU configurations and extendable setups or infrastructure to serve thousands of requests. Even the minimum GPU requirements or smallest recommended AIME systems for each model size ensure a more than real-time reading response performance for a seamless multi-user chat experience.

🦙

Read the next article in the AIME Blog LLaMa series:
• How to run Llama 3 with AIME API to Deploy Conversational AI Solutions