Deploy Meta Llama 3 8B with vLLM and OVHcloud AI Deploy
Deploy Meta Llama 3 8B with vLLM and OVHcloud AI Deploy
Open-source LLM deployed with just one command – you are not dreaming!
The world of Large Language Models (LLMs) is evolving at a breakneck pace, and Meta's Llama 3 has emerged as a frontrunner, offering incredible performance and versatility. Deploying LLMs can be a challenge! That's why we're thrilled to show you just how easy it is to deploy Meta-Llama-3-8B-Instruct using our AI Deploy service and leveraging the incredible inference power of NVIDIA L4 GPUs. It’s the easiest way to harness advanced AI without dealing with infrastructure – making powerful AI more accessible for developers, startups, and businesses.
Context
Let’s take a closer look at the key technologies involved before diving into deployment.
Meta Llama 3
The meta-llama/Meta-Llama-3-8B-Instruct is an 8-billion-parameter instruction-tuned model, designed for high performance and efficiency. Developed by Meta, this version is fine-tuned for instruction-following tasks, making it suitable for a wide range of applications.
To serve meta-llama/Meta-Llama-3-8B effectively, we use vLLM, a fast and scalable open-source inference engine optimized for LLMs.
vLLM
vLLM (Virtual LLM) is a highly optimized service engine designed to efficiently run large language models. It takes advantage of several key optimizations, such as:
• PagedAttention: an attention mechanism that reduces memory fragmentation and enables more efficient use of GPU memory
• Continuous Batching: vLLM dynamically adjusts batch sizes in real time, to use the GPU efficiently, even with multiple simultaneous requests
• Tensor parallelism: enables model inference across multiple GPUs to boost performance
• Optimized kernel implementations: vLLM uses custom CUDA kernels for faster execution, reducing latency compared to traditional inference frameworks
Thanks to these features, vLLM is an ideal runtime for large models like Llama 3 or Mistral Small 24B, enabling low-latency, high-throughput inference on modern GPUs.
By deploying on OVHcloud’s AI Deploy platform, you can deploy this model in a single command line.
AI Deploy
OVHcloud AI Deploy is a Container as a Service (CaaS) platform designed to help you deploy, manage and scale AI models. It provides a solution that allows you to optimally deploy your applications / APIs based on Machine Learning (ML), Deep Learning (DL) or LLMs.
The key benefits are:
• Easy to use: bring your own custom Docker image and deploy it in a command line or a few clicks
• High-performance computing: CPU, L4 GPU
• Scalability and flexibility: supports automatic scaling, allowing your model to effectively handle fluctuating workloads
• Cost-efficient: billing per minute, no surcharges
And yes, you can deploy the Llama 3 model in just one command.
Prerequisites
Before you begin, ensure that:
• You have An OVHcloud account: access to the OVHcloud Control Panel
• That ovhai CLI is available: install the ovhai CLI
• You have AI Deploy access: ensure you have a user for AI Deploy
• You have Hugging Face access: create an Hugging Face account and generate an access token
• You have Gated model authorization: be sure you have been granted access to Meta-Llama-3-8B-Instruct model
Licensing Note:
• Llama3 models are released under the Meta Llama 3 Community License. This license is designed to encourage broad use, but it does require you to read and accept the terms of use on the Hugging Face model page (e.g., for meta-llama/Meta-Llama-3-8B-Instruct) before you can download or utilize the model. Ensure you have accepted these terms to proceed seamlessly with the deployment.
🚀 It’s time to deploy!
Deployment of the Meta Llama 3 8B Model
Let’s go for the deployment of the model meta-llama/Meta-Llama-3-8B-Instruct
Manage access tokens
Export your Hugging Face token.
export MY_HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx
Create a token to access your AI Deploy app once it will be deployed.
ovhai token create --role operator ai_deploy_token=my_operator_token
Returning the following output:
Id: 47292486-fb98-4a5b-8451-600895597a2b
Created At: 20-02-25 11:53:05
Updated At: 20-02-25 11:53:05
Spec:
Name: ai_deploy_token=my_operator_token
Role: AiTrainingOperator
Label Selector:
Status:
Value: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Version: 1
You can now store and export your access token:
export MY_OVHAI_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Launch Meta Llama 3 8B with AI Deploy
You are ready to start Meta-Llama-3-8B-Instruct using vLLM and AI Deploy:
ovhai app run --name vllm-llama-3 \
--default-http-port 8000 \
--label ai_deploy_token=my_operator_token \
--gpu 1 \
--flavor l4-1-gpu \
-e OUTLINES_CACHE_DIR=/tmp/.outlines \
-e HF_TOKEN=$MY_HF_TOKEN \
-e HF_HOME=/hub \
-e HF_DATASETS_TRUST_REMOTE_CODE=1 \
-e HF_HUB_ENABLE_HF_TRANSFER=0 \
-v standalone:/hub:rw \
-v standalone:/workspace:rw \
vllm/vllm-openai:v0.8.2 \
-- bash -c "python3 -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-;2CLlama-3-8B-Instruct \
--tensor-parallel-size 1 \
--tokenizer-mode auto \
--load-format auto \
--config-format auto \
--dtype bfloat16 \
--max-model-len 4096"
How to understand the different parameters of this command?
1. Start your AI Deploy app
Launch a new app using ovhai CLI and name it.
ovhai app run --name vllm-llama-3
2. Define access
Define the HTTP API port and restrict access to your token.
--default-http-port 8000
--label ai_deploy_token=my_operator_token
3. Configure GPU resources
Specifies the hardware type (l4-1-gpu), which refers to an NVIDIA L4 GPU and the number (1).
--gpu 1
--flavor l4-1-gpu
⚠️NOTE: One L4 is sufficient for Llama 3 8B, but larger models may require other GPUs or multiple GPUs.
4. Set up environment variables
Configure caching for the Outlines library (used for efficient text generation):
-e OUTLINES_CACHE_DIR=/tmp/.outlines
Pass the Hugging Face token ($MY_HF_TOKEN) for model authentication and download:
-e HF_TOKEN=$MY_HF_TOKEN
Set the Hugging Face cache directory to /hub (where models will be stored):
-e HF_HOME=/hub
Allow execution of custom remote code from Hugging Face datasets (required for some model behaviors):
-e HF_DATASETS_TRUST_REMOTE_CODE=1
Disable Hugging Face Hub transfer acceleration (to use standard model downloading):
-e HF_HUB_ENABLE_HF_TRANSFER=0
5. Mount persistent volumes
Mounts two persistent storage volumes:
• /hub → Stores Hugging Face model files
• /workspace → Main working directory
The rw flag means read-write access.
-v standalone:/hub:rw
-v standalone:/workspace:rw
6. Choose the target Docker image
Uses the vllm/vllm-openai:v0.8.2 Docker image (a pre-configured vLLM OpenAI API server).
vllm/vllm-openai:v0.8.2
7. Running the model inside the container
Runs a bash shell inside the container and executes a Python command to launch the vLLM API server:
• python3 -m vllm.entrypoints.openai.api_server → Starts the OpenAI-compatible vLLM API server
• --model meta-llama/Meta-Llama-3-8B-Instruct → Loads the Meta Llama 3 8B Instruct model from Hugging Face
• --tensor-parallel-size 1 → Distributes the model across 1 GPU
• --tokenizer_mode auto → Automatically selects the tokenizer
• --load_format auto → Automatically detects model format (e.g., safetensors, PyTorch)
• --config_format auto → Auto-detects model config
• --dtype bfloat16 → Loads the model in bfloat16 for efficient memory usage with good performance
• --max-model-len 4096 → Sets the maximum sequence length supported for inference
You can now check if your AI Deploy app is alive:
ovhai app get <your_vllm_app_id>
Is your app in RUNNING status? Perfect! You can check in the logs that the server is started…
ovhai app logs <your_vllm_app_id>
WARNING! This step may take a little time as the template must be loaded…
After a few minutes, you should get the following information in the logs:
2025-02-20T13:48:07Z [app] [tcmzt] INFO: Started server process [13] 2025-02-20T13:48:07Z [app] [tcmzt] INFO: Waiting for application startup. 2025-02-20T13:48:07Z [app] [tcmzt] INFO: Application startup complete. 2025-02-20T13:48:07Z [app] [tcmzt] INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
🚦Are all the indicators green? Then it’s off to inference!
Request and send prompt to the LLM
Launch the following query by asking the question of your choice:
curl -s https://<your_vllm_app_id>.app.us-east-va.ai.cloud.ovh.us/v1/chat/completions \
-H "Authorization: Bearer $MY_OVHAI_ACCESS_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Give me the name of OVHcloud’s founder."}
],
"stream": false
}' | jq -r '.choices[0].message.content'
Returning the following result:
OVHcloud's founder is Octave Klaba
Conclusion
By following these steps, you have successfully deployed the meta-llama/Meta-Llama-3-8B-Instruct model using vLLM on OVHcloud’s AI Deploy platform. This setup provides a scalable and efficient solution for serving advanced language models in production environments.
For further customization and optimization, refer to the vLLM documentation and OVHcloud AI Deploy resources.
💪 Challenge taken! You can now enjoy the power of your LLM deployed in a single command line!
Want to deploy your own LLaMA model?
Check out our AI product page and GPU servers to get started with scalable, high-performance infrastructure.
Ready to Get Started?