01 Introduction to LMDeploy
LMDeploy is an efficient and user-friendly deployment toolkit for Large Language Models (LLMs) and Visual-Language Models (VLMs), developed by the Model Compression and Deployment Team of Shanghai Artificial Intelligence Laboratory (SAL), which covers the functions of model quantization, offline inference and online services.
1.1 Hardware and Software Platforms
Supported hardware and software platforms include:
Linux, Windows system + NVIDIA graphics card. The minimum requirement for the runtime, cuda runtime, is 11.3. Supported NVIDIA graphics card models include:
Volta(sm70): V100
Turing(sm75): 20 series, T4
Ampere(sm80,sm86): 30 series, A10, A16, A30, A100, etc.
Ada Lovelace(sm89): 40 Series
Hopper(sm90): H100 (not yet deeply optimized)
Huawei 910b
1.2 Project Structure
1.2.1 Interface Layer
Python: offline reasoning
RESTful: access to online services
gRPC: access triton inference server interface. No support for VLM model
1.2.2 Quantization layer
Weight quantization: supports AWQ and SmoothQuant algorithms.
K/V Cache: KV online quantization
1.2.3 Engine Layer
TurboMind engine: originated from FasterTransformer, developed by C++ and CUDA, dedicated to optimizing inference performance.
PyTorch engine: pure Python development, the kernel part is written in openai triton, aiming to reduce the threshold of developers.
The two engines complement each other, and together they form the cornerstone of LMDeploy.
1.2.4 Service Layer
OpenAI-like Server: inference service, compatible with the openai interface.
Gradio: web demo service.
Triton Inference Server: does not support VLM, and LLM is not recommended.
1.3 Supported Models
02 LMDeploy User's Guide for VLMs Deployment
2.1 Environment Installation
Option 1: Create a clean conda environment and pip install lmdeploy. python versions 3.8 - 3.12 are supported.
conda create -n lmdeploy python=3.8 -y
conda activate lmdeploy
pip install lmdeploy
Up to this point, you can deploy LLM models using lmdeploy.
However, if you want to deploy VLM models, such as the InternVL series, InternLM-XComposer series, LLaVA, etc., you need to install the dependencies required by the upstream model libraries. The reason is that LMDeploy reuses the upstream libraries for model inference and image preprocessing for the visual part of VLMs.
In the case of the InternVL2 model, for example, you need to install it:
pip install timm
# It is recommended to install timm from https://github.com/Dao-AILab/flash-attention/releases寻找和环境匹配的whl包
pip install flash-attn
Because of the reuse of VLMs upstream libraries for image preprocessing and visual model inference, and the dependencies of different VLMs are different, LMDeploy doesn't put the dependencies of VLMs, such as timm, flash-attn, into its own dependency list for the sake of maintainability.
However, torchvision is in the dependency list. The reason is that LMDeploy depends on torch, and I am worried that users may not pay attention to the version of torch when installing torchvision, and install a mismatched one. (We will consider removing it and trust the users)
Option 2: docker image
LMDeploy only provides mirrors for deploying LLM models, but not for deploying VLM models, for the reasons mentioned above.
It is recommended that users build VLM deployment mirrors based on LMDeploy mirrors, for example:
ARG CUDA_VERSION=cu12
FROM openmmlab/lmdeploy:latest-cu12 AS cu12
FROM openmmlab/lmdeploy:latest-cu11 AS cu11
RUN python3 -m pip install timm
# It is recommended to install timm from https://github.com/Dao-AILab/flash-attention/releases寻找和环境匹配的whl包
RUN python3 -m pip install flash-attn
The LMDeploy image is named as follows:
openmmlab/lmdeploy:latest-cu12
openmmlab/lmdeploy:latest-cu11
openmmlab/lmdeploy:latest # Same as openmmlab/lmdeploy:latest-cu12
openmmlab/lmdeploy:{tag}-cu12 # like openmmlab/lmdeploy:v0.5.3-cu12
openmmlab/lmdeploy:{tag}-cu11
2.2 Offline Reasoning
Taking the InternVL2-8B model as an example, the simplest “Hello, world” style inference is as follows:
from lmdeploy import pipeline
from lmdeploy.vl import load_image
pipe = pipeline('OpenGVLab/InternVL2-8B')
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)
When constructing the pipeline, if you don't specify whether to use the TurboMind engine or the PyTorch engine for inference, LMDeploy automatically assigns one based on their respective capabilities, with the TurboMind engine taking precedence by default.
Of course, you can choose an engine manually, and we will introduce the configuration method of both engines in detail in the chapter of inference engine configuration.
2.2.1 Creating a pipeline
2.2.1.1 API
def pipeline(model_path: str,
model_name: Optional[str] = None,
backend_config: Optional[Union[TurbomindEngineConfig,
PytorchEngineConfig]] = None,
chat_template_config: Optional[ChatTemplateConfig] = None,
log_level='ERROR',.
**kwargs).
model_path: model path
- can be model_repo_id on the huggingface hub
- can be the model_repo_id on the modelscope hub, in which case you need to install modelscope and set the environment variable
pip install modelscope
export LMDEPLOY_USE_MODELSCOPE=Tru
- For LLM models, this can be the path to the model after multi-lmdeploy convert. Offline conversion of VLMs models is not supported.
model_name: name of the built-in dialog template
- v0.6.0 (not yet released) will remove this parameter and replace it with the model_name in ChatTemplateConfig.
backend_config
Inference engine configuration parameters
chat_template_config
Conversation template parameters
log_level
Log level. Defaults to ERROR
vision_config
Configuration parameters for vision model inference
vision_config is hidden in the kwargs
2.2.1.2 Inference Engine Configuration
Definition of engine parameters
Turbomind
@dataclass
class TurbomindEngineConfig.
model_format: Optional[str] = None
tp: int = 1
session_len: Optional[int] = None
max_batch_size: int = 128
cache_max_entry_count: float = 0.8
cache_block_seq_len: int = 64
enable_prefix_caching: bool = False
quant_policy: int = 0
rope_scaling_factor: float = 0.0
use_logn_attn: bool = False
download_dir: Optional[str] = None
revision: Optional[str] = None
max_prefill_token_num: int = 8192
num_tokens_per_iter: int = 0
max_prefill_iters: int = 1
Pytorch
@dataclass
class PytorchEngineConfig.
tp: int = 1
session_len: int = None
max_batch_size: int = 128
cache_max_entry_count: float = 0.8
eviction_type: str = 'recompute'
prefill_interval: int = 16
block_size: int = 64
num_cpu_blocks: int = 0
num_gpu_blocks: int = 0
adapters: Dict[str, str] = None
max_prefill_token_num: int = 4096
thread_safe: bool = False
enable_prefix_caching: bool = False
device_type: str = 'cuda'
download_dir: str = None
revision: str = None
Parameters common to the engine:
- Single-computer multicard inference (tp)
tp is the number of GPUs used for tensor parallelism, default value is 1, currently constrained to 2n.
LMDeploy only supports single-computer multi-card, not multi-computer multi-card.
Turbomind
from lmdeploy import pipeline
from lmdeploy import TurbomindEngineConfig
pipe = pipeline(
'OpenGVLab/InternVL2-8B',
backend_config=TurbomindEngineConfig(
tp=2))
Pytorch
from lmdeploy import pipeline
from lmdeploy import PytorchEngineConfig
pipe = pipeline(
'OpenGVLab/InternVL2-8B',
backend_config=PytorchEngineConfig(
tp=2))
- Set memory usage (cache_max_entry_count)
cache_max_entry_count indicates the percentage of free GPU memory occupied by the K/V cache after loading the model weights. The default value is 0.8.
The K/V cache is allocated as a one-time request and reused repeatedly, which is why the pipeline and the api_server in the following section consume a lot of GPU memory upon startup.
If you are experiencing Out of Memory (OOM) error bugs, you may want to consider lowering the cache_max_entry_count of the
Turbomind
from lmdeploy import pipeline
from lmdeploy import TurbomindEngineConfig
pipe = pipeline(
'OpenGVLab/InternVL2-8B',
backend_config=TurbomindEngineConfig(
tp=2))
Pytorch
from lmdeploy import pipeline
from lmdeploy import PytorchEngineConfig
pipe = pipeline(
'OpenGVLab/InternVL2-8B',
backend_config=PytorchEngineConfig(
tp=2))
- Set the maximum inference length (session_len)
session_len indicates the maximum length of the context window, including the number of input prompt tokens and the number of output tokens
Turbomind
from lmdeploy import pipeline
from lmdeploy import TurbomindEngineConfig
pipe = pipeline(
'OpenGVLab/InternVL2-8B',
backend_config=TurbomindEngineConfig(
session_len=8192))
Pytorch
from lmdeploy import pipeline
from lmdeploy import PytorchEngineConfig
pipe = pipeline(
'OpenGVLab/InternVL2-8B',
backend_config=PytorchEngineConfig(
session_len=8192))
- Set the inference maximum batch (max_batch_size)
max_batch_size indicates the maximum number of batches for Continuous batching inference
Turbomind
from lmdeploy import pipeline
from lmdeploy import TurbomindEngineConfig
pipe = pipeline(
'OpenGVLab/InternVL2-8B',
backend_config=TurbomindEngineConfig(
max_batch_size=256))
Pytorch
from lmdeploy import pipeline
from lmdeploy import PytorchEngineConfig
pipe = pipeline(
'OpenGVLab/InternVL2-8B',
backend_config=PytorchEngineConfig(
max_batch_size=256))
- Set the prefix caching switch (enable_prefix_caching)
Turbomind
from lmdeploy import pipeline
from lmdeploy import TurbomindEngineConfig
pipe = pipeline(
'OpenGVLab/InternVL2-8B',
backend_config=TurbomindEngineConfig(
enable_prefix_caching=True))
Pytorch
from lmdeploy import pipeline
from lmdeploy import PytorchEngineConfig
pipe = pipeline(
'OpenGVLab/InternVL2-8B',
backend_config=PytorchEngineConfig(
enable_prefix_caching=
- Set the maximum number of tokens for the prefill chunk (max_prefill_token_num)
Turbomind
from lmdeploy import pipeline
from lmdeploy import TurbomindEngineConfig
pipe = pipeline(
'OpenGVLab/InternVL2-8B',
backend_config=TurbomindEngineConfig(
max_prefill_token_num=8192))
Pytorch
from lmdeploy import pipeline
from lmdeploy import PytorchEngineConfig
pipe = pipeline(
'OpenGVLab/InternVL2-8B',
backend_config=PytorchEngineConfig(
max_prefill_token_num=4096))
- Set model download parameters (download_dir, revision)
When model_path is not the local path, LMDeploy will download the model from the huggingface hub or modelscope hub, and download the latest version by default, which is stored under ~/.cache by default. You can specify the model version by revision and the model path by download_dir.
TurbomindEngine specific parameters:
- Set the Dynamic NTK extrapolation parameter (rope_scaling_factor)
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
backend_config = TurbomindEngineConfig(
rope_scaling_factor=2.5,
session_len=1000000,
max_batch_size=1,
cache_max_entry_count=0.9,
tp=2)
pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
prompt = 'Use a long prompt to replace this sentence'
gen_config = GenerationConfig(top_p=0.8,
top_k=40,
top_p=0.8, top_k=40, temperature=0.8, max_new_tokens
max_new_tokens=1024)
response = pipe(prompt, gen_config=gen_config)
print(response)
- Setting the online KV Cache quantization precision (quant_policy)
quant_policy denotes the parameter of the LLM model KV cache quantization policy, 4 means 4bit quantization, 8 means 8bit quantization. 8bit KV cache almost does not drop points, and the quantization accuracy can be considered lossless.
from lmdeploy import pipeline, TurbomindEngineConfig
pipe = pipeline('OpenGVLab/InternVL2-8B', 'OpenGVLab/InternVL2-8B', 'TurbomindEngineConfig')
backend_config=TurbomindEngineConfig(
quant_policy=8))
- Set the model format (model_format)
Value range {None,hf,llama,awq,gptq}. none is automatically determined based on the structure of the model file, hf denotes a model of the class llama structure on the huggingface, llama denotes meta_llama (pytorch weight format), awq denotes an awq quantitative model, gptq denotes a gptq quantitative model (supported in 0.6.0).
from lmdeploy import pipeline, TurbomindEngineConfig
pipe = pipeline(
'OpenGVLab/InternVL2-8B-AWQ',
backend_config=TurbomindEngineConfig(
model_format='awq'))
from lmdeploy import pipeline, TurbomindEngineConfig
pipe = pipeline(
'OpenGVLab/InternVL2-8B-AWQ',
backend_config=TurbomindEngineConfig(
model_format=None))
- Size of the KV block (cache_block_seq_len)
The number of tokens a KV cache block can hold, default is 64. Default is 64, should be a multiple of 32 if GPU compute_capability >= 8.0, otherwise it should be a multiple of 64.
num_tokens_per_iter
Controls the number of tokens processed in a single forward pass. This includes prefilling and decoding. For example, when there are 8 decoding sequences and 2 prefill sequences, the required resources are allocated first for the 8 decoding tokens, and then for num_tokens_per_iter - the 8 prefill tokens.
It defaults to max_prefill_token_num
To minimize the impact of long prompt sequences, the optimal value of num_tokens_per_iter is related to model size, GPU model and workload. For Llama3-8B, A100 80G and 128 concurrent uses, the value is between 128 and 256. If you are using a 4090 graphics card, you can start from 64-128. Basically, the smaller this value is, the less impact pre-filling will have (but it is larger than max_batch_size).
max_prefill_iters
Controls the maximum number of iterations for prefilling a single sequence.
After max_prefill_iters is set, num_tokens_per_iter may be recalculated. For example, when a request has 2000 tokens, if max_prefill_iters=1, it means that the prefill completes in 1 iteration, regardless of what num_tokens_per_iter is set to. So, for num_tokens_per_iter to have its full effect, max_prefill_iters should be set to a larger value.
It defaults to (session_len + max_prefill_token_num - 1) // max_prefill_token_num
When a suitable num_tokens_per_iter is found, max_prefill_iters is used to balance decoding smoothness and first token latency. It is first set to a large value (you will observe an increase in first token latency) and then gradually reduced until an acceptable first token latency is reached.
2.2.1.3 Dialog template configuration
@dataclass
class ChatTemplateConfig.
model_name: str
system: Optional[str] = None
meta_instruction: Optional[str] = None
eosys: Optional[str] = None
user: Optional[str] = None
eoh: Optional[str] = None
assistant: Optional[str] = None
eoa: Optional[str] = None
separator: Optional[str] = None
capability: Optional[Literal['completion', 'infilling', 'chat', 'python']] = None
'python']] = None
stop_words: Optional[List[str]] = None
model_name: Dialog template name
system, meta_instruction, and eosys represent the name of the system role, the prompt word of the system role, and the completion of the prompt word of the system role, respectively.
Taking the InternLM2 model as an example, the three attributes of its dialog template are as follows:
# InternLM2
system = “<|im_start|>system”
meta_instruction = “”"You are an AI assistant whose name is InternLM (书生-浦语).
- InternLM (书生-浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
- InternLM can understand and communicate fluently in the language chosen by the user such as English and Chinese.
“"”
eosys = “<|im_end|>\n”
- user and eoh are the name of the user character and the end of the prompt for the user character.
- assistant, eoa are the name of the AI assistant role, and the end of the role's reply.
- seperator: separator between two rounds of conversation
- capability: the capability of the model.” completion” for text completion, ‘chat’ for dialog, ‘infilling’ for code filling (codellama specific), ‘python’ for python code capability (codellama specific)
- stop_words: terminator, used to stop the AI assistant from responding. Currently, LMDeploy only supports 1 token_id per stop_word.
Dialog templates are used to stitch together a dialog sequence.
Let's assume a dialog sequence is U1A1U2A2.... .Un, where Ui denotes the prompt entered by the User in the ith round of the dialog, and Ai denotes the answer generated by the model or AI Assistant in the ith round of the dialog.
In LMDeploy, there are 2 ways of splicing dialog sequences, which correspond to two inference modes: interactive inference (stateful inference), and non-interactive inference (stateless inference).
The difference between these two modes of reasoning is what is the input of the user side in the i-th round of dialog. The input for interactive reasoning is Ui and the input for non-interactive reasoning is U1A1U2A2.... .Ui. In other words, in interactive reasoning mode, the user does not need to input the history dialog record because the information of the history dialog has already been cached by the inference engine (including token, KV, cursor, etc.), whereas in non-interactive reasoning mode, the user needs to input the history dialog record.
The two ways LMDeploy stitches together a sequence of conversations are implemented in BaseChatTemplate. get_prompt is used for user-interactive reasoning mode, and messages2prompt is used for non-interactive reasoning mode.
class BaseChatTemplate.
def get_prompt(self, prompt, sequence_start=True):
““”Return decorated prompt in interactive inference mode
“"”
if sequence_start: # decorate $$U_0$$
return f'{self.system}{self.meta_instruction}{self.eosys}' \
f'{self.user}{prompt}{self.eoh}' \
f'{self.assistant}'
else: # decorate $$U_i$$
return f'{self.user}{prompt}{self.eoh}' \
f'{self.assistant}'
def messages2prompt(self, messages, sequence_start=True, **kwargs)::
““”Return decorated prompt in non-interactive inference mode.
Args.
messages (str|List): user's input prompt which is supposed
to be in OpenAI format
“"”
if isinstance(messages, str): if isinstance(messages, str).
# fallback to `get_prompt` when `messages` isn't a list
return self.get_prompt(messages, sequence_start)
# “box” indicates “begin of x (role)”
box_map = dict(user=self.user, assistant=self.assistant, `map`, `map`)
assistant=self.assistant, system=self.system)
system=self.system)
# “eox” indicates “end of x (role)”
eox_map = dict(user=self.eoh,
assistant=self.eoa + self.separator, system=self.eosys)
system=self.eosys)
for message in messages.
role = message['role']
content = message['content']
ret += f'{box_map[role]}{content}{eox_map[role]}'
In the above code, some of the logic has been simplified for the sake of focus. For the completed code, please refer to: https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/model.py
Conversation templates can be used in several ways:
- Without specifying a dialog template configuration, LMDeploy matches the built-in dialog template name based on the pathname of the model
- To specify the built-in dialog template
from lmdeploy import pipeline, ChatTemplateConfig
pipe = pipeline('/the/path/of/your/finetuned/internvl2/8b/model',
chat_template_config=ChatTemplateConfig(
model_name='internvl-internlm2'
)
The lmdeploy list displays the built-in conversation templates. For the mapping between built-in dialog templates and supported models, see Appendix - Built-in Dialog Templates.
- To change the properties of the built-in dialog templates
from lmdeploy import pipeline, ChatTemplateConfig
pipe = pipeline('/the/path/of/your/finetuned/internvl2/8b/model',
chat_template_config=ChatTemplateConfig(
model_name='internvl2-internlm2',
meta_instruction='You are a helpful assistant'
)
LMDeploy will update the non-None attribute information into the specified chat template
- Customizing Dialog Templates
Option 1: The dialog template properties are defined and stitched together in a way that is completely consistent with the BaseChatTemplate definition.
All you need to do is set the ChatTemplateConfig fields, and LMDeploy will create an instance of BaseChatTemplate.
Way 2: The conversation template's properties or splicing methods do not conform to the BaseChatTemplate's definition.
@register_module(name=“awesome”)
class MyChatTemplate.
def __init__(*args, **kwargs).
def __init__(*args, **kwargs): pass
if sequence_start.
# TODO: return the decorated prompt when it is the first request of a sequence
pass
else: # TODO: return the decorated prompt when it is the first request of a sequence pass.
# TODO: return the decorated prompt when it is NOT the first request of a sequence pass else: # TODO: return the decorated prompt when it is not the first request of a sequence
return the decorated prompt when it is NOT the first request of a sequence pass
def message2prompt(self, messages, sequence_start=True, **kwargs): if isinstance(messages, strings): # TODO: return the decorated prompt when it is the first request of a sequence pass else: # TODO: return the decorated prompt when it is NOT the first request of a sequence pass
if isinstance(messages, str): return self.
return self.get_prompt(messages, sequence_start)
# TODO: return the prompt after applying the chat template
pipe = pipeline(“/the/path/of/your/awesome/model”,
chat_template_config=ChatTemplateConfig(
model_name=“awesome”))
2.2.1.4 Visual model inference configuration
@dataclass
class VisionConfig.
max_batch_size: int = 1
thread_safe: bool = False
max_batch_size indicates the size of the image batch processing. The larger the value, the higher the risk of OOM because the LLM part of the VLM model pre-allocates a lot of memory in advance.
2.2.2 Using pipeline
2.2.2.1 API
def __call__(self.
prompts: Union[VLPromptType, List[Dict], List[VLPromptType].
List[List[Dict]],], gen_config.
gen_config: Optional[GenerationConfig] = None,
**kwargs).
prompts The user input for prompt and image. it can be in the following forms:
- str: plain text
- list[str]: plain text sequence
- tuple(str, PIL.Image): text + image
- tuple(str, list[PIL.Image]): text + image sequence
- list[tuple(str, PIL.Image)]: (text + image) sequence
- Format of GPT4V
[{
“role": ‘user’, [{
“content": [{
“type": ‘text’.
“text": ‘the input text prompt’,
}, { “role”: [{ “role”: [{ “content”: “type”: “text”, “text”: “the input text prompt”, }, “text”: “the input text”, }
{
“type": ‘image_url’, {
“image_url": {
“url": ”data:image/jpeg;base64,{image_base64_data}”
}
},
{
“type": ‘image_data’, {
“image_data": {
“data": PIL.Image.
}
}, ...
...
{...}]
}]
LMDeploy will process the first 5 formats into the GPT4V format
gen_config Sampling parameters for generating tokens
@dataclass
class GenerationConfig.
n: int = 1
max_new_tokens: int = 512
top_p: float = 1.0
top_k: int = 1
temperature: float = 0.8
repetition_penalty: float = 1.0
ignore_eos: bool = False
random_seed: int = None
stop_words: List[str] = None
bad_words: List[str] = None
min_new_tokens: int = None
skip_special_tokens: bool = True
logprobs: int = None
- n: the number of sequences to be generated by the input request, currently only 1 is supported
- max_new_tokens: the maximum number of tokens to be generated by the input request.
- top_p: the least possible set of tokens whose cumulative probability exceeds top_p is considered for sampling.
- top_k: the top_k tokens with the highest probability are considered for sampling. top_k=1 indicates a greedy search.
- temperature: sampling temperature. temperature=0.f means greedy search.
- repetition_penalty: a penalty that prevents the model from generating repeated words or phrases. A value greater than 1 suppresses repetition.
- ignore_eos: whether or not to ignore eos_token_id
- random_seed: seed used when sampling tokens
- stop_words: stopwords for token generation. It is currently required that each stop_word can only have one token_id when it is tokenized
- bad_words: words that will never be generated. Currently only one token_id is required for each bad_word tokenized.
- min_new_tokens: the minimum number of tokens to be generated by the input request.
- skip_special_tokens: if or not to ignore special tokens in decoding, default is True.
- logprobs: the number of log probabilities returned for each output token.
- For an introduction to the sampling method at generation time, we recommend reading https://huggingface.co/blog/how-to-generate.
2.2.2.2 More examples
- Multi-graph input
For multi-graph scenarios, it is sufficient to put them in a list when reasoning. However, multiple graphs imply a larger number of input tokens, so it is often necessary to increase the inference context length
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
pipe = pipeline('OpenGVLab/InternVL2-8B',
backend_config=TurbomindEngineConfig(session_len=10000))
image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]
images = [load_image(img_url) for img_url in image_urls]
response = pipe(('describe these images', images))
print(response)
- Batch Graphic Input
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
pipe = pipeline('OpenGVLab/InternVL2-8B',
backend_config=TurbomindEngineConfig(session_len=8192))
image_urls=[
“https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg”,
“https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg”
]
prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
response = pipe(prompts)
print(response)
- Multi-Round Dialogs
There are two ways for pipeline to conduct multi-round conversations. One is to construct messages in the GPT4V format, and the other is to use the pipeline.chat interface.
from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
from lmdeploy.vl import load_image
pipe = pipeline('OpenGVLab/InternVL2-8B',
backend_config=TurbomindEngineConfig(session_len=8192))
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.6)
sess = pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)
- Customizing the Location of Image Tokens
By default, LMDeploy inserts a special token representing an image into the user prompt based on the conversation template provided by the algorithm repo, but in some models, there is no restriction on the location of the image token, such as deepseek-vl, or the user needs to customize the location of the image token insertion. In this case, the user needs to manually insert the image token into the prompt. lmdeploy uses <IMAGE_TOKEN> as the special token for the image.
from lmdeploy import pipeline
from lmdeploy.vl import load_image
from lmdeploy.vl.constants import IMAGE_TOKEN
pipe = pipeline('deepseek-ai/deepseek-vl-1.3b-chat')
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe((f'describe this image{IMAGE_TOKEN}', image))
print(response)
2.3 Online Services
2.3.1 Starting the service
2.3.1.1 Method 1: Use the lmdeploy cli utility
lmdeploy serve api_server OpenGVLab/InternVL2-8B
This command will start an OpenAI interface-compatible model inference service on port 23333 on the local host. You can specify a different server port using the --server-port option. For a more detailed description of the parameters, refer to section api_server parameters
2.3.1.2 Method 2: Using docker
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env “HUGGING_FACE_HUB_TOKEN=<secret>” \
-p 23333:23333 \
--ipc=host \
openmmlab/lmdeploy:latest \
lmdeploy serve api_server OpenGVLab/InternVL2-8B
2.3.1.3 api_server Parameters
root@lmdeploy-on-121:~/lmdeploy# lmdeploy serve api_server -h
usage: lmdeploy serve api_server [-h] [--server-name SERVER_NAME] [--server-port SERVER_PORT]
[--allow-origins ALLOW_ORIGINS [ALLOW_ORIGINS ...]] [--allow-credentials]
[--allow-methods ALLOW_METHODS [ALLOW_METHODS ...]]
[--allow-headers ALLOW_HEADERS [ALLOW_HEADERS ...]] [--qos-config-path QOS_CONFIG_PATH]
[---backend {pytorch,turbomind}]
[--log-level {CRITICAL,FATAL,ERROR,WARN,WARNING,INFO,DEBUG,NOTSET}]]
[--api-keys [API_KEYS ...]] [---ssl] [---model-name MODEL_NAME]
[--chat-template CHAT_TEMPLATE] [--revision REVISION] [--download-dir DOWNLOAD_DIR]
[--adapters [ADAPTERS ...]] [--tp TP] [--session-len SESSION_LEN]
[--max-batch-size MAX_BATCH_SIZE] [--cache-max-entry-count CACHE_MAX_ENTRY_COUNT]
[--cache-block-seq-len CACHE_BLOCK_SEQ_LEN] [--enable-prefix-caching]
[---model-format {hf,llama,awq,gptq}] [--quant-policy {0,4,8}]
[--rope-scaling-factor ROPE_SCALING_FACTOR] [--num-tokens-per-iter NUM_TOKENS_PER_ITER]
[---max-prefill-iters MAX_PREFILL_ITERS] [--vision-max-batch-size VISION_MAX_BATCH_SIZE]
model_path
- model_path
--server-name SERVER_NAME: Host IP address of the service. Default: 0.0.0.0.
--server-port SERVER_PORT: Service port. Default: 23333.
--allow-origins ALLOW_ORIGINS: List of allowed CORS sources. Default: ['*'].
--allow-credentials: Whether to allow CORS credentials. Default: False.
--allow-methods ALLOW_METHODS: List of allowed HTTP methods. Default: ['*'].
--allow-headers ALLOW_HEADERS: List of allowed HTTP headers. Default: ['*'].
---backend {pytorch,turbomind}: Set the inference backend. Default: turbomind.
--log-level {LEVELS}: Set the log level. Default: ERROR.
--api-keys [API_KEYS]: Optional list of API keys.
--ssl: Enable SSL. requires OS environment variables 'SSL_KEYFILE' and 'SSL_CERTFILE'.
--model-name MODEL_NAME: Service name of the model. Can be accessed via the RESTful API /v1/models. If not specified, model_path will be used.
--chat-template CHAT_TEMPLATE: When it is a string, it indicates the name of the built-in conversation template. When it is a JSON file path, it indicates a custom chat template.
--revision REVISION: The specific model version to use. Can be a branch name, tag name or commit ID.
--download-dir DOWNLOAD_DIR: Directory to download and load weights, defaults to the default cache directory for huggingface.
Parameters related to the TurboMind engine
--tp TP: Number of GPUs to be used in tensor parallelism.
--session-len SESSION_LEN: Maximum session length of the sequence
--max-batch-size MAX_BATCH_SIZE: Maximum batch size. Default: 128
--cache-max-entry-count CACHE_MAX_ENTRY_COUNT: Percentage of free GPU memory occupied by the KV cache, excluding weights. Default: 0.8
--cache-block-seq-len CACHE_BLOCK_SEQ_LEN: The number of tokens the KV cache block holds. For Turbomind engine, should be a multiple of 32 if GPU compute_capability >= 8.0, otherwise should be a multiple of 64. Default: 64.
--enable-prefix-caching: Whether to enable prefix matching KV caching. Default: False.
---model-format {hf,llama,awq,gptq}: input model format. hf means hf_llama, llama means meta_llama, awq means awq quantized model, gptq means gptq quantized model.
--quant-policy {0,4,8}: whether to quantize kv. 0: no quantization; 4: 4-bit kv; 8: 8-bit kv. default: 0
--rope-scaling-factor ROPE_SCALING_FACTOR: Rope scaling factor. Default: 0.0
--num-tokens-per-iter NUM_TOKENS_PER_ITER: Number of tokens processed in forward pass. Default: 0
---max-prefill-iters MAX_PREFILL_ITERS: Maximum number of forward passes in the prefill phase. Default: 1
Parameters related to the PyTorch engine
--adapters [ADAPTERS ...] : Used to set the path of the lora model. : Used to set the path to the lora model. Multiple lora key-value pairs can be entered in the format xxx=yyyy. if there is only one adapter, you can enter just the path to the adapter. Default: none. Type: string.
--tp TP: Number of GPUs to use in tensor parallelism
--session-len SESSION_LEN: Maximum session length of the sequence
--max-batch-size MAX_BATCH_SIZE: Maximum batch size. Default: 128
--cache-max-entry-count CACHE_MAX_ENTRY_COUNT: Percentage of free GPU memory occupied by the KV cache, excluding weights. Default: 0.8
--cache-block-seq-len CACHE_BLOCK_SEQ_LEN: The number of tokens the KV cache block holds. This parameter is ignored if the Lora adapter is specified. Default: 64
--enable-prefix-caching: Whether to enable prefix-matching KV cache. Default: False.
Vision model parameters:
--vision-max-batch-size VISION_MAX_BATCH_SIZE: Visual model batch size. Default: 1.
2.3.2 Accessing Services
It is recommended to use the openai client package interface to access the service
- image url
from openai import OpenAI
client = OpenAI(
api_key='YOUR_API_KEY', # dummy key to pass openai checking key
base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
model=model_name,
messages=[{
'role': 'user', 'content': [{
'content': [{
'type': 'text', 'Describe the image', 'text': 'user', 'content': [{
}, {
'type': 'image_url', 'image_url': {
'image_url': {
'url'.
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg', }
},
}],
}],
temperature=0.8,
top_p=0.8)
print(response)
If the model supports multiple images, you can append an image to the user's content list in messages
- Image base64 encoding
from openai import OpenAI
client = OpenAI(
api_key='YOUR_API_KEY', # dummy key to pass openai checking key
base_url='http://0.0.0.0:23333/v1')
def encode_image(image_path).
with open(image_path, “rb”) as image_file: return base64.
return base64.b64encode(image_file.read()).decode('utf-8')
# Path to your image
image_path = “path_to_your_image.jpg”
# Getting the base64 string
base64_image = encode_image(image_path)
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
model=model_name,
messages=[
{
“role": ‘user’,
“content": [
{ “type”: “text”, “text”: “What's in this image?”}, { “role”: “user”, “content”: [
{
“type": ‘image_url’, { ‘image_url’: { ‘type’: ‘text’.
“image_url": {
“url": f ‘data:image/jpeg;base64,{base64_image}’, }, { ‘type’: ‘image_url’, ‘image_url’: {
},
}
], }
}
],
max_tokens=300, )
)
- Add additional information
from openai import OpenAI
client = OpenAI(
api_key='YOUR_API_KEY', # dummy key to pass openai checking key
base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
model=model_name,
messages=[{
'role': 'user', 'content': [{
'content': [{
'type': 'text', 'Describe the image', 'text': 'user', 'content': [{
}, {
'type': 'image_url', 'image_url': {
'image_url': {
'url'.
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg', }
},
}],
}],
temperature=0.8,
top_p=0.8,
extra_body={“repetition_penalty”: 1.02}
)
print(response)
2.4 Model Quantization
2.4.1 Weight Quantization
- 4bit weight quantization. The quantization algorithm, AWQ, only quantizes the language model part of the VLM, but not the visual part.
- Supported graphics card models:
- V100 (0.6.0 support, not yet released)
- Turing(sm75): 20 series, T4
- Ampere(sm80): A100
- Ampere(sm86): 30 series, A10, A16, A30 etc.
- Ada Lovelace(sm89): 40 Series
- Hopper(sm90): H100, H800 (not yet deeply optimized)
lmdeploy lite auto_awq OpenGVLab/InternVL2-8B
2.4.2 KV Cache Quantization
Set the parameter quant_policy in TurbomindEngineConfig. For details, please refer to the previous Inference Engine Configuration section for a description of quant_policy
03 Inference Performance
There is no standardized method for evaluating VLM inference performance.
LLM pipeline is evaluated in the following way, and the test metric RPS (Request Per Second) is as follows
python benchmark/profile_pipeline_api.py \
ShareGPT_V3_unfiltered_cleaned_split.json \
meta-llama/Meta-Llama-3-8B-Instruct \
--num-prompts 5000
On the A100-SMX4-80G graphics card, the test results are as follows:
--------------------------------------------------
concurrency: 256
elapsed_time: 208.390s
first token latency(s)(min, max, ave): 0.068, 3.880, 0.378
per-token latency(s) percentile(50, 75, 95, 99): [0, 0.09, 0.153, 0.207]
number of prompt tokens: 1136185
number of completion tokens: 1008966
token throughput (completion token): 4841.723 token/s
token throughput (prompt + completion token): 10293.932 token/s
RPS (request per second): 23.993 req/s
RPM (request per minute): 1439.609 req/min
--------------------------------------------------
To test the performance of LLM serving, it is recommended to use vLLM's test script https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py. The test metrics of this script are: TTFT, TPOT
# Start the service first
lmdeploy api_server meta_llama/Meta-Llama-3-8B-Instruct --max-batch-size 256
# Open a new terminal and test with vLLM's test script benchmark_serving.py
# Start the service lmdeploy api_server meta_llama/Meta-Llama-3-8B-Instruct --max-batch-size 256 # Open a new terminal and test it with vLLM's test script benchmark_serving.py
vllm 0.4.2
lmdeploy 0.4.1
tensorrt-llm v0.9.0
04 VLM Inference Implementation
Initialize pipeline
prompt Preprocessing
In example 1 of the offline inference chapter, the model is OpenGVLab/InternVL2-8B and the request for pipeline inference is ('describe this image', image).
After “converting to GPT4V request format”, the request becomes:
{
“role": ‘user’,
“content": [
{
“type": ‘text’, ‘text’: ‘describe this image’, ‘text’: ‘text’.
“text": ”describe this image”
},
{
“type": ‘image_data’, ‘image_data’: { ‘type’: ‘text’: ‘describe this image’ }, {
“image_data": {
“data": image
}
}
]
}
After “adding the image token”, the request becomes.
{'role': 'user', 'content': '<img><IMAGE_TOKEN></img>\ndescribe this image'}
InternVL2-8B is putting the image before the text.
After “Decorate dialog template message2prompt”, the request becomes:
<|im_start|>system\nYou are a booker multimodal big model developed by Shanghai Artificial Intelligence Laboratory in conjunction with Shangtang Technology, named InternVL in English, a useful and harmless AI assistant. <|im_end|>\n<|im_start|>user\n<img><IMAGE_TOKEN></img>\ndescribe this image<|im_end|>\n<|im_start|>assistant\n
Image Encoding
self.vl_encoder reuse image preprocessing and vision model inference from upstream library
Reasoning
TurboMind inference
PyTorch inference
TODO
05 Future Planning
06 Appendices
6.1 Built-in Dialog Templates
Models
Model Type
Model structure
Built-in dialog template name
Description
InternLM-XComposer2
MLLM
InternLMXComposer2ForCausalLM
internlm-xcomposer2
InternLM-XComposer2.5
MLLM
InternLMXComposer2ForCausalLM internlm-xcomposer2.5 MLLM
internlm-xcomposer2d5
Qwen-VL
MLLM
QWenLMHeadModel
qwen
DeepSeek-VL
MLLM
MultiModalityCausalLM
deepseek-vl
Phi-3-vision
MLLM
Phi3VForCausalLM
phi-3
CogVLM-Chat
MLLM
CogVLMForCausalLM
cogvlm
CogVLM2-Chat
MLLM
CogVLMForCausalLM
cogvlm2
Yi-VL
MLLM
LlavaLlamaForCausalLM
yi-vl
LLaVA-v1.5
MLLM
LlavaLlamaForCausalLM
LLaVA-v1.5 MLLM
LLaVA-v1.6-vicuna
MLLM
LlavaLlamaForCausalLM
LLaVA-v1.6-vicuna MLLM
llava-v1.6-34b
MLLM
LlavaLlamaForCausalLM
llava-chatml
llava-v1.6-mistral-7b
MLLM
LlavaMistralForCausalLM
mistral
InternVL-Chat-V1-5
MLLM
InternLM2ForCausalLM
internvl-internlm2
InternVL series need to look at the architecture in llm_config.
Mini-InternVL-Chat-2B-V1-5
MLLM
InternLM2ForCausalLM
internvl-internlm2
Mini-InternVL-Chat-4B-V1-5
MLLM
MLLM InternLM2ForCausalLM internvl-internlm2
internvl-phi3
InternVL2 (2B, 8B, 26B)
MLLM
InternLM2ForCausalLM
internvl2-internlm2
InternVL2 (4B)
MLLM
internvl2-internlm2 InternVLM2 (4B) MLLM
internvl2-phi3
InternVL2 (40B)
MLLM
LlamaForCausalLM
internvl2-internlm2
internvl2-llama3-76b
MLLM
InternVL2-Llama3-76B MLLM LlamaForCausalLM internvl2-internlm2
internvl2-internlm2
MiniCPM-Llama3-V-2_5
LlamaForCausalLM internvl2-internlm2
MiniCPMV
Llama3
The type of llm can only be seen in the code.
MiniGeminiLlama
MLLM
MiniGeminiLlamaForCausalLM
mini-gemini-vicuna
GLM-4V
MLLM
ChatGLMModel
glm4