LMDeploy's Approach to Deploying VLMs and Discussions

01 Introduction to LMDeploy

LMDeploy is an efficient and user-friendly deployment toolkit for Large Language Models (LLMs) and Visual-Language Models (VLMs), developed by the Model Compression and Deployment Team of Shanghai Artificial Intelligence Laboratory (SAL), which covers the functions of model quantization, offline inference and online services.

1.1 Hardware and Software Platforms

Supported hardware and software platforms include:

Linux, Windows system + NVIDIA graphics card. The minimum requirement for the runtime, cuda runtime, is 11.3. Supported NVIDIA graphics card models include:

Volta(sm70): V100

Turing(sm75): 20 series, T4

Ampere(sm80,sm86): 30 series, A10, A16, A30, A100, etc.

Ada Lovelace(sm89): 40 Series

Hopper(sm90): H100 (not yet deeply optimized)

Huawei 910b

1.2 Project Structure

1.2.1 Interface Layer

Python: offline reasoning

RESTful: access to online services

gRPC: access triton inference server interface. No support for VLM model

1.2.2 Quantization layer

Weight quantization: supports AWQ and SmoothQuant algorithms.

K/V Cache: KV online quantization

1.2.3 Engine Layer

TurboMind engine: originated from FasterTransformer, developed by C++ and CUDA, dedicated to optimizing inference performance.

PyTorch engine: pure Python development, the kernel part is written in openai triton, aiming to reduce the threshold of developers.

The two engines complement each other, and together they form the cornerstone of LMDeploy.

1.2.4 Service Layer

OpenAI-like Server: inference service, compatible with the openai interface.

Gradio: web demo service.

Triton Inference Server: does not support VLM, and LLM is not recommended.

1.3 Supported Models

02 LMDeploy User's Guide for VLMs Deployment

2.1 Environment Installation

Option 1: Create a clean conda environment and pip install lmdeploy. python versions 3.8 - 3.12 are supported.

conda create -n lmdeploy python=3.8 -y

conda activate lmdeploy

pip install lmdeploy

Up to this point, you can deploy LLM models using lmdeploy.

However, if you want to deploy VLM models, such as the InternVL series, InternLM-XComposer series, LLaVA, etc., you need to install the dependencies required by the upstream model libraries. The reason is that LMDeploy reuses the upstream libraries for model inference and image preprocessing for the visual part of VLMs.

In the case of the InternVL2 model, for example, you need to install it:

pip install timm

# It is recommended to install timm from https://github.com/Dao-AILab/flash-attention/releases寻找和环境匹配的whl包

pip install flash-attn

Because of the reuse of VLMs upstream libraries for image preprocessing and visual model inference, and the dependencies of different VLMs are different, LMDeploy doesn't put the dependencies of VLMs, such as timm, flash-attn, into its own dependency list for the sake of maintainability.

However, torchvision is in the dependency list. The reason is that LMDeploy depends on torch, and I am worried that users may not pay attention to the version of torch when installing torchvision, and install a mismatched one. (We will consider removing it and trust the users)

Option 2: docker image

LMDeploy only provides mirrors for deploying LLM models, but not for deploying VLM models, for the reasons mentioned above.

It is recommended that users build VLM deployment mirrors based on LMDeploy mirrors, for example:

ARG CUDA_VERSION=cu12

FROM openmmlab/lmdeploy:latest-cu12 AS cu12

FROM openmmlab/lmdeploy:latest-cu11 AS cu11

RUN python3 -m pip install timm

# It is recommended to install timm from https://github.com/Dao-AILab/flash-attention/releases寻找和环境匹配的whl包

RUN python3 -m pip install flash-attn

The LMDeploy image is named as follows:

openmmlab/lmdeploy:latest-cu12

openmmlab/lmdeploy:latest-cu11

openmmlab/lmdeploy:latest # Same as openmmlab/lmdeploy:latest-cu12

openmmlab/lmdeploy:{tag}-cu12 # like openmmlab/lmdeploy:v0.5.3-cu12

openmmlab/lmdeploy:{tag}-cu11

2.2 Offline Reasoning

Taking the InternVL2-8B model as an example, the simplest “Hello, world” style inference is as follows:

from lmdeploy import pipeline

from lmdeploy.vl import load_image

pipe = pipeline('OpenGVLab/InternVL2-8B')

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')

response = pipe(('describe this image', image))

print(response)

When constructing the pipeline, if you don't specify whether to use the TurboMind engine or the PyTorch engine for inference, LMDeploy automatically assigns one based on their respective capabilities, with the TurboMind engine taking precedence by default.

Of course, you can choose an engine manually, and we will introduce the configuration method of both engines in detail in the chapter of inference engine configuration.

2.2.1 Creating a pipeline

2.2.1.1 API

def pipeline(model_path: str,

model_name: Optional[str] = None,

backend_config: Optional[Union[TurbomindEngineConfig,

PytorchEngineConfig]] = None,

chat_template_config: Optional[ChatTemplateConfig] = None,

log_level='ERROR',.

**kwargs).

model_path: model path

- can be model_repo_id on the huggingface hub

- can be the model_repo_id on the modelscope hub, in which case you need to install modelscope and set the environment variable

pip install modelscope

export LMDEPLOY_USE_MODELSCOPE=Tru

- For LLM models, this can be the path to the model after multi-lmdeploy convert. Offline conversion of VLMs models is not supported.

model_name: name of the built-in dialog template

- v0.6.0 (not yet released) will remove this parameter and replace it with the model_name in ChatTemplateConfig.

backend_config

Inference engine configuration parameters

chat_template_config

Conversation template parameters

log_level

Log level. Defaults to ERROR

vision_config

Configuration parameters for vision model inference

vision_config is hidden in the kwargs

2.2.1.2 Inference Engine Configuration

Definition of engine parameters

Turbomind

@dataclass

class TurbomindEngineConfig.

model_format: Optional[str] = None

tp: int = 1

session_len: Optional[int] = None

max_batch_size: int = 128

cache_max_entry_count: float = 0.8

cache_block_seq_len: int = 64

enable_prefix_caching: bool = False

quant_policy: int = 0

rope_scaling_factor: float = 0.0

use_logn_attn: bool = False

download_dir: Optional[str] = None

revision: Optional[str] = None

max_prefill_token_num: int = 8192

num_tokens_per_iter: int = 0

max_prefill_iters: int = 1

Pytorch

@dataclass

class PytorchEngineConfig.

tp: int = 1

session_len: int = None

max_batch_size: int = 128

cache_max_entry_count: float = 0.8

eviction_type: str = 'recompute'

prefill_interval: int = 16

block_size: int = 64

num_cpu_blocks: int = 0

num_gpu_blocks: int = 0

adapters: Dict[str, str] = None

max_prefill_token_num: int = 4096

thread_safe: bool = False

enable_prefix_caching: bool = False

device_type: str = 'cuda'

download_dir: str = None

revision: str = None

Parameters common to the engine:

- Single-computer multicard inference (tp)

tp is the number of GPUs used for tensor parallelism, default value is 1, currently constrained to 2n.

LMDeploy only supports single-computer multi-card, not multi-computer multi-card.

Turbomind

from lmdeploy import pipeline

from lmdeploy import TurbomindEngineConfig

pipe = pipeline(

'OpenGVLab/InternVL2-8B',

backend_config=TurbomindEngineConfig(

tp=2))

Pytorch

from lmdeploy import pipeline

from lmdeploy import PytorchEngineConfig

pipe = pipeline(

'OpenGVLab/InternVL2-8B',

backend_config=PytorchEngineConfig(

tp=2))

- Set memory usage (cache_max_entry_count)

cache_max_entry_count indicates the percentage of free GPU memory occupied by the K/V cache after loading the model weights. The default value is 0.8.

The K/V cache is allocated as a one-time request and reused repeatedly, which is why the pipeline and the api_server in the following section consume a lot of GPU memory upon startup.

If you are experiencing Out of Memory (OOM) error bugs, you may want to consider lowering the cache_max_entry_count of the

Turbomind

from lmdeploy import pipeline

from lmdeploy import TurbomindEngineConfig

pipe = pipeline(

'OpenGVLab/InternVL2-8B',

backend_config=TurbomindEngineConfig(

tp=2))

Pytorch

from lmdeploy import pipeline

from lmdeploy import PytorchEngineConfig

pipe = pipeline(

'OpenGVLab/InternVL2-8B',

backend_config=PytorchEngineConfig(

tp=2))

- Set the maximum inference length (session_len)

session_len indicates the maximum length of the context window, including the number of input prompt tokens and the number of output tokens

Turbomind

from lmdeploy import pipeline

from lmdeploy import TurbomindEngineConfig

pipe = pipeline(

'OpenGVLab/InternVL2-8B',

backend_config=TurbomindEngineConfig(

session_len=8192))

Pytorch

from lmdeploy import pipeline

from lmdeploy import PytorchEngineConfig

pipe = pipeline(

'OpenGVLab/InternVL2-8B',

backend_config=PytorchEngineConfig(

session_len=8192))

- Set the inference maximum batch (max_batch_size)

max_batch_size indicates the maximum number of batches for Continuous batching inference

Turbomind

from lmdeploy import pipeline

from lmdeploy import TurbomindEngineConfig

pipe = pipeline(

'OpenGVLab/InternVL2-8B',

backend_config=TurbomindEngineConfig(

max_batch_size=256))

Pytorch

from lmdeploy import pipeline

from lmdeploy import PytorchEngineConfig

pipe = pipeline(

'OpenGVLab/InternVL2-8B',

backend_config=PytorchEngineConfig(

max_batch_size=256))

- Set the prefix caching switch (enable_prefix_caching)

Turbomind

from lmdeploy import pipeline

from lmdeploy import TurbomindEngineConfig

pipe = pipeline(

'OpenGVLab/InternVL2-8B',

backend_config=TurbomindEngineConfig(

enable_prefix_caching=True))

Pytorch

from lmdeploy import pipeline

from lmdeploy import PytorchEngineConfig

pipe = pipeline(

'OpenGVLab/InternVL2-8B',

backend_config=PytorchEngineConfig(

enable_prefix_caching=

- Set the maximum number of tokens for the prefill chunk (max_prefill_token_num)

Turbomind

from lmdeploy import pipeline

from lmdeploy import TurbomindEngineConfig

pipe = pipeline(

'OpenGVLab/InternVL2-8B',

backend_config=TurbomindEngineConfig(

max_prefill_token_num=8192))

Pytorch

from lmdeploy import pipeline

from lmdeploy import PytorchEngineConfig

pipe = pipeline(

'OpenGVLab/InternVL2-8B',

backend_config=PytorchEngineConfig(

max_prefill_token_num=4096))

- Set model download parameters (download_dir, revision)

When model_path is not the local path, LMDeploy will download the model from the huggingface hub or modelscope hub, and download the latest version by default, which is stored under ~/.cache by default. You can specify the model version by revision and the model path by download_dir.

TurbomindEngine specific parameters:

- Set the Dynamic NTK extrapolation parameter (rope_scaling_factor)

from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig

backend_config = TurbomindEngineConfig(

rope_scaling_factor=2.5,

session_len=1000000,

max_batch_size=1,

cache_max_entry_count=0.9,

tp=2)

pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)

prompt = 'Use a long prompt to replace this sentence'

gen_config = GenerationConfig(top_p=0.8,

top_k=40,

top_p=0.8, top_k=40, temperature=0.8, max_new_tokens

max_new_tokens=1024)

response = pipe(prompt, gen_config=gen_config)

print(response)

- Setting the online KV Cache quantization precision (quant_policy)

quant_policy denotes the parameter of the LLM model KV cache quantization policy, 4 means 4bit quantization, 8 means 8bit quantization. 8bit KV cache almost does not drop points, and the quantization accuracy can be considered lossless.

from lmdeploy import pipeline, TurbomindEngineConfig

pipe = pipeline('OpenGVLab/InternVL2-8B', 'OpenGVLab/InternVL2-8B', 'TurbomindEngineConfig')

backend_config=TurbomindEngineConfig(

quant_policy=8))

- Set the model format (model_format)

Value range {None,hf,llama,awq,gptq}. none is automatically determined based on the structure of the model file, hf denotes a model of the class llama structure on the huggingface, llama denotes meta_llama (pytorch weight format), awq denotes an awq quantitative model, gptq denotes a gptq quantitative model (supported in 0.6.0).

from lmdeploy import pipeline, TurbomindEngineConfig

pipe = pipeline(

'OpenGVLab/InternVL2-8B-AWQ',

backend_config=TurbomindEngineConfig(

model_format='awq'))

from lmdeploy import pipeline, TurbomindEngineConfig

pipe = pipeline(

'OpenGVLab/InternVL2-8B-AWQ',

backend_config=TurbomindEngineConfig(

model_format=None))

- Size of the KV block (cache_block_seq_len)

The number of tokens a KV cache block can hold, default is 64. Default is 64, should be a multiple of 32 if GPU compute_capability >= 8.0, otherwise it should be a multiple of 64.

num_tokens_per_iter

Controls the number of tokens processed in a single forward pass. This includes prefilling and decoding. For example, when there are 8 decoding sequences and 2 prefill sequences, the required resources are allocated first for the 8 decoding tokens, and then for num_tokens_per_iter - the 8 prefill tokens.

It defaults to max_prefill_token_num

To minimize the impact of long prompt sequences, the optimal value of num_tokens_per_iter is related to model size, GPU model and workload. For Llama3-8B, A100 80G and 128 concurrent uses, the value is between 128 and 256. If you are using a 4090 graphics card, you can start from 64-128. Basically, the smaller this value is, the less impact pre-filling will have (but it is larger than max_batch_size).

max_prefill_iters

Controls the maximum number of iterations for prefilling a single sequence.

After max_prefill_iters is set, num_tokens_per_iter may be recalculated. For example, when a request has 2000 tokens, if max_prefill_iters=1, it means that the prefill completes in 1 iteration, regardless of what num_tokens_per_iter is set to. So, for num_tokens_per_iter to have its full effect, max_prefill_iters should be set to a larger value.

It defaults to (session_len + max_prefill_token_num - 1) // max_prefill_token_num

When a suitable num_tokens_per_iter is found, max_prefill_iters is used to balance decoding smoothness and first token latency. It is first set to a large value (you will observe an increase in first token latency) and then gradually reduced until an acceptable first token latency is reached.

2.2.1.3 Dialog template configuration

@dataclass

class ChatTemplateConfig.

model_name: str

system: Optional[str] = None

meta_instruction: Optional[str] = None

eosys: Optional[str] = None

user: Optional[str] = None

eoh: Optional[str] = None

assistant: Optional[str] = None

eoa: Optional[str] = None

separator: Optional[str] = None

capability: Optional[Literal['completion', 'infilling', 'chat', 'python']] = None

'python']] = None

stop_words: Optional[List[str]] = None

model_name: Dialog template name

system, meta_instruction, and eosys represent the name of the system role, the prompt word of the system role, and the completion of the prompt word of the system role, respectively.

Taking the InternLM2 model as an example, the three attributes of its dialog template are as follows:

# InternLM2

system = “<|im_start|>system”

meta_instruction = “”"You are an AI assistant whose name is InternLM (书生-浦语).

- InternLM (书生-浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.

- InternLM can understand and communicate fluently in the language chosen by the user such as English and Chinese.

“"”

eosys = “<|im_end|>\n”

- user and eoh are the name of the user character and the end of the prompt for the user character.

- assistant, eoa are the name of the AI assistant role, and the end of the role's reply.

- seperator: separator between two rounds of conversation

- capability: the capability of the model.” completion” for text completion, ‘chat’ for dialog, ‘infilling’ for code filling (codellama specific), ‘python’ for python code capability (codellama specific)

- stop_words: terminator, used to stop the AI assistant from responding. Currently, LMDeploy only supports 1 token_id per stop_word.

Dialog templates are used to stitch together a dialog sequence.

Let's assume a dialog sequence is U1A1U2A2.... .Un, where Ui denotes the prompt entered by the User in the ith round of the dialog, and Ai denotes the answer generated by the model or AI Assistant in the ith round of the dialog.

In LMDeploy, there are 2 ways of splicing dialog sequences, which correspond to two inference modes: interactive inference (stateful inference), and non-interactive inference (stateless inference).

The difference between these two modes of reasoning is what is the input of the user side in the i-th round of dialog. The input for interactive reasoning is Ui and the input for non-interactive reasoning is U1A1U2A2.... .Ui. In other words, in interactive reasoning mode, the user does not need to input the history dialog record because the information of the history dialog has already been cached by the inference engine (including token, KV, cursor, etc.), whereas in non-interactive reasoning mode, the user needs to input the history dialog record.

The two ways LMDeploy stitches together a sequence of conversations are implemented in BaseChatTemplate. get_prompt is used for user-interactive reasoning mode, and messages2prompt is used for non-interactive reasoning mode.

class BaseChatTemplate.

def get_prompt(self, prompt, sequence_start=True):

““”Return decorated prompt in interactive inference mode

“"”

if sequence_start: # decorate $$U_0$$

return f'{self.system}{self.meta_instruction}{self.eosys}' \

f'{self.user}{prompt}{self.eoh}' \

f'{self.assistant}'

else: # decorate $$U_i$$

return f'{self.user}{prompt}{self.eoh}' \

f'{self.assistant}'

def messages2prompt(self, messages, sequence_start=True, **kwargs)::

““”Return decorated prompt in non-interactive inference mode.

Args.

messages (str|List): user's input prompt which is supposed

to be in OpenAI format

“"”

if isinstance(messages, str): if isinstance(messages, str).

# fallback to `get_prompt` when `messages` isn't a list

return self.get_prompt(messages, sequence_start)

# “box” indicates “begin of x (role)”

box_map = dict(user=self.user, assistant=self.assistant, `map`, `map`)

assistant=self.assistant, system=self.system)

system=self.system)

# “eox” indicates “end of x (role)”

eox_map = dict(user=self.eoh,

assistant=self.eoa + self.separator, system=self.eosys)

system=self.eosys)

for message in messages.

role = message['role']

content = message['content']

ret += f'{box_map[role]}{content}{eox_map[role]}'

In the above code, some of the logic has been simplified for the sake of focus. For the completed code, please refer to: https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/model.py

Conversation templates can be used in several ways:

- Without specifying a dialog template configuration, LMDeploy matches the built-in dialog template name based on the pathname of the model

- To specify the built-in dialog template

from lmdeploy import pipeline, ChatTemplateConfig

pipe = pipeline('/the/path/of/your/finetuned/internvl2/8b/model',

chat_template_config=ChatTemplateConfig(

model_name='internvl-internlm2'

)

The lmdeploy list displays the built-in conversation templates. For the mapping between built-in dialog templates and supported models, see Appendix - Built-in Dialog Templates.

- To change the properties of the built-in dialog templates

from lmdeploy import pipeline, ChatTemplateConfig

pipe = pipeline('/the/path/of/your/finetuned/internvl2/8b/model',

chat_template_config=ChatTemplateConfig(

model_name='internvl2-internlm2',

meta_instruction='You are a helpful assistant'

)

LMDeploy will update the non-None attribute information into the specified chat template

- Customizing Dialog Templates

Option 1: The dialog template properties are defined and stitched together in a way that is completely consistent with the BaseChatTemplate definition.

All you need to do is set the ChatTemplateConfig fields, and LMDeploy will create an instance of BaseChatTemplate.

Way 2: The conversation template's properties or splicing methods do not conform to the BaseChatTemplate's definition.

@register_module(name=“awesome”)

class MyChatTemplate.

def __init__(*args, **kwargs).

def __init__(*args, **kwargs): pass

if sequence_start.

# TODO: return the decorated prompt when it is the first request of a sequence

pass

else: # TODO: return the decorated prompt when it is the first request of a sequence pass.

# TODO: return the decorated prompt when it is NOT the first request of a sequence pass else: # TODO: return the decorated prompt when it is not the first request of a sequence

return the decorated prompt when it is NOT the first request of a sequence pass

def message2prompt(self, messages, sequence_start=True, **kwargs): if isinstance(messages, strings): # TODO: return the decorated prompt when it is the first request of a sequence pass else: # TODO: return the decorated prompt when it is NOT the first request of a sequence pass

if isinstance(messages, str): return self.

return self.get_prompt(messages, sequence_start)

# TODO: return the prompt after applying the chat template

pipe = pipeline(“/the/path/of/your/awesome/model”,

chat_template_config=ChatTemplateConfig(

model_name=“awesome”))

2.2.1.4 Visual model inference configuration

@dataclass

class VisionConfig.

max_batch_size: int = 1

thread_safe: bool = False

max_batch_size indicates the size of the image batch processing. The larger the value, the higher the risk of OOM because the LLM part of the VLM model pre-allocates a lot of memory in advance.

2.2.2 Using pipeline

2.2.2.1 API

def __call__(self.

prompts: Union[VLPromptType, List[Dict], List[VLPromptType].

List[List[Dict]],], gen_config.

gen_config: Optional[GenerationConfig] = None,

**kwargs).

prompts The user input for prompt and image. it can be in the following forms:

- str: plain text

- list[str]: plain text sequence

- tuple(str, PIL.Image): text + image

- tuple(str, list[PIL.Image]): text + image sequence

- list[tuple(str, PIL.Image)]: (text + image) sequence

- Format of GPT4V

[{

“role": ‘user’, [{

“content": [{

“type": ‘text’.

“text": ‘the input text prompt’,

}, { “role”: [{ “role”: [{ “content”: “type”: “text”, “text”: “the input text prompt”, }, “text”: “the input text”, }

{

“type": ‘image_url’, {

“image_url": {

“url": ”data:image/jpeg;base64,{image_base64_data}”

}

{

“type": ‘image_data’, {

“image_data": {

“data": PIL.Image.

}

}, ...

...

{...}]

}]

LMDeploy will process the first 5 formats into the GPT4V format

gen_config Sampling parameters for generating tokens

@dataclass

class GenerationConfig.

n: int = 1

max_new_tokens: int = 512

top_p: float = 1.0

top_k: int = 1

temperature: float = 0.8

repetition_penalty: float = 1.0

ignore_eos: bool = False

random_seed: int = None

stop_words: List[str] = None

bad_words: List[str] = None

min_new_tokens: int = None

skip_special_tokens: bool = True

logprobs: int = None

- n: the number of sequences to be generated by the input request, currently only 1 is supported

- max_new_tokens: the maximum number of tokens to be generated by the input request.

- top_p: the least possible set of tokens whose cumulative probability exceeds top_p is considered for sampling.

- top_k: the top_k tokens with the highest probability are considered for sampling. top_k=1 indicates a greedy search.

- temperature: sampling temperature. temperature=0.f means greedy search.

- repetition_penalty: a penalty that prevents the model from generating repeated words or phrases. A value greater than 1 suppresses repetition.

- ignore_eos: whether or not to ignore eos_token_id

- random_seed: seed used when sampling tokens

- stop_words: stopwords for token generation. It is currently required that each stop_word can only have one token_id when it is tokenized

- bad_words: words that will never be generated. Currently only one token_id is required for each bad_word tokenized.

- min_new_tokens: the minimum number of tokens to be generated by the input request.

- skip_special_tokens: if or not to ignore special tokens in decoding, default is True.

- logprobs: the number of log probabilities returned for each output token.

- For an introduction to the sampling method at generation time, we recommend reading https://huggingface.co/blog/how-to-generate.

2.2.2.2 More examples

- Multi-graph input

For multi-graph scenarios, it is sufficient to put them in a list when reasoning. However, multiple graphs imply a larger number of input tokens, so it is often necessary to increase the inference context length

from lmdeploy import pipeline, TurbomindEngineConfig

from lmdeploy.vl import load_image

pipe = pipeline('OpenGVLab/InternVL2-8B',

backend_config=TurbomindEngineConfig(session_len=10000))

image_urls=[

'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',

'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'

]

images = [load_image(img_url) for img_url in image_urls]

response = pipe(('describe these images', images))

print(response)

- Batch Graphic Input

from lmdeploy import pipeline, TurbomindEngineConfig

from lmdeploy.vl import load_image

pipe = pipeline('OpenGVLab/InternVL2-8B',

backend_config=TurbomindEngineConfig(session_len=8192))

image_urls=[

“https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg”,

“https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg”

]

prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]

response = pipe(prompts)

print(response)

- Multi-Round Dialogs

There are two ways for pipeline to conduct multi-round conversations. One is to construct messages in the GPT4V format, and the other is to use the pipeline.chat interface.

from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig

from lmdeploy.vl import load_image

pipe = pipeline('OpenGVLab/InternVL2-8B',

backend_config=TurbomindEngineConfig(session_len=8192))

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')

gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.6)

sess = pipe.chat(('describe this image', image), gen_config=gen_config)

print(sess.response.text)

sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)

print(sess.response.text)

- Customizing the Location of Image Tokens

By default, LMDeploy inserts a special token representing an image into the user prompt based on the conversation template provided by the algorithm repo, but in some models, there is no restriction on the location of the image token, such as deepseek-vl, or the user needs to customize the location of the image token insertion. In this case, the user needs to manually insert the image token into the prompt. lmdeploy uses <IMAGE_TOKEN> as the special token for the image.

from lmdeploy import pipeline

from lmdeploy.vl import load_image

from lmdeploy.vl.constants import IMAGE_TOKEN

pipe = pipeline('deepseek-ai/deepseek-vl-1.3b-chat')

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')

response = pipe((f'describe this image{IMAGE_TOKEN}', image))

print(response)

2.3 Online Services

2.3.1 Starting the service

2.3.1.1 Method 1: Use the lmdeploy cli utility

lmdeploy serve api_server OpenGVLab/InternVL2-8B

This command will start an OpenAI interface-compatible model inference service on port 23333 on the local host. You can specify a different server port using the --server-port option. For a more detailed description of the parameters, refer to section api_server parameters

2.3.1.2 Method 2: Using docker

docker run --runtime nvidia --gpus all \

-v ~/.cache/huggingface:/root/.cache/huggingface \

--env “HUGGING_FACE_HUB_TOKEN=<secret>” \

-p 23333:23333 \

--ipc=host \

openmmlab/lmdeploy:latest \

lmdeploy serve api_server OpenGVLab/InternVL2-8B

2.3.1.3 api_server Parameters

root@lmdeploy-on-121:~/lmdeploy# lmdeploy serve api_server -h

usage: lmdeploy serve api_server [-h] [--server-name SERVER_NAME] [--server-port SERVER_PORT]

[--allow-origins ALLOW_ORIGINS [ALLOW_ORIGINS ...]] [--allow-credentials]

[--allow-methods ALLOW_METHODS [ALLOW_METHODS ...]]

[--allow-headers ALLOW_HEADERS [ALLOW_HEADERS ...]] [--qos-config-path QOS_CONFIG_PATH]

[---backend {pytorch,turbomind}]

[--log-level {CRITICAL,FATAL,ERROR,WARN,WARNING,INFO,DEBUG,NOTSET}]]

[--api-keys [API_KEYS ...]] [---ssl] [---model-name MODEL_NAME]

[--chat-template CHAT_TEMPLATE] [--revision REVISION] [--download-dir DOWNLOAD_DIR]

[--adapters [ADAPTERS ...]] [--tp TP] [--session-len SESSION_LEN]

[--max-batch-size MAX_BATCH_SIZE] [--cache-max-entry-count CACHE_MAX_ENTRY_COUNT]

[--cache-block-seq-len CACHE_BLOCK_SEQ_LEN] [--enable-prefix-caching]

[---model-format {hf,llama,awq,gptq}] [--quant-policy {0,4,8}]

[--rope-scaling-factor ROPE_SCALING_FACTOR] [--num-tokens-per-iter NUM_TOKENS_PER_ITER]

[---max-prefill-iters MAX_PREFILL_ITERS] [--vision-max-batch-size VISION_MAX_BATCH_SIZE]

model_path

- model_path

--server-name SERVER_NAME: Host IP address of the service. Default: 0.0.0.0.

--server-port SERVER_PORT: Service port. Default: 23333.

--allow-origins ALLOW_ORIGINS: List of allowed CORS sources. Default: ['*'].

--allow-credentials: Whether to allow CORS credentials. Default: False.

--allow-methods ALLOW_METHODS: List of allowed HTTP methods. Default: ['*'].

--allow-headers ALLOW_HEADERS: List of allowed HTTP headers. Default: ['*'].

---backend {pytorch,turbomind}: Set the inference backend. Default: turbomind.

--log-level {LEVELS}: Set the log level. Default: ERROR.

--api-keys [API_KEYS]: Optional list of API keys.

--ssl: Enable SSL. requires OS environment variables 'SSL_KEYFILE' and 'SSL_CERTFILE'.

--model-name MODEL_NAME: Service name of the model. Can be accessed via the RESTful API /v1/models. If not specified, model_path will be used.

--chat-template CHAT_TEMPLATE: When it is a string, it indicates the name of the built-in conversation template. When it is a JSON file path, it indicates a custom chat template.

--revision REVISION: The specific model version to use. Can be a branch name, tag name or commit ID.

--download-dir DOWNLOAD_DIR: Directory to download and load weights, defaults to the default cache directory for huggingface.

Parameters related to the TurboMind engine

--tp TP: Number of GPUs to be used in tensor parallelism.

--session-len SESSION_LEN: Maximum session length of the sequence

--max-batch-size MAX_BATCH_SIZE: Maximum batch size. Default: 128

--cache-max-entry-count CACHE_MAX_ENTRY_COUNT: Percentage of free GPU memory occupied by the KV cache, excluding weights. Default: 0.8

--cache-block-seq-len CACHE_BLOCK_SEQ_LEN: The number of tokens the KV cache block holds. For Turbomind engine, should be a multiple of 32 if GPU compute_capability >= 8.0, otherwise should be a multiple of 64. Default: 64.

--enable-prefix-caching: Whether to enable prefix matching KV caching. Default: False.

---model-format {hf,llama,awq,gptq}: input model format. hf means hf_llama, llama means meta_llama, awq means awq quantized model, gptq means gptq quantized model.

--quant-policy {0,4,8}: whether to quantize kv. 0: no quantization; 4: 4-bit kv; 8: 8-bit kv. default: 0

--rope-scaling-factor ROPE_SCALING_FACTOR: Rope scaling factor. Default: 0.0

--num-tokens-per-iter NUM_TOKENS_PER_ITER: Number of tokens processed in forward pass. Default: 0

---max-prefill-iters MAX_PREFILL_ITERS: Maximum number of forward passes in the prefill phase. Default: 1

Parameters related to the PyTorch engine

--adapters [ADAPTERS ...] : Used to set the path of the lora model. : Used to set the path to the lora model. Multiple lora key-value pairs can be entered in the format xxx=yyyy. if there is only one adapter, you can enter just the path to the adapter. Default: none. Type: string.

--tp TP: Number of GPUs to use in tensor parallelism

--session-len SESSION_LEN: Maximum session length of the sequence

--max-batch-size MAX_BATCH_SIZE: Maximum batch size. Default: 128

--cache-max-entry-count CACHE_MAX_ENTRY_COUNT: Percentage of free GPU memory occupied by the KV cache, excluding weights. Default: 0.8

--cache-block-seq-len CACHE_BLOCK_SEQ_LEN: The number of tokens the KV cache block holds. This parameter is ignored if the Lora adapter is specified. Default: 64

--enable-prefix-caching: Whether to enable prefix-matching KV cache. Default: False.

Vision model parameters:

--vision-max-batch-size VISION_MAX_BATCH_SIZE: Visual model batch size. Default: 1.

2.3.2 Accessing Services

It is recommended to use the openai client package interface to access the service

- image url

from openai import OpenAI

client = OpenAI(

api_key='YOUR_API_KEY', # dummy key to pass openai checking key

base_url='http://0.0.0.0:23333/v1')

model_name = client.models.list().data[0].id

response = client.chat.completions.create(

model=model_name,

messages=[{

'role': 'user', 'content': [{

'content': [{

'type': 'text', 'Describe the image', 'text': 'user', 'content': [{

}, {

'type': 'image_url', 'image_url': {

'image_url': {

'url'.

'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg', }

}],

temperature=0.8,

top_p=0.8)

print(response)

If the model supports multiple images, you can append an image to the user's content list in messages

- Image base64 encoding

from openai import OpenAI

client = OpenAI(

api_key='YOUR_API_KEY', # dummy key to pass openai checking key

base_url='http://0.0.0.0:23333/v1')

def encode_image(image_path).

with open(image_path, “rb”) as image_file: return base64.

return base64.b64encode(image_file.read()).decode('utf-8')

# Path to your image

image_path = “path_to_your_image.jpg”

# Getting the base64 string

base64_image = encode_image(image_path)

model_name = client.models.list().data[0].id

response = client.chat.completions.create(

model=model_name,

messages=[

{

“role": ‘user’,

“content": [

{ “type”: “text”, “text”: “What's in this image?”}, { “role”: “user”, “content”: [

{

“type": ‘image_url’, { ‘image_url’: { ‘type’: ‘text’.

“image_url": {

“url": f ‘data:image/jpeg;base64,{base64_image}’, }, { ‘type’: ‘image_url’, ‘image_url’: {

}

], }

}

max_tokens=300, )

)

- Add additional information

from openai import OpenAI

client = OpenAI(

api_key='YOUR_API_KEY', # dummy key to pass openai checking key

base_url='http://0.0.0.0:23333/v1')

model_name = client.models.list().data[0].id

response = client.chat.completions.create(

model=model_name,

messages=[{

'role': 'user', 'content': [{

'content': [{

'type': 'text', 'Describe the image', 'text': 'user', 'content': [{

}, {

'type': 'image_url', 'image_url': {

'image_url': {

'url'.

'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg', }

}],

temperature=0.8,

top_p=0.8,

extra_body={“repetition_penalty”: 1.02}

)

print(response)

2.4 Model Quantization

2.4.1 Weight Quantization

- 4bit weight quantization. The quantization algorithm, AWQ, only quantizes the language model part of the VLM, but not the visual part.

- Supported graphics card models:

- V100 (0.6.0 support, not yet released)

- Turing(sm75): 20 series, T4

- Ampere(sm80): A100

- Ampere(sm86): 30 series, A10, A16, A30 etc.

- Ada Lovelace(sm89): 40 Series

- Hopper(sm90): H100, H800 (not yet deeply optimized)

lmdeploy lite auto_awq OpenGVLab/InternVL2-8B

2.4.2 KV Cache Quantization

Set the parameter quant_policy in TurbomindEngineConfig. For details, please refer to the previous Inference Engine Configuration section for a description of quant_policy

03 Inference Performance

There is no standardized method for evaluating VLM inference performance.

LLM pipeline is evaluated in the following way, and the test metric RPS (Request Per Second) is as follows

python benchmark/profile_pipeline_api.py \

ShareGPT_V3_unfiltered_cleaned_split.json \

meta-llama/Meta-Llama-3-8B-Instruct \

--num-prompts 5000

On the A100-SMX4-80G graphics card, the test results are as follows:

--------------------------------------------------

concurrency: 256

elapsed_time: 208.390s

first token latency(s)(min, max, ave): 0.068, 3.880, 0.378

per-token latency(s) percentile(50, 75, 95, 99): [0, 0.09, 0.153, 0.207]

number of prompt tokens: 1136185

number of completion tokens: 1008966

token throughput (completion token): 4841.723 token/s

token throughput (prompt + completion token): 10293.932 token/s

RPS (request per second): 23.993 req/s

RPM (request per minute): 1439.609 req/min

--------------------------------------------------

To test the performance of LLM serving, it is recommended to use vLLM's test script https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py. The test metrics of this script are: TTFT, TPOT

# Start the service first

lmdeploy api_server meta_llama/Meta-Llama-3-8B-Instruct --max-batch-size 256

# Open a new terminal and test with vLLM's test script benchmark_serving.py

# Start the service lmdeploy api_server meta_llama/Meta-Llama-3-8B-Instruct --max-batch-size 256 # Open a new terminal and test it with vLLM's test script benchmark_serving.py

vllm 0.4.2

lmdeploy 0.4.1

tensorrt-llm v0.9.0

04 VLM Inference Implementation

Initialize pipeline

prompt Preprocessing

In example 1 of the offline inference chapter, the model is OpenGVLab/InternVL2-8B and the request for pipeline inference is ('describe this image', image).

After “converting to GPT4V request format”, the request becomes:

{

“role": ‘user’,

“content": [

{

“type": ‘text’, ‘text’: ‘describe this image’, ‘text’: ‘text’.

“text": ”describe this image”

{

“type": ‘image_data’, ‘image_data’: { ‘type’: ‘text’: ‘describe this image’ }, {

“image_data": {

“data": image

}

]

}

After “adding the image token”, the request becomes.

{'role': 'user', 'content': '<img><IMAGE_TOKEN></img>\ndescribe this image'}

InternVL2-8B is putting the image before the text.

After “Decorate dialog template message2prompt”, the request becomes:

Image Encoding

self.vl_encoder reuse image preprocessing and vision model inference from upstream library

Reasoning

TurboMind inference

PyTorch inference

TODO

05 Future Planning

06 Appendices

6.1 Built-in Dialog Templates

Models

Model Type

Model structure

Built-in dialog template name

Description

InternLM-XComposer2

MLLM

InternLMXComposer2ForCausalLM

internlm-xcomposer2

InternLM-XComposer2.5

MLLM

InternLMXComposer2ForCausalLM internlm-xcomposer2.5 MLLM

internlm-xcomposer2d5

Qwen-VL

MLLM

QWenLMHeadModel

qwen

DeepSeek-VL

MLLM

MultiModalityCausalLM

deepseek-vl

Phi-3-vision

MLLM

Phi3VForCausalLM

phi-3

CogVLM-Chat

MLLM

CogVLMForCausalLM

cogvlm

CogVLM2-Chat

MLLM

CogVLMForCausalLM

cogvlm2

Yi-VL

MLLM

LlavaLlamaForCausalLM

yi-vl

LLaVA-v1.5

MLLM

LlavaLlamaForCausalLM

LLaVA-v1.5 MLLM

LLaVA-v1.6-vicuna

MLLM

LlavaLlamaForCausalLM

LLaVA-v1.6-vicuna MLLM

llava-v1.6-34b

MLLM

LlavaLlamaForCausalLM

llava-chatml

llava-v1.6-mistral-7b

MLLM

LlavaMistralForCausalLM

mistral

InternVL-Chat-V1-5

MLLM

InternLM2ForCausalLM

internvl-internlm2

InternVL series need to look at the architecture in llm_config.

Mini-InternVL-Chat-2B-V1-5

MLLM

InternLM2ForCausalLM

internvl-internlm2

Mini-InternVL-Chat-4B-V1-5

MLLM

MLLM InternLM2ForCausalLM internvl-internlm2

internvl-phi3

InternVL2 (2B, 8B, 26B)

MLLM

InternLM2ForCausalLM

internvl2-internlm2

InternVL2 (4B)

MLLM

internvl2-internlm2 InternVLM2 (4B) MLLM

internvl2-phi3

InternVL2 (40B)

MLLM

LlamaForCausalLM

internvl2-internlm2

internvl2-llama3-76b

MLLM

InternVL2-Llama3-76B MLLM LlamaForCausalLM internvl2-internlm2

internvl2-internlm2

MiniCPM-Llama3-V-2_5

LlamaForCausalLM internvl2-internlm2

MiniCPMV

Llama3

The type of llm can only be seen in the code.

MiniGeminiLlama

MLLM

MiniGeminiLlamaForCausalLM

mini-gemini-vicuna

GLM-4V

MLLM

ChatGLMModel

glm4