211
社区成员
发帖
与我相关
我的任务
分享按照官方教程部署DeepSeek 一体机部署-满血版多机部署 | 摩尔线程文档中心,加载完模型报错,求救
可以联系我:920105409@qq.com
mccxadmin@mccx:/data/MUSA/musa-deploy$ MODEL_NAME=deepseek-r1-671b MAX_MODEL_LEN=1024 TP_SIZE=8 PP_SIZE=4 PP_LAYER_PARTITION=16,15,15,15 WORKER_NUM=3 MODEL_PATH=/home/model/DeepSeek-R1-BF16 HOST_IP=10.10.50.1,10.10.50.2,10.10.50.3,10.10.50.4 docker stack deploy -c /tmp/stack_template.yaml deepseek-r1-671b
Creating service deepseek-r1-671b_task3
Creating service deepseek-r1-671b_task1
Creating service deepseek-r1-671b_task2
mccxadmin@mccx:/data/MUSA/musa-deploy$ docker logs -f deepseek-r1-671b_task2.1.85swmxja48sb5vnu3hpdl5kwk
* Starting OpenBSD Secure Shell server sshd
...done.
export MCCL_PROTOS=2
export MUSA_PRINT_ENV=1
export MUSA_HOME=/usr/local/musa
export MTHREADS_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export TRITON_CACHE_DIR=/tmp/triton
export LIBRARY_PATH=/opt/intel/oneapi/mkl/lib/intel64:
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/musa/lib
export VLLM_NCCL_SO_PATH=/usr/local/musa/lib/libmccl.so.2
export GLOO_SOCKET_IFNAME=ens1f0np0
export TP_SOCKET_IFNAME=ens1f0np0
export VLLM_PP_LAYER_PARTITION=16,15,15,15
2025-11-06 18:05:56,836 INFO scripts.py:1287 -- Did not find any active Ray processes.
2025-11-06_18:06:00
ray start 10.10.50.1
Warning: Permanently added '[10.10.50.1]:62262' (ED25519) to the list of known hosts.
2025-11-06 18:06:01,034 INFO usage_lib.py:467 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2025-11-06 18:06:01,034 INFO scripts.py:865 -- Local node IP: 10.10.50.1
2025-11-06 18:06:05,169 SUCC scripts.py:902 -- --------------------
2025-11-06 18:06:05,169 SUCC scripts.py:903 -- Ray runtime started.
2025-11-06 18:06:05,169 SUCC scripts.py:904 -- --------------------
2025-11-06 18:06:05,169 INFO scripts.py:906 -- Next steps
2025-11-06 18:06:05,170 INFO scripts.py:909 -- To add another node to this Ray cluster, run
2025-11-06 18:06:05,170 INFO scripts.py:912 -- ray start --address='10.10.50.1:63794'
2025-11-06 18:06:05,170 INFO scripts.py:921 -- To connect to this Ray cluster:
2025-11-06 18:06:05,170 INFO scripts.py:923 -- import ray
2025-11-06 18:06:05,170 INFO scripts.py:924 -- ray.init()
2025-11-06 18:06:05,170 INFO scripts.py:955 -- To terminate the Ray runtime, run
2025-11-06 18:06:05,170 INFO scripts.py:956 -- ray stop
2025-11-06 18:06:05,170 INFO scripts.py:959 -- To view the status of the cluster, use
2025-11-06 18:06:05,170 INFO scripts.py:960 -- ray status
MUSA_PRINT_ENV = true : Print MUSA environment variables.
MUSA_LOG = 0 : Print API trace and debug logging. Bitmask (MUSA_LOG=0xffff)
MUSA_LAUNCH_BLOCKING = false : Enable blocking kernel launches.
MUSA_MANAGED_FORCE_DEVICE_ALLOC = false : Forces the driver to place all managed allocations in device memory.
MUSA_DEVICE_ORDER = FASTEST_FIRST : Enumerate all devices by compute capability (default:FASTEST_FIRST) or PCI Bus ID (PCI_BUS_ID) order.
MUSA_MODULE_LOADING = DEFAULT : Specify the module loading for the application.
MUSA_ERROR_DUMP_VERBOSE = false : Dump musa error details to file.
MUSA_VISIBLE_DEVICES = : Specify visible devices.
MUSA_DUMP_DEVICE_BINARY = false : Dump device code object to file.
MUSA_DUMP_KERNEL_ASSEMBLY = false : Dump assembly to file named by kernel.
MUSA_FORCE_SINGLE_CORE = false : Force kernel execution on single core.
MUSA_EXECUTION_TIMEOUT = 200000 : Specify kernel and memory operations execution timeout(ms), 200000ms by default.
MUSA_MEMCPY_PATH = 0 : Select mu/musaMemcpy copyManager, 0: Default 1: DMA. 2: TDM. 3: CE. 4: CPU. 5:CDM. 6:CDMShaderCopy.
MUSA_MEMSET_PATH = 0 : Select mu/musaMemset copyManager, 0: Default 1: DMA. 2: TDM. 3: CE. 4: CPU. 5:CDM. 6:CDMShaderCopy.
MUSA_PRETEND_SUPPORT = false : Return success rather than not supported for performance affected only APIs.
MUSA_EXECUTE_COUNT = 0 : Set the number of blocks dispatched to each core.
MUSA_BLOCK_SCHEDULE_MODE = -1 : Set the schedule mode of kernel blocks, -1: DEFAULT. 0: DEMAND. 1: ROUND_ROBIN. 2: TASK_DEMAND. 3: DYNAMIC_BALANCED.
MUSA_BLOCK_ARBITRATION_MODE = -1 : Set the arbitration mode, -1: DEFAULT. 0: RoundRobin. 1: QueuePriority 2: KernelRoundRobin. 3: KernelQueuePriority
MUSA_BLOCK_STARTING = false : Config block dispatching starts from 0: last ending by default or 1: always core0.
MUSA_INFLIGHT_SUBMISSION_LIMIT = 0 : Config the limit of inflight submissions, The default value 0 gives control to driver.
MUSA_TRACK_COMMAND_TIMESTAMP = false : Track commands start and end timestamp.
MUSA_USERQ = 0 : Enable user queue. 1: user queue with doorbell 2: user queue without doorbell
MUSA_CDM_PREFETCH = false : Enable cdm prefetch.
MUSA_VIRTUAL_ALIGNMENT = 0x40000 : Specify memory mapping alignment, which will be clamped to the closest available page size.
MUSA_FORCE_SINGLE_QUEUE = false : Force use single compute hardware queue; 0: use multiple compute queue if avaiable or 1: use single compute queue.
MUSA_BLOCK_DISTRIBUTION_MODE = 1 : The block distribution unit; 0: thread based or 1: block based(by default).
MUSA_BLOCK_DISTRIBUTION_GRANULARITY = 0 : The block distribution granularity; 0: per mp(by default) or 1: per mpc.
MUSA_STREAM_ASYNC_CAPACITY = 1024 : Config async command capacity of one stream, futher async api on the stream will be blocked. Default value is 1024.
MUSA_SEMAPHORE_OPEN_MODE = 1 : The semaphore open mode; 0: mtlink first(fall back pcie if disenabled mtlink) or 1: only pcie(by default).
MUSA_USERQ_TERMINATE_TIMEOUT = 4294967295 : The timeout in ms of terminating running user queue
ray start 10.10.50.2
Warning: Permanently added '[10.10.50.2]:62262' (ED25519) to the list of known hosts.
[2025-11-06 18:06:12,595 W 71 71] global_state_accessor.cc:429: Retrying to get node with node ID 94be73a9be39075eb3c6854ae329d1ba6760cba9179cea11dbcbcf31
[2025-11-06 18:06:13,596 W 71 71] global_state_accessor.cc:429: Retrying to get node with node ID 94be73a9be39075eb3c6854ae329d1ba6760cba9179cea11dbcbcf31
2025-11-06 18:06:09,535 INFO scripts.py:1047 -- Local node IP: 10.10.50.2
2025-11-06 18:06:14,617 SUCC scripts.py:1063 -- --------------------
2025-11-06 18:06:14,617 SUCC scripts.py:1064 -- Ray runtime started.
2025-11-06 18:06:14,617 SUCC scripts.py:1065 -- --------------------
2025-11-06 18:06:14,617 INFO scripts.py:1067 -- To terminate the Ray runtime, run
2025-11-06 18:06:14,617 INFO scripts.py:1068 -- ray stop
MUSA_PRINT_ENV = true : Print MUSA environment variables.
MUSA_LOG = 0 : Print API trace and debug logging. Bitmask (MUSA_LOG=0xffff)
MUSA_LAUNCH_BLOCKING = false : Enable blocking kernel launches.
MUSA_MANAGED_FORCE_DEVICE_ALLOC = false : Forces the driver to place all managed allocations in device memory.
MUSA_DEVICE_ORDER = FASTEST_FIRST : Enumerate all devices by compute capability (default:FASTEST_FIRST) or PCI Bus ID (PCI_BUS_ID) order.
MUSA_MODULE_LOADING = DEFAULT : Specify the module loading for the application.
MUSA_ERROR_DUMP_VERBOSE = false : Dump musa error details to file.
MUSA_VISIBLE_DEVICES = : Specify visible devices.
MUSA_DUMP_DEVICE_BINARY = false : Dump device code object to file.
MUSA_DUMP_KERNEL_ASSEMBLY = false : Dump assembly to file named by kernel.
MUSA_FORCE_SINGLE_CORE = false : Force kernel execution on single core.
MUSA_EXECUTION_TIMEOUT = 200000 : Specify kernel and memory operations execution timeout(ms), 200000ms by default.
MUSA_MEMCPY_PATH = 0 : Select mu/musaMemcpy copyManager, 0: Default 1: DMA. 2: TDM. 3: CE. 4: CPU. 5:CDM. 6:CDMShaderCopy.
MUSA_MEMSET_PATH = 0 : Select mu/musaMemset copyManager, 0: Default 1: DMA. 2: TDM. 3: CE. 4: CPU. 5:CDM. 6:CDMShaderCopy.
MUSA_PRETEND_SUPPORT = false : Return success rather than not supported for performance affected only APIs.
MUSA_EXECUTE_COUNT = 0 : Set the number of blocks dispatched to each core.
MUSA_BLOCK_SCHEDULE_MODE = -1 : Set the schedule mode of kernel blocks, -1: DEFAULT. 0: DEMAND. 1: ROUND_ROBIN. 2: TASK_DEMAND. 3: DYNAMIC_BALANCED.
MUSA_BLOCK_ARBITRATION_MODE = -1 : Set the arbitration mode, -1: DEFAULT. 0: RoundRobin. 1: QueuePriority 2: KernelRoundRobin. 3: KernelQueuePriority
MUSA_BLOCK_STARTING = false : Config block dispatching starts from 0: last ending by default or 1: always core0.
MUSA_INFLIGHT_SUBMISSION_LIMIT = 0 : Config the limit of inflight submissions, The default value 0 gives control to driver.
MUSA_TRACK_COMMAND_TIMESTAMP = false : Track commands start and end timestamp.
MUSA_USERQ = 0 : Enable user queue. 1: user queue with doorbell 2: user queue without doorbell
MUSA_CDM_PREFETCH = false : Enable cdm prefetch.
MUSA_VIRTUAL_ALIGNMENT = 0x40000 : Specify memory mapping alignment, which will be clamped to the closest available page size.
MUSA_FORCE_SINGLE_QUEUE = false : Force use single compute hardware queue; 0: use multiple compute queue if avaiable or 1: use single compute queue.
MUSA_BLOCK_DISTRIBUTION_MODE = 1 : The block distribution unit; 0: thread based or 1: block based(by default).
MUSA_BLOCK_DISTRIBUTION_GRANULARITY = 0 : The block distribution granularity; 0: per mp(by default) or 1: per mpc.
MUSA_STREAM_ASYNC_CAPACITY = 1024 : Config async command capacity of one stream, futher async api on the stream will be blocked. Default value is 1024.
MUSA_SEMAPHORE_OPEN_MODE = 1 : The semaphore open mode; 0: mtlink first(fall back pcie if disenabled mtlink) or 1: only pcie(by default).
MUSA_USERQ_TERMINATE_TIMEOUT = 4294967295 : The timeout in ms of terminating running user queue
ray start 10.10.50.3
Warning: Permanently added '[10.10.50.3]:62262' (ED25519) to the list of known hosts.
[2025-11-06 18:06:19,033 W 72 72] global_state_accessor.cc:429: Retrying to get node with node ID c7cf68faf65d307eced8029d323ceb8e54a978f3ad7ffbc94a9b5cf8
2025-11-06 18:06:16,007 INFO scripts.py:1047 -- Local node IP: 10.10.50.3
2025-11-06 18:06:20,055 SUCC scripts.py:1063 -- --------------------
2025-11-06 18:06:20,056 SUCC scripts.py:1064 -- Ray runtime started.
2025-11-06 18:06:20,056 SUCC scripts.py:1065 -- --------------------
2025-11-06 18:06:20,056 INFO scripts.py:1067 -- To terminate the Ray runtime, run
2025-11-06 18:06:20,056 INFO scripts.py:1068 -- ray stop
MUSA_PRINT_ENV = true : Print MUSA environment variables.
MUSA_LOG = 0 : Print API trace and debug logging. Bitmask (MUSA_LOG=0xffff)
MUSA_LAUNCH_BLOCKING = false : Enable blocking kernel launches.
MUSA_MANAGED_FORCE_DEVICE_ALLOC = false : Forces the driver to place all managed allocations in device memory.
MUSA_DEVICE_ORDER = FASTEST_FIRST : Enumerate all devices by compute capability (default:FASTEST_FIRST) or PCI Bus ID (PCI_BUS_ID) order.
MUSA_MODULE_LOADING = DEFAULT : Specify the module loading for the application.
MUSA_ERROR_DUMP_VERBOSE = false : Dump musa error details to file.
MUSA_VISIBLE_DEVICES = : Specify visible devices.
MUSA_DUMP_DEVICE_BINARY = false : Dump device code object to file.
MUSA_DUMP_KERNEL_ASSEMBLY = false : Dump assembly to file named by kernel.
MUSA_FORCE_SINGLE_CORE = false : Force kernel execution on single core.
MUSA_EXECUTION_TIMEOUT = 200000 : Specify kernel and memory operations execution timeout(ms), 200000ms by default.
MUSA_MEMCPY_PATH = 0 : Select mu/musaMemcpy copyManager, 0: Default 1: DMA. 2: TDM. 3: CE. 4: CPU. 5:CDM. 6:CDMShaderCopy.
MUSA_MEMSET_PATH = 0 : Select mu/musaMemset copyManager, 0: Default 1: DMA. 2: TDM. 3: CE. 4: CPU. 5:CDM. 6:CDMShaderCopy.
MUSA_PRETEND_SUPPORT = false : Return success rather than not supported for performance affected only APIs.
MUSA_EXECUTE_COUNT = 0 : Set the number of blocks dispatched to each core.
MUSA_BLOCK_SCHEDULE_MODE = -1 : Set the schedule mode of kernel blocks, -1: DEFAULT. 0: DEMAND. 1: ROUND_ROBIN. 2: TASK_DEMAND. 3: DYNAMIC_BALANCED.
MUSA_BLOCK_ARBITRATION_MODE = -1 : Set the arbitration mode, -1: DEFAULT. 0: RoundRobin. 1: QueuePriority 2: KernelRoundRobin. 3: KernelQueuePriority
MUSA_BLOCK_STARTING = false : Config block dispatching starts from 0: last ending by default or 1: always core0.
MUSA_INFLIGHT_SUBMISSION_LIMIT = 0 : Config the limit of inflight submissions, The default value 0 gives control to driver.
MUSA_TRACK_COMMAND_TIMESTAMP = false : Track commands start and end timestamp.
MUSA_USERQ = 0 : Enable user queue. 1: user queue with doorbell 2: user queue without doorbell
MUSA_CDM_PREFETCH = false : Enable cdm prefetch.
MUSA_VIRTUAL_ALIGNMENT = 0x40000 : Specify memory mapping alignment, which will be clamped to the closest available page size.
MUSA_FORCE_SINGLE_QUEUE = false : Force use single compute hardware queue; 0: use multiple compute queue if avaiable or 1: use single compute queue.
MUSA_BLOCK_DISTRIBUTION_MODE = 1 : The block distribution unit; 0: thread based or 1: block based(by default).
MUSA_BLOCK_DISTRIBUTION_GRANULARITY = 0 : The block distribution granularity; 0: per mp(by default) or 1: per mpc.
MUSA_STREAM_ASYNC_CAPACITY = 1024 : Config async command capacity of one stream, futher async api on the stream will be blocked. Default value is 1024.
MUSA_SEMAPHORE_OPEN_MODE = 1 : The semaphore open mode; 0: mtlink first(fall back pcie if disenabled mtlink) or 1: only pcie(by default).
MUSA_USERQ_TERMINATE_TIMEOUT = 4294967295 : The timeout in ms of terminating running user queue
ray start 10.10.50.4
Warning: Permanently added '[10.10.50.4]:62262' (ED25519) to the list of known hosts.
[2025-11-06 18:06:24,292 W 74 74] global_state_accessor.cc:429: Retrying to get node with node ID c9b8753723bfd92f2083810341b04a5159e6ea8b5a13d51e628114a0
2025-11-06 18:06:21,379 INFO scripts.py:1047 -- Local node IP: 10.10.50.4
2025-11-06 18:06:25,313 SUCC scripts.py:1063 -- --------------------
2025-11-06 18:06:25,313 SUCC scripts.py:1064 -- Ray runtime started.
2025-11-06 18:06:25,313 SUCC scripts.py:1065 -- --------------------
2025-11-06 18:06:25,313 INFO scripts.py:1067 -- To terminate the Ray runtime, run
2025-11-06 18:06:25,313 INFO scripts.py:1068 -- ray stop
MUSA_PRINT_ENV = true : Print MUSA environment variables.
MUSA_LOG = 0 : Print API trace and debug logging. Bitmask (MUSA_LOG=0xffff)
MUSA_LAUNCH_BLOCKING = false : Enable blocking kernel launches.
MUSA_MANAGED_FORCE_DEVICE_ALLOC = false : Forces the driver to place all managed allocations in device memory.
MUSA_DEVICE_ORDER = FASTEST_FIRST : Enumerate all devices by compute capability (default:FASTEST_FIRST) or PCI Bus ID (PCI_BUS_ID) order.
MUSA_MODULE_LOADING = DEFAULT : Specify the module loading for the application.
MUSA_ERROR_DUMP_VERBOSE = false : Dump musa error details to file.
MUSA_VISIBLE_DEVICES = : Specify visible devices.
MUSA_DUMP_DEVICE_BINARY = false : Dump device code object to file.
MUSA_DUMP_KERNEL_ASSEMBLY = false : Dump assembly to file named by kernel.
MUSA_FORCE_SINGLE_CORE = false : Force kernel execution on single core.
MUSA_EXECUTION_TIMEOUT = 200000 : Specify kernel and memory operations execution timeout(ms), 200000ms by default.
MUSA_MEMCPY_PATH = 0 : Select mu/musaMemcpy copyManager, 0: Default 1: DMA. 2: TDM. 3: CE. 4: CPU. 5:CDM. 6:CDMShaderCopy.
MUSA_MEMSET_PATH = 0 : Select mu/musaMemset copyManager, 0: Default 1: DMA. 2: TDM. 3: CE. 4: CPU. 5:CDM. 6:CDMShaderCopy.
MUSA_PRETEND_SUPPORT = false : Return success rather than not supported for performance affected only APIs.
MUSA_EXECUTE_COUNT = 0 : Set the number of blocks dispatched to each core.
MUSA_BLOCK_SCHEDULE_MODE = -1 : Set the schedule mode of kernel blocks, -1: DEFAULT. 0: DEMAND. 1: ROUND_ROBIN. 2: TASK_DEMAND. 3: DYNAMIC_BALANCED.
MUSA_BLOCK_ARBITRATION_MODE = -1 : Set the arbitration mode, -1: DEFAULT. 0: RoundRobin. 1: QueuePriority 2: KernelRoundRobin. 3: KernelQueuePriority
MUSA_BLOCK_STARTING = false : Config block dispatching starts from 0: last ending by default or 1: always core0.
MUSA_INFLIGHT_SUBMISSION_LIMIT = 0 : Config the limit of inflight submissions, The default value 0 gives control to driver.
MUSA_TRACK_COMMAND_TIMESTAMP = false : Track commands start and end timestamp.
MUSA_USERQ = 0 : Enable user queue. 1: user queue with doorbell 2: user queue without doorbell
MUSA_CDM_PREFETCH = false : Enable cdm prefetch.
MUSA_VIRTUAL_ALIGNMENT = 0x40000 : Specify memory mapping alignment, which will be clamped to the closest available page size.
MUSA_FORCE_SINGLE_QUEUE = false : Force use single compute hardware queue; 0: use multiple compute queue if avaiable or 1: use single compute queue.
MUSA_BLOCK_DISTRIBUTION_MODE = 1 : The block distribution unit; 0: thread based or 1: block based(by default).
MUSA_BLOCK_DISTRIBUTION_GRANULARITY = 0 : The block distribution granularity; 0: per mp(by default) or 1: per mpc.
MUSA_STREAM_ASYNC_CAPACITY = 1024 : Config async command capacity of one stream, futher async api on the stream will be blocked. Default value is 1024.
MUSA_SEMAPHORE_OPEN_MODE = 1 : The semaphore open mode; 0: mtlink first(fall back pcie if disenabled mtlink) or 1: only pcie(by default).
MUSA_USERQ_TERMINATE_TIMEOUT = 4294967295 : The timeout in ms of terminating running user queue
======== Autoscaler status: 2025-11-06 18:06:21.946874 ========
Node status
---------------------------------------------------------------
Active:
1 node_9d122a9262a44c8f76f9bc2156e9de8cd317f0e0a8c5b27527dc1483
1 node_94be73a9be39075eb3c6854ae329d1ba6760cba9179cea11dbcbcf31
1 node_c7cf68faf65d307eced8029d323ceb8e54a978f3ad7ffbc94a9b5cf8
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/384.0 CPU
0.0/24.0 GPU
0.0/24.0 MTGPU
0B/2.72TiB memory
0B/228.00GiB object_store_memory
Demands:
(no resource demands)
INFO 11-06 18:06:34 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 11-06 18:06:34 __init__.py:32] name=musa, value=vllm_musa:register
INFO 11-06 18:06:34 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 11-06 18:06:34 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 11-06 18:06:34 __init__.py:44] plugin musa loaded.
INFO 11-06 18:06:34 __init__.py:198] Platform plugin musa is activated
INFO 11-06 18:06:34 api_server.py:912] vLLM API server version 0.7.4.dev0+ged6e9075d.d20250418
INFO 11-06 18:06:34 api_server.py:913] args: Namespace(subparser='serve', model_tag='/home/model/DeepSeek-R1-BF16', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/home/model/DeepSeek-R1-BF16', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=1024, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend='ray', pipeline_parallel_size=4, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=1024, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=64, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['deepseek-r1-671b'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f872da78040>)
INFO 11-06 18:06:34 config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 11-06 18:06:40 config.py:549] This model supports multiple tasks: {'embed', 'score', 'generate', 'classify', 'reward'}. Defaulting to 'generate'.
WARNING 11-06 18:06:40 config.py:676] Async output processing can not be enabled with pipeline parallel
WARNING 11-06 18:06:40 _custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 11-06 18:06:40 triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
INFO 11-06 18:06:41 config.py:3329] MLA is enabled; forcing chunked prefill and prefix caching to be disabled.
INFO 11-06 18:06:41 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.4.dev0+ged6e9075d.d20250418) with config: model='/home/model/DeepSeek-R1-BF16', speculative_config=None, tokenizer='/home/model/DeepSeek-R1-BF16', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=4, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=musa, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=deepseek-r1-671b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":64}, use_cached_outputs=False,
2025-11-06 18:06:41,461 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.10.50.1:63794...
2025-11-06 18:06:41,472 INFO worker.py:1841 -- Connected to Ray cluster.
WARNING 11-06 18:06:43 utils.py:447] The environment variable HOST_IP is deprecated and ignored, as it is often used by Docker and other software tointeract with the container's network stack. Please use VLLM_HOST_IP instead to set the IP address for vLLM processes to communicate with each other.
INFO 11-06 18:06:43 ray_distributed_executor.py:153] use_ray_spmd_worker: False
WARNING 11-06 18:06:43 utils.py:447] The environment variable HOST_IP is deprecated and ignored, as it is often used by Docker and other software tointeract with the container's network stack. Please use VLLM_HOST_IP instead to set the IP address for vLLM processes to communicate with each other.
(RayWorkerWrapper pid=1027) INFO 11-06 18:06:48 __init__.py:30] Available plugins for group vllm.platform_plugins:
(RayWorkerWrapper pid=1027) INFO 11-06 18:06:48 __init__.py:32] name=musa, value=vllm_musa:register
(RayWorkerWrapper pid=1027) INFO 11-06 18:06:48 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
(RayWorkerWrapper pid=1027) INFO 11-06 18:06:48 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
(RayWorkerWrapper pid=1027) INFO 11-06 18:06:48 __init__.py:44] plugin musa loaded.
(RayWorkerWrapper pid=1027) INFO 11-06 18:06:48 __init__.py:198] Platform plugin musa is activated
(RayWorkerWrapper pid=360, ip=10.10.50.4) WARNING 11-06 18:06:49 utils.py:547] Overwriting environment variable MTHREADS_VISIBLE_DEVICES from '' to '0,1,2,3,4,5,6,7'
(RayWorkerWrapper pid=1030) WARNING 11-06 18:06:49 _custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
(RayWorkerWrapper pid=1030) WARNING 11-06 18:06:50 triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
INFO 11-06 18:06:51 shm_broadcast.py:258] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_8a73909e'), local_subscribe_port=60815, remote_subscribe_port=None)
(RayWorkerWrapper pid=360, ip=10.10.50.4) INFO 11-06 18:06:51 shm_broadcast.py:258] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_ae73852c'), local_subscribe_port=54563, remote_subscribe_port=None)
INFO 11-06 18:06:51 model_runner.py:1110] Starting to load model /home/model/DeepSeek-R1-BF16...
(RayWorkerWrapper pid=358, ip=10.10.50.3) INFO 11-06 18:06:51 model_runner.py:1110] Starting to load model /home/model/DeepSeek-R1-BF16...
Loading safetensors checkpoint shards: 0% Completed | 0/163 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 1% Completed | 1/163 [00:01<05:23, 2.00s/it]
Loading safetensors checkpoint shards: 1% Completed | 2/163 [00:06<08:46, 3.27s/it]
Loading safetensors checkpoint shards: 2% Completed | 3/163 [00:07<06:17, 2.36s/it]
Loading safetensors checkpoint shards: 4% Completed | 6/163 [00:07<02:09, 1.21it/s]
Loading safetensors checkpoint shards: 5% Completed | 8/163 [00:09<02:24, 1.07it/s]
Loading safetensors checkpoint shards: 6% Completed | 9/163 [00:10<02:24, 1.07it/s]
Loading safetensors checkpoint shards: 7% Completed | 11/163 [00:15<03:39, 1.44s/it]
Loading safetensors checkpoint shards: 7% Completed | 12/163 [00:16<03:19, 1.32s/it]
Loading safetensors checkpoint shards: 9% Completed | 14/163 [00:16<02:08, 1.16it/s]
Loading safetensors checkpoint shards: 10% Completed | 16/163 [00:16<01:27, 1.68it/s]
Loading safetensors checkpoint shards: 11% Completed | 18/163 [00:17<01:17, 1.88it/s]
Loading safetensors checkpoint shards: 12% Completed | 20/163 [00:19<01:30, 1.58it/s]
Loading safetensors checkpoint shards: 13% Completed | 22/163 [00:20<01:44, 1.36it/s]
(RayWorkerWrapper pid=366, ip=10.10.50.4) ERROR 11-06 18:07:21 worker_base.py:581] Error executing method 'load_model'. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=366, ip=10.10.50.4) ERROR 11-06 18:07:21 worker_base.py:581] Traceback (most recent call last):
(RayWorkerWrapper pid=366, ip=10.10.50.4) ERROR 11-06 18:07:21 worker_base.py:581] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 573, in execute_method
(RayWorkerWrapper pid=366, ip=10.10.50.4) ERROR 11-06 18:07:21 worker_base.py:581] return run_method(target, method, args, kwargs)
(RayWorkerWrapper pid=366, ip=10.10.50.4) ERROR 11-06 18:07:21 worker_base.py:581] File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2196, in run_method
(RayWorkerWrapper pid=366, ip=10.10.50.4) ERROR 11-06 18:07:21 worker_base.py:581] return func(*args, **kwargs)
(RayWorkerWrapper pid=366, ip=10.10.50.4) ERROR 11-06 18:07:21 worker_base.py:581] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 183, in load_model
(RayWorkerWrapper pid=366, ip=10.10.50.4) ERROR 11-06 18:07:21 worker_base.py:581] self.model_runner.load_model()
(RayWorkerWrapper pid=366, ip=10.10.50.4) ERROR 11-06 18:07:21 worker_base.py:581] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1112, in load_model
(RayWorkerWrapper pid=366, ip=10.10.50.4) ERROR 11-06 18:07:21 worker_base.py:581] self.model = get_model(vllm_config=self.vllm_config)
(RayWorkerWrapper pid=366, ip=10.10.50.4) ERROR 11-06 18:07:21 worker_base.py:581] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
(RayWorkerWrapper pid=366, ip=10.10.50.4) ERROR 11-06 18:07:21 worker_base.py:581] return loader.load_model(vllm_config=vllm_config)
(RayWorkerWrapper pid=366, ip=10.10.50.4) ERROR 11-06 18:07:21 worker_base.py:581] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 416, in load_model
(RayWorkerWrapper pid=366, ip=10.10.50.4) ERROR 11-06 18:07:21 worker_base.py:581] raise ValueError(
(RayWorkerWrapper pid=366, ip=10.10.50.4) ERROR 11-06 18:07:21 worker_base.py:581] ValueError: Following weights were not initialized from checkpoint: {'model.layers.52.mlp.gate.e_score_correction_bias', 'model.layers.47.self_attn.q_a_proj.weight', 'model.layers.52.post_attention_layernorm.weight', 'model.layers.48.self_attn.q_b_proj.weight', 'model.layers.50.mlp.gate.weight', 'model.layers.52.input_layernorm.weight', 'model.layers.52.mlp.gate.weight', 'model.layers.56.self_attn.q_a_layernorm.weight', 'model.layers.54.self_attn.q_b_proj.weight', 'model.layers.53.mlp.gate.e_score_correction_bias', 'model.layers.47.self_attn.kv_a_proj_with_mqa.weight', 'model.layers.53.self_attn.q_b_proj.weight', 'model.layers.54.self_attn.q_a_layernorm.weight', 'model.layers.48.self_attn.q_a_layernorm.weight', 'model.layers.56.self_attn.q_a_proj.weight', 'model.layers.48.mlp.shared_experts.gate_up_proj.weight', 'model.layers.47.self_attn.q_b_proj.weight', 'model.layers.47.self_attn.kv_a_layernorm.weight', 'model.layers.56.self_attn.q_b_proj.weight', 'model.layers.50.mlp.shared_experts.down_proj.weight', 'model.layers.50.self_attn.kv_a_layernorm.weight', 'model.layers.47.mlp.experts.w2_weight', 'model.layers.53.self_attn.kv_a_proj_with_mqa.weight', 'model.layers.47.mlp.gate.weight', 'model.layers.52.self_attn.q_a_proj.weight', 'model.layers.54.mlp.gate.weight', 'model.layers.55.input_layernorm.weight', 'model.layers.54.self_attn.kv_a_layernorm.weight', 'model.layers.51.post_attention_layernorm.weight', 'model.layers.52.self_attn.o_proj.weight', 'model.layers.56.self_attn.kv_a_proj_with_mqa.weight', 'model.layers.56.mlp.gate.weight', 'model.layers.53.self_attn.q_a_layernorm.weight', 'model.layers.50.self_attn.q_b_proj.weight', 'model.layers.56.self_attn.o_proj.weight', 'model.layers.52.self_attn.kv_a_proj_with_mqa.weight', 'model.layers.54.self_attn.kv_a_proj_with_mqa.weight', 'model.layers.48.self_attn.kv_a_proj_with_mqa.weight', 'model.layers.50.self_attn.kv_a_proj_with_mqa.weight', 'model.layers.47.mlp.gate.e_score_correction_bias', 'model.layers.53.mlp.shared_experts.down_proj.weight', 'model.layers.48.self_attn.kv_b_proj.weight', 'model.layers.53.mlp.shared_experts.gate_up_proj.weight', 'model.layers.48.mlp.gate.e_score_correction_bias', 'model.layers.47.mlp.shared_experts.gate_up_proj.weight', 'model.layers.49.input_layernorm.weight', 'model.layers.53.self_attn.q_a_proj.weight', 'model.layers.52.mlp.shared_experts.gate_up_proj.weight', 'model.layers.50.self_attn.o_proj.weight', 'model.layers.53.mlp.gate.weight', 'model.layers.50.self_attn.q_a_proj.weight', 'model.layers.47.self_attn.kv_b_proj.weight', 'model.layers.48.self_attn.o_proj.weight', 'model.layers.50.self_attn.q_a_layernorm.weight', 'model.layers.56.self_attn.kv_a_layernorm.weight', 'model.layers.54.mlp.shared_experts.gate_up_proj.weight', 'model.layers.53.self_attn.kv_a_layernorm.weight', 'model.layers.54.self_attn.kv_b_proj.weight', 'model.layers.56.mlp.shared_experts.gate_up_proj.weight', 'model.layers.52.self_attn.kv_a_layernorm.weight', 'model.layers.54.mlp.gate.e_score_correction_bias', 'model.layers.52.self_attn.q_b_proj.weight', 'model.layers.52.mlp.shared_experts.down_proj.weight', 'model.layers.55.post_attention_layernorm.weight', 'model.layers.50.mlp.gate.e_score_correction_bias', 'model.layers.48.mlp.gate.weight', 'model.layers.52.self_attn.kv_b_proj.weight', 'model.layers.56.self_attn.kv_b_proj.weight', 'model.layers.48.self_attn.q_a_proj.weight', 'model.layers.48.self_attn.kv_a_layernorm.weight', 'model.layers.51.input_layernorm.weight', 'model.layers.48.mlp.shared_experts.down_proj.weight', 'model.layers.50.mlp.shared_experts.gate_up_proj.weight', 'model.layers.47.self_attn.o_proj.weight', 'model.layers.56.mlp.gate.e_score_correction_bias', 'model.layers.54.self_attn.q_a_proj.weight', 'model.layers.53.self_attn.o_proj.weight', 'model.layers.47.mlp.shared_experts.down_proj.weight', 'model.layers.56.mlp.shared_experts.down_proj.weight', 'model.layers.52.self_attn.q_a_layernorm.weight', 'model.layers.47.post_attention_layernorm.weight', 'model.layers.47.input_layernorm.weight', 'model.layers.50.self_attn.kv_b_proj.weight', 'model.layers.47.self_attn.q_a_layernorm.weight', 'model.layers.54.self_attn.o_proj.weight', 'model.layers.49.post_attention_layernorm.weight', 'model.layers.53.self_attn.kv_b_proj.weight', 'model.layers.54.mlp.shared_experts.down_proj.weight', 'model.layers.47.mlp.experts.w13_weight'}
(RayWorkerWrapper pid=367, ip=10.10.50.4) INFO 11-06 18:06:49 __init__.py:30] Available plugins for group vllm.platform_plugins: [repeated 31x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=367, ip=10.10.50.4) INFO 11-06 18:06:49 __init__.py:32] name=musa, value=vllm_musa:register [repeated 31x across cluster]
(RayWorkerWrapper pid=367, ip=10.10.50.4) INFO 11-06 18:06:49 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded. [repeated 31x across cluster]
(RayWorkerWrapper pid=367, ip=10.10.50.4) INFO 11-06 18:06:49 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load. [repeated 31x across cluster]
(RayWorkerWrapper pid=367, ip=10.10.50.4) INFO 11-06 18:06:49 __init__.py:44] plugin musa loaded. [repeated 31x across cluster]
(RayWorkerWrapper pid=367, ip=10.10.50.4) INFO 11-06 18:06:49 __init__.py:198] Platform plugin musa is activated [repeated 31x across cluster]
Loading safetensors checkpoint shards: 14% Completed | 23/163 [00:21<01:46, 1.31it/s]
Loading safetensors checkpoint shards: 15% Completed | 24/163 [00:23<02:02, 1.14it/s]
Loading safetensors checkpoint shards: 17% Completed | 27/163 [00:23<01:05, 2.07it/s]
Loading safetensors checkpoint shards: 17% Completed | 28/163 [00:27<02:37, 1.17s/it]
Loading safetensors checkpoint shards: 18% Completed | 29/163 [00:28<02:28, 1.10s/it]
Loading safetensors checkpoint shards: 19% Completed | 31/163 [00:28<01:39, 1.33it/s]
Loading safetensors checkpoint shards: 20% Completed | 32/163 [00:30<01:57, 1.12it/s]
Loading safetensors checkpoint shards: 20% Completed | 33/163 [00:32<02:34, 1.19s/it]
Loading safetensors checkpoint shards: 21% Completed | 34/163 [00:33<02:36, 1.21s/it]
(RayWorkerWrapper pid=361, ip=10.10.50.2) WARNING 11-06 18:06:49 utils.py:547] Overwriting environment variable MTHREADS_VISIBLE_DEVICES from '' to '0,1,2,3,4,5,6,7' [repeated 30x across cluster]
(RayWorkerWrapper pid=367, ip=10.10.50.4) WARNING 11-06 18:06:50 _custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") [repeated 30x across cluster]
(RayWorkerWrapper pid=367, ip=10.10.50.4) WARNING 11-06 18:06:50 triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored. [repeated 30x across cluster]
(RayWorkerWrapper pid=358, ip=10.10.50.3) INFO 11-06 18:06:51 shm_broadcast.py:258] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_71351eaa'), local_subscribe_port=34733, remote_subscribe_port=None) [repeated 2x across cluster]
(RayWorkerWrapper pid=1035) INFO 11-06 18:06:51 model_runner.py:1110] Starting to load model /home/model/DeepSeek-R1-BF16... [repeated 30x across cluster]
(RayWorkerWrapper pid=361, ip=10.10.50.2) ERROR 11-06 18:07:26 worker_base.py:581] Error executing method 'load_model'. This might cause deadlock in distributed execution. [repeated 9x across cluster]
(RayWorkerWrapper pid=361, ip=10.10.50.2) ERROR 11-06 18:07:26 worker_base.py:581] Traceback (most recent call last): [repeated 9x across cluster]
(RayWorkerWrapper pid=361, ip=10.10.50.2) ERROR 11-06 18:07:26 worker_base.py:581] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 573, in execute_method [repeated 9x across cluster]
(RayWorkerWrapper pid=361, ip=10.10.50.2) ERROR 11-06 18:07:26 worker_base.py:581] return run_method(target, method, args, kwargs) [repeated 9x across cluster]
(RayWorkerWrapper pid=361, ip=10.10.50.2) ERROR 11-06 18:07:26 worker_base.py:581] File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2196, in run_method [repeated 9x across cluster]
(RayWorkerWrapper pid=361, ip=10.10.50.2) ERROR 11-06 18:07:26 worker_base.py:581] return func(*args, **kwargs) [repeated 9x across cluster]
(RayWorkerWrapper pid=361, ip=10.10.50.2) ERROR 11-06 18:07:26 worker_base.py:581] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 416, in load_model [repeated 27x across cluster]
(RayWorkerWrapper pid=361, ip=10.10.50.2) ERROR 11-06 18:07:26 worker_base.py:581] self.model_runner.load_model() [repeated 9x across cluster]
(RayWorkerWrapper pid=361, ip=10.10.50.2) ERROR 11-06 18:07:26 worker_base.py:581] self.model = get_model(vllm_config=self.vllm_config) [repeated 9x across cluster]
(RayWorkerWrapper pid=361, ip=10.10.50.2) ERROR 11-06 18:07:26 worker_base.py:581] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model [repeated 9x across cluster]
(RayWorkerWrapper pid=361, ip=10.10.50.2) ERROR 11-06 18:07:26 worker_base.py:581] return loader.load_model(vllm_config=vllm_config) [repeated 9x across cluster]
(RayWorkerWrapper pid=361, ip=10.10.50.2) ERROR 11-06 18:07:26 worker_base.py:581] raise ValueError( [repeated 9x across cluster]
(RayWorkerWrapper pid=361, ip=10.10.50.2) ERROR 11-06 18:07:26 worker_base.py:581] ValueError: Following weights were not initialized from checkpoint: {'model.layers.20.mlp.gate.e_score_correction_bias', 'model.layers.19.input_layernorm.weight', 'model.layers.17.mlp.shared_experts.gate_up_proj.weight', 'model.layers.23.mlp.shared_experts.gate_up_proj.weight', 'model.layers.27.mlp.gate.e_score_correction_bias', 'model.layers.23.self_attn.q_a_layernorm.weight', 'model.layers.27.self_attn.q_b_proj.weight', 'model.layers.23.self_attn.q_a_proj.weight', 'model.layers.26.post_attention_layernorm.weight', 'model.layers.17.mlp.gate.weight', 'model.layers.20.mlp.gate.weight', 'model.layers.17.self_attn.q_a_proj.weight', 'model.layers.17.mlp.gate.e_score_correction_bias', 'model.layers.17.self_attn.kv_a_proj_with_mqa.weight', 'model.layers.16.post_attention_layernorm.weight', 'model.layers.23.self_attn.q_b_proj.weight', 'model.layers.27.mlp.shared_experts.down_proj.weight', 'model.layers.27.self_attn.q_a_layernorm.weight', 'model.layers.16.input_layernorm.weight', 'model.layers.20.self_attn.kv_a_proj_with_mqa.weight', 'model.layers.20.self_attn.kv_b_proj.weight', 'model.layers.23.self_attn.kv_a_layernorm.weight', 'model.layers.27.self_attn.kv_a_layernorm.weight', 'model.layers.19.post_attention_layernorm.weight', 'model.layers.23.self_attn.o_proj.weight', 'model.layers.23.self_attn.kv_b_proj.weight', 'model.layers.27.self_attn.kv_a_proj_with_mqa.weight', 'model.layers.26.input_layernorm.weight', 'model.layers.27.self_attn.o_proj.weight', 'model.layers.20.mlp.shared_experts.gate_up_proj.weight', 'model.layers.23.mlp.shared_experts.down_proj.weight', 'model.layers.27.mlp.shared_experts.gate_up_proj.weight', 'model.layers.23.mlp.gate.e_score_correction_bias', 'model.layers.27.mlp.gate.weight', 'model.layers.20.self_attn.kv_a_layernorm.weight', 'model.layers.23.self_attn.kv_a_proj_with_mqa.weight', 'model.layers.17.mlp.shared_experts.down_proj.weight', 'model.layers.20.self_attn.q_a_proj.weight', 'model.layers.20.self_attn.o_proj.weight', 'model.layers.17.self_attn.kv_b_proj.weight', 'model.layers.17.self_attn.o_proj.weight', 'model.layers.27.self_attn.q_a_proj.weight', 'model.layers.17.self_attn.kv_a_layernorm.weight', 'model.layers.17.self_attn.q_b_proj.weight', 'model.layers.27.self_attn.kv_b_proj.weight', 'model.layers.20.self_attn.q_a_layernorm.weight', 'model.layers.17.self_attn.q_a_layernorm.weight', 'model.layers.20.mlp.shared_experts.down_proj.weight', 'model.layers.20.self_attn.q_b_proj.weight', 'model.layers.23.mlp.gate.weight'} [repeated 9x across cluster]
(RayWorkerWrapper pid=365, ip=10.10.50.2) ERROR 11-06 18:07:34 worker_base.py:581] Error executing method 'load_model'. This might cause deadlock in distributed execution. [repeated 14x across cluster]
(RayWorkerWrapper pid=365, ip=10.10.50.2) ERROR 11-06 18:07:34 worker_base.py:581] Traceback (most recent call last): [repeated 14x across cluster]
(RayWorkerWrapper pid=365, ip=10.10.50.2) ERROR 11-06 18:07:34 worker_base.py:581] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 573, in execute_method [repeated 14x across cluster]
(RayWorkerWrapper pid=365, ip=10.10.50.2) ERROR 11-06 18:07:34 worker_base.py:581] return run_method(target, method, args, kwargs) [repeated 14x across cluster]
(RayWorkerWrapper pid=365, ip=10.10.50.2) ERROR 11-06 18:07:34 worker_base.py:581] File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2196, in run_method [repeated 14x across cluster]
(RayWorkerWrapper pid=365, ip=10.10.50.2) ERROR 11-06 18:07:34 worker_base.py:581] return func(*args, **kwargs) [repeated 14x across cluster]
(RayWorkerWrapper pid=365, ip=10.10.50.2) ERROR 11-06 18:07:34 worker_base.py:581] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 416, in load_model [repeated 42x across cluster]
(RayWorkerWrapper pid=365, ip=10.10.50.2) ERROR 11-06 18:07:34 worker_base.py:581] self.model_runner.load_model() [repeated 14x across cluster]
(RayWorkerWrapper pid=365, ip=10.10.50.2) ERROR 11-06 18:07:34 worker_base.py:581] self.model = get_model(vllm_config=self.vllm_config) [repeated 14x across cluster]
Loading safetensors checkpoint shards: 22% Completed | 36/163 [00:37<03:05, 1.46s/it]
Loading safetensors checkpoint shards: 23% Completed | 37/163 [00:38<02:49, 1.34s/it]
Loading safetensors checkpoint shards: 25% Completed | 40/163 [00:40<02:19, 1.13s/it]
Loading safetensors checkpoint shards: 25% Completed | 41/163 [00:45<03:41, 1.81s/it]
Loading safetensors checkpoint shards: 26% Completed | 42/163 [00:45<02:58, 1.48s/it]
Loading safetensors checkpoint shards: 27% Completed | 44/163 [00:48<02:44, 1.38s/it]
Loading safetensors checkpoint shards: 28% Completed | 45/163 [00:49<02:40, 1.36s/it]
Loading safetensors checkpoint shards: 29% Completed | 47/163 [00:54<03:17, 1.70s/it]
Loading safetensors checkpoint shards: 29% Completed | 48/163 [00:54<02:40, 1.39s/it]
Loading safetensors checkpoint shards: 31% Completed | 50/163 [00:55<01:56, 1.03s/it]
Loading safetensors checkpoint shards: 32% Completed | 52/163 [00:55<01:22, 1.34it/s]
Loading safetensors checkpoint shards: 34% Completed | 55/163 [01:00<01:58, 1.10s/it]
Loading safetensors checkpoint shards: 34% Completed | 56/163 [01:01<02:06, 1.18s/it]
Loading safetensors checkpoint shards: 35% Completed | 57/163 [01:03<02:11, 1.24s/it]
Loading safetensors checkpoint shards: 36% Completed | 58/163 [01:08<03:34, 2.04s/it]
Loading safetensors checkpoint shards: 36% Completed | 59/163 [01:09<03:03, 1.76s/it]
Loading safetensors checkpoint shards: 37% Completed | 60/163 [01:12<03:45, 2.19s/it]
Loading safetensors checkpoint shards: 37% Completed | 61/163 [01:18<05:21, 3.16s/it]
Loading safetensors checkpoint shards: 38% Completed | 62/163 [01:18<04:04, 2.42s/it]
Loading safetensors checkpoint shards: 39% Completed | 63/163 [01:20<03:29, 2.09s/it]
Loading safetensors checkpoint shards: 40% Completed | 65/163 [01:21<02:27, 1.51s/it]
Loading safetensors checkpoint shards: 42% Completed | 68/163 [01:23<01:34, 1.01it/s]
Loading safetensors checkpoint shards: 42% Completed | 69/163 [01:24<01:31, 1.02it/s]
Loading safetensors checkpoint shards: 44% Completed | 72/163 [01:28<01:51, 1.23s/it]
Loading safetensors checkpoint shards: 45% Completed | 73/163 [01:31<02:11, 1.46s/it]
Loading safetensors checkpoint shards: 45% Completed | 74/163 [01:31<01:47, 1.21s/it]
Loading safetensors checkpoint shards: 46% Completed | 75/163 [01:33<01:59, 1.36s/it]
Loading safetensors checkpoint shards: 47% Completed | 76/163 [01:33<01:34, 1.08s/it]
Loading safetensors checkpoint shards: 47% Completed | 77/163 [01:38<02:49, 1.97s/it]
Loading safetensors checkpoint shards: 48% Completed | 78/163 [01:38<02:07, 1.50s/it]
Loading safetensors checkpoint shards: 48% Completed | 79/163 [01:40<02:27, 1.75s/it]
Loading safetensors checkpoint shards: 49% Completed | 80/163 [01:42<02:14, 1.62s/it]
Loading safetensors checkpoint shards: 50% Completed | 81/163 [01:46<03:24, 2.49s/it]
Loading safetensors checkpoint shards: 50% Completed | 82/163 [01:47<02:44, 2.03s/it]
Loading safetensors checkpoint shards: 52% Completed | 85/163 [01:47<01:12, 1.08it/s]
Loading safetensors checkpoint shards: 53% Completed | 87/163 [01:48<00:53, 1.41it/s]
Loading safetensors checkpoint shards: 55% Completed | 90/163 [01:49<00:46, 1.58it/s]
Loading safetensors checkpoint shards: 56% Completed | 92/163 [01:51<00:46, 1.54it/s]
Loading safetensors checkpoint shards: 57% Completed | 93/163 [01:52<00:55, 1.26it/s]
Loading safetensors checkpoint shards: 58% Completed | 94/163 [01:56<01:29, 1.29s/it]
Loading safetensors checkpoint shards: 58% Completed | 95/163 [01:56<01:10, 1.04s/it]
Loading safetensors checkpoint shards: 60% Completed | 98/163 [01:59<01:10, 1.09s/it]
Loading safetensors checkpoint shards: 61% Completed | 99/163 [02:01<01:15, 1.18s/it]
Loading safetensors checkpoint shards: 63% Completed | 103/163 [02:01<00:39, 1.51it/s]
Loading safetensors checkpoint shards: 64% Completed | 104/163 [02:07<01:26, 1.46s/it]
Loading safetensors checkpoint shards: 64% Completed | 105/163 [02:08<01:19, 1.37s/it]
Loading safetensors checkpoint shards: 65% Completed | 106/163 [02:09<01:16, 1.35s/it]
Loading safetensors checkpoint shards: 66% Completed | 108/163 [02:10<00:57, 1.04s/it]
Loading safetensors checkpoint shards: 69% Completed | 112/163 [02:12<00:37, 1.38it/s]
Loading safetensors checkpoint shards: 70% Completed | 114/163 [02:14<00:36, 1.33it/s]
Loading safetensors checkpoint shards: 72% Completed | 117/163 [02:16<00:31, 1.47it/s]
Loading safetensors checkpoint shards: 72% Completed | 118/163 [02:17<00:34, 1.32it/s]
Loading safetensors checkpoint shards: 73% Completed | 119/163 [02:23<01:15, 1.71s/it]
Loading safetensors checkpoint shards: 74% Completed | 120/163 [02:24<01:12, 1.68s/it]
Loading safetensors checkpoint shards: 74% Completed | 121/163 [02:26<01:10, 1.67s/it]
Loading safetensors checkpoint shards: 75% Completed | 123/163 [02:28<00:51, 1.30s/it]
Loading safetensors checkpoint shards: 76% Completed | 124/163 [02:33<01:29, 2.31s/it]
Loading safetensors checkpoint shards: 77% Completed | 125/163 [02:41<02:13, 3.52s/it]
Loading safetensors checkpoint shards: 77% Completed | 126/163 [02:42<01:46, 2.89s/it]
Loading safetensors checkpoint shards: 78% Completed | 127/163 [02:43<01:25, 2.37s/it]
Loading safetensors checkpoint shards: 80% Completed | 130/163 [02:49<01:13, 2.22s/it]
Loading safetensors checkpoint shards: 80% Completed | 131/163 [02:49<00:57, 1.80s/it]
Loading safetensors checkpoint shards: 81% Completed | 132/163 [02:55<01:23, 2.69s/it]
Loading safetensors checkpoint shards: 82% Completed | 133/163 [03:00<01:39, 3.31s/it]
Loading safetensors checkpoint shards: 82% Completed | 134/163 [03:08<02:11, 4.54s/it]
Loading safetensors checkpoint shards: 83% Completed | 135/163 [03:09<01:34, 3.37s/it]
Loading safetensors checkpoint shards: 84% Completed | 137/163 [03:12<01:09, 2.67s/it]
Loading safetensors checkpoint shards: 85% Completed | 138/163 [03:14<01:00, 2.40s/it]
Loading safetensors checkpoint shards: 85% Completed | 139/163 [03:15<00:51, 2.16s/it]
Loading safetensors checkpoint shards: 87% Completed | 141/163 [03:16<00:33, 1.51s/it]
Loading safetensors checkpoint shards: 87% Completed | 142/163 [03:24<01:04, 3.05s/it]
Loading safetensors checkpoint shards: 88% Completed | 143/163 [03:25<00:50, 2.50s/it]
Loading safetensors checkpoint shards: 88% Completed | 144/163 [03:27<00:40, 2.16s/it]
Loading safetensors checkpoint shards: 89% Completed | 145/163 [03:33<00:58, 3.25s/it]
Loading safetensors checkpoint shards: 90% Completed | 146/163 [03:34<00:47, 2.79s/it]
Loading safetensors checkpoint shards: 91% Completed | 149/163 [03:34<00:18, 1.29s/it]
Loading safetensors checkpoint shards: 93% Completed | 151/163 [03:36<00:12, 1.07s/it]
Loading safetensors checkpoint shards: 93% Completed | 152/163 [03:37<00:12, 1.13s/it]
Loading safetensors checkpoint shards: 94% Completed | 154/163 [03:38<00:07, 1.25it/s]
Loading safetensors checkpoint shards: 96% Completed | 156/163 [03:39<00:05, 1.26it/s]
Loading safetensors checkpoint shards: 96% Completed | 157/163 [03:41<00:05, 1.08it/s]
Loading safetensors checkpoint shards: 97% Completed | 158/163 [03:42<00:05, 1.05s/it]
Loading safetensors checkpoint shards: 98% Completed | 159/163 [03:51<00:11, 2.86s/it]
Loading safetensors checkpoint shards: 98% Completed | 160/163 [03:51<00:06, 2.20s/it]
Loading safetensors checkpoint shards: 99% Completed | 162/163 [03:52<00:01, 1.58s/it]
Loading safetensors checkpoint shards: 100% Completed | 163/163 [03:52<00:00, 1.43s/it]
(RayWorkerWrapper pid=365, ip=10.10.50.2) ERROR 11-06 18:07:34 worker_base.py:581] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model [repeated 14x across cluster]
(RayWorkerWrapper pid=365, ip=10.10.50.2) ERROR 11-06 18:07:34 worker_base.py:581] return loader.load_model(vllm_config=vllm_config) [repeated 14x across cluster]
(RayWorkerWrapper pid=365, ip=10.10.50.2) ERROR 11-06 18:07:34 worker_base.py:581] raise ValueError( [repeated 14x across cluster]
(RayWorkerWrapper pid=365, ip=10.10.50.2) ERROR 11-06 18:07:34 worker_base.py:581] ValueError: Following weights were not initialized from checkpoint: {'model.layers.23.self_attn.q_a_proj.weight', 'model.layers.26.post_attention_layernorm.weight', 'model.layers.27.self_attn.q_a_proj.weight', 'model.layers.17.self_attn.q_b_proj.weight', 'model.layers.20.self_attn.q_a_proj.weight', 'model.layers.23.mlp.gate.e_score_correction_bias', 'model.layers.16.post_attention_layernorm.weight', 'model.layers.23.mlp.shared_experts.gate_up_proj.weight', 'model.layers.27.mlp.gate.weight', 'model.layers.17.mlp.shared_experts.gate_up_proj.weight', 'model.layers.19.input_layernorm.weight', 'model.layers.23.mlp.shared_experts.down_proj.weight', 'model.layers.23.self_attn.q_a_layernorm.weight', 'model.layers.20.self_attn.kv_a_proj_with_mqa.weight', 'model.layers.23.mlp.gate.weight', 'model.layers.16.input_layernorm.weight', 'model.layers.20.self_attn.q_a_layernorm.weight', 'model.layers.20.mlp.gate.e_score_correction_bias', 'model.layers.27.self_attn.o_proj.weight', 'model.layers.26.input_layernorm.weight', 'model.layers.27.mlp.gate.e_score_correction_bias', 'model.layers.27.self_attn.q_a_layernorm.weight', 'model.layers.20.self_attn.kv_a_layernorm.weight', 'model.layers.17.self_attn.kv_a_layernorm.weight', 'model.layers.17.self_attn.q_a_proj.weight', 'model.layers.17.mlp.shared_experts.down_proj.weight', 'model.layers.23.self_attn.q_b_proj.weight', 'model.layers.27.mlp.shared_experts.gate_up_proj.weight', 'model.layers.23.self_attn.kv_b_proj.weight', 'model.layers.27.self_attn.kv_b_proj.weight', 'model.layers.23.self_attn.kv_a_proj_with_mqa.weight', 'model.layers.23.self_attn.o_proj.weight', 'model.layers.20.self_attn.kv_b_proj.weight', 'model.layers.20.mlp.shared_experts.down_proj.weight', 'model.layers.20.mlp.gate.weight', 'model.layers.20.self_attn.o_proj.weight', 'model.layers.17.mlp.gate.e_score_correction_bias', 'model.layers.27.mlp.shared_experts.down_proj.weight', 'model.layers.23.self_attn.kv_a_layernorm.weight', 'model.layers.17.self_attn.o_proj.weight', 'model.layers.17.mlp.gate.weight', 'model.layers.17.self_attn.kv_b_proj.weight', 'model.layers.27.self_attn.q_b_proj.weight', 'model.layers.19.post_attention_layernorm.weight', 'model.layers.20.self_attn.q_b_proj.weight', 'model.layers.27.self_attn.kv_a_proj_with_mqa.weight', 'model.layers.17.self_attn.kv_a_proj_with_mqa.weight', 'model.layers.20.mlp.shared_experts.gate_up_proj.weight', 'model.layers.27.self_attn.kv_a_layernorm.weight', 'model.layers.17.self_attn.q_a_layernorm.weight'} [repeated 14x across cluster]
(RayWorkerWrapper pid=1025) INFO 11-06 18:10:54 model_runner.py:1115] Loading model weights took 40.1075 GB
INFO 11-06 18:10:54 model_runner.py:1115] Loading model weights took 40.1075 GB
MUSA_PRINT_ENV = true : Print MUSA environment variables.
MUSA_LOG = 0 : Print API trace and debug logging. Bitmask (MUSA_LOG=0xffff)
MUSA_LAUNCH_BLOCKING = false : Enable blocking kernel launches.
MUSA_MANAGED_FORCE_DEVICE_ALLOC = false : Forces the driver to place all managed allocations in device memory.
MUSA_DEVICE_ORDER = FASTEST_FIRST : Enumerate all devices by compute capability (default:FASTEST_FIRST) or PCI Bus ID (PCI_BUS_ID) order.
MUSA_MODULE_LOADING = DEFAULT : Specify the module loading for the application.
MUSA_ERROR_DUMP_VERBOSE = false : Dump musa error details to file.
MUSA_VISIBLE_DEVICES = : Specify visible devices.
MUSA_DUMP_DEVICE_BINARY = false : Dump device code object to file.
MUSA_DUMP_KERNEL_ASSEMBLY = false : Dump assembly to file named by kernel.
MUSA_FORCE_SINGLE_CORE = false : Force kernel execution on single core.
MUSA_EXECUTION_TIMEOUT = 200000 : Specify kernel and memory operations execution timeout(ms), 200000ms by default.
MUSA_MEMCPY_PATH = 0 : Select mu/musaMemcpy copyManager, 0: Default 1: DMA. 2: TDM. 3: CE. 4: CPU. 5:CDM. 6:CDMShaderCopy.
MUSA_MEMSET_PATH = 0 : Select mu/musaMemset copyManager, 0: Default 1: DMA. 2: TDM. 3: CE. 4: CPU. 5:CDM. 6:CDMShaderCopy.
MUSA_PRETEND_SUPPORT = false : Return success rather than not supported for performance affected only APIs.
MUSA_EXECUTE_COUNT = 0 : Set the number of blocks dispatched to each core.
MUSA_BLOCK_SCHEDULE_MODE = -1 : Set the schedule mode of kernel blocks, -1: DEFAULT. 0: DEMAND. 1: ROUND_ROBIN. 2: TASK_DEMAND. 3: DYNAMIC_BALANCED.
MUSA_BLOCK_ARBITRATION_MODE = -1 : Set the arbitration mode, -1: DEFAULT. 0: RoundRobin. 1: QueuePriority 2: KernelRoundRobin. 3: KernelQueuePriority
MUSA_BLOCK_STARTING = false : Config block dispatching starts from 0: last ending by default or 1: always core0.
MUSA_INFLIGHT_SUBMISSION_LIMIT = 0 : Config the limit of inflight submissions, The default value 0 gives control to driver.
MUSA_TRACK_COMMAND_TIMESTAMP = false : Track commands start and end timestamp.
MUSA_USERQ = 0 : Enable user queue. 1: user queue with doorbell 2: user queue without doorbell
MUSA_CDM_PREFETCH = false : Enable cdm prefetch.
MUSA_VIRTUAL_ALIGNMENT = 0x40000 : Specify memory mapping alignment, which will be clamped to the closest available page size.
MUSA_FORCE_SINGLE_QUEUE = false : Force use single compute hardware queue; 0: use multiple compute queue if avaiable or 1: use single compute queue.
MUSA_BLOCK_DISTRIBUTION_MODE = 1 : The block distribution unit; 0: thread based or 1: block based(by default).
MUSA_BLOCK_DISTRIBUTION_GRANULARITY = 0 : The block distribution granularity; 0: per mp(by default) or 1: per mpc.
MUSA_STREAM_ASYNC_CAPACITY = 1024 : Config async command capacity of one stream, futher async api on the stream will be blocked. Default value is 1024.
MUSA_SEMAPHORE_OPEN_MODE = 1 : The semaphore open mode; 0: mtlink first(fall back pcie if disenabled mtlink) or 1: only pcie(by default).
MUSA_USERQ_TERMINATE_TIMEOUT = 4294967295 : The timeout in ms of terminating running user queue
Traceback (most recent call last):
File "/usr/local/bin/vllm", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/cli/main.py", line 73, in main
args.dispatch_function(args)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/cli/serve.py", line 34, in cmd
uvloop.run(run_server(args))
File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper
return await main
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 947, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 139, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 163, in build_async_engine_client_from_engine_args
engine_client = AsyncLLMEngine.from_engine_args(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 644, in from_engine_args
engine = cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 594, in __init__
self.engine = self._engine_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 267, in __init__
super().__init__(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 273, in __init__
self.model_executor = executor_class(vllm_config=vllm_config, )
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 271, in __init__
super().__init__(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
self._init_executor()
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_distributed_executor.py", line 90, in _init_executor
self._init_workers_ray(placement_group)
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_distributed_executor.py", line 360, in _init_workers_ray
self._run_workers("load_model",
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_distributed_executor.py", line 485, in _run_workers
ray_worker_outputs = ray.get(ray_worker_outputs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2772, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 919, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::RayWorkerWrapper.execute_method() (pid=358, ip=10.10.50.2, actor_id=6e2230b2d76764fc9d9cc52c01000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f592fc6c310>)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 582, in execute_method
raise e
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 573, in execute_method
return run_method(target, method, args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2196, in run_method
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 183, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1112, in load_model
self.model = get_model(vllm_config=self.vllm_config)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
return loader.load_model(vllm_config=vllm_config)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 416, in load_model
raise ValueError(
ValueError: Following weights were not initialized from checkpoint: {'model.layers.20.self_attn.kv_b_proj.weight', 'model.layers.26.input_layernorm.weight', 'model.layers.23.self_attn.kv_b_proj.weight', 'model.layers.27.mlp.shared_experts.down_proj.weight', 'model.layers.23.self_attn.kv_a_proj_with_mqa.weight', 'model.layers.27.self_attn.kv_a_layernorm.weight', 'model.layers.23.self_attn.q_a_proj.weight', 'model.layers.20.self_attn.o_proj.weight', 'model.layers.27.mlp.gate.weight', 'model.layers.20.mlp.shared_experts.gate_up_proj.weight', 'model.layers.27.self_attn.q_a_proj.weight', 'model.layers.20.self_attn.kv_a_layernorm.weight', 'model.layers.20.self_attn.q_a_proj.weight', 'model.layers.23.self_attn.q_a_layernorm.weight', 'model.layers.20.self_attn.q_b_proj.weight', 'model.layers.27.mlp.shared_experts.gate_up_proj.weight', 'model.layers.17.self_attn.q_b_proj.weight', 'model.layers.23.mlp.shared_experts.down_proj.weight', 'model.layers.27.mlp.gate.e_score_correction_bias', 'model.layers.20.mlp.gate.e_score_correction_bias', 'model.layers.23.self_attn.q_b_proj.weight', 'model.layers.20.self_attn.q_a_layernorm.weight', 'model.layers.23.mlp.gate.weight', 'model.layers.17.self_attn.kv_a_proj_with_mqa.weight', 'model.layers.27.self_attn.q_a_layernorm.weight', 'model.layers.23.mlp.shared_experts.gate_up_proj.weight', 'model.layers.17.self_attn.q_a_layernorm.weight', 'model.layers.17.mlp.shared_experts.down_proj.weight', 'model.layers.17.self_attn.kv_a_layernorm.weight', 'model.layers.17.mlp.gate.e_score_correction_bias', 'model.layers.17.self_attn.q_a_proj.weight', 'model.layers.26.post_attention_layernorm.weight', 'model.layers.23.self_attn.kv_a_layernorm.weight', 'model.layers.20.mlp.shared_experts.down_proj.weight', 'model.layers.16.input_layernorm.weight', 'model.layers.23.mlp.gate.e_score_correction_bias', 'model.layers.17.mlp.shared_experts.gate_up_proj.weight', 'model.layers.17.self_attn.o_proj.weight', 'model.layers.20.mlp.gate.weight', 'model.layers.19.post_attention_layernorm.weight', 'model.layers.23.self_attn.o_proj.weight', 'model.layers.27.self_attn.kv_b_proj.weight', 'model.layers.27.self_attn.q_b_proj.weight', 'model.layers.20.self_attn.kv_a_proj_with_mqa.weight', 'model.layers.17.self_attn.kv_b_proj.weight', 'model.layers.27.self_attn.o_proj.weight', 'model.layers.17.mlp.gate.weight', 'model.layers.27.self_attn.kv_a_proj_with_mqa.weight', 'model.layers.16.post_attention_layernorm.weight', 'model.layers.19.input_layernorm.weight'}
(RayWorkerWrapper pid=1034) INFO 11-06 18:10:54 model_runner.py:1115] Loading model weights took 40.1075 GB [repeated 6x across cluster]
INFO 11-06 18:10:55 ray_distributed_executor.py:104] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
mccxadmin@mccx:/data/MUSA/musa-deploy$
mccxadmin@mccx:/data/MUSA/musa-deploy$
mccxadmin@mccx:/data/MUSA/musa-deploy$
mccxadmin@mccx:/data/MUSA/musa-deploy$
mccxadmin@mccx:/data/MUSA/musa-deploy$ ls /data/model/DeepSeek-R1-BF16/*.safetensors | wc -l
163
mccxadmin@mccx:/data/MUSA/musa-deploy$ grep -E "num_experts|moe" /data/model/DeepSeek-R1-BF16/config.json
"moe_intermediate_size": 2048,
"moe_layer_freq": 1,
"num_experts_per_tok": 8,
mccxadmin@mccx:/data/MUSA/musa-deploy$ ls /data/model/DeepSeek-R1-BF16/*.safetensors | head -n5
/data/model/DeepSeek-R1-BF16/model-00001-of-000163.safetensors
/data/model/DeepSeek-R1-BF16/model-00002-of-000163.safetensors
/data/model/DeepSeek-R1-BF16/model-00003-of-000163.safetensors
/data/model/DeepSeek-R1-BF16/model-00004-of-000163.safetensors
/data/model/DeepSeek-R1-BF16/model-00005-of-000163.safetensors
mccxadmin@mccx:/data/MUSA/musa-deploy$
可以有写下载链接