CVE-2026-9540
Received Received - Intake
Denial of Service in vLLM OpenAI-Compatible Serving

Publication date: 2026-05-26

Last updated on: 2026-05-26

Assigner: VulDB

Description
A vulnerability was identified in vllm-project vllm 0.19.0. This issue affects some unknown processing of the component OpenAI-compatible Serving Path. Such manipulation leads to denial of service. It is possible to launch the attack remotely. The exploit is publicly available and might be used. The pull request to fix this issue awaits acceptance.
CVSS Scores
EPSS Scores
Probability:
Percentile:
Meta Information
Published
2026-05-26
Last Modified
2026-05-26
Generated
2026-05-26
AI Q&A
2026-05-26
EPSS Evaluated
N/A
NVD
EUVD
Affected Vendors & Products
Showing 1 associated CPE
Vendor Product Version / Range
vllm-project vllm 0.19.0
Helpful Resources
Exploitability
CWE
CWE Icon
KEV
KEV Icon
CWE ID Description
CWE-404 The product does not release or incorrectly releases a resource before it is made available for re-use.
Attack-Flow Graph
AI Powered Q&A
How does this vulnerability affect compliance with common standards and regulations (like GDPR, HIPAA)?:

The provided context and resources do not contain any information regarding the impact of CVE-2026-9540 on compliance with common standards and regulations such as GDPR or HIPAA.


Can you explain this vulnerability to me?

CVE-2026-9540 is a vulnerability in the vllm-project vllm 0.19.0, specifically affecting the OpenAI-compatible Serving Path component. The issue causes a denial of service (DoS) condition that can be triggered remotely. The root cause is a performance regression where requests with high values for parameters 'n_completions' and 'logprobs' cause significant latency spikes for other co-scheduled requests. This happens because a single request with large 'n_completions' and 'logprobs' values forces synchronous batch execution to wait for expensive full-vocabulary softmax computations at every decode step, stalling all other requests in the batch.

The vulnerability leads to a severe Time To First Token (TTFT) regression, with latency spikes ranging from 76 to 423 times slower for other requests. The expensive request itself completes normally, but other requests experience excessive compute overhead and delays.


How can this vulnerability impact me? :

This vulnerability can cause significant denial of service by severely degrading the performance of the vLLM serving engine. When a request with high 'n_completions' and 'logprobs' values is processed, it can cause other requests in the same batch to experience extreme latency spikes, delaying their responses by up to hundreds of times longer than normal.

As a result, users relying on the vLLM service may face slow or unresponsive AI inference, impacting the availability and reliability of applications that depend on it.


How can this vulnerability be detected on my network or system? Can you suggest some commands?

The vulnerability manifests as a significant latency spike (Time To First Token regression) when co-scheduled requests with high `n_completions` and `logprobs` values run concurrently in vLLM. Detection involves monitoring for unusual delays or denial of service symptoms during inference requests.

One practical approach to detect this issue is to observe the latency of requests, especially when parameters like `n_completions=8` and `logprobs=20` are used. Automated fuzzing and performance tracing tools can help identify the problem.

For system-level detection, tools such as eBPF kernel tracing combined with analysis of CPU context switches and GPU usage can reveal CPU contention and GPU time starvation caused by the vulnerability.

  • Monitor request latency for spikes, especially with high `n_completions` and `logprobs` parameters.
  • Use eBPF kernel tracing to analyze CPU context switches and GPU activity.
  • Trace CUDA calls and host events to identify blocking caused by full-vocabulary softmax operations.

Specific commands are not provided in the resources, but setting up eBPF tracing and analyzing logs for context switches and GPU usage as described in the debugging approach can be used to detect the issue.


What immediate steps should I take to mitigate this vulnerability?

The immediate mitigation involves applying the fix introduced in the vLLM Scheduler, which limits the cumulative compute cost of requests with high `n_completions` and `logprobs` values within a batch.

This fix introduces a `max_num_batched_logprobs` budget in the SchedulerConfig and the V1 Scheduler, which defers heavy sampling tasks exceeding this budget to the next batch, isolating them from simpler requests and preventing latency spikes.

Until the pull request fixing this issue is accepted and merged, you can mitigate the problem by avoiding or limiting the use of requests with high `n_completions` and `logprobs` values concurrently.

  • Apply the Scheduler fix with `max_num_batched_logprobs` budget once available.
  • Avoid sending requests with high `n_completions` and `logprobs` parameters simultaneously.
  • Monitor and prioritize requests to prevent heavy sampling tasks from blocking others.

Additional recommendations include tuning system-level settings such as pinning engine threads and deprioritizing background jobs to reduce CPU contention, as demonstrated in the debugging approach.


Ask Our AI Assistant
Need more information? Ask your question to get an AI reply (Powered by our expertise)
0/70
EPSS Chart