CVE-2026-9540

Deferred Deferred - Pending Action

Denial of Service in vLLM OpenAI-Compatible Serving

Publication date: 2026-05-26

Last updated on: 2026-07-23

Assigner: VulDB

Description

A vulnerability was identified in vllm-project vllm 0.19.0. This issue affects some unknown processing of the component OpenAI-compatible Serving Path. Such manipulation leads to denial of service. It is possible to launch the attack remotely. The exploit is publicly available and might be used. The pull request to fix this issue awaits acceptance.

CVSS Scores

EPSS Scores

Probability:
Percentile:

Meta Information

Published

2026-05-26

Last Modified

2026-07-23

Generated

2026-07-26

AI Q&A

2026-05-26

EPSS Evaluated

2026-07-24

NVD

CVE-2026-9540

EUVD

EUVD-2026-31810

Affected Vendors & Products

Vendor	Product	Version / Range
vllm-project	vllm	0.19.0

Helpful Resources

Exploitability

CWE

KEV

CWE ID	Description
CWE-404	The product does not release or incorrectly releases a resource before it is made available for re-use.

Attack-Flow Graph

Executive Summary

CVE-2026-9540 is a vulnerability in the vllm-project vllm 0.19.0, specifically affecting the OpenAI-compatible Serving Path component. The issue causes a denial of service (DoS) condition that can be triggered remotely. The root cause is a performance regression where requests with high values for parameters 'n_completions' and 'logprobs' cause significant latency spikes for other co-scheduled requests. This happens because a single request with large 'n_completions' and 'logprobs' values forces synchronous batch execution to wait for expensive full-vocabulary softmax computations at every decode step, stalling all other requests in the batch.

The vulnerability leads to a severe Time To First Token (TTFT) regression, with latency spikes ranging from 76 to 423 times slower for other requests. The expensive request itself completes normally, but other requests experience excessive compute overhead and delays.

Detection Guidance

The vulnerability manifests as a significant latency spike (Time To First Token regression) when co-scheduled requests with high `n_completions` and `logprobs` values run concurrently in vLLM. Detection involves monitoring for unusual delays or denial of service symptoms during inference requests.

One practical approach to detect this issue is to observe the latency of requests, especially when parameters like `n_completions=8` and `logprobs=20` are used. Automated fuzzing and performance tracing tools can help identify the problem.

For system-level detection, tools such as eBPF kernel tracing combined with analysis of CPU context switches and GPU usage can reveal CPU contention and GPU time starvation caused by the vulnerability.

Monitor request latency for spikes, especially with high `n_completions` and `logprobs` parameters.
Use eBPF kernel tracing to analyze CPU context switches and GPU activity.
Trace CUDA calls and host events to identify blocking caused by full-vocabulary softmax operations.

Specific commands are not provided in the resources, but setting up eBPF tracing and analyzing logs for context switches and GPU usage as described in the debugging approach can be used to detect the issue.

Impact Analysis

This vulnerability can cause significant denial of service by severely degrading the performance of the vLLM serving engine. When a request with high 'n_completions' and 'logprobs' values is processed, it can cause other requests in the same batch to experience extreme latency spikes, delaying their responses by up to hundreds of times longer than normal.

As a result, users relying on the vLLM service may face slow or unresponsive AI inference, impacting the availability and reliability of applications that depend on it.

Compliance Impact

The provided context and resources do not contain any information regarding the impact of CVE-2026-9540 on compliance with common standards and regulations such as GDPR or HIPAA.

Mitigation Strategies

The immediate mitigation involves applying the fix introduced in the vLLM Scheduler, which limits the cumulative compute cost of requests with high `n_completions` and `logprobs` values within a batch.

This fix introduces a `max_num_batched_logprobs` budget in the SchedulerConfig and the V1 Scheduler, which defers heavy sampling tasks exceeding this budget to the next batch, isolating them from simpler requests and preventing latency spikes.

Until the pull request fixing this issue is accepted and merged, you can mitigate the problem by avoiding or limiting the use of requests with high `n_completions` and `logprobs` values concurrently.

Apply the Scheduler fix with `max_num_batched_logprobs` budget once available.
Avoid sending requests with high `n_completions` and `logprobs` parameters simultaneously.
Monitor and prioritize requests to prevent heavy sampling tasks from blocking others.

Additional recommendations include tuning system-level settings such as pinning engine threads and deprioritizing background jobs to reduce CPU contention, as demonstrated in the debugging approach.

Hi! I’m here to help you understand CVE-2026-9540. Ask me anything about the vulnerability, its impact, or mitigation strategies.

0/70