CVE-2026-53923

Analyzed Analyzed - Analysis Complete

Integer Truncation in vLLM GGUF Dequantize Kernels Leads to Information Disclosure

Publication date: 2026-06-22

Last updated on: 2026-06-24

Assigner: GitHub, Inc.

Description

vLLM is an inference and serving engine for large language models (LLMs). From 0.5.5 until 0.23.1rc0, integer truncation of tensor dimensions in vLLM's GGUF dequantize kernels (csrc/quantization/gguf/gguf_kernel.cu) causes partial tensor processing. The output tensor is allocated at full size via torch::empty (uninitialized memory), but the dequantize CUDA kernel processes only a truncated number of elements. The unfilled portion of the output tensor retains whatever was previously in GPU memory. In multi-tenant inference deployments, this residual GPU memory may contain tensor data from other users' inference requests, constituting information disclosure. This vulnerability is fixed in 0.23.1rc0.

CVSS Scores

EPSS Scores

Probability:
Percentile:

Meta Information

Published

2026-06-22

Last Modified

2026-06-24

Generated

2026-08-02

AI Q&A

2026-06-23

EPSS Evaluated

2026-08-01

NVD

CVE-2026-53923

Affected Vendors & Products

Vendor	Product	Version / Range
vllm	vllm	From 0.5.5 (inc) to 0.23.1 (exc)

Helpful Resources

Exploitability

CWE

KEV

CWE ID	Description
CWE-681	When converting from one data type to another, such as long to integer, data can be omitted or translated in a way that produces unexpected values. If the resulting values are used in a sensitive context, then dangerous behaviors may occur.
CWE-200	The product exposes sensitive information to an actor that is not explicitly authorized to have access to that information.

Attack-Flow Graph

Executive Summary

This vulnerability exists in vLLM, an inference and serving engine for large language models. Due to integer truncation in the GGUF dequantize kernels, only part of the output tensor is processed, while the rest remains uninitialized and retains residual data from GPU memory.

In multi-tenant inference deployments, this means that leftover data from other users' inference requests can be exposed, leading to information disclosure.

The issue affects versions from 0.5.5 until 0.23.1rc0 and is fixed in version 0.23.1rc0.

Detection Guidance

This vulnerability arises from integer truncation in tensor dimension handling within vLLM's GGUF dequantize CUDA kernels, leading to partial tensor processing and potential exposure of residual GPU memory from other users in multi-tenant environments.

To detect this vulnerability on your system, you should first verify the version of the vLLM library in use. Versions from 0.5.5 up to 0.23.1rc0 are affected, and upgrading to 0.23.1rc0 or later resolves the issue.

Since the vulnerability is related to GPU memory exposure during inference of GGUF models with large tensor dimensions, detection involves checking for:

Use of vulnerable vLLM versions (0.5.5 to 0.23.1rc0).
Inference workloads involving GGUF models with tensor dimensions whose product exceeds 32-bit integer limits (e.g., very large matrices).
Multi-tenant deployment scenarios where GPU memory is shared among users.

There are no specific network or system commands provided in the resources to directly detect exploitation or presence of this vulnerability. However, you can perform the following steps:

Check the installed vLLM version: ```bash pip show vllm ```
Inspect your inference workloads for GGUF models with very large tensor dimensions that might trigger truncation.
Monitor GPU memory usage and look for unexpected data leakage or anomalies in multi-tenant environments.

For a more technical approach, reviewing the source code or logs for calls to the GGUF dequantize kernels and verifying if the patched version (with int64_t parameters and zero-initialization) is in use can help confirm mitigation.

Impact Analysis

The vulnerability can lead to information disclosure in multi-tenant environments where multiple users share the same GPU resources.

Specifically, residual GPU memory from one user's inference request may be exposed to another user, potentially leaking sensitive or confidential data.

Compliance Impact

The vulnerability in vLLM causes partial tensor processing that may lead to residual GPU memory containing tensor data from other users' inference requests. In multi-tenant inference deployments, this residual data exposure constitutes information disclosure.

Such information disclosure can impact compliance with data protection regulations like GDPR and HIPAA, which require strict controls to prevent unauthorized access to personal or sensitive data.

Therefore, this vulnerability could lead to violations of these standards by exposing data from one user to another without authorization.

Mitigation Strategies

To mitigate this vulnerability, upgrade vLLM to version 0.23.1rc0 or later, where the integer truncation issue in the GGUF dequantize kernels has been fixed.

Hi! I’m here to help you understand CVE-2026-53923. Ask me anything about the vulnerability, its impact, or mitigation strategies.

0/70