摘要:
In the Linux kernel, the following vulnerability has been resolved:
mm: abort vma_modify() on merge out of memory failure
The remainder of vma_modify() relies upon the vmg state remaining pristine
after a merge attempt.
Usually this is the case, however in the one edge case scenario of a merge
attempt failing not due to the specified range being unmergeable, but
rather due to an out of memory error arising when attempting to commit the
merge, this assumption becomes untrue.
This results in vmg->start, end being modified, and thus the proceeding
attempts to split the VMA will be done with invalid start/end values.
Thankfully, it is likely practically impossible for us to hit this in
reality, as it would require a maple tree node pre-allocation failure that
would likely never happen due to it being 'too small to fail', i.e. the
kernel would simply keep retrying reclaim until it succeeded.
However, this scenario remains theoretically possible, and what we are
doing here is wrong so we must correct it.
The safest option is, when this scenario occurs, to simply give up the
operation. If we cannot allocate memory to merge, then we cannot allocate
memory to split either (perhaps moreso!).
Any scenario where this would be happening would be under very extreme
(likely fatal) memory pressure, so it's best we give up early.
So there is no doubt it is appropriate to simply bail out in this
scenario.
However, in general we must if at all possible never assume VMG state is
stable after a merge attempt, since merge operations update VMG fields.
As a result, additionally also make this clear by storing start, end in
local variables.
The issue was reported originally by syzkaller, and by Brad Spengler (via
an off-list discussion), and in both instances it manifested as a
triggering of the assert:
VM_WARN_ON_VMG(start >= end, vmg);
In vma_merge_existing_range().
It seems at least one scenario in which this is occurring is one in which
the merge being attempted is due to an madvise() across multiple VMAs
which looks like this:
start end
|<------>|
|----------|------|
| vma | next |
|----------|------|
When madvise_walk_vmas() is invoked, we first find vma in the above
(determining prev to be equal to vma as we are offset into vma), and then
enter the loop.
We determine the end of vma that forms part of the range we are
madvise()'ing by setting 'tmp' to this value:
/* Here vma->vm_start <= start < (end|vma->vm_end) */
tmp = vma->vm_end;
We then invoke the madvise() operation via visit(), letting prev get
updated to point to vma as part of the operation:
/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
error = visit(vma, &prev, start, tmp, arg);
Where the visit() function pointer in this instance is
madvise_vma_behavior().
As observed in syzkaller reports, it is ultimately madvise_update_vma()
that is invoked, calling vma_modify_flags_name() and vma_modify() in turn.
Then, in vma_modify(), we attempt the merge:
merged = vma_merge_existing_range(vmg);
if (merged)
return merged;
We invoke this with vmg->start, end set to start, tmp as such:
start tmp
|<--->|
|----------|------|
| vma | next |
|----------|------|
We find ourselves in the merge right scenario, but the one in which we
cannot remove the middle (we are offset into vma).
Here we have a special case where vmg->start, end get set to perhaps
unintuitive values - we intended to shrink the middle VMA and expand the
next.
This means vmg->start, end are set to... vma->vm_start, start.
Now the commit_merge() fails, and vmg->start, end are left like this.
This means we return to the rest of vma_modify() with vmg->start, end
(here denoted as start', end') set as:
start' end'
|<-->|
|----------|------|
| vma | next |
|----------|------|
So we now erroneously try to split accordingly. This is where the
unfortunate
---truncated---
安全等级: Low
公告ID: KylinSec-SA-2025-2358
发布日期: 2025年4月20日
关联CVE: CVE-2025-21932
在 Linux 内核中,以下漏洞已被修复:
mm:在合并操作内存不足时终止 vma_modify()
vma_modify() 的剩余逻辑依赖于 vmg 状态在合并尝试后保持原始状态。
通常情况下确实如此,但在一个边缘场景中——当合并失败不是由于指定范围不可合并,而是由于尝试提交合并时出现内存不足错误时——这个假设就不再成立。
这将导致 vmg->start 和 end 被修改,进而导致后续尝试分割 VMA 时会使用无效的 start/end 值。
值得庆幸的是,实际上几乎不可能遇到这种情况,因为这需要发生 maple tree 节点预分配失败,而由于"太小而不会失败"的特性,内核很可能会持续重试内存回收直到成功。
然而,这种情况在理论上是可能发生的,而我们当前的处理方式是不正确的,必须予以修正。
最安全的做法是,当这种情况发生时直接放弃操作。如果我们无法分配内存来合并,那么我们同样无法分配内存来分割(可能更甚!)。
任何发生这种情况的场景都处于极其严重(很可能是致命)的内存压力下,因此我们最好尽早放弃。
因此毫无疑问,在这种情况下直接退出操作是恰当的做法。
不过,总的来说我们必须尽可能避免假设合并尝试后 VMG 状态保持稳定,因为合并操作会更新 VMG 字段。因此,我们还应该通过将 start 和 end 存储在局部变量中来明确这一点。
该问题最初由 syzkaller 和 Brad Spengler(通过非公开讨论)报告,在这两个案例中都表现为触发了断言:
VM_WARN_ON_VMG(start >= end, vmg);
In vma_merge_existing_range().
至少存在一种触发此问题的场景是:当尝试跨多个 VMA 执行 madvise() 操作时,其内存布局如下所示:
start end
|<------>|
|----------|------|
vma next
当调用 madvise_walk_vmas() 时,我们首先找到上图中的 vma(由于起始位置在 vma 内部,prev 被设为等于 vma),然后进入循环。
我们通过设置 'tmp' 来确定构成 madvise() 操作范围的 vma 结束边界:
/* 此处 vma->vm_start <= start < (end|vma->vm_end) */
tmp = vma->vm_end;
接着通过 visit() 执行 madvise() 操作,并在此过程中更新 prev 使其指向 vma:
/* 此处 vma->vm_start <= start < tmp <= (end|vma->vm_end) */
error = visit(vma, &prev, start, tmp, arg);
在本案例中,visit() 函数指针实际指向 madvise_vma_behavior()。
如 syzkaller 报告所示,最终会调用 madvise_update_vma(),继而依次调用 vma_modify_flags_name() 和 vma_modify()。
然后在 vma_modify() 中尝试合并操作:
merged = vma_merge_existing_range(vmg);
if (merged)
return merged;
我们以如下方式调用该函数,其中 vmg->start 和 end 被设置为 start 和 tmp:
start tmp
|<--->|
|----------|------|
vma next
此时我们处于"右侧合并"场景,但无法移除中间部分(因为我们位于 vma 内部偏移处)。
这里存在一个特殊情况:vmg->start 和 end 可能会被设置为看似不符合直觉的值——我们原本的意图是缩小中间 VMA 并扩展 next VMA。
这意味着 vmg->start 和 end 被设置为... vma->vm_start 和 start。
当 commit_merge() 失败后,vmg->start 和 end 将保持这种状态。于是我们返回到 vma_modify() 的剩余流程时,vmg->start 和 end(此处记作 start' 和 end')的设置如下:
start' end'
|<-->|
|----------|------|
vma next
因此我们现在会错误地尝试进行相应的分割操作。
cve名称 | 产品 | 组件 | 是否受影响 |
---|---|---|---|
CVE-2025-21932 | KY3.4-5 | kernel | Unaffected |
CVE-2025-21932 | V6 | kernel | Unaffected |