The trouble with get_user_pages()

Did you know...?

LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

When kernel code needs to work directly with user-space pages, it often calls get_user_pages() (or one of several variants) to fault those pages into RAM and pin them there. This function is not entirely easy to use, though, and recent changes have made it harder to use safely. Jan Kara and Dan Williams led a plenary session at the 2018 Linux Storage, Filesystem, and Memory-Management Summit to discuss potential solutions, but it is not entirely clear that any were found.

Kara started by saying that he just spent half a year chasing down reports of kernel crashes; now that he has found the reason, he's not sure what to do about it. It comes down to how get_user_pages() is used. When it is called, it will translate user-space virtual addresses to physical addresses and ensure that the pages are in memory. Typically the caller will then perform some sort of I/O on those pages. There are a number of

mechanisms by which this is done, but it all comes down to passing the addresses of the pages to the devices. When the I/O is complete, the kernel calls set_page_dirty() to mark the pages as dirty and releases its references to the pages.

Problems can arise when the kernel decides to perform writeback on some of the pages brought in with get_user_pages(). The writeback process will write-protect the pages so that user-space cannot modify them until writeback is complete, but it knows nothing about DMA operations started by the driver that called get_user_pages(); that I/O may still be ongoing. One failure mode comes about as the result of the filesystem not knowing that pages are changing underneath it; that can lead to crashes in the filesystem code.

Other crashes can come about if page reclaim removes buffers from the pages before the driver marks them dirty. Problems can result from modification of the data contained in pages while they are under writeback; it is essentially the stable pages problem all over again. And there are various data loss or corruption problems associated with use of fallocate() on pages that are under I/O — fallocate() may want to shuffle pages around, but an ongoing DMA operation will do the wrong thing if that happens.

Things get even worse if DAX is in use, since the pages in question exist on the storage media itself. If, for example, pages are truncated from a file before DMA completes, the result can be data and metadata corruption. Running DMA directly against blocks that the filesystem is manipulating is hazardous; the filesystem cannot see the elevated reference counts that would indicate that something else is going on with those pages.

Boaz Harrosh suggested simply preventing writeback on pages with elevated reference counts, but that would be likely to create all kinds of strange side effects. The fact that subsystems like RDMA can hold references on

pages for hours at a time exacerbates this kind of problem. (The group circled for a while on the topic of whether this kind of long-term reference makes sense, without any sort of useful outcome).

Williams said that the core of the problem is finding a way to allow the kernel to work with pages that have been pinned with get_user_pages(). He proposed a set of changes, starting with storing information about pinned pages in the inode (Al Viro was quick to ask: "which inode?") and requiring get_user_pages() users to provide a revoke() callback. Jérôme Glisse insisted, though, that any call site that could implement revoke() could also just use MMU notifiers to detect changes. Williams said that revoke() would really just wait for the I/O to complete so that the pages could be released, but Glisse pointed out that, with various types of I/O (such as a camera device streaming video images) the I/O is never really done. There would be no avoiding taking action to stop I/O in such cases.

Going further, Glisse stated that MMU notifiers are the interface that the kernel has now for dealing with memory-management events. They are called for all page-table entry changes, including write protection; they should be used, he said, rather than reinventing the interface somewhere else. Kara acknowledged that the idea sounds interesting for short-term users of get_user_pages(), at least. As the session ran out of time, Glisse said that long-term users could make it work too; the Mellanox RDMA driver "did it right", for example. Of course, he acknowledged, the fact that this interface has its own memory-management unit helps. The kernel should, he said, "be mean" to hardware that lacks such capabilities.

About the only hard conclusion from this discussion was that more discussions are needed before the developers will get a real handle on this problem.

(Log in to post comments)

The trouble with get_user_pages()

Posted Apr 30, 2018 15:25 UTC (Mon) by viro (subscriber, #7872) [Link]

FWIW, a big problem with that approach is that get_user_pages_fast() is pretty much lost after that. As it is, a good way to think of get_user_pages_fast() is to consider it a simulated TLB miss. If we manage to resolve that out of page tables, everything's nice and fast (and lockless, at that). If we run into something trickier, that simulated TLB miss escalates into a full-blown simulated page fault, which is where we start grabbing locks, hitting page cache, doing allocations, hitting disk, etc. That's what "_fast" in get_user_pages_fast() is about and that's what we lose with that approach. Even finding out which (if any) file is backing the area we are hitting requires pretty much the full-blown page fault locking.

Doing that for the sake of infinibad playing silly buggers with long-term page references looks like a bad idea - the interface is used for a lot more than that and punishing the regular users that way is not an appealing prospect...

The trouble with get_user_pages()

The trouble with get_user_pages()

The trouble with get_user_pages()

Recommend

下场“造芯”的微软被苹果逼的

Introducing Spark Protocol: MakerDAO's Decentralized Lending and Borrowing Platf...

Wealthiest People in Singapore (May 9, 2023)

告别套路化分析，用思维探索无限可能

iPhone15 Pro新细节曝光边框更窄更好看

MLC LLM | Home

联想3D打印实验室造拖鞋全是洞的人字拖你见过吗？

Final Cut Pro and Logic Pro Are Coming to the iPad on May 23rd

Eight in 10 scams come from Mark Zuckerberg’s platforms, says TSB

避免被实施合作伙伴破坏的七种方法

About Joyk