Linux memory hotplug (hotremoval, exactly speaking)
Warning: This page is outdated and no longer maintained.
This document is about adding memory hotplug functionality in linux.
Hotplug, in general, means addition and removal of devices while target
systems are running.
In case of memory, removal is challenging because it requires
that all the memory of the removing module must be freed.
I'll exclusively explain how memory can be freed.
Other items, such as managing memory configuration data
(pgdat, zone_table, e.t.c.),
are also integral but are outside of the scope of this document.
The purpose of this patch is to demonstrate and test data purging from
a memory zone without special hardware requirements.
It splits highmem into chunks (whose sizes are set by a kernel config option)
at boot time.
The following operations are possible for each zone.
- disabling page allocation
- reenabling page allocation
- activate kswapd
- do remapping
Note that older revisions have different interface.
- $ cat /proc/memhotplug
Shows a memory usage statistics per zone.
Node 0 enabled nonhotremovable
DMA: free 250, active 940, present 4096
Normal: free 307, active 101623, present 126976
Node 1 enabled hotremovable
Normal: free 336, active 9287, present 83968
HighMem: free 88, active 14406, present 45056
# echo 'plug <node num>' > /proc/memhotplug
Plugs the specified node.
# echo 'unplug <node num>' > /proc/memhotplug
Unplugs the specified node. (All pages must be freed in advance)
# echo 'disable <node num>' > /proc/memhotplug
Disable page allocation from the specified node.
# echo 'enable <node num>' > /proc/memhotplug
Enable page allocation from the specified node.
# echo 'purge <zone num>' > /proc/memhotplug
Activate kswapd for the specified zone.
# echo 'remap <zone num>' > /proc/memhotplug
Moves pages in the specified zone to other zones.
There are a few more commands for debugging.
Remap operation does the following to every pages of a zone.
The key point is to block accesses to the page under operation by
modifying the radix tree.
After the radix tree has been modified, no new access goes to page.
And accesses to newpage are blocked until the data is ready because
it is locked and !uptodate.
- allocate newpage
- lock newpage
- modify oldpage entry in the corresponding radix tree with newpage
- clear all PTEs that refer to oldpage
- memcpy(newpage, oldpage, PAGE_SIZE)
- set uptodate flag of newpage
- unlock newpage and wakeup waiters
In some cases, a remap operation needs to be rolled back and to be retried
This is a bit tricky because it is likely that some processes have already
looked up the radix tree and waiting for its lock.
Such processes need to discard newpage and look up the radix tree again,
as newpage is now invalid.
To achieve this, I defined a new page flag (PG_again).
- Roll back the radix tree change.
- Set the PG_again bit of newpage and unlock it.
- Woken up processes see the PG_again bit and looks up the radix tree
- Wait until the page count of newpage falls to 1 (for the remapd process).
- Roll back is complete. newpage can be freed.
Patches are against linux-2.6.7.
The main patches.
Continuously remap pages between zones.
Unlinks files while extracting from a tarball.
Good VM subsystem exercise.
Link to Takahashi's
HugeTLBfs page handling patch.
Link to swapout based hotremoval investigation report.
In rare cases, possibly a combination of kswapd, remapd, and dirty page
writeback, bad page states happen and crash the kernel.
It is under investigation.
PAE is not supported.
IWAMOTO Toshihiro <iwamoto at valinux...>
$Id: mh.html,v 1.14 2004/07/13 02:14:13 toshii Exp $