The purpose of this page is to disclose our investigation report of the linux swap out code for memory hot removal. This page is meant to be a short summary. If you need more detail, please ask.

Contents:


Memory hotplug prototype specification

** Objective

	This prototype was implemented to verify if pages can be
	released by activating kswapd.

** Specification
*  Modification of memory management

	Recognized memory are divided into 1GB chunks at boot time.
	They are registered as if they belong to different nodes.
	Other than the first node, every memory chunk is registered as
	highmem.

	The page allocator is modified so that it does not allocate
	memory from zones specified from /proc interface.

*  Forced page release functionality

	Forced page release is implemented as invoking kswapd against
	a specified zone.  To make page release effective, the
	following modifications to kswapd which run against
	page-allocation-disabled zones:

	* Increase pages_high.

	* Disable the kswapd code which makes kswapd to skip pages
          which are recently accessed.

	Other than that, per CPU free pages are put back to the free lists.

*  Functions for verification

	(omitted)

Memory hotplug prototype test report

** Test method

	Done as follows:

$ ./some_memory_hog &
$ sudo sh -c 'echo "disable 5" > /proc/memhotplug'
$ sudo sh -c 'echo "purge 5" > /proc/memhotplug'
$ sudo sh -c 'echo "purge 5" > /proc/memhotplug'
				# repeat as necessary
...
$ cat /proc/memhotplug 
				# display per zone memory usage

** Test results
*  File cache

	* Populated file cache by doing "cksum 1gb_size_file", then
	  purge commands were issued.

	* Purge commands were issued while running a test program
	  which mmaps a file and reads and writes the mmapped region.

	All pages were released in both tests.

*  Anonymous memory

	Purge commands were issued while running a test program which
	mallocs 1GB and repeats touching the allocated pages.

	All pages were released.

*  Dirty page cache

	If a process continuously writes over a small file region
	(about 1MB), the associated file cache pages are hard to
	release.
Download the test program
	a kswapd works in a following fashion:

	1. shrink_cache() builds a list of pages to free from a
	   inactive list and calls shrink_list()
	2. If a page is dirty, shrink_list() clears the dirty bit and
	   issue a writepage.  The page is not released.
	   If a page isn't dirty and its buffer can be freed,
	   shrink_list() removes the page from page cache and a page
	   release operation finishes.  (Detail omitted.)

	If a page is dirtied frequently, it can be redirtied before
	its writeback I/O finishes.  In such a case, kswapd has no
	chance to free the page cache.

*  tmpfs
*  pipe
*  sendfile
*  semaphore
*  virtual block devices

	(omitted)

Modification to the swap out code

The full version can be found here. zone_activep() is used for making kswapd more aggressive on hot removing memory zones.
--- linux-2.6.0-test11/mm/vmscan.c      Thu Nov 27 05:43:06 2003
+++ linux-2.6.0-test11-mh/mm/vmscan.c   Fri Nov 28 17:55:35 2003
@@ -285,6 +288,8 @@ shrink_list(struct list_head *page_list,
                        goto keep_locked;

                pte_chain_lock(page);
+               if ((! zone_activep(page_zone(page))) && page_mapped(page))
+                       page_referenced(page);
                referenced = page_referenced(page);
                if (referenced && page_mapping_inuse(page)) {
                        /* In active use or really unfreeable.  Activate it. */
@@ -589,7 +594,7 @@ done:
  * But we had to alter page->flags anyway.
  */
 static void
-refill_inactive_zone(struct zone *zone, const int nr_pages_in,
+refill_inactive_zone(struct zone *zone, int nr_pages_in,
                        struct page_state *ps, int priority)
 {
        int pgmoved;
@@ -607,6 +612,12 @@ refill_inactive_zone(struct zone *zone,

        lru_add_drain();
        pgmoved = 0;
+#ifdef CONFIG_MEMHOTPLUGTEST
+       if (! zone_activep(zone)) {
+               nr_pages = nr_pages_in = zone->present_pages - zone->free_pages;
+               printk("Purging active list of disabled zone\n");
+       }
+#endif
        spin_lock_irq(&zone->lru_lock);
        while (nr_pages && !list_empty(&zone->active_list)) {
                page = list_entry(zone->active_list.prev, struct page, lru);
@@ -658,12 +669,20 @@ refill_inactive_zone(struct zone *zone,
         */
        if (swap_tendency >= 100)
                reclaim_mapped = 1;
+#ifdef CONFIG_MEMHOTPLUGTEST
+       if (! zone_activep(zone))
+               reclaim_mapped = 1;
+#endif

        while (!list_empty(&l_hold)) {
                page = list_entry(l_hold.prev, struct page, lru);
                list_del(&page->lru);
                if (page_mapped(page)) {
                        pte_chain_lock(page);
+#ifdef CONFIG_MEMHOTPLUGTEST
+                       if (! zone_activep(zone))
+                               page_referenced(page);  /* XXX */
+#endif
                        if (page_mapped(page) && page_referenced(page)) {
                                pte_chain_unlock(page);
                                list_add(&page->lru, &l_active);
@@ -767,6 +786,11 @@ shrink_zone(struct zone *zone, int max_s
        ratio = (unsigned long)nr_pages * zone->nr_active /
                                ((zone->nr_inactive | 1) * 2);
        atomic_add(ratio+1, &zone->refill_counter);
+#ifdef CONFIG_MEMHOTPLUGTEST
+       if (! zone_activep(zone))
+               /* XXX */
+               atomic_add(SWAP_CLUSTER_MAX, &zone->refill_counter);
+#endif
        if (atomic_read(&zone->refill_counter) > SWAP_CLUSTER_MAX) {
                int count;