Friday, May 27, 2011

Improving TMPFS performance in solaris

Improving tmpfs File System Performance


Applies to:
Solaris SPARC Operating System - Version: 8.0 and later [Release: 8.0 and later ]
All Platforms
Goal
Performance of the tmpfs file system can be improved by setting tmpfs tunable "tmp_nopage = 1" in /etc/system. This issue is raised in bug

Solution

Tmpfs is a memory resident file system. It uses the page cache for caching file data. Files created in a tmpfs file system avoid physical disk read and write.

The primary goal of designing tmpfs was to improve read/write performance of short lived files without invoking network and disk I/O.

Tmpfs does not use a dedicated memory such as a "RAM DISK". Instead it uses virtual memory (VM) maintained by the kernel. This allows it to use VM and kernel resource allocation policies. Tmpfs files are written and read directly from the kernel memory. Pages allocated to tmpfs files are treated the same way as any other physical memory pages.

Physical memory assigned to tmpfs files uses anonymous memory to store file data. The kernel does not differentiate tmpfs file data from the page cache. During memory pressure, tmpfs pages can be freed and written back to the physical swap device if the page daemon selects them as candidates for such.

It is the user's responsibility to keep a back up of tmpfs files by copying tmpfs files to disk based file system such as ufs. Otherwise, tmpfs files will be lost in case of a crash or reboot.

In Solaris, fsflush (the file system flush daemon), is responsible for flushing the dirty pages to disk. A page is considered dirty, when the content of the page is modified in memory and has not been sync'd to the disk. For every dirty page in memory, fsflush calls the putpage() routine of the file system, responsible for writing the page to the backing store. For the ufs file system fsflush calls fs_putpage() and similarly for tmpfs dirty page it calls tmpfs_putpage(). Pages in memory are identified using vnode and offset.

When a tmpfs file is created or modified, pages are marked dirty. Tmpfs pages stay dirty until the file is deleted. The only time that the tmpfs_putpage() routine pushes the dirty tmpfs pages to the swap device is when the system experiences memory pressure. Systems with no physical swap device or configured with plenty of physical memory can avoid this overhead by setting the tmpfs tunable

tmpfs:tmp_nopage = 1

in /etc/system. Setting this tunable causes tmpfs_putpage() to return immediately without it's overhead.

tmpfs_putpage() Overhead

There is a great deal of work done in the tmp_putpage() routine. For every vnode and offset, tmpfs searches for dirty page in the global page hash list and locks the page. To make sure it can write multiple dirty pages in chunks, it performs the similar search for pages adjacent to the locked page. tmpfs_putpage() does a lookup for the backing store for the page. If physical swap device is full or not configured, it unlocks the pages and returns without writing the dirty pages. The page-out operation to the swap device only happens when the free memory (freemem) is low. For every successful page-out, tmpfs_putpage() increments the tmp_putpagecnt and tmp_pagespushed. Systems with no physical swap device or a system with a physical swap but plenty of memory should have zero value for tmp_putpagecnt and tmp_pagespushed.

If the system has no swap device configured, then the option to use paging out to free up memory is not available.

Testing and Verification

Lab tests have shown that copying a large file (1 GB in size) from a tmpfs to a ufs file system gets a huge performance boost when the tmp_nopage tunable is set to 1. Test results are shown below:

tmp_nopage=0 (default)

$ mkfile 1024m /tmp/one

$ ptime cp /tmp/one /fast/one

real 2:27.301
user 0.044
sys 2:27.207

$ mkfile 1024m /tmp/two

$ ptime cp /tmp/two /fast/two

real 2:27.452
user 0.044
sys 2:27.352

tmp_nopage=1

Setting tmp_nopage=1 on a Live system using mdb:

# echo 'tmp_nopage/W 1' | mdb -kw

$ rm /tmp/* /fast/*

$ mkfile 1024m /tmp/one

$ ptime cp /tmp/one /fast/one

real 18.767 << 18 seconds instead of over 2 minutes.
user 0.044
sys 18.695

$ mkfile 1024m /tmp/two

$ ptime cp /tmp/two /fast/two

real 19.160
user 0.040
sys 19.095

Setting tmp_nopage permanently

To set this on a permanent basis, the following line should be placed in /etc/system and the system rebooted:

set tmpfs:tmp_nopage=1

No comments: