Sunday, 24 January 2010

Alarm, Alarm!!! - Solaris Swap is exhausted - Part 1

The Problem
I recently witnessed a Solaris server die a painful death by memory starvation, which was credited to an un-tuned Oracle 11g instance having it's way.

The obvious questions were asked after the event, including - 'How do we stop this happening again?'. One suggestion was to monitor all the Solaris hosts to ensure that, in the event that swap is running low, we will at least know about it before the system falls in a heap (no pun intended).

So, how do we do this?

The approach
Nagios is the tool of choice in this case, which implies SNMP Monitoring by default (for this particular organisation). Some testing led me to the NET-SNMP UCDavis MIB values:
  • memSwapError (.1.3.6.1.4.1.2021.4.100.0)
  • memSwapErrorMsg (.1.3.6.1.4.1.2021.4.101.0)
  • memAvailSwap (.1.3.6.1.4.1.2021.4.4.0)
  • memMinimumSwap (.1.3.6.1.4.1.2021.4.12.0)

According to the Net-SNMP notes, memSwapError and memSwapErrorMsg are triggered when 'memAvailSwap is less than the desired minimum (specified by memMinimumSwap)'. In Part 2 of this topic, I'll show how to check that this is actually what happens on your configuration.

Unexpected Fruit
The main point of this blog is to highlight where/how to set memMinimumSwap which eluded me for a while. At first it appeared that this value is a hard-coded value as evidenced by a sampling of a number of different hosts. The value returned is always 16000 which turns out to be the default.

Reading throught the Net-SNMP source code, you find that a more meaningful value can be set in snmpd.conf using the swap directive to supply a number of kilobytes. e.g. To change from the 16MB default to 32MB, use:

swap 32000

Now I can use Nagios and the check_snmp plug-in to monitor the value returned by memSwapError - if it's 0, there are no problems; if it returns 1, the system has fallen below the defined threshold and is in some trouble.

References:

1 comments:

John said...

Notification is good, but there also is a means to prevent this.

Check out the swapfs_reserve kernel parameter defined in the Kernel Tunable Parameters guide on docs.sun.com. This parameter defines the amount of system memory that is reserved for use by uid 0 processes.

- John