Tuesday, May 03, 2005

Memory Fragmentation on NetWare

Memory Allocation and Fragmentation is the Top Issue at Novell right now.

This information was extracted from the following useful documents:
TID 10096649 NetWare 6.x Memory Fragmentation & Tuning
TID 10091980 Memory Fragmentation Issue with NetWare 6.0/6.5
TID 10091598 Understanding Logical Memory
TID 10097396 SEG.NLM - How to read SEGSTATS.TXT
TID 10069653 How does NetWare 6.0/6.5 use memory above 4Gb?
TID 10082323 Understanding the Database Cache page in NDS iMonitor
TID 10058100 NWMKDE.NLM is consuming large amounts of RAM (Btrieve)
TID 10090008 CPU HOG ABEND with NLSMETER.NLM and NWMKDE.NLM

SEG.NLM: NetWare Memory Analyzer (Cool Tools on Novell Cool Solutions)
I have added many of my own comments as necessary for clarification.

Background:
NetWare was designed around the 32-bit Intel i386 architecture which gives us 4 Gigabytes of space to operate in. Although NetWare can handle up to 64 GB of physical RAM, Intel's 32 bit architecture limits any OS to a 4 GB area for mapping Logical Memory. (The memory above 4 GB must be accessed by mapping pages in and out of the 4 GB space). Because most applications run in the "Kernel Space" or "Ring 0" in NetWare (as opposed to "User Space" or "Ring 3" in other operating systems), all NLMs running in the kernel have a finite amount of RAM to work with.

In NetWare prior to NetWare 5, the Traditional File System (TFS) was also NetWares memory manager. The TFS handled all memory allocated to file cache as well as all NLMs requesting memory allocation on the server. All memory requests were allocated from one physical memory pool, the File System Cache Pool (also known as the file cache buffers). And when you were low on file cache buffers, you simply added more memory to alleviate the problem.

But memory changed a lot in NetWare 5. For instance, Virtual Memory was introduced, as was a new memory manager called the NetWare Memory Manager. Another big change that occurred in NetWare 5.0 (and has been there ever since) was the introduction of three physical memory pools.

Because of these changes, memory management is not well understood in its present architectural model. And with the move of making NSS the default file system, the growing number of servers with 4 GB of physical memory, the ever-growing need of memory for NetWare services in each new Support Pack, and a bunch of other factors, many administrators are having memory issues. Because of the lack of memory information, system administrators are trying to place NetWare 4 memory solutions on NetWare 6 and above scenarios, with limited success.

Since Novell started utilizing eDirectory 8.6.2 and going to NSS, it became necessary to enhance SERVER.EXE to handle the extravagant memory needs of these NLM's. We saw a significant change starting with NetWare 6.0 SP5 and Memory Management. In my experience, there are four main NLM's that like to hoard memory at startup. DS.NLM, NSS, TSAFS, and BTRIEVE.
The following information is relevant in dealing with NetWare 6.0 SP5 to NetWare 6.5 SP2 issues.

In April 2005 Novell recently released NetWare 6.5 SP3 (and OES). NetWare 6.5 Support Pack 3 has many features to deal with these issues including an auto-tuning feature. It’s a hidden SET parameter called Auto Tune Server Memory and is turned ON by default. There are also many other new features which are discussed in Ed Liebing's Document. As of May 2005 Novell will be coming out with a new SERVER.EXE in addition to fix some recent defects found.

Note: Even though NetWare 6.0 SP5 is on the end of life list, Novell will be coming out with a new SERVER.EXE for NetWare 6.0 SP5 to take advantage of some of the auto tuning features found in NetWare 6.5 SP3. There will be no more service packs for NetWare 6.0.


Memory Primer for NetWare 6.0 and 6.5:
As mentioned, a NetWare server has three logical address spaces.

- The File Cache Pool . This is the logical = physical range. This is where NSS runs and it is defined by the set parameter "File Cache Maximum Size".

- The user address space is used for Protected Mode Applications, including Java and ZENWorks apps, and can be controlled by the -u setting on the server.exe command line.

- The Virtual Memory (VM) Pool is the logical space where NLMs generally run, and it's size is 4GB minus the user space and File Cache Maximum Size.


Let's talk about the VM Pool:
The VM pool is now, and has been since 5.0, the main cache for all memory allocations except for NSS and a few NLMs which require logical = physical memory. The VM cache pool will fulfill all memory requests, even logical = physical requests when it has it. The VM cache pool also provides all stacks, NLM code and data, along with backing the virtual memory system.

If however the VM system can not provide the requesting NLM with the requested memory, then memory will be scavenged from the File Cache pool for use by the NLM. When the NLM is unloaded, or it reduces its memory requirements, the memory manager will assign only some of the memory back to the File Cache Pool, resulting in fragmentation of the File Cache Pool, and a reduction in available memory in the pool.

The NetWare servers memory management system is designed to transfer memory to where the memory is needed. At boot 1GB of memory is allocated to the File Cache Pool, and the rest for the VM Pool. As the server is utilized - clients making requests to NSS, NLMs being loaded and unloaded, or requesting then freeing memory, the Memory Manager will reallocate memory to where it is best utilized.

If NSS requires more that the 1 GB of memory the VM system will move memory down to satisfy NSS, as long as the NSS size is not statically configured. If NSS doesn't need anymore, the VM system will hold onto the available memory, as requests for NLM memory vastly exceed the number of requests for NSS memory. The server will adapt to the load placed on it. If the VM system is holding all the memory it is usually because NSS has not asked for it.

As stated, at boot time the File Cache Pool is set at 1GB, then once the registry is opened the File Cache Maximum Size is set. This set parameter sets a logical (not physical) space limit on how big NSS can grow. NOTE: Setting the File Cache Maximum Size (FCMS) parameter does not immediately allocate that memory to the pool, it sets a limit on how large the pool can grow. This has led to some confusion as administrators expected the memory to be in the pool, but found that the memory wasn't available.

One other common point of confusion comes from the fact that the size of the memory pools, individually or collectively may often exceed the physical memory installed in the server. Remember that these pool sizes are logical sizes not physical. A server with 0.5GB of physical memory can still have a logical pool size of 1GB or more. The File Cache pool size is a limit to how large the pool could grow to, presuming there is that much memory, and the file cache requires it, not the current physical size of the pool.

The VM Pool size is a derived figure: 4GB - File Cache Pool size - User Space, and will almost always be greater than installed memory on any server with less than around 2GB of physical memory.


Memory Fragmentation:
Memory fragmentation occurs as the initially contiguous blocks of memory are broken into smaller blocks and assigned to the different pools. As NLMs are loaded or request memory, the memory manager provides them with a contiguous block of memory equal to the request. When the NLM is unloaded or the memory released, the memory manager takes the returned memory and reallocates it as required, and as the server is configured.

As more and more of the memory is broken up, or fragmented, the available contiguous memory will drop, and in the worst case, will drop to a point where there is very little contiguous memory in the VM Pool. There could potentially be hundreds of megabytes there, but all of it broken into very small blocks. So, although there should be "plenty" of memory in the VM Pool to satisfy memory requests, the fragmented nature of the memory means that there is insufficient contiguous memory, and the memory allocation will fail.

Note that memory fragmentation is normal, and will occur to some extent on all servers. Its only when the memory becomes so fragmented that the memory manager cannot allocate sufficient contiguous memory to fulfill a request that it becomes a problem. This will generally not happen on most servers, as it requires many large memory requests/releases to occur.


How to Troubleshoot the problem:

1. First we have to know what the symptoms are. The symptoms I usually see are that after Ten to Fourteen days the server will suddenly start sending memory allocation errors on the Console screen. Then performance will slow to a crawl and eventually the server will abend or even just halt without an abend. Typically the abends are CPU HOG abends.

2. We have to collect relevant data. The second step in troubleshooting this abend is to obtain the SEG.NLM or Memory Analyzer
Load SEG.NLM which starts logging memory changes and trends. You'll see SEG.CSV, SEG1.CSV,... SEGx.CSV in the SYSTEM directory. These can be read with any spreadsheet program.

3. Take a snapshot of memory after 4 hours of use with the SEG.NLM. The SEGSTATS.TXT can be created using the / (forward slash) key on any SEG.NLM screen by hitting the key, then arrowing over to "INFO" on the menu, then down to "Write SEGSTATS.TXT. The file can be found in SYS:\SYSTEM. Rename this file to SEGSTAT1.TXT because we can use this as a reference point.

4. Become familiar with the Memory Analyzer screens. F1 will show Allocation Errors and the Largest Contiguous Memory Cache Segment. You'll want to watch these. F3 will show a list of NLM's sorted by NLM's allocating the most memory. It will also show any suspect NLM's highlighted in RED or YELLOW. Watch if any NLM's start climbing up the allocation ladder to become top dog in memory allocation. F7 gives you the full list.

5. Use the NetWare Remote Manager (NRM pronounced "NoRM") via a Web Browser. The Server must have NILE.NLM, HTTPSTK.NLM, and PORTAL.NLM loaded.

Point your browser to the server:
For example http://1.2.3.45:8008 or https://1.2.3.45:8009

The screens I would monitor are first, the Health Monitor which will bring to your attention any memory related issues. Particularly Available Memory, Virtual Memory Performance, and Cache Performance. Second I would monitor "List Modules" under the Manage Applications heading. This will show you a similar list of NLM's that SEG.NLM will show. Click on "Alloc Memory" to sort the NLM's via memory usage. Another screen in NetWare 6.0 and 6.5 SP2 is "View Memory Config" under the Manager Server heading. This will show you a pie chart graph of the different memory pools.

If you go into NRM and view the memory config shortly after boot, you'll see that the VM Pool is still very large. This is NORMAL. NSS has taken its initial allocation of memory, and as the server settles into its normal load, memory will be migrated from the VM Pool into the other pools as required.

This screen has changed in NetWare 6.5 SP3. The Pie Chart is gone and instead there are Bar Graphs. Read Ed Liebing's Document for a detailed explanation.

6. When the server has been up and running for Ten days up to a period of Two Weeks start keeping an eye on the above mentioned statistics. Any time you think you may have a performance issue with memory or you start seeing Memory Allocation Errors, it's time to take another snapshot of Memory. Use the SEG.NLM tool and use the / (forward slash) key and get another SEGSTATS.TXT. This is very valuable for Support Engineers to diagnose the problem. Follow TID 10097396 for a detailed explanation of this file.


How to deal with the problem:
The best memory tuning document I have found in dealing with memory fragmentation is TID 10091980. This TID gives detailed information on the above two main tuning options but also gives information on other tuning options for TSAFS, and DS.NLM, as well as other relevant information.

This document takes you through Steps 1 to 6. I have been successful in resolving all memory related issues following this TID. Although keep in mind that it could take two or three months of tuning to really get a server tuned. Also keep in mind that each server behaves differently and needs to be tuned differently.

I'll go over a brief outline of the steps.

STEP 1: Update your server to the latest NetWare Support Pack. Update your server to the latest NSS files. The following steps require NetWare 6.0 SP5 or NetWare 6.5 SP2 and the latest NSS nlm's.

STEP 2 (If the module TSAFS.NLM is running on the server): TSAFS can be limited in the amount of cache it requests. To do this, unload the module TSAFS.NLM then re-load it with the following command-line switch:

Load TSAFS /CacheMemoryThreshold=1

Or try using TSA600.NLM. Note that Novell is no longer writing any code fixes to TSA600.NLM.
Step 2 is important if you're seeing memory issues during a backup.

STEP 3: Set a hard limit on the amount of RAM that DS.NLM uses: This is done by going into NRM and clicking on the "NDS iMonitor" link under the Manage eDirectory heading. Once in iMonitor, Click on "Agent Activity"; then "Agent Configuration" ; then "Database Cache" under settings. You'll want to go to "Database Cache Configuration" and set a "Hard Limit" instead of "Dynamic Adjust"
Use TID 10082323 to assist in setting a limit on DS Cache. This TID has good explanations of this screen as well as screen shots to assist.

STEP 4: Set the File Cache Maximum Size parameter:
SET File Cache Maximum Size = 1073741824

This hidden parameter will increase the logical memory pool available for NLMs by 1 GB more than the default, by reducing the maximum size of the file cache system. The default setting is 3 GB (3087007744).

In my experience Step 3 and 4 has solved the majority of memory issues. Not every memory issue but a majority of them.

Stop here and reboot the server. Wait for another two weeks and watch and monitor the server. If we're still having issues, then after a monitoring period we go on to steps 5 and 6.

Step 5: Set a hard limit on the amount of RAM that NSS can have

In the file c:\nwserver\nssstart.cfg put the following lines:

/nocachebalance

/minbuffercachesize=

These settings tell NSS to turn off cache balancing between the OS cache pool and the NSS cache pool, and to allow NSS to allocate only a specific number of cache buffers for file system caching. Each cache buffer is 4096 bytes, so specifying a value of 102400, for example, results in 400 MB of RAM for NSS.

STEP 6: Adjust the size of the User Address Space "server -u

If Steps 1-5 have been applied, and the server has run for several days or weeks and is still exhibiting signs of logical memory fragmentation, you can alter the default size used for the User Address Space with a server startup command line switch (issued from the DOS prompt or added to the Autoexec.bat line that loads the server). This step, which makes use of a new feature in NetWare Remote Manager (NRM) to get a recommended value for this setting, should be taken ONLY at the time the server is having problems, not when it has been recently re-booted or when it is running smoothly. Please be careful with this setting. We have seen customers set this too low and have problems like high CPU utilization, and programs not loading or running in protected memory correctly.

Use "server -u" to give the memory configuration just what the server needs for the User Address Space, and not more.

Included in NetWare 6.5 Support Pack 2 (and later) is a new feature in the NetWare Remote Manager (NRM) that calculates a recommended value for the "server -u" switch, customized specifically to the conditions and activity on the current server. It is important that this value not be calculated when the server is freshly re-booted; the most accurate calculation can be done only after the server has been running for a while, including if possible a period of peak activity and a back-up cycle or any other intensive operation.

To access this configuration help, open up NetWare Remote Manager (logging in as Admin), and click on "View Memory Config" in the left pane of the main window. From there, click on "Tune Logical Address Space." This opens a screen displaying configuration recommendations from the kernel developers at Novell. The recommended settings are calculated specific to the current server's running condition, and include information on how big to set the User Address Space size and the File System Cache Pool. (The NetWare kernel now stores the maximum amount of memory used by these pools over time, and can recommend optimal settings for them.) This will improve how the server uses memory because unused memory in one pool can automatically be given to the correct pool at boot time.


Remember the three main pools? 1. File Cache Pool, 2. User Address Space or UAS, 3. Virtual Memory or VM Pool.

The FILE CACHE MAXIMUM SIZE will move the line between the FS Cache Pool and the VM Pool up or down, depending on what you set the number to. The server - u parameter will move the line between the UAS and the VM Pool up or down, depending on what you set the number to. This is illustrated very effectively by going into the Memory Analyzer / SEG.NLM and examing the Advanced Summary Screen F10, then the Memory Mapping Screen F7. Or just look at the memory map in the SEGSTATS.TXT file.

One hint: If you have many servers and forget which memory tuning parameters have been done on which server, SEGSTATS.TXT is your answer. It will show a summary of all tuning that this document has discussed.

Novell Support has noticed that some customers are using the settings described in the steps above improperly for some configurations. We have included all 6 Steps above in an attempt to describe all possible factors contributing to or aggravating the problem of memory fragmentation. However, the inclusion of all these steps does not mean that every step is recommended for every customer. The steps outlined above should be followed in the order presented.

Controlling the memory that NLMs can use in the cache pool has proven to be successful with virtually all customers. Steps 2, 3 and 5 above detail three of the most prevalent examples of controlling memory used by specific NLMs. Other modules loaded on the NetWare server, from Novell or from 3rd party vendors, may require scrutiny and adjustments to regulate their role in consuming memory on the server. The NetWare Remote Manager (Module Listing) and other tools can be used to monitor memory consumption over time on a per-module basis on the server.

What is Novell doing to address this problem? Where do we go from here?

As stated before, Novell Support and Novell Development have recently released NetWare 6.5 SP3. By default Auto Tuning is turned on.

Auto Tune Server Memory = ON

With the advent of NetWare 6.5 Support Pack 3, many important memory issues are addressed. These include the following:

  • Memory fragmentation has been significantly decreased through algorithm changes that affect how logical memory address space is allocated.

  • A new NSS API that flushes the NSS Cache Pool and allows the NetWare Memory Manager to change the File Cache Maximum Size line on the fly without rebooting the server. This only works if NSS runs with its cache balance set to On (the default), so unless told otherwise, leave NSS cache balance to its default of On.

  • Support Pack 3 comes with memory auto-tuning enabled, so systems do not have the memory problems they experienced in Support Pack 2.

Support Pack 3 also looks at how much memory the server presently has. If, after the server has been running for a while, the NetWare OS needs to adjust the File Cache Maximum Size line to give more logical space for running NLMs, you will receive a message on the server console screen similar to the following:

"Server Logical Address Space is running low. The File Cache Maximum Size has been set to ."

If your auto-tune is set to OFF, you might be asked to make the changes yourself, in which case you will see a message on the server console screen similar to the following:

"Server Logical Address Space is running low. Increase the available logical space by increasing the File Cache Maximum Size to . For maximum benefit, reboot your server at the next convenient time."

With the auto-tune ON or OFF, you might be asked, in a message similar to the following, to increase the L!=P Address Space by shrinking the User Address Space:

"Server Logical Address Space is running low. Increase the available logical space by restarting the server with the -u switch."

Conclusion
I have one customer that has already implemented NetWare 6.5 SP3 or OES NetWare. We use the default settings in a new server installation. If one upgrades to NetWare 6.5 SP3 the upgrade should reset the tuning parameters to default, but I would go back and check all the parameters above just in case. We are still seeing some memory tuning issues on one server. Novell will be releasing a new SERVER.EXE very soon to address this. I am working with this customer and we are examining the latest files to see if this is the same issue that other Novell customers are seeing. I'm waiting for the new patches and then I can give my recommendation for NetWare 6.5 SP3.

Novell is very committed to our NetWare customer base. I'm looking forward to OES NetWare and Linux as well as NetWare 6.5 SP3. I'm relieved and confident that Novell has addressed these memory issues and we can all get back to Self Tuning NetWare as in the past. Novell is well known for the security and stability of NetWare. I'm confident that once we have customers upgrading to OES / NetWare 6.5 SP3 that we'll be able to see that the server again just works.

1 comment:

Anonymous said...

Hi Bucky,

Thanks for the info. I've been searching for the cache performance on netware 6.5 is suspect state and "work to do on netware 6.5 is in suspect state". this happens once a day and it heals itself again! Do I need to worry about this? I already installed SP3 after new install and still gives me this messages. Will the solution you gave on your blog fix the problem?

Thanks
nelly