swol-08-perf

How does Solaris 2.6 improve performance stats and Web performance?

We fill you in on all the new performance and measurement enhancements

Abstract

Solaris 2.6 is out this month. It has a lot of new features, and apart from performance improvements for Web servers and databases, there are some extremely useful new performance measurements provided. (3,400 words)

So, what's new in Solaris 2.6?

A lot, a whole lot, way too much for me to cover in this column. I'll concentrate on performance improvements and new performance measurements.

Solaris 2.6 is a different kind of release from Solaris 2.5 and Solaris 2.5.1. Those releases were tied to very important hardware launches -- UltraSPARC support in Solaris 2.5 and Ultra Enterprise Server support in Solaris 2.5.1. With a hard deadline you have to keep functionality improvements under control, so there were relatively few new features. Solaris 2.6 is not tied to any hardware launch. New systems released this summer all run an updated version of 2.5.1 as well as Solaris 2.6. The current exception is the Enterprise 10000 (Starfire) which was not a Sun product early enough (Sun brought in the development team from Cray during 1996) to have Solaris 2.6 support at first release. Later this year an update release of Solaris 2.6 will include support for the E10000. Because Solaris 2.6 had a more flexible release schedule and fewer hardware dependencies it was possible to take longer over the development and add far more new functionality.

Some of the projects weren't quite ready for Solaris 2.5 (like large file support), so they ended up in Solaris 2.6. Other projects like the integration of Java 1.1 were important enough to delay the release of Solaris 2.6 by a few months. There are other documents on www.sun.com that describe most of the new features (see Resources below), so I'll concentrate on explaining some of the performance tuning that was done for this release and tell you about some small but useful changes to the performance measurements that sneaked into Solaris 2.6. Some of them were filed as request for enhancements (RFEs) by myself and Brian Wong over the last few years. (Brian has a feature story, "The TPC-C database benchmark -- What does it really mean?" in SunWorld this month.)

Web server performance
This is the most dramatic performance change in Solaris 2.6. Multiprocessor scalability is now excellent, and that multiplies up the good performance on one and two CPU systems that we had already with Solaris 2.5.1. At the low end, the Ultra2/2300 has two 300-MHz CPUs with 2 MB caches, giving it a hardware boost of about 50 percent over the Ultra 2/2200 with 200-MHz CPUs and 1 MB caches. The first processor has a useful increase in performance over Solaris 2.5.1, but with Solaris 2.6 the second processor is now contributing almost as much performance, with no internal contention to slow it down. The results on larger systems are harder to interpret, using different server software, processor modules, and network interface types. The essential message is very clear however. If you use Solaris 2.6 you can throw a lot of CPUs at this problem and get good additional performance from every one of them.

Solaris 2.5.1/ISS, Netscape on Ultra 2/2200 -- 626 SPECweb96 httpops/sec
Solaris 2.6, SWS on Ultra 2/2300, 2x300MHz -- 1488 SPECweb96 httpops/sec
Solaris 2.6, SWS on E3000, 6x250MHz -- 2535 SPECweb96 httpops/sec
Solaris 2.6, Netscape on E4000, 8x250MHz -- 2796 SPECweb96 httpops/sec
Solaris 2.6, SWS on E4000 10x250MHz -- 3746 SPECweb96 httpops/sec

The essential difference is that Solaris 2.6 has extremely good multiprocessor scalability for TCP connection intensive workloads like Web service. No other vendor has demonstrated anything more than poor scalability to four CPUs; Solaris 2.6 has good scalability to 10 CPUs and beyond. This is the result of a large sustained effort by a team of engineers at SunSoft. They rewrote the locking strategies in TCP, IP, and streams, building on the in-kernel socket code that was introduced as part of Solaris 2.5.1/ISS. The benefit applies to any Web server code, although the large increase in kernel efficiency exposes the relative performance of different Web servers. With previous releases the kernel's TCP/IP stack barely scaled to two CPUs for connection-intensive workloads, so differences between server code were masked, and there was no performance improvement on large multiprocessor systems. These results show that the new Solaris Web Server (SWS1.0) that comes with server editions of Solaris 2.6 is the most efficient, with Netscape's Enterprise Server also performing well. Internal tests have shown that SWS is faster than the Zeus server used for many published SPECweb96 benchmarks, followed by Netscape then Apache.

The message should be obvious. Upgrade busy Web servers to Solaris 2.6 as soon as you can. Check out the features of SWS1.0 to see if you can use it (see Resources). In this release it has no server API, but does have a flexible security management system, so it could be a good upgrade for a basic Apache setup.

Database server performance
Database server performance was already very good and scaled well with Solaris 2.5.1. There is always room for improvement though, and several changes have been made to increase efficiency and scalability even further in Solaris 2.6. If you look at the recently published TPC benchmarks you will see that they have all used Solaris 2.6. TPC rules say that the products used must ship within six months. There are a couple of features worth mentioning. The first is a transparent increase in efficiency on UltraSPARC systems. The intimate shared memory segment used by most databases is now mapped using 1 MB pages, rather than lots of 8 KB pages. This greatly reduces the load on the memory management unit (MMU). Intimate shared memory is an existing optimization which causes the memory to be locked into RAM at a fixed address, and have its MMU translations shared by all processes. See the SHM_SHARE_MMU option to shmat(2).

The second new feature is direct I/O. This enables a database table that is resident in a filesystem to bypass the filesystem buffering and behave more like a piece of raw disk. This benefit does not show up in TPC benchmarks, as they are always run using raw disk for maximum efficiency, but for many realworld installations that use filesystems for administrative convenience the performance improvement can be dramatic. It makes the most difference on write-intensive workloads. All I/O must be block aligned. If it is not, then UFS buffering is used to hold the unaligned data. Some new mount_ufs(1M) options enable and control the direct I/O features. For even higher performance the optional Veritas VxFS filesystem is now a supported Sun product. It also has a direct I/O capability, but its extent-based on-disk layout gives further performance advantages over the UFS indirect block scheme.

New and improved performance measurements
A collection of RFEs had built up over several years, asking for better measurements in the operating system and improvements for the tools that display the metrics. Brian Wong and I filed some of them, others came from database engineering and from customers. These RFEs have now been implemented -- so I'm having to think of some new ones! You should be aware that Sun's bug tracking tool has three kinds of bug in it. Problem bugs, RFEs, and Ease Of Use (EOU) issues. If you have an idea for an improvement, or think that something should be easier to use, you can help everyone by taking the trouble to call up Sun Service and asking them to register it. It may take a long time to appear in a release, but it will take even longer if you don't tell anyone!

The improvements we got this time include new disk metrics, new iostat options, tape metrics, client side NFS mount point metrics, network byte counters, and detailed process memory usage measurements.

Disk metrics
Disk configurations have become extremely large and complex on big server systems. A maximally configured E10000 supports several thousand disk drives, but even dealing with a few hundred is a problem. When large numbers of disks are configured the overall failure rate also increases. It can be hard to keep an inventory of all the disks, and tools like Solstice Symon depend upon parsing messages from syslog to see if any faults are reported. The size of each disk is also growing. When more than one type of data is stored on a disk, it becomes hard to work out which disk partition is active. A series of new features have been introduced to help solve these problems.

New per-partition data identical to existing per-disk data. It is now possible to separate out root, swap, and home directory activity even if they are all on the same disk.
New "error and identity" data per disk, so there is no longer a need to scan syslog for errors. Full data is saved from the first SCSI probe to a disk. This includes Vendor, Product, Revision, Serial no, RPM, heads, and size. Soft, hard, and transport error counter categories sum up any problems. If you want more details there are counters for Media Error, Device not ready, No device, Recoverable, Illegal request, and Predictive failure analysis. Dead or missing disks can still be identified as there is no need to send them another SCSI probe.
New iostat options are provided to present these metrics. One option (iostat -M) shows throughput in MB/s rather than KB/s, which is useful for hardware RAID units on high-end systems. Another option (-n) translates disk names into a much more useful form so you don't have to deal with the "sd43b" format, you get "c1t2d5s1." This makes it much easier to keep track of per-controller load levels in large configurations.

Tape metrics
Fast tapes now match the performance impact of disks. We recently ran a tape backup benchmark to see if there were any scalability or throughput limits in Solaris, and we were very pleased to find that the only real limit is the speed of your disks and tape drives. The final result was a backup rate of an Oracle database at 1 terabyte per hour. This works out at about 350 megabytes per second which was as fast as the disk subsystem we had configured could go. To sustain this rate we used every tape drive we could lay our hands on, including 24 StorageTEK Redwood tape transports, which run at around 15 MB/s each. We ran this test using Solaris 2.5.1, but there are no measurements of tape drive throughput in Solaris 2.5.1. Tape metrics have now been added to Solaris 2.6, thereby closing one of my RFEs which was originally filed a few years ago, and now you can finally see which tape drive is active, the throughput, average transfer size, and service time for each tape drive. Thanks to Henry Newman of Instrumental (http://www.instrumental.com) for raising this issue with me in the first place.

Tapes are instrumented the same way as disks; they appear in sar and iostat automatically. Tape read/write operations are instrumented with all the same measures that are used for disks. Rewind and scan/seek are omitted from the service time.

Some new iostat options
The output format and options of sar(1) are fixed by the generic Unix standard SVID3, but the format and options for iostat can be changed. In Solaris 2.6, existing iostat options are unchanged, and apart from extra entries that appear for tape drives and NFS mount points (described later), anyone storing iostat data from a mixture of Solaris 2 systems will get a consistent format. There are new options that extend iostat as follows:

-E full error stats
-e error summary stats
-n disk name and NFS mount point translation, extended service time
-M MB/s instead of KB/s
-P partitions only
-p disks and partitions

Here are examples of some of the new iostat formats:

% iostat -xp
                               extended device statistics
device    r/s  w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd106     0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd106,a   0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd106,b   0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd106,c   0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0
st47      0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 

% iostat -xe
                               extended device statistics ---- errors ----
device    r/s  w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot
sd106     0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0 
st47      0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0   0   0   0   0 

% iostat -E

sd106   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: SEAGATE  Product: ST15230W SUN4.2G Revision: 0626 Serial No: 00193749 
RPM: 7200 Heads: 16 Size: 4.29GB <4292075520 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 

st47    Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: EXABYTE  Product: EXB-8505SMBANSH2 Revision: 0793 Serial No:

New NFS metrics
Local disk and NFS usage is functionally interchangeable so Solaris 2.6 was changed to instrument NFS Client mount points as if they are disks! NFS mounts are always shown by iostat and sar. With automounted directories coming and going more often than disks coming online this may cause problems for performance tools that don't expect the number of iostat or sar records to change often. We will have to do some work on the SE toolkit to handle this properly.

The full instrumentation includes the wait queue for commands in the client (biod wait) that have not yet been sent to the server, the active queue for commands currently in the server, and utilization (%busy) for the server mount point activity level. Note that unlike for disks, 100 percent busy does NOT indicate that the server itself is saturated; it just indicates that the client always has outstanding requests to that server. An NFS server is much more complex than a disk drive and can handle a lot more simultaneous requests than a single disk drive can.

The example shows off the new "-xnP" option, although NFS mounts appear in all formats. Note that the "P" option suppresses disks and shows only disk partitions. The "xn" option breaks down the response time "svc_t" into wait and active times, and puts the full device name at the end of the line so that long names don't mess up the columns. The "vold" entry is used to mount floppy and CD-ROM devices.

crun% iostat -xnP
                              extended device statistics
  r/s  w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 crun:vold(pid363)
  0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 serv-dist:/usr/dist
  0.0  0.5    0.0    7.9  0.0  0.0    0.0   20.7   0   1 serv-home:/export/home2/adrianc
  0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 serv-home:/var/mail
  0.0  1.3    0.0   10.4  0.0  0.2    0.0  128.0   0   2 c0t2d0s0
  0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t2d0s2

New network metrics
The standard SNMP MIB for a network interface is supposed to contain IfInOctets and IfOutOctets counters that report the number of bytes input and output on the interface. These were not measured by network devices for Solaris 2, so the MIB always reported zero. Brian Wong and I filed RFEs against all the different interfaces a few years ago, and bugs were filed more recently against the SNMP implementation. The result is that these counters have been added to the "le" and "hme" interfaces in Solaris 2.6, and the fix has been backported in patches for 2.5.1 103903-03 (le) and 104212-04 (hme).

The new counters added were:

rbytes, obytes -- read and output byte counts
multircv, multixmt -- multicast receive and transmit byte counts
brdcstrcv, brdcstxmt -- broadcast byte counts
norcvbuf, noxmtbuf -- buffer allocation failure counts

% netstat -k | more
...
le0:

ipackets 0 ierrors 0 opackets 0 oerrors 5 collisions 0 
defer 0 framing 0 crc 0 oflo 0 uflo 0 missed 0 late_collisions 0 
retry_error 0 nocarrier 2 inits 11 notmds 0 notbufs 0 norbufs 0 
nocanput 0 allocbfail 0 rbytes 0 obytes 0 multircv 0 multixmt 0 
brdcstrcv 0 brdcstxmt 5 norcvbuf 0 noxmtbuf 0

An unfortunate by-product of this change is that a spelling mistake was corrected in the metrics for "le." The metric "framming" was replaced by "framing." Not many tools look at all the metrics, but the SE toolkit does, and if patch 103903-03 is loaded any SE script that looks at the network and finds an "le" interface fails immediately.

New and changed ndd parameters
tcp_conn_req_max replaced. This value is well-known as it normally needs to be increased for Web servers in older releases of Solaris 2. It no longer exists in Solaris 2.6, and patch 103582-12 adds this feature to Solaris 2.5.1. The change is part of a fix that prevents denial of service from SYN flood attacks. There are now two separate queues of partially complete connections instead of one.

tcp_conn_req_max_q (default value 128) is the maximum number of completed connections waiting to return from an accept call as soon as the right process gets some CPU time.

tcp_conn_req_max_q0 (default value 1024) is the maximum number of connections with handshake incomplete. A SYN flood attack could only affect this queue, and a special algorithm makes sure that valid connections can still get through.

The new values are high enough to not need tuning in normal use as a Web server.

ip_addrs_per_if is new in Solaris 2.5.1/ISS and 2.6. This allows higher virtual IP hosting numbers. The default is 256 as before, it has been tested up to 8192. Some work was also done to speed up ifconfig of large numbers of interfaces. You configure a virtual IP address using ifconfig on the interface with the number separated by a colon.

ifconfig hme0:283 ...

tcp_conn_hash_size has moved! There is a hash table structure that TCP uses to locate a TCP connection control block. By default the table contains 256 entries, but when running at sustained high connection rates, tens of thousands of control blocks can be present. The hashed lookup degrades to a linear search and wastes CPU cycles. For SPECweb96 tests the table size was set to 262144. You shouldn't normally set it this high as it is a waste of RAM. The current size is shown at the start of the read-only tcp_conn_hash display using ndd.

This variable was introduced as an ndd variable in 2.5.1/ISS, and could be changed online. In Solaris 2.6 the need for multiprocessor scalability removed the lock that previously allowed it to be changed online. The variable is now set in /etc/system as tcp:tcp_conn_hash_size. It needs a reboot to change it and rounds up to a power of two.

netstat now non-invasive. This option dumps out all the network protocol statistics. In previous releases netstat grabbed a global lock that caused additional contention in TCP. With the new scalable TCP some work was done so that you can run netstat without locking TCP, so it no longer slows Web servers down.

Process memory usage
There is a new /proc/pid/metric structure that allows you to use open/read/close rather than open/ioctl/close to read data from /proc. It can be seen using ls.

% ls /proc/5436
./         cred       lpsinfo    map        rmap       usage
../        ctl        lstatus    object/    root@      watch
as         cwd@       lusage     pagedata   sigact     xmap
auxv       fd/        lwp/       psinfo     status

The new xmap data provides extended mappings which show how much memory a process is really using and how much is resident -- shared and private for each segment. This is an excellent way to figure out memory sizing. If you want to run 100 copies of a process, you can look at one and figure out how much private memory you need to multiply by 100. This facility is based on work done by Richard McDougall, who joined our group earlier this year.

% /usr/proc/bin/pmap -x 5436
5436:   /bin/csh
Address   Kbytes Resident Shared Private Permissions       Mapped File
00010000     140     140     132       8 read/exec         csh
00042000      20      20       4      16 read/write/exec   csh
00047000     164      68       -      68 read/write/exec    [ heap ]
EF6C0000     588     524     488      36 read/exec         libc.so.1
EF762000      24      24       4      20 read/write/exec   libc.so.1
EF768000       8       4       -       4 read/write/exec    [ anon ]
EF790000       4       4       -       4 read/exec         libmapmalloc.so.1
EF7A0000       8       8       -       8 read/write/exec   libmapmalloc.so.1
EF7B0000       4       4       4       - read/exec/shared  libdl.so.1
EF7C0000       4       -       -       - read/write/exec    [ anon ]
EF7D0000     112     112     112       - read/exec         ld.so.1
EF7FB000       8       8       4       4 read/write/exec   ld.so.1
EFFF5000      44      24       -      24 read/write/exec    [ stack ]
--------  ------  ------  ------  ------
total Kb    1128     940     748     192

Wrap up
I work in the enterprise server division at Sun, so it's really good for us that Web server performance now scales, and an E4000 is now the "sweet spot" for high-end Web server performance rather than a cluster of Ultra 2s. Overall, despite the exponential growth of the Internet, we seem to be matching or exceeding the performance needs of Web servers. The situation now is that network bandwidth is the bottleneck again -- the E4000-based SPECweb96 result used two 622 Mbit ATM and a bunch of 100baseT interfaces!

The new metrics need to be supported by performance tools vendors. Unfortunately, some of the new metrics were not included in the beta release versions of Solaris 2.6, so vendors will have to test on the final release of Solaris 2.6 before they know what the situation is. It is quite normal for performance tools to be amongst the least portable applications from one release to the next. Since Rich Pettit recently left Sun to work for a performance tool vendor (Capital Technologies -- http://www.captech.com) we have not been able to keep the SE Toolkit tracking Solaris 2.6. It will take us a while after the final release of Solaris 2.6 is available before we have a version of SE that works and supports the new metrics. I'll update you on our progress next month.

Resources

Solaris 2.6 information on Sun's site http://www.sun.com/solaris/
Solaris 2.6 core OS FAQ http://www.sun.com/solaris/faqs/faq-os.html
General Solaris 2.6 FAQ http://www.sun.com/solaris/faqs/faq-products.html
Sun WebServer (SWS) 1.0 http://www.sun.com/webserver/index.html

Solaris-related stories in SunWorld

"Solaris 2.6 ushers in the millennium," July 1997 feature story in SunWorld http://www.sun.com/sunworldonline/swol-07-1997/swol-07-solaris.html
"Sun targets Novell installed base with Solaris Server for Intranets," July 1997 news story in SunWorld http://www.sun.com/sunworldonline/swol-07-1997/swol-07-solarisintranet.html
"It's official -- Sun goes public with Solaris 2.6," June 1997 news story in SunWorld http://www.sun.com/sunworldonline/swol-06-1997/swol-06-solaris2.6.html
"Web Start: Phase one of "native Java environment for Solaris," April 1997 SunWorld news story http://www.sun.com/sunworldonline/swol-04-1997/swol-04-webstart.html
"SUG East: Solaris 2.6 due August 18," June 1997 SunWorld news story http://www.sun.com/sunworldonline/swol-06-1997/swol-06-sug.html
"Solaris 2.6 AnswerBook interface now demoed on Sun's Web site," June 1997 SunWorld news story http://www.sun.com/sunworldonline/swol-06-1997/swol-06-sunspots.html#2
"Solaris 2.6: We've got the goods on the new features," March 1996 SunWorld news story http://www.sun.com/sunworldonline/swol-03-1997/swol-03-solaris2.6.html
"Sun injects Solaris X86 with new life as it makes its way to 64 bits," February 1997 SunWorld feature story http://sw.wpi.com/sunworldonline/swol-02-1997/swol-02-solarisX86.html
Solaris and Solaris shareware stories in SunWorld's Site Index http://www.sun.com/sunworldonline/common/swol-siteindex.html#solaris
Solaris shareware resources in SunWorld's sunWHERE http://www.sun.com/sunworldonline/sunwhere.html#shareware

More resources

Solaris patch information page, SunSolve Online http://sunsolve.Sun.COM/pub-cgi/pubpatchpage.pl
See Adrian Cockcroft's frequently asked questions for answers to three dozen performance-related questions. Subjects covered include performance monitoring commands, tuning variables, logins and processes, how to interpret the output of performance measurements, and how to optimize Web servers and news servers. http://www.sun.com/sunworldonline/common/cockcroft.letters.html
virtual_adrian.se rule http://www.sun.com/951001/columns/adrian/column2.html
Interested in Web server performance? Go to SunWorld's Site Index http://www.sun.com/sunworldonline/common/swol-siteindex.html#webperf
If you want to build performance tools and utilities, get a copy of the SE Performance Toolkit Version 2.5.0.2 http://www.sun.com/960601/columns/adrian/se2.5.html
Adrian Cockcroft's profile (complete with low- and high-bandwidth bios) http://www.sun.com/950901/columns/adrian/adrian.html
A full listing of Adrian Cockcroft's other Performance Q&A columns in SunWorld http://www.sun.com/sunworldonline/common/swol-backissues-columns.html#perf

Other Cockcroft columns at www.sun.com

"New Release of the SE Performance Toolkit" http://www.sun.com/960301/columns/adrian/column7.html
"Solaris 2.5 Performance Update" http://www.sun.com/960201/columns/adrian/
"Confessions of an Ultra 1 User" http://www.sun.com/951107/columns/adrian/column3.html
"Advanced Monitoring and Tuning" http://www.sun.com/951001/columns/adrian/column2.html
"System Performance Monitoring" http://www.sun.com/950901/columns/adrian/column1.html