How does Solaris 2.6 improve performance stats and Web performance?
We fill you in on all the new performance and measurement enhancements
Solaris 2.6 is out this month. It has a lot of new features, and apart from performance improvements for Web servers and databases, there are some extremely useful new performance measurements provided. (3,400 words)
A lot, a whole lot, way too much for me to cover in this column. I'll concentrate on performance improvements and new performance measurements.
Solaris 2.6 is a different kind of release from Solaris 2.5 and Solaris 2.5.1. Those releases were tied to very important hardware launches -- UltraSPARC support in Solaris 2.5 and Ultra Enterprise Server support in Solaris 2.5.1. With a hard deadline you have to keep functionality improvements under control, so there were relatively few new features. Solaris 2.6 is not tied to any hardware launch. New systems released this summer all run an updated version of 2.5.1 as well as Solaris 2.6. The current exception is the Enterprise 10000 (Starfire) which was not a Sun product early enough (Sun brought in the development team from Cray during 1996) to have Solaris 2.6 support at first release. Later this year an update release of Solaris 2.6 will include support for the E10000. Because Solaris 2.6 had a more flexible release schedule and fewer hardware dependencies it was possible to take longer over the development and add far more new functionality.
Some of the projects weren't quite ready for Solaris 2.5 (like large file support), so they ended up in Solaris 2.6. Other projects like the integration of Java 1.1 were important enough to delay the release of Solaris 2.6 by a few months. There are other documents on www.sun.com that describe most of the new features (see Resources below), so I'll concentrate on explaining some of the performance tuning that was done for this release and tell you about some small but useful changes to the performance measurements that sneaked into Solaris 2.6. Some of them were filed as request for enhancements (RFEs) by myself and Brian Wong over the last few years. (Brian has a feature story, "The TPC-C database benchmark -- What does it really mean?" in SunWorld this month.)
Web server performance
This is the most dramatic performance change in Solaris 2.6. Multiprocessor scalability is now excellent, and that multiplies up the good performance on one and two CPU systems that we had already with Solaris 2.5.1. At the low end, the Ultra2/2300 has two 300-MHz CPUs with 2 MB caches, giving it a hardware boost of about 50 percent over the Ultra 2/2200 with 200-MHz CPUs and 1 MB caches. The first processor has a useful increase in performance over Solaris 2.5.1, but with Solaris 2.6 the second processor is now contributing almost as much performance, with no internal contention to slow it down. The results on larger systems are harder to interpret, using different server software, processor modules, and network interface types. The essential message is very clear however. If you use Solaris 2.6 you can throw a lot of CPUs at this problem and get good additional performance from every one of them.
The message should be obvious. Upgrade busy Web servers to Solaris 2.6 as soon as you can. Check out the features of SWS1.0 to see if you can use it (see Resources). In this release it has no server API, but does have a flexible security management system, so it could be a good upgrade for a basic Apache setup.
Database server performance
Database server performance was already very good and scaled well with Solaris 2.5.1. There is always room for improvement though, and several changes have been made to increase efficiency and scalability even further in Solaris 2.6. If you look at the recently published TPC benchmarks you will see that they have all used Solaris 2.6. TPC rules say that the products used must ship within six months. There are a couple of features worth mentioning. The first is a transparent increase in efficiency on UltraSPARC systems. The intimate shared memory segment used by most databases is now mapped using 1 MB pages, rather than lots of 8 KB pages. This greatly reduces the load on the memory management unit (MMU). Intimate shared memory is an existing optimization which causes the memory to be locked into RAM at a fixed address, and have its MMU translations shared by all processes. See the SHM_SHARE_MMU option to shmat(2).
The second new feature is direct I/O. This enables a database table that is resident in a filesystem to bypass the filesystem buffering and behave more like a piece of raw disk. This benefit does not show up in TPC benchmarks, as they are always run using raw disk for maximum efficiency, but for many realworld installations that use filesystems for administrative convenience the performance improvement can be dramatic. It makes the most difference on write-intensive workloads. All I/O must be block aligned. If it is not, then UFS buffering is used to hold the unaligned data. Some new mount_ufs(1M) options enable and control the direct I/O features. For even higher performance the optional Veritas VxFS filesystem is now a supported Sun product. It also has a direct I/O capability, but its extent-based on-disk layout gives further performance advantages over the UFS indirect block scheme.
New and improved performance measurements
A collection of RFEs had built up over several years, asking for better measurements in the operating system and improvements for the tools that display the metrics. Brian Wong and I filed some of them, others came from database engineering and from customers. These RFEs have now been implemented -- so I'm having to think of some new ones! You should be aware that Sun's bug tracking tool has three kinds of bug in it. Problem bugs, RFEs, and Ease Of Use (EOU) issues. If you have an idea for an improvement, or think that something should be easier to use, you can help everyone by taking the trouble to call up Sun Service and asking them to register it. It may take a long time to appear in a release, but it will take even longer if you don't tell anyone!
The improvements we got this time include new disk metrics, new iostat options, tape metrics, client side NFS mount point metrics, network byte counters, and detailed process memory usage measurements.
Disk configurations have become extremely large and complex on big server systems. A maximally configured E10000 supports several thousand disk drives, but even dealing with a few hundred is a problem. When large numbers of disks are configured the overall failure rate also increases. It can be hard to keep an inventory of all the disks, and tools like Solstice Symon depend upon parsing messages from syslog to see if any faults are reported. The size of each disk is also growing. When more than one type of data is stored on a disk, it becomes hard to work out which disk partition is active. A series of new features have been introduced to help solve these problems.
Tapes are instrumented the same way as disks; they appear in sar and iostat automatically. Tape read/write operations are instrumented with all the same measures that are used for disks. Rewind and scan/seek are omitted from the service time.
Some new iostat options
The output format and options of sar(1) are fixed by the generic Unix standard SVID3, but the format and options for iostat can be changed. In Solaris 2.6, existing iostat options are unchanged, and apart from extra entries that appear for tape drives and NFS mount points (described later), anyone storing iostat data from a mixture of Solaris 2 systems will get a consistent format. There are new options that extend iostat as follows:
% iostat -xp extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b sd106 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd106,a 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd106,b 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd106,c 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 st47 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 % iostat -xe extended device statistics ---- errors ---- device r/s w/s kr/s kw/s wait actv svc_t %w %b s/w h/w trn tot sd106 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 st47 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 % iostat -E sd106 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: SEAGATE Product: ST15230W SUN4.2G Revision: 0626 Serial No: 00193749 RPM: 7200 Heads: 16 Size: 4.29GB <4292075520 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 st47 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: EXABYTE Product: EXB-8505SMBANSH2 Revision: 0793 Serial No:New NFS metrics
The full instrumentation includes the wait queue for commands in the client (biod wait) that have not yet been sent to the server, the active queue for commands currently in the server, and utilization (%busy) for the server mount point activity level. Note that unlike for disks, 100 percent busy does NOT indicate that the server itself is saturated; it just indicates that the client always has outstanding requests to that server. An NFS server is much more complex than a disk drive and can handle a lot more simultaneous requests than a single disk drive can.
The example shows off the new "-xnP" option, although NFS mounts appear in all formats. Note that the "P" option suppresses disks and shows only disk partitions. The "xn" option breaks down the response time "svc_t" into wait and active times, and puts the full device name at the end of the line so that long names don't mess up the columns. The "vold" entry is used to mount floppy and CD-ROM devices.
crun% iostat -xnP extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 crun:vold(pid363) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 serv-dist:/usr/dist 0.0 0.5 0.0 7.9 0.0 0.0 0.0 20.7 0 1 serv-home:/export/home2/adrianc 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 serv-home:/var/mail 0.0 1.3 0.0 10.4 0.0 0.2 0.0 128.0 0 2 c0t2d0s0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t2d0s2New network metrics
The new counters added were:
% netstat -k | more ... le0: ipackets 0 ierrors 0 opackets 0 oerrors 5 collisions 0 defer 0 framing 0 crc 0 oflo 0 uflo 0 missed 0 late_collisions 0 retry_error 0 nocarrier 2 inits 11 notmds 0 notbufs 0 norbufs 0 nocanput 0 allocbfail 0 rbytes 0 obytes 0 multircv 0 multixmt 0 brdcstrcv 0 brdcstxmt 5 norcvbuf 0 noxmtbuf 0An unfortunate by-product of this change is that a spelling mistake was corrected in the metrics for "le." The metric "framming" was replaced by "framing." Not many tools look at all the metrics, but the SE toolkit does, and if patch 103903-03 is loaded any SE script that looks at the network and finds an "le" interface fails immediately.
New and changed ndd parameters
tcp_conn_req_max replaced. This value is well-known as it normally needs to be increased for Web servers in older releases of Solaris 2. It no longer exists in Solaris 2.6, and patch 103582-12 adds this feature to Solaris 2.5.1. The change is part of a fix that prevents denial of service from SYN flood attacks. There are now two separate queues of partially complete connections instead of one.
tcp_conn_req_max_q (default value 128) is the maximum number of completed connections waiting to return from an accept call as soon as the right process gets some CPU time.
tcp_conn_req_max_q0 (default value 1024) is the maximum number of connections with handshake incomplete. A SYN flood attack could only affect this queue, and a special algorithm makes sure that valid connections can still get through.
The new values are high enough to not need tuning in normal use as a Web server.
ip_addrs_per_if is new in Solaris 2.5.1/ISS and 2.6. This allows higher virtual IP hosting numbers. The default is 256 as before, it has been tested up to 8192. Some work was also done to speed up ifconfig of large numbers of interfaces. You configure a virtual IP address using ifconfig on the interface with the number separated by a colon.
ifconfig hme0:283 ...tcp_conn_hash_size has moved! There is a hash table structure that TCP uses to locate a TCP connection control block. By default the table contains 256 entries, but when running at sustained high connection rates, tens of thousands of control blocks can be present. The hashed lookup degrades to a linear search and wastes CPU cycles. For SPECweb96 tests the table size was set to 262144. You shouldn't normally set it this high as it is a waste of RAM. The current size is shown at the start of the read-only tcp_conn_hash display using ndd.
This variable was introduced as an ndd variable in 2.5.1/ISS, and could be changed online. In Solaris 2.6 the need for multiprocessor scalability removed the lock that previously allowed it to be changed online. The variable is now set in /etc/system as tcp:tcp_conn_hash_size. It needs a reboot to change it and rounds up to a power of two.
netstat now non-invasive. This option dumps out all the network protocol statistics. In previous releases netstat grabbed a global lock that caused additional contention in TCP. With the new scalable TCP some work was done so that you can run netstat without locking TCP, so it no longer slows Web servers down.
Process memory usage
There is a new /proc/pid/metric structure that allows you to use open/read/close rather than open/ioctl/close to read data from /proc. It can be seen using ls.
% ls /proc/5436 ./ cred lpsinfo map rmap usage ../ ctl lstatus object/ root@ watch as cwd@ lusage pagedata sigact xmap auxv fd/ lwp/ psinfo statusThe new xmap data provides extended mappings which show how much memory a process is really using and how much is resident -- shared and private for each segment. This is an excellent way to figure out memory sizing. If you want to run 100 copies of a process, you can look at one and figure out how much private memory you need to multiply by 100. This facility is based on work done by Richard McDougall, who joined our group earlier this year.
% /usr/proc/bin/pmap -x 5436 5436: /bin/csh Address Kbytes Resident Shared Private Permissions Mapped File 00010000 140 140 132 8 read/exec csh 00042000 20 20 4 16 read/write/exec csh 00047000 164 68 - 68 read/write/exec [ heap ] EF6C0000 588 524 488 36 read/exec libc.so.1 EF762000 24 24 4 20 read/write/exec libc.so.1 EF768000 8 4 - 4 read/write/exec [ anon ] EF790000 4 4 - 4 read/exec libmapmalloc.so.1 EF7A0000 8 8 - 8 read/write/exec libmapmalloc.so.1 EF7B0000 4 4 4 - read/exec/shared libdl.so.1 EF7C0000 4 - - - read/write/exec [ anon ] EF7D0000 112 112 112 - read/exec ld.so.1 EF7FB000 8 8 4 4 read/write/exec ld.so.1 EFFF5000 44 24 - 24 read/write/exec [ stack ] -------- ------ ------ ------ ------ total Kb 1128 940 748 192Wrap up
The new metrics need to be supported by performance tools vendors. Unfortunately, some of the new metrics were not included in the beta release versions of Solaris 2.6, so vendors will have to test on the final release of Solaris 2.6 before they know what the situation is. It is quite normal for performance tools to be amongst the least portable applications from one release to the next. Since Rich Pettit recently left Sun to work for a performance tool vendor (Capital Technologies -- http://www.captech.com) we have not been able to keep the SE Toolkit tracking Solaris 2.6. It will take us a while after the final release of Solaris 2.6 is available before we have a version of SE that works and supports the new metrics. I'll update you on our progress next month.