University of Toronto Crest

What's new in the Network Services Group?

services
statistics
projects
policies
software
support
resources
staff

contact us

NSG home

UofT home

Network Services logo

UTORmail Problem Update 1

See Update One | Update Two | Update Three | Update Four | Final Update

We continue to experience problems with UTORmail. At indeterminate times during weekdays we are experiencing periods when accessing the INBOX and other folders is slow. There are also complete outages of a few minutes while we reboot servers in order to try to clear the problem. We have a team of CNS staff working urgently to resolve the problem. We have had a severity one critical incident open with IBM for over a week, and IBM personnel in Toronto and Raleigh, North Carolina are involved. Red Hat, the vendor of our operating system has also been attempting to diagnose the problem. We apologize for the inconvenience, and appreciate your patience.

For those interested, the following is a brief description of the problem and a sample of the ongoing activities to resolve it. (I'm assuming an audience with general computer knowledge, but not technical knowledge of hardware/software involved.)

The University recently purchased a brand new IBM disk array for storing all INBOXes and other folders at the UTORmail post office. (We put out the required performance specifications, and IBM was one of several vendors to respond.) For the first few weeks we had no problems.

The problem is that at indeterminate times during weekdays, access to INBOXes and other folders becomes slow. (At the server itself everything is fast except trying get a large directory listing or reading a large file from the disk array.) As more and more requests arrive from client email software (some are the regular constant rate of new requests, others are additional requests generated by customers who repeatedly click "Get Message" in Netscape or "Send and Receive" in Outlook Express or Outlook because they experience slow access), while old requests are not being completed quickly, the overall load rapidly increases. Beyond a certain point, UofT CNS staff reboot the server to clear the problem. That usually, but not always, buys a few hours with no problems. There are four message servers used to access INBOXes and other folders with each customer assigned to one of them, so not everyone experiences the problem at the same time.

IBM has looked at metrics gathered inside the disk array and it is well within performance characteristics. The metrics from within the operating system also indicate we are within the disk array spec. In fact, we went from 8 slower 10,000 rotation per minute (rpm) disks (no mirror, no striping) per message server before the disk array, to 20 faster 15,000 rpm disks per message server (in a 10x10 striped mirror) on the disk array--with only small customer growth since the disk array was installed, performance should not be a problem.

This suggests the problem lies elsewhere: the Linux software which handles reads and writes, lvm (Logical Volume Manager, a layer which permits remapping disk space on various real volumes onto pretend volumes), the device driver, the fibre channel card (the hardware used to connect the disk array to the message server), etc.

This suggests the problem lies elsewhere: the Linux software which handles reads and writes, lvm (Logical Volume Manager, a layer which permits remapping disk space on various real volumes onto pretend volumes), the device driver, the fibre channel card (the hardware used to connect the disk array to the message server), etc.

If we could reproduce the problem on an out of production server, then we could try to eliminate each of these layers as the culprit. Chris Siebenmann, of UofT's CNS Unix Systems Group, wrote custom programs to generate synthetic loads on the disk array matching and exceeding performance metrics seen when the problem occurs, without reproducing the problem. Derek Young, with CNS Network Services, and Chris Siebenmann also traced the I/O activity of 1,000 (imap and pop) processes, and then Chris wrote a custom program to model the characteristic observed reads and writes to the disk array, but this did not reproduce the problem. One thing hampering this work is that we were initially misled by metrics reported by Linux (e.g. by "iostat") which turned out to be completely bogus (e.g. the reported time for a disk to perform a read or write, called "svctm", appeared low, but turns out to be completed bogus)--Chris has combed the Linux sources to determine which numbers have something to do with reality. Chris also continues to try reproducing the problem.

We attempted to ask the system where the bottleneck was by doing what is called a "Magic SysReq key process dump" during the problem. However, the system gave little information and then hung. Vladimir Patoka, of the CNS Unix Systems Group, worked on this problem and late this week we were able to get a longer, though still partial information which is currently being analyzed.

Darren Jacobs, of CNS Network Services, collected other information (using "sysreport") for Red Hat. (Red Hat is the vendor for our operating system.) Red Hat analyzed this information but found nothing wrong in the components they support.

To rule out the IBM supplied device driver and make Red Hat's job easier, IBM gave us special permission to use a Red Hat supplied device driver. Michael Simms, of CNS Network Services, and Derek are currently trying to make it work on an out of production server. If all goes well (it isn't right now), it should go on one of the message servers on Monday. We are also going to upgrade the software inside the fibre channel card on the same server (what technical people call "flashing the ROM".)

In parallel Peter Ip, of CNS Network Services, has been analyzing and graphing various metrics to see if the culprit can be teased out of the numbers. We are also verifying the affect of some of the changes we make to try and improve the situation.

Michael Simms added memory, borrowed from OSS, to two of the message servers (from 4GB to 5GB)--we think it has improved matters, but we can't prove it using the metrics. We also purchased some memory which Michael added during test time on Friday evening (taking a third message server from 4GB to 7GB)--we will find out on Monday what affect it has.

The Linux Logical Volume Manager (lvm), installed on IBM's advice, maps disks presented by the disk array into pretend disks (e.g. when we add disks to the disk array, lvm can grow the pretend disks, thereby minimizing service disruption.) To try to eliminate lvm as the culprit, Darren Jacobs came in Saturday to copy one message server's worth of INBOXes and folders (approximately 400 gigabytes) off the disk array, removed lvm, and then copied the folders back to the disk array. It took from Saturday at 10am until Sunday morning at 4am, both due to the long time to copy this much data, as well as problems that hampered progress. Unfortunately, customers assigned to this message server had no access to their mail during this period--we apologize for the inconvenience. If the problem re-occurs on this message server on Monday, then we will conclude lvm is not the culprit. If the problem does not occur, then either lvm is the culprit or it could be that the re-arranging of files has lowered input/output demands sufficiently that we have masked the problem for a few months.

We are also taking measures to reduce the load on the servers. Cheryl Ziegler, of CNS Network Services, and Darren Jacobs are analyzing audit trails to detect customers who put a high load on the post office. We have already contacted some of them and will continue to do so.

Unfortunately, we suspect the highest impact is caused by the "Check All Folders" option in Outlook Express and Outlook. This forces checking all folders for newly delivered messages--even though UTORmail never delivers new messages anywhere but the INBOX. (Actually, UTORmail also delivers new messages to the junk-mail folder, but my guess is most people are not interested in being notified every time new spam arrives.)

In one case I spoke to someone who was repeatedly hitting "Send and Receive All" many times a day because he suspected a new message had arrived but not yet rendered in the user interface. Because "Check All Folders" was on, this was forcing each of his many large folders to be read from the disk array over and over. This also breaks his connection to the INBOX, forcing a time consuming re-connect. The irony is that Outlook Express and Outlook are informed of new messages in the INBOX by the post office very quickly, even when Outlook Express and Outlook do not check for new messages at all. Hence "Check All Folders" has zero benefit to the customer, yet high negative impact on the UTORmail post office. Unfortunately our audit trail does not show which connections are using "Check All Folders". Tracing processes will uncover this option, but slows the system down. Hence, we have only traced a small number of processes for this option.

Also to try to reduce the load on UTORmail, Peter Ip is currently installing software which will attempt to minimize the load that webmail puts on the UTORmail post office. This "imap proxy" software combines multiple requests from a single customer to lesson the load. With the assistance of the Information Commons Help Desk we are currently testing the software. If all goes well, we intend to deploy it in a few days.

The above are only some of the ongoing activities to resolve the problem, and to mitigate the symptoms until it is resolved. CNS has seven technical staff, seconded from other work, working as a team to solve the problem. IBM and Red Hat staff are working on the problem.

Once again, apologies for the inconvenience. We hope to have good news as soon as possible.

Alex Nishri as of Sunday morning, January 23, 2005

 

Network Services Group links