solutions
Performance analysis and fault detection

Use case description
A company with 50 employees encountered severe problems in their network, servers and clients. The symptoms were as follows: a very slow network, almost no availability of the Internet connection and all clients machines were unresponsive. The company suspected that they were infected by a malicious and harmful Trojan and the demage to the company increased every day.
Our challange was to identify the source of those problems, detect the faulty components and bring the system back to normal.

Our analysis
We used our data mining and anomaly detection tools to gather information from the different resources of each machine in the network. For example, per process statistics of the CPU usage, disk utilization, memory consumption, network activity etc. We built a rich viewpoint of the network and its machines and used the gathered data to detect the anomlies, characterize the problems and identify the source of evil.
The evidence was clear: the CPU utilization was very high, the memory consumption increased continuously, the hard disk constantly performed many read/write operations and gigs of data were transferred from the network side. We found out that the source of this devastating behavior was the Tunderbird e-mail client. We revealed that many clients in the company were automatically upgraded from version 2 to version 3 of TB. Our tools identified a dramatic change in the behavior of the upgraded e-mail client.
The following graphs compare the behavior of Thunderbird 2 to Thunderbird 3 during the first 48 hours after their fresh installation. In each graph, the blue line represents the behavior of Thunderbird 2 the red line represents the behavior of Thunderbird 3.

CPU Utilization:



While CPU utilization of Thunderbird 2 is usually between 0% to 10% with an average of 0.3%, Thunderbird 3 CPU utilization is between 5% to 80% with an average of 30% - 100 times more than Thunderbird 2. In addition, during long periods of time, Thunderbird 3 used more than 50% of the overall CPU resources. This behavior slows dramatically the whole machine.

Memory Consumption:



Thunderbird 3 memory consumption is twice as Thunderbird 2 consumption.

Read Operations from Hard Disk:



While Thunderbird 2 performs minor read operations from the hard disk, Thunderbird 3 performs thousands of read operations during long periods of time. .





Thunderbird 3 reads gigs of data from the hard disk during long periods of time. Thunderbird 2 reads almost none. This behavior affects the response time of the hard disk and the behavior of other applications that read from the hard disk.

Write Operations to Hard Disk:



While Thunderbird 2 performs minor write operations to the hard disk, Thunderbird 3 performs thousands of write operations during long periods of time.





Thunderbird 3 writes gigs of data to the hard disk during long periods of time. Thunderbird 2 writes almost none. Similar to the read behavior, it affects the response time of the hard disk and the behavior of other applications that write to the hard disk. In addition, it occupies gigs of free space from the hard disk.

Network Activity:



While Thunderbird 2 performs minor download operations from the Internet, Thunderbird 3 performs thousands of download operations during long periods of time.





Thunderbird 3 downloads gigs of data from the Internet during long periods of time. Thunderbird 2 downloads almost none. This behavior affects the response time of the Internet connection. It hogs the connection and slows dramatically the Internet activities. In case your Internet is billed based to the bandwidth that you use, your bill will increase significantly.

We could see that Thunderbird 3 has a completely different behavior than Thunderbird 2. While Thunderbird 2 behaves like a normal balanced application, Thunderbird 3 has an anomalous behavior - its deviations from normal behavior are clear. Thunderbird 3 reclaimed the CPU, reclaimed the memory, reclaimed the hard disk and reclaimed the network.

Conclusions and our solution
We developed a system that protects the machines by keeping the overall performance optimal - it makes sure that all processes behave normally without causing problems that may affect the resources of the machine - for example, hogging the network bandwidth, excessive resource usage (disk, memory, cpu) and more. Once installed on the client machine, the system starts its short training phase (about 20 minutes). During the training phase it collects and analyzes several statistics from each active process. Then, it builds a normal profile for each process. At the end of the training phase, the system switches automatically to the testing phase. During this phase, it monitors and analyzes in realtime the statistics of each process and it looks for deviations from the previously built normal behavior. These deviations are constantly scored according to their abnormality levels. The user can see in realtime the score of each process. The system displays automatically alerts regarding the most problematic processes.




We found out that the problems were due to a combination of two features:
The first feature is the Global Search and Indexer. This feature is new in Thunderbird 3. It enables fast search of e-mails in the mailbox. However, Thunderbird has first to index all the e-mails and this process is time and resource consuming. It took Thunderbird 3 days to index a typical mailbox. During this time, its CPU utilization was between 5% to 80%, its memory consumption was between 100 to 150MB and it made thousands of read operations from the hard disk.
In addition, even after the indexing was completed, we still noticed that Thunderbird 3 continue to index and re-index from time to time thus consuming more resources from the machine. Moreover, according to Thunderbird's official site, "if you enable Global Search/Indexing it normally uses about 3.5 KB per message in the SQLite database". So a typical 10K e-mails database should theoretically consume about 30MB. However, in reality it consumed 150MB and it keeps growing and growing. The worrisome thing about this new feature is that it is turned on by default. It happens either when you migrate from Thunderbird 2 to 3 or when you install Thunderbird 3 from scratch. This feature has a huge impact on the behavior of the client: for at least couple of days, the machine is hogged from all directions and the user is helpless. We found out that users around the world spent hours and days on finding the source of this problem.

The second feature is the Message Synchronization of IMAP accounts. Thunderbird synchronizes the IMAP folders and saves the messages locally on the hard disk of the machine. It means that all your IMAP accounts reside on the machine hard disk. This feature was already in Thunderbird 2 but its default was turned off. In Thunderbird 3, its default was changed to be on. In addition, if Thunderbird 2 was migrated to Thunderbird 3, this feature was turned on automatically, even if it wasn't turned on in your Thunderbird 2 settings! This feature has a huge impact on the computer resources since it downloads all your e-mail messages and stores them on the hard disk of the machine. It means that it uses your Internet connection to download gigs of data and uses your hard disk to store them. It took Thunderbird 3 days to synchronize our mailbox and during this time it hogged our Internet connection by downloading our entire mailbox while making thousands of write operations to the hard disk. Since Gmail maps messages to labels, multiple copies of the same message are held when synchronization is enabled, thus increasing the occupied disk space. As a consequence, a typical 5GB online mailbox increased to 40GB of data that were downloaded from the Internet and were stored on our hard disk without getting any warning alert from Thunderbird!

In this case, once we detected the abnormal behavior and isolated the problem and the application that was the source for the drastic decrease in the performance of the network, the solution was very simple. By disabling the two features (global indexing and message synchronization) all problems were disappeard immediatly and the network with its clients and servers were back to normal activity.

© Brainstorm Private Consulting. All Rights Reserved
Designed By : Template World