massmind - Get Together - Service Outage for Hosted Microsoft Exchange

Skip to navigation

Get Connected on
Massmind Logo
Massmind is Powered by Your Contributions.

Service Outage for Hosted Microsoft Exchange Services Due to Thunderstorm at Data Center - July 6, 2008 (One Hour and 28 Minutes)

This morning, we experienced a service outage starting at 8:42 AM EDT. Service to most customers  was restored by 9:55 AM, with all customers having full service by 10:10 AM. 

The outage was initially caused by a primary power outage at our Saratoga Springs data center (the result of a severe thunderstorm):

This was followed by issues with some of the UPS ((Usually!) Uninterruptible Power Supply) units being unable to continue providing power until the data center backup power became available. This resulted in server shutdowns which caused the service outage. 

You may be thinking:

Why didn't your redundant datacenter in San Francisco kick in?  Why didn't your 4000 kilowatt diesel generators kick in?  The answer is, we don't have them. 

What we do have is our "Guaranteed Availability with 24/7 monitoring of server systems and a 99.9% Service Level Agreement" which is satisfied year after year.  We continually strive to offer you the very best prices with the highest level of support and reliability.  We are constantly upgrading our infrastructure striving for 100% uptime for all of our systems.

We have taken the following two measures immediately following this morning's power failure:

  1. greatly reduce the likelihood of server shutdowns caused by power outage

a.  Implementing additional monitoring of our UPS battery packs, with 'deep-discharge' activities to be scheduled as necessary

b.  Adding additional UPS's in our racks to increase the amount of time we can run on battery power before server shutdowns occur

     2.    greatly speed the restoration of service following a server shutdown situation

a.  staggering startups of clustered servers to prevent issues with Windows clustering services failing to start automatically

b.  increasing physical location diversity for key servers to reduce probability of all servers for a particular function being affected by an outage

We have already implemented key actions in the above 2 areas, and we will continue implementing these actions over the next several days.
Thank you,


Topaz Group Ventures