The Grasshopper Outage: Co-Founders Response
June 9, 2011
On Tuesday & Wednesday Grasshopper experienced a major outage unlike anything we’ve experienced since launching 8 years ago (For details please read the full explanation below). In short, we had major hardware failure and our disaster recovery systems that we’ve spent millions of dollars on did not work as designed.
Although outages are inevitable with any tech provider, an outage of this magnitude is simply unacceptable and we are very sorry it happened – we know how important your phone is to your business. We want you to know that we take this outage very seriously and are very disappointed that it not only happened in the first place, but it took significantly longer than expected to resolve.
Now that all services have been restored, our team is focused on making sure this does not happen again.
Here’s what we are working on first:
- Increasing investment in our disaster recovery systems to prevent experiencing this type of failure again
- Enhancements to our network operations procedures
- A notification system to immediately communicate any service-affecting issues so you can prepare your business and customers
- A fail-over feature so your customers can still reach someone in case of any downtime
To those of you who reached out to us upset, you had every right to and we understand your frustration. To those of you who were supportive during the outage, please accept our sincerest thanks. We’re amazed and honored to have such great and loyal customers.
We’re happy to answer any questions you have. You can reply to this email, call us, or reach out to us on Twitter.You can also reach our support team 24/7 at 800-279-1455 or at [email protected]
Siamak & David
About the Outage: Details
On Tuesday morning our primary production NetApp Storage Area Network (SAN) suffered a simultaneous 2 disk failure in a storage array that serves voice greetings and other vital files for systems to work properly. This is a very unusual event and has never happened before. However, understanding this was possible we have always run RAID-DP so that no data would be lost. RAID-DP or double parity allows for multiple disk failures to occur and still protect data as it is striped across multiple drives. We had the drives replaced very quickly and the NetApp started the process of insuring no data loss at which time it caused an outage as it prioritized data protection over serving the files systems requested.
After trying a number of methods to recover locally and careful evaluation by our team and senior engineers at the SAN vendor, the decision was made to bring systems back online. To do this we would move storage traffic for that array to our disaster recovery site, which is a full duplicate of our primary data center. This was a deviation from our standard disaster recovery plan as we only wanted to move the critical failed item and not fail over everything since failing over everything would take longer. After many hours of working on this all the systems were back online and stable, although not as fast as they should be. We continued to test through the night.
Early Wednesday AM EST we started to get reports of problems from our NOC and support staff and quickly brought all teams back online. The problem was not clear at all and we spent a lot of time troubleshooting different issues and ultimately found what we believe to be a major core networking issue at the disaster recovery site. Rather than troubleshoot this unknown issue, the team decided to start the process of bringing the primary site back online as the data recovery process had finished. During this process the NetApp had to perform a process that gives no status as to how long it takes or when it will finish and senior engineers could only give rough guesses that ranged from 2 hours to 15 hours.
As this internal NetApp process continued the team started work on four different fronts to reduce this unknown time to recovery. The most promising option was to bring online the new storage array from Pillar Data Systems that was planned to replace the NetApp in Q3. On short notice the team got the most senior engineers from Pillar to help with this process and started to prepare the system as quickly as possible. As we were finishing this preparation for final data copy the NetApp array became available and we quickly brought all systems back online on that array.
There is much more work to be done in the coming days, weeks and months, but our first action items are:
- replacing the necessary systems as quickly as possible
- fully researching and fixing the core networking issue at the disaster recovery site
- starting the process of preparing all systems for a full disaster recovery evaluation to determine what needs to be purchased and put in place to prevent this and other issues in the future