It’s been a little over two months since the outage, and we’ve been working diligently to make sure something like that doesn’t happen again. We wanted to give you a quick update on the progress we’ve made and the things we are working on. This list is by no means all inclusive; there are many things that are still on the roadmap but are only in the exploration stages right now.
What have we been working on?
1. Customer communication system
During the outage, we lost access to our customer database which prevented us from being able to alert our customers of the outage. We’ve since put a system in place to ensure that we’re never without access to our customer database. In the future, should an outage occur, we’ll be able to proactively communicate this information with you.
On top of the ability to alert you of planned and unplanned downtime, we’ve also implemented a process to ensure our communications are more specific and timelier. We‘ll continue to expand on and improve these communication systems in the future.
2. Infrastructure Audit and Improvements
We conducted a detailed network audit and were able to identify and resolve the network and storage related issues that prevented us from recovering faster. Over the next few months we’ll be making even more storage infrastructure and network improvements to ensure we’re in a better position to handle any problems that could arise.
As you know, we ran into issues with our disaster recovery site that contributed to our inability to recover faster from the outage. As a result, we’re performing periodic controlled failovers to ensure our DR site is fully operational in the case of another outage or emergency.
3. Updating our Monitoring System
We’re currently in the process of completely replacing our network monitoring system with a top of the line system. This new monitoring system will give us greater visibility into the health of our network and the ability to identify and address issues more quickly.
4. Application Hardening
There are several initiatives in the pipeline right now that will make our applications more fault tolerant. So, what does this mean? In addition to the network, storage and hardware improvements we’re working on, we’ll also be making improvements to our applications to ensure they will operate even in the event of a failure.
We’re currently making our applications more aware of other backup points. For example, we are coding our applications so that when a call hits one of our telephony servers and it can’t find a customer’s voicemail greetings, it will be smart enough to turn to other backup points instead of disconnecting the caller.
The goal of these initiatives is for our applications to be able to handle errors more gracefully, to improve the callers experience during an outage and ideally, to prevent errors from even being noticed.
All of these improvements are designed to make Grasshopper even more stable and prevent another outage. As promised, we’ll continue to update you on the progress and improvements we make to our systems. Stay tuned for our next update!