Availability and Uptime Management is one of five components in the ITIL Service Delivery area. It is responsible for ensuring application systems are up and available for use according to the service criticality and the defined Service Level Agreements (SLAs).
Your uptime management team should analyze your online business availability requirements and ensure that optimized, cost-effective contingency plans are put in place and tested on a regular basis to ensure an online robust service that meets the business needs. For example, Internet eCommerce systems may have almost zero recovery RTO (Recovery Time Objective) in comparison to less critical, non-customer-facing applications where even a few days of recovery can be provisioned on a less expensive cloud infrastructure with limited redundancy capabilities.
Following years of experience supporting all types of online service criticalities, I have compiled the following five crucial guidelines that can assist you in presenting a suitable uptime management plan for your cloud service:
Step 1 – Build with Disaster Recovery in Mind
You must define and consider your system criticality and come up with a Disaster Recovery (DR) and High Availability (HA) plan from the start. Recognize your POFs (point of failure) and plan the right cloud infrastructure architecture to support the business SLA, bearing in mind fast (seamless) recovery and backup of your service.
Step 2 – Workload Forecasting & Right-Sizing of Underling Infrastructure
According to the past or the expected workloads (measured in requests per second, RPS), you should have the front end and the back end servers in place. The sizing of the cloud infrastructure is based on defined KPIs that are measured all the time. Make sure to integrate a tool to forecast your cloud workloads using metrics (such as the number of users for a set period of time) or the internal traffic load (such as the DB server throughput). For example, MongoDB can handle thousands of queries per second whereas rational databases like MSSQL and MySQL can handle only hundreds.
Step 3 – Uptime Management Protocol
According to the KPIs defined (2), you need to create an uptime management protocol so that you have a relevant call to action for any type of anomaly, enabling support for real-time crises. This protocol is dynamic and it must be enhanced by lessons learned from real events. The protocol is not only for when your user calls you when the system is down; it should have a proactive aspect to avoid expected issues and changes in the severity of new events that might occur.
Step 4 – Support in Tiers
The common best practice is the three-tier model.
Tier 1
Your Network Operations Center (NOC) team. This team takes the calls and is able to remediate and respond to 60-80% of all support requests. The NOC team should handle events according to your uptime management protocol. Any exceptions should be escalated to tier 2. The NOC team are responsible for recording all events and actions taken while enhancing your uptime protocol.
Tier 2
Expert Support – System engineers that have the tools and permissions to dig in, check the logs, and take actions to solve configuration “heavy issues”. Their reports are sent to their “internal client,” the NOC team.
Tier 3
R&D or “Code Level Support” team. This tier handles escalations from tier 2 and maintains events and issues that happened due to code bugs or even product definitions.
The three-tier support model can be extended by breaking each of the tiers down into sub-tiers to support multiple online operations and systems. Such support organization structures can be found in large enterprises such as telecom companies.
Step 5 – Tracking & Reports in Two Levels
Level 1 – Operations’ KPIs Reports
The KPIs mentioned above are crucial to enabling improvement and scaling. The NOC’s system engineer should track and report KPIs on a daily basis in order to be able to forecast and avoid future issues due to demand, and resultant utilization changes.
Level 2 – Business Summary Reports
The business support reports are summary broad reports that can be generated on a weekly and monthly basis. These reports include a high-level view of the production status and actual usage of the system. For example, a report that shows the business growth trend based on the number of new users provides insights on new workloads that need to be handled and planned. Annual reports of this kind will show the yearly usage of the system, comparing month to month, and will present the system’s operational margins.
Don’t forget your online service support. Start by recognizing and clearly defining your business goals and system criticality. I hope that these help you achieve healthy online operations and scale while eliminating risks.