By suggestion of some friends, I've broken the My previous blog post on SAP HANA TDI KPIs in two parts, to give the proper highlight to the architecture considerations I included in the second part of the original version of this post. This is part 2.
Part 2: One customer example to illustrate
Part 2: One customer example to illustrate
A concrete example: I talked with a customer that was evaluating the migration all his SAP landscape to HANA (not decided yet, just calculating the implications). He had about 45 productive Netweaver based systems, in a total (if I remember well) of about 180 SIDs. Within these, there are 3 productive systems where there are massive data loads, and so write performance is critical. But for all the others, there aren't that many data loads, and the most critical aspect is read performance.
My suggestion was to define as a standard, a virtualized architecture on top of a stretched cluster replicated across 2 datacenters at a "metro distance". For you more knowledgeable in infrastructure, the replication between the sites was already being done using EMC VPLEX Metro.
This architecture would most likely serve for 177 of his 180 systems.
This architecture drove major cost savings, and depending on how much saving the customer wanted to draw out of this scenario as well as the distance between the 2 replicated datacenters, it might happen that some of the SAP HANA VMs might not fulfill all the TDI KPIs. And that is no big deal, as this kind of compromise is already happening today in the Netweaver world, and is good enough for organizations, better when put in perspective with the major savings achieved.
Then as the other 3 were an exception, we came to a discussion in regards to what would be the acceptable balance between performance and data resiliency.
If you want zero Recovery Point Objective (RPO), meaning no data loss, there is no other way that to implement some sort of synchronous replication. Being the replication done at the storage level or through HANA System Replication, you'll always suffer the impact of the network latency.
If you cannot compromise performance because the load on this system will be massive, no way you'll be able to replicate synchronously to a datacenter at a "metro distance" as the network latency will always add up degrading SAP HANA write performance.
Then, you need to evaluate your specific business scenario, and define which is the best compromise.
2 Step SLA as a compromise solution for the "performance and availability" paradigm
Through the discussion I suggested him then to define a "2 level SLA":
- One SLA for single component failure within a datacenter;
- And another SLA for a full site loss.
When you build resiliency into your architecture, what you're doing is putting in place a contingency for a business risk.
All risks are a factor of probability and impact.
So, the probability of a single component in the datacenter infrastructure failing is very high, but if you design properly the systems, the impact can be minimal.
Also, the probability of loosing a full datacenter is very low, although the impact would be massive.
So, maybe the RTO and RPO you would be willing to accept in these two scenarios would be different, as if you lose a full site, maybe this is a major disaster, and it might even imply that your customers would be out of business as well, so you need to evaluate how much are you willing to spend to put a contingency in place, or whether is acceptable to reduce your SLAs and complement them with an insurance policy for unexpected losses.
In the end, all this discussion goes back to ensuring that you fulfill business needs, while at the same time you ensure that the TCO of your solution doesn't become bigger than the business benefits of it.
This then opened the following possibility as a more costly one, but one that would balanced the extreme availability and performance requirements of these systems. Doing this for the few systems that "really required such extreme characteristics" enabled a major cost saving on all the other systems included in the above defined standard (the virtual Datacenter).
To protect from a single component failure, it could be implemented SAP HANA synchronous replication with "in-memory pre-load" option, ensuring zero RPO and near zero RTO (less that 1 minute for sure). This would be done across 2 servers in the same datacenter, to ensure minimal performance impact.
For full datacenter loss, then one of the systems would be replicating asynchronously at the storage level to the secondary datacenter. The calculated RPO would be less than 1 minute and the calculated RTO would be less than 15 minutes, ensuring an acceptable SLA, with minimal performance impact, and also reducing the TCO by allowing the utilization of the system on the replicated datacenter to be used for "pre-production testing" in normal operations.
Conclusion
Not all business systems are alike, and not all have the same business requirements. Being them running on SAP HANA or not!!!
This implies that the technical requirements will not be the same to all these systems either.
By defining a standard for the majority of the SAP HANA and Netweaver systems based on Virtualization and Mirrored Datacenters with VPLEX, this organization would achieve a very flexible architecture, easily manageable, with a very high utilization of assets, and so with a very high service level at a very affordable cost.
Even if some of the systems would not fulfill all the TDI KPIs, it is not a big deal, as all this KPI discussion is one of "performance" and not of functionality, or even support as stated in SAP's documents referenced in this blog.
Then exceptions were evaluated for what they are: EXCEPTIONS!
And specific solutions were design according to the business requirements.
So, again, SAP HANA is becoming a normal application in the datacenter! Make sure to keep up to date on the current rules and conditions to ensure the most suitable architecture to your organization, not only from a performance perspective, but also from a data resiliency, availability, CAPEX and most of all OPEX, as its OPEX you'll be living with for the rest of your solution life.
Being this just one example, I hope this helps others going through the same discussion.
No comments:
Post a Comment