2015-09-23

SAP HANA TDI KPI: mandatory or indicative? And what about the network implications on Synchronous replication?

By suggestion of some friends, I've broken this blog post in two to give the proper highlight to the architecture considerations I included in the second part of the original version of this post. That seccond part is now published at: http://sapinfrastructureintegration.blogspot.com/2015/10/exceptions-are-exceptions-one-example.html


          Part 1: The current facts

I've been receiving a lot this question: when I have SAP HANA either with storage or HANA System Replication in a synchronous mode, depending on the network latency, it will not meet the TDI KPIs. Will SAP accept this? What would be the performance impact at an application level?

The question came as this customer wanted all the performance he could get, but also the maximum data resiliency technology could provide today.

In a nutshell, there are certain situations that the "Laws of Physics" do not allow you to have all you want, and you need to chose: either the maximum performance, or the maximum data resiliency.


So, going through this reasoning I first collected the currently available information from SAP in this regards, just to reach the conclusion that "YES it is acceptable to fail KPIs" as SAP says "its the customer's decision to define whether the performance penalty is acceptable to his specific business scenario". Then, I entered a discussion with him on finding the right balance between technical and business requirements for his architecture blueprint.



Where did I get the information that it is acceptable to fail KPIs?

Let me share here what I wrote some weeks ago to a customer with a question in regards to what latency is accepted when doing storage replication, as a competitor was telling him that if he had storage replication and did not fulfill all the TDI KPIs, he could not run any production workload on that infrastructure, which is FALSE!

Then, I'll also share the reasoning that followed after getting this question cleared up, by providing a concrete customer example I hope helps you build your own reasoning in case you're going through a similar discussion in your organization as well.


          Network latency impact on SAP HANA performance when replicating synchronously

Indeed the network latency will be a critical factor on SAP HANA performance. But it is both for storage replication in the same way it is in the case the customer decides to implement SAP HANA System Replication in synchronous mode.

My advice here is to carefully evaluate what is the latency between the two sites being synchronously replicated. Ideal is in fact for latency to be below 1ms. More than that can really start to impact write performance, so if the customer application will have massive data loads, the data load time can become longer because of this. There will be tough, absolutely no impact on the read performance as it is done in the memory of the server. So depending on your application specific workload profile, you may feel a lot this increased latency or nothing at all.

In the end, all the details are on SAP’s Whitepapers.
There are two in particular relevant to this matter: The SAP HANA Network Requirements whitepaper: http://scn.sap.com/docs/DOC-63221
In page 28 it says:
  • There is no straightforward recommendation regarding the bandwidth and the latency a system replication network must provide. A rough estimation of the required performance is given in the How-to Guide Network Required for SAP HANA System Replication, which is referred to from the SAP Note 1999880 - FAQ: SAP HANA system replication
  • Latency: The redo log shipping time for 4 KB log buffers must be less than a millisecond or in a low single-digit millisecond range – depending on the application requirements (relevant for synchronous replication only).

This applies to storage replication as well. So, let’s say that if you cannot stay below 1ms, that 2 or 3 ms may be acceptable. But it will depend on your specific business scenario.
Transactional scenarios (like SAP ERP on HANA) are more sensitive to latency. Analytical scenarios are more sensitive to throughput.

Looking at the note mentioned above it sends you in a sequence of documents being the final one: Network Recommendations for SAP HANA System Replication at http://scn.sap.com/docs/DOC-56044
Here, the same as above is mentioned:
  • All changes to data are captured in the redo log. The SAP HANA database asynchronously persists the redo log with I/O orders of 4 KB to 1 MB size into log segment files in the log volume (i. e. on disk). A transaction writing a commit into the redo log waits until the buffer containing the commit has been written to the log volume. This wait time for 4 KB log buffers should be less than a millisecond or in a low single-digit millisecond range.


          Even without replication, it is your choice whether is acceptable to fail KPIs

Let me add as a final note the transcript of the response to question 7 on page 7 of the SAP HANA TDI FAQ published at SCN: SAP HANA TDI - FAQ | SCN

There is written:

Q: Which cases where one or more KPIs are not fulfilled does SAP consider as uncritical or acceptable?
  • This is always up to the customer to decide if falling below the KPIs is acceptable for his/her daily operation of the SAP HANA system. The customer must decide whether the performance of his/her SAP HANA system is sufficient for his/her needs.
  • The following questions and examples might help for making that decision:
    • Is the given SAP HANA system a non-productive system?
      • The KPIs apply for production systems only. In general, for non-productive systems weaker performance is acceptable
    • Which SAP HANA scenarios are run on the given system?
      • OLAP scenarios only (e.g. SAP BW-on-HANA):
        • The performance of queries is usually not affected if all required tables have been loaded into memory
        • Latency times of the log volume and throughput rates for writing the data volume are mainly relevant when loading data from source systems
        • Throughput rates for reading from the data or the log volume affect the overall system restart time (e.g. after applying an SAP HANA revision update)
      • OLTP scenarios only (e.g. SAP Business-Suite-on-HANA):
        • Latency times of the log volume affect the duration of every transaction that changes one or more tables in the database
        • Throughput rates for reading from the data or the log volume affect the overall system restart time (e.g. after applying an SAP HANA revision update)
      • See the Storage Requirements whitepaper for details: SAP HANA TDI - Storage Requirements | SCN

So, as you can see, SAP HANA is really becoming a normal application in the datacenter, and more choice is given to customers every single day. Today, even without replication, it is each customer's choice whether failing certain KPIs is acceptable.

And this is relevant for example when you think on very small SAP HANA systems that are not very critical, that you see more and more as customers make SAP HANA mainstream and migrate every single system to HANA. There may be a couple of systems that are very big and very critical, but there are a ton of others where you could compromise a bit of performance to have a better TCO.

No comments:

Post a Comment