2014-01-06

Choosing the right SAP HANA Architecture – a discussion example on "Block over FC" vs "File over IP"

Lots of discussion has been going around on whether the future of HANA will be based on "Block over FiberChannel" or "File over IP" level access to persistency.

It is my personal opinion that both technologies have a space and there is a reasonable rational for choosing one, the other of a combination of both. Based on my current knowledge and experience I do consider “Block over FiberChannel” access to persistency a solution more robust and efficient, suited to the largest mission critical applications – enterprise type of workloads, and “File over IP” a more flexible good-enough solution for less critical and lower scale HANA systems.

Let me explain you my rationale for this conclusion of the next paragraphs. I will also
take the opportunity to exemplify a “reasoning process” for making architecture choices
regarding other SAP HANA aspects.

Context

For those a bit more disconnected from this discussion, let me briefly introduce you.

SAP HANA, although it is an in-memory database, it needs to persist data to prevent data-loss in the case of a system failure. So, like any other database system, SAP HANA has log files and data files on persistent mediums, like disk-subsystems, needed to restart and recover after a server failure.

When SAP HANA systems access data on share disk systems, there are different transmission protocols that can be used.

Why is this important for a SAP HANA discussion? Because each of those protocols will imply choices in terms of Hardware to purchase, as well as configurations both on the server side, as well as on the HANA side. Also each of these protocols, due to their design, will have implications on throughput, latency, manageability, availability, security, as well as lots of other aspects that will influence the life of the application.

The more critical and the larger the application is, the more importance these aspects gain.

So, this means that some IT Architects these days may not even be aware of these two options, and the implications of choosing one over the other.

In this blog post, I want to share my current personal perspective on this, and my reasoning for
choosing one or the other.

Variables to be evaluated when making technology choices


SAP HANA being a disruptive technology has a number of implications at the datacenter level that should not be underestimated. Some of those implications are:
  1. HANA requires a new skill set at the systems administration and operations level;
  2. HANA currently implies the need to purchase some new hardware, as SAP mandates for productive systems to run on dedicated resources;
  3. HANA may imply new operational processes and setups for High Availability, Disaster Recovery as well as basic backup and recovery procedures;
  4. HANA may imply the usage of new management tools;
  5. Also HANA may imply new procedures for passing developments to production, setup test systems, or refresh those test systems with productive data;
  6. HANA usage scenarios will more likely imply higher levels of automation and real-time business processes, multiplying the impact of any disruption.

At the same time, wise IT and business decision makers try to evaluate a number of different
variables in the effort to make the best decision for their businesses.

The most considered ones are usually technical in nature, and some examples are:
  • Performance requirements
    • Volumes of data expected;
    • Processing power needed;
  • Availability requirements
    • Maximum downtime admissible for planned maintenance tasks
    • Maximum downtime admissible for unplanned downtimes (RTO)
  • Security Requirements
    • Maximum admissible data loss in case of a disaster (RPO)
    • Reliability of the solution

But there are also business variables to be considered.

Experience shows all of these have an impact over a solution lifetime, nevertheless, only a small part of these are usually considered. Here is a list of aspects that through my personal experience, I’ve come to evaluate (even at a qualitative level):

  • Cost to implement
    • HW investments
    • Implementation Services
    • Team training
    • Fit with existing architecture designs
    • Impact of “in-house existing depth of knowledge” on alignment of architectural & operational choices and time-to-value against business timelines
  • Cost to operate
    • Preventive maintenance
    • Reactive maintenance
    • Incident Management
    • Problem Management
    • Minor Change Management
  • Cost of downtime
    • Impact of “in-house existing depth of knowledge” on the points above
    • Cost to change (new projects)
    • Impact of adding capacity due to organic growth or acquisitions
    • Cost of separating due to divestitures
    • Cost of risk (as per risk management practices, the value of other known and unknown risks)
    • Impact of “in-house existing depth of knowledge” on the points above
There are lots of variables, and most customers only consider a small part of them.

Especially the “knowledge factor” is quite often underrated in terms of its impact on the over cost of a solution across its entire lifecycle. And it’s understandable as it is also one of the most difficult to quantify.

Choosing one architecture just in face of the technical variables, will not ensure its best fit
against a specific use case in a specific organization.

This impact of knowledge may be so significant, that it may be enough to change the decision recommendation on what would be just by technical terms the "best solution".

Setting the scene

As stated above, the fact is that these variables aren't always part of customer evaluations, and also to enable me to make a fair comparison, i must consider some of these as a constant.

So, on my exiercise of thinking which is the best solution of persistency access to SAP HANA (File over IP vs Block over FC) lets consider that all aspects related with the organization (people and processes) have exactly the same valuation on both scenarios.

For this discussion I'll use the following customer scenario:
  • The company doesn’t have the leading edge knowledge on HANA, as its most resources where hired based on their knowledge of legacy technologies;
  • Although key IT employees get regular training, they still lack the practical experience implementing and operating SAP HANA, and so they are prone to make some of the mistakes related with first time experiences;
  • As in most organizations, in this company cost is a key decision factor, and a significant part of the resouces alocated to daily operations have been subcontracted through yearly auction processes. This has led to increased rotation of resources, decreased experience on most critical platforms, loss of knowledge on initial implementation process, and less readiness to deal with unusual failures;
  • The level of automation on operational tasks is low, and although significant documentation exists, it isn't always up-to-date, neither it is known by all operations staff, leading to increased risk of human errors (it’s widely known that most IT failures are due to human errors);
  • Although the customer had on service management positions some of its best and most experienced resources, some years without them putting their hands on the systems, led to a miss perception by top management of higher operational knowledge than the reality shows (high dependency on external contractors for key change and transformation projects, and increased human errors leading to systems failure);
  • Having the external sourcing process of operational resources been in place for some years now, the newest projects have been fully sub-contracted, having as a consequence the fact that the internal resources are loosing more and more their grasp of operational best practices in all that relates with most recent technologies, and their technical evaluations are getting outdated.
Again, remember that I’m not a storage expert, but I do have 15 years of SAP Technology background between consulting and operations, and there are some facts I’ve observed with my own eyes in my professional life while consulting or managing operations of large scale SAP System Landscapes, on which I’ll base my parallel to the new SAP HANA reality.

Having learned that multiple aspects need to be considered when choosing the right architecture options for an enterprise, the right solution in a scenario may not be the best on another.

Let me explore my “architecture design variables” and discuss them through my own experience, focusing on the scenario of “enterprise grade, large scale, transactional mission critical SAP HANA implementations”, which is the scenario for SAP Business Suite on HANA for many of the current SAP ERP 6 customers, considering the change of their underlying database to SAP HANA.

Although performance, availability and cost of implementation usually are always in, and
security and cost of operations are still found many times, cost of change is definitely the
factor more rarely seen considered.

Why is that? I would say that because when you start a new project, rarely you have visibility to upcoming changes, making it difficult to “visualize” this factor.

One thing we all know for sure, is that change will, happen, and more and more likely it will come in a “short notice” and “unexpected” way. So, when discussing business agility and the impact of IT in a company’s ability to change fast, this variable should be a key one.

Cost of change and the knowledge factor may become the most complex wildcards when going through the decision tree, so we’ll consider them a constant both on the block and file scenarios.

The Performance Considerations


Let’s start by looking at the performance aspects.

One of the things I’ve observed is that latency mounts up to the response time observed at the SAP Application level, and that minor increases in latency in the infrastructure may multiply by 10 when measuring the application response time.

How have I come to understand this? Being a SAP Basis expert, although many organizations work in silos, where SAP Basis teams have no access to Storage, Network, Server or Virtualization consoles, due to some critical projects I’ve been involved where performance was a critical aspect (for example to ensure data migration between two systems fit the affordable downtime windows of the business), I came to see myself in an endless discussion with the infrastructure stakeholders.

Many of you SAP Basis guys who read this will identify, and most of you guys from infrastructures who read this will think I exaggerating:
  • “SAP Basis guy: Hey infrastructure guys, I have a performance problem! Can you check it on your side?

  • Network Guy: your systems are hardly speaking, and I see low network utilization. No problem on my side
  • Server Guy: your system is not paging, its low on CPU utilization, so no problem on my side
  • Storage guy: the controllers are low on CPU utilization, front-end ports and SAN switches also, so no problem on my side either.”
  • Note: I could add the “VMware Guy” comments as productive support to SAP HANA on VMware is just around the corner, but let me leave that to a later post.

So the Basis guy sees himself in a corner. He tries to work with the application guys to optimize sql statements, indexes, buffers, database parameters, but he faces a brutal argument from those guys: “when I ran this same code on the old system, it ran faster, so don’t come and say it’s a code problem. Work it out with your infrastructure colleagues as for sure it’s not a problem from my side”.

I was lucky, because when in this situation, I got some training on infrastructure topics (Servers, Storage, Network, Virtualization), and I had built some trust high level in my IT organizations, and had the Directors saying to all the infrastructure teams: “guys, the SAP Basis Team Lead will sit on the side of each of you, he doesn’t have access to the consoles of the infrastructure, but you will show him the metrics and configurations of each of the components of the system, so that we can trace down the reason for this performance issue”.

Why was I lucky? Because this enabled me to learn also the infrastructure part, not only through training, but also from the experience sharing of these great team mates that welcomed me to sit with them and went into the details of the functioning of their parts of the infrastructure, providing me an outstanding learning experience.

With this I’ve learned that increased response times can come from multiple places: applications and database specific configurations, client to application server network, application server low memory / paging, network between application servers and database servers, database instances low memory / paging, database servers IO queues, SAN QoS, storage specific configuration aspects, and a lot of other variables being these just a small example.

So my finding is that when you have a severe performance problem, usually it doesn’t relate only to a single cause, but it related with multiple contributing factors, where all summed up, end up with an unacceptable response time. Nevertheless, comparing two systems working with the same code, some aspects popped up multiple times in different customer IT landscapes.

On my experiences (and doesn’t mean it is the reality on all customers or SAP installations), considering two systems with the exactly same parameter configurations on App Instance and DB Instance (happens for example on a Server Tech Refresh) I found two key contributing factors on many circumstances to degraded response times. 

They were:
  • App Server to DB Server access latency;
  • DB server to storage access latency.
Throughout the process, I came to a specific storage introductory training where I was
explained the differences between the IP and FC protocols, and it became clear why IP based systems (NFS, iSCSI, NAS, etc) always presented worst DB response times than Block over FC based systems with similar backend configurations. The protocol in itself introduces an
overhead that makes that IP is always less efficient than FC.

So, in my wish list, I would like to always have FC!

This in itself would be a good argument to say that all highly transactional systems, should work on FC based disk access modes, and the fact is that some of the largest enterprise transactional systems these days to use FC as the preferred disk access protocol.

When talking about HANA, this is also a valid argument, as when you go to very large HANA Systems, although all the reads are done in memory, the fact is that all writes need to be persisted to disk (persistency) before they are given as committed. So, on systems where writes are intensive, transactional and critical for specific business process, it’s absurd that you have things happening between memory and CPU on the nanoseconds scale, and then have the whole system slowing down because it is waiting for IO.

NOTE: for SAP HANA writes to persistency, the key aspect is realy latency, not throughput!

Some argue to this that, this is the reason why they use internal storage in the servers, as
this variable gets out of the equation.

To this argument I need to counter argument, that there was some very significant reasons why back in the 90s IT systems moved from the servers only with internal disk drives, to storage area networks, and some of the key arguments related with High Availability and Disaster Recovery requirements, as doing HA and DR with internal disks was quite a challenge with huge implications on reliability of the solutions, more chances of errors due to significant manual intervention (which is always a fail factor on critical moments such as when disasters occurs as you may not have your most skilled resources at hand), and also because the costs of operations where significant.

The Availability Considerations


So, if Performance was the only argument to consider when designing a systems architecture, we would have never left the “internal disk era”. 

External storage also allowed things like todays cluster configurations, using SCSI Persistency Groups to lock the access to disk resources to the servers that needed them at a given time, all this processes widely tested and with recognized reliability levels in the industry.

Coming back to the topic of why on the largest systems, block is the right access protocol to HANA persistence, the fact is that the availability and recoverability variables were the key drivers for the move to external storage systems, and that is why “external disk” is the right option for “enterprise mission critical transactional workloads”, and that in the external disk options, block over FC is the best option from a performance but also from an availability perspective.

I’ve had many very open discussions along the last 6 months with customers on multiple industries and multiple roles on this topic, and all have come to agree with these arguments.
The following SAP Notes further “indirectly” support the argument for block based access to HANA persistency:
  • 1931536 - HANA Possible DB corruption with NFSv3 based scale out
  • 1880382 - HANA persistence startup error - Cannot lock file
  • 1740136 - SAP HANA wrong mount option may lead to corrupt persistency

Although all of them present solutions, the fact that the problems existed in the first place is on my perspective a sign of the solution’s fragility, and I would want widely tested and proven technologies for my enterprise & mission critical workloads.

A CIO once told me when I was his supplier presenting to him the latest and greatest, that he only wanted on his mission critical processes the “almost newest” technology, and invited me to come back in a year to speak about the same topic… 

Meaning, he didn't want to take the chance of compromising its most critical operations testing new technologies, and would rather implement the 2nd generation of their technology, leaving new technologies adoption only to the less critical systems (the ones other than those where the company makes it money on top of).

IP based protocols like NFS weren't designed for this type of usages, and add-ons must be implemented to overcome its limitations. This adds risk and potential for human errors and failures, both unacceptable in an enterprise-grade, high-workload, real-time, mission-critical application.

Block over FC has built in availability features that have passed the test of time and intensive critical operations, as it was designed specifically for that purpose.

Today, the fact that since SAP HANA 1.0 SP5, through the “SAP HANA Fiber Channel Storage Connector”, has built-in all the availability features used for enterprise class clusters since the 90s, gives makes this option equipped with availability features “by design”. The mechanism is very simple, and in that sense, very reliable.

It uses SCSI3 Persistency Group reservations on the block devices to lock them for exclusive access from a single node. And all this comes out of the box both on the Linux OS side, and on the HANA System side, without the need for any additional clustering or locking software.

Curiosity: Do you know that the "SAP HANA Fiber Channel Storage Connector" was co-developed between SAP and EMC? Two of EMC's engineers based in Walldorf were key contributors to the build of this solution that today is becoming the "de-facto standard" for most hardware vendors with external storage systems for SAP HANA installations.

The security considerations

Security is a very vast topic, and I would take it in this context to the perspective on how likely is one architecture to be more vulnerable to unintended operator errors, or how prone it is to automation.

Unattended operation and model based automation are two key ways to reduce human errors on the operations, administration or change of mission critical systems. The less are the manual intervention needs, and the more automated they are, the better for mission critical systems.

Here, having technologies that have built-in reliability controls, and that have been designed with mission-critical aspects in mind, is paramount. There is no question that “Block over SAN” is way better equipped in this field than “File over NAS”.

SCSI3 persistency Group Reservations is a mechanism well documented and understood in
the market, adopted by all manufacturers, and that has been designed to provide the failover
automation and fencing needed by the failover mechanism of a shared nothing cluster such
as SAP HANA.

Note: appart from EMC, other companies claim the same benefits of the FC storage connector for SAP HANA. HDS is just one example. Is it a coincidence?

The case for file


Having said all the above, I would also say that probably maybe 50% of current SAP customers, use SAP systems on business scenarios that may not require such features. 

To many of the mid-sized customers, certain levels of automation, reliability and availability needs to be balanced with the investment costs of those technologies, as these companies CAPEX capacity is limited and needs to be focused on core business equipment needs, often relying on existing inhouse resources that have evolved from end-user support or cheap contractors to keep these new systems running, and so it’s better that they use the most well-known technologies in the market, even if those are not the most reliable, resilient, automated or performing one.

It’s the good-enough market. It is real, and we face it many times.

This market, at least in Europe, is for example leading the adoption of Hiper-V for virtualizing productive SAP systems, or usage of SQL Server as the preferred database for the whole SAP landscape. 

And this market is also the one more eager to get a deal with a service provider that makes it cheaper to operate these systems, that it is in house, as they do not have the scale to implement certain costly automation tools, neither they have the negotiating capacity to attract or contract skills at competitive prices.

This leads also to the definition of 2 grades of service provider offerings (sorry for the over simplification):
  1. On one side the best you can have”, hosted managed private cloud;
  2. One the other side, the good enough cheap hybrid/public cloud.

Both for medium size companies (or large scale companies with usage of SAP systems for non-core / non-critical business processes), as well as for “medium grade SLA” cloud service providers, “File over IP” may provide a cheaper, easier to manage option.

Note that today, most public cloud offerings target the simplification of access to development and sandbox systems, that enable companies to start projects with low startup costs, or to potencial customers and partners to get acquainted with the SAP HANA technology, and the server farms supporting this "HANA Public Cloud Offerings" will have multiple small systems not mission critical. Evidence of that is looking at the standard SLA provided by those providers. The situation is evolving, but this still the reality on the majority of SAP's partners in this field.

Key dangers on the “File over NAS” implementation 


Old good practices regarding architecture design for mission-critical transactional SAP Systems, taught us to privilege stability over all other aspects. As a consequence, isolation came as a natural consequence: isolate each workload type to ensure its behavior is predictable and reliable.

But today’s “cloud alike” architecture designs leverage aspects like resource pooling and sharing to achieve elasticity and cost reduction.

The school of virtualization has led architects to think their designs in the perspective of minimizing hardware investments and maximizing assets utilization.

This has happened for servers, storage and connectivity.

On traditional architecture designs, you had a dedicated network for client-to-server access, another for server-to-server, and a different one for server to storage. On more complex environments you may have other networks for specific applications needs, or separated networks either for performance or security reasons.

Today, all these systems have capacity like never before, at increasingly affordable prices:
  • Ethernet evolved to 10 gbps with prices continuously dropping;
  • The same for storage with flash drives;
  • And also for servers with multi-core technologies and larger size memory boards.
So, the temptation to share everything, is huge!

The risk? Having very different workload patterns, you need to define very well the SLAs for each application, you need to excel at workload monitoring and capacity management, you need to ensure adequate free capacity for the peak processing times of your critical applications.

Knowing that:
  1. monitoring is not yet brilliant at most organizations, and that performance monitoring associated to end-to-end root cause analysis on performance aspects is still an “un-mastered” science;
  2. understanding by infrastructure teams of each application specific workload is still minimal;
  3. workload pattern of applications is increasingly unpredictable, and even the recurrent processing peaks aren’t always documented.
As SAP opens up more the HANA hardware configuration options out of the existing appliance model, for example through the “SAP HANA Tailored Datacenter Integration”, I would not be surprised in the coming months to see customer implementations sharing the same 10GB backbone and the same server 10GB ports for many things other than exclusive access to HANA persistence.

It happened with iSCSI on traditional Netweaver based systems! There, to achieve higher performing and lower latency configurations, disk access needed to be segregated, leading to increased complexity and cost, many times leading to thesame situation that choosing IP over FC wanted to overcome.

So, be carefull when making this choice and evaluate thoroughly the real needs and implementation scenario for your application!

While many customers ignored this, some have experienced long periods degraded performance even to levels that endangered the wished “near-real-time” scenarios for newly implemented business scenarios. Why? In my experience, on some of these cases I was asked to consult on, the providers that steped forward with those "cheaper solutions" didn't want to assume their error, and prolonged the agony with the search for a non-existent solution. 

Here I have to remember the well know phrase of Benjamin Franklin: "the bitterness of poor quality remains long after the sweetness of low price is forgotten".

Understanding the human nature and the diversity of persons in organizations, has taught us that saying “it is possible, but…. recommendations… best practices…”, ends up in people only considering the “it is possible” and ignoring all the “recommendations… best practices…”.

Conclusion


"File over IP" has limitations on itself at a design level. Bringing it to large scale, highly transactional, real-time enterprise wide implementations, will imply risks that I would not consider acceptable for organizations that depend on SAP HANA systems for their “day-to-day money making”.

"Block over FC", both from a design, but also from an operations experience is a more reliable, robust, performing solutions and is definitely my recommendation for the larger scale SAP HANA implementations.

In fact, the discussion of "File over IP" vs "Block over FC" has the same arguments for both sides in a SAP HANA scenario, as it had for other mission critical applications in the datacenter. And this is why most enterprise customers use "Block over FC" as the preferred architecture for their most "A Grade SLA" applications.

Nevertheless, both have their space on the market, and aspects like the existing knowledge and operations discipline within organizations, may fully change the overall TCO study in favor of both solutions.

Know more about EMC Solutions for SAP at: https://community.emc.com/community/connect/everything_sap

No comments:

Post a Comment