2015-12-06

SAP HANA Scale-up vs Scale-out learnings for SoH and S/4



It’s been quite some time since I’ve blogged, but as you all know, changing jobs always requires some adjustment time, and the traction that the Virtustream value proposition is observing in the European market, has simply made my agenda overflow! The good thing about is that my learning has only grown due to the massive number of customer engagements I’ve had just in a 2 months period.

And on these interactions, the question on whether the best option when deploying applications based on SAP HANA is to scale-up or scale-out keeps popping up much frequently than I would ever expect. Despite SAP’s recommendations in this regards, as the reality sometimes outpaces current rules, this becomes a key discussion point when deciding whether to move forward with SAP HANA adoption plans or not.

In this blog post I’m coming to share my most recent learning and experiences on this topic, as well as my reasoning behind the pros and cons of each of the scenarios, while sustaining my arguments with the currently available information from SAP.

My goal here is not to provide any clear guidance, but rather to show you my reasoning process and arguments, so that you can have your own discussion in a more informed way, and leveraging knowledge on what others are going through.

In the end, each customer scenario is different, and so there is no “one rule that fits all”.


               Setting the Scene

Time goes by, and this topic keeps popping up in many of the customer engagements I participate in. I’ve already written about this long ago, but the fact that I keep getting faced with absurd situations got me going to write about it more extensively.

And the question is: “For my specific scenario, should I implement SAP HANA in a scale-up or scale-out model”?

On my experiences, there were multiple angles to this question, and depending on the angle, the response may be different.

One of the aspects that disturbs me the most is how so many “technical architects” I come across ignore the variables in the picture above, and come up with solutions, that when we factor in the increased openness SAP HANA is going through, and the reasonable expectations on the expansion of available options over a 3 year period, still come with designs that on my mind simply do not make any business sense. After all IT architecture is not just a “techies” discipline! Being technically grounded, the IT architects should be key pieces in driving business and IT alignment, and so making choices and imposing scenarios that completely ignore the business side of IT, simply leads to bad solutions.

To get background on SAP HANA scalability, you can read SAP’s scalability whitepaper at: http://scn.sap.com/docs/DOC-60340

Will not repeat that here, so if you are new to the HANA scalability topic, I would recommend you to read this whitepaper before reading this blog post.

While I’ll focus mostly on the technical aspects of this discussion, on the back of my head is always: on one side the fact that IT has as primary goal to serve and support business needs; and on the other facts like the dramatic increase in RAM capacity on servers just over the last two years with the evolution of Intel CPUs from Westmere, to Ivy-Bridge and now Haswell, which has come to enable organizations today to dramatically simplify architectures thanks to the increased capacity these latest generations of technology have come to enable.

To get started, let remember current SAP recommendations in regards to SAP HANA scale-up vs scale-out:
  • Scale-up first as much as you can before considering scale-out;
  • Scaling out is only generally available for analytic workloads (BW on HANA or DataMart scenarios);
  • For transactional workloads, like SAP Business Suite on HANA (ERP, CRM, SRM, SCM, etc), today only scale-up is generally available.

And we also all have heard SAP’s vision for the future of the enterprise, with a single version of truth, a single HANA database hosting all business applications, enabling real-time reporting on operational data.

So, why am I coming back with this question, isn’t it clear already?


               Cost and Reality Check disrupt current rules on SAP HANA scalability

Well, there are 2 variables that disrupt all the above stated:
  • Cost
  • Reality
Cost, because customers won’t buy something just because someone tells them they should. I’ve written a lot about that as well, and don’t want to repeat myself here, but IT exists to serve business purposes, so the cost of IT cannot be higher than the value gained by the services IT provides, and this is the simple reason many organizations live with “good enough” solutions.

How many times, being you a buyer or a seller in IT, have seen customers complaining systems don’t perform, and buyers going there trying to sell more gear to solve that, and those projects never got budget? In many of those situations, as IT stakeholders went up the chain up to the CIO level, or even the CFO, LoB or CEO levels to get funding for those projects, what you hear from them is that “we can live with what we have, and cannot afford more costs there”.

And this brings to the sellers the challenge of needing to build a TCO business case, to show how the benefits sold will pay off the investment and provide savings on the long run, which in many cases is not easy.

So, balancing costs with the business benefits is a fundamental aspect of any IT adoption decision, which in turn drives a search for the best balance between architecture choices and “good enough” business results, than enable the lowest possible cost each specific business requirements.

And Reality, because the reality I’m seeing in customers is not compatible with the simple rules I’ve stated above as communicated by SAP.

On the “reality check” part, let me just give you to examples from real customer cases I’ve been involved just over the last 2 months.

On one case, we were talking about an analytics scenario, with native HANA applications, so no ABAP there. The initial sizing was 12TB, with an expected growth over 3 years to 40 TB of in-memory data, and 800 TB of disk based data.

We need to remember that today the biggest box available for SAP HANA today is the SGI UV300 server which can hold up to 24 TB of RAM (in 32GB DIMMs). Also important to note that when talking about “disk based data”, on this scenario it would be SAP HANA Dynamic Tiering, which today only allows for single-server implementations of the “Extended Store”, which ends up limiting the capacity of the extended store to about 100 TB.

So, we are talking about a system that breaks the barriers of “supported”. And I say “supported” because in many cases the limitations are not a factor of “technology limitation” but rather a factor either of existing experience, knowledge or confidence in such extreme scenarios.

We all know that SAP is an organization conservative by nature, and when they state something as “generally available” it means that it has been extensively tested, is well documented, and basically any customer should be able to do it with no major surprises. The other side of this is when we face a context of such accelerated pace of change, where customer scenarios keep pushing the boundaries of “generally available”, and that’s where “controlled availability” or “restricted availability” come in. For these extreme cases, if you limit yourself to the options of “generally available”, you may just be killing your organization’s adoption plans. To which I would say, its better to push the boundaries a bit, than just give-up.

So, on many of these extreme scenarios, scaling-up is not an option, and you must consider scale-out “by design” and from the start.

The other case was a customer evaluating the migration of a SAP ERP with IS-U to HANA. The sizing pointed out to the need of 25 to 35 TB of In-memory capacity for go-live (I’m still working on qualifying this further and understand whether the sizing assumptions were correct, as it would not be the first time I would see some “gross mistakes” in this area).

So, here as well, we are outside the limits of the currently possible.

We might discuss whether SGI would be able to load their 32 socket servers with 64GB DIMMs and then scale to 48 TB of in-memory capacity.

Being a fan of SGI technology, putting myself in the shoes of a customer (which I once was), once you go beyond 16 sockets, you are limited to a single vendor, as no one else can provide boxes that big, and if you are getting started with this sizes, you can only expect that it will continuously outgrow the limits of technology. So I would defend a “multi-vendor limit”, as for example when considering 16 sockets you have multiple vendors (Like SGI and Bull), which provides you bargain power as you are not locked in to a single vendor, which as a consequence makes prices more reasonable. 

Also, defining this limit, enables you “in-box” growth paths, as if SAP confirms that you have a memory bottleneck but your CPU utilization is rather low, you may be allowed just to grow in memory, braking the current rules of memory to CPU ratios (note that this only applies to growth scenarios, and not initial sizing – for more info on this read for example SAP Note 1903576 and it’s attachment).

So, why not start with scale-out by design?

Oh, yes. For the first use case you could as it’s an analytics use case, but for this one you can’t as SAP HANA Scale-out is not generally available for transactional use cases.


               Scale-out transactional workloads (Suite on HANA) is possible!!!

But do you know that SAP HANA Scale-out is released under “controlled availability” for transactional use cases? And that there is an installation guide for “SAP Business Suite powered by SAP HANA Scale-out” in attach to SAP Note 1781986? (Check-out my list of relevant SAP notes on this topic at: http://sapinfrastructureintegration.blogspot.com/p/important-sap-notes.html)

So, scaling out transactional workloads on HANA is not a discussion of technical possibilities.

It is possible, there are already a few customers doing it and there is SAP documentation on it.

Being in controlled availability means it is not a solution so spread out that anyone can do it without supervision or support, so being in controlled availability means that SAP must approve you to enter the program in order to have support, enabling them to validate that your scenario and arguments make sense, and at the same time, commit to support you and enable you to then operate the solution. 

And Suite on HANA has been in controlled availability for a long time as you can see from the “very old” SAP slide above (anything older than 1 year, in the HANA world is already too old! :) ).


               Some of the proposed solutions I’ve seen make no sense…

Before going further on my perspective on scale-out, let me tell you the “indication” that was given to both these customers, with which I struggle to agree.

On the analytics use case, the customer was told to break his “analysis groups” in independent “universes”, and then load each group in a different HANA system, and then have multiple scale-up smaller single nodes, instead of a single large scale-out cluster.

On the transactional use case, a similar solution to the one above was suggested: break your business units in groups and put one business unit in each separate ERP, with a smaller single HANA scale-up system.

Heck, what the hell?!?!?! So, what about all that talk of the single system, single source of truth SAP has been telling over the last couple of years?!?!?!?! Am I missing something here?

In both these scenarios, the implication will be a proliferation of smaller systems, and the build of an “integration system”. For example in the transactional use case, we’ll go back to the scenario of 1 ERP for each company and then a consolidation ERP for the holding. Wasn’t this one of the things SAP was promising to end with the adoption of SAP HANA?

Which one is better or worse? To integrate the data at the database level through a scale-out architecture, or to integrate it at a functional level by creating interfaces across the multiple instances?

I see this in the following perspective: once data exceeds the capacity of a single server you’ll need to distribute it across multiple servers anyway. If you divide the universe of data across disparate independent systems, you’ll then need to take care of the integration at the functional level. If you are a company managing a portfolio of business with a dynamic acquisition and divestiture activity, you’ll have for sure a consolidation system, and have set your functional knowledge already to manage the integration at this level, so I would understand that you break your “massive single instance into more and smaller independent systems”.

But if you really need the integrated view across all those universes of data, I would say it would be easier to manage a scale-out cluster (being for analytic workloads or transactional workloads), than breaking the data across smaller single systems and then keep leaving with the “data latency” problem due to the ETL process and then the delayed reporting process we heard so much Hasso Platner talking about as a problem that having all data in HANA would solve.

In the end, when dealing with very large systems, complexity will exist. So as a CIO or senior leader on an IT organization you need to evaluate where you'd rather have the complexity:
  • At the functional level, and have your development and functional teams dealing with the integration of data from multiple systems, with interface development, agregation systems, etc... well, all the things you've had across the last 2 decades on SAP;
  •  At the technical level, and have your infrastructure architects and HANA DBAs deal with it, through expensive SQL statement analysis and infrastructure optimization.
If your business scenario indicates that you should have a single very large instance, then whether you have the integration at a functional level or the technical level will be a factor of:
  • If you integrate at the functional level, whether you can leave with the "reporting latency" introduced by the need to move data around and aggregate it , and whether you trust more the capacity of your functional and development teams to manage interfaces and data integration;
  • If you integrate at the technical level, whether you trust your DBA and infrastructure teams ability to properly design, architect, build and operate such an environment.
Considering that I'm now working with Virtustream, the fact that we as a service provider are able to take care of all the complexity involved at the infrastructure and HANA DBA levels, I would say that if the reason for you to consider a functional integration is the lack of capacity or skills in your platform teams, you should consider a cloud approach to this, where Virtustream would provide you an SLA at the SAP Basis (platform) level.

But let me continue this blog post ignoring that I'm working for a cloud provider specialized on SAP HANA, and continue the reasoning as if I was involved in deciding on-premise implementations.


               Will the HANA reality be of many small systems??? In some cases, yes.

Being a bit of a devil’s lawyer here, right? ;-)

But we need to make our minds clear: either we believe in the benefits of “single pond” where we build a single source of truth, or we understand that there are still reasons for customers to have multiple systems like happened in the past. So, on my mind, the message in the PowerPoints is not matching the reality I’m seeing in the field.

As a sort of disclosure, I believe there are plenty of reasons for organizations to maintain multiple separate databases for each of their business systems, and I’ve written on how VMware based virtualization is a great match here, as it brings massive operational simplification and efficiencies. The example I gave at a previous blog post was a corporation that manages a portfolio of business, and keeps buying and selling companies, where keeping separate independent ERP systems for each business is crucial to simplify the divestiture when the time comes. We all know that splitting the data within a single SAP ERP is the key pain in separating a business group (and I've been more than once in such projects, and it was not fast, cheap or simple at all).

But I also understand that there are some special cases where having a single database for the whole business brings massive benefits, and we need to tackle both these scenarios with an open mind, and no bias toward one or the other, in order to truly be able to make the best choice for each customer organization. For example, companies with vertical or horizontal integration, where the different companies businesses are interrelated and can potencially cross sell to join customers between each other will certainly benefit from having an integrated real-time global view across businesses, which would point out to having a single global instance for all business units.

So, you’ll see me in one customer, depending on his particular scenario advising to keep all their smaller systems and run them all on VMware, while on another with a different business scenario, you may see me advising for a single global instance with all data in a single database.

In the end it, the “best solution” really depends on each company business scenario.
 


               Data Temperatures with Dynamic Tiering and SAP HANA Vora

Have to say here that I’m ignoring all the topics about data temperatures. But that would take us on a completely different direction, so I’ll chose to ignore the implications of data temperatures here, although I must say, it is a very relevant angle to the problem I’m describing.

In fact, having a approach to “data lifecycle management” from the start, and incorporating a discussion with your businesses on the “business value of data” right from the early stages of your project evaluation, may lead you to some massive reduction of your required in-memory footprint, and consequence reduction in the overall costs of the solution. I've approached this topic for example in my blog post "The right Architecture for S/4 HANA and IoT".

Have to note though, that the “split between current data and historical data” on SAP ERP on HANA is still not a reality, and the “Data Temperature” discussion when in face of transactional systems, today is still much more limited in options than when talking about analytic use cases like SAP BW on HANA.


Key barrier to scaling-out transactional workloads on HANA is operations experience

But, enough with problem analysis and considerations! Let me then share a bit my perspective in regards to scaling out “transactional workloads” on SAP HANA.

I believe the key reasons to avoid scaling out for transactional scenarios is the lack of knowledge on the market on how to implement and manage these systems.

The technology is there, is available and works.

But if people implementing and operating these systems don’t understand the “angle of scaling out transactional workloads”, you may easily end up with an application that, because of the performance impact a bad scale-out architecture, implementation and operation implications, may end up lagging a lot on expectations performance wise.

Note that I haven’t said, “won’t work”. I said “lagging on expectations”.

And I’m making this note because I believe it’s all about expectation management.


               One customer example on proper expectations management with SAP HANA adoption

And let me take the example on the picture bellow to make the case.
In this example a customer was moving BPC from Oracle to HANA. BTW, this customer was EMC Corporation (read the full story at: http://www.emc.com/collateral/analyst-reports/emc-it-breaks-grounds-sap-hana.pdf ). I was an EMC employee when this happened, and was privileged to be in close contact with the EMC IT SAP Architecture and Basis guys driving this when the project went live.

Once they moved BPC “as is” to HANA, immediately one of their critical financial processes passed from 53 minutes to 7 minutes. Then, after they applied EHP1 for BPC which brings the HANA optimized code for that process, it passed to 17 seconds. EMC IT here also tested the impact of virtualization and did tests both on physical servers and virtual servers and verified that running that process on physical would take less 1,5 second.

So, when talking with the business users, their expectation was to be able to run that process faster. When they saw 7 minutes they were very happy. So when they saw 17 seconds, they were exhilarated! When asked whether 1,5 second would make any difference for them, when confronted with the difference of cost on both scenarios, virtualization being so much cheaper became a no-brainer option.

But when talking with the technical teams, their concern was all about how much is the difference between running the process on physical vs virtual, which for the business is simply an irrelevant discussion.

I’m telling this story because this happened in November 2013, way before SAP supported SAP HANA in production on VMware. And I was privileged to observe closely this story as it developed.

I wrote a lot about the benefits of virtualization then, and that it was fundamental to do a proper expectation management. I believe we are in face of a similar “expectations management” discussion in regards to scaling out transactional workloads on SAP HANA.


               Managing SoH Scale-out is not more complex that managing Oracle DBs

So, on my perspective managing a SAP HANA scale-out cluster is in no way more complex than managing an Oracle database.

It’s just a matter of understanding the conditions and tools.

Starting with the Oracle example. I remember a colleague that came from a Microsoft background managing SQL Server, and started on a project where all systems were Oracle.

To make him understand that he needed to do expensive SQL Statement analysis, that he needed to work with development teams to build new indexes, needed to determine whether a table or index would benefit from having dedicated table space and needed to define whether a certain tablespace would benefit from a dedicated LUN, or when was time to do a table reorganization, or to how do create the rollback segments so that you wouldn’t have processes breaking… you should have seen his face… He was so lost and frustrated!!!

Now imagine that he was the SAP Basis admin on an organization that had just implemented SAP ERP, the system was growing very fast, and he only had his SQL Server background. The result would have been a disaster, because he didn’t understand the need for all these optimization processes (as for example with SAP ERP on SQL Server you don’t determine the table placement at the disk level, SQL Server automatically stripes all data across all available datafiles), neither knew where to look for problems, or even if he knew where to look, he wouldn’t know how to interpret the numbers, or what actions to take to solve the problems. Having had SAP ADM315, ADM505 and ADM506 training courses would have helped him a lot.

On my perspective, the discussion on scaling-out transactional workload on SAP HANA is of a similar nature to the story I just told of the SQL Server DBA going to an Oracle landscape.

Similar nature, but much simpler! As I believe the complexity involved in managing SAP HANA in a scale-out cluster is not nearly as complex as it was to manage an Oracle 8 database!!! Remember the “PCTFREE”, “PCTUSED” parameters on Oracle??? Those were tough days when compared with SAP HANA System Administration. When I remember the project I worked in 1998 of a R/3 system with a 500GB database and 1200 concurrent users… Getting the system running smoothly was no easy job! (those doing "Basis" back then know how huge such a system was at that time...)


               Understanding latency implications on SAP HANA performance

So, diving into SAP HANA scale-out challenges for transactional workloads.

First of all is all about an “expectation management” problem. And this is due to the effect of network latency on the processing of data in the cluster.

When you look at the picture above, and if you consider that SAP HANA is optimized to process data in RAM, you see that the latency for data transmission between RAM and the CPU on the same NUMA node is of about 60 nanoseconds, while if you transmit data across the network between two servers you’ll always be looking at latencies from micro to milliseconds.

This means that if you need to move data between nodes across the network, that is a very “expensive” operation on SAP HANA performance wise when compared with processing data in the RAM of the same server. This is because, SAP HANA having a shared nothing cluster architecture which means data only exists in one cluster node (I’ve also written about this a long time ago, so please read that one for some history and context), and when in a scale-out cluster, data is stripped across multiple nodes.

The challenge on a transactional system is then that as you have a lot of joins between tables across many different business areas, and you also have “business objects” like invoices, materials and others that are stored in many different tables, you’ll be facing a very high probability of a large number of processes requiring cross-node communication.

The consequence here is that once reporting against data that are on tables across different server nodes, you’ll need to move data over the network to gather it all together and calculate results. And that “costs” a lot more time than if all tables were in the RAM of a single server. Will this mean that those operations will be slower than on “anyDB”? Well, the reality may be similar to the scenario I’ve described once EMC decided to virtualize BPC on HANA.

The technical guys may be very concerned with how many millisecconds more it will take on scale-out vs scale-up, while the business guys will be thrilled with the improvements observed in terms of increased business visibility, and reduced lead-time to consolidated information, and might consider that the reports taking a bit longer to generate is a small price to pay, especially when in context of their starting scenario on legacy ERP on AnyDB.

The same problem happens when you write data, as if you write an object that will be stored in multiple tables, and those tables are distributed across different server nodes, you’ll need to do the write in each of the nodes, and the commit is only given once all nodes communicate that data has been written, so writing may become slower.

Then, my point that this is no different from managing an Oracle database.

As in Oracle (in fact as in any database), once you start dealing with very large systems, you’ll need to look at parameters and variables that for smaller scenarios you just ignore, as the standard/default configurations are good enough for you.

When faced with bigger systems, once you learn where to look and how to manage this additional level of detail, then it becomes routine and a no-brainer.

The same must happen with HANA. But the knowledge in the market is just not there yet.

SAP should develop and make available a course similar to the old “ADM315” (or any course for SAP Basis kind of people just focusing on performance analysis and optimization) focusing on HANA, just covering the monitoring tools today available to evaluate performance in a scale-out cluster, and well as the tools and mechanisms to address potential performance problems.

There is already some training on this area, must if fully focused on developers, and so, it would be useful to have something really focused on SAP Basis people.

For example, today there is the possibility to mirror tables across HANA nodes. Why is this relevant? Imagine that you have a very large table partitioned across multiple nodes, on which you run a report that implies a join with a configuration table that is just 100 lines long. By mirroring that table across nodes, you avoid cross node joins, and all joins are done locally in each node, so you truly taking benefit of the massive parallel scale-out architecture of SAP HANA.

Also the tools on HANA (like the “Plan Visualizer”) to analyze where time is spent when a specific statement is executed, are mind blowing when compared with I knew in Netweaver, and provide you a level of detail that enable you to clearly identify whether the delay is caused by network, disk or any other reason.

It is then my belief that the key barrier to further scale-out adoption is then lack of knowledge, with all the risk implications of it for business critical applications.

And on my perspective, although I may accept that scaling-up is the short term solution to avoid the risk, considering the increased number of customer cases I’m seeing with very large datasets needing massive amounts of memory, and considering that most things about Data Temperatures are not ready yet for transactional workloads (Dynamic Tiering, NLS or Vora do not solve this problem now), documenting and transmitting knowledge on SAP HANA Scale-out performance analysis and optimization would be the way to go for these customers (at least for the customer examples I mentioned here).


               Ok, then when to scale-up or to scale-out?

Couldn’t finish this blog, without adding some comments on the variables I identified at the beginning.

And that is, when is it a good option to scale-up and to scale-out.

So, let's look at what many say about scale-out:
  • In some cases (with certain server vendors), from a CAPEX perspective 2 x 4 socket box is a lot less expensive than 1 x 8 socket (as an example);
  • If you have to do HA, looking for example for a 18 TB system, instead of having 6+6+6+(6 for HA), you would have 2x24TB;
  • If you have a very dynamic change environment, where you buy and sell companies, launch and divest business lines, having a standard building block in the datacenter enables you to easily reallocate these smaller boxes, while in the big box scenario, it will always be there, as you can’t even slice it with VMware.
  • Looking also from a service provider perspective (and the same for very large customers with 50+ productive SAP Systems), scale-out provides less cost of vacancy as its easier to play around with the multiple smaller boxes.
  • You could argue that scale-out drives higher OPEX:
    • More OS images to manage and patch:
      • If you implement automation tools, doing one or many should be the same, and for these 50+ systems customers, they will always be managing a ton of OS images, so more or less 5 OS images…
    • Need to do continuous table distribution analysis and replacement:
      • In the Oracle world everyone looks as a normal activity to look at table partitioning optimization in BW, and do “hot-spot analysis”, reorganizing tables across tablespaces or putting tables in dedicated tablespaces. Table redistribution on HANA would look exactly like that. If you make it a part of your core knowledge, it becomes a no brainer.
  • Future S/4 will be based on Multi-tenant Database containers. So we are consolidating multiple applications on the same SAP HANA System to then break it up between multiple tenants? Why would we then put them all on a single box? Rather split them across a scale-out node. The only caveat here would be the network connection between the nodes, as if the tenants would have a lot of cross tenant communication; it would definitely be faster in a single scale-up node.
  • Another topic that is gaining increased traction is “data aging” or “data tiering”, meaning most likely in the future you’ll have less and less data footprint. So, you would buy a ton of iron to solve a problem that will see its impact reduced across the next years. I would rather “scale-out” and then repurpose. Add to this the fact that many of the large scale-out sizing exercises already account for 3 years of growth… so who knows what will happen in 3 years.
  • Imagine as well that you have a fully virtualized datacenter, and that you’ll need a 9 TB system. Why not to have a 3x3TB scale-out system on VMware? I know this is not available today, but I believe we’ll reach it at some point. Or that you have a 4 TB system that has outgrown the vShpere6 maximums? Do you move from virtual to physical and buy a huge box, or just scale-out?
  • Imagine customers that have massive growth rates on their SAP Systems (know one with a 90 TB ERP…). Keep scaling up will lead you at one point in time to a wall. Scale-out keeps your scaling options open, while enabling to invest as you grow.
  • Also from a bargain perspective, if you buy many smaller servers, your bargain power as a customer is higher, and you avoid vendor lock in…
All of these are examples of arguments about scale-out I've heard, and have to say, some of them are very strong.

But let’s look at scaling-up:
  • A strong argument is definitely consolidating multiple AnyDB systems on a single SAP HANA system. But again, this will be based on MDC, so does it really make sense to put all of these in a single box?
  • Another strong argument is that there are always development resources in the code development teams that suck, and produce bad code, that isn't properly checked for quality, especially when in "RUN mode", so doing maintenance on existing systems with the pressure from business to do it faster and cheaper. In a single scale-up system, this problem would not be as evident as in a scale-out.
  • You may also say that its easier to buy more iron than to upgrade a SAP System. So, all the things in regards to MDC and Data Tiering will still take some years to be real and mature, and you need to make a decision today. Then a scale-up would be a good fit.
  • Also if your system is expected to be rather stable, and you do not foresee a very dynamic change environment affecting the SAP Systems, you may argue that on the long run, a scale-up system may bring OPEX savings that compensate for the larger CAPEX.
You would say that the arguments above are highly biased towards scale-out, right?

There are 2 recent experiences I've had, which I believe add another perspective to this discussion:
  • There are server vendors with modular architectures (for example Bull and SGI) that enable you to scale-up as you go, and rearrange the building blocks as you need.
    • For example Bull servers are composed of 2 socket modules that are aggregated together. So a 16 socket server is the aggregation of 8 x 2 socket modules.
    • In the example of SGI, the servers grow in 4 socket modules, and then a 16 socket server is the aggregation of 4 x 4 socket modules.
    • On both you can add additional modules to grow, or reutilize the modules by breaking down the server in case you no longer need such a large box.
  • I’ve also learned that “if you negotiate properly”, the costs of these servers can be equivalent (meaning not massively more expensive, and in some cases maybe even cheaper!!!) to similar capacity from other vendors in smaller servers.
  • Then you have service providers like Virtustream, that have the possibility upon growth to reallocate compute capacity between tenants and so provide a risk free evolution path for these very large scenarios. Meaning, you don’t need to figure all this out by yourself, as Virtustream will do this analysis with you and provide you with the right solution in a purely OPEX model while taking care of SAP HANA systems administration for you, and so eliminating all this complexity and risk. This would enable you to just chose between scale-up or scale-out based on your business requirements, and not the architecture and systems administration constraints that would entail.


Conclusions: final questions and… what about Cloud?

So, today I would say, that my personal preference is to scale-up on all sizes up to 16 sockets (as there are multiple server vendor alternatives in the market), and to scale-out beyond that.

Why I say this? It’s not as simple as stated above, as I would need to factor in many considerations.

Questions I would ask to provide a better advice to customers would be:
  • What is your growth forecast for the next 3 years: after 3 years technology is obsolete, and being SAP HANA running on x86 servers, you just need to do an homogeneous system copy, which being in a TDI scenario – with external storage – may mean just attaching the storage to a new server, with a minimum downtime and risk.
  • Do you have any extreme performance requirements: number of transactions per minute may indicate how much being in a scale-out can become a problem performance wise or not. There are a lot of customers that have a lot of data, but not that extreme volume of transactions, which will mean that SAP will allow them to scale beyond current CPU/memory ratios to increase just memory - always when confirmed that your CPU utilization is really low.
  • Do you need high availability, and which are your SLAs: when in a scale-out scenario, you have less stand-by capacity than in a scale-up scenario. So, considering the failover time of a node with SAP HANA Host Auto-failover in a scale-out scenario with external storage, if this is acceptable for the customer, may enable him to save some money. This may change once SAP enables SAP HANA System Replication with the standby node active for reporting, but we can't always postpone our decisions based on futurology, and this option is not yet available today.
  • How resilient you need your data to be: this goes to RPO in a scenario of disaster. Many organizations put in place a DR scenario because they simply cannot afford to operate without SAP systems. But facts are showing that with today’s datacenter standards, it is almost impossible for a Tier4 datacenter to go out of service. This makes many customers to take an approach of disaster avoidance, and build high availability across different buildings in a Tier4 datacenter provider, instead of having asynchronous replication to a remote location. Have to say that this varies a lot by region, and by industry. For example in regions more subject to extreme natural disasters, is more likely for remote data replication to be required, while in western Europe is increasingly common for organizations to assume a disaster avoidance scenario and just do synchronous replication across metro distances. This will have implications on the mechanisms to put in place and the associated costs.
  • And as a final aspect, I’ll look at what operation automation mechanisms have to be put in place: for example, there are customers making a standard doing more frequent data refresh on Q&A systems. Doing it for example on a weekly basis, implies a high degree of automation involved in this, and has implications on infrastructure architecture. It is a lot simpler and faster to automate this for a single scale-up node, than for a scale-out cluster.
And the obvious conclusion from this journey through my thoughts and learnings on this topic is obvious: the right choice depends!

As always, hope that this helps you build up your mind on what is right for your organization, and be aware of the possibilities.

Anyway, remember that today you can just avoid all this complexity and risk by putting all your SAP systems in an “Enterprise Class Cloud”, by leveraging the uniqueness of the offer of companies like Virtustream which have pushed Security and Compliance to a level that truly makes it safe for organizations to run their mission critical systems on. That also opens you the possibility of working with architects like me on evaluating your business scenario, and assisting you in making the right decisions for your business.

As a final word, have to say I’m feeling truly privileged and overwhelmed with the amount of talent and innovation at Virtustream and how it is leading the emergence of a new “Cloud Generation”, which I would call “Enterprise Class Cloud” or "Cloud 2.0"!

Stay tuned, as I’ll be writing very soon on what I’ve learned so far about Virtustream, what sets its Cloud offer apart, and in what ways it breaks barriers to adoption of SAP HANA, and overcomes long-time concerns of organizations when faced with the scenario of moving enterprise mission critical applications like SAP Business Suite to the cloud.