Thursday, February 13, 2014

Looking at Symmetry & Chirality in Enterprise Software

Symmetry (balance) and Chirality ("handedness") are seen throughout the physical constructs around us. In biology, chemistry, physics, math, art, music, architecture .... but what are its implications and effects within the sphere of software and performance engineering? Symmetry seems to be favored by nature, and it's actually a simplicity of biological engineering that belies its complexity (a nautilus shell is a great example). Yet true symmetry eludes man ... in our engineering efforts, in design - we say that simple, elegant designs are the most difficult to create (and I agree), and so too - true symmetry is a harsh mistress to tame.


Think about it this way ... systems are built with some intended symmetries, either in the form of multipliers for capacity and/or fault tolerance. Many times they lead to somewhat interesting geometric patterns like this:


Compared to these following things found in nature:

www.scientificamerican.com
blsciblogs.baruch.cuny.edu
jwilson.coe.uga.edu
www.fantasticforwards.com
(images attributed, but used without permission)


But the dirty secret is this: In nature, what appears to be symmetrical almost never truly is. While it may be well balanced, it is imperfect due to all the slight variations and complexities of the chemical and biological variables that natural systems are subject to.... but it is beautiful and generally relatively simple at its core. So for the sake of argument, let's say that these natural systems are truly symmetrical, and that is the thing that we humans are trying to mimic in our physical designs.

ssrsbstaff.ednet.ns.ca

In man-made software systems, balance and symmetry is pursued through configurations and heuristic means - such as the load balancing schemes that are selectable in the load balancing appliances, web tier, app tier, messaging tier and database tier. We sprinkle "load balancing" logic and function all over the application hierarchy in an attempt to balance the systems. In reality, how work is distributed is typically achieved with a mesh of pre-configured patterns of distribution; whether "round robin" or "least busy" or "sticky IP/most recently used" or whatever the logical pattern might be, it's an attempt to bring balance to a system that is inherently prone to being out of balance. Hmmm. Why is that? Accidentally engineered chirality, perhaps? Why are we designing systems that are out of symmetry and therefore inherently require "load balancers" as really expensive and complex band aids? Seems kind of ridiculous if you really think about it. Bad design gave way to a cottage industry of appliances that attempt to fix what we don't understand, by being a heavy handed traffic cop. What would an alternative look like?

Vendors are trying to wrap solutions around these problems - Microsoft with .NET, Oracle/WebLogic, IBM and WebSphere - they try to create a holistic environment that will play well together and help to contain the variables of symmetry - to try to tame it. But often times the consumers of these products do not buy in to the holistic solution and they deploy "solutions" that are grown, not designed. A .NET web server bolted in front of some proprietary code that communicates with an Apache Tomcat server, interconnected with backend systems with message buses, databases and mainframes. This is CRAP, but the typical "enterprise" accepts these hand-me-down solutions as reasonable, because most of the time they can get what they perceive to be a decent or at least predictable level of performance out of them. I'd argue that it's really an accident - it's a fall in slow motion. This type of mess is typically the result of cost management and constructing systems from the existing pieces in the organization, not from designing a solution from the top-down.

When we see imbalance in systems, often times the result is some aspect of the system as a whole performing poorly. When a load balancing scheme causes unintended consequences such as over-loading an application server because of the way that session state is tracked at the load balancer, and a system/software fault in another application server, causing traffic to be imbalanced through the cluster and then eventually with enough inertia the system tips over.

These types of failures of our imposed attempts at symmetry are very, very common. They are also easily overlooked, because many times these pattern are assumed to work "as intended" or worse, "good enough." So then we employ more layers of internal instrumentation and external monitoring systems from various vendors to monitor traffic through the load balancers and application tiers, to monitor the "symmetry" that we're want to enforce. Ugh. Yet another cottage industry of expensive software and hardware to help us contain this thing that we don't understand. Are you seeing a pattern here?

We are constantly addressing the side-effects and not the cause.


How do we design and measure balance though? I think that we are looking at the concept backwards. We look at the effects of balance, and we do it in terms that are easy to quantify - like that of utilized resources. For instance, if there are two nodes in a cluster and one is running with a CPU usage of 80% while the other is running at a CPU utilization of 20% we would be concerned that the systems are off balance - we are not spreading the load out evenly, thereby creating a condition where the work sent to one server may very well be adversely impacted by the high CPU utilization, where the work sent to the other server would not be impacted. We can easily wrap our heads around that concept and then come up with a construct, a scheme to keep the imbalance from happening. Seems simple ... easy ... intuitive. It couldn't be more stupid.

Could developers and administrators be introducing chirality (handedness) into these systems without knowing it? When we enforce balancing schemes and they fail, is it because of a predisposition of the way we think about balance - rather than actually considering what symmetry really is? Or due to default states that seemed innocuous on the developer's workbench, but tip the scales under load?

Over the course of my work in performance engineering I have seen the different logical and physical tiers work against each other in this way, many times. F5's BigIP load balancing schemes on the outside layer, have to play nicely with the application infrastructure's loading mechanisms - be they WebSphere, WebLogic, Apache, Oracle RAC, or whatever - the paths that data travels through these complex systems is determined on-the-fly by logic or configuration that are happening outside of the control of and what the software designers had ever considered. This stuff is interconnected with database queries and web based technologies like session IDs in an attempt at transparency. Hmmm. Odd that we'd do it this way, isn't it?

Systems Architects are the lords of this domain - and they work with the so-called DevOps folks, application specialists, lead developers, DBAs, vendors and consultants to create the fabric of load modeling and performance engineering for their systems - on paper. Where this fails is in reality - when failures or misconfigurations or unforeseen circumstances deal a hand that was not expected. What then? The entire enterprise is thrown into reaction mode - triage. What happened? Why? How? What's the "fix!?" Deploy, run, hit a boundary case, isolate and fix, redeploy, run, hit a boundary case, isolate, fix...... insanity. These architects are many times being given the systems equivalent of the criminal rabble of the army and yet at the same time being commanded to make them into an elite force. It's certainly something far less than ideal... and application performance engineering isn't "Kelly's Heros."

[more to come on this thread]


Ancillary Thoughts Triggered by This Exposition

What about a Symmetric Performance Pipeline Architecture?
An open hardware architecture that provided the necessary physical interconnections and expansion and scalability requirements of the software solution, provided a backbone that could eliminate the network layer overhead and complexity of load balancers and load balancing throughout the various layers?

The pipeline would balance itself automatically by weighing the KPIs that matter to the system. Various heuristic methods would be employed to understand the load and its impact to the system as a whole, and traffic/load would be moved around the logical organism automatically. This about how your brain compensates for pain in one foot - you don't have to think a whole lot about how to compensate for that pain - you avoid it automatically.

This architecture would provide a unified symmetry throughout the main tiers of the application infrastructure. You will still have external systems, legacy systems, and other considerations, but from the outside world through the firewall and into the pipeline, it's a direct-shot into the system.


Under load, the dynamic nature of the load distribution approach automatically favors balance in the system:
What does this look like in practical terms? How to you create such a system and make it adaptable, affordable, and able to play with existing components?

The design goals should address these constraints:

  1. Must have seamless interoperability with off-the-shelf server components; web servers, app servers, middle-ware servers, databases, etc.
  2. Provides the interconnectivity and symmetry automatically.
  3. Worst performing components should be are automatically called out for isolation/remediation
  4. Should be aware of Layers 3-7 ; understanding the complexities of low-level networking

What Ifs:

  1. What if the Performance Pipeline could use generic hardware such as 10Gb NICs for its communications MESH, but forego the IP protocol in favor of low-level, low-latency inter-node communications? What would this do to available bandwidth and latency?
  2. What if the MESH translated calls from tier to tier and remapped traffic - regardless of how a node is initiating communications? (Web Server 1 explicitly calls App Server 2 - but the MESH redirects the traffic according to balance)
  3. What if the MESH could see and map business transactions, their flow, and provide analytics?
  4. What if the MESH could provide 'capture/replay' functionality for testing and reproducibility?
Content switches, routers, and dynamic caching engines all have pieces of the puzzle of the core of the Performance Pipeline. The key here is to step in the middle of the operation and direct traffic based on a performance goal.


Is Open Source any Better?
Now you have many open source, user-contributed type software solutions that attempt to address business needs from a very different design and support model than that of the "big corporation." Hat tip "The Cathedral and the Bazaar" - Eric S. Raymond's quintessential essays on the implications of Open Source (http://www.catb.org/~esr/writings/cathedral-bazaar/). While many people are attracted to the business model of Open Source, does Open Source help to solve this whole problem that we're discussing, or does it in fact make it worse? For instance, are all these tiers of Open Source software being designed in harmony with each other? Do they share a common fabric of performance engineering - do they holistically work together, or are they bolted together - still requiring glue and band aids to achieve some level of acceptable scalability and performance? (hint: it's the latter)

So I don't see salvation in Open Source per se. Could we see something grow organically, sure we could, and maybe it's already happening.

Open Source is probably the best bet of getting something like the Symmetric Performance Pipeline off the ground.

What about Fluid Dynamics?
What if we could take all of the systems and configurations and perform the equivalent of fluid dynamics modeling with them? We start to do this with performance testing, but there is always a series of compromises in performance testing that dictate the you cannot test everything - you cannot create load scenarios that cover 100% or even 90% or even 80% of your production loading model cases - there is not enough time, resources or money to accomplish that. The old adage that you have three options and you get to choose two: time, quality, cost < which two you choose will dictate the majority of the experience of your end-users.

What if we could find a way to harmonize input into the outside tier of the application architecture that would flow end-to-end, in a dynamic manner? What if we didn't have to "write test scripts?" What if business logic could be reverse engineered on-the-fly, and requests generated and driven dynamically?

What if we "saw" and reacted to systems performance with a perspective like this?


The system should automatically favor balance over configuration.


What about Outliers / Boundary Cases?
If the majority of your end-users are experiencing a reasonable level of system performance and they are generally happy - say 60-80% of your users, what about the others? The remaining 20-40% of those users are not happy because they are not having the same experiences as the happy ones, right? Their dissatisfaction is wide and varied, because they are hitting boundary cases that are caused by that third variable that we had to drop from the list - time, quality or cost. Now you have a bunch of users who are likely each experiencing their own unique boundary case. It is often considered the cost of doing business, that these users will never be fully satisfied - because their problem is just an "outlier."

Are the outliers worthy of all the energy it would take to resolve them? Are they statistically significant? Are they a symptom of much larger issues lurking under the surface - failures of symmetry or lack of foresight into growth requirements, or are you being stabbed in the back by a pernicious bug that a vendor let slip through their QA processes?

What if we could spot outliers, isolate their issues, find the root cause, and remediate it quickly - dynamically? Self healing systems are not a new idea by any means, but I have yet to see a commercial enterprise come close to this concept with a production/enterprise system - they are just too complex and their interconnections too brittle.