Mercenary Performance Engineering

Thursday, February 13, 2014

Looking at Symmetry & Chirality in Enterprise Software

Symmetry (balance) and Chirality ("handedness") are seen throughout the physical constructs around us. In biology, chemistry, physics, math, art, music, architecture .... but what are its implications and effects within the sphere of software and performance engineering? Symmetry seems to be favored by nature, and it's actually a simplicity of biological engineering that belies its complexity (a nautilus shell is a great example). Yet true symmetry eludes man ... in our engineering efforts, in design - we say that simple, elegant designs are the most difficult to create (and I agree), and so too - true symmetry is a harsh mistress to tame.

Think about it this way ... systems are built with some intended symmetries, either in the form of multipliers for capacity and/or fault tolerance. Many times they lead to somewhat interesting geometric patterns like this:

Compared to these following things found in nature:

www.scientificamerican.com

blsciblogs.baruch.cuny.edu

jwilson.coe.uga.edu

www.fantasticforwards.com

(images attributed, but used without permission)

But the dirty secret is this: In nature, what appears to be symmetrical almost never truly is. While it may be well balanced, it is imperfect due to all the slight variations and complexities of the chemical and biological variables that natural systems are subject to.... but it is beautiful and generally relatively simple at its core. So for the sake of argument, let's say that these natural systems are truly symmetrical, and that is the thing that we humans are trying to mimic in our physical designs.

ssrsbstaff.ednet.ns.ca

In man-made software systems, balance and symmetry is pursued through configurations and heuristic means - such as the load balancing schemes that are selectable in the load balancing appliances, web tier, app tier, messaging tier and database tier. We sprinkle "load balancing" logic and function all over the application hierarchy in an attempt to balance the systems. In reality, how work is distributed is typically achieved with a mesh of pre-configured patterns of distribution; whether "round robin" or "least busy" or "sticky IP/most recently used" or whatever the logical pattern might be, it's an attempt to bring balance to a system that is inherently prone to being out of balance. Hmmm. Why is that? Accidentally engineered chirality, perhaps? Why are we designing systems that are out of symmetry and therefore inherently require "load balancers" as really expensive and complex band aids? Seems kind of ridiculous if you really think about it. Bad design gave way to a cottage industry of appliances that attempt to fix what we don't understand, by being a heavy handed traffic cop. What would an alternative look like?

Vendors are trying to wrap solutions around these problems - Microsoft with .NET, Oracle/WebLogic, IBM and WebSphere - they try to create a holistic environment that will play well together and help to contain the variables of symmetry - to try to tame it. But often times the consumers of these products do not buy in to the holistic solution and they deploy "solutions" that are grown, not designed. A .NET web server bolted in front of some proprietary code that communicates with an Apache Tomcat server, interconnected with backend systems with message buses, databases and mainframes. This is CRAP, but the typical "enterprise" accepts these hand-me-down solutions as reasonable, because most of the time they can get what they perceive to be a decent or at least predictable level of performance out of them. I'd argue that it's really an accident - it's a fall in slow motion. This type of mess is typically the result of cost management and constructing systems from the existing pieces in the organization, not from designing a solution from the top-down.

When we see imbalance in systems, often times the result is some aspect of the system as a whole performing poorly. When a load balancing scheme causes unintended consequences such as over-loading an application server because of the way that session state is tracked at the load balancer, and a system/software fault in another application server, causing traffic to be imbalanced through the cluster and then eventually with enough inertia the system tips over.

These types of failures of our imposed attempts at symmetry are very, very common. They are also easily overlooked, because many times these pattern are assumed to work "as intended" or worse, "good enough." So then we employ more layers of internal instrumentation and external monitoring systems from various vendors to monitor traffic through the load balancers and application tiers, to monitor the "symmetry" that we're want to enforce. Ugh. Yet another cottage industry of expensive software and hardware to help us contain this thing that we don't understand. Are you seeing a pattern here?

We are constantly addressing the side-effects and not the cause.

How do we design and measure balance though? I think that we are looking at the concept backwards. We look at the effects of balance, and we do it in terms that are easy to quantify - like that of utilized resources. For instance, if there are two nodes in a cluster and one is running with a CPU usage of 80% while the other is running at a CPU utilization of 20% we would be concerned that the systems are off balance - we are not spreading the load out evenly, thereby creating a condition where the work sent to one server may very well be adversely impacted by the high CPU utilization, where the work sent to the other server would not be impacted. We can easily wrap our heads around that concept and then come up with a construct, a scheme to keep the imbalance from happening. Seems simple ... easy ... intuitive. It couldn't be more stupid.

Could developers and administrators be introducing chirality (handedness) into these systems without knowing it? When we enforce balancing schemes and they fail, is it because of a predisposition of the way we think about balance - rather than actually considering what symmetry really is? Or due to default states that seemed innocuous on the developer's workbench, but tip the scales under load?

Over the course of my work in performance engineering I have seen the different logical and physical tiers work against each other in this way, many times. F5's BigIP load balancing schemes on the outside layer, have to play nicely with the application infrastructure's loading mechanisms - be they WebSphere, WebLogic, Apache, Oracle RAC, or whatever - the paths that data travels through these complex systems is determined on-the-fly by logic or configuration that are happening outside of the control of and what the software designers had ever considered. This stuff is interconnected with database queries and web based technologies like session IDs in an attempt at transparency. Hmmm. Odd that we'd do it this way, isn't it?

Systems Architects are the lords of this domain - and they work with the so-called DevOps folks, application specialists, lead developers, DBAs, vendors and consultants to create the fabric of load modeling and performance engineering for their systems - on paper. Where this fails is in reality - when failures or misconfigurations or unforeseen circumstances deal a hand that was not expected. What then? The entire enterprise is thrown into reaction mode - triage. What happened? Why? How? What's the "fix!?" Deploy, run, hit a boundary case, isolate and fix, redeploy, run, hit a boundary case, isolate, fix...... insanity. These architects are many times being given the systems equivalent of the criminal rabble of the army and yet at the same time being commanded to make them into an elite force. It's certainly something far less than ideal... and application performance engineering isn't "Kelly's Heros."

[more to come on this thread]

Ancillary Thoughts Triggered by This Exposition

What about a Symmetric Performance Pipeline Architecture?
An open hardware architecture that provided the necessary physical interconnections and expansion and scalability requirements of the software solution, provided a backbone that could eliminate the network layer overhead and complexity of load balancers and load balancing throughout the various layers?

The pipeline would balance itself automatically by weighing the KPIs that matter to the system. Various heuristic methods would be employed to understand the load and its impact to the system as a whole, and traffic/load would be moved around the logical organism automatically. This about how your brain compensates for pain in one foot - you don't have to think a whole lot about how to compensate for that pain - you avoid it automatically.

This architecture would provide a unified symmetry throughout the main tiers of the application infrastructure. You will still have external systems, legacy systems, and other considerations, but from the outside world through the firewall and into the pipeline, it's a direct-shot into the system.

Under load, the dynamic nature of the load distribution approach automatically favors balance in the system:

What does this look like in practical terms? How to you create such a system and make it adaptable, affordable, and able to play with existing components?

The design goals should address these constraints:

Must have seamless interoperability with off-the-shelf server components; web servers, app servers, middle-ware servers, databases, etc.
Provides the interconnectivity and symmetry automatically.
Worst performing components should be are automatically called out for isolation/remediation
Should be aware of Layers 3-7 ; understanding the complexities of low-level networking

What Ifs:

What if the Performance Pipeline could use generic hardware such as 10Gb NICs for its communications MESH, but forego the IP protocol in favor of low-level, low-latency inter-node communications? What would this do to available bandwidth and latency?
What if the MESH translated calls from tier to tier and remapped traffic - regardless of how a node is initiating communications? (Web Server 1 explicitly calls App Server 2 - but the MESH redirects the traffic according to balance)
What if the MESH could see and map business transactions, their flow, and provide analytics?
What if the MESH could provide 'capture/replay' functionality for testing and reproducibility?

Content switches, routers, and dynamic caching engines all have pieces of the puzzle of the core of the Performance Pipeline. The key here is to step in the middle of the operation and direct traffic based on a performance goal.

Is Open Source any Better?
Now you have many open source, user-contributed type software solutions that attempt to address business needs from a very different design and support model than that of the "big corporation." Hat tip "The Cathedral and the Bazaar" - Eric S. Raymond's quintessential essays on the implications of Open Source (http://www.catb.org/~esr/writings/cathedral-bazaar/). While many people are attracted to the business model of Open Source, does Open Source help to solve this whole problem that we're discussing, or does it in fact make it worse? For instance, are all these tiers of Open Source software being designed in harmony with each other? Do they share a common fabric of performance engineering - do they holistically work together, or are they bolted together - still requiring glue and band aids to achieve some level of acceptable scalability and performance? (hint: it's the latter)

So I don't see salvation in Open Source per se. Could we see something grow organically, sure we could, and maybe it's already happening.

Open Source is probably the best bet of getting something like the Symmetric Performance Pipeline off the ground.

What about Fluid Dynamics?
What if we could take all of the systems and configurations and perform the equivalent of fluid dynamics modeling with them? We start to do this with performance testing, but there is always a series of compromises in performance testing that dictate the you cannot test everything - you cannot create load scenarios that cover 100% or even 90% or even 80% of your production loading model cases - there is not enough time, resources or money to accomplish that. The old adage that you have three options and you get to choose two: time, quality, cost < which two you choose will dictate the majority of the experience of your end-users.

What if we could find a way to harmonize input into the outside tier of the application architecture that would flow end-to-end, in a dynamic manner? What if we didn't have to "write test scripts?" What if business logic could be reverse engineered on-the-fly, and requests generated and driven dynamically?

What if we "saw" and reacted to systems performance with a perspective like this?

The system should automatically favor balance over configuration.

What about Outliers / Boundary Cases?
If the majority of your end-users are experiencing a reasonable level of system performance and they are generally happy - say 60-80% of your users, what about the others? The remaining 20-40% of those users are not happy because they are not having the same experiences as the happy ones, right? Their dissatisfaction is wide and varied, because they are hitting boundary cases that are caused by that third variable that we had to drop from the list - time, quality or cost. Now you have a bunch of users who are likely each experiencing their own unique boundary case. It is often considered the cost of doing business, that these users will never be fully satisfied - because their problem is just an "outlier."

Are the outliers worthy of all the energy it would take to resolve them? Are they statistically significant? Are they a symptom of much larger issues lurking under the surface - failures of symmetry or lack of foresight into growth requirements, or are you being stabbed in the back by a pernicious bug that a vendor let slip through their QA processes?

What if we could spot outliers, isolate their issues, find the root cause, and remediate it quickly - dynamically? Self healing systems are not a new idea by any means, but I have yet to see a commercial enterprise come close to this concept with a production/enterprise system - they are just too complex and their interconnections too brittle.

Friday, September 27, 2013

Target Fixation

You are driving down the road, it's a dark, rainy night. Your headlights are barely lighting the road ahead in the downpour. As you bob and weave your head and eyes in order to see through the rain and the windshield wipers that are dancing on your windshield, you try to pick out the defining details of the road ahead. Trees, branches, deer, skunk, trash ... all manners of hazard litter the roadway, and from time to time you lose total orientation to the sides of your lane. Good fortune is with you, you continue to acquire the roadway ahead, in spite of your gut telling you to pull off the road and wait this downpour out. Your pulse continues to rise, and has hit a shuddering 170 beats per minute, your palms are sweating, and your knuckles have the telltale pallid hue as you grip the steering wheel tighter and tighter. You are in full-on combat mode with the road and the inclement weather is your enemy.

Suddenly, as you break around a turn, on coming headlights appear to be in your lane and time slams into slow motion.
At this point the mind can take a number of paths.
1. Freeze. Total musculo-skeletal lock. Inability to process the information being fed in.
2. Panic. Swerve radically and risk collision with objects off of the road.
3. Tactical avoidance. Mind and body swing into motion, plotting an escape from the situation in a rapid, coordinated set of motions that take place instantaneously - reflexive response.
4. Target fixation. Is like freezing, but worse. You collide with the target because you actively navigate into it.

The description of target fixation above seems ridiculous, but it is a very real situation in which you focus on the objective so intently that your mind and body conspire to do the exact opposite of your intent. Instead of avoiding the obstacle, you end up colliding with it.

This phenomenon happens more often than you might think, and in the area of application performance monitoring, tuning and testing, it is extremely common.

This video clip is a classic example of target fixation ... in the moment of truth, the rider is so focussed on the thing he so desperately does not want to collide with - that he drives right into it. Don't worry, it's not gory (but it sure hurt, you can bank on that!):

Do you think that the rider in the video wanted to hit the wall? Of course not, that is absurd. It was the very last thing that the rider wanted to do *ever* in his pursuit of doing what he enjoyed, riding the open road on a motorcycle. The motorcycle is not the issue, the wall was not the issue, the rider's brain - and how it processed the threat is the issue. He could have just panicked and layed the motorcycle down (and it would have also slid into the wall, thanks to angular momentum). He could also have taken evasive action and the rest of his day would have been a lot better.

This phenomenon was seen in the early days of aerial combat, where pilots would fly into the targets they were intently trying to destroy - never intending to be a "kamikaze." The focus on the target is so intense, that all other sensory input is shut out/diminished in importance over the one, key, overarching goal - the target itself.

So what? What in the world do motorcyclists and fighter pilots have to do with Performance Engineering? It's all about evasive action, how we train mind and body to work in unison to avoid obstacles - keeping your eye on the target, but also letting peripheral vision and sensory acuity work. In terms of Performance Engineering, we're talking about application performance and adapting to the constantly changing state of our environment.

The heart of this issue in Performance Engineering, is that to approach modern applications and systems as a static entity that we can just create a turn-key, template-ized, formulaic way of finding and resolving all of our potential performance robbing and architectural defects, is sorely misguided. The process and practices of effective PE are just like the opening scenario - every test, every production outage is like driving on a rainy night. You might have been down this same road 10,000 times before, but every rainy night is different.

When you Performance Test large applications, it is an exercise of constant balance - balancing business needs and risks. I would contest that if you are doing Performance Engineering correctly, you will be constantly bombarded with unexpected issues and information that will often require a lot of thought, sometimes deep thought, and analysis to try to understand causation/root-cause and filter out the background noise.

[October 8, 2013 Edit - Example of Target Fixation]
Here is a greatly simplified example of this phenomenon of target fixation, the one that inspired this article:

Using past performance metrics/KPIs/criteria as the sole measure by which current system testing is compared against, and thereby judged to be acceptable or not. Don't get me wrong - there is the issue of having a baseline that you compare change against, but there is a form of data myopia that happens with this type of approach.

For instance, if you are using some key business metrics/KPIs that were derived from a previous season's issues or a specific outage, such as crashes during the Black Friday sales season of 2012, and those and only those metrics/KPIs are used as the ultimate measure of whether current performance is acceptable. This is target fixation.

The take away here is that what is bad is bad, and the current state of the application and environment constantly changes - and while you use previous season's or tests' results to gauge incremental progress or change, you must not avoid what is staring you in the face. More specifically, if in the 2012 event your "add to cart" functionality was measured to be within acceptable performance criteria, and therefore it is not on this list of metrics/KPIs, but in the last few rounds of performance testing the "add to cart" functionality has been consistently slowing down, it needs to be called out and addressed. This falls into the category below of "worst performers." Constantly call out the worst performers and bring attention to them, don't ignore them because they are not on some summary list of "issues that bit us last year."

5 Things that you can do to vastly improve your Enterprise Performance Engineering efforts
Each of these deserve their own dedicated post, but these things can get you past Target Fixation and moving toward true Performance Engineering:

1. Test in Production

This sounds so scary that companies dismiss it out of hand, and that is the single largest mistake that they can make.
Done right, testing in Production absolutely answers fundamental questions that you NEED to know about your enterprise operations, performance and capacity.
Modern systems are digital symphonies that almost never scale linearly across the board. If you only test in an environment that is 1/3 or 1/4 the capacity of your Production servers, you are never guaranteed that your testing is actually uncovering issues or validating performance requirements. Extrapolation is exCrap-olation.

2. A Team Post-Mortem After Every Production Test or Major Pre-Production Test

You have got to come out of every testing effort with action items ... nothing ever goes 100% to plan, and there are always unexpected glitches, surprises, and observations that deserve attention.
Assign action items, assign dates, regroup with an action plan and a follow up test plan
Drive performance - it doesn't happen by accident
Come up with a 4 or 5 slide dashboard to present each tests findings, action items, and statistics.

3. Don't Extrapolate for Production

Extrapolation is a guessing game that by its very definition cannot end in certainty
You can effectively test in scaled down environments, but you cannot accurately project performance, it does not work.

4. Find Time to Chase the White Whale

During performance testing, vast amounts of information is collected, and most of it is unobserved, unused, unanalyzed. Many outlying observations are tossed out, but sometimes disturbing trends are also actively ignored because they cannot be explained, or they are outside of the stated scope of goals for the test.
Often times when Business politics drives performance testing, the Performance Engineering aspects of our jobs are compromised. Highlight the "things that make you go hmmmm?" and build a team to undertake the challenge of identifying and explaining all the little things - because they add up.
Develop targeted test plans for those things that fall in the grey area ... "these are not great, but they're within tolerance, but it could be better" - todays blips are tomorrows bottlenecks.
Performance Testing goals need to change as problems are solved. When you fix a bottleneck, validate it against the baseline, and re-baseline. Now you're on to the next issue, which may have been uncovered by that last fix. You just removed a massive bottleneck at your load balancers, well guess what, your next tier in line is going to get hammered. Adapt to the shift, test for it, move on.
Identify the worst performing parts of your apps and target them for tuning. Beat the living snot out of them until they perform well.

5. Listen to the Crazy People on Your Team

There is that one guy on your team who says crazy stuff like, "That performance curve reminds me of the torque curve of a failing engine...." and people shake their heads and go on. Stop. What? Explain that? What do you mean? That guy has a picture in his head that he hasn't explained, and it might be a key insight to what the rest of your team is missing.
In the movie, The January Man, a detective ended up solving a case by listening to his artistic friend who spotted a pattern in the clues that everyone else missed. He sounded insane at first - until he was proven correct. The abstract thinkers have ways of looking through details and spotting things that others cannot see.