More on workload skewing

Based on a true story, the names have been omitted to protect the innocent, guilty, and the vaguely interested.

Anyone who has talked to me is aware of my interest in workload skewing that can be a natural part a shared queue environment.  I have been involved in a number of situations, recommending mitigation techniques to a number of customers.  But as with anything else, there is always more to learn about the causes, effects and interaction between the CF and MQ.

As a bit of background, some customers have observed what the development lab have termed “over notification” when using shared queues.  Simply there are multiple application instances that have an MQGET with a wait active on a particular queue, when the Coupling Facility code detects a transition from empty to non-empty – a message has been put to a previously empty queue – and notifies MQ of the transition, MQ in turn notifies all the application instances that are waiting.  All of the applications attempt to get messages, which if there is only one message means that one instance gets the message but there is CPU associated with all those other instances trying to get a message.

Over notification works to the advantage of some workloads, where the arrival of one message is typically the start of a deluge of traffic.  This is particularly true when the queue is not likely to return to a zero depth for a long period of time, if the CF transition is never re-driven those ‘extra’ instances will never be notified that there are messages to be processed.

Reducing over notification can be an advantage when there are CPU constraints, there is less CPU used when there are fewer notifications going out and there is less contention between instances that can consume a limited number of messages.  The MQ change team has developed a tuning option that helps control over notification.

It gets interesting when conditions change.  In this particular instance, the customer had used the tuning option to control over notification in their production environment because one application was consuming a significant amount of CPU when the primary shared queue was basically being trickle fed most of the time.

However, when starting to test a different application, they did not implement the over notification tuning option.  They initially could not drive a lot of workload, were running into CPU waits, and when examining the SMF data we found that they had a significant level of latching on type 24 latches – which is the primary symptom of churn on the ‘get wait’ chain often seen with the over notification.   The tuning option was turned on, and greater throughput was achieved.  We saw a significant reduction in the type 24 latching.

Additional capacity was added to the test environment to mimic their production environment and the situation changed.  As more load was added, the LPAR that had the faster service times from the coupling facility was consuming 90% of the work, and we were not seeing activity on the second LPAR no matter how high the workload was driven,  In addition, the inbound queue depths were building up, and messages were remaining on the queue for longer periods.  At the same time there were a lot of application instances that did not appear to be active.

The development labs in Poughkeepsie and Hursley has a conversation about this and found that because of the suppression of notification on the part of MQ and the lack of the coupling facility transition (the queue never went back to 0 under the constant load) notification was causing the bottleneck.  Turning off the tuning option reduce the skewing to a more expected 60/40 split based on the difference in service times, at the cost of some CPU because the reintroduced over notification was creating some of the original get-waiting churn and contention for the messages.

We anticipate that the over notification ‘churn’ can be controlled in the test environment when a disciplined study of the number of active application instances can be done.  Perhaps that will help them reach equilibrium between CPU consumption and message throughput.

Workload  matters.

 

One thought on “More on workload skewing”

Leave a Reply

Your email address will not be published. Required fields are marked *