Data collection for an MQ for z/OS Health Check

I do a fair number of MQ for z/OS health checks, usually as part of a parallel sysplex health check, and often get a lot of push back due to the perceived cost of turning on the MQ SMF data.  What we typically as for is:

  1. One week or more of the MQ Statistics records (SMF 115, all classes).
  2. The MQ task accounting and channel accounting (SMF 116, all classes) for 2 peak SMF intervals (on-line and batch).
  3. The JES logs for the MSTR and CHIN address spaces.
  4. Some display of data, if the JES logs do not include the queue manager start up.

The complaint is often around the perceived cost of gathering this data.  As I have had to address these objections a couple of times in the past  week, I am gathering my thoughts here around some of the issues brought up.  I hope this helps with the general understanding of why this data is important, the costs may not be as high as you think, and some mitigation that may be possible – especially when this is a stand alone MQ for z/OS Health check.

  1. Objection # 1 – We do not want to change the zPARMS for MQ as that requires an outage.
    To collect the MQ SMF data there is no requirement for a change in zParms. All the collection can be turned on dynamically. The +cpf START TRACE commands work for both statistics  and accounting data.
  2. Objection # 2 – “The cost of collecting that MQ Statistics is too high.”The MQ Statistics are very lightweight, cutting between 2-4 records per SMF interval per queue manager. If they are not on all the time, they should be. If they are not being reviewed at least monthly, then learning how to gather and review the data should be a high priority at your site.

    MQ does not report into the z/OS health checker – which means that there are two primary sources of information about the overall health of the queue manager the JES logs and the statistical data (SMF115). Often by the time there is a message in the JES log, the problem is quite critical and could have been prevented.

    The combination of the statistics and the JES log information gives a picture of the general health of the queue manager – neither by itself is complete. The cost of producing the statistical data is negligible, as the information is always collected – the only cost is to write out the 2-4 records per interval.

  3. Objection #3 – Why do you ask for a week of statistics, when the other subsystems are just asking for two one hour time slots?
    Because of the asynchronous nature of MQ itself, having data from a longer period of time can help avoid problems.  Depending on the processing model used by the applications the MQ peak processing time may be slightly different from batch jobs or  other subsystems peak periods.  We have experienced situations where when the MQ data was limited to an hour or so the recommendations made actually negatively impacted the queue managers because the MQ peak time was different.
  4. Objection # 4 – The cost of collecting the MQ Task Accounting data is too high, we’ve heard it is 10% overhead.Yes this can be costly, because the records are big and may be cut frequently.  However, the task accounting data is invaluable to seeing how MQ is actually used, which queues are in use, and for finding performance problems in the application code itself.  While I never recommend they be captured all the time, I do recommenced that they be gathered regularly and examined.  This is to help spot issues, do proper capacity planning, and evaluate queue placement.  Finally, when you do have a problem, being familiar with these records will make it easier to spot problems – it is hard to figure out what is abnormal, when you do not know what normal looks like.

The 10% figure is a myth, it was a ‘gut feel’ early on in MQ V6, never verified with testing.  Reports from numerous customers show that collecting the MQ task accounting data (aka the class 3 data) does have a 3-7% overhead (depending on the method used to collect the data, how many long running transactions there are, and how busy the queue managers are).

In addition, if this is an independent MQ for z/OS health check you can control the timing of the record production a bit to limit the impact. What I have often suggested is:

A) Use the queue manager STATIME attribute to reduce the MQ SMF interval to 2 minutes, well prior to the typical peak time (if your SMF interval is 30 minutes, issue the +cpt SET SYSTEM STATIME(2) command no later than 35 minutes prior to the collection period.
B) Just prior to the peak time, use the START TRACE command, +cpf START TRACE(A) CLASS(*)
C) Allow that to run for 6 minutes.
D) Issue the STOP TRACE command, +cpf STOP TRACE(A) CLASS(*)
E) Issue the SET SYSTEM STATIME(??) command to restore the MQ interval to what is was prior to the data collection.