For years I have been talking to people about log reads that show up in the MQ Statistics. And they come in three flavors, buffer reads (bad), active log reads (worse), and archive log reads (worst). Reads from the MQ log files are done as part of the recovery process, and in many cases, are only seen at queue manager start-up time. When they show up in the regular flow of MQ work they usually indicate a transaction is backing out.
Note that when a channel connection ends a backout is done, but those typically have no log activity associated with them – they are issued just to make sure everything is in a good state before ending the connection. We don’t typically see any log read activity associated with them.
When looking at the MQ Statistics data, when we see a high number of log reads, especially when they are in the worse or worst categories described above, we like to look at the detailed accounting record so we can figure out which application may be issuing the rollback (backout) requests and get them to address the issue. I was looking for just that situation in some customer data recently and I found some unexpected things.
First the statistics data showed an unusual number of log reads, many from the buffers but a surprising number from the active logs, I then checked the detailed accounting data. Since I had loaded the data into DB2 (using the MQSMFCSV tool for processing of course), I ran a query that associates the task (WTAS), task ID (WTID), with the queue records (WQ) to find the tasks that were doing the backouts and what queue activity was associated. The query looks as follows:
The results were surprising, all the backouts recorded during the intervals were from the channel initiator. And all were on SVRCONN channels – I was fortunate that this customer had named all the SVRCONN channels with that term in the name, so it was easy to see.
This seemed really odd until I looked at the CHIN JES log, which was littered with CSQX259E messages indicating that a channel has timed out. And all were for the SVRCONN channels that showed up as having the number of backouts greatly exceeding the expected counts. So these client connections are likely finishing a transaction, then starting another one which gets abandoned until the connection times out. Many of the SVRCONN channels are reporting appropriate numbers, it is just a few of the channels that seem to be contributing. As the customer has used specific SVRCONNs for each client application, a very good practice, it should be easy for the customer to at least identify the culprit. Of course fixing the problem may require more effort. And it may be related to an application server that is trying to pool and share connections, but perhaps somewhat ineffectively.
For years I have been warning customers about the CPU costs associated with badly behaving clients due to the cost of the connect/disconnect verbs and the uplift on the CHIN CPU issuing the MQ API requests on behalf of the client. This impact on the logs was not one I anticipated, but maybe other people delving into the MQ SMF data have seen it.
I am lucky to learn something new every day, but this one made my eyes cross in trying to figure out what was causing all this I/O!