Friday, April 29, 2016

TBSM 6.1.1 Fix pack 4 released!

Fix pack 4 was released tonight, see the full list of APARs and improvements, also get the downloads here:

http://www-01.ibm.com/support/docview.wss?uid=swg24041505

Please make note that there are few manual actions to take, apart from installing the fix, to fully benefit from few new features included, which are:

  1. RADEVENTSTORE index creation - to prevent TBSM Event reader from hanging
  2. Installation of the new right-click context menu item "Delete Service Instance" 
  3. SLAPrunePolicyActivator - Impact policy activator service to prune RADEVENTSTORE and 6 other tables in TBSM DB.
  4. TIP 2.2.0.17 version 3 installation - is now certified to use with TBSM
mp

Tuesday, April 26, 2016

Total Event Count in TBSM

Introduction

Tivoli Business Service Manager can calculate amazing things for you, if you only need them. This is thanks to the powerful rules engine being the key part of TBSM as well as the Netcool/Impact policies engine running  just under the hood together with every TBSM edition. You can present your calculation results later on on a dashboard or in reports, depending if you think of a real time scorecard or historical KPI reports.
In this article, I’ll show how to calculate a total event count throughout multi-level service tree. It is something that TBSM isn’t doing right after a fresh install because it doesn’t provide you with the right rules out of the box, however TBSM doesn’t have also any predefined service tree available to you so in order to see this working you’d need to do both: add the rules to your service templates and import or create by hand your service tree structure to test this.
In this material, I’ll create a simple, multi-level service tree consisting of 3 levels of instances and I’ll use my own defined template T_Regions, but in order to repeat this exercise you can also simply reuse the template SCR_ServiceComponentRawStatusTemplate, which comes with every TBSM installation and is widely used in integrations with Tivoli Application Dependency Discovery Manager (TADDM). They key thing is that your template is to:
-        Have at least 1 incoming status rule
-        Be in use across the whole service tree, so all service instances on all levels in your service tree implement that template.


Figure 2. Incoming Status Rule body used in this excercise

Figure 3. Simple service tree used in this material

Make note. This document is trying to implement already existing functionality, means calculating the total number of events on every service tree level which result is stored in numRawEventsInt parameter. This parameter can be visible as the last value in the RAD_prototype widget being used typically on Custom Canvases on TBSM dashboards created in Tivoli Integrated Portal. But that parameter value isn’t accessible for numerical rules or policies for further processing.
Figure 4. numRawEventsInt value used on RAD_prototype widget

The newest add-on to TBSM, the debug Spy tools, also offer a parameter per every service tree level, called Matching Events. However that value is correct too, it also isn’t accessible from numerical rules or policies.
Make note. There is a BSM Accelerator template, called BSMAccelerator_EventCount which was designed to present the correct number of events for every service instance, however it was tailored to BSM Accelerator needs and service tree structure and isn’t scalable for potentially endlessly high service trees. However, some of the concepts introduced in order to support the BSM Accelerator package, will be covered in this document. If you want to read more, see this document:
Make note. TBSM 6.1.1 FP3 is a prerequisite for all rules described in this material to work correctly. However it is highly recommended to install Fix Pack 4 or higher for ensuring the latest improvements.


What is the multi-level events count?

TBSM runs an Impact service called TBSMOMNIbusEventReader which comes with the product out of the box and is responsible for reading events in Netcool/OMNIbus on a regular basis (every 3000 miliseconds by default) and finding events to be processed by TBSM by using its special predefined filter.
Here’s the default filter:
(Class <> 12000) AND (Type <> 2) AND ((Severity <> RAD_RawInputLastValue) or (RAD_FunctionType = '')) AND (RAD_SeenByTBSM = 0) AND (BSM_Identity <> '')

All events which pass that filter get processed further by TBSM service template rules, actually their special kind called Incoming Status Rules. The most typical incoming status rule, predefined inside SCR_ServiceComponentRawStatusTemplate template, called ComponentRawEventStatusRule has a precondition, called a discriminator, which filters out all events filtered in previously by the event reader, which don’t have one of the following classes:
·        TPC Rules(89200),
·        IBM Tivoli Monitoring Agent(87723),
·        Predictive Events(89300),
·        IBM Tivoli Monitoring(87722),
·        Default Class(0),
·        TME10tecad(6601),
·        Tivoli Application Dependency Discovery Manager(87721),
·        Precision [Start](8000),
·        MTTrapd(300),
·        Precision [End](8049)

Make note. In my example my Incoming Status rule will simply expect just Default Class (0) in all my test events.
This is not the end. There’s one more filter. It is called event identification field and by default TBSM will look for its value in event’s field called BSM_Identity. Value that is expected in that field comes from every service instance event identifier, which by default is the same as service instance name. So the event identifiers for my simple service tree will be the following:
Service instance name
Event identifier
Europe
Europe
Poland
Poland
Malopolska
Malopolska

I will not discuss in this material about how to maintain event identifiers, how many event identifiers you can have, how to set up event identifiers in XMLtoolkit configuration files (if you’re interested in those topics, please see my private blog entry: http://www.marcinpaluch.pl/wordpress/?p=231). I will also not discuss here on how the event severity may affect service instance status, I go defaults here in my example, but I will not focus on that area in this material at this time.
To sum it up: there are 3 filters your event has to pass before it affects your service instance:
a)      The TBSMOMNIbusEventReader’s filter
b)      The Incoming Status Rule discriminator / event class filter
c)      The event identifier
If your event made it through all the filters, you can call it a service instance affecting event.
It doesn’t have to mean your event has to change your service instance status, it only means that your event was processed by the Incoming Status Rule implemented in your service instance’s template. If you use TBSM 6.1.1 FP4, you can use Service Model Spy tool to see that your Incoming Status rule updated various attributes like Matching Events (number), Max Event Status (Event’s severity) and a timestamp of time when the rule processed the event.
The Matching Events parameter is what I’ll be calling in this material the EventCount.
Now, why Multilevel event count?
Every service instance can have its own individual EventCount. Every level of the service tree can contain more than one service instance and the best way to sum them up is to calculate their sum on their parent level. Then the parent service instance may also be used to implement a template with Incoming Status rule and therefore it can have its own individual EventCount. And then the parent service instance can be one of many parent service instances so the best way of summing them up would be calculating TotalEventCount on the grandparent service instance level. And so on. So the Multi-level event count is a feature to calculate the total number of events being processed by TBSM in the whole service tree.
Why would you need it? There are several use cases possible:
-        Your service tree consistency check and verification  - in a development phase, to see if all levels of your service tree get processed correctly
-        Statistics – to see the current and true load on TBSM by source, class, alert type, any event field in order to perform some further analysis of event storms and their reasons
-        To monitor the operations – for example to compare total events count to total acknowledged events count to total count of events escalated by opening an incident etc.
-        To monitor service component qualities – especially important in case of service components are managed or provided by a 3rd party provider – you can assess how much trouble all of them give your company or your operations team
Once the use case is agreed, you may want to use this material to start collecting your Total event counts in order to present them on a dashboard or in a report. Let me now explain to you how to set it up.


Implementation

As the first step let’s make sure I’m collecting the event count for each of my service tree elements. Let me create my new rule: OwnEvents count.
Make note. This step has a prerequisite: I need to have my Incoming Status rule already created.
This is perhaps not well documented, but every Incoming Status Rule can be used in a Numerical Formula rule to get the number of events processed. It is documented in this technote:
So let me do exactly what the technote does, this is my numerical formula, my rule called OwnEvents, which will return only non-clear events count via the default (since TBSM 6.1.1 FP1) Incoming Status Rule’s parameter NumEventsSevGE2. Whenever my Incoming Status Rule has processed another event with severity 1 or higher, the output of my numerical formula will refresh and increase by 1.
Figure 5. OwnEvents rule settings

And on my scorecard:
Figure 6. OwnEvents in a scorecard

Let’s send a test event to the last level now:
Figure 7. Sending test event
Figure 8. Test event settings

Figure 9. OwnEvents after sending test event

As you could see the events severity was passed through the whole service tree up, that is why the icon in the Events column changed color to Purple from bottom level right to the top one.
After sending a critical event to the 2nd level the icons from the 2nd level to the top one changed their color to red.
Figure 10. OwnEvents after sending 2nd test event

Make note. In order to perform this exercise, I haven’t created a status propagation rule. And I will not!
Take a look at the OwnEvents column. Even if status was propagated through the service tree from bottom to the top, the OwnEvents rule worked for every level individually. Europe shows bad Events noticed but OwnEvents column shows 0 events affected that level.
Now, let’s try to make every level aware of events happening on the level below it.
Prepare such a policy:
/* trigger_totalevents */
log("Triggered: "+ServiceInstance.STATEMODELNODE.trigger_totalevents.Value);

Status = 0;

si = ServiceInstance.SERVICEINSTANCENAME+" ("+ServiceInstance.DISPLAYNAME+")";

if(ServiceInstance.STATEMODELNODE.count_ownevents.Value <> NULL) {
   Status =  Int(ServiceInstance.STATEMODELNODE.count_ownevents.Value);
}

log("Service instance: "+si+" own events count: "+Status);

i = 0;
while (ServiceInstance.CHILDINSTANCEBEANS[i] <> NULL) {
   ci = ServiceInstance.CHILDINSTANCEBEANS[i].SERVICEINSTANCENAME+" ("+ServiceInstance.CHILDINSTANCEBEANS[i].DISPLAYNAME+")";

   if(ServiceInstance.CHILDINSTANCEBEANS[i].NUMCHILDREN > 0) {
      grandChildEvents = 0;

      if(ServiceInstance.CHILDINSTANCEBEANS[i].STATEMODELNODE.count_totalevents.Value <> NULL) {
         grandChildEvents = Int(ServiceInstance.CHILDINSTANCEBEANS[i].STATEMODELNODE.count_totalevents.Value);
      }
      log("Service instance: "+si+", child: "+ci+" children events: "+grandChildEvents);

      Status = Status + grandChildEvents;
   } else {

      childOwnEvents = 0;
      if(ServiceInstance.CHILDINSTANCEBEANS[i].STATEMODELNODE.count_ownevents.Value <> NULL) {
         childOwnEvents = Int(ServiceInstance.CHILDINSTANCEBEANS[i].STATEMODELNODE.count_ownevents.Value);
      }
      log("Service instance: "+si+", child: "+ci+" own events: "+childOwnEvents);

      Status = Status + childOwnEvents;
      
      log("Service instance: "+si+", child: "+ci+" children events: "+childOwnEvents);
   }

   i = i + 1;
}

log("Service instance: "+si+" total events count: "+Status);

I called this policy count_totalevents_policy_1 and I saved it within numerical formula rule, called count_totalevents.
Figure 11. TotalEvents rule settings

Same time, create another, numerical aggregation rule, in which you will point to the just created rule within the same template. Make sure you name your rule exactly same way as indicated in the header of the policy in the numerical formula just created a moment ago.
Figure 12. TriggerTotalEvents rule settings

You should have by the end the following list of rules in your template:
Figure 13. T_Regions template complete rules set

Make note. After creating a template rule pointing to the same template as a child template, the template will disappear from the templates list in the service navigator portlet. In order to fix it, add that template to any other template by associating via any type of status propagation rule:
Figure 14. T_Regions template associated to templateFinder

And this is the result that should occur at the end in your scorecard:
Figure 15. TotalEvents column in a scorecard

It looks like the concept works fine. Let’s try it further. Let’s send another event from every level, starting from Malopolska to Poland and to Europe.
Figure 16. TotalEvents column after sending more test events

It looks correct, every level OwnEvent count increased by 1 and I have in total 5 events in the entire tree, just 2 on the leaf, another 2 in the middle and just 1 on the root level.
Let’s add a new level below Malopolska and call it Krakow. This will simulate expanding the service tree i.e. in case of a fresh import from TADDM or CMDB.
Figure 17. OwnEvents and TotalEvents after adding a new child service

Let’s now send a new event, Severity 3 to Krakow:
Figure 18. OwnEvents and TotalEvents after sending a test event to the new child service

The new event affected Krakow and was included in all level calculations of the TotalEvents count correctly. Let’s now create one level above the all, called Earth:
Figure 19. OwnEvents and TotalEvents after adding a new  root service

Adding Earth didn’t change the TotalEvents count of course, but the current max was reflected on the new top/root level. Let’s send another event to Poland:
Figure 20. OwnEvents and TotalEvents after sending test events to the new root service

The total event count increased by 1 again. Only Europe’s OwnEvents column value increased by 1.
Let’s now remove Krakow from the Leaf level to see if the TotalEvents count will decrease by 1 now:
Figure 21. OwnEvents and TotalEvents after removing the child service from the tree

So it is correct again, after removing Krakow with its 1 event the overall TotalEvents count dropped by 1 too and equals now 6.