Tuesday, May 10, 2016

Unique Grand Children Count in TBSM



Introduction

Tivoli Business Service Manager can calculate amazing things for you, if you only need them. This is thanks to the powerful rules engine being the key part of TBSM as well as the Netcool/Impact policies engine running  just under the hood together with every TBSM edition. You can present your calculation results later on on a dashboard or in reports, depending if you think of a real time scorecard or historical KPI reports. 
In this article, I’ll show how you can use TBSM rules engine to calculate unique children count for a grand parent level service instance. It is something that isn’t really documented at all and the case isn’t very popular but in case you need it, you can find it here in this material.
In this material I will use the following hierachy of three templates:
  • T_NetworkSite – acting as grandparent template level
  • T_Interface – acting as parent template level
  • T_Router – acting as child template level
Interface a parent to a Router? – You may ask. It is not really what’s being promoted in various documents, definitely not something documented here:
Well, this depends very much on what and how you want to present in TBSM dashboards. So it depends on what is your busienss service about. The example in the article I mention above is concentrating more on VPN services:
Figure 1. Source: https://www.ibm.com/support/knowledgecenter/api/content/nl/en-us/SSSPFK_6.1.1.3/com.ibm.tivoli.itbsm.doc/bsma/10/images/bsma_cust_sm_network_topology.jpg



In my example, I’m concentraing on Layer 2 connectivity, in other words: I cannot connect to my network site or it is unavailable if all router interfaces are down. All router interfaces can be down and the routers themselves can be up – it doesn’t matter, it means the same thing  for the service: an outage. Automatically, if whole routers get switched off, the interfaces will be switched off too so my network site will be unavailable too. 
Figure 2. Templates hierarchy used in this material
The desired effect is the following:
  • There is one grandparent KrakowSite
  • There are 2 routers in total
  • There are 4 interfaces in total, 2 per each of routers
Figure 3. Access to Krakow network site - business service sample diagram
In other words, KrakowSite should report to run 4 installed interfaces but 2 router devices only. The next scorecard is something we will be building during this exercise.

Figure 4. Target scorecard to build
Before I continue, I will need to introduce a HeartBeat and PassToTBSM concept.

PassToTBSM and Heartbeat

PassToTBSM is an Impact function that can be used to send any data from Netcool/Impact policy straight to TBSM. It doesn’t have to be same Impact as Impact running jointly with TBSM on the same server, it can be a standalone Impact server too (but I haven’t tried that). It can also be both Impact 6.1 or Impact 7.1 (announced not to have PassToTBSM but I hear it’s still there, not tested by myself though).
A policy that sends data to TBSM with PassToTBSM function can be as follows:
Seconds = GetDate();
Time = LocalTime(Seconds, "HH:mm:ss");

ev=NewEvent("TBSMTreeRuleHeartbeatService");
ev.timestamp = String(Time);
ev.bsm_identity = "AnyChild";
PassToTBSM(ev);

So we construct an IPL policy in which we take the current time (it is important to have at least one changing value, I’ll explain why in another article on this blog) and specify service instance identifier that affected service instance is expected to have defined for its incoming status rules or numerical rules. Because I’m going to affect two routers: RouterA and RouterB, I specify something generic like “AnyChild”. I could also send two events to TBSM, one with ev.bsm_identity=”RouterA” and the other with ev.bsm_identity=”RouterB”. In a case of large implementations it is easier to specify something generic like AnyChild and add such an identifier to every service instance automatically during an import process via SCR API/XMLtoolkit.
Let me call the policy with TBSMTreeRulesHeartbeat.
Such a policy needs now to be called by an Impact service:

Figure 5. Impact service to run the heartbeat policy
Make note. Alternatively a data fetcher could be used, which also can be scheduled to run every 30 seconds or even once a day at 12:00 AM or at another time, however I wanted to show PassToTBSM function in action and also in large solution cases you may not want to involve an SQL SELECT statement against any database to simply run such a heartbeat function. Alternatively you could create a policy fetcher, but then you need more skills to do it since there’s no UI for that in TBSM.
Make note. Such a service doesn’t really needs to be added to any of Impact projects. 
Now, in order to use such a service and policy in a numerical rule in TBSM, you do two things: you set that service as the data source and set mapping. I have created my HeartbeatRule in TBSM with the following settings:
Figure 6. Numerical Rule with heartbeat service as data feed

Then in Customize Fields form you should have:
Figure 7. Custom fields mapping

Save this rule to your LEAF template:
Figure 8. Heartbeat rule in the LEAF template definition

And the last thing: don’t forget to make sure your service instances have “AnyChild” instance identifier specified:
Figure 9. Adding new instance ID - AnyChild

Why is it for? You may ask.
The answer is: We will be calculating unique number of grand children in one of TBSM functions. All functions in TBSM need a trigger which is an input value that changes, in order to return a fresh value. If the input value doesn’t change, you’ll not see a new value on the output. It can be the same value, but your rule won’t work if you don’t trigger it from outside somehow. Example? Sure:
On the next level in templates hierarchy there will be NumberOfRouters rule defined (and the heartbeat rule too):
Figure 10. T_Interface template's rules list

Let’s see inside the NumberOfRouters rule:
Figure 11. NumberOfRouters rule definition

This rule will return the output value from the function NumberOfAllChildren defined in the policy NumericalAttributeFunctions.ipl every time the HeartbeatRule triggers it.
In other words, the number of routers below interfaces won’t change in output of this function, even if it really changes (grows, reduces) unless the rule is kicked again.
So you need that extra rule on the children level like HeartbeatRule running periodically every 30 seconds and returning a random timestamp every time to ensure a different output value every time it runs.
Why so much hassle, you may say?
Why not to use ServiceInstace.NUMCHILDREN inside a policy-based numerical formula?
Well, first of all, Numerical formula is also a rule that also needs a trigger to run. Every rule in TBSM needs a trigger to run. I can dedicate a special post to that topic.
Second of all, I do use ServiceInstance.NUMCHILDREN, check out my policy function:
function NumberOfAllChildren(ChildrenStatusArray, AllChildrenArray, ServiceInstance, Status) {
   Status = ServiceInstance.NUMCHILDREN;
}

So this policy, I mean this function, will return the NUMCHILDREN value any time you trigger the rule.
The main reason for that hassle is that unfortunately but you cannot use NUMCHILDREN directly on a scorecard, you only can return it in rules. And rules need a trigger. NUMCHILDREN isn’t also an additional attribute, which could be shown directly in JazzSM dashboard.
Is it clear? I know, it’s bit weird, but just at the first sight.
You may also doubt: why am I using ServiceInstance.NUMCHILDREN? Is there any other attribute to return same value? Why am I using TIP, not JazzSM in my examples at all? The answers are: there’s no additional attribute that you could return in JazzSM straight, without wrapping it with a rule (and you cannot return an additional attribute without packing it in a rule in TIP) to return anything like number of children. So you have two choices:
  1.    Use ServiceInstance object’s field NUMCHILDREN – see above
  2.    Use a policy that will iterate through an array of children objects of your service instance and return the array’s length.
As you can see, still a policy, so still a numerical aggregation rule or a numerical formula rule must be used. So there’s no other way really. So rules are your way and you need to trigger them.

Recalculate correct number of objects after server restart

There’s an alternative to the Heartbeat rule, from TBSM 6.1.1 FP2 you can run this policy and associate it with the server start or run it from time to time manually or schedule it with an Impact service, there are two policies actually, one is for all nodes and the other just for leafs.
All nodes
Leafs
USE_SHARED_SCOPE;
Type="StateModel";
Filter = "RECALCSTATENODESLEAF";
log("Recalc Leaf Node Only. Policy Start." );
GetByFilter(Type, Filter, false);
log("Recalc Leaf Node Only. Policy Finish." );
USE_SHARED_SCOPE;
Type="StateModel";
Filter = "RECALCSTATENODESALL";
log("Recalc All Nodes. Policy Start." );
GetByFilter(Type, Filter, false);
log("Recalc All Nodes. Policy Finish." );

This alternative is documented here:
The difference between my heartbeat solution and the policy documented above is that my heartbeat function is selective, I decide which elements of the service tree will be recalculated (not just leafs but also not the entire service tree) and when (not just during a restart but every now and then). This is important, because change in number of children on some intermediate levels may occur independently on changes in number of children on the leaf level and I still need to trigger that change. Same time it’s an effort for TBSM to recalculate the whole tree, especially in case I have 100k instances in my service tree. That’s why I prefer to make it selective, so I use Heartbeat concept.


Unique grandchildren count rule

Now once we have the children count rule created and triggered, it’s time to get the unique grandchildren count rule.
What’s the difference?
It’s simple, you don’t want to take your children children count, because every Interface will report it has 1 parent, which gives you 4 parents while the true number is just 2.

So you need a smart Impact policy that will calculate that for you.

Since we’re clear on what rules need to be created on the Router level and the Interfaces level, it’s time to present rules on the NetworkSite template level:
Figure 12. Rules defined inside T_NetworkSite template

The NumberOfInterfaces rule is just to calculate the number of interface below the network site and inside of that rule the same function NumberOfAllChildren is being called from within NumericAttributeFunctions.ipl. The trigger should be the heartbeat rule again since number of interfaces inside the site may change independently. As you could see above, I defined a heartbeat rule inside the T_Interface template and I called it HeartbeatRuleIfc.
The more interesting rule is UniqueGrandChildren, which runs another function from the NumericAttributeFunctions policy, called NumberOfUniqueGrandChildren:

function NumberOfUniqueGrandChildren(ChildrenStatusArray, AllChildrenArray, ServiceInstance, Status) {
   i = 0;

   uniquegrandchildrenarray = {};
   log("MP: "+ServiceInstance);
   while(i<length(ServiceInstance.CHILDINSTANCEBEANS)) {
      child = ServiceInstance.CHILDINSTANCEBEANS[i];
      log("Child "+child.DISPLAYNAME+" of grand parent "+ServiceInstance.DISPLAYNAME+" was found.");

      j = 0;
      while(j<length(child.CHILDINSTANCEBEANS)) {
         grandchild = child.CHILDINSTANCEBEANS[j];
         log("Child "+grandchild.DISPLAYNAME+" of child "+child.DISPLAYNAME+" was found.");

         // Testing if currently analyzed child has already occurred
         k = 0;
         occurence = 0;
         while(k<length(uniquegrandchildrenarray)) {
            if(uniquegrandchildrenarray[k].SERVICEINSTANCEID == grandchild.SERVICEINSTANCEID) {
               // if yes, mark occurred = 1 (true) and finish analyzing further, so exit this loop
               occurence = 1;
               // k = length(uniquegrandchildrenarray); //uncomment this line to speed up in case of large child arrays
               log("Duplicate found: "+uniquegrandchildrenarray[k].SERVICEINSTANCEID+" and "+grandchild.SERVICEINSTANCEID+". Skipping.");
            }
            k=k+1;
         }
       
         if(occurence == 0) {
            uniquegrandchildrenarray = uniquegrandchildrenarray + grandchild;
            log("Unique grand child found: "+grandchild.DISPLAYNAME+". Added to the list.");
         }
         j = j + 1;
      }
      i = i + 1;
   }
   Status = length(uniquegrandchildrenarray);
   log("Grand parent "+ServiceInstance.DISPLAYNAME+" has # grand unique children "+Status);
}


So basically the function will traverse the service tree two levels down to the grandchildren level and will start storing their number by tracking their name. For every reoccurring name a counter will be incremented by 1. For every new name, a new item will be added to an array. The size of the array is the returned value.

Is it simple? Not so much, but it’s probably one of those functions you implement once and use all times, so it’s worth to learn about it. Let’s see the rule at the end:
Figure 13. NumberOfUniqueGrandChildren rule

So this is your desired effect:
Figure 14. Unique GrandChildrenCount on the scorecard

I hope that you like this type of small hints on how to achieve something useful in TBSM, if so, please comment and I'll try to post as man of this type of posts as I can. Thanks!


Friday, April 29, 2016

TBSM 6.1.1 Fix pack 4 released!

Fix pack 4 was released tonight, see the full list of APARs and improvements, also get the downloads here:

http://www-01.ibm.com/support/docview.wss?uid=swg24041505

Please make note that there are few manual actions to take, apart from installing the fix, to fully benefit from few new features included, which are:

  1. RADEVENTSTORE index creation - to prevent TBSM Event reader from hanging
  2. Installation of the new right-click context menu item "Delete Service Instance" 
  3. SLAPrunePolicyActivator - Impact policy activator service to prune RADEVENTSTORE and 6 other tables in TBSM DB.
  4. TIP 2.2.0.17 version 3 installation - is now certified to use with TBSM
mp

Tuesday, April 26, 2016

Total Event Count in TBSM

Introduction

Tivoli Business Service Manager can calculate amazing things for you, if you only need them. This is thanks to the powerful rules engine being the key part of TBSM as well as the Netcool/Impact policies engine running  just under the hood together with every TBSM edition. You can present your calculation results later on on a dashboard or in reports, depending if you think of a real time scorecard or historical KPI reports.
In this article, I’ll show how to calculate a total event count throughout multi-level service tree. It is something that TBSM isn’t doing right after a fresh install because it doesn’t provide you with the right rules out of the box, however TBSM doesn’t have also any predefined service tree available to you so in order to see this working you’d need to do both: add the rules to your service templates and import or create by hand your service tree structure to test this.
In this material, I’ll create a simple, multi-level service tree consisting of 3 levels of instances and I’ll use my own defined template T_Regions, but in order to repeat this exercise you can also simply reuse the template SCR_ServiceComponentRawStatusTemplate, which comes with every TBSM installation and is widely used in integrations with Tivoli Application Dependency Discovery Manager (TADDM). They key thing is that your template is to:
-        Have at least 1 incoming status rule
-        Be in use across the whole service tree, so all service instances on all levels in your service tree implement that template.


Figure 2. Incoming Status Rule body used in this excercise

Figure 3. Simple service tree used in this material

Make note. This document is trying to implement already existing functionality, means calculating the total number of events on every service tree level which result is stored in numRawEventsInt parameter. This parameter can be visible as the last value in the RAD_prototype widget being used typically on Custom Canvases on TBSM dashboards created in Tivoli Integrated Portal. But that parameter value isn’t accessible for numerical rules or policies for further processing.
Figure 4. numRawEventsInt value used on RAD_prototype widget

The newest add-on to TBSM, the debug Spy tools, also offer a parameter per every service tree level, called Matching Events. However that value is correct too, it also isn’t accessible from numerical rules or policies.
Make note. There is a BSM Accelerator template, called BSMAccelerator_EventCount which was designed to present the correct number of events for every service instance, however it was tailored to BSM Accelerator needs and service tree structure and isn’t scalable for potentially endlessly high service trees. However, some of the concepts introduced in order to support the BSM Accelerator package, will be covered in this document. If you want to read more, see this document:
Make note. TBSM 6.1.1 FP3 is a prerequisite for all rules described in this material to work correctly. However it is highly recommended to install Fix Pack 4 or higher for ensuring the latest improvements.


What is the multi-level events count?

TBSM runs an Impact service called TBSMOMNIbusEventReader which comes with the product out of the box and is responsible for reading events in Netcool/OMNIbus on a regular basis (every 3000 miliseconds by default) and finding events to be processed by TBSM by using its special predefined filter.
Here’s the default filter:
(Class <> 12000) AND (Type <> 2) AND ((Severity <> RAD_RawInputLastValue) or (RAD_FunctionType = '')) AND (RAD_SeenByTBSM = 0) AND (BSM_Identity <> '')

All events which pass that filter get processed further by TBSM service template rules, actually their special kind called Incoming Status Rules. The most typical incoming status rule, predefined inside SCR_ServiceComponentRawStatusTemplate template, called ComponentRawEventStatusRule has a precondition, called a discriminator, which filters out all events filtered in previously by the event reader, which don’t have one of the following classes:
·        TPC Rules(89200),
·        IBM Tivoli Monitoring Agent(87723),
·        Predictive Events(89300),
·        IBM Tivoli Monitoring(87722),
·        Default Class(0),
·        TME10tecad(6601),
·        Tivoli Application Dependency Discovery Manager(87721),
·        Precision [Start](8000),
·        MTTrapd(300),
·        Precision [End](8049)

Make note. In my example my Incoming Status rule will simply expect just Default Class (0) in all my test events.
This is not the end. There’s one more filter. It is called event identification field and by default TBSM will look for its value in event’s field called BSM_Identity. Value that is expected in that field comes from every service instance event identifier, which by default is the same as service instance name. So the event identifiers for my simple service tree will be the following:
Service instance name
Event identifier
Europe
Europe
Poland
Poland
Malopolska
Malopolska

I will not discuss in this material about how to maintain event identifiers, how many event identifiers you can have, how to set up event identifiers in XMLtoolkit configuration files (if you’re interested in those topics, please see my private blog entry: http://www.marcinpaluch.pl/wordpress/?p=231). I will also not discuss here on how the event severity may affect service instance status, I go defaults here in my example, but I will not focus on that area in this material at this time.
To sum it up: there are 3 filters your event has to pass before it affects your service instance:
a)      The TBSMOMNIbusEventReader’s filter
b)      The Incoming Status Rule discriminator / event class filter
c)      The event identifier
If your event made it through all the filters, you can call it a service instance affecting event.
It doesn’t have to mean your event has to change your service instance status, it only means that your event was processed by the Incoming Status Rule implemented in your service instance’s template. If you use TBSM 6.1.1 FP4, you can use Service Model Spy tool to see that your Incoming Status rule updated various attributes like Matching Events (number), Max Event Status (Event’s severity) and a timestamp of time when the rule processed the event.
The Matching Events parameter is what I’ll be calling in this material the EventCount.
Now, why Multilevel event count?
Every service instance can have its own individual EventCount. Every level of the service tree can contain more than one service instance and the best way to sum them up is to calculate their sum on their parent level. Then the parent service instance may also be used to implement a template with Incoming Status rule and therefore it can have its own individual EventCount. And then the parent service instance can be one of many parent service instances so the best way of summing them up would be calculating TotalEventCount on the grandparent service instance level. And so on. So the Multi-level event count is a feature to calculate the total number of events being processed by TBSM in the whole service tree.
Why would you need it? There are several use cases possible:
-        Your service tree consistency check and verification  - in a development phase, to see if all levels of your service tree get processed correctly
-        Statistics – to see the current and true load on TBSM by source, class, alert type, any event field in order to perform some further analysis of event storms and their reasons
-        To monitor the operations – for example to compare total events count to total acknowledged events count to total count of events escalated by opening an incident etc.
-        To monitor service component qualities – especially important in case of service components are managed or provided by a 3rd party provider – you can assess how much trouble all of them give your company or your operations team
Once the use case is agreed, you may want to use this material to start collecting your Total event counts in order to present them on a dashboard or in a report. Let me now explain to you how to set it up.


Implementation

As the first step let’s make sure I’m collecting the event count for each of my service tree elements. Let me create my new rule: OwnEvents count.
Make note. This step has a prerequisite: I need to have my Incoming Status rule already created.
This is perhaps not well documented, but every Incoming Status Rule can be used in a Numerical Formula rule to get the number of events processed. It is documented in this technote:
So let me do exactly what the technote does, this is my numerical formula, my rule called OwnEvents, which will return only non-clear events count via the default (since TBSM 6.1.1 FP1) Incoming Status Rule’s parameter NumEventsSevGE2. Whenever my Incoming Status Rule has processed another event with severity 1 or higher, the output of my numerical formula will refresh and increase by 1.
Figure 5. OwnEvents rule settings

And on my scorecard:
Figure 6. OwnEvents in a scorecard

Let’s send a test event to the last level now:
Figure 7. Sending test event
Figure 8. Test event settings

Figure 9. OwnEvents after sending test event

As you could see the events severity was passed through the whole service tree up, that is why the icon in the Events column changed color to Purple from bottom level right to the top one.
After sending a critical event to the 2nd level the icons from the 2nd level to the top one changed their color to red.
Figure 10. OwnEvents after sending 2nd test event

Make note. In order to perform this exercise, I haven’t created a status propagation rule. And I will not!
Take a look at the OwnEvents column. Even if status was propagated through the service tree from bottom to the top, the OwnEvents rule worked for every level individually. Europe shows bad Events noticed but OwnEvents column shows 0 events affected that level.
Now, let’s try to make every level aware of events happening on the level below it.
Prepare such a policy:
/* trigger_totalevents */
log("Triggered: "+ServiceInstance.STATEMODELNODE.trigger_totalevents.Value);

Status = 0;

si = ServiceInstance.SERVICEINSTANCENAME+" ("+ServiceInstance.DISPLAYNAME+")";

if(ServiceInstance.STATEMODELNODE.count_ownevents.Value <> NULL) {
   Status =  Int(ServiceInstance.STATEMODELNODE.count_ownevents.Value);
}

log("Service instance: "+si+" own events count: "+Status);

i = 0;
while (ServiceInstance.CHILDINSTANCEBEANS[i] <> NULL) {
   ci = ServiceInstance.CHILDINSTANCEBEANS[i].SERVICEINSTANCENAME+" ("+ServiceInstance.CHILDINSTANCEBEANS[i].DISPLAYNAME+")";

   if(ServiceInstance.CHILDINSTANCEBEANS[i].NUMCHILDREN > 0) {
      grandChildEvents = 0;

      if(ServiceInstance.CHILDINSTANCEBEANS[i].STATEMODELNODE.count_totalevents.Value <> NULL) {
         grandChildEvents = Int(ServiceInstance.CHILDINSTANCEBEANS[i].STATEMODELNODE.count_totalevents.Value);
      }
      log("Service instance: "+si+", child: "+ci+" children events: "+grandChildEvents);

      Status = Status + grandChildEvents;
   } else {

      childOwnEvents = 0;
      if(ServiceInstance.CHILDINSTANCEBEANS[i].STATEMODELNODE.count_ownevents.Value <> NULL) {
         childOwnEvents = Int(ServiceInstance.CHILDINSTANCEBEANS[i].STATEMODELNODE.count_ownevents.Value);
      }
      log("Service instance: "+si+", child: "+ci+" own events: "+childOwnEvents);

      Status = Status + childOwnEvents;
      
      log("Service instance: "+si+", child: "+ci+" children events: "+childOwnEvents);
   }

   i = i + 1;
}

log("Service instance: "+si+" total events count: "+Status);

I called this policy count_totalevents_policy_1 and I saved it within numerical formula rule, called count_totalevents.
Figure 11. TotalEvents rule settings

Same time, create another, numerical aggregation rule, in which you will point to the just created rule within the same template. Make sure you name your rule exactly same way as indicated in the header of the policy in the numerical formula just created a moment ago.
Figure 12. TriggerTotalEvents rule settings

You should have by the end the following list of rules in your template:
Figure 13. T_Regions template complete rules set

Make note. After creating a template rule pointing to the same template as a child template, the template will disappear from the templates list in the service navigator portlet. In order to fix it, add that template to any other template by associating via any type of status propagation rule:
Figure 14. T_Regions template associated to templateFinder

And this is the result that should occur at the end in your scorecard:
Figure 15. TotalEvents column in a scorecard

It looks like the concept works fine. Let’s try it further. Let’s send another event from every level, starting from Malopolska to Poland and to Europe.
Figure 16. TotalEvents column after sending more test events

It looks correct, every level OwnEvent count increased by 1 and I have in total 5 events in the entire tree, just 2 on the leaf, another 2 in the middle and just 1 on the root level.
Let’s add a new level below Malopolska and call it Krakow. This will simulate expanding the service tree i.e. in case of a fresh import from TADDM or CMDB.
Figure 17. OwnEvents and TotalEvents after adding a new child service

Let’s now send a new event, Severity 3 to Krakow:
Figure 18. OwnEvents and TotalEvents after sending a test event to the new child service

The new event affected Krakow and was included in all level calculations of the TotalEvents count correctly. Let’s now create one level above the all, called Earth:
Figure 19. OwnEvents and TotalEvents after adding a new  root service

Adding Earth didn’t change the TotalEvents count of course, but the current max was reflected on the new top/root level. Let’s send another event to Poland:
Figure 20. OwnEvents and TotalEvents after sending test events to the new root service

The total event count increased by 1 again. Only Europe’s OwnEvents column value increased by 1.
Let’s now remove Krakow from the Leaf level to see if the TotalEvents count will decrease by 1 now:
Figure 21. OwnEvents and TotalEvents after removing the child service from the tree

So it is correct again, after removing Krakow with its 1 event the overall TotalEvents count dropped by 1 too and equals now 6.