algorithm - chain of events analysis and reasoning -


my boss said logs in current state not acceptable customer. if there fault, dozen of different modules of device report own errors , land in logs. original reason of fault may buried somewhere in middle of list, may not appear on list (given module being damaged report), or appear way late after else finished reporting problems result original fault. anyway, there few people outside system developers can interprete logs , come happened.

my current task writing module customer-friendly fault-reporting. is, gather events reported on last ~3 seconds (which max interval between origin of fault occurring , last resulting after-effects), magic processing of data, , come 1 clear, friendly line broken , needs fixed.

the problem magic part: how, given number of fault reports, come original source of fault. there no simple list of cause-effect list. there commonly occurring chains of events displaying regularities.

examples:

  • short circuit detected, resulting in limited operation mode, limited operation not remove fault, emergency state escalated, total output power disconnected.
  • safety line got engaged. no module reported engaging within 3s since engaged, "unknown-source or interference" attributed reason of system halt.
  • most output modules report no output voltage. 1s later power supply monitoring module reports power out, original reason.
  • an output module reports no output voltage in of output lines. no report power supply module. reason power line disconnected module.
  • an output module reports no output voltage in 1 of output lines. no other faults reported. reason burnt fuse.
  • an output module did not report applying received state. shortly after, control module reports illegal state or output lines, (resulting output module not updating state in timely manner.) cause output module (which introduced fault), not control module (which halted system due fault detected).
  • a fault of input module switches device backup-failsafe mode. output module not used far, faulty gets engaged in mode , fault mode gets escalated critical. original reason not input, allowed report false-positives concerning faults, broken backup output aborted operation.
  • there no activity of kind output module, last 2 seconds. means it's broken , fault mode must entered.

there no comprehensive list of rules causes what. rules added new kinds of faults occur "in wild" , diagnosed , fixed. of them heuristics - if error accompanied these errors, fault this. faults not solved - bland list of module reports have suffice. answers ambigous, 1 set of symptoms may suggest 2 different faults. more of "best effort" "guaranteed solution" one.

now (overly general , vague) question: how solve this? there specific algorithms, methods or generalized solutions kind of problem? how write generalized rulesets , match against them? how soft-matching? (say, input module broke right in middle of emergency halt, it's unrelated event ignored.) please?

in honesty, write series of simple rules , done it. pain maintenance wise, getting right may time consuming , brittle.

if insist, approach having each error drop sort of symbol/token each error code - you'll make harder if try bag of words/keyword matching. input outputted tokens in sort of classifier.

at heart, need sort of rules engine - fuzzy or exact. first thing comes mind hand-built bayesian network. allow fuzzy matching calculate probable 'report' function of tokens receive. allows set threshold token groups aren't indicative of specifying minimum probability return answer.

you train bayes net or other type classifier, you'll need quite bit of data you've manually labeled (token1,token2,token3->faultxyz) , might more accurate yourself.


Comments

Popular posts from this blog

c# - How to set Z index when using WPF DrawingContext? -

razor - Is this a bug in WebMatrix PageData? -

android - layout with fragment and framelayout replaced by another fragment and framelayout -