Faults - Ignore or Resolve

Faults in the Delphix virtualisation engine are automatically generated when a problem occurs and generally mean something is broken. They will be categorised as WARNING or CRITICAL and require some human intervention to fix.

When a fault occurs the GUI will show a red ‘Fault’ link in the top right part of the screen.

Clicking the link will open a small window where you can browse each active fault and further Ignore or Resolve them, or even Mark All Resolved. But when should you ignore a fault, when should you resolve a fault and when should you resolve ALL the faults, and what are you actually doing by ignoring and resolving?

It’s easy to misunderstand the concepts behind fault resolution and mistakenly ignore faults that should be resolved and vice versa, and even resolve all when you only want to resolve one, the carpet bomb approach you might say! So this post is to ensure you don’t make those mistakes.

To be fair, the words do describe the action pretty accurately according to the Free Dictionary:

Resolve – to find a solution to.
Ignore – to take no notice of; to pay no attention to.

We have a fault, let’s say the source database user account has been locked so the engine can not communicate with the source database now. The correct course of action here is to unlock the database account, check the Delphix engine can once again communicate and then resolve the fault. Easy. In this case I don’t think you would mistakenly mark the fault as Ignored.

We have another fault, Logsync on an Oracle dSource has failed to fetch an archivelog so now we can’t provision from this time period.

Our course of action here could be that we know we won’t need to provision from this time period so we don’t bother fetching the missing archivelog (repairing the Timeflow) and we mark the fault as Ignored. Wrong!

What we’ve actually done here is told the Delphix engine to not notify us of this fault ever again. Ever. Period. This is not good. No doubt the next time it occurs we DO want to know about it so we can actually repair the timeflow and ensure we can provision/refresh from this period of time.

The official docs do state this quite well:

You can mark the fault as Ignored if it meets the following criteria:
The fault is caused by a well-understood issue that cannot be changed
Its impact to the Delphix Engine is well understood and acceptable
In this case, the fault will not be re-diagnosed even if the fault condition persists. You will receive no further notifications.

We all make mistakes and like me, you may have ignored a fault when you should have resolved it but don’t panic, there’s a way to fix it. We can’t do it via the GUI so we have to drop into the CLI for this.

Go to the Fault context and list all faults

KDVDXE002 fault> ls
Objects
REFERENCE  STATUS    DATEDIAGNOSED             TARGETNAME                               TITLE
FAULT-12   IGNORED   2017-12-20T15:37:57.540Z  Ora_Sources/xexe                         Cannot provision a database from a portion of TimeFlow
...

Select the Ignored fault and change it to Resolved

KDVDXE002 fault> select FAULT-12
KDVDXE002 fault 'FAULT-12'> ls
Properties
    type: Fault
    action: Use the CLI TimeFlow Log Fetch command to retrieve the missing log. The log can only be retrieved from a non-ASM location.
    bundleID: fault.oracle.linkedsource.notprovisionable.timeflow.failedlogs
    dateDiagnosed: 2017-12-20T15:37:57.540Z
    dateResolved: 2017-12-20T15:38:43.083Z
    description: Database "xexe" TimeFlow "default" is missing log sequence 1.129 because LogSync failed to fetch it more than 3 times.
    reference: FAULT-12
    resolutionComments:
    severity: WARNING
    status: IGNORED
    target: default
    targetName: Ora_Sources/xexe
    targetObjectType: OracleTimeflow
    title: Cannot provision a database from a portion of TimeFlow

Operations
resolve
KDVDXE002 fault 'FAULT-12'> resolve
KDVDXE002 fault 'FAULT-12' resolve *> commit

See the fault is now Resolved (not Ignored)

KDVDXE002 fault> ls
Objects
REFERENCE  STATUS    DATEDIAGNOSED             TARGETNAME                               TITLE
FAULT-12   RESOLVED  2017-12-20T15:37:57.540Z  Ora_Sources/xexe                         Cannot provision a database from a portion of TimeFlow
...

So what have we done here? There is no way to set an Ignored fault back to active again so instead we set it to Resolved. Now the fault will pop up when the issue reoccurs. Not ideal but better than being stuck in the Ignored state.

Once last thing, the Mark All Resolved option

I sort of understand why this option is available. Maybe while we’re setting up the environment we generate lots of faults and once we’ve completed the setup we want to quickly resolve them all and start with a clean slate. Great, hit the “Mark All Resolved” button.

But my reason for mentioning this option is as a warning. Be careful with it. Apart from the scenario above I really can’t see a reason to ever use it. If this is a shared engine, and by that I mean shared amongst many teams each with their own environments hosted, then if we resolve all they may not know there has been a fault with their environment. Well, at least until it occurs again.

I hope this post at least makes you think twice before dealing with faults so they are dealt with in the right way. Just be careful of hitting Ignore when you are most likely intending to Resolve.

Matt Griffith

Matt is a technology consultant with over 20 years experience helping organisations around the world achieve data success using proven and emerging technologies. He is the Principal Consultant and Head Trainer at Kuzo Data.

Connect with Matt on LinkedIn.

The Data Lobby

Dataops and Devops Technology Blog

Faults – Ignore or Resolve

Leave a Reply Cancel reply