Thursday, January 3, 2019

COMMON CAUSES OF OCSSD EVICTIONS



Oracle Clusterware is designed to perform a node eviction by removing one or more nodes from the cluster if some critical problem is detected.  A critical problem could be a node not responding via a network heartbeat, a node not responding via a disk heartbeat, a hung or severely degraded machine, or a hung ocssd.bin process.  The purpose of this node eviction is to maintain the overall health of the cluster by removing bad members.

Common Causes are as below.

  1. Network failure or latency between nodes. It would take 30 consecutive missed checkins (by default - determined by the CSS misscount) to cause a node eviction. 
  2. Problems writing to or reading from the CSS voting disk.  If the node cannot perform a disk heartbeat to the majority of its voting files, then the node will be evicted.
  3. A member kill escalation.  For example, database LMON process may request CSS to remove an instance from the cluster via the instance eviction mechanism.  If this times out it could escalate to a node kill. 
  4. An unexpected failure or hang of the OCSSD process, this can be caused by any of the above issues or something else.
  5. An Oracle bug.

At some extents we can maintain them using below 

$CRS_HOME/bin/crsctl set css misscount n -- default 60 sec in 11g nd 30 sec in 12c
$CRS_HOME/bin/crsctl set css reboottime n -- default 3
$CRS_HOME/bin/crsctl set css disktimeout n -- default 200 sec


No comments:

Post a Comment