Process ended with exit code=137

{question}

TE/SM became unavailable. Sifting through the agent logs, I noticed "exit code=137" - What does it mean?

{question}

{answer}

Let's consider the following scenario where we observe an error message like:

INFO ProcessService.nodeLeft (serv-socket<SOCKET NUMBER>-thread-<THREAD NUMBER>) cleaning up pid=<PID> with exit code=137

exit code=137 means that either (1) something killed the container that hosted the TE or (2) something killed the process with SIGKILL (kill -9) (We can figure that out by taking the exit code and deduct 128 from it to get the actual signal number, i.e. 137-128=9).

So if we see exit code=137, we first need to check whether or not the container was shutdown gracefully.

Remember that node state changes are replicated to all brokers, so if we grep for "pid=<PID>" across all agent.log files, then "local" always tells us which agent/broker was the process' local agent, e.g:

Node <TE/SM> db=<[DBNAME]> pid=<PID> id=<ID> req=null (local)

If the local agent broadcasted the "nodeLeft" event to ALL brokers, we can conclude that the container wasn't hard-killed.

If the container wasn't hard-killed, the next thing we need to check is if the container was gracefully shut down and if NuoDB "nuoagent" init.d script ran.
If the container was shut down gracefully, then the "nuoengine" init.d runs first, which shuts down all NuoDB processes using "killall -9 nuodb", then the "nuoagent" init.d runs after that.

The purpose of this is that engines are shutdown and the admin layer gets a chance to observe the "nodeLeft" event, and clean up the durable store.

If the local agent wasn't shut down eventually, then we can conclude that the engine was just killed with SIGKILL. This can be done by either something (script or app) / someone or by the Linux OOM killer. When Linux runs out of memory (OOM), the OOM killer chooses a process to kill based on some heuristics.

When Linux runs out of memory (OOM), the OOM killer chooses a process to kill based on some heuristics.

To know if that's the case, we can just grep the var messages from each of the hosts to see if the OOM killer killed the process by running dmesg | egrep -i 'killed process':

[root@ip-##-##-#-### ~]# dmesg | egrep -i 'killed process

An output such as the following, tells us that the process was killed by the OS (Linux OS in this case):

java invoked oom-killer: gfp_mask=#######, order=#, oom_adj=#, oom_score_adj=# 
[<############>] ? oom_kill_process+####/##### 
Out of memory: Kill process ##### (nuodb) score ### or sacrifice child 
Killed process #####, UID ###, (nuodb) total-vm:###########kB, anon-rss:##########kB, file-rss:###kB


ASIDE 1: The agent itself never issues a SIGKILL.

ASIDE 2: kill -15 <PID> (or just "kill <PID>") is SIGTERM which gives the process a chance to handle the signal but doesn't guarantee that the process will go away. SIGKILL just blows the process away.

{answer}

Have more questions? Submit a request

Comments