Replies: 4
Hello,
After some successful work with my Hadoop cluster, I’m having some trouble managing services
I tried to stop completely all the hadoop services in order to modify some log4j rotation log parameters, and since this time, some services are stuck (they are in state STOP_FAILED, however they are successfully shutdowned).
The symptoms are the following :
I’m able to launch events both from the Ambari-server UI or directly with the API via curl. But all the actions that are launched are never taken into account, they stay with state QUEUED or PENDING …
And puppet site files corresponding to those actions in /var/lib/ambari-agent/data are not generated anymore, as before
(I succeed in stop / start all those services manually, using custom puppet manifests, so the problem seems not to be situated at the service level)
It looks like ambari-agents are doing nothing and didn’t take the Ambari-server actions which are terminated with TIMEOUT state.
I see nothing particular both in Ambari-agent and ambari-server logs which could explain this behavor. I already tried to restart all of them, and even rebooting servers composing my HDP cluster. But the issue is still there.
Below an example for nagios service of what I am saying :
{
“href” : “http://obench20s:8080/api/v1/clusters/hadoop_poc/requests/90/tasks/361″,
“Tasks” : {
“exit_code” : 999,
“stdout” : “”,
“status” : “QUEUED”,
“stderr” : “”,
“host_name” : “obench20s****”,
“id” : 361,
“cluster_name” : “hadoop_poc”,
“attempt_cnt” : 1,
“request_id” : 90,
“command” : “STOP”,
“role” : “NAGIOS_SERVER”,
“start_time” : 1361895724078,
“stage_id” : 1
}
A few time later :
{
“href” : “http://obench20s:8080/api/v1/clusters/hadoop_poc/requests/90/tasks/361″,
“Tasks” : {
“exit_code” : 999,
“stdout” : “”,
“status” : “TIMEDOUT”,
“stderr” : “”,
“host_name” : “obench20s****”,
“id” : 361,
“cluster_name” : “hadoop_poc”,
“attempt_cnt” : 2,
“request_id” : 90,
“command” : “STOP”,
“role” : “NAGIOS_SERVER”,
“start_time” : 1361895724078,
“stage_id” : 1
}
In ambari-server log, I get :
17:28:37,403 DEBUG ResourceProviderImpl:271 – Setting property for resource, resourceType=HostComponent, propertyId=HostRoles/host_name, value=obench20s*****
17:28:37,403 DEBUG ResourceProviderImpl:271 – Setting property for resource, resourceType=HostComponent, propertyId=HostRoles/state, value=STOPPING
17:28:37,404 DEBUG ResourceProviderImpl:271 – Setting property for resource, resourceType=HostComponent, propertyId=HostRoles/desired_state, value=INSTALLED
Ambari-agent logs (from the server where Nagios normally run):
INFO 2013-02-26 17:36:30,487 Heartbeat.py:68 – Heartbeat dump: {‘componentStatus’: [],
‘hostname’: ‘obench20s****’,
‘nodeStatus’: {’cause’: ‘NONE’, ‘status’: ‘HEALTHY’},
‘reports’: [],
‘responseId’: 260,
‘timestamp’: 1361896590486}
Many thanks for help