接下来将会发布我们总结的一系列的关于hadoop的常见错误和解决办法,这些错误都是我们曾经犯过,但现已经纠正过来的。并且证明确实有效。
1,错误一:java.io.IOException:IncompatibleclusterIDs时常出现在namenode重新格式化之后9`7k#I:L2|9U*@6d
2014-04-2914:32:53,877FATALorg.apache.hadoop.hdfs.server.datanode.DataNode:InitializationfailedforblockpoolBlockpoolBP-1480406410-192.168.1.181-1398701121586(storageidDS-167510828-192.168.1.191-50010-1398750515421)servicetohadoop-master/192.168.1.181:9000"J'|7h2q(T&@$h"s'B
java.io.IOException:IncompatibleclusterIDsin/data/dfs/data:namenodeclusterID=CID-d1448b9e-da0f-499e-b1d4-78cb18ecdebb;datanodeclusterID=CID-ff0faa40-2940-4838-b321-98272eb0dee3!U8t)L-F(@0~'H0N9I
atorg.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:391)5~"a4j4o6M7~*r
atorg.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:191)8{*e.t;f7?#I8I:\-v
atorg.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219)
atorg.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:837)
atorg.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808)9le(o1ou#D
atorg.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
atorg.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)
atorg.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)*j)}9t/x*~
atjava.lang.Thread.run(Thread.java:722)
2014-04-2914:32:53,885WARNorg.apache.hadoop.hdfs.server.datanode.DataNode:Endingblockpoolservicefor:BlockpoolBP-1480406410-192.168.1.181-1398701121586(storageidDS-167510828-192.168.1.191-50010-1398750515421)servicetohadoop-master/192.168.1.181:90002V9G-G3f*L
2014-04-2914:32:53,889INFOorg.apache.hadoop.hdfs.server.datanode.DataNode:RemovedBlockpoolBP-1480406410-192.168.1.181-1398701121586(storageidDS-167510828-192.168.1.191-50010-1398750515421)
2014-04-2914:32:55,897WARNorg.apache.hadoop.hdfs.server.datanode.DataNode:ExitingDatanode
原因:每次namenodeformat会重新创建一个namenodeId,而data目录包含了上次format时的id,namenodeformat清空了namenode下的数据,但是没有清空datanode下的数据,导致启动时失败,所要做的就是每次fotmat前,清空data下的所有目录.
:d6E2t&M"g7a*q3l,H
解决办法:停掉集群,删除问题节点的data目录下的所有内容。即hdfs-site.xml文件中配置的dfs.data.dir目录。重新格式化namenode。
另一个更省事的办法:先停掉集群,然后将datanode节点目录/dfs/data/current/VERSION中的修改为与namenode一致即可。
2,错误:org.apache.hadoop.yarn.exceptions.YarnException:Unauthorizedrequesttostartcontainer
.Y&};L%^'M8H6I5T
14/04/2902:45:07INFOmapreduce.Job:Jobjob_1398704073313_0021failedwithstateFAILEDdueto:Applicationapplication_1398704073313_0021failed2timesduetoErrorlaunchingappattempt_1398704073313_0021_000002.Gotexception:org.apache.hadoop.yarn.exceptions.YarnException:Unauthorizedrequesttostartcontainer.'F4}0C*`/y#L9A
Thistokenisexpired.currenttimeis1398762692768found1398711306590
atsun.reflect.GeneratedConstructorAccessor30.newInstance(UnknownSource)5[&j'H(j0j9_4?7o;q6m
atsun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)7g-{(w6[(N)s"`
atjava.lang.reflect.Constructor.newInstance(Constructor.java:525)
atorg.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)&n/J*]3|&a2q)t*g
atorg.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106),B4u7Gf,fd6H
atorg.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122)0o+[.u$O;S'S-?3t.y
atorg.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:249)/z+v$o(g)j*p.G
atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
atjava.lang.Thread.run(Thread.java:722)
.Failingtheapplication.6NG(N1f9l&K)z
14/04/2902:45:07INFOmapreduce.Job:Counters:0
&M0e9z6h:a7O
问题原因:namenode,datanode时间同步问题8y$c&Lf2W.h/q(v&K
)a$K$~1q3R0D5e
解决办法:多个datanode与namenode进行时间同步,在每台服务器执行:ntpdatetime.nist.gov,确认时间同步成功。6q'N6}4N8u
最好在每台服务器的/etc/crontab中加入一行:
02***rootntpdatetime.nist.gov&&hwclock–w
1,错误一:java.io.IOException:IncompatibleclusterIDs时常出现在namenode重新格式化之后9`7k#I:L2|9U*@6d
2014-04-2914:32:53,877FATALorg.apache.hadoop.hdfs.server.datanode.DataNode:InitializationfailedforblockpoolBlockpoolBP-1480406410-192.168.1.181-1398701121586(storageidDS-167510828-192.168.1.191-50010-1398750515421)servicetohadoop-master/192.168.1.181:9000"J'|7h2q(T&@$h"s'B
java.io.IOException:IncompatibleclusterIDsin/data/dfs/data:namenodeclusterID=CID-d1448b9e-da0f-499e-b1d4-78cb18ecdebb;datanodeclusterID=CID-ff0faa40-2940-4838-b321-98272eb0dee3!U8t)L-F(@0~'H0N9I
atorg.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:391)5~"a4j4o6M7~*r
atorg.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:191)8{*e.t;f7?#I8I:\-v
atorg.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219)
atorg.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:837)
atorg.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808)9le(o1ou#D
atorg.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
atorg.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)
atorg.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)*j)}9t/x*~
atjava.lang.Thread.run(Thread.java:722)
2014-04-2914:32:53,885WARNorg.apache.hadoop.hdfs.server.datanode.DataNode:Endingblockpoolservicefor:BlockpoolBP-1480406410-192.168.1.181-1398701121586(storageidDS-167510828-192.168.1.191-50010-1398750515421)servicetohadoop-master/192.168.1.181:90002V9G-G3f*L
2014-04-2914:32:53,889INFOorg.apache.hadoop.hdfs.server.datanode.DataNode:RemovedBlockpoolBP-1480406410-192.168.1.181-1398701121586(storageidDS-167510828-192.168.1.191-50010-1398750515421)
2014-04-2914:32:55,897WARNorg.apache.hadoop.hdfs.server.datanode.DataNode:ExitingDatanode
原因:每次namenodeformat会重新创建一个namenodeId,而data目录包含了上次format时的id,namenodeformat清空了namenode下的数据,但是没有清空datanode下的数据,导致启动时失败,所要做的就是每次fotmat前,清空data下的所有目录.
:d6E2t&M"g7a*q3l,H
解决办法:停掉集群,删除问题节点的data目录下的所有内容。即hdfs-site.xml文件中配置的dfs.data.dir目录。重新格式化namenode。
另一个更省事的办法:先停掉集群,然后将datanode节点目录/dfs/data/current/VERSION中的修改为与namenode一致即可。
2,错误:org.apache.hadoop.yarn.exceptions.YarnException:Unauthorizedrequesttostartcontainer
.Y&};L%^'M8H6I5T
14/04/2902:45:07INFOmapreduce.Job:Jobjob_1398704073313_0021failedwithstateFAILEDdueto:Applicationapplication_1398704073313_0021failed2timesduetoErrorlaunchingappattempt_1398704073313_0021_000002.Gotexception:org.apache.hadoop.yarn.exceptions.YarnException:Unauthorizedrequesttostartcontainer.'F4}0C*`/y#L9A
Thistokenisexpired.currenttimeis1398762692768found1398711306590
atsun.reflect.GeneratedConstructorAccessor30.newInstance(UnknownSource)5[&j'H(j0j9_4?7o;q6m
atsun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)7g-{(w6[(N)s"`
atjava.lang.reflect.Constructor.newInstance(Constructor.java:525)
atorg.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)&n/J*]3|&a2q)t*g
atorg.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106),B4u7Gf,fd6H
atorg.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122)0o+[.u$O;S'S-?3t.y
atorg.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:249)/z+v$o(g)j*p.G
atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
atjava.lang.Thread.run(Thread.java:722)
.Failingtheapplication.6NG(N1f9l&K)z
14/04/2902:45:07INFOmapreduce.Job:Counters:0
&M0e9z6h:a7O
问题原因:namenode,datanode时间同步问题8y$c&Lf2W.h/q(v&K
)a$K$~1q3R0D5e
解决办法:多个datanode与namenode进行时间同步,在每台服务器执行:ntpdatetime.nist.gov,确认时间同步成功。6q'N6}4N8u
最好在每台服务器的/etc/crontab中加入一行:
02***rootntpdatetime.nist.gov&&hwclock–w