V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
wdg8106
V2EX  ›  Python

spark 问题求教

  •  
  •   wdg8106 · 2024-02-18 09:57:54 +08:00 · 1549 次点击
    这是一个创建于 376 天前的主题,其中的信息可能已经有所发展或是发生改变。
    spark 启动后 提交任务失败,问下各位大佬是什么问题呢。

    脚本代码:
    from pyspark import SparkContext,SparkConf
    host = 'discomaster'
    sc = SparkContext.getOrCreate(conf=SparkConf().setMaster('spark://%s:7077' %host).setAppName('lc_test'))

    rdd = sc.parallelize(range(1,11),4)
    res = rdd.take(100)
    print res

    脚本在 webserver2 上执行,discomaster 是 master 节点。

    集群信息:
    URL: spark://discomaster:7077
    Alive Workers: 27
    Cores in use: 27 Total, 0 Used
    Memory in use: 54.0 GiB Total, 0.0 B Used
    Resources in use:
    Applications: 0 Running, 42 Completed
    Drivers: 0 Running, 0 Completed
    Status: ALIVE

    脚本执行信息:
    24/02/18 09:49:26 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
    24/02/18 09:49:41 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
    24/02/18 09:49:56 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
    24/02/18 09:50:11 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
    24/02/18 09:50:26 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

    excutor 的 stderr 信息:
    Spark Executor Command: "/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java" "-cp" "/share/software/spark/conf/:/share/software/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=19038" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@webserver2:19038" "--executor-id" "372" "--hostname" "172.16.22.108" "--cores" "1" "--app-id" "app-20240218094855-0041" "--worker-url" "spark://Worker@172.16.22.108:36943"
    ========================================

    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    24/02/18 09:49:30 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 123172@cluster208
    24/02/18 09:49:30 INFO SignalUtils: Registered signal handler for TERM
    24/02/18 09:49:30 INFO SignalUtils: Registered signal handler for HUP
    24/02/18 09:49:30 INFO SignalUtils: Registered signal handler for INT
    24/02/18 09:49:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    24/02/18 09:49:30 INFO SecurityManager: Changing view acls to: root,lichong
    24/02/18 09:49:30 INFO SecurityManager: Changing modify acls to: root,lichong
    24/02/18 09:49:30 INFO SecurityManager: Changing view acls groups to:
    24/02/18 09:49:30 INFO SecurityManager: Changing modify acls groups to:
    24/02/18 09:49:30 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, lichong); groups with view permissions: Set(); users with modify permissions: Set(root, lichong); groups with modify permissions: Set()
    Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:283)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:272)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
    Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:302)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$3(CoarseGrainedExecutorBackend.scala:303)
    at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
    at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
    at scala.collection.immutable.Range.foreach(Range.scala:158)
    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$1(CoarseGrainedExecutorBackend.scala:301)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
    ... 4 more
    Caused by: java.io.IOException: Failed to connect to webserver2/172.16.16.175:19038
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195)
    at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
    Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: webserver2/172.16.16.175:19038
    Caused by: java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:748)
    6 条回复    2024-02-19 16:52:25 +08:00
    llllllllllllllii
        1
    llllllllllllllii  
       2024-02-18 10:10:08 +08:00
    网络问题,没分配到资源,确认下 url 和 webserver2 状态之类的
    wdg8106
        2
    wdg8106  
    OP
       2024-02-18 12:48:08 +08:00
    @llllllllllllllii 确实直接在 master 节点上执行脚本 能正常运行,但 webserver2 跟 master 之间能正常通信,甚至免密登陆都配了,会是啥网络问题呢
    AsAsSaSa
        3
    AsAsSaSa  
       2024-02-18 15:40:55 +08:00
    没了解过 spark ,但看报错内容似乎很清楚地指出,executor 无法访问到 webserver2 (的 19038 端口),Connection refused 是指显式拒绝,像是防火墙问题?
    wdg8106
        4
    wdg8106  
    OP
       2024-02-18 17:08:31 +08:00
    Chain INPUT (policy ACCEPT)
    target prot opt source destination

    Chain FORWARD (policy ACCEPT)
    target prot opt source destination

    Chain OUTPUT (policy ACCEPT)
    target prot opt source destination
    wdg8106
        5
    wdg8106  
    OP
       2024-02-18 17:09:30 +08:00
    防火墙规则看着挺宽松的
    wdg8106
        6
    wdg8106  
    OP
       2024-02-19 16:52:25 +08:00
    找到原因了,感谢各位大佬。
    我是用 Standalone 模式启动的 spark 集群,deploy-mode 有 client 和 cluster 两种模式。
    cluster 模式是客户端提交之后就完事了, 而 client 模式 driver 程序会运行在客户端程序中,所以才会有集群中的节点来连客户端的情况。
    我就是用 client 模式启动的,但是由于 没有指定 bindAddress 导致默认绑定在回环地址上,导致其他机器没法访问,所以报错了。
    手动设置下 bindAddress 就可以了
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   2684 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 129ms · UTC 13:24 · PVG 21:24 · LAX 05:24 · JFK 08:24
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.