24小时热门版块排行榜    

Znn3bq.jpeg
查看: 4134  |  回复: 1

sic029

铁虫 (初入文坛)

[求助] qsub提交并行siesta不成功,求助

大家好,交流下集群程序使用遇到的问题,多谢。
[node21:10714] *** An error occurred in MPI_Comm_rank
[node21:10714] *** on communicator MPI_COMM_WORLD
[node21:10714] *** MPI_ERR_COMM: invalid communicator
[node21:10714] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
mpirun has exited due to process rank 3 with PID 10711 on
node node21 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[node21:10707] 7 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[node21:10707] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


新编译的计算程序siesta,用qsub job提交上去很快结束提示的信息,能否帮忙诊断一下情况。在另外一个集群上编译后直接用mpirun -np 4 siesta可以顺利执行的,不知道为何在新集群用qsub出现这个问题,这个新集群不让进入到子节点,所以必须要解决这个问题才行,多谢了。

不知道是哪里的问题,之前在该环境并行编译的lammps和vasp都使用很顺利,就是siesta用qsub提交作业总是无法正常计算,但是并行编译的siesta在另外环境下的子节点用mpirun -np 4 siesta执行很顺利,纠结了。

哦,登录节点上mpirun我试过的,请帮忙看看,感觉被管理员设置了也无法用:
mpirun -np 4 siesta
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
    This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
    This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
    This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
    This will severely limit memory registrations.
--------------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory.  This typically can indicate that the
memlock limits are set too low.  For most HPC installations, the
memlock limits should be set to "unlimited".  The failure occured
here:

  Local host:    manage1
  OMPI source:   btl_openib_component.c:1115
  Function:      ompi_free_list_init_ex_new()
  Device:        mlx4_0
  Memlock limit: 32768

You may need to consult with your system administrator to get this
problem fixed.  This FAQ entry on the Open MPI web site may also be
helpful:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   manage1
  Local device: mlx4_0
--------------------------------------------------------------------------
[manage1:16214] *** An error occurred in MPI_Comm_rank
[manage1:16214] *** on communicator MPI_COMM_WORLD
[manage1:16214] *** MPI_ERR_COMM: invalid communicator
[manage1:16214] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 16212 on
node manage1 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[manage1:16211] 3 more processes have sent help message help-mpi-btl-openib.txt / init-fail-no-mem
[manage1:16211] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[manage1:16211] 3 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[manage1:16211] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal

计算子节点无法进入,被限制死了的。

这边用的pbs作业管理gridview,我用的提交脚本是:
====================================
#PBS -N test
#PBS -l nodes=1:ppn=8
#PBS -j oe
#PBS -l walltime=24:00:00  
                 
cd $PBS_O_WORKDIR
NP=`cat $PBS_NODEFILE|wc -l`
source /public/software/mpi/openmpi1.5.4-intel.sh                    
mpirun  -machinefile $PBS_NODEFILE -np $NP  \
/home/sw/siesta/siesta-3.1/Obj/siesta < fe.fdf | tee output  
=====================================

感谢虫友帮忙,多谢。
回复此楼

» 猜你喜欢

» 本主题相关价值贴推荐,对您同样有帮助:

research
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

redsnowolf

银虫 (小有名气)

【答案】应助回帖


liliangfang: 金币+1, 谢谢交流 2012-09-15 15:12:35
我前两天用vasp也出现类似问题,刚刚解决~

The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory.  This typically can indicate that the
memlock limits are set too low.  For most HPC installations, the
memlock limits should be set to "unlimited".  The failure occured
here:

  Local host:    node21
  OMPI source:   btl_openib_component.c:1055
  Function:      ompi_free_list_init_ex_new()
  Device:        mlx4_0
  Memlock limit: 65536

You may need to consult with your system administrator to get this
problem fixed.  This FAQ entry on the Open MPI web site may also be
helpful:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
上面那个网址里15、16、17说的挺清楚的,我的情况是在每个节点ulimit -a显示locked memory都正常,可就是出错说内存分配不正常,那个网址里说可能是登录时没有正常执行系统所设的locked memory,或者作业调度系统没有分配给应用程序足够大的内存,最后重启了一下每个节点的pbs调度系统的守护进程,问题解决了~
或者你可以在mpirun前边儿加上ulimit -l unlimited,用qsub提交下试试
希望以上信息对楼主有用~
2楼2012-09-15 14:24:17
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖
相关版块跳转 我要订阅楼主 sic029 的主题更新
最具人气热帖推荐 [查看全部] 作者 回/看 最后发表
[考博] 华师大读博 +3 xq83 2026-04-22 3/150 2026-04-22 03:56 by 啊哒哒哒叨
[考研] 一志愿A区211,22408 321求调剂 +7 随心所欲☆ 2026-04-15 8/400 2026-04-21 08:22 by Equinoxhua
[考研] 295分求调剂 +6 ?要上岸? 2026-04-17 6/300 2026-04-21 08:18 by Equinoxhua
[论文投稿] 有没有接收比较快的sci期刊呀,最好在一个月之内的,研三孩子求毕业 20+4 之护着 2026-04-16 7/350 2026-04-20 15:45 by 豆豆7758
[教师之家] 又一批高校组建人工智能学院 师资行吗 不是骗人吗 +4 yexuqing 2026-04-19 4/200 2026-04-20 14:47 by brantleo
[考研] 337求调剂 +3 jyz04 2026-04-18 3/150 2026-04-20 12:24 by 研可安
[考博] 申博 +3 Xyyx. 2026-04-18 3/150 2026-04-20 10:44 by YuY66
[考博] 湖南大学刘巧玲课题组2026年第二批次博士研究生招生信息 +3 南风观火 2026-04-18 5/250 2026-04-20 10:13 by 南风观火
[考研] 294求调剂 +8 淡然654321 2026-04-17 9/450 2026-04-19 19:51 by Equinoxhua
[考研] 304求调剂 +8 castLight 2026-04-16 8/400 2026-04-19 17:14 by 中豫男
[考研] 求调剂 +10 小聂爱学习 2026-04-16 12/600 2026-04-19 16:51 by 中豫男
[考研] 求调剂 +6 苦命人。。。 2026-04-18 7/350 2026-04-19 16:27 by 中豫男
[考研] 294求调剂 +15 淡然654321 2026-04-15 15/750 2026-04-19 08:20 by cuisz
[考研] 0854求调剂 +23 门路摸摸 2026-04-15 27/1350 2026-04-19 01:59 by 烟雨流涯
[考研] 300求调剂 +12 橙a777 2026-04-15 12/600 2026-04-18 23:51 by 路病情
[考研] 收到复试调剂但是去不了 +8 小蜗牛* 2026-04-16 8/400 2026-04-18 11:15 by zixin2025
[考研] 260求调剂 +4 Zyt1314520.. 2026-04-17 5/250 2026-04-18 08:28 by babysonlkd
[有机交流] 二苯甲酮酸类衍生物 50+3 小白爱主人 2026-04-17 6/300 2026-04-17 18:47 by kf2781974
[考研] 322求调剂 +6 tekuzu 2026-04-17 6/300 2026-04-17 13:48 by Espannnnnol
[考研] 一志愿沪9,生物学326求调剂 +9 刘墨墨 2026-04-15 9/450 2026-04-16 17:14 by 崔崔崔cccc
信息提示
请填处理意见