24小时热门版块排行榜    

CyRhmU.jpeg
查看: 4090  |  回复: 1

sic029

铁虫 (初入文坛)

[求助] qsub提交并行siesta不成功,求助

大家好,交流下集群程序使用遇到的问题,多谢。
[node21:10714] *** An error occurred in MPI_Comm_rank
[node21:10714] *** on communicator MPI_COMM_WORLD
[node21:10714] *** MPI_ERR_COMM: invalid communicator
[node21:10714] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
mpirun has exited due to process rank 3 with PID 10711 on
node node21 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[node21:10707] 7 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[node21:10707] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


新编译的计算程序siesta,用qsub job提交上去很快结束提示的信息,能否帮忙诊断一下情况。在另外一个集群上编译后直接用mpirun -np 4 siesta可以顺利执行的,不知道为何在新集群用qsub出现这个问题,这个新集群不让进入到子节点,所以必须要解决这个问题才行,多谢了。

不知道是哪里的问题,之前在该环境并行编译的lammps和vasp都使用很顺利,就是siesta用qsub提交作业总是无法正常计算,但是并行编译的siesta在另外环境下的子节点用mpirun -np 4 siesta执行很顺利,纠结了。

哦,登录节点上mpirun我试过的,请帮忙看看,感觉被管理员设置了也无法用:
mpirun -np 4 siesta
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
    This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
    This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
    This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
    This will severely limit memory registrations.
--------------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory.  This typically can indicate that the
memlock limits are set too low.  For most HPC installations, the
memlock limits should be set to "unlimited".  The failure occured
here:

  Local host:    manage1
  OMPI source:   btl_openib_component.c:1115
  Function:      ompi_free_list_init_ex_new()
  Device:        mlx4_0
  Memlock limit: 32768

You may need to consult with your system administrator to get this
problem fixed.  This FAQ entry on the Open MPI web site may also be
helpful:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   manage1
  Local device: mlx4_0
--------------------------------------------------------------------------
[manage1:16214] *** An error occurred in MPI_Comm_rank
[manage1:16214] *** on communicator MPI_COMM_WORLD
[manage1:16214] *** MPI_ERR_COMM: invalid communicator
[manage1:16214] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 16212 on
node manage1 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[manage1:16211] 3 more processes have sent help message help-mpi-btl-openib.txt / init-fail-no-mem
[manage1:16211] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[manage1:16211] 3 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[manage1:16211] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal

计算子节点无法进入,被限制死了的。

这边用的pbs作业管理gridview,我用的提交脚本是:
====================================
#PBS -N test
#PBS -l nodes=1:ppn=8
#PBS -j oe
#PBS -l walltime=24:00:00  
                 
cd $PBS_O_WORKDIR
NP=`cat $PBS_NODEFILE|wc -l`
source /public/software/mpi/openmpi1.5.4-intel.sh                    
mpirun  -machinefile $PBS_NODEFILE -np $NP  \
/home/sw/siesta/siesta-3.1/Obj/siesta < fe.fdf | tee output  
=====================================

感谢虫友帮忙,多谢。
回复此楼
research
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

redsnowolf

银虫 (小有名气)

【答案】应助回帖


liliangfang: 金币+1, 谢谢交流 2012-09-15 15:12:35
我前两天用vasp也出现类似问题,刚刚解决~

The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory.  This typically can indicate that the
memlock limits are set too low.  For most HPC installations, the
memlock limits should be set to "unlimited".  The failure occured
here:

  Local host:    node21
  OMPI source:   btl_openib_component.c:1055
  Function:      ompi_free_list_init_ex_new()
  Device:        mlx4_0
  Memlock limit: 65536

You may need to consult with your system administrator to get this
problem fixed.  This FAQ entry on the Open MPI web site may also be
helpful:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
上面那个网址里15、16、17说的挺清楚的,我的情况是在每个节点ulimit -a显示locked memory都正常,可就是出错说内存分配不正常,那个网址里说可能是登录时没有正常执行系统所设的locked memory,或者作业调度系统没有分配给应用程序足够大的内存,最后重启了一下每个节点的pbs调度系统的守护进程,问题解决了~
或者你可以在mpirun前边儿加上ulimit -l unlimited,用qsub提交下试试
希望以上信息对楼主有用~
2楼2012-09-15 14:24:17
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖
相关版块跳转 我要订阅楼主 sic029 的主题更新
信息提示
请填处理意见