24小时热门版块排行榜    

查看: 2765  |  回复: 21

wnryc

新虫 (初入文坛)

[交流] 【求助】Performing VASP: mpich2 someties breaks down 已有5人参与

前面我发个同样的贴,可是问题没有得到解决。比较急,希望大家能够帮助下,谢谢!

我在RHEL 5.4; mpich2-1.2.1p1;pgi-9.0.1;双核Xeon E5504 (intel CPU)环境下跑并行的vasp。有些作业,能够正常并行计算(使用命令:mpiexec -n 8 vasp.pgi >out& or mpiexec -n 8 vasp.pgi   out& or mpiexec -n 8 vasp.pgi  ,有些作业却不能运行(我安装mpich2-1.2.p1), pgi.9.0.1没有问题),这时程序读了INCAR,POTCAR,POSCAR和KPOINTS文件后,屏幕提示如下的错误:
----------------------------------------------------------------------
running on    8 nodes
distr:  one band on    1 nodes,    8 groups
vasp.4.6.21  23Feb03 complex
POSCAR found :  3 types and   30 ions
LDA part: xc-table for Ceperly-Alder, Vosko type interpolation para-ferro
POSCAR, INCAR and KPOINTS ok, starting setup
WARNING: wrap around errors must be expected
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
FFT: planning ...            2
reading WAVECAR
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
mpiexec_qltang1 (handle_stdin_input 1089): stdin problem; if pgm is run in background, redirect from /dev/null
mpiexec_qltang1 (handle_stdin_input 1090):     e.g.: mpiexec -n 4 a.out < /dev/null &
WARNING: random wavefunctions but no delay for mixing, default for NELMDL
entering main loop
       N       E                     dE             d eps       ncg     rms          rms(c)
rank 6 in job 36  qltang1_54199   caused collective abort of all ranks
  exit status of rank 6: killed by signal 9
rank 3 in job 36  qltang1_54199   caused collective abort of all ranks
  exit status of rank 3: killed by signal 9

希望能得到帮助,谢谢!
回复此楼

» 收录本帖的淘帖专辑推荐

VASP

» 猜你喜欢

» 本主题相关价值贴推荐,对您同样有帮助:

Dr.Qian-LinTang
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

gump_813276

铜虫 (小有名气)

★ ★
小木虫(金币+0.5):给个红包,谢谢回帖交流
zzy870720z(金币+1):谢谢指点 2010-06-17 12:26:05
我上次回的你试了吗?
上面说的是 stdin problem
所以我觉得 你的第一个命令是对的mpiexec -n 8 vasp.pgi >out&
你试试 还有问题的话再说
2楼2010-06-17 12:19:32
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

wnryc

新虫 (初入文坛)

不是mpiexec命令的问题,上面都拭了,出现同样的error message
Dr.Qian-LinTang
3楼2010-06-17 15:23:13
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

valenhou001

至尊木虫 (职业作家)

★ ★
小木虫(金币+0.5):给个红包,谢谢回帖交流
zzy870720z(金币+1):谢谢指导 2010-06-17 18:33:36
#/bin/sh
mpdtrace -l
# Check the connectivity.
mpdringtest 100

mpiexec -n  2    vasp的路径    >  out 2>& 1

mpdallexit


试试上面的。
4楼2010-06-17 15:33:53
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

wnryc

新虫 (初入文坛)

按照建议,我这样操作,过程如下:
[qltang@qltang1 pgi]$ /bin/sh
sh-3.2$ mpdtrace -l
qltang1_54199 (127.0.0.1)
sh-3.2$ mpdringtest 100
time for 100 loops = 0.0500471591949 seconds
sh-3.2$ mpiexec -n  2 /usr/local/bin/ >out 2>&1
sh-3.2$  mpiexec -n  2 /usr/local/bin/vasp.pgi >out 2>&1
sh-3.2$
结果是同样的,VASP只读了输入文件就退出来了,即:
running on    2 nodes
distr:  one band on    1 nodes,    2 groups
vasp.4.6.21  23Feb03 complex
POSCAR found :  3 types and   48 ions
LDA part: xc-table for Ceperly-Alder, Vosko type interpolation para-ferro
POSCAR, INCAR and KPOINTS ok, starting setup
WARNING: wrap around errors must be expected
FFT: planning ...           10
reading WAVECAR
entering main loop
       N       E                     dE             d eps       ncg     rms          rms(c)
rank 1 in job 63  qltang1_54199   caused collective abort of all ranks
  exit status of rank 1: killed by signal 11
  
我又问了下mpich2 support, 他说可能不是mpich2原因,是core du的原因,即他说:“"ulimit -c unlimited" is the usual means to enable core dumps.  If that's not working for you, then either your VASP program isn't dumping core or core dumps must be enabled some other way.  You'll have to google for the appropriate way to enable them on your platform.”

这样,我又编译了个串行VASP,先执行命令:
ulimit -c unlimited
ulimit -s unlimited
然后跑串行的vasp.pgi.serial,发现VASP只读了输入文件就退出来了,而且给出这样的提示:
[qltang@qltang1 pgi2]$ vasp.pgi.serial
vasp.4.6.21  23Feb03 complex
POSCAR found :  3 types and   48 ions
LDA part: xc-table for Ceperly-Alder, Vosko type interpolation para-ferro
POSCAR, INCAR and KPOINTS ok, starting setup
WARNING: wrap around errors must be expected
FFT: planning ...           16
reading WAVECAR
entering main loop
       N       E                     dE             d eps       ncg     rms          rms(c)
*** glibc detected *** vasp.pgi.serial: free(): invalid next size (fast): 0x0000000005310760 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3fd74722ef]
/lib64/libc.so.6(cfree+0x4b)[0x3fd747273b]
vasp.pgi.serial[0x4f6176]
======= Memory map: ========
00400000-006bc000 r-xp 00000000 08:03 1770734                            /usr/local/bin/vasp.pgi.serial
008bb000-008e8000 rwxp 002bb000 08:03 1770734                            /usr/local/bin/vasp.pgi.serial
008e8000-00c32000 rwxp 008e8000 00:00 0
05243000-05a4a000 rwxp 05243000 00:00 0                                  [heap]
3fd6c00000-3fd6c1c000 r-xp 00000000 08:03 3204371                        /lib64/ld-2.5.so
3fd6e1b000-3fd6e1c000 r-xp 0001b000 08:03 3204371                        /lib64/ld-2.5.so
3fd6e1c000-3fd6e1d000 rwxp 0001c000 08:03 3204371                        /lib64/ld-2.5.so
3fd7400000-3fd754d000 r-xp 00000000 08:03 3204372                        /lib64/libc-2.5.so
3fd754d000-3fd774d000 ---p 0014d000 08:03 3204372                        /lib64/libc-2.5.so
3fd774d000-3fd7751000 r-xp 0014d000 08:03 3204372                        /lib64/libc-2.5.so
3fd7751000-3fd7752000 rwxp 00151000 08:03 3204372                        /lib64/libc-2.5.so
3fd7752000-3fd7757000 rwxp 3fd7752000 00:00 0
3fd7c00000-3fd7c82000 r-xp 00000000 08:03 3204376                        /lib64/libm-2.5.so
3fd7c82000-3fd7e81000 ---p 00082000 08:03 3204376                        /lib64/libm-2.5.so
3fd7e81000-3fd7e82000 r-xp 00081000 08:03 3204376                        /lib64/libm-2.5.so
3fd7e82000-3fd7e83000 rwxp 00082000 08:03 3204376                        /lib64/libm-2.5.so
3fd8000000-3fd8016000 r-xp 00000000 08:03 3204374                        /lib64/libpthread-2.5.so
3fd8016000-3fd8215000 ---p 00016000 08:03 3204374                        /lib64/libpthread-2.5.so
3fd8215000-3fd8216000 r-xp 00015000 08:03 3204374                        /lib64/libpthread-2.5.so
3fd8216000-3fd8217000 rwxp 00016000 08:03 3204374                        /lib64/libpthread-2.5.so
3fd8217000-3fd821b000 rwxp 3fd8217000 00:00 0
3fd8800000-3fd8807000 r-xp 00000000 08:03 3204377                        /lib64/librt-2.5.so
3fd8807000-3fd8a07000 ---p 00007000 08:03 3204377                        /lib64/librt-2.5.so
3fd8a07000-3fd8a08000 r-xp 00007000 08:03 3204377                        /lib64/librt-2.5.so
3fd8a08000-3fd8a09000 rwxp 00008000 08:03 3204377                        /lib64/librt-2.5.so
3fdd000000-3fdd00d000 r-xp 00000000 08:03 3202051                        /lib64/libgcc_s-4.1.2-20080825.so.1
3fdd00d000-3fdd20d000 ---p 0000d000 08:03 3202051                        /lib64/libgcc_s-4.1.2-20080825.so.1
3fdd20d000-3fdd20e000 rwxp 0000d000 08:03 3202051                        /lib64/libgcc_s-4.1.2-20080825.so.1
2abf81586000-2abf8158c000 rwxp 2abf81586000 00:00 0
2abf815ad000-2abf94b2b000 rwxp 2abf815ad000 00:00 0
7fff8fa7d000-7fff8fa92000 rwxp 7ffffffea000 00:00 0                      [stack]
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0                  [vdso]
Aborted (core dumped)
[qltang@qltang1 pgi2]$

不知道原因出在哪里,问题是有些作业可以跑串行的,能得到结果,有些串行作业vasp只读输入文件就停止了(出现上面的报错信息)。希望这问题能早点解决,谢谢!
Dr.Qian-LinTang
5楼2010-06-17 17:16:38
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

valenhou001

至尊木虫 (职业作家)

★ ★ ★
小木虫(金币+0.5):给个红包,谢谢回帖交流
zzy870720z(金币+2):谢谢指导 2010-06-17 20:21:37
不需要用ulimit的命令的。

有2种可能:
i)你的makefile中编译选项设置的不合适;
ii)你的mpich2安装有问题。

建议先搞定vasp的串行编译,使得它能正常用。然后再简单改为并行的编译。

测试mpich2的安装是否正常,比如运行mpich2自带的测试程序或命令。
6楼2010-06-17 18:55:03
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

gump_813276

铜虫 (小有名气)

★ ★ ★
小木虫(金币+0.5):给个红包,谢谢回帖交流
hedaors(金币+2):谢谢分享 2010-06-17 22:42:32
我感觉是stack size limit的问题
我试过下面这个方法,挺好用的
创建一个file叫limit.c 内容如下:
#include
#include
#include
void stacksize_()
{
int res;
struct rlimit rlim;

getrlimit(RLIMIT_STACK, &rlim);
printf("Before: cur=%d,hard=%d\n",(int)rlim.rlim_cur,(int)rlim.rlim_max);

rlim.rlim_cur=RLIM_INFINITY;
rlim.rlim_max=RLIM_INFINITY;
res=setrlimit(RLIMIT_STACK, &rlim);

getrlimit(RLIMIT_STACK, &rlim);
printf("After: res=%d,cur=%d,hard=%d\n",res,(int)rlim.rlim_cur,(int)rlim.rlim_max);
}


把这个文件和其他的vasp source code 放在一起
编译这个文件:cc -c -Wall -O2 limit.c
在main.F 的开头加入: CALL stacksize() 具体应该加在所有的声明之后
然后在makefile里SOURCE那一大串的文件最后加入limit.o

试试吧~
7楼2010-06-17 22:35:59
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

valenhou001

至尊木虫 (职业作家)

★ ★
小木虫(金币+0.5):给个红包,谢谢回帖交流
zzy870720z(金币+1):谢谢指点 2010-06-19 08:03:07
测试的体系很大吗?如果测试的体系很小的话,用的内存很小,系统默认的堆栈大小是足够。

建议按上我上贴说,一步步检查,先搞定串行的。
8楼2010-06-18 10:32:35
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

wnryc

新虫 (初入文坛)

测试的体系不是很大(16个原子),我还是先弄清楚串行的问题。即使串行的,有时vasp对一些体系能跑起来,有些体系就出现上面提到错误报错。请大家帮我看下我的串行makefile文件是否恰当。
机器配置:
1)Xeon E5504 CPU 2.0G (两颗四核, 64 bit),内存2*4G,Cache size=4096 M
2)RHEL 5.4 (64 bit):Linux qltang1 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
3) mpich2: 1.2.1p1
4) pgi: 9.0.1

makefile内容是(部分没有修改的未列出):
.SUFFIXES: .inc .f .f90 .F
SUFFIX=.f

FC=pgf90
FCL=$(FC)

CPP_ =  ./preprocess <$*.F | /usr/bin/cpp -P -C -traditional >$*$(SUFFIX)
CPP    = $(CPP_) -DHOST=\"LinuxPgi\" \
          -Dkind8 -DNGXhalf -DCACHE_SIZE=4096 -DPGF90 -Davoidalloc \
          -DRPROMU_DGEMV
FFLAGS =  -Mfree -Mx,119,0x200000
OFLAG  = -O2  -tp p7-64

OFLAG_HIGH = $(OFLAG)
OBJ_HIGH =
OBJ_NOOPT =
DEBUG  = -g -O0
INLINE = $(OFLAG)

BLAS= -L/usr/local/pgi-9.0.1/linux86-64/9.0-1/lib -lblas
LAPACK= -L/usr/local/pgi-9.0.1/linux86-64/9.0-1/lib -llapack

LIB  = -L../vasp.4.lib -ldmy \
     ../vasp.4.lib/linpack_double.o $(LAPACK) \
     $(BLAS)

LINK    =

FFT3D   = fft3dfurth.o fft3dlib.o

用这个makefile编译的串行vasp,有些作业能跑,有些作业就出现上面的问题。请大家帮我检查下,我的m马克

[ Last edited by wnryc on 2010-6-18 at 18:59 ]
Dr.Qian-LinTang
9楼2010-06-18 18:55:30
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

wnryc

新虫 (初入文坛)

测试的体系不是很大(16个原子),我还是先弄清楚串行的问题。即使串行的,有时vasp对一些体系能跑起来,有些体系就出现上面提到错误报错。请大家帮我看下我的串行makefile文件是否恰当。
机器配置:
1)Xeon E5504 CPU 2.0G (两颗四核, 64 bit),内存2*4G,Cache size=4096 M
2)RHEL 5.4 (64 bit):Linux qltang1 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
3) mpich2: 1.2.1p1
4) pgi: 9.0.1

makefile内容是(部分没有修改的未列出):
.SUFFIXES: .inc .f .f90 .F
SUFFIX=.f

FC=pgf90
FCL=$(FC)

CPP_ =  ./preprocess <$*.F | /usr/bin/cpp -P -C -traditional >$*$(SUFFIX)
CPP    = $(CPP_) -DHOST=\"LinuxPgi\" \
          -Dkind8 -DNGXhalf -DCACHE_SIZE=4096 -DPGF90 -Davoidalloc \
          -DRPROMU_DGEMV
FFLAGS =  -Mfree -Mx,119,0x200000
OFLAG  = -O2  -tp p7-64

OFLAG_HIGH = $(OFLAG)
OBJ_HIGH =
OBJ_NOOPT =
DEBUG  = -g -O0
INLINE = $(OFLAG)

BLAS= -L/usr/local/pgi-9.0.1/linux86-64/9.0-1/lib -lblas
LAPACK= -L/usr/local/pgi-9.0.1/linux86-64/9.0-1/lib -llapack

LIB  = -L../vasp.4.lib -ldmy \
     ../vasp.4.lib/linpack_double.o $(LAPACK) \
     $(BLAS)

LINK    =

FFT3D   = fft3dfurth.o fft3dlib.o

用这个makefile编译的串行vasp,有些作业能跑,有些作业就出现上面的问题。请大家帮我检查下,我的m
Dr.Qian-LinTang
10楼2010-06-18 18:56:28
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖
相关版块跳转 我要订阅楼主 wnryc 的主题更新
普通表情 高级回复 (可上传附件)
信息提示
请填处理意见