24小时热门版块排行榜    

查看: 1745  |  回复: 5

04nylxb

木虫 (正式写手)

[求助] 请问MS-linux-cluster安装之后无法并行的问题?

我是让MS用rsh进行的,cluster上已经成功配置好rsh的免认证登陆了,我在运行的时候,总是提示,然后很快就failed
bash: /opt/hpmpi/bin/mpid: No such file or directory
bash: /opt/hpmpi/bin/mpid: No such file or directory

求指点安装完之后,是否还需要对hpmpi作一些配置?

(rsh配置好的,是免认证登录的,rsh nodexx 不需要任何密码就切换了)
求高人指点……不胜感激。
我将gw-info.sbd和gwparams.cfg的cpucorestotal都改成总数了,64。并修改了mpi运行参数,支持ib。

选择4个进程(单节点)运行时,出现这样的错误:求指点,貌似是mpi出问题了
Current trace stack:
model_write_occ_eigenvalues
model_write_all
model_write
geom_BFGS
geometry_optimise
castep
Trapped SIGINT or SIGTERM. Exiting...
Trapped SIGINT or SIGTERM. Exiting...
Trapped SIGINT or SIGTERM. Exiting...
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 2 pid 23177 on host master to cpu 2
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 3 pid 23178 on host master to cpu 3
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 1 pid 23176 on host master to cpu 1
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 0 pid 23175 on host master to cpu 0
MPI Application rank 0 exited before MPI_Finalize() with status 1


————————————————————————————————————————
(分割线下是旧问题,呵呵)
att,ms并行机安装,安装过程提示
should hpmpi use ssh? [Y/n]
我十六个节点,配置的是rsh的免认证登陆,ssh没有配置,如果我上面选择no 的话,是否后面进行并行计算的时候就是以rsh的方式进行了呢?
求高人指点,呵呵。
另,linux下如何卸载ms?因为我之前用msi账户安装的时候选择了ssh,结果忘了并行机上配置的是rsh,如果配置ssh的话,比较麻烦。
我将主节点整个home目录都做成了nfs共享到各个计算节点了,这样在主节点master上 生成一个key的时候,就被共享到其它节点了 ($ ssh-keygen -t rsa,默认生成~./ssh id_rsa id_rsa.pub)。在其它节点进行同样操作的时候,当生成密钥的时候,也是放在home下面的,这时候因为主节点共享了home,就会提示已经有key存在了,需要覆盖,晕。然后我就不知道该如何解决了。
请指点。

[ Last edited by 04nylxb on 2011-6-22 at 08:31 ]
回复此楼

» 猜你喜欢

» 本主题相关价值贴推荐,对您同样有帮助:

集中精力发文章
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

lbambool

木虫 (著名写手)

【答案】应助回帖

★ ★
贺仪(金币+2): 多谢指教! 2011-06-22 22:22:48
04nylxb(金币+3): 非常感谢啊 2011-06-22 22:30:33
不要共享home,只共享MS 安装目录;
每个节点上都要安装HPMPI;
不要用root用户安装MS;
每个节点上的tmp路径都要有读写权限;
尽可能用SSH计算,配置不麻烦,按手册上做一遍就行了,个人认为比RSH简单,现在的linux分发版本默认是没有安装RSH服务的,那个RSH服务要额外安装,而SSH是直接配置就可以用的;
安装MS 要加--type cluster参数;
Θ抚琴闹市外,独闲山水间Θ
2楼2011-06-22 20:04:20
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

04nylxb

木虫 (正式写手)

引用回帖:
Originally posted by lbambool at 2011-06-22 20:04:20:
不要共享home,只共享MS 安装目录;
每个节点上都要安装HPMPI;
不要用root用户安装MS;
每个节点上的tmp路径都要有读写权限;
尽可能用SSH计算,配置不麻烦,按手册上做一遍就行了,个人认为比RSH简单,现在 ...

嗯,非常感谢啊。系统管理员为了管理方便,就将整个home都作了共享。嗯,各个节点都安装了hpmpi。
现在上面的问题解决了,hpmpi通了,
我用dmol3试了下,发觉计算完全正常,汗。CASTEP就出现问题,新问题如下,求指点,呵呵

Job started on host master
at Wed Jun 22 21:55:51 2011

MPI_CPU_AFFINITY set to RANK, setting affinity of rank 1 pid 1156 on host master to cpu 1
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 2 pid 1157 on host master to cpu 2
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 0 pid 1155 on host master to cpu 0
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 3 pid 1158 on host master to cpu 3
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 13 pid 7947 on host node3 to cpu 1
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 14 pid 7948 on host node3 to cpu 0
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 25 pid 15375 on host node6 to cpu 1
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 21 pid 6099 on host node5 to cpu 1
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 23 pid 6101 on host node5 to cpu 1
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 26 pid 15376 on host node6 to cpu 0
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 4 pid 24707 on host node1 to cpu 0
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 15 pid 7949 on host node3 to cpu 1
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 12 pid 7946 on host node3 to cpu 0
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 27 pid 15377 on host node6 to cpu 1
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 24 pid 15374 on host node6 to cpu 0
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 20 pid 6098 on host node5 to cpu 0
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 22 pid 6100 on host node5 to cpu 0
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 6 pid 24709 on host node1 to cpu 0
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 7 pid 24710 on host node1 to cpu 1
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 5 pid 24708 on host node1 to cpu 1
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 8 pid 8980 on host node2 to cpu 0
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 10 pid 8982 on host node2 to cpu 0
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 30 pid 14483 on host node7 to cpu 0
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 31 pid 14484 on host node7 to cpu 1
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 9 pid 8981 on host node2 to cpu 1
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 11 pid 8983 on host node2 to cpu 1
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 29 pid 14482 on host node7 to cpu 1
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 28 pid 14481 on host node7 to cpu 0
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 17 pid 17856 on host node4 to cpu 1
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 18 pid 17857 on host node4 to cpu 0
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 19 pid 17858 on host node4 to cpu 1
MPI_CPU_AFFINITY set to RANK, setting affinity of rank 16 pid 17855 on host node4 to cpu 0
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
arning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
arning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
MX:node6:mx__connect_common(00:60:dd:48:d9:57):error 36(errno=3)estination NIC not found in network table
MPI Application rank 26 exited before MPI_Finalize() with status 1
MX:node3:mx__connect_common(00:60:dd:48:d9:28):error 36(errno=3)estination NIC not found in network table
MX:node3:mx__connect_common(00:60:dd:48:d9:28):error 36(errno=3)estination NIC not found in network table
MX:node2:Remote endpoint is closed, peer=00:60:dd:48:d8:f0 (node3:0)
MMX:node7:Remote endpoint is closed, peer=00:60:dd:48:d8:f0 (node3:0)
MX:node4:Remote endpoint is closed, peer=00:60:dd:48:d8:f0 (node3:0)
MX:node5:Remote endpoint is closed, peer=00:60:dd:48:d8:f0 (node3:0)
MMX:node2:Remote endpoint is closed, peer=00:60:dd:48:d8:f0 (node3:0)
MX:node5:Remote endpoint is closed, peer=00:60:dd:48:d8:f0 (node3:0)
MX:node1:Remote endpoint is closed, peer=00:60:dd:48:d8:f0 (node3:0)
X:node1:Remote endpoint is closed, peer=00:60:dd:48:d8:f0 (node3:0)
MX:node4:Remote endpoint is closed, peer=00:60:dd:48:d8:f0 (node3:0)
MX:node7:Remote endpoint is closed, peer=00:60:dd:48:d8:f0 (node3:0)
MPI Application rank 15 exited before MPI_Finalize() with status 1
MX:node5:Remote endpoint is closed, peer=00:60:dd:48:d8:f0 (node3:0)
MX:node2:Remote endpoint is closed, peer=00:60:dd:48:d8:f0 (node3:0)
MX:node7:Remote endpoint is closed, peer=00:60:dd:48:d8:f0 (node3:0)
MPI Application rank 29 exited before MPI_Finalize() with status 1
MPI Application rank 21 exited before MPI_Finalize() with status 1
forrtl: error (78): process killed (SIGTERM)
Image              PC        Routine            Line        Source            
libpthread.so.0    0096D21A  Unknown               Unknown  Unknown
libmyriexpress.so  B6F7535D  Unknown               Unknown  Unknown
libmpi.so.1        B7A3401F  Unknown               Unknown  Unknown
libmpi.so.1        B7A10622  Unknown               Unknown  Unknown
libmpi.so.1        B7A0FFCB  Unknown               Unknown  Unknown
libmpi.so.1        B7A60BDF  Unknown               Unknown  Unknown
libmpi.so.1        B7A6AF17  Unknown               Unknown  Unknown
castepexe_mpi.exe  080A68E9  Unknown               Unknown  Unknown
castepexe_mpi.exe  08F5D992  Unknown               Unknown  Unknown
……………………
集中精力发文章
3楼2011-06-22 22:33:35
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

04nylxb

木虫 (正式写手)

引用回帖:
Originally posted by 04nylxb at 2011-06-22 22:33:35:
嗯,非常感谢啊。系统管理员为了管理方便,就将整个home都作了共享。嗯,各个节点都安装了hpmpi。
现在上面的问题解决了,hpmpi通了,
我用dmol3试了下,发觉计算完全正常,汗。CASTEP就出现问题,新问题如下 ...

晕,怎么出现笑脸了,汗。是Destination NIC not found in network table

刚第一个dmol3计算正常,24个processors,计算收敛了。
然后想再试下的时候,一运行,马上就失败了……晕倒了。
集中精力发文章
4楼2011-06-22 22:43:17
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

lbambool

木虫 (著名写手)

【答案】应助回帖

★ ★
04nylxb(金币+3): 收到,多谢啊。呵呵。我一个一个节点验证,发现有几个节点不行 2011-06-23 13:00:32
zzy870720z(金币+2): 谢谢指教 2011-06-23 14:17:53
你用InfiniBand?这技术我没有接触过,如果dmol3可以使用,但castep不可用的话,先检查一下是否是lic的问题,如果不是的话硬件配置问题可能性较大了,下面是网上查到的一些信息,希望对你有用。你可以让你的管理员看一下这段文字,看能否处理一下。
What is warning:regcache incompatible with malloc ?

Myrinet MX uses registration cache (see the "Acronyms in high performance interconnect world" table above) to achieve higher performance. When registration cache feature is enabled, Myrinet MX will manage all memory allocations by itself, i.e. it has its own implemetation of malloc, free, realloc, mremap, munmap, sbrk, etc (see mx__regcache.c in libmyriexpress package)
The warning message in question pops up when mx__regcache_works returns 0. For Linux, this means when calling a pair of malloc/free, the variable mx__hook_triggered is not triggerred.

Registration cache checks can be disabled by setting the environmental variable MX_RCACHE to 2.

Registration cache can sometimes cause weird errors. It can be disabled by setting the environmental variable MX_RCACHE to 0.
Θ抚琴闹市外,独闲山水间Θ
5楼2011-06-23 09:04:09
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖

04nylxb

木虫 (正式写手)

引用回帖:
Originally posted by lbambool at 2011-06-23 09:04:09:
你用InfiniBand?这技术我没有接触过,如果dmol3可以使用,但castep不可用的话,先检查一下是否是lic的问题,如果不是的话硬件配置问题可能性较大了,下面是网上查到的一些信息,希望对你有用。你可以让你的管理员 ...

Dmol3能选择到30个processor了,CASTEP如果一大于2个,立即就出错,唉,郁闷。
集中精力发文章
6楼2011-06-23 17:15:33
已阅   回复此楼   关注TA 给TA发消息 送TA红花 TA的回帖
相关版块跳转 我要订阅楼主 04nylxb 的主题更新
信息提示
请填处理意见