Pantek Library
Hosting Provided By
CybrHost
High Speed Hosting

Re: nodes on two seperate lans hangs

From: Anatoly Pidruchny <apidruchny(at)newxt.com>
Date: Wed Jul 25 2007 - 19:34:09 EDT


Hi Vincent.

> My cluster, with two data nodes, works fine when all nodes are on the
> same network. But when configuring the cluster exactly the same way but
> replacing a data node with a node on a remote network, it hang up in
> starting phase (I did the --initial start after modifying the conf).
> Maybe you can help or give me some hints

If I were you, I would use lsof and tcpdump/ngrep to find out exactly on what interfaces/IP addresses and ports the ndbd processes are listening and to what addresses and ports they try to connect, I would also see if I can connect to those IP addresses and ports using telnet.

> manager is on host NET1.5
>
> -- NDB Cluster -- Management Client --
> ndb_mgm> show ;
> Connected to Management Server at: localhost:1186
> Cluster Configuration
> ---------------------
> [ndbd(NDB)] 2 node(s)
> id=3 @NET1.6 (Version: 5.1.20, starting, Nodegroup: 0)
> id=7 @NET2.178 (Version: 5.1.20, starting, Nodegroup: 0)
>
> [ndb_mgmd(MGM)] 1 node(s)
> id=1 (Version: 5.1.20)
>
> [mysqld(API)] 3 node(s)
> id=4 (not connected, accepting connect from NET1.6)
> id=6 (not connected, accepting connect from NET1.7)
> id=8 (not connected, accepting connect from NET2.178)
>
>
> Remark: I am suprised here I have no master ? When all machines are on
> the same lan NET1.0/24, node 3 becomes master and API nodes connect.

I think because the data nodes can not contact each other. If you fix the problem with communication then they will elect the master.

> I get messages in the ndb_mgmd log I don't understand:
>
> 2007-07-25 14:09:30 [MgmSrvr] INFO -- Node 3: Initial start, waiting
> for 0000000000000080 to connect, nodes [ all: 0000000000000088
> connected: 0000000000000008 no-wait: 0000000000000000 ]
> 2007-07-25 14:09:32 [MgmSrvr] INFO -- Node 7: Initial start, waiting
> for 0000000000000008 to connect, nodes [ all: 0000000000000088
> connected: 0000000000000080 no-wait: 0000000000000000 ]
>
> How do I interpret "0000000000000080" and "0000000000000008" as machine
> identifiers ? Can you give details on these identifiers ?

It is something like this. They are not machine identifiers, they are nodes and node sets. Node 1 is encoded as 0000000000000001, Node 2 is encoded as 0000000000000002, Node 3 is encoded as 0000000000000004, Node 4 is encoded as 0000000000000008, Node 5 - 0000000000000010 and so on.

> NB: firewall is wide open between all these hosts (nothing is blocked,
> tcpdump report a lot of traffic between manager and nodes, but no
> traffic at all between the 2 data nodes).
> NB: If I revert to a "all nodes on the same lan" config, evrything
> restart normally.
> NB: I checked the configuration for possible stupid errors, but this
> looks nice right now.
> NB: hosts on seperate lans have 100Mb/s connections between them.
>
> My final question is: "Even if not recommended, is it possible to have
> such a configuration with data nodes on seprate networks ?"

Do you need help?X

If the connection between the networks is really 100 Mb/s then it should be possible.

Regards,

Anatoly.

-- 
MySQL Cluster Mailing List
For list archives: 
http://lists.mysql.com/cluster
To unsubscribe:    
http://lists.mysql.com/cluster?unsub=lists@pantek.com
Received on Wed Jul 25 19:34:31 2007

This archive was generated by hypermail 2.1.8 : Thu Aug 09 2007 - 19:30:34 EDT


Contact Us  Legal Notices  Order Services Online 
Pantek Home  Privacy Policy  IT news  Site Map  Pantek Library