Thank you to anyone who has already donated - your generous donations helped make three months of treatment possible.
My brother Nate continues to fight stage IV Hodgkin's lymphoma. He's just 31, with a wife and baby girl. They have no active income (since he's been unable to return to work), no insurance, and cannot afford the treatment he needs. Nate and his family need your help. Please consider a donation, every dollar helps. Thanks.
From our Gluster volume log file: gdata.log:[2011-10-02 15:22:51.597683] I [afr-common.c:790:afr_lookup_done] 0-data-replicate-1: background entry self-heal triggered. path: /a/bkp/db099/hot gdata.log:[2011-10-02 15:22:51.608915] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-1: background entry self-heal completed on /a/bkp/db099/hot gdata.log:[2011-10-02 15:22:51.609221] I [afr-common.c:790:afr_lookup_done] 0-data-replicate-0: background entry self-heal triggered. path: /a/bkp/db099/hot gdata.log:[2011-10-02 15:22:51.609302] I [afr-common.c:790:afr_lookup_done] 0-data-replicate-3: background entry self-heal triggered. path: /a/bkp/db099/hot gdata.log:[2011-10-02 15:22:51.619626] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-0: background entry self-heal completed on /a/bkp/db099/hot gdata.log:[2011-10-02 15:22:51.751435] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-3: background entry self-heal completed on /a/bkp/db099/hot gdata.log:[2011-10-02 15:22:52.214606] I [afr-common.c:790:afr_lookup_done] 0-data-replicate-2: background entry self-heal triggered. path: /a/bkp/db099/hot gdata.log:[2011-10-02 15:22:52.231328] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-2: background entry self-heal completed on /a/bkp/db099/hot gdata.log:[2011-10-03 06:27:05.365663] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-5: reopendir on /a/bkp/db099/old2/old succeeded gdata.log:[2011-10-03 06:29:08.676629] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-5: reopendir on /a/bkp/db099/old2/old succeeded gdata.log:[2011-10-03 06:32:21.775641] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-5: reopendir on /a/bkp/db099/old2/old succeeded gdata.log:[2011-10-03 06:33:58.232913] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-5: reopendir on /a/bkp/db099/old2/old succeeded gdata.log:[2011-10-03 06:36:06.758396] I [afr-common.c:790:afr_lookup_done] 0-data-replicate-2: background data self-heal triggered. path: /a/bkp/db099/hot/nlcorp_nlcompany_data73.dat.gz gdata.log:[2011-10-03 06:36:06.819556] I [afr-common.c:790:afr_lookup_done] 0-data-replicate-2: background entry self-heal triggered. path: /a/bkp/db099/old2/old gdata.log:[2011-10-03 06:37:12.879077] I [afr-self-heal-algorithm.c:532:sh_diff_loop_driver_done] 0-data-replicate-2: diff self-heal on /a/bkp/db099/hot/nlcorp_nlcompany_data73.dat.gz: 0 blocks of 5921 were different gdata.log:[2011-10-03 06:37:12.887628] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-2: background data self-heal completed on /a/bkp/db099/hot/nlcorp_nlcompany_data73.dat.gz gdata.log:[2011-10-03 06:39:18.558246] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-2: background entry self-heal completed on /a/bkp/db099/old2/old gdata.log:[2011-10-03 06:52:56.915286] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-5: reopendir on /a/bkp/db099/old2/old succeeded gdata.log:[2011-10-03 06:58:45.720367] I [afr-common.c:790:afr_lookup_done] 0-data-replicate-2: background entry self-heal triggered. path: /a/bkp/db099/old2/old gdata.log:[2011-10-03 06:58:45.721553] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-2: background entry self-heal completed on /a/bkp/db099/old2/old From our Gluster NFS log file: nfs.log:[2011-10-02 08:51:03.226051] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded nfs.log:[2011-10-02 08:51:03.227120] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded nfs.log:[2011-10-02 08:51:03.227274] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded nfs.log:[2011-10-02 08:51:03.227492] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded nfs.log:[2011-10-02 08:51:03.227677] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded nfs.log:[2011-10-02 08:51:03.227915] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded nfs.log:[2011-10-02 08:51:03.228101] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded nfs.log:[2011-10-02 08:51:03.228342] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded nfs.log:[2011-10-02 08:51:03.228520] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded nfs.log:[2011-10-02 08:51:03.228803] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded nfs.log:[2011-10-02 08:51:03.228976] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded nfs.log:[2011-10-02 08:51:03.229223] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded [snip] nfs.log:[2011-10-02 08:51:12.51332] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded nfs.log:[2011-10-02 08:51:12.51357] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded nfs.log:[2011-10-02 08:51:12.51833] I [client-handshake.c:399:client3_1_reopen_cbk] 0-data-client-3: reopen on /a/bkp/db099/hot/nlcorp_nlcompany_data101.dat.gz succeeded nfs.log:[2011-10-02 08:51:12.51961] I [client-handshake.c:399:client3_1_reopen_cbk] 0-data-client-3: reopen on /a/bkp/db099/hot/nlcorp_nlcompany_data124.dat.gz succeeded These informational messages have been logged a few times in our nfs.log file. Here's another section where ~82k file descriptors are set aside for the same directory: nfs.log:[2011-10-02 12:30:46.994682] I [afr-common.c:633:afr_lookup_self_heal_check] 0-data-replicate-2: size differs for /a/bkp/db099/hot/nlcorp_nlcompany_data73.dat.gz nfs.log:[2011-10-02 12:38:39.263796] I [client-handshake.c:399:client3_1_reopen_cbk] 0-data-client-3: reopen on /a/bkp/db099/hot/nlcorp_nlcompany_data59.dat.gz succeeded (remote-fd = 2) nfs.log:[2011-10-02 12:38:39.263820] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 3) nfs.log:[2011-10-02 12:38:39.263844] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 4) nfs.log:[2011-10-02 12:38:39.263866] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 5) nfs.log:[2011-10-02 12:38:39.263889] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 6) [snip] nfs.log:[2011-10-02 12:38:41.192093] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 82499) nfs.log:[2011-10-02 12:38:41.192109] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 82500) nfs.log:[2011-10-02 12:38:41.192126] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 82501) nfs.log:[2011-10-02 12:38:41.192142] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 82502) More messages from our nfs.log, this time with some errors: nfs.log:[2011-10-02 15:21:32.477291] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-5: reopendir on /a/bkp/db099/hot succeeded (fd = 4434) nfs.log:[2011-10-02 15:21:32.477307] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-5: reopendir on /a/bkp/db099/hot succeeded (fd = 4435) nfs.log:[2011-10-02 15:21:32.477323] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-5: reopendir on /a/bkp/db099/hot succeeded (fd = 4436) nfs.log:[2011-10-02 15:21:32.477342] I [client-handshake.c:399:client3_1_reopen_cbk] 0-data-client-5: reopen on /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz succeeded (remote-fd = 4437) nfs.log:[2011-10-02 15:21:44.423323] I [afr-open.c:438:afr_openfd_sh] 0-data-replicate-2: data self-heal triggered. path: /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz, reason: Replicate up down flush, data lock is held nfs.log:[2011-10-02 15:21:44.423639] E [afr-self-heal-common.c:1217:sh_missing_entries_create] 0-data-replicate-2: no missing files - /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz. proceeding to metadata check nfs.log:[2011-10-02 15:21:54.985427] I [afr-self-heal-algorithm.c:532:sh_diff_loop_driver_done] 0-data-replicate-2: diff self-heal on /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz: 4 blocks of 49126 were different (0.01%) nfs.log:[2011-10-02 15:21:54.986031] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-2: background data self-heal completed on /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz nfs.log:[2011-10-02 15:21:54.986240] I [afr-open.c:438:afr_openfd_sh] 0-data-replicate-2: data self-heal triggered. path: /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz, reason: Replicate up down flush, data lock is held nfs.log:[2011-10-02 15:21:54.986622] E [afr-self-heal-common.c:1217:sh_missing_entries_create] 0-data-replicate-2: no missing files - /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz. proceeding to metadata check nfs.log:[2011-10-02 15:21:54.987397] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-2: background data self-heal completed on /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz nfs.log:[2011-10-02 15:21:54.987681] I [afr-open.c:438:afr_openfd_sh] 0-data-replicate-2: data self-heal triggered. path: /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz, reason: Replicate up down flush, data lock is held nfs.log:[2011-10-02 15:21:54.988123] E [afr-self-heal-common.c:1217:sh_missing_entries_create] 0-data-replicate-2: no missing files - /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz. proceeding to metadata check nfs.log:[2011-10-02 15:21:54.988902] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-2: background data self-heal completed on /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz nfs.log:[2011-10-02 15:21:54.989154] I [afr-open.c:438:afr_openfd_sh] 0-data-replicate-2: data self-heal triggered. path: /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz, reason: Replicate up down flush, data lock is held nfs.log:[2011-10-02 15:21:54.989609] E [afr-self-heal-common.c:1217:sh_missing_entries_create] 0-data-replicate-2: no missing files - /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz. proceeding to metadata check nfs.log:[2011-10-02 15:21:54.990483] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-2: background data self-heal completed on /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz nfs.log:[2011-10-02 15:21:54.990779] I [afr-open.c:438:afr_openfd_sh] 0-data-replicate-2: data self-heal triggered. path: /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz, reason: Replicate up down flush, data lock is held nfs.log:[2011-10-02 15:21:54.991335] E [afr-self-heal-common.c:1217:sh_missing_entries_create] 0-data-replicate-2: no missing files - /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz. proceeding to metadata check nfs.log:[2011-10-02 15:21:54.992414] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-2: background data self-heal completed on /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz As of right now, around 4.5k FDs are open for just one directory. This behavior is consistent through a Gluster service stop/start. [root@bkp1002a glusterfs]# lsof | grep gluster | wc -l; lsof | grep gluster | grep db099 | less 5610 glusterfs 5633 root 169r DIR 104,16 4096 751751 /data/gluster/a/bkp/db099/old glusterfs 5633 root 170r DIR 104,16 4096 751751 /data/gluster/a/bkp/db099/old glusterfs 5633 root 171r DIR 104,16 4096 751751 /data/gluster/a/bkp/db099/old glusterfs 5633 root 172r DIR 104,16 4096 751751 /data/gluster/a/bkp/db099/old glusterfs 5633 root 173r DIR 104,16 4096 751751 /data/gluster/a/bkp/db099/old glusterfs 5633 root 174r DIR 104,16 4096 751751 /data/gluster/a/bkp/db099/old glusterfs 5633 root 175r DIR 104,16 4096 751751 /data/gluster/a/bkp/db099/old [snip] glusterfs 5633 root 4606r DIR 104,16 4096 751751 /data/gluster/a/bkp/db099/old glusterfs 5633 root 4607r DIR 104,16 4096 751751 /data/gluster/a/bkp/db099/old glusterfs 5633 root 4608r DIR 104,16 4096 751751 /data/gluster/a/bkp/db099/old glusterfs 5633 root 4609r DIR 104,16 4096 751751 /data/gluster/a/bkp/db099/old glusterfs 5633 root 4610r DIR 104,16 4096 751751 /data/gluster/a/bkp/db099/old FDs 175-4606 are the same as the other FDs on display. Our Gluster volume info: [root@bkp1002a glusterfs]# gluster volume info data Volume Name: data Type: Distributed-Replicate Status: Started Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: bkp1002ax:/data/gluster Brick2: bkp1002bx:/data/gluster Brick3: bkp1002cx:/data/gluster Brick4: bkp1002dx:/data/gluster Brick5: bkp1002ex:/data/gluster Brick6: bkp1002fx:/data/gluster Brick7: bkp1002gx:/data/gluster Brick8: bkp1002hx:/data/gluster Options Reconfigured: performance.quick-read: off performance.stat-prefetch: on network.ping-timeout: 10 cluster.min-free-disk: 6% Relevant entries from our nfs-server.vol: volume data-client-0 type protocol/client option remote-host bkp1002ax option remote-subvolume /data/gluster option transport-type tcp option ping-timeout 10 end-volume volume data-client-1 type protocol/client option remote-host bkp1002bx option remote-subvolume /data/gluster option transport-type tcp option ping-timeout 10 end-volume volume data-client-2 type protocol/client option remote-host bkp1002cx option remote-subvolume /data/gluster option transport-type tcp option ping-timeout 10 end-volume volume data-client-3 type protocol/client option remote-host bkp1002dx option remote-subvolume /data/gluster option transport-type tcp option ping-timeout 10 end-volume volume data-client-4 type protocol/client option remote-host bkp1002ex option remote-subvolume /data/gluster option transport-type tcp option ping-timeout 10 end-volume volume data-client-5 type protocol/client option remote-host bkp1002fx option remote-subvolume /data/gluster option transport-type tcp option ping-timeout 10 end-volume volume data-client-6 type protocol/client option remote-host bkp1002gx option remote-subvolume /data/gluster option transport-type tcp option ping-timeout 10 end-volume volume data-client-7 type protocol/client option remote-host bkp1002hx option remote-subvolume /data/gluster option transport-type tcp option ping-timeout 10 end-volume volume data-replicate-0 type cluster/replicate subvolumes data-client-0 data-client-1 end-volume volume data-replicate-1 type cluster/replicate subvolumes data-client-2 data-client-3 end-volume volume data-replicate-2 type cluster/replicate subvolumes data-client-4 data-client-5 end-volume volume data-replicate-3 type cluster/replicate subvolumes data-client-6 data-client-7 end-volume volume data-dht type cluster/distribute option min-free-disk 6% subvolumes data-replicate-0 data-replicate-1 data-replicate-2 data-replicate-3 end-volume volume data-write-behind type performance/write-behind subvolumes data-dht end-volume volume data-read-ahead type performance/read-ahead subvolumes data-write-behind end-volume volume data-io-cache type performance/io-cache subvolumes data-read-ahead end-volume volume data type debug/io-stats subvolumes data-io-cache end-volume volume nfs-server type nfs/server option nfs.dynamic-volumes on option rpc-auth.addr.data1.allow * option nfs3.data1.volume-id 91a96dbe-35d9-4324-a521-3b503f3f2f09 option rpc-auth.addr.data.allow * option nfs3.data.volume-id ac503ee5-ad29-47ac-99c0-38cc696a1d4d subvolumes data1 data end-volume OS: [root@bkp1002a nfs]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 5.6 (Tikanga)