Thank you to anyone who has already donated - your generous donations helped make three months of treatment possible.

My brother Nate continues to fight stage IV Hodgkin's lymphoma. He's just 31, with a wife and baby girl. They have no active income (since he's been unable to return to work), no insurance, and cannot afford the treatment he needs. Nate and his family need your help. Please consider a donation, every dollar helps. Thanks.


From our Gluster volume log file:

gdata.log:[2011-10-02 15:22:51.597683] I [afr-common.c:790:afr_lookup_done] 0-data-replicate-1: background  entry self-heal triggered. path: /a/bkp/db099/hot
gdata.log:[2011-10-02 15:22:51.608915] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-1: background  entry self-heal completed on /a/bkp/db099/hot
gdata.log:[2011-10-02 15:22:51.609221] I [afr-common.c:790:afr_lookup_done] 0-data-replicate-0: background  entry self-heal triggered. path: /a/bkp/db099/hot
gdata.log:[2011-10-02 15:22:51.609302] I [afr-common.c:790:afr_lookup_done] 0-data-replicate-3: background  entry self-heal triggered. path: /a/bkp/db099/hot
gdata.log:[2011-10-02 15:22:51.619626] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-0: background  entry self-heal completed on /a/bkp/db099/hot
gdata.log:[2011-10-02 15:22:51.751435] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-3: background  entry self-heal completed on /a/bkp/db099/hot
gdata.log:[2011-10-02 15:22:52.214606] I [afr-common.c:790:afr_lookup_done] 0-data-replicate-2: background  entry self-heal triggered. path: /a/bkp/db099/hot
gdata.log:[2011-10-02 15:22:52.231328] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-2: background  entry self-heal completed on /a/bkp/db099/hot
gdata.log:[2011-10-03 06:27:05.365663] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-5: reopendir on /a/bkp/db099/old2/old succeeded (fd = 0)
gdata.log:[2011-10-03 06:29:08.676629] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-5: reopendir on /a/bkp/db099/old2/old succeeded (fd = 0)
gdata.log:[2011-10-03 06:32:21.775641] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-5: reopendir on /a/bkp/db099/old2/old succeeded (fd = 0)
gdata.log:[2011-10-03 06:33:58.232913] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-5: reopendir on /a/bkp/db099/old2/old succeeded (fd = 0)
gdata.log:[2011-10-03 06:36:06.758396] I [afr-common.c:790:afr_lookup_done] 0-data-replicate-2: background  data self-heal triggered. path: /a/bkp/db099/hot/nlcorp_nlcompany_data73.dat.gz
gdata.log:[2011-10-03 06:36:06.819556] I [afr-common.c:790:afr_lookup_done] 0-data-replicate-2: background  entry self-heal triggered. path: /a/bkp/db099/old2/old
gdata.log:[2011-10-03 06:37:12.879077] I [afr-self-heal-algorithm.c:532:sh_diff_loop_driver_done] 0-data-replicate-2: diff self-heal on /a/bkp/db099/hot/nlcorp_nlcompany_data73.dat.gz: 0 blocks of 5921 were different (0.00%)
gdata.log:[2011-10-03 06:37:12.887628] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-2: background  data self-heal completed on /a/bkp/db099/hot/nlcorp_nlcompany_data73.dat.gz
gdata.log:[2011-10-03 06:39:18.558246] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-2: background  entry self-heal completed on /a/bkp/db099/old2/old
gdata.log:[2011-10-03 06:52:56.915286] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-5: reopendir on /a/bkp/db099/old2/old succeeded (fd = 0)
gdata.log:[2011-10-03 06:58:45.720367] I [afr-common.c:790:afr_lookup_done] 0-data-replicate-2: background  entry self-heal triggered. path: /a/bkp/db099/old2/old
gdata.log:[2011-10-03 06:58:45.721553] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-2: background  entry self-heal completed on /a/bkp/db099/old2/old


From our Gluster NFS log file:
nfs.log:[2011-10-02 08:51:03.226051] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 0)
nfs.log:[2011-10-02 08:51:03.227120] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 1)
nfs.log:[2011-10-02 08:51:03.227274] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 2)
nfs.log:[2011-10-02 08:51:03.227492] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 3)
nfs.log:[2011-10-02 08:51:03.227677] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 4)
nfs.log:[2011-10-02 08:51:03.227915] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 5)
nfs.log:[2011-10-02 08:51:03.228101] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 6)
nfs.log:[2011-10-02 08:51:03.228342] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 7)
nfs.log:[2011-10-02 08:51:03.228520] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 8)
nfs.log:[2011-10-02 08:51:03.228803] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 9)
nfs.log:[2011-10-02 08:51:03.228976] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 10)
nfs.log:[2011-10-02 08:51:03.229223] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 11)
[snip]
nfs.log:[2011-10-02 08:51:12.51332] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 50706)
nfs.log:[2011-10-02 08:51:12.51357] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 50707)
nfs.log:[2011-10-02 08:51:12.51833] I [client-handshake.c:399:client3_1_reopen_cbk] 0-data-client-3: reopen on /a/bkp/db099/hot/nlcorp_nlcompany_data101.dat.gz succeeded (remote-fd = 50709)
nfs.log:[2011-10-02 08:51:12.51961] I [client-handshake.c:399:client3_1_reopen_cbk] 0-data-client-3: reopen on /a/bkp/db099/hot/nlcorp_nlcompany_data124.dat.gz succeeded (remote-fd = 50711)


These informational messages have been logged a few times in our nfs.log file.

Here's another section where ~82k file descriptors are set aside for the same directory:

nfs.log:[2011-10-02 12:30:46.994682] I [afr-common.c:633:afr_lookup_self_heal_check] 0-data-replicate-2: size differs for /a/bkp/db099/hot/nlcorp_nlcompany_data73.dat.gz 
nfs.log:[2011-10-02 12:38:39.263796] I [client-handshake.c:399:client3_1_reopen_cbk] 0-data-client-3: reopen on /a/bkp/db099/hot/nlcorp_nlcompany_data59.dat.gz succeeded (remote-fd = 2)
nfs.log:[2011-10-02 12:38:39.263820] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 3)
nfs.log:[2011-10-02 12:38:39.263844] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 4)
nfs.log:[2011-10-02 12:38:39.263866] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 5)
nfs.log:[2011-10-02 12:38:39.263889] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 6)

[snip]
nfs.log:[2011-10-02 12:38:41.192093] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 82499)
nfs.log:[2011-10-02 12:38:41.192109] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 82500)
nfs.log:[2011-10-02 12:38:41.192126] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 82501)
nfs.log:[2011-10-02 12:38:41.192142] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-3: reopendir on /a/bkp/db099/hot succeeded (fd = 82502)


More messages from our nfs.log, this time with some errors:
nfs.log:[2011-10-02 15:21:32.477291] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-5: reopendir on /a/bkp/db099/hot succeeded (fd = 4434)
nfs.log:[2011-10-02 15:21:32.477307] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-5: reopendir on /a/bkp/db099/hot succeeded (fd = 4435)
nfs.log:[2011-10-02 15:21:32.477323] I [client-handshake.c:498:client3_1_reopendir_cbk] 0-data-client-5: reopendir on /a/bkp/db099/hot succeeded (fd = 4436)
nfs.log:[2011-10-02 15:21:32.477342] I [client-handshake.c:399:client3_1_reopen_cbk] 0-data-client-5: reopen on /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz succeeded (remote-fd = 4437)
nfs.log:[2011-10-02 15:21:44.423323] I [afr-open.c:438:afr_openfd_sh] 0-data-replicate-2:  data self-heal triggered. path: /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz, reason: Replicate up down flush, data lock is held
nfs.log:[2011-10-02 15:21:44.423639] E [afr-self-heal-common.c:1217:sh_missing_entries_create] 0-data-replicate-2: no missing files - /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz. proceeding to metadata check
nfs.log:[2011-10-02 15:21:54.985427] I [afr-self-heal-algorithm.c:532:sh_diff_loop_driver_done] 0-data-replicate-2: diff self-heal on /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz: 4 blocks of 49126 were different (0.01%)
nfs.log:[2011-10-02 15:21:54.986031] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-2: background  data self-heal completed on /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz
nfs.log:[2011-10-02 15:21:54.986240] I [afr-open.c:438:afr_openfd_sh] 0-data-replicate-2:  data self-heal triggered. path: /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz, reason: Replicate up down flush, data lock is held
nfs.log:[2011-10-02 15:21:54.986622] E [afr-self-heal-common.c:1217:sh_missing_entries_create] 0-data-replicate-2: no missing files - /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz. proceeding to metadata check
nfs.log:[2011-10-02 15:21:54.987397] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-2: background  data self-heal completed on /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz
nfs.log:[2011-10-02 15:21:54.987681] I [afr-open.c:438:afr_openfd_sh] 0-data-replicate-2:  data self-heal triggered. path: /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz, reason: Replicate up down flush, data lock is held
nfs.log:[2011-10-02 15:21:54.988123] E [afr-self-heal-common.c:1217:sh_missing_entries_create] 0-data-replicate-2: no missing files - /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz. proceeding to metadata check
nfs.log:[2011-10-02 15:21:54.988902] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-2: background  data self-heal completed on /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz
nfs.log:[2011-10-02 15:21:54.989154] I [afr-open.c:438:afr_openfd_sh] 0-data-replicate-2:  data self-heal triggered. path: /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz, reason: Replicate up down flush, data lock is held
nfs.log:[2011-10-02 15:21:54.989609] E [afr-self-heal-common.c:1217:sh_missing_entries_create] 0-data-replicate-2: no missing files - /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz. proceeding to metadata check
nfs.log:[2011-10-02 15:21:54.990483] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-2: background  data self-heal completed on /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz
nfs.log:[2011-10-02 15:21:54.990779] I [afr-open.c:438:afr_openfd_sh] 0-data-replicate-2:  data self-heal triggered. path: /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz, reason: Replicate up down flush, data lock is held
nfs.log:[2011-10-02 15:21:54.991335] E [afr-self-heal-common.c:1217:sh_missing_entries_create] 0-data-replicate-2: no missing files - /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz. proceeding to metadata check
nfs.log:[2011-10-02 15:21:54.992414] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 0-data-replicate-2: background  data self-heal completed on /a/bkp/db099/hot/nlcorp_nlcompany_lob81.dat.gz


As of right now, around 4.5k FDs are open for just one directory.  This behavior is consistent through a Gluster service stop/start.

[root@bkp1002a glusterfs]# lsof | grep gluster | wc -l; lsof | grep gluster | grep db099 | less
5610
glusterfs  5633       root  169r      DIR             104,16       4096     751751 /data/gluster/a/bkp/db099/old
glusterfs  5633       root  170r      DIR             104,16       4096     751751 /data/gluster/a/bkp/db099/old
glusterfs  5633       root  171r      DIR             104,16       4096     751751 /data/gluster/a/bkp/db099/old
glusterfs  5633       root  172r      DIR             104,16       4096     751751 /data/gluster/a/bkp/db099/old
glusterfs  5633       root  173r      DIR             104,16       4096     751751 /data/gluster/a/bkp/db099/old
glusterfs  5633       root  174r      DIR             104,16       4096     751751 /data/gluster/a/bkp/db099/old
glusterfs  5633       root  175r      DIR             104,16       4096     751751 /data/gluster/a/bkp/db099/old
[snip]
glusterfs  5633       root 4606r      DIR             104,16       4096     751751 /data/gluster/a/bkp/db099/old
glusterfs  5633       root 4607r      DIR             104,16       4096     751751 /data/gluster/a/bkp/db099/old
glusterfs  5633       root 4608r      DIR             104,16       4096     751751 /data/gluster/a/bkp/db099/old
glusterfs  5633       root 4609r      DIR             104,16       4096     751751 /data/gluster/a/bkp/db099/old
glusterfs  5633       root 4610r      DIR             104,16       4096     751751 /data/gluster/a/bkp/db099/old

FDs 175-4606 are the same as the other FDs on display.

Our Gluster volume info:
[root@bkp1002a glusterfs]# gluster volume info data

Volume Name: data
Type: Distributed-Replicate
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: bkp1002ax:/data/gluster
Brick2: bkp1002bx:/data/gluster
Brick3: bkp1002cx:/data/gluster
Brick4: bkp1002dx:/data/gluster
Brick5: bkp1002ex:/data/gluster
Brick6: bkp1002fx:/data/gluster
Brick7: bkp1002gx:/data/gluster
Brick8: bkp1002hx:/data/gluster
Options Reconfigured:
performance.quick-read: off
performance.stat-prefetch: on
network.ping-timeout: 10
cluster.min-free-disk: 6%


Relevant entries from our nfs-server.vol:
volume data-client-0
    type protocol/client
    option remote-host bkp1002ax
    option remote-subvolume /data/gluster
    option transport-type tcp
    option ping-timeout 10
end-volume

volume data-client-1
    type protocol/client
    option remote-host bkp1002bx
    option remote-subvolume /data/gluster
    option transport-type tcp
    option ping-timeout 10
end-volume

volume data-client-2
    type protocol/client
    option remote-host bkp1002cx
    option remote-subvolume /data/gluster
    option transport-type tcp
    option ping-timeout 10
end-volume

volume data-client-3
    type protocol/client
    option remote-host bkp1002dx
    option remote-subvolume /data/gluster
    option transport-type tcp
    option ping-timeout 10
end-volume

volume data-client-4
    type protocol/client
    option remote-host bkp1002ex
    option remote-subvolume /data/gluster
    option transport-type tcp
    option ping-timeout 10
end-volume

volume data-client-5
    type protocol/client
    option remote-host bkp1002fx
    option remote-subvolume /data/gluster
    option transport-type tcp
    option ping-timeout 10
end-volume

volume data-client-6
    type protocol/client
    option remote-host bkp1002gx
    option remote-subvolume /data/gluster
    option transport-type tcp
    option ping-timeout 10
end-volume

volume data-client-7
    type protocol/client
    option remote-host bkp1002hx
    option remote-subvolume /data/gluster
    option transport-type tcp
    option ping-timeout 10
end-volume

volume data-replicate-0
    type cluster/replicate
    subvolumes data-client-0 data-client-1
end-volume

volume data-replicate-1
    type cluster/replicate
    subvolumes data-client-2 data-client-3
end-volume

volume data-replicate-2
    type cluster/replicate
    subvolumes data-client-4 data-client-5
end-volume

volume data-replicate-3
    type cluster/replicate
    subvolumes data-client-6 data-client-7
end-volume

volume data-dht
    type cluster/distribute
    option min-free-disk 6%
    subvolumes data-replicate-0 data-replicate-1 data-replicate-2 data-replicate-3
end-volume

volume data-write-behind
    type performance/write-behind
    subvolumes data-dht
end-volume

volume data-read-ahead
    type performance/read-ahead
    subvolumes data-write-behind
end-volume

volume data-io-cache
    type performance/io-cache
    subvolumes data-read-ahead
end-volume

volume data
    type debug/io-stats
    subvolumes data-io-cache
end-volume

volume nfs-server
    type nfs/server
    option nfs.dynamic-volumes on
    option rpc-auth.addr.data1.allow *
    option nfs3.data1.volume-id 91a96dbe-35d9-4324-a521-3b503f3f2f09
    option rpc-auth.addr.data.allow *
    option nfs3.data.volume-id ac503ee5-ad29-47ac-99c0-38cc696a1d4d
    subvolumes data1 data
end-volume


OS:
[root@bkp1002a nfs]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 5.6 (Tikanga)