Troubleshooting7–8
In a situation where only one or two client nodes have crashed and a lock is needed, there is a pause of 6
to 20 seconds while the crashed client nodes are being evicted. When such an event occurs, Lustre attempts
to evict clients one by one. A typical log message in this situation is as follows:
2005/09/30 21:02:53 kern i s5 : LustreError:
4952:0:(ldlm_lockd.c:365:ldlm_failed_ast()) ### blocking AST failed (-110): evicting
client b9929_workspace_9803d79af3@NET_0xac160393_UUID NID 0xac160393 (172.22.3.147)
ns: filter-sfsalias-ost203_UUID lock: 40eabb80/0x37e426c2e3b1ac01 lrc: 2/0 , 0 mode:
PR/PR res: 79613/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0-
>18446744073709551615) flags: 10020 remote: 0xc40949dc40637e1f expref: 2 pid: 4940
After an interval of 6 to 20 seconds, the message is repeated for the next crashed client node.
When enough time has elapsed, Lustre proactively evicts nodes, and a message similar to the following is
displayed:
2005/10/27 11:04:00 kern i s14: Lustre: sfsalias-ost200 hasn't heard from
172.22.1.211 in 232 seconds. I think it's dead , and I am evicting it.
7.3.5 Access to a file hangs (ldlm_namespace_cleanup() messages)
A Lustre problem can cause access to a specific file to hang. This results in one of the following two
scenarios:
• I/O operations on the file hang, but the file can still be accessed using the ls -ls command.
In this scenario, I/O operations hang and can only be killed by a signal (or fail) after a minimum of
approximately100 seconds.
To determine if this scenario has arisen, enter the cat command for the file. If the command hangs,
press Ctrl/c. The cat command will terminate approximately 100 seconds later (but will not show the
file contents).
• An unmount operation on a Lustre file system causes an LBUG ldlm_lock_cancel() error on the
client node.
In this scenario, unmounting a Lustre file system on a client node leads to an LBUG message being
logged to the /var/log/messages file. This is due to LDLM lock references that cannot be cleaned
up.
Detecting the cause of the problem
If a client node is evicted by an OST or MDS service and also reports a message similar to the following, it
is likely that one of the two problem scenarios described above (especially the scenario concerning the
unmount operation) will occur at some point in the future:
LustreError: 21207:0: (ldlm_resource.c:365:ldlm_namespace_cleanup()) Namespace
OSC_n208_sfsalias-ost188_MNT_client_vib resource refcount 4 after lock cleanup;
forcing cleanup.
In these circumstances, you may also see a message similar to the following in the client logs:
LustreError: 61:0:(llite_lib.c:931:null_if_equal()) ### clearing inode with
ungranted lock ns: OSC_n208_sfsalias-ost188_MNT_client_vib lock: 00000100763764c0/
0x86d202a4044bda23 lrc: 2/1,0 mode: --/PR res: 115087/0 rrc: 3 type: EXT [0->24575]
(req 0->24575) flags: c10 remote: 0x98aa53884a4d12d6 expref: -99 pid: 738
This type of message is useful, as it helps you to identify which particular object on the OST device has the
locking issue. For example, this text:
res: 115087/0
indicates that the problem is on the 115087 resource or object.
While it is not always simple to map back from an OST or MDS object to a specific file on the file system,
the information in these messages can be used for detailed analysis of the problem and correlation of client
node and server log messages.
Commenti su questo manuale