Red Hat GLOBAL FILE SYSTEM 4.7 Guida Utente Scaricare il pdf (Pagina 104)

Troubleshooting7–8

In a situation where only one or two client nodes have crashed and a lock is needed, there is a pause of 6

to 20 seconds while the crashed client nodes are being evicted. When such an event occurs, Lustre attempts

to evict clients one by one. A typical log message in this situation is as follows:

2005/09/30 21:02:53 kern i s5 : LustreError:

4952:0:(ldlm_lockd.c:365:ldlm_failed_ast()) ### blocking AST failed (-110): evicting

client b9929_workspace_9803d79af3@NET_0xac160393_UUID NID 0xac160393 (172.22.3.147)

ns: filter-sfsalias-ost203_UUID lock: 40eabb80/0x37e426c2e3b1ac01 lrc: 2/0 , 0 mode:

PR/PR res: 79613/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0-

>18446744073709551615) flags: 10020 remote: 0xc40949dc40637e1f expref: 2 pid: 4940

After an interval of 6 to 20 seconds, the message is repeated for the next crashed client node.

When enough time has elapsed, Lustre proactively evicts nodes, and a message similar to the following is

displayed:

2005/10/27 11:04:00 kern i s14: Lustre: sfsalias-ost200 hasn't heard from

172.22.1.211 in 232 seconds. I think it's dead , and I am evicting it.

7.3.5 Access to a file hangs (ldlm_namespace_cleanup() messages)

A Lustre problem can cause access to a specific file to hang. This results in one of the following two

scenarios:

• I/O operations on the file hang, but the file can still be accessed using the ls -ls command.

In this scenario, I/O operations hang and can only be killed by a signal (or fail) after a minimum of

approximately100 seconds.

To determine if this scenario has arisen, enter the cat command for the file. If the command hangs,

press Ctrl/c. The cat command will terminate approximately 100 seconds later (but will not show the

file contents).

• An unmount operation on a Lustre file system causes an LBUG ldlm_lock_cancel() error on the

client node.

In this scenario, unmounting a Lustre file system on a client node leads to an LBUG message being

logged to the /var/log/messages file. This is due to LDLM lock references that cannot be cleaned

up.

Detecting the cause of the problem

If a client node is evicted by an OST or MDS service and also reports a message similar to the following, it

is likely that one of the two problem scenarios described above (especially the scenario concerning the

unmount operation) will occur at some point in the future:

LustreError: 21207:0: (ldlm_resource.c:365:ldlm_namespace_cleanup()) Namespace

OSC_n208_sfsalias-ost188_MNT_client_vib resource refcount 4 after lock cleanup;

forcing cleanup.

In these circumstances, you may also see a message similar to the following in the client logs:

LustreError: 61:0:(llite_lib.c:931:null_if_equal()) ### clearing inode with

ungranted lock ns: OSC_n208_sfsalias-ost188_MNT_client_vib lock: 00000100763764c0/

0x86d202a4044bda23 lrc: 2/1,0 mode: --/PR res: 115087/0 rrc: 3 type: EXT [0->24575]

(req 0->24575) flags: c10 remote: 0x98aa53884a4d12d6 expref: -99 pid: 738

This type of message is useful, as it helps you to identify which particular object on the OST device has the

locking issue. For example, this text:

res: 115087/0

indicates that the problem is on the 115087 resource or object.

While it is not always simple to map back from an OST or MDS object to a specific file on the file system,

the information in these messages can be used for detailed analysis of the problem and correlation of client

node and server log messages.

1 2 ... 99 100 101 102 103 104 105 106 107 108 109 ... 133 134

Commenti su questo manuale

Nessun commento

Red Hat GLOBAL FILE SYSTEM 4.7 Guida Utente Pagina 104

Commenti su questo manuale

Prodotti e manuali riguardandi Manuali per software Red Hat GLOBAL FILE SYSTEM 4.7