Discussion:
[RFC PATCH] cifs: Fix possible deadlock with cifs and work queues
Steven Rostedt
2014-03-19 19:12:52 UTC
Permalink
We just had a customer report a deadlock in their 3.8-rt kernel.
Looking into this, it is very possible to have the same deadlock in
mainline in Linus's tree as it stands today.

It is much easier to deadlock in the -rt kernel because reader locks
are serialized, where a down_read() can block another down_read(). But
because rwsems are fair locks, if a writer is waiting, a new reader
will then block. This means that if it is possible for a reader to
deadlock another reader, this can happen if a write comes along and
blocks on a current reader. That will prevent another reader from
running, and if that new reader requires to wake up a reader that owns
the lock, you have your deadlock.

Here's the situation with CIFS and workqueues:

The cifs system has several workqueues used in file.c and other places.
One of them is used for completion of a read and to release the
page_lock which wakes up the reader. There are several other workqueues
that do various other tasks.

A holder of the reader lock can sleep on a page_lock() and expect the
reader workqueue to wake it up (page_unlock()). The reader workqueue
takes no locks so this does not seem to be a problem (but it is).

The other workqueues can take the rwsem for read or for write. But our
issue that we tripped over was that it grabs it for read (remember in
-rt readers are serialized). But this can also happen if a separate
writer is waiting on the lock as that would cause a reader to block on
another reader too.

All the workqueue callbacks are executed on the same workqueue:

queue_work(cifsiod_wq, &rdata->work);
[...]
queue_work(cifsiod_wq, &cfile->oplock_break);

Now if the reader workqueue callback is queued after one of these
workqueues that can take the rwsem, we can hit a deadlock. The
workqueue code looks to be able to prevent deadlocks of these kinds,
but I do not totally understand the workqueue scheduled work structure
and perhaps if the kworker thread structure blocks hard it wont move
works around.

Here's what we see:

rdata->work is scheduled after cfile->oplock_break

CPU0 CPU1
---- ----

do_sync_read()
cifs_strict_readv()
down_read(cinode->lock_sem);
generic_file_aio_read()
__lock_page_killable()
__wait_on_bit_lock()

* BLOCKED *

process_one_work()
cifs_oplock_break()
cifs_has_mand_locks()
down_read(cinode->lock_sem);

* BLOCKED *

[ note, cifs_oplock_break() can
also call cifs_push_locks which takes
the lock with down_write() ]


But we noticed that the rdata->work was queued to run under the same
workqueue task and this work is to wake up the owner of the semaphore.
But because the workqueue task is blocked waiting on that lock, it will
never wake it up.

My question to Tejun is, if we create another workqueue, to add the
rdata->work to, would that prevent the above problem? Or what other
fixes can we do?

This is only compiled tested, we have not given it to our customer yet.

Signed-off-by: Steven Rostedt <***@goodmis.org>

diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 849f613..6656058 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -86,6 +86,7 @@ extern mempool_t *cifs_req_poolp;
extern mempool_t *cifs_mid_poolp;

struct workqueue_struct *cifsiod_wq;
+struct workqueue_struct *cifsiord_wq;

#ifdef CONFIG_CIFS_SMB2
__u8 cifs_client_guid[SMB2_CLIENT_GUID_SIZE];
@@ -1199,9 +1200,15 @@ init_cifs(void)
goto out_clean_proc;
}

+ cifsiord_wq = alloc_workqueue("cifsiord", WQ_FREEZABLE|WQ_MEM_RECLAIM, 0);
+ if (!cifsiord_wq) {
+ rc = -ENOMEM;
+ goto out_destroy_wq;
+ }
+
rc = cifs_fscache_register();
if (rc)
- goto out_destroy_wq;
+ goto out_destroy_rwq;

rc = cifs_init_inodecache();
if (rc)
@@ -1249,6 +1256,8 @@ out_destroy_inodecache:
cifs_destroy_inodecache();
out_unreg_fscache:
cifs_fscache_unregister();
+out_destroy_rwq:
+ destroy_workqueue(cifsiord_wq);
out_destroy_wq:
destroy_workqueue(cifsiod_wq);
out_clean_proc:
@@ -1273,6 +1282,7 @@ exit_cifs(void)
cifs_destroy_inodecache();
cifs_fscache_unregister();
destroy_workqueue(cifsiod_wq);
+ destroy_workqueue(cifsiord_wq);
cifs_proc_clean();
}

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index c0f3718..75d1941 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -1561,6 +1561,7 @@ void cifs_oplock_break(struct work_struct *work);

extern const struct slow_work_ops cifs_oplock_break_ops;
extern struct workqueue_struct *cifsiod_wq;
+extern struct workqueue_struct *cifsiord_wq;

extern mempool_t *cifs_mid_poolp;

diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index f3264bd..ca04a2e 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -1571,7 +1571,7 @@ cifs_readv_callback(struct mid_q_entry *mid)
rdata->result = -EIO;
}

- queue_work(cifsiod_wq, &rdata->work);
+ queue_work(cifsiord_wq, &rdata->work);
DeleteMidQEntry(mid);
add_credits(server, 1, 0);
}
diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 8603447..b74bf61 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -1742,7 +1742,7 @@ smb2_readv_callback(struct mid_q_entry *mid)
if (rdata->result)
cifs_stats_fail_inc(tcon, SMB2_READ_HE);

- queue_work(cifsiod_wq, &rdata->work);
+ queue_work(cifsiord_wq, &rdata->work);
DeleteMidQEntry(mid);
add_credits(server, credits_received, 0);
}
Peter Zijlstra
2014-03-19 19:34:07 UTC
Permalink
Post by Steven Rostedt
My question to Tejun is, if we create another workqueue, to add the
rdata->work to, would that prevent the above problem? Or what other
fixes can we do?
The way I understand workqueues is that we cannot guarantee concurrency
like this. It tries, but there's no guarantee.

WQ_MAX_ACTIVE seems to be a hard upper limit of concurrent workers. So
given 511 other blocked works, the described problem will always happen.

Creating another workqueue doesn't actually create more threads.

There is the kthread_work stuff for if you want a guaranteed worker
thread.
Steven Rostedt
2014-03-19 19:43:39 UTC
Permalink
On Wed, 19 Mar 2014 20:34:07 +0100
Post by Peter Zijlstra
Post by Steven Rostedt
My question to Tejun is, if we create another workqueue, to add the
rdata->work to, would that prevent the above problem? Or what other
fixes can we do?
The way I understand workqueues is that we cannot guarantee concurrency
like this. It tries, but there's no guarantee.
WQ_MAX_ACTIVE seems to be a hard upper limit of concurrent workers. So
given 511 other blocked works, the described problem will always happen.
Creating another workqueue doesn't actually create more threads.
But I noticed this:

Before patch:

# ps aux |grep cifs
root 3119 0.0 0.0 0 0 ? S< 14:17 0:00 [cifsiod]

After patch:

# ps aux |grep cifs
root 1109 0.0 0.0 0 0 ? S< 15:11 0:00 [cifsiod]
root 1111 0.0 0.0 0 0 ? S< 15:11 0:00 [cifsiord]

It looks to me that it does create new threads.

-- Steve
Post by Peter Zijlstra
There is the kthread_work stuff for if you want a guaranteed worker
thread.
Steven Rostedt
2014-03-19 19:46:56 UTC
Permalink
On Wed, 19 Mar 2014 15:43:39 -0400
Post by Steven Rostedt
On Wed, 19 Mar 2014 20:34:07 +0100
Post by Peter Zijlstra
Post by Steven Rostedt
My question to Tejun is, if we create another workqueue, to add the
rdata->work to, would that prevent the above problem? Or what other
fixes can we do?
The way I understand workqueues is that we cannot guarantee concurrency
like this. It tries, but there's no guarantee.
WQ_MAX_ACTIVE seems to be a hard upper limit of concurrent workers. So
given 511 other blocked works, the described problem will always happen.
Creating another workqueue doesn't actually create more threads.
# ps aux |grep cifs
root 3119 0.0 0.0 0 0 ? S< 14:17 0:00 [cifsiod]
# ps aux |grep cifs
root 1109 0.0 0.0 0 0 ? S< 15:11 0:00 [cifsiod]
root 1111 0.0 0.0 0 0 ? S< 15:11 0:00 [cifsiord]
It looks to me that it does create new threads.
Or is that just the rescuer thread? I can rewrite the patch to use
kthread_work instead too.

-- Steve
Peter Zijlstra
2014-03-19 19:47:29 UTC
Permalink
Post by Steven Rostedt
On Wed, 19 Mar 2014 20:34:07 +0100
Post by Peter Zijlstra
Post by Steven Rostedt
My question to Tejun is, if we create another workqueue, to add the
rdata->work to, would that prevent the above problem? Or what other
fixes can we do?
The way I understand workqueues is that we cannot guarantee concurrency
like this. It tries, but there's no guarantee.
WQ_MAX_ACTIVE seems to be a hard upper limit of concurrent workers. So
given 511 other blocked works, the described problem will always happen.
Creating another workqueue doesn't actually create more threads.
# ps aux |grep cifs
root 3119 0.0 0.0 0 0 ? S< 14:17 0:00 [cifsiod]
# ps aux |grep cifs
root 1109 0.0 0.0 0 0 ? S< 15:11 0:00 [cifsiod]
root 1111 0.0 0.0 0 0 ? S< 15:11 0:00 [cifsiord]
It looks to me that it does create new threads.
Ah, I think that's because of the MEM_RECLAIM, not sure if that will
normally participate in running works. Its been a long time since I
looked at any of that.

Lets wait for TJ to wake up.
Tejun Heo
2014-03-19 20:28:39 UTC
Permalink
Hello, Steven, Peter.
Post by Peter Zijlstra
The way I understand workqueues is that we cannot guarantee concurrency
like this. It tries, but there's no guarantee.
So, the guarantee is that if a workqueue has WQ_MEM_RECLAIM, it'll
always have at least one worker thread working on it, so workqueues
which may be depended upon during memory reclaim should have the flag
set and must not require more than single level of concurrency to make
forward progress. Workqueues w/o memory reclaim set depend on the
fact that eventually memory will be reclaimed and enough number of
workers necessary to make forward progress will be made available.
Post by Peter Zijlstra
WQ_MAX_ACTIVE seems to be a hard upper limit of concurrent workers. So
given 511 other blocked works, the described problem will always happen.
That actually is per-workqueue limit and workqueue core will try to
create as many workers as possible to satisfy the demanded
concurrency. ie. having two workqueues with the same max_active means
that the total number of workers may reach 2 * max_active; however,
this is no guarantee. If the system is under memory pressure and the
workqueues don't have MEM_RECLAIM set, they may not get any
concurrency until more memory is made available.
Post by Peter Zijlstra
Creating another workqueue doesn't actually create more threads.
It looks like the issue Steven is describing is caused by having a
dependency chain longer than 1 through rwsem in a MEM_RECLAIM
workqueue. Moving the write work items to a separate workqueue breaks
the r-w-r chain and ensures that forward progress can be made with
single level of concurrency on both workqueues, so, yeah, it looks
like the correct fix to me. It it scarily subtle tho and quite likely
to present in other code paths too. :(

Thanks.
--
tejun
Jeffrey Layton
2014-03-20 19:28:33 UTC
Permalink
On Wed, 19 Mar 2014 15:12:52 -0400
Post by Steven Rostedt
We just had a customer report a deadlock in their 3.8-rt kernel.
Looking into this, it is very possible to have the same deadlock in
mainline in Linus's tree as it stands today.
It is much easier to deadlock in the -rt kernel because reader locks
are serialized, where a down_read() can block another down_read(). But
because rwsems are fair locks, if a writer is waiting, a new reader
will then block. This means that if it is possible for a reader to
deadlock another reader, this can happen if a write comes along and
blocks on a current reader. That will prevent another reader from
running, and if that new reader requires to wake up a reader that owns
the lock, you have your deadlock.
The cifs system has several workqueues used in file.c and other
places. One of them is used for completion of a read and to release
the page_lock which wakes up the reader. There are several other
workqueues that do various other tasks.
A holder of the reader lock can sleep on a page_lock() and expect the
reader workqueue to wake it up (page_unlock()). The reader workqueue
takes no locks so this does not seem to be a problem (but it is).
The other workqueues can take the rwsem for read or for write. But our
issue that we tripped over was that it grabs it for read (remember in
-rt readers are serialized). But this can also happen if a separate
writer is waiting on the lock as that would cause a reader to block on
another reader too.
queue_work(cifsiod_wq, &rdata->work);
[...]
queue_work(cifsiod_wq, &cfile->oplock_break);
Now if the reader workqueue callback is queued after one of these
workqueues that can take the rwsem, we can hit a deadlock. The
workqueue code looks to be able to prevent deadlocks of these kinds,
but I do not totally understand the workqueue scheduled work structure
and perhaps if the kworker thread structure blocks hard it wont move
works around.
rdata->work is scheduled after cfile->oplock_break
CPU0 CPU1
---- ----
do_sync_read()
cifs_strict_readv()
down_read(cinode->lock_sem);
generic_file_aio_read()
__lock_page_killable()
__wait_on_bit_lock()
* BLOCKED *
process_one_work()
cifs_oplock_break()
cifs_has_mand_locks()
down_read(cinode->lock_sem);
* BLOCKED *
[ note,
cifs_oplock_break() can also call cifs_push_locks which takes
the lock with
down_write() ]
But we noticed that the rdata->work was queued to run under the same
workqueue task and this work is to wake up the owner of the semaphore.
But because the workqueue task is blocked waiting on that lock, it
will never wake it up.
My question to Tejun is, if we create another workqueue, to add the
rdata->work to, would that prevent the above problem? Or what other
fixes can we do?
This is only compiled tested, we have not given it to our customer yet.
Cc'ing Pavel since he wrote most of the lock pushing code...

Nice analysis! I think eventually we'll need to overhaul this code not
to use rw semaphores, but that's going to take some redesign. (Wonder
if we could change it to use seqlocks or something?)

Out of curiousity, does this eventually time out and unwedge itself?
Usually when the server doesn't get a response to an oplock break in
around a minute or so it gives up and allows the thing that caused the
oplock break to proceed anyway. Not great for performance but it out to
eventually make progress due to that.

In any case, this looks like a reasonable fix for now, but I suspect you
can hit similar problems in the write codepath too. What may be best is
turn this around and queue the oplock break to the new workqueue
instead of the read completion job.
Post by Steven Rostedt
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 849f613..6656058 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -86,6 +86,7 @@ extern mempool_t *cifs_req_poolp;
extern mempool_t *cifs_mid_poolp;
struct workqueue_struct *cifsiod_wq;
+struct workqueue_struct *cifsiord_wq;
#ifdef CONFIG_CIFS_SMB2
__u8 cifs_client_guid[SMB2_CLIENT_GUID_SIZE];
@@ -1199,9 +1200,15 @@ init_cifs(void)
goto out_clean_proc;
}
+ cifsiord_wq = alloc_workqueue("cifsiord",
WQ_FREEZABLE|WQ_MEM_RECLAIM, 0);
+ if (!cifsiord_wq) {
+ rc = -ENOMEM;
+ goto out_destroy_wq;
+ }
+
rc = cifs_fscache_register();
if (rc)
- goto out_destroy_wq;
+ goto out_destroy_rwq;
rc = cifs_init_inodecache();
if (rc)
cifs_destroy_inodecache();
cifs_fscache_unregister();
+ destroy_workqueue(cifsiord_wq);
destroy_workqueue(cifsiod_wq);
@@ -1273,6 +1282,7 @@ exit_cifs(void)
cifs_destroy_inodecache();
cifs_fscache_unregister();
destroy_workqueue(cifsiod_wq);
+ destroy_workqueue(cifsiord_wq);
cifs_proc_clean();
}
diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index c0f3718..75d1941 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -1561,6 +1561,7 @@ void cifs_oplock_break(struct work_struct *work);
extern const struct slow_work_ops cifs_oplock_break_ops;
extern struct workqueue_struct *cifsiod_wq;
+extern struct workqueue_struct *cifsiord_wq;
extern mempool_t *cifs_mid_poolp;
diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index f3264bd..ca04a2e 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -1571,7 +1571,7 @@ cifs_readv_callback(struct mid_q_entry *mid)
rdata->result = -EIO;
}
- queue_work(cifsiod_wq, &rdata->work);
+ queue_work(cifsiord_wq, &rdata->work);
DeleteMidQEntry(mid);
add_credits(server, 1, 0);
}
diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 8603447..b74bf61 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -1742,7 +1742,7 @@ smb2_readv_callback(struct mid_q_entry *mid)
if (rdata->result)
cifs_stats_fail_inc(tcon, SMB2_READ_HE);
- queue_work(cifsiod_wq, &rdata->work);
+ queue_work(cifsiord_wq, &rdata->work);
DeleteMidQEntry(mid);
add_credits(server, credits_received, 0);
}
--
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+***@public.gmane.org>
Steven Rostedt
2014-03-20 20:57:03 UTC
Permalink
On Thu, 20 Mar 2014 15:28:33 -0400
Post by Jeffrey Layton
Nice analysis! I think eventually we'll need to overhaul this code not
Note, Ulrich Obergfell helped a bit in the initial analysis. He found
from a customer core dump that the kworker thread was blocked on the
cinode->lock_sem, and the reader was blocked as well. That was enough
for me to find where the problem laid.
Post by Jeffrey Layton
to use rw semaphores, but that's going to take some redesign. (Wonder
if we could change it to use seqlocks or something?)
Out of curiousity, does this eventually time out and unwedge itself?
Usually when the server doesn't get a response to an oplock break in
around a minute or so it gives up and allows the thing that caused the
oplock break to proceed anyway. Not great for performance but it out to
eventually make progress due to that.
No, I believe it's hard locked. Nothing is going to wake up the oplock
break if it is blocked on a down_read(). Only the release of the rwsem
will do that. It's the subtle way the kworker threads are done.
Post by Jeffrey Layton
In any case, this looks like a reasonable fix for now, but I suspect you
can hit similar problems in the write codepath too. What may be best is
turn this around and queue the oplock break to the new workqueue
instead of the read completion job.
Or perhaps give both the read and write their own workqueues? We have
to look at all the work queue handlers, and be careful about any users
that take the lock_sem, and separate them out.

-- Steve
Jeff Layton
2014-03-20 21:02:39 UTC
Permalink
On Thu, 20 Mar 2014 16:57:03 -0400
Post by Steven Rostedt
On Thu, 20 Mar 2014 15:28:33 -0400
Post by Jeffrey Layton
Nice analysis! I think eventually we'll need to overhaul this code not
Note, Ulrich Obergfell helped a bit in the initial analysis. He found
from a customer core dump that the kworker thread was blocked on the
cinode->lock_sem, and the reader was blocked as well. That was enough
for me to find where the problem laid.
Kudos to Uli, then ;)
Post by Steven Rostedt
Post by Jeffrey Layton
to use rw semaphores, but that's going to take some redesign. (Wonder
if we could change it to use seqlocks or something?)
Out of curiousity, does this eventually time out and unwedge itself?
Usually when the server doesn't get a response to an oplock break in
around a minute or so it gives up and allows the thing that caused the
oplock break to proceed anyway. Not great for performance but it out to
eventually make progress due to that.
No, I believe it's hard locked. Nothing is going to wake up the oplock
break if it is blocked on a down_read(). Only the release of the rwsem
will do that. It's the subtle way the kworker threads are done.
Eventually the server should just allow the read to complete even if
the client doesn't respond to the oplock break. It has to since clients
can suddenly drop off the net while holding an oplock. That should
allow everything to unwedge eventually (though it may take a while).

If that's not happening then I'd be curious as to why...
Post by Steven Rostedt
Post by Jeffrey Layton
In any case, this looks like a reasonable fix for now, but I suspect you
can hit similar problems in the write codepath too. What may be best is
turn this around and queue the oplock break to the new workqueue
instead of the read completion job.
Or perhaps give both the read and write their own workqueues? We have
to look at all the work queue handlers, and be careful about any users
that take the lock_sem, and separate them out.
Yeah, I haven't looked closely yet but I'm fairly sure that you could
hit the same situation in the write codepath as well. Whether adding
more workqueues will really help, I'm not sure of yet...
--
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+***@public.gmane.org>
Steven Rostedt
2014-03-21 02:23:43 UTC
Permalink
On Thu, 20 Mar 2014 17:02:39 -0400
Post by Jeff Layton
Eventually the server should just allow the read to complete even if
the client doesn't respond to the oplock break. It has to since clients
can suddenly drop off the net while holding an oplock. That should
allow everything to unwedge eventually (though it may take a while).
If that's not happening then I'd be curious as to why...
The problem is that the data is being filled in the page and the reader
is waiting for the page lock to be released. The kworker for the reader
will issue the complete() and unlock the page to wake up the reader.

But because the other workqueue callback calls down_read(), and there
can be a down_write() waiting for the reader to finish, this
down_read() will block on the lock as well (rwsems are fair locks).
This blocks the other workqueue callback from issuing the complete and
page_unlock() that will wake up the reader that is holding the rwsem
with down_read().

DEADLOCK.


-- Steve
Pavel Shilovsky
2014-03-21 08:32:12 UTC
Permalink
Post by Steven Rostedt
On Thu, 20 Mar 2014 17:02:39 -0400
Post by Jeff Layton
Eventually the server should just allow the read to complete even if
the client doesn't respond to the oplock break. It has to since clients
can suddenly drop off the net while holding an oplock. That should
allow everything to unwedge eventually (though it may take a while).
If that's not happening then I'd be curious as to why...
The problem is that the data is being filled in the page and the reader
is waiting for the page lock to be released. The kworker for the reader
will issue the complete() and unlock the page to wake up the reader.
But because the other workqueue callback calls down_read(), and there
can be a down_write() waiting for the reader to finish, this
down_read() will block on the lock as well (rwsems are fair locks).
This blocks the other workqueue callback from issuing the complete and
page_unlock() that will wake up the reader that is holding the rwsem
with down_read().
DEADLOCK.
Thank you for reporting and clarifying the issue!

Read and write codepaths both obtain lock_sem for read and then wait
for cifsiod_wq to complete and release lock_sem. They don't do any
lock_sem operations inside their work task queued to cifsiod_wq. But
oplock code can obtain/release lock_sem in its work task. So, that's
why I agree with Jeff and suggest to move the oplock code to a
different work queue (cifsioopd_wq?) but leave read and write
codepaths use cifsiod_wq.
--
Best regards,
Pavel Shilovsky.
Steven Rostedt
2014-03-21 12:17:06 UTC
Permalink
On Fri, 21 Mar 2014 12:32:12 +0400
Post by Pavel Shilovsky
Read and write codepaths both obtain lock_sem for read and then wait
for cifsiod_wq to complete and release lock_sem. They don't do any
lock_sem operations inside their work task queued to cifsiod_wq. But
oplock code can obtain/release lock_sem in its work task. So, that's
why I agree with Jeff and suggest to move the oplock code to a
different work queue (cifsioopd_wq?) but leave read and write
codepaths use cifsiod_wq.
OK, how about I submit a second patch that moves the reader and writer
to its own "safe" workqueue?

-- Steve
Jeff Layton
2014-03-21 12:41:28 UTC
Permalink
On Fri, 21 Mar 2014 08:17:06 -0400
Post by Steven Rostedt
On Fri, 21 Mar 2014 12:32:12 +0400
Post by Pavel Shilovsky
Read and write codepaths both obtain lock_sem for read and then wait
for cifsiod_wq to complete and release lock_sem. They don't do any
lock_sem operations inside their work task queued to cifsiod_wq. But
oplock code can obtain/release lock_sem in its work task. So, that's
why I agree with Jeff and suggest to move the oplock code to a
different work queue (cifsioopd_wq?) but leave read and write
codepaths use cifsiod_wq.
OK, how about I submit a second patch that moves the reader and writer
to its own "safe" workqueue?
-- Steve
That'd probably work fine too. The main point is to make sure oplock
breaks run on a different workqueue from where read or write completion
jobs run since they are operating on the lock_sem. The other jobs that
get queued to cifsiod_wq don't touch the lock_sem and shouldn't be a
problem.
--
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+***@public.gmane.org>
Steven Rostedt
2014-03-21 12:54:28 UTC
Permalink
On Fri, 21 Mar 2014 08:41:28 -0400
Post by Jeff Layton
That'd probably work fine too. The main point is to make sure oplock
breaks run on a different workqueue from where read or write completion
jobs run since they are operating on the lock_sem. The other jobs that
get queued to cifsiod_wq don't touch the lock_sem and shouldn't be a
problem.
OK, I'll take a look at them, and maybe I'll just move the oplock
workqueue. I think you are correct and it may be best to move the one
that takes the lock. Keep that one separate and that will keep the
others from being blocked by it.

Thanks, I'll write something up in a bit.

-- Steve
Steven Rostedt
2014-03-21 15:07:19 UTC
Permalink
We just had a customer report a deadlock in their 3.8-rt kernel.
Looking into this, it is very possible to have the same deadlock in
mainline in Linus's tree as it stands today.

It is much easier to deadlock in the -rt kernel because reader locks
are serialized, where a down_read() can block another down_read(). But
because rwsems are fair locks, if a writer is waiting, a new reader
will then block. This means that if it is possible for a reader to
deadlock another reader, this can happen if a write comes along and
blocks on a current reader. That will prevent another reader from
running, and if that new reader requires to wake up a reader that owns
the lock, you have your deadlock.

Here's the situation with CIFS and workqueues:

The cifs system has several workqueues used in file.c and other places.
One of them is used for completion of a read and to release the
page_lock which wakes up the reader. There are several other workqueues
that do various other tasks.

A holder of the reader lock can sleep on a page_lock() and expect the
reader workqueue to wake it up (page_unlock()). The reader workqueue
takes no locks so this does not seem to be a problem (but it is).

The other workqueues can take the rwsem for read or for write. But our
issue that we tripped over was that it grabs it for read (remember in
-rt readers are serialized). But this can also happen if a separate
writer is waiting on the lock as that would cause a reader to block on
another reader too.

All the workqueue callbacks are executed on the same workqueue:

queue_work(cifsiod_wq, &rdata->work);
[...]
queue_work(cifsiod_wq, &cfile->oplock_break);

Now if the reader workqueue callback is queued after one of these
workqueues that can take the rwsem, we can hit a deadlock. The
workqueue code looks to be able to prevent deadlocks of these kinds,
but I do not totally understand the workqueue scheduled work structure
and perhaps if the kworker thread structure blocks hard it wont move
works around.

Here's what we see:

rdata->work is scheduled after cfile->oplock_break

CPU0 CPU1
---- ----

do_sync_read()
cifs_strict_readv()
down_read(cinode->lock_sem);
generic_file_aio_read()
__lock_page_killable()
__wait_on_bit_lock()

* BLOCKED *

process_one_work()
cifs_oplock_break()
cifs_has_mand_locks()
down_read(cinode->lock_sem);

* BLOCKED *

[ note, cifs_oplock_break() can
also call cifs_push_locks which takes
the lock with down_write() ]

The key to remember is that the work to wake up the lock owner is queued
behind the oplock work which is blocked on the same lock.

We noticed that the rdata->work was queued to run under the same
workqueue task and this work is to wake up the owner of the semaphore.
But because the workqueue task is blocked waiting on that lock, it will
never wake it up.

By adding another workqueue that runs all the work that might take a mutex
we should be able to avoid this deadlock.

Link: http://lkml.kernel.org/r/***@gandalf.local.home

Signed-off-by: Steven Rostedt <***@goodmis.org>
---
fs/cifs/cifsfs.c | 17 ++++++++++++++++-
fs/cifs/cifsglob.h | 1 +
fs/cifs/misc.c | 2 +-
fs/cifs/smb2misc.c | 6 +++---
4 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 849f613..b0761c8 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -86,6 +86,12 @@ extern mempool_t *cifs_req_poolp;
extern mempool_t *cifs_mid_poolp;

struct workqueue_struct *cifsiod_wq;
+/*
+ * The oplock workqueue must be separate to prevent it from blocking
+ * other work that is queued. Work that requires to grab mutex locks
+ * must use the 'l' version.
+ */
+struct workqueue_struct *cifsiold_wq;

#ifdef CONFIG_CIFS_SMB2
__u8 cifs_client_guid[SMB2_CLIENT_GUID_SIZE];
@@ -1199,9 +1205,15 @@ init_cifs(void)
goto out_clean_proc;
}

+ cifsiold_wq = alloc_workqueue("cifsiold", WQ_FREEZABLE|WQ_MEM_RECLAIM, 0);
+ if (!cifsiold_wq) {
+ rc = -ENOMEM;
+ goto out_destroy_wq;
+ }
+
rc = cifs_fscache_register();
if (rc)
- goto out_destroy_wq;
+ goto out_destroy_rwq;

rc = cifs_init_inodecache();
if (rc)
@@ -1249,6 +1261,8 @@ out_destroy_inodecache:
cifs_destroy_inodecache();
out_unreg_fscache:
cifs_fscache_unregister();
+out_destroy_rwq:
+ destroy_workqueue(cifsiold_wq);
out_destroy_wq:
destroy_workqueue(cifsiod_wq);
out_clean_proc:
@@ -1273,6 +1287,7 @@ exit_cifs(void)
cifs_destroy_inodecache();
cifs_fscache_unregister();
destroy_workqueue(cifsiod_wq);
+ destroy_workqueue(cifsiold_wq);
cifs_proc_clean();
}

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index c0f3718..6c2b5c8 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -1561,6 +1561,7 @@ void cifs_oplock_break(struct work_struct *work);

extern const struct slow_work_ops cifs_oplock_break_ops;
extern struct workqueue_struct *cifsiod_wq;
+extern struct workqueue_struct *cifsiold_wq;

extern mempool_t *cifs_mid_poolp;

diff --git a/fs/cifs/misc.c b/fs/cifs/misc.c
index 2f9f379..1bc94e9 100644
--- a/fs/cifs/misc.c
+++ b/fs/cifs/misc.c
@@ -468,7 +468,7 @@ is_valid_oplock_break(char *buffer, struct TCP_Server_Info *srv)

cifs_set_oplock_level(pCifsInode,
pSMB->OplockLevel ? OPLOCK_READ : 0);
- queue_work(cifsiod_wq,
+ queue_work(cifsiold_wq,
&netfile->oplock_break);
netfile->oplock_break_cancelled = false;

diff --git a/fs/cifs/smb2misc.c b/fs/cifs/smb2misc.c
index fb39662..ffea93f 100644
--- a/fs/cifs/smb2misc.c
+++ b/fs/cifs/smb2misc.c
@@ -447,7 +447,7 @@ smb2_tcon_has_lease(struct cifs_tcon *tcon, struct smb2_lease_break *rsp,
else
cfile->oplock_break_cancelled = true;

- queue_work(cifsiod_wq, &cfile->oplock_break);
+ queue_work(cifsiold_wq, &cfile->oplock_break);
kfree(lw);
return true;
}
@@ -463,7 +463,7 @@ smb2_tcon_has_lease(struct cifs_tcon *tcon, struct smb2_lease_break *rsp,
memcpy(lw->lease_key, open->lease_key,
SMB2_LEASE_KEY_SIZE);
lw->tlink = cifs_get_tlink(open->tlink);
- queue_work(cifsiod_wq, &lw->lease_break);
+ queue_work(cifsiold_wq, &lw->lease_break);
}

cifs_dbg(FYI, "found in the pending open list\n");
@@ -579,7 +579,7 @@ smb2_is_valid_oplock_break(char *buffer, struct TCP_Server_Info *server)
rsp->OplockLevel ? SMB2_OPLOCK_LEVEL_II : 0,
0, NULL);

- queue_work(cifsiod_wq, &cfile->oplock_break);
+ queue_work(cifsiold_wq, &cfile->oplock_break);

spin_unlock(&cifs_file_list_lock);
spin_unlock(&cifs_tcp_ses_lock);
--
1.8.1.4
Shirish Pargaonkar
2014-03-23 05:57:07 UTC
Permalink
Just a thought, although leasing code does not cause such deadlocks with rwsems,
perhaps leasing and oplock code can reside on one workqueue!
Post by Steven Rostedt
We just had a customer report a deadlock in their 3.8-rt kernel.
Looking into this, it is very possible to have the same deadlock in
mainline in Linus's tree as it stands today.
It is much easier to deadlock in the -rt kernel because reader locks
are serialized, where a down_read() can block another down_read(). But
because rwsems are fair locks, if a writer is waiting, a new reader
will then block. This means that if it is possible for a reader to
deadlock another reader, this can happen if a write comes along and
blocks on a current reader. That will prevent another reader from
running, and if that new reader requires to wake up a reader that owns
the lock, you have your deadlock.
The cifs system has several workqueues used in file.c and other places.
One of them is used for completion of a read and to release the
page_lock which wakes up the reader. There are several other workqueues
that do various other tasks.
A holder of the reader lock can sleep on a page_lock() and expect the
reader workqueue to wake it up (page_unlock()). The reader workqueue
takes no locks so this does not seem to be a problem (but it is).
The other workqueues can take the rwsem for read or for write. But our
issue that we tripped over was that it grabs it for read (remember in
-rt readers are serialized). But this can also happen if a separate
writer is waiting on the lock as that would cause a reader to block on
another reader too.
queue_work(cifsiod_wq, &rdata->work);
[...]
queue_work(cifsiod_wq, &cfile->oplock_break);
Now if the reader workqueue callback is queued after one of these
workqueues that can take the rwsem, we can hit a deadlock. The
workqueue code looks to be able to prevent deadlocks of these kinds,
but I do not totally understand the workqueue scheduled work structure
and perhaps if the kworker thread structure blocks hard it wont move
works around.
rdata->work is scheduled after cfile->oplock_break
CPU0 CPU1
---- ----
do_sync_read()
cifs_strict_readv()
down_read(cinode->lock_sem);
generic_file_aio_read()
__lock_page_killable()
__wait_on_bit_lock()
* BLOCKED *
process_one_work()
cifs_oplock_break()
cifs_has_mand_locks()
down_read(cinode->lock_sem);
* BLOCKED *
[ note, cifs_oplock_break() can
also call cifs_push_locks which takes
the lock with down_write() ]
The key to remember is that the work to wake up the lock owner is queued
behind the oplock work which is blocked on the same lock.
We noticed that the rdata->work was queued to run under the same
workqueue task and this work is to wake up the owner of the semaphore.
But because the workqueue task is blocked waiting on that lock, it will
never wake it up.
By adding another workqueue that runs all the work that might take a mutex
we should be able to avoid this deadlock.
---
fs/cifs/cifsfs.c | 17 ++++++++++++++++-
fs/cifs/cifsglob.h | 1 +
fs/cifs/misc.c | 2 +-
fs/cifs/smb2misc.c | 6 +++---
4 files changed, 21 insertions(+), 5 deletions(-)
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 849f613..b0761c8 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -86,6 +86,12 @@ extern mempool_t *cifs_req_poolp;
extern mempool_t *cifs_mid_poolp;
struct workqueue_struct *cifsiod_wq;
+/*
+ * The oplock workqueue must be separate to prevent it from blocking
+ * other work that is queued. Work that requires to grab mutex locks
+ * must use the 'l' version.
+ */
+struct workqueue_struct *cifsiold_wq;
#ifdef CONFIG_CIFS_SMB2
__u8 cifs_client_guid[SMB2_CLIENT_GUID_SIZE];
@@ -1199,9 +1205,15 @@ init_cifs(void)
goto out_clean_proc;
}
+ cifsiold_wq = alloc_workqueue("cifsiold", WQ_FREEZABLE|WQ_MEM_RECLAIM, 0);
+ if (!cifsiold_wq) {
+ rc = -ENOMEM;
+ goto out_destroy_wq;
+ }
+
rc = cifs_fscache_register();
if (rc)
- goto out_destroy_wq;
+ goto out_destroy_rwq;
rc = cifs_init_inodecache();
if (rc)
cifs_destroy_inodecache();
cifs_fscache_unregister();
+ destroy_workqueue(cifsiold_wq);
destroy_workqueue(cifsiod_wq);
@@ -1273,6 +1287,7 @@ exit_cifs(void)
cifs_destroy_inodecache();
cifs_fscache_unregister();
destroy_workqueue(cifsiod_wq);
+ destroy_workqueue(cifsiold_wq);
cifs_proc_clean();
}
diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index c0f3718..6c2b5c8 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -1561,6 +1561,7 @@ void cifs_oplock_break(struct work_struct *work);
extern const struct slow_work_ops cifs_oplock_break_ops;
extern struct workqueue_struct *cifsiod_wq;
+extern struct workqueue_struct *cifsiold_wq;
extern mempool_t *cifs_mid_poolp;
diff --git a/fs/cifs/misc.c b/fs/cifs/misc.c
index 2f9f379..1bc94e9 100644
--- a/fs/cifs/misc.c
+++ b/fs/cifs/misc.c
@@ -468,7 +468,7 @@ is_valid_oplock_break(char *buffer, struct TCP_Server_Info *srv)
cifs_set_oplock_level(pCifsInode,
pSMB->OplockLevel ? OPLOCK_READ : 0);
- queue_work(cifsiod_wq,
+ queue_work(cifsiold_wq,
&netfile->oplock_break);
netfile->oplock_break_cancelled = false;
diff --git a/fs/cifs/smb2misc.c b/fs/cifs/smb2misc.c
index fb39662..ffea93f 100644
--- a/fs/cifs/smb2misc.c
+++ b/fs/cifs/smb2misc.c
@@ -447,7 +447,7 @@ smb2_tcon_has_lease(struct cifs_tcon *tcon, struct smb2_lease_break *rsp,
else
cfile->oplock_break_cancelled = true;
- queue_work(cifsiod_wq, &cfile->oplock_break);
+ queue_work(cifsiold_wq, &cfile->oplock_break);
kfree(lw);
return true;
}
@@ -463,7 +463,7 @@ smb2_tcon_has_lease(struct cifs_tcon *tcon, struct smb2_lease_break *rsp,
memcpy(lw->lease_key, open->lease_key,
SMB2_LEASE_KEY_SIZE);
lw->tlink = cifs_get_tlink(open->tlink);
- queue_work(cifsiod_wq, &lw->lease_break);
+ queue_work(cifsiold_wq, &lw->lease_break);
}
cifs_dbg(FYI, "found in the pending open list\n");
@@ -579,7 +579,7 @@ smb2_is_valid_oplock_break(char *buffer, struct TCP_Server_Info *server)
rsp->OplockLevel ? SMB2_OPLOCK_LEVEL_II : 0,
0, NULL);
- queue_work(cifsiod_wq, &cfile->oplock_break);
+ queue_work(cifsiold_wq, &cfile->oplock_break);
spin_unlock(&cifs_file_list_lock);
spin_unlock(&cifs_tcp_ses_lock);
--
1.8.1.4
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jeff Layton
2014-03-21 11:59:56 UTC
Permalink
On Fri, 21 Mar 2014 12:32:12 +0400
Post by Pavel Shilovsky
Post by Steven Rostedt
On Thu, 20 Mar 2014 17:02:39 -0400
Post by Jeff Layton
Eventually the server should just allow the read to complete even if
the client doesn't respond to the oplock break. It has to since clients
can suddenly drop off the net while holding an oplock. That should
allow everything to unwedge eventually (though it may take a while).
If that's not happening then I'd be curious as to why...
The problem is that the data is being filled in the page and the reader
is waiting for the page lock to be released. The kworker for the reader
will issue the complete() and unlock the page to wake up the reader.
But because the other workqueue callback calls down_read(), and there
can be a down_write() waiting for the reader to finish, this
down_read() will block on the lock as well (rwsems are fair locks).
This blocks the other workqueue callback from issuing the complete and
page_unlock() that will wake up the reader that is holding the rwsem
with down_read().
DEADLOCK.
Thank you for reporting and clarifying the issue!
Read and write codepaths both obtain lock_sem for read and then wait
for cifsiod_wq to complete and release lock_sem. They don't do any
lock_sem operations inside their work task queued to cifsiod_wq. But
oplock code can obtain/release lock_sem in its work task. So, that's
why I agree with Jeff and suggest to move the oplock code to a
different work queue (cifsioopd_wq?) but leave read and write
codepaths use cifsiod_wq.
Yes, thanks for the clarification. I missed the fact that a 3rd task
was blocked on a down_write. I think that fix will help prevent the
full-blown deadlock, but it'll still be susceptible to long stalls in
some situations.

In Steven's example the work on CPU0 is blocked on a SMB_READ. That
read may not be completing because the server has issued an oplock
break to the client and is waiting on the response. That oplock break
response is blocked because it's blocked on the down_write.

In that situation, what breaks the deadlock is that the server
eventually gives up waiting on the oplock break response, but that can
take on the order of a minute.

So yeah, I think we should add a dedicated workqueue for oplock breaks
as an interim fix, but I also think we need to consider revamping the
lock pushing code such that oplock breaks are not subject to being
blocked like that.
--
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+***@public.gmane.org>
Jeff Layton
2014-03-20 23:53:46 UTC
Permalink
On Wed, 19 Mar 2014 15:12:52 -0400
Post by Steven Rostedt
We just had a customer report a deadlock in their 3.8-rt kernel.
Looking into this, it is very possible to have the same deadlock in
mainline in Linus's tree as it stands today.
It is much easier to deadlock in the -rt kernel because reader locks
are serialized, where a down_read() can block another down_read(). But
because rwsems are fair locks, if a writer is waiting, a new reader
will then block. This means that if it is possible for a reader to
deadlock another reader, this can happen if a write comes along and
blocks on a current reader. That will prevent another reader from
running, and if that new reader requires to wake up a reader that owns
the lock, you have your deadlock.
The cifs system has several workqueues used in file.c and other places.
One of them is used for completion of a read and to release the
page_lock which wakes up the reader. There are several other workqueues
that do various other tasks.
A holder of the reader lock can sleep on a page_lock() and expect the
reader workqueue to wake it up (page_unlock()). The reader workqueue
takes no locks so this does not seem to be a problem (but it is).
The other workqueues can take the rwsem for read or for write. But our
issue that we tripped over was that it grabs it for read (remember in
-rt readers are serialized). But this can also happen if a separate
writer is waiting on the lock as that would cause a reader to block on
another reader too.
queue_work(cifsiod_wq, &rdata->work);
[...]
queue_work(cifsiod_wq, &cfile->oplock_break);
Now if the reader workqueue callback is queued after one of these
workqueues that can take the rwsem, we can hit a deadlock. The
workqueue code looks to be able to prevent deadlocks of these kinds,
but I do not totally understand the workqueue scheduled work structure
and perhaps if the kworker thread structure blocks hard it wont move
works around.
rdata->work is scheduled after cfile->oplock_break
CPU0 CPU1
---- ----
do_sync_read()
cifs_strict_readv()
down_read(cinode->lock_sem);
generic_file_aio_read()
__lock_page_killable()
__wait_on_bit_lock()
* BLOCKED *
process_one_work()
cifs_oplock_break()
cifs_has_mand_locks()
down_read(cinode->lock_sem);
* BLOCKED *
[ note, cifs_oplock_break() can
also call cifs_push_locks which takes
the lock with down_write() ]
Wait...why does the work running on CPU1 end up blocked on down_read()?
Is it really getting stuck on the down_write you mention?
--
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+***@public.gmane.org>
Steven Rostedt
2014-03-21 02:19:13 UTC
Permalink
On Thu, 20 Mar 2014 19:53:46 -0400
Post by Jeff Layton
Wait...why does the work running on CPU1 end up blocked on down_read()?
Is it really getting stuck on the down_write you mention?
rwsems are fair locks. Readers will not block on a reader lock unless
there's a writer waiting. That's the key. As soon as a writer blocks on
a lock that is held by a reader (or multiple readers), new readers
coming in will also block to let the writer get a chance. Otherwise, it
is a unfair lock and the readers can starve the writer.

But people tend to forget that a waiting writer causes readers to block
on each other, and if the reader locks can deadlock each other, they
will deadlock with a writer waiting.

-- Steve
Loading...