Discussion:
Question regarding CIFS cache=loose behavior.
Tetsuo Handa
2014-03-20 12:03:03 UTC
Permalink
Is the behavior described below ( "read() returns 0 unless stat() detects the
growth of file size" and "read() returns non-0 when stat() detects the growth
of file size but the data actually read is '\0'" ) a feature of cache=loose
option?

Steps to reproduce:

---------- Start of writer1.c ----------
#include <unistd.h>
int main(int argc, char *argv[])
{
while (write(1, ".", 1) == 1)
sleep(1);
return 0;
}
---------- End of writer1.c ----------

---------- Start of reader1.c ----------
#include <unistd.h>
int main(int argc, char *argv[])
{
while (1) {
char c;
if (read(0, &c, 1) == 1) {
if (c)
write(1, &c, 1);
else
write(1, "!", 1);
} else {
sleep(1);
}
}
return 0;
}
---------- End of reader1.c ----------

---------- Start of reader2.c ----------
#include <sys/stat.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
while (1) {
struct stat buf;
char c;
fstat(0, &buf);
if (read(0, &c, 1) == 1) {
if (c)
write(1, &c, 1);
else
write(1, "!", 1);
} else {
sleep(1);
}
}
return 0;
}
---------- End of reader2.c ----------

Run the following commands on the Samba server side.

$ cc -Wall -O3 -o writer1 writer1.c
$ ./writer1 > shared_file

Run the following commands on the Samba client side.

$ cc -Wall -O3 -o reader1 reader1.c
$ strace -tt -o /tmp/log1 ./reader1 < shared_file
.......... (<= Makes no progress by default)

The shared_file shown above is a regular file shared via CIFS mounted with
cache=loose option. (e.g. "mount -t cifs -o cache=loose //server/path /mnt" )

I expect that the reader1 process reads one byte whenever the writer1
process writes one byte. But we can observe via /tmp/log1 that the read()
continues returning 0 until the growth of file size of shared_file is
recognized by stat() using programs like /bin/ls and /usr/bin/stat .

Then, terminate the reader1 process using Ctrl-C and run the following
commands on the Samba client side.

$ cc -Wall -O3 -o reader2 reader2.c
$ strace -tt -o /tmp/log2 ./reader2 < shared_file
......................!!!!!! (<= Makes progress with wrong data)

We can observe via /tmp/log2 that the read() returns 1 because fstat()
detects the growth of file size of shared_file . However, the data
returned by read() is not '.' but '\0'.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jeff Layton
2014-03-20 20:19:43 UTC
Permalink
On Thu, 20 Mar 2014 21:03:03 +0900
Post by Tetsuo Handa
Is the behavior described below ( "read() returns 0 unless stat() detects the
growth of file size" and "read() returns non-0 when stat() detects the growth
of file size but the data actually read is '\0'" ) a feature of cache=loose
option?
---------- Start of writer1.c ----------
#include <unistd.h>
int main(int argc, char *argv[])
{
while (write(1, ".", 1) == 1)
sleep(1);
return 0;
}
---------- End of writer1.c ----------
---------- Start of reader1.c ----------
#include <unistd.h>
int main(int argc, char *argv[])
{
while (1) {
char c;
if (read(0, &c, 1) == 1) {
if (c)
write(1, &c, 1);
else
write(1, "!", 1);
} else {
sleep(1);
}
}
return 0;
}
---------- End of reader1.c ----------
---------- Start of reader2.c ----------
#include <sys/stat.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
while (1) {
struct stat buf;
char c;
fstat(0, &buf);
if (read(0, &c, 1) == 1) {
if (c)
write(1, &c, 1);
else
write(1, "!", 1);
} else {
sleep(1);
}
}
return 0;
}
---------- End of reader2.c ----------
Run the following commands on the Samba server side.
$ cc -Wall -O3 -o writer1 writer1.c
$ ./writer1 > shared_file
Run the following commands on the Samba client side.
$ cc -Wall -O3 -o reader1 reader1.c
$ strace -tt -o /tmp/log1 ./reader1 < shared_file
.......... (<= Makes no progress by default)
The shared_file shown above is a regular file shared via CIFS mounted with
cache=loose option. (e.g. "mount -t cifs -o cache=loose //server/path /mnt" )
I expect that the reader1 process reads one byte whenever the writer1
process writes one byte. But we can observe via /tmp/log1 that the read()
continues returning 0 until the growth of file size of shared_file is
recognized by stat() using programs like /bin/ls and /usr/bin/stat .
Then, terminate the reader1 process using Ctrl-C and run the following
commands on the Samba client side.
$ cc -Wall -O3 -o reader2 reader2.c
$ strace -tt -o /tmp/log2 ./reader2 < shared_file
......................!!!!!! (<= Makes progress with wrong data)
We can observe via /tmp/log2 that the read() returns 1 because fstat()
detects the growth of file size of shared_file . However, the data
returned by read() is not '.' but '\0'.
That's expected behavior. The kernel believes that the file is frozen in
length so it returns short read() calls until the size is updated.

cache=loose is very much not recommended for use when you have multiple
hosts accessing files on the server (or access by processes on the
server itself). It only gives you "loose" cache coherency. The whole
point of it is to allow the client to cache data even when the protocol
says that it shouldn't.

cache=strict is what is recommended and that's the default these days.
--
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+***@public.gmane.org>
Tetsuo Handa
2014-03-20 20:56:04 UTC
Permalink
Post by Jeff Layton
That's expected behavior. The kernel believes that the file is frozen in
length so it returns short read() calls until the size is updated.
The "size is updated" means "stat() detects the growth of file size",
doesn't it? Then, the former is expected behavior.
Post by Jeff Layton
cache=loose is very much not recommended for use when you have multiple
hosts accessing files on the server (or access by processes on the
server itself). It only gives you "loose" cache coherency. The whole
point of it is to allow the client to cache data even when the protocol
says that it shouldn't.
But why is the latter ( "read() returns non-0 when stat() detects the growth
of file size but the data actually read is '\0'" ) is expected behavior?
It sounds like a bug that the client caches '\0' (data nobody has ever wrote)
instead of '.' (data somebody wrote when the file size grew).
Post by Jeff Layton
cache=strict is what is recommended and that's the default these days.
Yes, I suggested a customer using CIFS on RHEL6 to use cache=none or
cache=strict , for that customer's system does not work as expected due to
the latter behavior when using "tail -f" for checking for appended log.
Steve French
2014-03-20 21:04:13 UTC
Permalink
On Thu, Mar 20, 2014 at 3:56 PM, Tetsuo Handa
Post by Tetsuo Handa
Post by Jeff Layton
That's expected behavior. The kernel believes that the file is frozen in
length so it returns short read() calls until the size is updated.
The "size is updated" means "stat() detects the growth of file size",
doesn't it? Then, the former is expected behavior.
Note that setting cifs mount option "actimeo=0" will cause size not to
be cached and may help.
Post by Tetsuo Handa
Post by Jeff Layton
cache=loose is very much not recommended for use when you have multiple
hosts accessing files on the server (or access by processes on the
server itself). It only gives you "loose" cache coherency. The whole
point of it is to allow the client to cache data even when the protocol
says that it shouldn't.
But why is the latter ( "read() returns non-0 when stat() detects the growth
of file size but the data actually read is '\0'" ) is expected behavior?
It sounds like a bug that the client caches '\0' (data nobody has ever wrote)
instead of '.' (data somebody wrote when the file size grew).
Post by Jeff Layton
cache=strict is what is recommended and that's the default these days.
Yes, I suggested a customer using CIFS on RHEL6 to use cache=none or
cache=strict , for that customer's system does not work as expected due to
the latter behavior when using "tail -f" for checking for appended log.
--
Thanks,

Steve
Jeff Layton
2014-03-20 23:10:13 UTC
Permalink
On Fri, 21 Mar 2014 05:56:04 +0900
Post by Tetsuo Handa
Post by Jeff Layton
That's expected behavior. The kernel believes that the file is frozen in
length so it returns short read() calls until the size is updated.
The "size is updated" means "stat() detects the growth of file size",
doesn't it? Then, the former is expected behavior.
Post by Jeff Layton
cache=loose is very much not recommended for use when you have multiple
hosts accessing files on the server (or access by processes on the
server itself). It only gives you "loose" cache coherency. The whole
point of it is to allow the client to cache data even when the protocol
says that it shouldn't.
But why is the latter ( "read() returns non-0 when stat() detects the growth
of file size but the data actually read is '\0'" ) is expected behavior?
It sounds like a bug that the client caches '\0' (data nobody has ever wrote)
instead of '.' (data somebody wrote when the file size grew).
Yeah, that sounds wrong. What should happen is that the cache is
invalidated when the size changes. It's possible there is a race in
that code however. The locking around it is pretty sloppy...
Post by Tetsuo Handa
Post by Jeff Layton
cache=strict is what is recommended and that's the default these days.
Yes, I suggested a customer using CIFS on RHEL6 to use cache=none or
cache=strict , for that customer's system does not work as expected due to
the latter behavior when using "tail -f" for checking for appended log.
Yeah, cache=strict should work much better for that sort of use-case.
--
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+***@public.gmane.org>
Pavel Shilovsky
2014-03-21 08:56:34 UTC
Permalink
Post by Jeff Layton
On Fri, 21 Mar 2014 05:56:04 +0900
Post by Tetsuo Handa
Post by Jeff Layton
That's expected behavior. The kernel believes that the file is frozen in
length so it returns short read() calls until the size is updated.
The "size is updated" means "stat() detects the growth of file size",
doesn't it? Then, the former is expected behavior.
Post by Jeff Layton
cache=loose is very much not recommended for use when you have multiple
hosts accessing files on the server (or access by processes on the
server itself). It only gives you "loose" cache coherency. The whole
point of it is to allow the client to cache data even when the protocol
says that it shouldn't.
But why is the latter ( "read() returns non-0 when stat() detects the growth
of file size but the data actually read is '\0'" ) is expected behavior?
It sounds like a bug that the client caches '\0' (data nobody has ever wrote)
instead of '.' (data somebody wrote when the file size grew).
Yeah, that sounds wrong. What should happen is that the cache is
invalidated when the size changes. It's possible there is a race in
that code however. The locking around it is pretty sloppy...
When fstat() get a new file size it sets
CIFS_I(inode)->invalid_mapping to true but do not revalidate the
cache. Then generic_file_aio_read() reads the wrong data. I think we
need to check if CIFS_I(inode)->invalid_mapping is true and revalidate
the cache before calling generic_file_aio_read() in
file->f_ops->aio_read(). Now cache revalidation happens in lookup/open
and mmap codepaths only for cache=loose.

Of course, cache=loose is not recommended for this sort of work flow
and cache=strict should be used to provide a data coherency between
several machines.
--
Best regards,
Pavel Shilovsky.
Jeff Layton
2014-03-21 11:40:20 UTC
Permalink
On Fri, 21 Mar 2014 12:56:34 +0400
Post by Pavel Shilovsky
Post by Jeff Layton
On Fri, 21 Mar 2014 05:56:04 +0900
Post by Tetsuo Handa
Post by Jeff Layton
That's expected behavior. The kernel believes that the file is frozen in
length so it returns short read() calls until the size is updated.
The "size is updated" means "stat() detects the growth of file size",
doesn't it? Then, the former is expected behavior.
Post by Jeff Layton
cache=loose is very much not recommended for use when you have multiple
hosts accessing files on the server (or access by processes on the
server itself). It only gives you "loose" cache coherency. The whole
point of it is to allow the client to cache data even when the protocol
says that it shouldn't.
But why is the latter ( "read() returns non-0 when stat() detects the growth
of file size but the data actually read is '\0'" ) is expected behavior?
It sounds like a bug that the client caches '\0' (data nobody has ever wrote)
instead of '.' (data somebody wrote when the file size grew).
Yeah, that sounds wrong. What should happen is that the cache is
invalidated when the size changes. It's possible there is a race in
that code however. The locking around it is pretty sloppy...
When fstat() get a new file size it sets
CIFS_I(inode)->invalid_mapping to true but do not revalidate the
cache. Then generic_file_aio_read() reads the wrong data. I think we
need to check if CIFS_I(inode)->invalid_mapping is true and revalidate
the cache before calling generic_file_aio_read() in
file->f_ops->aio_read(). Now cache revalidation happens in lookup/open
and mmap codepaths only for cache=loose.
Of course, cache=loose is not recommended for this sort of work flow
and cache=strict should be used to provide a data coherency between
several machines.
Ahh right, you're quite correct that we need to revalidate the cache
before doing an aio_read. We should have cifs do something like
nfs_file_read() does...

I also suspect that we have a problem with the invalid_mapping flag
similar to the one NFS had until recently. That was fixed by commit
d529ef83c355. We probably ought to do something similar for cifs.
--
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+***@public.gmane.org>
Tetsuo Handa
2014-03-26 12:04:15 UTC
Permalink
Post by Jeff Layton
Post by Tetsuo Handa
Yes, I suggested a customer using CIFS on RHEL6 to use cache=none or
cache=strict , for that customer's system does not work as expected due to
the latter behavior when using "tail -f" for checking for appended log.
Yeah, cache=strict should work much better for that sort of use-case.
I got another problem where none of cache=none cache=strict actimeo=0 helps.

Since "tail -f" does "repeat calling sleep() and fstat() until the file size
increases" -> "do read() until EOF" -> "call fstat() again", "tail -f" prints
"file truncated" message and prints some portion of already printed lines
even though the file size on the Samba server never decreased.
I expect that the latest fstat() returns the up-to-date file size.
This behavior is observed on 2.6.32-431.11.2.el6.x86_64 and 3.14-rc8.

Steps to reproduce:

---------- Start of testprog.c ----------
/**
* stdin and stdout must refer a regular file which is shared via CIFS.
* An example is shown below.
*
* # mount -t cifs -o cache=strict,actimeo=0 //127.0.0.1/path/to/export /path/to/mount
* $ cc -Wall -O3 -o a.out this_file.c
* $ ./a.out > /path/to/export/shared_file < /path/to/mount/shared_file
*/
#include <stdio.h>
#include <stdlib.h>
#include <poll.h>
#include <sys/stat.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
unsigned int delay = argc > 1 ? atoi(argv[1]) : 0;
size_t write_pos = 0;
size_t read_pos = 0;
if (ftruncate(1, 0))
return 1;
while (1) {
struct stat buf;
char c;
if (write(1, ".", 1) != 1)
return 2;
write_pos++;
if (fstat(1, &buf))
return 3;
if (buf.st_size != write_pos)
return 4;
if (read(0, &c, 1) != 1)
return 5;
if (c != '.')
return 6;
read_pos++;
if (delay)
poll(NULL, 0, delay);
if (fstat(0, &buf))
return 7;
if (buf.st_size < read_pos)
fprintf(stderr, "read(%lu) - stat(%lu) = %lu\n",
read_pos, buf.st_size, read_pos - buf.st_size);
}
}
---------- End of testprog.c ----------

Make a CIFS mount on localhost where the Samba server process is running.
Compile testprog.c and run testprog, with stdin redirected to a regular file
which is accessed via CIFS and stdout redirected to the same regular file
which is accessed as a local filesystem. An example is shown below.

# mount -t cifs -o cache=none,actimeo=0 //127.0.0.1/home /mnt
$ cc -Wall -O3 -o testprog testprog.c
$ ./testprog > ~/shared_file < /mnt/shared_file

According to what testprog.c does, it is expected that nothing is printed by
testprog process because the bytes read via read() equals the bytes obtained
via fstat(). However, testprog process prints that the bytes read via read()
is larger than the bytes obtained via fstat().

The strace on the smbd process reports that CIFS is omitting lstat() request
after pread() request if lstat() request was done very recently. Specifying
delay as the command line argument for testprog process reduces the possibility
of omitting lstat() request.
Jeffrey Layton
2014-03-26 12:21:19 UTC
Permalink
On Wed, 26 Mar 2014 21:04:15 +0900
Post by Tetsuo Handa
Post by Jeff Layton
Post by Tetsuo Handa
Yes, I suggested a customer using CIFS on RHEL6 to use cache=none
or cache=strict , for that customer's system does not work as
expected due to the latter behavior when using "tail -f" for
checking for appended log.
Yeah, cache=strict should work much better for that sort of
use-case.
I got another problem where none of cache=none cache=strict actimeo=0 helps.
Since "tail -f" does "repeat calling sleep() and fstat() until the
file size increases" -> "do read() until EOF" -> "call fstat()
again", "tail -f" prints "file truncated" message and prints some
portion of already printed lines even though the file size on the
Samba server never decreased. I expect that the latest fstat()
returns the up-to-date file size. This behavior is observed on
2.6.32-431.11.2.el6.x86_64 and 3.14-rc8.
---------- Start of testprog.c ----------
/**
* stdin and stdout must refer a regular file which is shared via CIFS.
* An example is shown below.
*
* # mount -t cifs -o
cache=strict,actimeo=0 //127.0.0.1/path/to/export /path/to/mount
* $ cc -Wall -O3 -o a.out this_file.c
* $ ./a.out > /path/to/export/shared_file
< /path/to/mount/shared_file */
#include <stdio.h>
#include <stdlib.h>
#include <poll.h>
#include <sys/stat.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
unsigned int delay = argc > 1 ? atoi(argv[1]) : 0;
size_t write_pos = 0;
size_t read_pos = 0;
if (ftruncate(1, 0))
return 1;
while (1) {
struct stat buf;
char c;
if (write(1, ".", 1) != 1)
return 2;
write_pos++;
if (fstat(1, &buf))
return 3;
if (buf.st_size != write_pos)
return 4;
if (read(0, &c, 1) != 1)
return 5;
if (c != '.')
return 6;
read_pos++;
if (delay)
poll(NULL, 0, delay);
if (fstat(0, &buf))
return 7;
if (buf.st_size < read_pos)
fprintf(stderr, "read(%lu) - stat(%lu) =
%lu\n", read_pos, buf.st_size, read_pos - buf.st_size);
}
}
---------- End of testprog.c ----------
Make a CIFS mount on localhost where the Samba server process is
running. Compile testprog.c and run testprog, with stdin redirected
to a regular file which is accessed via CIFS and stdout redirected to
the same regular file which is accessed as a local filesystem. An
example is shown below.
# mount -t cifs -o cache=none,actimeo=0 //127.0.0.1/home /mnt
$ cc -Wall -O3 -o testprog testprog.c
$ ./testprog > ~/shared_file < /mnt/shared_file
According to what testprog.c does, it is expected that nothing is
printed by testprog process because the bytes read via read() equals
the bytes obtained via fstat(). However, testprog process prints that
the bytes read via read() is larger than the bytes obtained via
fstat().
The strace on the smbd process reports that CIFS is omitting lstat()
request after pread() request if lstat() request was done very
recently. Specifying delay as the command line argument for testprog
process reduces the possibility of omitting lstat() request.
(cc'ing samba-technical)

...or that samba is omitting it and is sending cached info instead of
doing the lstat() call? I'm not sure if it does that, but I don't
think you should draw too many conclusions about the behavior of cifs.ko
from stracing smbd.

What may make more sense is to get network captures and analyze the
behavior from that perspective.
--
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+***@public.gmane.org>
Jeffrey Layton
2014-03-26 13:03:05 UTC
Permalink
On Wed, 26 Mar 2014 05:21:19 -0700
Post by Jeffrey Layton
On Wed, 26 Mar 2014 21:04:15 +0900
Post by Tetsuo Handa
Post by Jeff Layton
Post by Tetsuo Handa
Yes, I suggested a customer using CIFS on RHEL6 to use
cache=none or cache=strict , for that customer's system does
not work as expected due to the latter behavior when using
"tail -f" for checking for appended log.
Yeah, cache=strict should work much better for that sort of use-case.
I got another problem where none of cache=none cache=strict
actimeo=0 helps.
Since "tail -f" does "repeat calling sleep() and fstat() until the
file size increases" -> "do read() until EOF" -> "call fstat()
again", "tail -f" prints "file truncated" message and prints some
portion of already printed lines even though the file size on the
Samba server never decreased. I expect that the latest fstat()
returns the up-to-date file size. This behavior is observed on
2.6.32-431.11.2.el6.x86_64 and 3.14-rc8.
---------- Start of testprog.c ----------
/**
* stdin and stdout must refer a regular file which is shared via CIFS.
* An example is shown below.
*
* # mount -t cifs -o
cache=strict,actimeo=0 //127.0.0.1/path/to/export /path/to/mount
* $ cc -Wall -O3 -o a.out this_file.c
* $ ./a.out > /path/to/export/shared_file
< /path/to/mount/shared_file */
#include <stdio.h>
#include <stdlib.h>
#include <poll.h>
#include <sys/stat.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
unsigned int delay = argc > 1 ? atoi(argv[1]) : 0;
size_t write_pos = 0;
size_t read_pos = 0;
if (ftruncate(1, 0))
return 1;
while (1) {
struct stat buf;
char c;
if (write(1, ".", 1) != 1)
return 2;
write_pos++;
if (fstat(1, &buf))
return 3;
if (buf.st_size != write_pos)
return 4;
if (read(0, &c, 1) != 1)
return 5;
if (c != '.')
return 6;
read_pos++;
if (delay)
poll(NULL, 0, delay);
if (fstat(0, &buf))
return 7;
if (buf.st_size < read_pos)
fprintf(stderr, "read(%lu) - stat(%lu) =
%lu\n", read_pos, buf.st_size, read_pos - buf.st_size);
}
}
---------- End of testprog.c ----------
Make a CIFS mount on localhost where the Samba server process is
running. Compile testprog.c and run testprog, with stdin redirected
to a regular file which is accessed via CIFS and stdout redirected
to the same regular file which is accessed as a local filesystem. An
example is shown below.
# mount -t cifs -o cache=none,actimeo=0 //127.0.0.1/home /mnt
$ cc -Wall -O3 -o testprog testprog.c
$ ./testprog > ~/shared_file < /mnt/shared_file
According to what testprog.c does, it is expected that nothing is
printed by testprog process because the bytes read via read() equals
the bytes obtained via fstat(). However, testprog process prints
that the bytes read via read() is larger than the bytes obtained via
fstat().
The strace on the smbd process reports that CIFS is omitting lstat()
request after pread() request if lstat() request was done very
recently. Specifying delay as the command line argument for testprog
process reduces the possibility of omitting lstat() request.
(cc'ing samba-technical)
...or that samba is omitting it and is sending cached info instead of
doing the lstat() call? I'm not sure if it does that, but I don't
think you should draw too many conclusions about the behavior of
cifs.ko from stracing smbd.
What may make more sense is to get network captures and analyze the
behavior from that perspective.
...or maybe it *is* cifs.ko. From cifs_inode_needs_reval:


if (!time_in_range(jiffies, cifs_i->time,
cifs_i->time + cifs_sb->actimeo))
return true;


...I think though that if cifs_i->time == jiffies and actimeo=0, then
that condition will be false. As a quick check, it might be good to add
something like this before that if statement and then rerun your test:

if (!cifs_sb->actimeo)
return true;

That should get rid of that particular corner case.
--
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+***@public.gmane.org>
Tetsuo Handa
2014-03-26 13:06:53 UTC
Permalink
Post by Jeffrey Layton
...or maybe it *is* cifs.ko.
The strace log is showing that smbd process receives "\377SMB." for read()
request and "\377SMB2" for fstat() request.
We can see that there is only one "\377SMB." between two "\377SMB2" for
delay = 1000 ms case while there are multiple "\377SMB." between two
"\377SMB2" for delay = 0 ms case.

Therefore, I think it *is* cifs.ko.

---------- strace of normal case ----------
21:38:47.725757 poll([{fd=6, events=POLLIN|POLLHUP}, {fd=25, events=POLLIN|POLLHUP}, {fd=8, events=POLLIN|POLLHUP}], 3, 53516) = 1 ([{fd=25, revents=POLLIN}])
21:38:47.725847 read(25, "\0\0\0;", 4) = 4
21:38:47.725901 read(25, "\377SMB.\0\0\0\0\0\1\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0\32\5d\0c\32"..., 59) = 59
21:38:47.725970 fstat(29, {st_mode=S_IFREG|0644, st_size=1, ...}) = 0
21:38:47.726039 pread(29, ".", 1, 0) = 1
21:38:47.726095 writev(25, [{"\0\0\0<\377SMB.\0\0\0\0\200\3\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0\32\5"..., 64}], 1) = 64
21:38:47.726187 poll([{fd=6, events=POLLIN|POLLHUP}, {fd=25, events=POLLIN|POLLHUP}, {fd=8, events=POLLIN|POLLHUP}], 3, 53514) = 1 ([{fd=25, revents=POLLIN}])
21:38:48.739627 read(25, "\0\0\0b", 4) = 4
21:38:48.739725 read(25, "\377SMB2\0\0\0\0\0\1\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0\32\5d\0d\32"..., 98) = 98
21:38:48.739811 lstat("shared_file", {st_mode=S_IFREG|0644, st_size=1, ...}) = 0
21:38:48.739902 getcwd("/root", 4096) = 6
21:38:48.739956 lstat("/root/shared_file", {st_mode=S_IFREG|0644, st_size=1, ...}) = 0
21:38:48.740039 lstat("shared_file", {st_mode=S_IFREG|0644, st_size=1, ...}) = 0
21:38:48.740111 fcntl(13, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=13024, len=1}) = 0
21:38:48.740171 fcntl(13, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=13024, len=1}) = 0
21:38:48.740241 writev(25, [{"\0\0\0\240\377SMB2\0\0\0\0\200\3\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0\32\5"..., 164}], 1) = 164
21:38:48.740341 poll([{fd=6, events=POLLIN|POLLHUP}, {fd=25, events=POLLIN|POLLHUP}, {fd=8, events=POLLIN|POLLHUP}], 3, 52500) = 1 ([{fd=25, revents=POLLIN}])
21:38:48.740459 read(25, "\0\0\0;", 4) = 4
21:38:48.740515 read(25, "\377SMB.\0\0\0\0\0\1\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0\32\5d\0e\32"..., 59) = 59
21:38:48.740578 fstat(29, {st_mode=S_IFREG|0644, st_size=2, ...}) = 0
21:38:48.740643 pread(29, ".", 1, 1) = 1
21:38:48.740698 writev(25, [{"\0\0\0<\377SMB.\0\0\0\0\200\3\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0\32\5"..., 64}], 1) = 64
21:38:48.740789 poll([{fd=6, events=POLLIN|POLLHUP}, {fd=25, events=POLLIN|POLLHUP}, {fd=8, events=POLLIN|POLLHUP}], 3, 52499) = 1 ([{fd=25, revents=POLLIN}])
21:38:49.755256 read(25, "\0\0\0b", 4) = 4
21:38:49.755330 read(25, "\377SMB2\0\0\0\0\0\1\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0\32\5d\0f\32"..., 98) = 98
21:38:49.755412 lstat("shared_file", {st_mode=S_IFREG|0644, st_size=2, ...}) = 0
21:38:49.755484 getcwd("/root", 4096) = 6
21:38:49.755538 lstat("/root/shared_file", {st_mode=S_IFREG|0644, st_size=2, ...}) = 0
21:38:49.755603 lstat("shared_file", {st_mode=S_IFREG|0644, st_size=2, ...}) = 0
21:38:49.755674 fcntl(13, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=13024, len=1}) = 0
21:38:49.755733 fcntl(13, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=13024, len=1}) = 0
21:38:49.755802 writev(25, [{"\0\0\0\240\377SMB2\0\0\0\0\200\3\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0\32\5"..., 164}], 1) = 164
21:38:49.755945 poll([{fd=6, events=POLLIN|POLLHUP}, {fd=25, events=POLLIN|POLLHUP}, {fd=8, events=POLLIN|POLLHUP}], 3, 51484) = 1 ([{fd=25, revents=POLLIN}])
21:38:49.756007 read(25, "\0\0\0;", 4) = 4
21:38:49.756058 read(25, "\377SMB.\0\0\0\0\0\1\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0\32\5d\0g\32"..., 59) = 59
21:38:49.756120 fstat(29, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0
21:38:49.756185 pread(29, ".", 1, 2) = 1
21:38:49.756240 writev(25, [{"\0\0\0<\377SMB.\0\0\0\0\200\3\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0\32\5"..., 64}], 1) = 64
21:38:49.756331 poll([{fd=6, events=POLLIN|POLLHUP}, {fd=25, events=POLLIN|POLLHUP}, {fd=8, events=POLLIN|POLLHUP}], 3, 51484) = 1 ([{fd=25, revents=POLLIN}])
21:38:50.770753 read(25, "\0\0\0b", 4) = 4
21:38:50.770838 read(25, "\377SMB2\0\0\0\0\0\1\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0\32\5d\0h\32"..., 98) = 98
21:38:50.770922 lstat("shared_file", {st_mode=S_IFREG|0644, st_size=3, ...}) = 0
21:38:50.770994 getcwd("/root", 4096) = 6
21:38:50.771050 lstat("/root/shared_file", {st_mode=S_IFREG|0644, st_size=3, ...}) = 0
21:38:50.771115 lstat("shared_file", {st_mode=S_IFREG|0644, st_size=3, ...}) = 0
21:38:50.771187 fcntl(13, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=13024, len=1}) = 0
21:38:50.771247 fcntl(13, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=13024, len=1}) = 0
21:38:50.771316 writev(25, [{"\0\0\0\240\377SMB2\0\0\0\0\200\3\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0\32\5"..., 164}], 1) = 164
21:38:50.771461 poll([{fd=6, events=POLLIN|POLLHUP}, {fd=25, events=POLLIN|POLLHUP}, {fd=8, events=POLLIN|POLLHUP}], 3, 50469) = 1 ([{fd=25, revents=POLLIN}])
---------- strace of normal case ----------

---------- strace of problem case ----------
21:40:36.663289 poll([{fd=6, events=POLLIN|POLLHUP}, {fd=25, events=POLLIN|POLLHUP}, {fd=8, events=POLLIN|POLLHUP}], 3, 4638) = 1 ([{fd=25, revents=POLLIN}])
21:40:36.663349 read(25, "\0\0\0b", 4) = 4
21:40:36.663399 read(25, "\377SMB2\0\0\0\0\0\1\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0,\5d\0\227\34"..., 98) = 98
21:40:36.663463 lstat("shared_file", {st_mode=S_IFREG|0644, st_size=447, ...}) = 0
21:40:36.663528 getcwd("/root", 4096) = 6
21:40:36.663578 lstat("/root/shared_file", {st_mode=S_IFREG|0644, st_size=447, ...}) = 0
21:40:36.663642 lstat("shared_file", {st_mode=S_IFREG|0644, st_size=447, ...}) = 0
21:40:36.663708 fcntl(13, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=13024, len=1}) = 0
21:40:36.663761 fcntl(13, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=13024, len=1}) = 0
21:40:36.663867 writev(25, [{"\0\0\0\240\377SMB2\0\0\0\0\200\3\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0,\5"..., 164}], 1) = 164
21:40:36.664038 poll([{fd=6, events=POLLIN|POLLHUP}, {fd=25, events=POLLIN|POLLHUP}, {fd=8, events=POLLIN|POLLHUP}], 3, 4637) = 1 ([{fd=25, revents=POLLIN}])
21:40:36.664102 read(25, "\0\0\0;", 4) = 4
21:40:36.664154 read(25, "\377SMB.\0\0\0\0\0\1\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0,\5d\0\230\34"..., 59) = 59
21:40:36.664215 fstat(29, {st_mode=S_IFREG|0644, st_size=448, ...}) = 0
21:40:36.664277 pread(29, ".", 1, 447) = 1
21:40:36.664330 writev(25, [{"\0\0\0<\377SMB.\0\0\0\0\200\3\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0,\5"..., 64}], 1) = 64
21:40:36.664438 poll([{fd=6, events=POLLIN|POLLHUP}, {fd=25, events=POLLIN|POLLHUP}, {fd=8, events=POLLIN|POLLHUP}], 3, 4637) = 1 ([{fd=25, revents=POLLIN}])
21:40:36.664498 read(25, "\0\0\0;", 4) = 4
21:40:36.664548 read(25, "\377SMB.\0\0\0\0\0\1\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0,\5d\0\231\34"..., 59) = 59
21:40:36.664608 fstat(29, {st_mode=S_IFREG|0644, st_size=449, ...}) = 0
21:40:36.664669 pread(29, ".", 1, 448) = 1
21:40:36.664719 writev(25, [{"\0\0\0<\377SMB.\0\0\0\0\200\3\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0,\5"..., 64}], 1) = 64
21:40:36.664824 poll([{fd=6, events=POLLIN|POLLHUP}, {fd=25, events=POLLIN|POLLHUP}, {fd=8, events=POLLIN|POLLHUP}], 3, 4636) = 1 ([{fd=25, revents=POLLIN}])
21:40:36.664883 read(25, "\0\0\0;", 4) = 4
21:40:36.664933 read(25, "\377SMB.\0\0\0\0\0\1\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0,\5d\0\232\34"..., 59) = 59
21:40:36.664992 fstat(29, {st_mode=S_IFREG|0644, st_size=450, ...}) = 0
21:40:36.665053 pread(29, ".", 1, 449) = 1
21:40:36.665102 writev(25, [{"\0\0\0<\377SMB.\0\0\0\0\200\3\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0,\5"..., 64}], 1) = 64
21:40:36.665206 poll([{fd=6, events=POLLIN|POLLHUP}, {fd=25, events=POLLIN|POLLHUP}, {fd=8, events=POLLIN|POLLHUP}], 3, 4636) = 1 ([{fd=25, revents=POLLIN}])
21:40:36.665265 read(25, "\0\0\0;", 4) = 4
21:40:36.665315 read(25, "\377SMB.\0\0\0\0\0\1\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0,\5d\0\233\34"..., 59) = 59
21:40:36.665374 fstat(29, {st_mode=S_IFREG|0644, st_size=451, ...}) = 0
21:40:36.665435 pread(29, ".", 1, 450) = 1
21:40:36.665484 writev(25, [{"\0\0\0<\377SMB.\0\0\0\0\200\3\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0,\5"..., 64}], 1) = 64
21:40:36.665587 poll([{fd=6, events=POLLIN|POLLHUP}, {fd=25, events=POLLIN|POLLHUP}, {fd=8, events=POLLIN|POLLHUP}], 3, 4636) = 1 ([{fd=25, revents=POLLIN}])
21:40:36.665646 read(25, "\0\0\0;", 4) = 4
21:40:36.665696 read(25, "\377SMB.\0\0\0\0\0\1\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0,\5d\0\234\34"..., 59) = 59
21:40:36.665755 fstat(29, {st_mode=S_IFREG|0644, st_size=452, ...}) = 0
21:40:36.666556 pread(29, ".", 1, 451) = 1
21:40:36.666613 writev(25, [{"\0\0\0<\377SMB.\0\0\0\0\200\3\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0,\5"..., 64}], 1) = 64
21:40:36.666712 poll([{fd=6, events=POLLIN|POLLHUP}, {fd=25, events=POLLIN|POLLHUP}, {fd=8, events=POLLIN|POLLHUP}], 3, 4634) = 1 ([{fd=25, revents=POLLIN}])
21:40:36.666772 read(25, "\0\0\0b", 4) = 4
21:40:36.666823 read(25, "\377SMB2\0\0\0\0\0\1\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0,\5d\0\235\34"..., 98) = 98
21:40:36.666887 lstat("shared_file", {st_mode=S_IFREG|0644, st_size=452, ...}) = 0
21:40:36.666951 getcwd("/root", 4096) = 6
21:40:36.667000 lstat("/root/shared_file", {st_mode=S_IFREG|0644, st_size=452, ...}) = 0
21:40:36.667064 lstat("shared_file", {st_mode=S_IFREG|0644, st_size=452, ...}) = 0
21:40:36.667129 fcntl(13, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=13024, len=1}) = 0
21:40:36.667182 fcntl(13, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=13024, len=1}) = 0
21:40:36.667242 writev(25, [{"\0\0\0\240\377SMB2\0\0\0\0\200\3\300\0\0\0\0\0\0\0\0\0\0\0\0\1\0,\5"..., 164}], 1) = 164
21:40:36.667344 poll([{fd=6, events=POLLIN|POLLHUP}, {fd=25, events=POLLIN|POLLHUP}, {fd=8, events=POLLIN|POLLHUP}], 3, 4634) = 1 ([{fd=25, revents=POLLIN}])
---------- strace of problem case ----------
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tetsuo Handa
2014-03-26 14:05:20 UTC
Permalink
Post by Jeffrey Layton
Post by Jeffrey Layton
...or that samba is omitting it and is sending cached info instead of
doing the lstat() call? I'm not sure if it does that, but I don't
think you should draw too many conclusions about the behavior of
cifs.ko from stracing smbd.
What may make more sense is to get network captures and analyze the
behavior from that perspective.
if (!time_in_range(jiffies, cifs_i->time,
cifs_i->time + cifs_sb->actimeo))
return true;
...I think though that if cifs_i->time == jiffies and actimeo=0, then
that condition will be false. As a quick check, it might be good to add
if (!cifs_sb->actimeo)
return true;
That should get rid of that particular corner case.
Yes, I confirmed that below change fixes this problem. Thank you.

--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -1737,6 +1737,8 @@ cifs_inode_needs_reval(struct inode *inode)
if (cifs_i->time == 0)
return true;

+ if (!cifs_sb->actimeo)
+ return true;
if (!time_in_range(jiffies, cifs_i->time,
cifs_i->time + cifs_sb->actimeo))
return true;
Jeffrey Layton
2014-03-26 14:12:19 UTC
Permalink
On Wed, 26 Mar 2014 23:05:20 +0900
Post by Tetsuo Handa
Post by Jeffrey Layton
Post by Jeffrey Layton
...or that samba is omitting it and is sending cached info
instead of doing the lstat() call? I'm not sure if it does that,
but I don't think you should draw too many conclusions about the
behavior of cifs.ko from stracing smbd.
What may make more sense is to get network captures and analyze
the behavior from that perspective.
if (!time_in_range(jiffies, cifs_i->time,
cifs_i->time + cifs_sb->actimeo))
return true;
...I think though that if cifs_i->time == jiffies and actimeo=0,
then that condition will be false. As a quick check, it might be
good to add something like this before that if statement and then
if (!cifs_sb->actimeo)
return true;
That should get rid of that particular corner case.
Yes, I confirmed that below change fixes this problem. Thank you.
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -1737,6 +1737,8 @@ cifs_inode_needs_reval(struct inode *inode)
if (cifs_i->time == 0)
return true;
+ if (!cifs_sb->actimeo)
+ return true;
if (!time_in_range(jiffies, cifs_i->time,
cifs_i->time + cifs_sb->actimeo))
return true;
Great, thanks for testing it. I'll send Steve a patch for 3.15 and cc
you.
--
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+***@public.gmane.org>
Jeff Layton
2014-03-31 17:48:44 UTC
Permalink
On Thu, 20 Mar 2014 21:03:03 +0900
Post by Tetsuo Handa
Is the behavior described below ( "read() returns 0 unless stat() detects the
growth of file size" and "read() returns non-0 when stat() detects the growth
of file size but the data actually read is '\0'" ) a feature of cache=loose
option?
---------- Start of writer1.c ----------
#include <unistd.h>
int main(int argc, char *argv[])
{
while (write(1, ".", 1) == 1)
sleep(1);
return 0;
}
---------- End of writer1.c ----------
---------- Start of reader1.c ----------
#include <unistd.h>
int main(int argc, char *argv[])
{
while (1) {
char c;
if (read(0, &c, 1) == 1) {
if (c)
write(1, &c, 1);
else
write(1, "!", 1);
} else {
sleep(1);
}
}
return 0;
}
---------- End of reader1.c ----------
---------- Start of reader2.c ----------
#include <sys/stat.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
while (1) {
struct stat buf;
char c;
fstat(0, &buf);
if (read(0, &c, 1) == 1) {
if (c)
write(1, &c, 1);
else
write(1, "!", 1);
} else {
sleep(1);
}
}
return 0;
}
---------- End of reader2.c ----------
Run the following commands on the Samba server side.
$ cc -Wall -O3 -o writer1 writer1.c
$ ./writer1 > shared_file
Run the following commands on the Samba client side.
$ cc -Wall -O3 -o reader1 reader1.c
$ strace -tt -o /tmp/log1 ./reader1 < shared_file
.......... (<= Makes no progress by default)
The shared_file shown above is a regular file shared via CIFS mounted with
cache=loose option. (e.g. "mount -t cifs -o cache=loose //server/path /mnt" )
I expect that the reader1 process reads one byte whenever the writer1
process writes one byte. But we can observe via /tmp/log1 that the read()
continues returning 0 until the growth of file size of shared_file is
recognized by stat() using programs like /bin/ls and /usr/bin/stat .
Then, terminate the reader1 process using Ctrl-C and run the following
commands on the Samba client side.
$ cc -Wall -O3 -o reader2 reader2.c
$ strace -tt -o /tmp/log2 ./reader2 < shared_file
......................!!!!!! (<= Makes progress with wrong data)
We can observe via /tmp/log2 that the read() returns 1 because fstat()
detects the growth of file size of shared_file . However, the data
returned by read() is not '.' but '\0'.
I had a chance to poke at this today. I think the main issue is likely
due to a server-side problem.

smbd.conf defaults to "kernel oplocks = no" now. That means that samba
is handing out oplocks even when other processes on the box have the
file open for read or write. Setting that to "yes" should help things
work better.

It seems like that used to be different, but has apparently changed in
recent versions of samba. Without that, we get some "interesting"
effects when the size changes, but we still hold an oplock.

I also have a patchset forthcoming to fix some races in this area too.
After setting kernel oplocks = yes and applying the patchset, this
works more like you'd expect.
--
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+***@public.gmane.org>
Loading...