Opened 11 months ago

Closed 10 months ago

Last modified 10 months ago

#324 closed defect (fixed)

segfaults with xcache.readonly_protection = Off under heavy load

Reported by: boxdev Owned by: moo
Priority: critical Milestone: 3.0.4
Component: cacher Version: 3.0.3
Keywords: Cc:
Application: PHP Version: 5.4.19
Other Exts: SAPI: Irrelevant
Probability: Sometimes Blocked By:
Blocking:

Description

We downloaded xcache-3.0-r1369.tar.bz2 and installed it on our test servers. When we run around 150 concurrent processes, we observe segfaults being generated on our test servers.

I have attached some stack traces. We would appreciate your feedback on what other data we can gather.

Version of XCache: php-xcache-3.0.4-5.4_box1.el6.x86_64
Version of PHP: php-5.4.19-1.el6.x86_64
Version of Apache: httpd-2.2.15-28.sl6.x86_64
OS: Scientific Linux 6
Output of 'uname -a': Linux pod4201-upload01.pod.box.net 2.6.32-220.el6.x86_64 #1 SMP Sat Dec 10 17:04:11 CST 2011 x86_64 x86_64 x86_64 GNU/Linux
Frequency of occurrence: Normally very low. But much more frequent when our test-driver uses around 150 concurrent processes or more.
xcache.readonly_protection = Off

This is very similar behavior to XCache 3.0.3.

Attachments (2)

gdb-output-20130910-1.txt (4.8 KB) - added by boxdev 11 months ago.
Stack trace involving shutdown.
phpinfo().html (82.0 KB) - added by boxdev 11 months ago.
This is the output of phpinfo.

Download all attachments as: .zip

Change History (27)

Changed 11 months ago by boxdev

Stack trace involving shutdown.

comment:1 Changed 11 months ago by moo

you can always reopen the bug. anyway, I want to confirm that the core is dump when updated xcache is in use. is there any stack comes up with xcache-3.0-r1369/*.c? i can only see php*/*.c yet in your dump.

i've tested with php-fpm, reproduced with 3.0.3, can't reproduce with 3.0.4-dev. i'll make some time install apache2

comment:2 Changed 11 months ago by moo

please change xcache.h:
#define XCACHE_VERSION "3.0.4-dev my testing" /* this is any mark */
repack SRPM/reinstall, restart apache2, see if this string appears in phpinfo() page

comment:3 Changed 11 months ago by moo

btw, if you have any testing env, i would like to get root ssh access to it. with myself being able to reproduce the problem it take a lot less time to trace it down

Version 0, edited 11 months ago by moo (next)

comment:4 Changed 11 months ago by boxdev

Some of the segfaults are more prevalent than others. We see some segfaults where the Apache process was in the middle of running PHP. However, that is very hard to reproduce. And it only happens on some of our server types. On the other hand, the segfault which occurs when the Apache process is shutting down is easier to reproduce.

Regarding the change to xcache.h, I made the edit and this is the resulting output when I run PHP:

$ php -i | grep -i xcache
/etc/php.d/xcache.ini,

with XCache v3.0.4-dev my testing, Copyright (c) 2005-2013, by mOo
with XCache Cacher v3.0.4-dev my testing, Copyright (c) 2005-2013, by mOo

XCache
XCache Version => 3.0.4-dev my testing
xcache.coredump_directory => /box/var/cores => /box/var/cores
xcache.disable_on_crash => Off => Off
xcache.experimental => Off => Off
xcache.test => Off => Off
XCache Cacher
XCache Cacher Module => enabled
xcache.admin.enable_auth => On => On
xcache.allocator => bestfit => bestfit
xcache.cacher => On => On
xcache.count => 5 => 5
xcache.gc_interval => 300 => 300
xcache.mmap_path => /box/var/tmp/xcache.mmap => /box/var/tmp/xcache.mmap
xcache.readonly_protection => Off => Off
xcache.shm_scheme => mmap => mmap
xcache.size => 512M => 512M
xcache.slots => 4K => 4K
xcache.stat => On => On
xcache.ttl => 3600 => 3600
xcache.var_allocator => bestfit => bestfit
xcache.var_count => 4 => 4
xcache.var_gc_interval => 300 => 300
xcache.var_maxttl => 3600 => 3600
xcache.var_namespace => no value => no value
xcache.var_namespace_mode => 0 => 0
xcache.var_size => 128M => 128M
xcache.var_slots => 2K => 2K
xcache.var_ttl => 3600 => 3600

Regarding the test environment, I don't believe that someone can access it unless they are on our campus or using our VPN. I am trying to figure out how to create a VM which we can send to you. But this problem is more complicated than the other one we previously reported to you. I don't have permission to ship to you our source files. Thus, I am trying to create an environment where we can all look at the segfaults without using our source code.

In the meantime, I would be happy to run more tests and capture more output.

comment:5 Changed 11 months ago by moo

please use phpinfo() page, request it from any where to the http://...../phpinfo.php as i suggested instead of "php -i"

Changed 11 months ago by boxdev

This is the output of phpinfo.

comment:6 Changed 11 months ago by moo

#323 is relative to this ticket

comment:7 Changed 11 months ago by boxdev

As we work on the segfault problems, we are trying different XCache configurations. On some servers, we have set "xcache.readonly_protection = On". By changing this setting to On, we have observed fewer segfaults. However, they still occur on our servers.

I was thinking of creating another ticket which focuses on the segfaults where "xcache.readonly_protection = On". We also sometimes have a different PHP version, a different Apache version, etc. By having a separate ticket, the data for that set packages and that set of configurations won't complicate this ticket.

What do you think? Did you only want one ticket overall? Or should I file a separate ticket which can focus on that other scenario?

By the way, for the other scenario, the segfault occurs while PHP is still running.

comment:8 Changed 11 months ago by moo

It's better to make a reproduce env so i can reproduce locally, or have me access to your test box, trigger the crash using apachebench

Debugging random crash is never easy, not just need plenty of info but also interact with the reproducing step, check/mark/tweak everything I can think of

comment:9 Changed 10 months ago by boxdev

We agree that debugging crashes isn't easy. It is much more straightforward when there exists a simple test case for reproducing the problem. We are still trying to reproduce this problem with a simpler test case since we don't have permission to send to you our source code.

Are you in the SF Bay Area? I will ask if it would be possible for you to visit our site in order to access our test environment.

Also, we have observed that the xcache.readonly_protection setting affects how many segfaults are generated. When xcache.readonly_protection is set to Off, we observe over 100,000 segfaults per day. However, when xcache.readonly_protection is set to On, there are only several hundred per day. Does that give any clues about the problem?

comment:10 Changed 10 months ago by moo

nope. It's China here. I understand your difficulty and hope you can nail it down to a simpler test case

different copy behavior is applied depends on readonly_protection, I'll check the difference and hopefully kill some problem if not all

comment:11 Changed 10 months ago by boxdev

Thank you for checking.

Just to be clear, when xcache.readonly_protection is set to Off, we see several types of stack traces (including the ones I have sent to you). And there are over 100,000 segfaults total. But when we set xcache.readonly_protection to On, we no longer the same stack traces as before. Thus, toggling the setting either addresses the earlier problems or masks the problems enough that we no longer see them.

We still see around 500 segfaults total per day. But the stack traces from these segfaults are very different from the ones that we observed when xcache.readonly_protection was set to Off.

comment:12 Changed 10 months ago by moo

  • Milestone changed from undecided to 3.0.4
  • Status changed from new to accepted
  • Version set to 3.0.3

use this ticket for "off" case only, file another ticket for "on". and attach multiple back trace files there for core dump with xcache.readonly_protection=on

It seems both case is cause by different bug

comment:13 Changed 10 months ago by moo

[1381] may be a fix to this xcache.readonly_protection=off, can you please install from svn?

$ svn co svn://svn.lighttpd.net/xcache/trunk xcache-trunk
$ svn info xcache-trunk
...
Revision: 1381
...

do it as if it were 3.0.4

comment:14 Changed 10 months ago by moo

should be after [1382]

comment:15 Changed 10 months ago by moo

  • Status changed from accepted to started
  • Summary changed from segfaults when using XCache 3.0.4 under heavy load to segfaults with xcache.readonly_protection = Off under heavy load

comment:16 Changed 10 months ago by boxdev

This is an interesting change. :)

I downloaded the source from SVN. I have run several tests but haven't observed segfaults yet in our test environment. I will run more tests tomorrow.

If the tests turn out well, we would be interested in downloading the official version of XCache 3.0.4 so that we can run tests using the official package.

Thank you.

comment:17 Changed 10 months ago by moo

yes it is. The old code was borrow from apc. apc still use refcount[0] = 1000;

looks like they do it the same way in ext/opcache/zend_accelerator_util_funcs.c zend_prepare_function_for_execution in PHP_5_5

please be aware that [1388] is needed to avoid reaching 0 (zero). it's even more cute. better than ZEND_PROTECTED_REFCOUNT (1<<30)

anyway let's hope it fix the problem

comment:18 Changed 10 months ago by boxdev

Yes, we're hoping this fixes many of the segfaults we have observed.

I'm sure that there are preparations that need to be completed before making the next version of the package available. Could you please let us know how long it would be until the next release? Is the schedule for releasing XCache 3.0.4 published?

Thanks.

comment:19 Changed 10 months ago by moo

3.0.4 RC will be out as soon as you confirm your bug fixed

comment:20 Changed 10 months ago by boxdev

I ran more tests today. They duplicated the tests over the weekend, where we observe segfaults when using XCache 2.0.1 but no segfaults when using XCache 3.0.4.

I need to run some more tests tomorrow but wanted to let you know the status so far.

comment:21 Changed 10 months ago by moo

as 3.0.4 is not released, i'm not sure what version are you talking about when you say "no segfaults when using XCache 3.0.4"

there were multiple versions here:

  • 2.0.1
  • 3.0.3
  • "3.0.4-dev my testing" (3.0.x-r1369)
  • 3.1.x-dev (trunk 3.1.x r1381)

and the above versions was combined with readonly_protection=on or off. can you tell me which segfault which don't, and which those was not tested (only the latest r1381 trunk has to be tested)

comment:22 Changed 10 months ago by boxdev

I'm sorry for the confusion I'm causing. When I said XCache 3.0.4, I meant the code that was downloaded when I ran this command as you suggested above:

svn co svn://svn.lighttpd.net/xcache/trunk xcache-trunk

We observed segfaults with all the other versions of XCache.

Once the new code is available as a downloadable package, we can run our tests on it, too.

comment:23 Changed 10 months ago by boxdev

We have completed the tests for the code which we downloaded with the SVN command:

svn co ​svn://svn.lighttpd.net/xcache/trunk xcache-trunk

Using this code, there were no segfaults observed during several rounds of testing. Is this really XCache 3.1.x instead of XCache 3.0.4?

What is the best way for us to get XCache 3.0.4?

Thank you.

comment:24 Changed 10 months ago by moo

  • Resolution set to fixed
  • Status changed from started to closed

In 1395:

merge r1381,r1388,r1394 from trunk: fixed #324: xcache.readonly_protection = Off cause SEGV under mass concurrent

comment:25 Changed 10 months ago by boxdev

I downloaded http://xcache.lighttpd.net/pub/snapshots/3.0-r1397/xcache-3.0-r1397.tar.bz2 and ran some more tests. We still don't see the segfaults. Thank you very much for looking into this problem.

When the final version of the XCache 3.0.4 package becomes available on http://xcache.lighttpd.net/ we can begin deploying onto our servers.

If we see other segfaults, we will collect the data and let you know if XCache may be involved. Thanks again.

Note: See TracTickets for help on using tickets.