Opened 14 months ago

Closed 14 months ago

Last modified 13 months ago

#323 closed defect (fixed)

segfaults when using XCache 3.0.3 under heavy load

Reported by: boxdev Owned by: moo
Priority: critical Milestone: 3.0.4
Component: cacher Version: 3.0.3
Keywords: Cc:
Application: PHP Version: 5.4.13
Other Exts: SAPI: Irrelevant
Probability: Blocked By:
Blocking:

Description

Hi. I am working with David Schnepper from Box to troubleshoot segfaults on our site. He previously worked with you on XCache 2.0.1. You took his fixes for XCache 2.0.1 and created XCache 3.0.3. Box upgraded to XCache 3.0.3 but we are still seeing the segfaults.

I have attached some stack traces. We would appreciate your feedback on what other data we can gather.

Version of XCache: php-xcache-3.0.3-5.4_box1.el6.x86_64
Version of PHP: php-5.4.13-1.el6.x86_64
Version of Apache: httpd-2.2.15-28.sl6.x86_64
OS: Scientific Linux 6
Output of 'uname -a': Linux pod4201-upload01.pod.box.net 2.6.32-220.el6.x86_64 #1 SMP Sat Dec 10 17:04:11 CST 2011 x86_64 x86_64 x86_64 GNU/Linux
Frequency of occurrence: Normally very low. But much more frequent when out test-driver uses around 150 concurrent processes or more.
xcache.readonly_protection = Off

Attachments (6)

coredumps-20130901.txt (13.1 KB) - added by boxdev 14 months ago.
Stack traces from a few core dump files.
gdb.commands (212 bytes) - added by boxdev 14 months ago.
Command file for when we run GDB.
gdbinit (13.2 KB) - added by boxdev 14 months ago.
The ".gdbinit" file for PHP 5.4.13.
xcache.ini (1.2 KB) - added by boxdev 14 months ago.
The xcache.ini used during testing.
gdb-output-20130908-1.txt (71.0 KB) - added by boxdev 14 months ago.
GDB output usint "bt full".
gdb-output-20130908-2.txt (4.6 KB) - added by boxdev 14 months ago.
Updated version of the GDB output.

Download all attachments as: .zip

Change History (23)

Changed 14 months ago by boxdev

Stack traces from a few core dump files.

Changed 14 months ago by boxdev

Command file for when we run GDB.

Changed 14 months ago by boxdev

The ".gdbinit" file for PHP 5.4.13.

Changed 14 months ago by boxdev

The xcache.ini used during testing.

comment:1 Changed 14 months ago by boxdev

When we dump the stack traces, we use a shell script that is similar to the following:

$ cat dump.sh
#!/bin/bash -x

CORES="core-httpd-11-48-48-29826-1378065706 core-httpd-11-48-48-32184-1378065706 core-httpd-11-48-48-9729-1378065706 core-httpd-11-48-48-18832-1378065708 core-httpd-11-48-48-9255-1378065706"
ECHO="/bin/echo"
GDB="/usr/bin/gdb"
GREP="/bin/grep"

for x in $CORES; do

$ECHO " start: $x "
$GDB -x ./gdb.commands -e /usr/sbin/httpd -c ./$x | $GREP -v "Reading symbols from" | $GREP -v "Loaded symbols for"
$ECHO " end: $x "
$ECHO " "

done

comment:2 Changed 14 months ago by boxdev

Inside xcache.ini, when we set "xcache.readonly_protection = Off" we see thousands of segfaults on our site each day. However, when we set "xcache.readonly_protection = On" we see very few segfaults.

In our test environment (when "xcache.readonly_protection = Off"), we see almost no segfaults when the number of concurrent processes is around 100 or less. But when we run around 150 concurrent processes or more, we see many more segfaults.

Furthermore, in our test environment (when "xcache.readonly_protection = On"), we see no segfaults even when we run more than 150 concurrent processes.

Please let us know if you need any more information.

Thanks.

comment:3 Changed 14 months ago by moo

is it #296 that you're referring to?
i really want to fix this ticket asap, but the info provided can't get me straight into the problem

dump_bt won't work on this stack as it in sequence of request shutdown, not inside executor, executor_globals is already destruct'ed
i'm not sure if _SERVER variable with is still there as the REQUEST_URI could be useful information

destroy_op_array (op_array=0x7f03d8f3d048) <- this op_array need to dig in. so please try

bt full
frame 1
print *op_array

comment:4 Changed 14 months ago by boxdev

Yes, #296 was the problem reported by David Schnepper. For the current segfaults, we don't see the same log messages as #296.

For your previous comment, did you mean "frame 0"?

Also, our Ops team has upgraded our servers from PHP 5.4.13 to PHP 5.4.19 since the last time we ran our tests. I will upload an attachment with the new stack traces.

Changed 14 months ago by boxdev

GDB output usint "bt full".

comment:5 Changed 14 months ago by moo

This stack is different from your previously attached stacks. It's frame 0 in this case. whether to use frame 0 or 1 depends on which frame the "op_array" variable stay in

previous stacks are strange that it only crash on request shutdown while this stacks back trace to unknown frames and then function_add_ref, autoload. maybe the frame stack is corrupted

the stack does look like a double free to me but still we not sure what the file (of the op_array) get double free. maybe the one get autoloaded --- class "box_service_action", but still no idea how it comes to request_shutdown when autoloading box_service_action

can you try investigate more core files you already have and see if there's a pattern (similar points), and how many patterns are there (distinguish by enough different points)

comment:6 Changed 14 months ago by boxdev

Yes, you're right. We are seeing several different kinds of segfaults. The segfaults which involve shutdown are the easiest for us to reproduce right now. I can run tests and reliably generate core dumps for you.

I will upload another stack trace.

Changed 14 months ago by boxdev

Updated version of the GDB output.

comment:7 Changed 14 months ago by moo

Is it possible to narrow down the test case script so I can reproduce it?

comment:8 Changed 14 months ago by moo

can you please change xc_lock_destroy function to

void xc_lock_destroy(xc_lock_t *lck) /* {{{ */
{
}
/* }}} */

(remove everything inside)
and recompile/install/restart apache, see if it crash

comment:9 Changed 14 months ago by moo

which sapi are you using?

comment:10 Changed 14 months ago by moo

  • Resolution set to fixed
  • Status changed from new to closed

In 1366:

fixes #323: refix locking impl for threaded env

comment:11 Changed 14 months ago by moo

In 1367:

cacher: merge [1366] from trunk; fixes #323: refix locking impl for threaded env

comment:12 Changed 14 months ago by boxdev

Hi. Did you still need us to run a test with the body of xc_lock_destroy() removed?

And this ticket has been marked fixed. Could you please let us know if the fix will be in xcache 3.0.4?

Thanks for the clarification.

comment:13 Changed 14 months ago by boxdev

Hi. Can you please let us know how we can get the fix? Will you build xcache 3.0.4?

Thanks for the information.

comment:14 Changed 14 months ago by moo

it will be in 3.0.4, but i'm not sure if it fix your problem. i need you verify it in your env
so please download from http://xcache.lighttpd.net/pub/snapshots/3.0-r1369/xcache-3.0-r1369.tar.bz2 and build it

$ phpize && make all
$ su
# make install

restart apache/php

comment:15 Changed 14 months ago by boxdev

On our servers, we install PHP via rpms. Is there a way I can get an SRPM from you? Or is there a way to convert xcache-3.0-r1369.tar.bz2 into an SRPM?

Thanks.

Version 0, edited 14 months ago by boxdev (next)

comment:16 Changed 14 months ago by boxdev

Hi. I was able to make an SRPM out of xcache-3.0-r1369.tar.bz2 and install it onto our test servers. When I re-ran our tests, we still observed segfaults. I will provide more data in another ticket.

comment:17 Changed 13 months ago by moo

  • Milestone changed from undecided to 3.0.4
Note: See TracTickets for help on using tickets.