How I fixed a Mozmill freeze by bisecting Python

Last week we had a very frustrating situation with Mozmill. It caused us some headaches because it came up at the time when we tried to trigger our new test-run for add-ons ‘Default to Compatible’ for the first time. While I was working on the necessary Python script and testing it excessively on OS X 10.7, I haven’t noticed any issue. Everything was working as expected. Then I asked William Lachance to assist me in testing the script on Windows and Linux. So as you may guess now, he came back to me with a strange issue which was constantly happening for him on both platforms. In detail the script was able to start the Mozmill tests but when Firefox had to be restarted the Mozmill process kept hanging and caused multiple instances of Firefox to be opened after a while. Because I have never seen a situation like that I asked him to file a bug. A bit later we got bug 722707 which we had to investigate and fix quickly now.

Because it was already too late for me to start working on the investigation, I asked Clint Talbert to have a look at it if possible. He did and came up with an instrumented patch including a lot of debug statements. Also his comment on the bug showed me that something should be really wrong in the network.py module of JSbridge:

So, this is the closest to one of those “how the hell did this ever work” moments I’ve ever seen.

While Clint delivered helpful information it was still hard for me to know where to begin with the investigation of this regression. So my first task was to setup the system as used for DTC testing for Windows and Linux. Running my tests under the same conditions should hopefully reveal the same issue for me. And voila, already the first test-run on Windows and Linux showed me the same results as William has seen. I was really surprised. So what has been changed? I would probably see when diving deeper into the issue.

Because the script for DTC testing is only a wrapper script which calls some of our existing Mozmill test-run scripts from the Mozmill Automation repository, I executed one of those to check if my above mentioned script has caused this regression. This was not the case because the Endurance test-run was also affected. What!? How can this be? Why don’t we see this issue on our own boxes which do the daily and release tests? It was getting more mystically. So as next step I did a test with the pure mozmill-restart command and executed the endurance tests from our Mozmill Tests repository:

$ mozmill-restart -b /home/builds/nightly/firefox -t tests/endurance/

As I have already expected this test also failed. I thought: ‘What a bummer!’ but was happy at the same time that it has not been caused by my code. So we clearly had a problem with Mozmill on those platforms. But why? Ok, lets figure that out now. The only variations I was able to think of were the following changes which happened in the last couple of days:

A possible system update for Ubuntu and Windows could have caused this in our VMs.
For the Mozmill Environment we moved to a newer version of virtualenv.
Platforms have different versions of Python installed. While OS X Lion is still using Python 2.6.7, Linux and Windows are with Python 2.7.2.

The first thing I tried was to rollback the Windows VM to a snapshot which was clearly working. After doing so and updating the Mozmill Environment to the latest version, I haven’t seen any difference. Mozmill still hangs when Firefox gets restarted. So what’s next? I replaced the Python 2.7.2 installation on the Windows VM with Python 2.6.7, because it hasn’t shown the problem on OS X. After the environment has been rebuilt, the next test-run was successful and the hang didn’t occur anymore. That means we have a regression based on a change in Python between 2.6.7 and 2.7.2. Yay, I’m getting closer! Next, figure out the exact Python version which caused Mozmill to fail. That was a quick task because only Python 2.7 and Python 2.7.1 had to be tested. As the results have been shown the regression has been caused by Python 2.7.2. That also makes it clear why the failure didn’t occur on OS X up to Lion.

So which changes have been made in Python 2.7.2? Checking the download page of Python revealed the necessary information in the change log. But those are way too many changes to get any idea of possible candidates. It looks like that I have to test nightly builds of Pythons. Asking in the Python channel on Freenode where those builds can be found, I got the answer that for Python no nightly builds are getting produced and I should use ‘hg bisect’ to get the wanted changeset. Hm, I clearly have not expected to build Python on my own. But well, it looks like that this is the only way to get closer to a possible fix.

Because I haven’t setup Windows as a build machine, I planned to do the regression test on OS X. But that also meant that I had to check if Python 2.7.2 on OS X is affected too. Thankfully the Python Developer’s Guide gave me all the information to get started with the build process. The first step is to clone the Mercurial repository of Python:

$ hg clone http://hg.python.org/cpython

Afterward you have to run the ./configure script in the newly created cpython subfolder to setup the build process for your platform:

$ ./configure --with-pydebug

With that done I have to find the correct tag of the Python 2.7.2 release:

$ hg tags
...
v2.7.2                         70777:8527427914a2
v2.7.2rc1                      70508:f48756685406
v3.1.4rc1                      70504:32fcb9e94985
...
v2.7.1                         66511:5395f96588d4
...

Now it’s easy to update the code and get the build process started:

$ hg up v2.7.2
3570 files updated, 0 files merged, 714 files removed, 0 files unresolved
$ make -j8

The good thing on Python is that most code is located in Python modules and the interpreter itself can be build really quick. With my 8-core CPU the initial build took about 1 minute only. Wow!

To be able to run Mozmill tests the local site-packages folder of the custom Python build has to be prepared. I wanted to avoid that by creating a new virtual environment with the Python build as interpreter but that didn’t work:

$ virtualenv -p ./python.exe test
Running virtualenv with interpreter ./python.exe
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/virtualenv.py", line 2098, in 
    main()
  File "/Library/Python/2.7/site-packages/virtualenv.py", line 928, in main
    never_download=options.never_download)
  File "/Library/Python/2.7/site-packages/virtualenv.py", line 1029, in create_environment
    site_packages=site_packages, clear=clear))
  File "/Library/Python/2.7/site-packages/virtualenv.py", line 1166, in install_python
    copy_required_modules(home_dir)
  File "/Library/Python/2.7/site-packages/virtualenv.py", line 1118, in copy_required_modules
    dst_filename = change_prefix(filename, dst_prefix)
  File "/Library/Python/2.7/site-packages/virtualenv.py", line 1103, in change_prefix
    (filename, prefixes)
AssertionError: Filename /Volumes/data/code/cpython/Lib/os.py does not start with any of these prefixes: ['/usr/local', '/Library/Python/2.7/site-packages', '/usr/local/Extras/lib/python', '/Volumes/data/code/cpython/~/Library/Python/2.7/site-packages']
[39360 refs]

So I ended up with installing setuptools and Mozmill manually into ./Lib/site-packages:

$ export PYTHONPATH=./Lib/site-packages/
$ curl -O http://peak.telecommunity.com/dist/ez_setup.py
$ python ez_setup.py --install-dir Lib/site-packages/ setuptools
$ Lib/site-packages/easy_install --install-dir Lib/site-packages/ mozmill==1.5.8

Because we want to start Mozmill via the mozmill-restart command, we have to change the interpreter of the Lib/site-packages/mozmill-restart script to point to our self-build Python interpreter. Simply replace the shebang with your local path, e.g. #!/data/code/cpython/python.exe.

Now the tests can be executed with:

$ Lib/site-packages/mozmill-restart -b /Applications/Firefox.app/ -t mozmill-tests/nightly/tests/endurance/

As I was hoping the issue is reproducible with Python 2.7.2 on OS X. That gives me the chance to start bisecting with hg to find the causing changeset. To initialize the bisect module of hg a good and bad build have to be specified. Because the bad revision is already set, it can directly marked as bad:

$ hg bisect --bad
$ hg up -r v2.7.1
1388 files updated, 0 files merged, 34 files removed, 0 files unresolved
$ hg bisect --good
Testing changeset 68537:2baaabf6bb05 (580 changesets remaining, ~9 tests)
1302 files updated, 0 files merged, 2 files removed, 0 files unresolved

All in all that gives us 580 changesets to test. Mercurial is kinda good in estimating the amount of tests, so we can expect to build Python and to run the Mozmill test about 9 times. The next changeset to test ‘2baaabf6bb05‘ gets automatically selected by bisect. We only have to build and execute the test again. For me the above changeset also failed and has to be marked as bad:

$ hg bisect --bad
Testing changeset 67486:e4d07a6f7abf (290 changesets remaining, ~8 tests)
1248 files updated, 0 files merged, 2 files removed, 0 files unresolved

Repeat those steps by specifying –bad or –good until the first bad changeset has been found:

The first bad revision is:
changeset:   68064:965ab9911fcd
branch:      2.7
parent:      68057:1797300f87b9
user:        (removed due to privacy)
date:        Thu Mar 03 14:17:51 2011 +0000
summary:     Merged revisions 88722 via svnmerge from

So what has actually be changed? With the changeset ‘965ab9911fcd‘ in the hand we can check the remote repository for more information about this specific changeset. As you can see it’s a SVN merge changeset which itself contains changes for 7 issues. Given Clint’s former information about network.py and asyncore involved lets see if we can find something specific in that area. And yes, there is only revision 88722 which exactly has code changes in asyncore.py. Lets focus on it.

With this revision a new array named ‘_DISCONNECTED‘ has been introduced. It contains all those error codes the module has to take care of. Therefore some lines of code have been changed to use this new array. For us it means one of those lines should have caused the freeze in JSbridge. For me the easiest solution was to simply revert those lines step by step to the old code, and building and testing again. In the forth cycle with reverting line 1.47 the issue was fixed and no hang occurred:

-  if why.args[0] in [ECONNRESET, ENOTCONN, ESHUTDOWN, ECONNABORTED]:
+  if why.args[0] in _DISCONNECTED:

Because those two rows don’t give as much as information as I need to fix the bug in JSbridge, lets check the the current version of asyncore.py in which method the change has taken place. So it’s:

def recv(self, buffer_size):
        try:
            data = self.socket.recv(buffer_size)
            if not data:
                # a closed connection is indicated by signaling
                # a read condition, and having recv() return 0.
                self.handle_close()
                return ''
            else:
                return data
        except socket.error, why:
            # winsock sometimes throws ENOTCONN
            if why.args[0] in _DISCONNECTED:
                self.handle_close()
                return ''
            else:
                raise

So which entry in _DISCONNECTED is causing this problem? Comparing the change ‘EPIPE‘ and ‘EBADF‘ have both been added. Lets remove ‘EPIPE‘ and test again… That didn’t help, so what about ‘EBADF‘? Yeah, with it removed the hang is gone. So we have the cause of the problem. But what is that ‘EBADF’ code? Checking the Python documentation for error codes shows:

errno.EBADF
Bad file number

We do not have a file descriptor here, but it should also be thrown for a bad socket connection. And with the code changes from above the recv method now handles such an exception on its own. It’s not getting passed to the caller method anymore. Could this be related? Lets check the code of JSbridge and where the recv method gets called:

def read_all(self):
        import socket
        data = ''
        while 1:
            try:
                data += self.recv(4096)
            except socket.error:
                return data

There is only one instance in the network.py file and it’s part of the Telnet class. Here we are entering a while loop which will always run until a socket error happens. What? Really? This looks really suspicious. As I have mentioned above the code in asyncore.py takes care about socket errors itself and doesn’t pass it to the caller anymore. But here in JSbridge we rely on such an exception to break out from the infinite loop. So this is definitely the problem which needs to be fixed. We are only allowed to run the loop as long as a valid socket connection exists.

At the end the patch to fix this hang in JSbridge was kinda simple:

-        while 1:
+        while self.connected:
             try:
                 data += self.recv(4096)
             except socket.error:
-                return data
+                break
+        return data

Now the loop is getting aborted when the socket connection gets closed. And I kept the except clause to ensure the backward compatibility with Python <2.7.2.

This patch has been checked-in on all development branches for Mozmill and the fix will be available in the upcoming release of Mozmill 1.5.9.

For me it was a wonderful and exciting challenge to figure out the real cause and finally being able to fix the hang in Mozmill or better in JSbridge.

Profile cancel