MemChaser – An extension to track Firefox’s memory activities

Today I want to announce a new extension for Firefox. It’s been developed by the Mozilla QA Automation Services team, which exists to support the Mozilla QA community by developing tools and frameworks for ease of testing. And this time, in case of the extension, we even target (web) developers, and our huge Firefox user base. So what is it about?

Mid of December 2011 bug 711900 has been filed. It complained about an increased garbage and cycle collector activity in Firefox 10, which was not existent in Firefox 9. So Mozilla QA has been asked to analyze that behavior and find possible use cases for reproduction. While working on it, we noticed that observing the related information about memory usage and garbage/cycle collector activity wasn’t that comfortable. For the memory you have had to constantly watch and refresh ‘about:memory‘ in a separate tab, and the latter even forced us to always have the Error Console open and a specific hidden preference set.

To make it easier for us I had the idea of creating an extension, which would be able to present all the necessary information at a prominent place and would be visible all the time. There should be no need of switching tabs or scraping the Error Console for output. Scraping was even hard because all entries constantly get wiped out when a maximum threshold of messages has been reached. Further copying out this data is a hassle. So lets create an extension with logging capabilities…

Some days later after an internal discussion across teams we started the development and called it the MemChaser project. Based on the fantastic help from members of the SpiderMonkey team – lets name Nicholas Nethercote and Andrew McCreight – we got a very early prototype setup kinda quickly. But as is turned out having an extension, which needs a restart to get installed, is pretty useless if the user is already facing a strange memory related issue. So one of the bigger tasks was to get familiar with the Add-on SDK project aka Jetpack and to transform the extension into a restartless one. As it turned out the work wasn’t that hard but gave us well better structured code and a nice toolchain to test and build the extension. And finally with all the features implemented we were able to release the MemChaser extension on addons.mozilla.org.

The current set of features is not that big but covers all the requirements we got from the QA team. We will make sure to enhance the extension drastically in the next couple of months so that it doesn’t only display current values in the Add-on bar. But what is it doing right now?

As you can see in the image a couple of values are displayed. But what do those stand for? Here the answer:

  • Resident: Memory mapped by the process that is present in physical memory. It’s normally the memory usage you will see in the systems process manager.
  • GC: Duration of the last Garbage Collector activity and in brackets the time when the second last activity happened.
  • CC: Duration of the last Cycle Collector activity and in brackets the time when the second last activity happened.
  • Logging: Starts and stops the internal logger, which writes out the current memory usage and GC/CC activities to a log file, which can be found in the memchaser sub directory of your Firefox profile.

All those information already helps a lot to keep track of memory related activities and enables someone to see if Firefox is using too much memory at all or is having long GC/CC cycles which could freeze the UI of Firefox for short moments.

As I have mentioned above we will continuously enhance the features of the extension. We already have some great ideas and I will certainly blog about those more in the next couple of days, once we have updated our documentation and set milestones for the next releases.

For now please install the extension in your daily profile and watch the memory activities. If you recognize strange behavior turn on logging and let us know by contacting us directly.

Mozmill 1.5.9 released


As noted in my yesterdays blog post about the freeze of Mozmill in the JSBridge module when Python 2.7.2 is used, there was an urgent need for a new version of Mozmill.

Late yesterday we were able to finally release Mozmill 1.5.9 on PYPI. It only includes the fix for bug 722707 and allows everyone who is running on Python 2.7.2 to use Mozmill again. If you are affected please upgrade immediately.

How I fixed a Mozmill freeze by bisecting Python

Last week we had a very frustrating situation with Mozmill. It caused us some headaches because it came up at the time when we tried to trigger our new test-run for add-ons ‘Default to Compatible’ for the first time. While I was working on the necessary Python script and testing it excessively on OS X 10.7, I haven’t noticed any issue. Everything was working as expected. Then I asked William Lachance to assist me in testing the script on Windows and Linux. So as you may guess now, he came back to me with a strange issue which was constantly happening for him on both platforms. In detail the script was able to start the Mozmill tests but when Firefox had to be restarted the Mozmill process kept hanging and caused multiple instances of Firefox to be opened after a while. Because I have never seen a situation like that I asked him to file a bug. A bit later we got bug 722707 which we had to investigate and fix quickly now.

Because it was already too late for me to start working on the investigation, I asked Clint Talbert to have a look at it if possible. He did and came up with an instrumented patch including a lot of debug statements. Also his comment on the bug showed me that something should be really wrong in the network.py module of JSbridge:

So, this is the closest to one of those “how the hell did this ever work” moments I’ve ever seen.

While Clint delivered helpful information it was still hard for me to know where to begin with the investigation of this regression. So my first task was to setup the system as used for DTC testing for Windows and Linux. Running my tests under the same conditions should hopefully reveal the same issue for me. And voila, already the first test-run on Windows and Linux showed me the same results as William has seen. I was really surprised. So what has been changed? I would probably see when diving deeper into the issue.

Because the script for DTC testing is only a wrapper script which calls some of our existing Mozmill test-run scripts from the Mozmill Automation repository, I executed one of those to check if my above mentioned script has caused this regression. This was not the case because the Endurance test-run was also affected. What!? How can this be? Why don’t we see this issue on our own boxes which do the daily and release tests? It was getting more mystically. So as next step I did a test with the pure mozmill-restart command and executed the endurance tests from our Mozmill Tests repository:

$ mozmill-restart -b /home/builds/nightly/firefox -t tests/endurance/

As I have already expected this test also failed. I thought: ‘What a bummer!’ but was happy at the same time that it has not been caused by my code. So we clearly had a problem with Mozmill on those platforms. But why? Ok, lets figure that out now. The only variations I was able to think of were the following changes which happened in the last couple of days:

  • A possible system update for Ubuntu and Windows could have caused this in our VMs.
  • For the Mozmill Environment we moved to a newer version of virtualenv.
  • Platforms have different versions of Python installed. While OS X Lion is still using Python 2.6.7, Linux and Windows are with Python 2.7.2.

The first thing I tried was to rollback the Windows VM to a snapshot which was clearly working. After doing so and updating the Mozmill Environment to the latest version, I haven’t seen any difference. Mozmill still hangs when Firefox gets restarted. So what’s next? I replaced the Python 2.7.2 installation on the Windows VM with Python 2.6.7, because it hasn’t shown the problem on OS X. After the environment has been rebuilt, the next test-run was successful and the hang didn’t occur anymore. That means we have a regression based on a change in Python between 2.6.7 and 2.7.2. Yay, I’m getting closer! Next, figure out the exact Python version which caused Mozmill to fail. That was a quick task because only Python 2.7 and Python 2.7.1 had to be tested. As the results have been shown the regression has been caused by Python 2.7.2. That also makes it clear why the failure didn’t occur on OS X up to Lion.

So which changes have been made in Python 2.7.2? Checking the download page of Python revealed the necessary information in the change log. But those are way too many changes to get any idea of possible candidates. It looks like that I have to test nightly builds of Pythons. Asking in the Python channel on Freenode where those builds can be found, I got the answer that for Python no nightly builds are getting produced and I should use ‘hg bisect’ to get the wanted changeset. Hm, I clearly have not expected to build Python on my own. But well, it looks like that this is the only way to get closer to a possible fix.

Because I haven’t setup Windows as a build machine, I planned to do the regression test on OS X. But that also meant that I had to check if Python 2.7.2 on OS X is affected too. Thankfully the Python Developer’s Guide gave me all the information to get started with the build process. The first step is to clone the Mercurial repository of Python:

$ hg clone http://hg.python.org/cpython

Afterward you have to run the ./configure script in the newly created cpython subfolder to setup the build process for your platform:

$ ./configure --with-pydebug

With that done I have to find the correct tag of the Python 2.7.2 release:

$ hg tags
...
v2.7.2                         70777:8527427914a2
v2.7.2rc1                      70508:f48756685406
v3.1.4rc1                      70504:32fcb9e94985
...
v2.7.1                         66511:5395f96588d4
...

Now it’s easy to update the code and get the build process started:

$ hg up v2.7.2
3570 files updated, 0 files merged, 714 files removed, 0 files unresolved
$ make -j8

The good thing on Python is that most code is located in Python modules and the interpreter itself can be build really quick. With my 8-core CPU the initial build took about 1 minute only. Wow!

To be able to run Mozmill tests the local site-packages folder of the custom Python build has to be prepared. I wanted to avoid that by creating a new virtual environment with the Python build as interpreter but that didn’t work:

$ virtualenv -p ./python.exe test
Running virtualenv with interpreter ./python.exe
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/virtualenv.py", line 2098, in <module>
    main()
  File "/Library/Python/2.7/site-packages/virtualenv.py", line 928, in main
    never_download=options.never_download)
  File "/Library/Python/2.7/site-packages/virtualenv.py", line 1029, in create_environment
    site_packages=site_packages, clear=clear))
  File "/Library/Python/2.7/site-packages/virtualenv.py", line 1166, in install_python
    copy_required_modules(home_dir)
  File "/Library/Python/2.7/site-packages/virtualenv.py", line 1118, in copy_required_modules
    dst_filename = change_prefix(filename, dst_prefix)
  File "/Library/Python/2.7/site-packages/virtualenv.py", line 1103, in change_prefix
    (filename, prefixes)
AssertionError: Filename /Volumes/data/code/cpython/Lib/os.py does not start with any of these prefixes: ['/usr/local', '/Library/Python/2.7/site-packages', '/usr/local/Extras/lib/python', '/Volumes/data/code/cpython/~/Library/Python/2.7/site-packages']
[39360 refs]</module>

So I ended up with installing setuptools and Mozmill manually into ./Lib/site-packages:

$ export PYTHONPATH=./Lib/site-packages/
$ curl -O http://peak.telecommunity.com/dist/ez_setup.py
$ python ez_setup.py --install-dir Lib/site-packages/ setuptools
$ Lib/site-packages/easy_install --install-dir Lib/site-packages/ mozmill==1.5.8

Because we want to start Mozmill via the mozmill-restart command, we have to change the interpreter of the Lib/site-packages/mozmill-restart script to point to our self-build Python interpreter. Simply replace the shebang with your local path, e.g. #!/data/code/cpython/python.exe.

Now the tests can be executed with:

$ Lib/site-packages/mozmill-restart -b /Applications/Firefox.app/ -t mozmill-tests/nightly/tests/endurance/

As I was hoping the issue is reproducible with Python 2.7.2 on OS X. That gives me the chance to start bisecting with hg to find the causing changeset. To initialize the bisect module of hg a good and bad build have to be specified. Because the bad revision is already set, it can directly marked as bad:

$ hg bisect --bad
$ hg up -r v2.7.1
1388 files updated, 0 files merged, 34 files removed, 0 files unresolved
$ hg bisect --good
Testing changeset 68537:2baaabf6bb05 (580 changesets remaining, ~9 tests)
1302 files updated, 0 files merged, 2 files removed, 0 files unresolved

All in all that gives us 580 changesets to test. Mercurial is kinda good in estimating the amount of tests, so we can expect to build Python and to run the Mozmill test about 9 times. The next changeset to test ‘2baaabf6bb05‘ gets automatically selected by bisect. We only have to build and execute the test again. For me the above changeset also failed and has to be marked as bad:

$ hg bisect --bad
Testing changeset 67486:e4d07a6f7abf (290 changesets remaining, ~8 tests)
1248 files updated, 0 files merged, 2 files removed, 0 files unresolved

Repeat those steps by specifying –bad or –good until the first bad changeset has been found:

The first bad revision is:
changeset:   68064:965ab9911fcd
branch:      2.7
parent:      68057:1797300f87b9
user:        (removed due to privacy)
date:        Thu Mar 03 14:17:51 2011 +0000
summary:     Merged revisions 88722 via svnmerge from

So what has actually be changed? With the changeset ‘965ab9911fcd‘ in the hand we can check the remote repository for more information about this specific changeset. As you can see it’s a SVN merge changeset which itself contains changes for 7 issues. Given Clint’s former information about network.py and asyncore involved lets see if we can find something specific in that area. And yes, there is only revision 88722 which exactly has code changes in asyncore.py. Lets focus on it.

With this revision a new array named ‘_DISCONNECTED‘ has been introduced. It contains all those error codes the module has to take care of. Therefore some lines of code have been changed to use this new array. For us it means one of those lines should have caused the freeze in JSbridge. For me the easiest solution was to simply revert those lines step by step to the old code, and building and testing again. In the forth cycle with reverting line 1.47 the issue was fixed and no hang occurred:

-  if why.args[0] in [ECONNRESET, ENOTCONN, ESHUTDOWN, ECONNABORTED]:
+  if why.args[0] in _DISCONNECTED:

Because those two rows don’t give as much as information as I need to fix the bug in JSbridge, lets check the the current version of asyncore.py in which method the change has taken place. So it’s:

def recv(self, buffer_size):
        try:
            data = self.socket.recv(buffer_size)
            if not data:
                # a closed connection is indicated by signaling
                # a read condition, and having recv() return 0.
                self.handle_close()
                return ''
            else:
                return data
        except socket.error, why:
            # winsock sometimes throws ENOTCONN
            if why.args[0] in _DISCONNECTED:
                self.handle_close()
                return ''
            else:
                raise

So which entry in _DISCONNECTED is causing this problem? Comparing the change ‘EPIPE‘ and ‘EBADF‘ have both been added. Lets remove ‘EPIPE‘ and test again… That didn’t help, so what about ‘EBADF‘? Yeah, with it removed the hang is gone. So we have the cause of the problem. But what is that ‘EBADF’ code? Checking the Python documentation for error codes shows:

errno.EBADF
Bad file number

We do not have a file descriptor here, but it should also be thrown for a bad socket connection. And with the code changes from above the recv method now handles such an exception on its own. It’s not getting passed to the caller method anymore. Could this be related? Lets check the code of JSbridge and where the recv method gets called:

def read_all(self):
        import socket
        data = ''
        while 1:
            try:
                data += self.recv(4096)
            except socket.error:
                return data

There is only one instance in the network.py file and it’s part of the Telnet class. Here we are entering a while loop which will always run until a socket error happens. What? Really? This looks really suspicious. As I have mentioned above the code in asyncore.py takes care about socket errors itself and doesn’t pass it to the caller anymore. But here in JSbridge we rely on such an exception to break out from the infinite loop. So this is definitely the problem which needs to be fixed. We are only allowed to run the loop as long as a valid socket connection exists.

At the end the patch to fix this hang in JSbridge was kinda simple:

-        while 1:
+        while self.connected:
             try:
                 data += self.recv(4096)
             except socket.error:
-                return data
+                break
+        return data

Now the loop is getting aborted when the socket connection gets closed. And I kept the except clause to ensure the backward compatibility with Python <2.7.2.

This patch has been checked-in on all development branches for Mozmill and the fix will be available in the upcoming release of Mozmill 1.5.9.

For me it was a wonderful and exciting challenge to figure out the real cause and finally being able to fix the hang in Mozmill or better in JSbridge.

FOSDEM 2012, Add-on SDK, and slides for Mozmill CI

Over the last weekend I had the pleasure to participate in the Free and Open source Software Developers’ European Meeting – aka FOSDEM – in Brussels, which is an annual conference for open source projects and their enthusiastic communities. It’s not only fascinating to listen to scheduled talks, but also to meet fellows again who I haven’t seen for mostly a year or even longer.

As every year I have spent a bit of my time before the event to prepare a schedule including interesting talks and other organizational stuff. But as it has been turned out again, setting up a schedule for talks is useless for me, because I would sit most of the time in the Mozilla DevRoom anyway. There are way too many interesting presentations and discussions that it is hard for me to leave.

This year some bits were different. Given my cold I was not able to participate in the Beer Event in the Delirium Cafe. It’s usually the perfect initial come together of this conference to greet people and have the first conversations. I kinda missed it, but the remaining work for this week did a good job in not letting me think about if for so long. The other news we noticed with the arrival at the campus of the University Solbosch. Surprisingly the locations have been modified a bit to give everyone more space. The biggest change definitely was the move of the booths to another building, which wasn’t that close to the rest of the areas of interest. So I headed over only once to check our Mozilla booth. IMO we haven’t had such a great location this time but given by our staff a lot of people stopped by already. While checking the other booths I also found the one of the Jenkins project, where Kohsuke Kawaguchi – the creator of Jenkins (formerly Hudson) – was able to help me with all of my questions. I call this a success – thanks Kohsuke!

As my personal highlight I had an own talk about Continuous Integration with Mozmill scheduled on Sunday at noon. It went pretty well and I got some positive feedback. If you wasn’t able to make it, you can check my slides here:

The most impressive conversation and follow-up work happened definitely with the guys from the Mozilla Add-on SDK project. Starting from my quick question about a foreseeable support for Thunderbird, it ended-up in a lot of discussions and a final hacking session at Sunday afternoon, which I sadly wasn’t able to attend. So I’m absolutely interested in the results. Oh, and while traveling back home I have spent about an hour to contribute my first patch to the project, which will let the builder add contributor names to the extensions install.rdf file.

All in all it was a stunning weekend and I hope to be able to attend it next year again. So a big thanks to Benoit and the Belgian community for taking care of the organization!

Mozmill Crowd 0.1.5 released

Yesterday we had to release Mozmill Crowd 0.1.5 due to changes in our reporting infrastructure. Given the fallout of one of our servers our Mozmill Dashboard is now located at http://mozmill-crowd.blargon7.com/. This move needed an update of the default report URL. Now you can send reports of your test results again.

We have to admit that not much work happened on the extension in the last couple of months. Reasons are mostly other important projects and mainly the development of the new pre-configured environment for Mozmill. There are still some outstanding improvements to implement before we can release a new version of the environment.

If you are interested to help out please get in contact with us via our team page or any other public channel.