April 26, 2012
Fixing the Memory Leak

The MemShrink effort that has been underway at Mozilla for the last several months has substantially decreased the memory usage of Firefox for most users.  There are still some remaining issues that lead to pathological memory use.  One of those issues is leaky addons, which Nick has identified as the single most important MemShrink issue.

In Firefox, the JavaScript heap is split into compartments.  Firefox’s UI code, which is written in JS, lives in the privileged “chrome” compartment.  Addon code also usually lives in the chrome compartment.  Websites live in different, unprivileged compartments.  Exactly how compartments are allocated to websites is beyond the scope of this article, but at the time of writing there is roughly one compartment per domain.  Code running in the chrome compartment can hold references to objects in the content compartments (much like how a page can hold references to objects in an iframe).

For example of how this might look in practice, lets imagine we have Firefox open to three tabs: GMail, Twitter, and Facebook, and we have some sort of social media addon installed.  Our compartments might look something like this:

Where the blue lines are the references the Firefox UI is holding and the red lines are the references the addon is holding.

The problems start to arise if these references aren’t cleaned up properly when the tab is navigated or closed.  If the Facebook tab is closed, but not all of those references are cleaned up, some or all of the memory the Facebook tab was using is not released.  The result is popularly known as a zombie compartment, and is a big source of leaks in Firefox.

Chrome (privileged UI or other JS) code that leaks is particularly problematic because the leak usually persists for the lifetime of the browser.  When chrome code leaks, say, facebook.com, it leads to dozens of megabytes of memory being lost.  It turns out that writing chrome code that doesn’t leak can actually be quite difficult.  Even the Firefox front end code, which is worked on by a number of full time engineers and has extensive code review, has a number of leaks.  We can find and fix those, but addons are a much harder problem, and we can’t expect addon authors to be as diligent as we try to be in finding and fixing leaks.  The only defense we have had is the AMO review team and our list of best practices.

That changed last night when I landed Bug 695480.  Firefox now attempts to clean up after leaky chrome code.  My approach takes advantage of the fact that chrome code lives in a separate compartment from web page code.  This means that every reference from chrome code to content code goes through a cross-compartment wrapper, which we maintain in a list.  When the page is navigated, or a tab is closed, we reach into chrome compartment and grab this list.  We go through this list and “cut” all of the wrappers that point to objects in the page we’re getting rid of.  The garbage collector can then reclaim the memory used by the page that is now gone.

The result looks something like:

Code that accidentally (or intentionally!) holds references to objects in pages that are gone will no longer leak.  If the code tries to touch the object after the wrapper has been “cut”, it will get an exception.  This may break certain code patterns.  A few examples:

  • Creating a DOM node from a content document and storing it in a global variable for indefinite use.  Once the page you created the node from is closed your node will vanish.  Here’s an example of code in Firefox that used to do that.
  • Creating a closure over DOM objects can break if those objects can go away before the closure is invoked.  Here’s some code in Firefox that did that.  In one of our tests in our test suite the tab closed itself before the timeout ran, resulting in an exception being thrown.

Addon authors probably don’t need to bother changing anything unless they see breakage.  Breakage should be pretty rare, and the huge upside of avoided leaks will be worth it.  It’s a little early to be sure what effects this will have, but the amount of leaks we see on our test suite dropped by 80%.  I expect that this change will also fix a majority of the addon leaks we see, without any effort on the part of the addon authors.

February 23, 2012
Address Space Layout Randomization now mandatory for binary components

This evening I landed Bug 728429 on mozilla-central.  Firefox will now refuse to load XPCOM component DLLs that do not implement ASLR.  ASLR is an important defense-in-depth mechanism that makes it more difficult to successfully exploit a security vulnerability.  Firefox has used ASLR on its core components for some time now, but many extensions that ship with binary components do not.

ASLR is on by default on modern versions of Visual Studio, so extension authors will only need to ensure that they haven’t flipped the switch to turn it off.  MSDN documentation on ASLR options is available here.  Further reading about the benefits of ASLR is available here.

If no unexpected problems arise, this change will ship in Firefox 13.

December 19, 2011
Pushing Compilers to the Limit (and Beyond)

At the end of the first week of December Firefox exceeded the memory limits of the Microsoft linker we use to produce our highly optimized Windows builds.  After the problem was identified we took some emergency steps to ensure that people could continue to land changes to parts of Firefox not affected by this issue by disabling some new and experimental features.  Once that was complete we were able to make some other changes that reduced the memory used by the linker back below the limits.  We were then unable to undo those emergency steps and turn those features back on.

This will have no lasting impact on what is or is not shipped in Firefox 11.  The issues described here only affected Firefox developers, and have nothing to do with the memory usage or other performance characteristics of the Firefox binaries shipped to users.

Technical Details

Recently we began seeing sporadic compilation failures in our optimized builds on Windows.  After some debugging we determined that the problem was that the linker was running out of virtual address space.  In essence, the linker couldn’t fit everything it needed into memory and crashed.

The build configuration that was failing is not our normal build configuration.  It uses Profiled Guided Optimization, fancy words meaning that it runs some benchmarks that we give it and then uses that information to determine what to optimize for speed and what optimizations to use.  It also uses Link-Time Code Generation, which means that instead of the traditional compilation model where the compiler generates code and the linker glues it all together the linker does all of the code generation.  These two optimization techniques are quite powerful (they generally win 10-20% on various benchmarks that we have) but they require loading source code and profiling data for most of Firefox into RAM at the same time.

Once we identified the problem we took emergency steps by disabling SPDY support and the Graphite font subsystem, both new features that had been landed recently and were turned off by default (in other words, users had to use an about:config preference to turn them on).  This allowed us to reopen the tree for checkins that did not touch code that ends up in xul.dll (this allowed work to proceed on the Firefox UI, the Javascript engine, and a few other things).

We then disabled Skia (which is being used as an experimental <canvas> backend) and separated video codecs and parts of WebGL support into a separate shared library.  This work decreased the linker’s memory usage enough to resume normal development and turn SPDY back on.  The medium term solution is to start doing our 32 bit builds on 64 bit operating systems so that the linker can use 4 GB of memory instead of 3 GB of memory, and to separate pieces of code that aren’t on the critical startup path into other shared libraries.

Frequently Asked Questions:

  • Why don’t you just get machines with more RAM? - The problem is not that the linker was running out of physical memory, but that it was running out of virtual memory.  A 32 bit program can only address 2^32 bytes (4GB) of memory, regardless of how much memory is in the machine.  Additionally, on 32 bit Windows, the last 1 GB is reserved for the kernel, so a program is really limited to 3 GB of memory.
  • Ok, so why don’t you just use a 64 bit linker? - Unfortunately there is no 64->32 bit cross compiler provided with the Microsoft toolchain so you can’t generate binaries that run on 32 bit systems with a 64 bit compiler.
  • Sure you can, just use -MACHINE:X86 on the linker! - You can have the 64 bit linker link 32 bit binaries, but this is incompatible with Link-Time Code Generation.
  • Is Firefox bloated? - Firefox’s size and linker memory usage compares favorably with other browsers. These problems are not a reflection on which browsers are or are not bloated, but rather on how resource intensive it is to do whole program optimization across a large C++ codebase.

September 29, 2011
Using XHR.onload/etc in addons

I just landed https://bugzilla.mozilla.org/show_bug.cgi?id=687332 on mozilla-central which makes some changes to how .onfoo event listeners are handled on some DOM objects (including XHR).  These changes mean it is no longer possible to use .onfoo event listeners from JS scopes where the global object is not a Window, or from C++.  The correct way to listen for events from these scopes is to use .addEventListener.

This will likely affect a number of addons (particularly for XHR).  Addons that use XHR in XPCOM components should check to see if they are affected.  We may consider implementing some sort of a compatibility hack for XHR if that number is large.

August 10, 2011
xpidlc is dead. Long live pyxpidl.

Today I landed Bug 458936 which moves from using xpidlc to generate xpcom typelibs to new python code.  With that, and other work by people including Ted Mielczarek, Mike Hommey, and Benjamin Smedberg, Firefox is now built without ever invoking the binary xpidl.

The remaining pieces of work here are:

  • Migrate comm-central to the new python tools (interfaces in comm-central are still compiled with xpidlc)
  • Package the python xpidl into the Gecko SDK.
  • Stop building the binary xpidl entirely and remove it from the tree.
  • Remove our build time dependencies on libIDL, etc.

July 2, 2011
Mork is finally gone

It’s not even going to be worthy of a footnote compared to the other awesome things making Firefox 7, but I landed Bug 578268 this morning, which removes the last vestiges of Mork (specifically morkreader) from Firefox.

June 28, 2011
Migrating to MSVC 2010

I’m reposting this here because Google Groups is not mirroring the last several days of mozilla.dev.platform.  Please follow up to anything here on the newsgroup.


I just landed the fix for the last known bug in product code that stops us from switching to MSVC 2010 (Bug 515492).

If RelEng can deploy MSVC 2010 SP1 to the builders in the next couple of weeks I would like to try to change the official compiler for the next release cycle (starting July 5th).  This depends of course on how much trouble we run into along the way :-)  If deployment/test failures/perf regressions slow us down this can always catch the next train.

Things we expect to pick up from MSVC 2010:
- Not being on a six year old compiler
- A modest perf improvement (maybe a couple percent).
- Fixes to known crashes with PGO (which should allow us to turn PGO back on for the JS engine, which will pick up another 10-15% on Sunspider/etc)
- Fixes to various issues relating to crashreporter symbols that should allow us to get better stacks (Bug 520651).

Other things worth noting:
- Downloading Windows debug builds from Tinderbox might be useful again (since the debug CRT is non-redistributable, you must have the same version of the compiler installed locally to run a debug build.  This will be much more bearable when we’re on a version that people can get easily.)
- The CRT included with MSVC 2010 requires XP SP 2 or higher, so switching compilers will end support for Windows 2000 and pre-SP2 versions of Windows XP in official builds of Firefox.  I expect that we will likely remove support for Windows 2000 from Gecko itself shortly afterwards.

Please follow up on mozilla.dev.platform if you have any questions/comments/concerns.

March 22, 2011
Build System Changes

See the tl;dr if you are so inclined.

A few hours ago I merged the build-system branch to mozilla-central.  This pulls in:

  • Bug 584474 - This reworks our linking system on all platforms.  Instead of building intermediate static libraries that are later combined into libxul, we make lists of object files that are assembled into libxul by a python wrapper around the linker.  This buys us significant link time improvements on Mac (more than a 10x speedup from some tests) and gets us full incremental linking with MSVC (currently on 2.0 only object files that pass through one level of static libs to get into libxul are linked incrementally, e.g. anything in gklayout is not linked incrementally).  This also allows object file reordering in the future.
  • Bug 638429 - —disable-libxul is gone.  Thunderbird and Seamonkey have been using fat-libxul (where the app components are placed inside libxul) for some time.  This will allow a lot of code cleanup and simplification, including deCOMtamination.
  • Bug 638755 - —disable-ipc is gone.  As e10s progresses the overhead of maintaining —disable-ipc will increase.  This will break some platforms (OS/2 for example) until they adapt.
  • Bug 636695 - —enable-debug-symbols is now the default, and —enable-debugger-info-modules is gone completely.
  • Bug 637317 - elfhack is back on again.

You’ll need to clobber after pulling or your build will choke.

March 21, 2011
The Sorry State of Tech “Journalism”

This is the second major Firefox release that I’ve been around for.  By now I’m used to the rampant speculation and the “exclusive prerelease” builds that intrepid tech “journalists” find on ftp.mozilla.org a couple of days before the release.

This piece by Tom’s Hardware really takes the cake though.  It’s a competition between the most recently released stable versions of the five major browsers.  It was published on March 21st comparing IE 9 among others to Firefox 3.6.  Firefox 4 is going to be officially released on … March 22nd.

Now, I subscribe to Hanlon’s razor, so unlike most of the commenters on the article I didn’t expect that the timing was malicious.  Sure enough, looking through the comments the author of the article didn’t seem to know when Firefox 4 was going to be released and complained that he hadn’t received a press release from Mozilla about the release date.  (Sadly there appears to be no way to link to individual comments.)  It’s a shame that “journalists” can’t even be bothered to locate the most basic facts about the software they’re reviewing.

The supreme irony in all of this is that a number of commenters are complaining that Firefox 4 has already been released, thanks to the usual shoddy reporting about “exclusive prereleases.”

Read More

January 7, 2011
Hunting Regressions

Nobody likes doing repetitive boring tasks.  There’s a certain threshold at which I’ll bite the bullet and write a script/tool/what have you to automate that task.  There’s another lower threshold where I’ll complain about the task on the internet.  Hunting regressions with hg bisect falls somewhere between those thresholds.

What I want is a script (ideally in Python!) that will drive an hg bisect against a test from our test suite.  The requirements for said script are:

  • Accept tests that would run in our test suites (mochitest and crashtest are probably the most useful to me, but would be good to have them all eventually)
  • Accept a starting good revision and an ending bad revision
  • Start hg bisect and build the tree from source (clobber build)
  • Run the test against the build and proceed with the bisection appropriately
  • Handle revisions that fail to compile intelligently

This obviously won’t handle everything, it’s easy to imagine a revision that builds but crashes at startup, or a revision that changes a leak to a crash, but this would be very useful to me and many others.

If somebody is willing to write this script (prior art here, here, and here) I will provide a fast machine to run it on (clobbers from scratch in 8 minutes, so < 2 hrs for a full regression search) and set up a web or email gateway to it for the community to use.

Any takers? :-)

Liked posts on Tumblr: More liked posts »