September 22, 2012
Refcounting thread-safety assertions are now fatal on mozilla-central

Gecko has long had assertions to verify that XPCOM objects are AddRefed/Released on the right thread.  Today I landed Bug 753659 which makes those assertions fatal (using MOZ_ASSERT).  This makes these assertions noticeable on test suites that do not check assertion counts (namely mochitest).  It also ensures that developers will notice these assertions when testing locally.  Remember that any time you see one of these assertions you are seeing a potential sg:crit (via a use-after-free on an object that’s reference count is too low due to AddRef racing with another operation) and should file and fix it immediately.

July 19, 2012
Cycle Collection

We don’t really have a comprehensive and current overview of the cycle collector and how to use it anywhere, so I wrote this.  This is probably part 1 of a multipart series, as I’ve only convered the simple cases here.

What?

The cycle collector is sort of like a garbage collector for C++.  It solves the fundamental problem of reference counting: cycles.  In a naive reference counting system, if A owns B and B owns A, neither A nor B will ever be freed.  Some structures in Gecko are inherently cyclic (e.g. a node tree) or can very easily be made cyclic by code beyond our control (e.g. most DOM objects can form cycles with expando properties added by content script).

The cycle collector operates on C++ objects that “opt-in” to cycle collection and all JS objects.  It runs a heavily modified version of Bacon and Rajan’s synchronous cycle collection algorithm. C++ objects opt-in by notifying the cycle collector when they may be garbage.  When the cycle collector wakes up it inspects the C++ objects (with help from the objects themselves) and builds a graph of the heap that participates in cycle collection.  It then finds the garbage cycles in this graph and breaks them, allowing the memory to be reclaimed.

Why?

The cycle collector makes developing Gecko much simpler at the cost of some runtime overhead to collect cycles.  Without a cycle collector, we would have to either a) manually break cycles when appropriate or b) use weak pointers to avoid ownership cycles.  These add significant complexity to modifying code and make avoiding memory leaks and use-after-free errors much harder. 

When?

C++ objects need to participate in cycle collection whenever they can be part of a reference cycle that is not guaranteed to be broken through other means.  C++ objects also need to participate in cycle collection if they hold direct references to objects that are managed by the JavaScript garbage collector (a jsval, JS::Value, JSObject*, etc.).

In practice, this means most DOM objects need to be cycle collected.

  • Does the object inherit from nsWrapperCache (directly or indirectly)?  If so, it must be cycle collected.
  • Does the object have direct references to JavaScript values (jsval, JS::Value, JSObject*, etc)?  If so, it must be cycle collected.  Note that interface pointers to interfaces implemented by JavaScript (e.g. nsIDOMEventListener) do *not* count here.
  • Does the object hold no strong references (e.g. it has no member variables of type nsCOMPtr or nsRefPtr, it has no arrays of those (nsTArray<nsCOMPtr>, nsTArray<nsRefPtr>, or nsCOMArray), no hashtables of them (nsInterfaceHashtable, nsRefPtrHashtable), and does not directly own any object that has these (via new/delete or nsAutoPtr))?  If so, it does not need to be cycle collected.
  • Is the object threadsafe (e.g. an nsRunnable, or something that uses the threadsafe AddRef/Release macros)?  Threadsafe objects cannot participate in cycle collection and must break ownership cycles manually.
  • Is the object a service or other long lived object?  Long lived objects should break ownership cycles manually.  Adding cycle collection may prevent shutdown leaks, but it will just replace that with a leak until shutdown, which is just as bad but doesn’t show up on our tools.
  • Does the object hold strong references to other things that are cycle collected?  If so, and the object does not have a well-defined lifetime (e.g. it can be accessed from Javascript) it must be cycle collected.
  • Does the object have strong references only to other things that are not cycle collected (e.g. interfaces from XPCOM, Necko, etc)?  If so, it probably does not need to be cycle collected.
  • Can the object be accessed from Javascript?  Then it probably needs to be cycle collected.

The last two are kind of vague on purpose.  Determining exactly when a class needs to participate in cycle collection is a bit tricky and involves some engineering judgement.  If you’re not sure, ask your reviewer or relevant peers/module owners.

How?

C++ objects participate in cycle collection by:

  1. Modifying their reference counting to use the cycle collector.
  2. Implementing a “cycle collection participant”, a set of functions that tell the cycle collector how to inspect the object.
  3. Modifying their QueryInterface implementation to return the participant when asked.

Like many things in Gecko, this involves lots of macros.

The reference counting is modified by replacing existing macros:

  • NS_DECL_ISUPPORTS becomes NS_DECL_CYCLE_COLLECTING_ISUPPORTS.
  • NS_IMPL_ADDREF becomes NS_IMPL_CYCLE_COLLECTING_ADDREF.
  • NS_IMPL_RELEASE becomes NS_IMPL_CYCLE_COLLECTING_RELEASE.

The cycle collection participant is a helper class that provides up to three functions:

  • A ‘Trace’ function is provided by participants that represent objects that use direct JavaScript object references.  It reports those JavaScript references to the cycle collector.
  • A ‘Traverse’ function is provided by all participants.  It reports strong C++ references to the cycle collector,
  • An ‘Unlink’ function is provided by (virtually) all participants.  It clears out both JavaScript and C++ references, breaking the cycle.

The cycle collection participant is implemented by placing one of the following macros in the header:

  • NS_DECL_CYCLE_COLLECTION_CLASS is the normal choice.  It is used for classes that only have C++ references to report.  This participant has Traverse and Unlink functions.
  • NS_DECL_CYCLE_COLLECTION_CLASS_AMBIGUOUS is a version of the previous macro for classes that multiply inherit from nsISupports.
  • NS_DECL_CYCLE_COLLECTION_SCRIPT_HOLDER_CLASS is used for classes that have JS references or a mix of JS and C++ references to report.  This participant has Trace, Traverse, and Unlink methods.
  • NS_DECL_CYCLE_COLLECTION_SCRIPT_HOLDER_CLASS_AMBIGUOUS is the ambiguous version of the previous macro.

And by doing one of the following in the cpp file:

  • For very simple classes, that don’t have JS references and only have nsCOMPtrs, you can use the NS_IMPL_CYCLE_COLLECTION_N macros, where N is the number of nsCOMPtrs the class has.
  • For classes that almost meet the above requirements, but inherit from nsWrapperCache, you can use the NS_IMPL_CYCLE_COLLECTION_WRAPPERCACHE_N macros, where N is the number of nsCOMPtrs the class has.
  • Otherwise, use the NS_IMPL_CYCLE_COLLECTION_CLASS macro and separate macros to implement the Traverse, Unlink, and Trace (if appropriate) methods.  To implement those, use the NS_IMPL_CYCLE_COLLECTION_[TRAVERSE|UNLINK|TRACE]_* macros to construct Traverse, Unlink, and Trace methods.

April 26, 2012
Fixing the Memory Leak

The MemShrink effort that has been underway at Mozilla for the last several months has substantially decreased the memory usage of Firefox for most users.  There are still some remaining issues that lead to pathological memory use.  One of those issues is leaky addons, which Nick has identified as the single most important MemShrink issue.

In Firefox, the JavaScript heap is split into compartments.  Firefox’s UI code, which is written in JS, lives in the privileged “chrome” compartment.  Addon code also usually lives in the chrome compartment.  Websites live in different, unprivileged compartments.  Exactly how compartments are allocated to websites is beyond the scope of this article, but at the time of writing there is roughly one compartment per domain.  Code running in the chrome compartment can hold references to objects in the content compartments (much like how a page can hold references to objects in an iframe).

For example of how this might look in practice, lets imagine we have Firefox open to three tabs: GMail, Twitter, and Facebook, and we have some sort of social media addon installed.  Our compartments might look something like this:

Where the blue lines are the references the Firefox UI is holding and the red lines are the references the addon is holding.

The problems start to arise if these references aren’t cleaned up properly when the tab is navigated or closed.  If the Facebook tab is closed, but not all of those references are cleaned up, some or all of the memory the Facebook tab was using is not released.  The result is popularly known as a zombie compartment, and is a big source of leaks in Firefox.

Chrome (privileged UI or other JS) code that leaks is particularly problematic because the leak usually persists for the lifetime of the browser.  When chrome code leaks, say, facebook.com, it leads to dozens of megabytes of memory being lost.  It turns out that writing chrome code that doesn’t leak can actually be quite difficult.  Even the Firefox front end code, which is worked on by a number of full time engineers and has extensive code review, has a number of leaks.  We can find and fix those, but addons are a much harder problem, and we can’t expect addon authors to be as diligent as we try to be in finding and fixing leaks.  The only defense we have had is the AMO review team and our list of best practices.

That changed last night when I landed Bug 695480.  Firefox now attempts to clean up after leaky chrome code.  My approach takes advantage of the fact that chrome code lives in a separate compartment from web page code.  This means that every reference from chrome code to content code goes through a cross-compartment wrapper, which we maintain in a list.  When the page is navigated, or a tab is closed, we reach into chrome compartment and grab this list.  We go through this list and “cut” all of the wrappers that point to objects in the page we’re getting rid of.  The garbage collector can then reclaim the memory used by the page that is now gone.

The result looks something like:

Code that accidentally (or intentionally!) holds references to objects in pages that are gone will no longer leak.  If the code tries to touch the object after the wrapper has been “cut”, it will get an exception.  This may break certain code patterns.  A few examples:

  • Creating a DOM node from a content document and storing it in a global variable for indefinite use.  Once the page you created the node from is closed your node will vanish.  Here’s an example of code in Firefox that used to do that.
  • Creating a closure over DOM objects can break if those objects can go away before the closure is invoked.  Here’s some code in Firefox that did that.  In one of our tests in our test suite the tab closed itself before the timeout ran, resulting in an exception being thrown.

Addon authors probably don’t need to bother changing anything unless they see breakage.  Breakage should be pretty rare, and the huge upside of avoided leaks will be worth it.  It’s a little early to be sure what effects this will have, but the amount of leaks we see on our test suite dropped by 80%.  I expect that this change will also fix a majority of the addon leaks we see, without any effort on the part of the addon authors.

February 23, 2012
Address Space Layout Randomization now mandatory for binary components

This evening I landed Bug 728429 on mozilla-central.  Firefox will now refuse to load XPCOM component DLLs that do not implement ASLR.  ASLR is an important defense-in-depth mechanism that makes it more difficult to successfully exploit a security vulnerability.  Firefox has used ASLR on its core components for some time now, but many extensions that ship with binary components do not.

ASLR is on by default on modern versions of Visual Studio, so extension authors will only need to ensure that they haven’t flipped the switch to turn it off.  MSDN documentation on ASLR options is available here.  Further reading about the benefits of ASLR is available here.

If no unexpected problems arise, this change will ship in Firefox 13.

December 19, 2011
Pushing Compilers to the Limit (and Beyond)

At the end of the first week of December Firefox exceeded the memory limits of the Microsoft linker we use to produce our highly optimized Windows builds.  After the problem was identified we took some emergency steps to ensure that people could continue to land changes to parts of Firefox not affected by this issue by disabling some new and experimental features.  Once that was complete we were able to make some other changes that reduced the memory used by the linker back below the limits.  We were then unable to undo those emergency steps and turn those features back on.

This will have no lasting impact on what is or is not shipped in Firefox 11.  The issues described here only affected Firefox developers, and have nothing to do with the memory usage or other performance characteristics of the Firefox binaries shipped to users.

Technical Details

Recently we began seeing sporadic compilation failures in our optimized builds on Windows.  After some debugging we determined that the problem was that the linker was running out of virtual address space.  In essence, the linker couldn’t fit everything it needed into memory and crashed.

The build configuration that was failing is not our normal build configuration.  It uses Profiled Guided Optimization, fancy words meaning that it runs some benchmarks that we give it and then uses that information to determine what to optimize for speed and what optimizations to use.  It also uses Link-Time Code Generation, which means that instead of the traditional compilation model where the compiler generates code and the linker glues it all together the linker does all of the code generation.  These two optimization techniques are quite powerful (they generally win 10-20% on various benchmarks that we have) but they require loading source code and profiling data for most of Firefox into RAM at the same time.

Once we identified the problem we took emergency steps by disabling SPDY support and the Graphite font subsystem, both new features that had been landed recently and were turned off by default (in other words, users had to use an about:config preference to turn them on).  This allowed us to reopen the tree for checkins that did not touch code that ends up in xul.dll (this allowed work to proceed on the Firefox UI, the Javascript engine, and a few other things).

We then disabled Skia (which is being used as an experimental <canvas> backend) and separated video codecs and parts of WebGL support into a separate shared library.  This work decreased the linker’s memory usage enough to resume normal development and turn SPDY back on.  The medium term solution is to start doing our 32 bit builds on 64 bit operating systems so that the linker can use 4 GB of memory instead of 3 GB of memory, and to separate pieces of code that aren’t on the critical startup path into other shared libraries.

Frequently Asked Questions:

  • Why don’t you just get machines with more RAM? - The problem is not that the linker was running out of physical memory, but that it was running out of virtual memory.  A 32 bit program can only address 2^32 bytes (4GB) of memory, regardless of how much memory is in the machine.  Additionally, on 32 bit Windows, the last 1 GB is reserved for the kernel, so a program is really limited to 3 GB of memory.
  • Ok, so why don’t you just use a 64 bit linker? - Unfortunately there is no 64->32 bit cross compiler provided with the Microsoft toolchain so you can’t generate binaries that run on 32 bit systems with a 64 bit compiler.
  • Sure you can, just use -MACHINE:X86 on the linker! - You can have the 64 bit linker link 32 bit binaries, but this is incompatible with Link-Time Code Generation.
  • Is Firefox bloated? - Firefox’s size and linker memory usage compares favorably with other browsers. These problems are not a reflection on which browsers are or are not bloated, but rather on how resource intensive it is to do whole program optimization across a large C++ codebase.

September 29, 2011
Using XHR.onload/etc in addons

I just landed https://bugzilla.mozilla.org/show_bug.cgi?id=687332 on mozilla-central which makes some changes to how .onfoo event listeners are handled on some DOM objects (including XHR).  These changes mean it is no longer possible to use .onfoo event listeners from JS scopes where the global object is not a Window, or from C++.  The correct way to listen for events from these scopes is to use .addEventListener.

This will likely affect a number of addons (particularly for XHR).  Addons that use XHR in XPCOM components should check to see if they are affected.  We may consider implementing some sort of a compatibility hack for XHR if that number is large.

August 10, 2011
xpidlc is dead. Long live pyxpidl.

Today I landed Bug 458936 which moves from using xpidlc to generate xpcom typelibs to new python code.  With that, and other work by people including Ted Mielczarek, Mike Hommey, and Benjamin Smedberg, Firefox is now built without ever invoking the binary xpidl.

The remaining pieces of work here are:

  • Migrate comm-central to the new python tools (interfaces in comm-central are still compiled with xpidlc)
  • Package the python xpidl into the Gecko SDK.
  • Stop building the binary xpidl entirely and remove it from the tree.
  • Remove our build time dependencies on libIDL, etc.

July 2, 2011
Mork is finally gone

It’s not even going to be worthy of a footnote compared to the other awesome things making Firefox 7, but I landed Bug 578268 this morning, which removes the last vestiges of Mork (specifically morkreader) from Firefox.

June 28, 2011
Migrating to MSVC 2010

I’m reposting this here because Google Groups is not mirroring the last several days of mozilla.dev.platform.  Please follow up to anything here on the newsgroup.


I just landed the fix for the last known bug in product code that stops us from switching to MSVC 2010 (Bug 515492).

If RelEng can deploy MSVC 2010 SP1 to the builders in the next couple of weeks I would like to try to change the official compiler for the next release cycle (starting July 5th).  This depends of course on how much trouble we run into along the way :-)  If deployment/test failures/perf regressions slow us down this can always catch the next train.

Things we expect to pick up from MSVC 2010:
- Not being on a six year old compiler
- A modest perf improvement (maybe a couple percent).
- Fixes to known crashes with PGO (which should allow us to turn PGO back on for the JS engine, which will pick up another 10-15% on Sunspider/etc)
- Fixes to various issues relating to crashreporter symbols that should allow us to get better stacks (Bug 520651).

Other things worth noting:
- Downloading Windows debug builds from Tinderbox might be useful again (since the debug CRT is non-redistributable, you must have the same version of the compiler installed locally to run a debug build.  This will be much more bearable when we’re on a version that people can get easily.)
- The CRT included with MSVC 2010 requires XP SP 2 or higher, so switching compilers will end support for Windows 2000 and pre-SP2 versions of Windows XP in official builds of Firefox.  I expect that we will likely remove support for Windows 2000 from Gecko itself shortly afterwards.

Please follow up on mozilla.dev.platform if you have any questions/comments/concerns.

March 22, 2011
Build System Changes

See the tl;dr if you are so inclined.

A few hours ago I merged the build-system branch to mozilla-central.  This pulls in:

  • Bug 584474 - This reworks our linking system on all platforms.  Instead of building intermediate static libraries that are later combined into libxul, we make lists of object files that are assembled into libxul by a python wrapper around the linker.  This buys us significant link time improvements on Mac (more than a 10x speedup from some tests) and gets us full incremental linking with MSVC (currently on 2.0 only object files that pass through one level of static libs to get into libxul are linked incrementally, e.g. anything in gklayout is not linked incrementally).  This also allows object file reordering in the future.
  • Bug 638429 - —disable-libxul is gone.  Thunderbird and Seamonkey have been using fat-libxul (where the app components are placed inside libxul) for some time.  This will allow a lot of code cleanup and simplification, including deCOMtamination.
  • Bug 638755 - —disable-ipc is gone.  As e10s progresses the overhead of maintaining —disable-ipc will increase.  This will break some platforms (OS/2 for example) until they adapt.
  • Bug 636695 - —enable-debug-symbols is now the default, and —enable-debugger-info-modules is gone completely.
  • Bug 637317 - elfhack is back on again.

You’ll need to clobber after pulling or your build will choke.

Liked posts on Tumblr: More liked posts »