flypig.co.uk

List items

Items from the current list are shown below.

Gecko

15 Jul 2024 : Day 289 #
We finished the diary entry day yesterday with a mystery. The C++ code in AudioChannelServices.cpp that sends audio-playback messages is being called on both ESR 78 and ESR 91. On ESR 91 the JavaScript code in AudioPlaybackChild.jsm that receives these messages, converts them into AudioPlayback messages and forwards them on to AudioPlaybackParent.jsm is also working correctly. But on ESR 78 the AudioPlaybackChild never receives the messages.

To add to the mystery, Mark Washeim (poetaster) flagged up the fact that he's experienced the same error when developing apps that use a WebView. That suggests there are at least some occasions when the AudioPlayback class is successfully receiving the messages.

Why are the messages being received in ESR 91 but only on ESR 78 sporadically?

Today I plan to delve into this, but be warned, it's a bit of a lengthy one today. I'm hoping the payoff will be worth it.

Considering the reasons for the peculiar differences I'm seeing between ESR 78 and ESR 91, one possibility is that the actor isn't being properly instantiated on ESR 78. To help determine if this is the case I've added the following code to the AudioPlaybackChild class on both ESR 78 and ESR 91. I've added similar code (but with the output changed accordingly) to the AudioPlaybackParent class, in addition to the debug output that I already added yesterday.
  constructor() {
    super();
    dump("AudioPlaybackChild: constructor()\n");
  }
Here's what I get when running this on ESR 91. As you can see I've not removed the other debug output lines, so we can watch both classes be instantiated before the messages are passed to their respective observers.
library "libandroidicu.so" needed or dlopened by "/system/lib64/
    libmedia.so" is not accessible for the namespace "(default)"

AudioPlaybackChild: constructor()
AudioPlayback: observe()
AudioPlayback: topic: audio-playback
AudioPlayback: data: active
Stacktrace: observe@resource://gre/actors/AudioPlaybackChild.jsm:19:22

AudioPlaybackParent: constructor()
AudioPlayback: receiveMessage()
AudioPlayback: aMessage: AudioPlayback:Start
Stacktrace: receiveMessage@resource://gre/actors/AudioPlaybackParent.jsm:20:22

JavaScript error: resource://gre/actors/AudioPlaybackParent.jsm, line 27: 
    TypeError: browser is null
AudioPlayback: observe()
AudioPlayback: topic: audio-playback
AudioPlayback: data: inactive-pause
Stacktrace: observe@resource://gre/actors/AudioPlaybackChild.jsm:19:22
On ESR 78 we get none of this output, despite the same debug print lines being present in the code.

I'm beginning to feel really frustrated by this now. There must be some reason why the actor is being initialised on ESR 91 but not on ESR 78. There are two possibilities that I need to pursue to try to get to the bottom of this. The first is to try to find out if it's just this specific actor that's being skipped, or second is whether the ActorManager — the class that performs the initialisation for all of the actors — isn't starting at all.

To help with this I've placed some debug output in the loop that starts all of the actors. I notice that there's some conditionality here: actors can be disabled using a preference if it's listed as such in the actor configuration. This configuration is listed in the ActorManagerParent.jsm source file itself. Here's the initialisation loop taken from this file:
    for (let [actorName, actor] of Object.entries(actors)) {
      dump("ActorManagerParent: actor: " + actorName + ", 
    enabledPreference: " + actor.emablePreference + "\n");
      // If enablePreference is set, only register the actor while the
      // preference is set to true.
      if (actor.enablePreference) {
        let actorNameProp = actorName + "_Preference";
        XPCOMUtils.defineLazyPreferenceGetter(
          this,
          actorNameProp,
          actor.enablePreference,
          false,
          (prefName, prevValue, isEnabled) => {
            if (isEnabled) {
              dump("ActorManagerParent: pref registering " + 
    actorName + "\n");
              register(actorName, actor);
            } else {
              dump("ActorManagerParent: pref unregistering " + 
    actorName + "\n");
              unregister(actorName, actor);
            }
            if (actor.onPreferenceChanged) {
              actor.onPreferenceChanged(prefName, prevValue, isEnabled);
            }
          }
        );
        if (!this[actorNameProp]) {
          continue;
        }
      }
   
      dump("ActorManagerParent: registering " + actorName + 
    "\n"); 
      register(actorName, actor);
    }
As we can see, the loop goes through each of the Actors and checks whether they have an enablePreference entry; if they do it searches for the appropriate preference which, in our case, would be AudioPlayback_Preference. If this is set to true or if there's no preference enabled then the Actor is registered; otherwise if it's set to false the actor is unregistered. The calls to dump() in this code were added by me just now so that we can trace progress through the method (sadly gdb won't work on JavaScript code).

The entry for the AudioPlayback actor, as listed higher up in the same file, looks like this:
  AudioPlayback: {
    parent: {
      moduleURI: "resource://gre/actors/AudioPlaybackParent.jsm",
    },

    child: {
      moduleURI: "resource://gre/actors/AudioPlaybackChild.jsm",
      observers: ["audio-playback"],
    },

    allFrames: true,
  },
As we can see from this, the actorName is AudioPlayback and the actor is everything else. But we can also see there's noemabledPreference entry here. As such what I'd expect to see is the preference code being entirely skipped and for the actor to be directly registered through the call at the end of the loop.

When I execute this on ESR 91, here's the output I get. It turns out there are very many actors (36 in total) so there's quite a lot of output. I've cut out the majority of them for brevity, but they all basically look the same anyway so you're not missing much.
Created LOG for EmbedLite
ActorManagerParent: actor: AsyncPrefs, enabledPreference: undefined
ActorManagerParent: registering AsyncPrefs
[...]
ActorManagerParent: actor: AudioPlayback, enabledPreference: undefined
ActorManagerParent: registering AudioPlayback
ActorManagerParent: actor: AutoComplete, enabledPreference: undefined
ActorManagerParent: registering AutoComplete
ActorManagerParent: actor: Autoplay, enabledPreference: undefined
ActorManagerParent: registering Autoplay
[...]
ActorManagerParent: actor: UnselectedTabHover, enabledPreference: undefined
ActorManagerParent: registering UnselectedTabHover
Created LOG for EmbedPrefs
What we see here is what we expect to see: the enablePreference is undefined and so the Actor gets registered. That's on ESR 91. When I add exactly the same debug annotations to the ESR 78 code I get... nothing. I've also added some code to the initialisation steps of the ActorManager and they don't get executed either. So it's looking clearly like the ActorManager isn't being executed at all.

So I need to find where the ActorManager initialisation is triggered. Searching through the code I can see a few potential places, for example in the tab-content.js file and in the browser_startup.js file. Both of these reference the ActorManagerParent. But neither are included in our omni.ja file so presumably they can't be the culprits.

At this point I've placed debug output all over the place and I don't seem to be getting any closer to finding out the reason. This is deeply frustrating.

But suddenly I realise that the relevant code might not be in the gecko repository at all: it could be in the embedlite-components repository. And when I check there, it looks plausible that the ActorManagerChild is getting initialised in the embedhelper.js script which is part of embedlite-components. Could this be what I've been searching for?

I've added some debug output to this file as well, as you can see in the dump() calls in this extract from the file:
function canInitializeActorManagerChild() {
  if (actorManagerChildAttached)
    return false;
  
  const { sharedData } = Services.cpmm;
  dump("canInitializeActorManagerChild: sharedData.get(
    CHILD_SINGLETON_ACTORS) = " + sharedData.get(CHILD_SINGLETON_ACTORS) + 
    "\n");
  return sharedData.get(CHILD_SINGLETON_ACTORS);
}

function initializeActorManagerChild() {
  dump("initializeActorManagerChild\n");
  try {
    if (canInitializeActorManagerChild()) {
      dump("canInitializeActorManagerChild: true\n");
      const { ActorManagerChild } = ChromeUtils.import("resource://gre/
    modules/ActorManagerChild.jsm");
      ActorManagerChild.attach(this);
      actorManagerChildAttached = true;
    }
  } catch (e) {}
} 
  
initializeActorManagerChild();
The code here calls initializeActorManagerChild(), which then checks whether initialisation is possible and if it is, initialises it. The two situations in which the initialisation may not be possible are if either it's already been initialised, or the value returned by sharedData.get(CHILD_SINGLETON_ACTORS) resolves to false. Here's the output I get when I run this on ESR 78:
Created LOG for EmbedLiteLayerManager
initializeActorManagerChild
canInitializeActorManagerChild: sharedData.get(CHILD_SINGLETON_ACTORS) = null
This initially looks encouraging: the value returned from the CHILD_SINGLETON_ACTORS check is null, which is interpreted as false in the condition and so the initialisation is skipped. Could this be the reason the ActorManager isn't being initialised?

Sadly my hopes are dashed when I discover that ESR 91 generates exactly the same output. There as well I get a return value of null with no initialisation.

I thought this would be it... I was wrong and I'm no further forwards than before. This all feels incredibly frustrating.

But having wandered into the embedlite-components code, I now notice there's one more place which could be triggering the initialisation and that's in the EmbedLiteGlobalHelper.js file. When I open this up I discover something unexpected. Compare the code in the file from ESR 78 with that from ESR 91. Here first is code taken from my device running ESR 78:
//ChromeUtils.defineModuleGetter(
//  this,
//  "ActorManagerParent",
//  "resource://gre/modules/ActorManagerParent.jsm"
//);  

Services.scriptloader.loadSubScript("chrome://embedlite/content/
    Logger.js");

// Common helper service
  
function EmbedLiteGlobalHelper()
{ 
  //ActorManagerParent.flush();
[...]
And now here's the same code taken from my device running ESR 91:
ChromeUtils.defineModuleGetter(
  this,
  "ActorManagerParent",
  "resource://gre/modules/ActorManagerParent.jsm"
);

Services.scriptloader.loadSubScript("chrome://embedlite/content/
    Logger.js");

// Common helper service

function EmbedLiteGlobalHelper()
{
  // Touch ActorManagerParent so that it gets initialised
  var actor = ActorManagerParent;
[...]
In the first block of code from ESR 78 the lines for getting the ActorManagerParent are completely commented out. That doesn't look right; they're not commented out on my ESR 91 device.

There's another difference too, which is less relevant for the issue at hand, but which explains how this came about. Slightly further down we can see that the call to ActorMangerParent.flush() (which is also commented out) has been switched for a line that just creates an instance of the ActorManagerParent.

I think I've finally found the culprit I've been looking for. The nice thing is that this explains fully what's going on. To understand we need to look at the history of the commits related to these changes:
$ git log -1
commit 377ed10e9cffece90458b1d8152a579b438b5dc0 (HEAD -> sailfishos-esr91, 
    origin/sailfishos-esr91)
Author: David Llewellyn-Jones <david@flypig.co.uk>
Date:   Mon Feb 12 21:33:19 2024 +0000

    [embedlite-components] Ensure ActorManagerParent is initialised
    
    In diff 0bf2601425ec1d8d63925 a call to ActorManagerParent.flush() was
    removed because the function was removed from upstream. However this
    caused ActorManagerParent not to be initialised, preventing the Window
    Actors from being registered.
    
    This change adds code to touch ActorManagerParent to ensure it gets
    reinitialised again.
$ git log -1 0bf2601425ec1d8d63925
commit 0bf2601425ec1d8d639255d6a7c32231e7e38eae
Author: David Llewellyn-Jones <david.llewellyn-jones@jolla.com>
Date:   Thu Nov 23 21:55:13 2023 +0000

    Remove EmbedLiteGlobalHelper.js errors
    
    Makes three changes to address errors that were coming from
    EmbedLiteGlobalHelper.js:
    
    1. Use ComponentUtils.generateNSGetFactory() instead of
       XPCOMUtils.generateNSGetFactory().
    
    2. Remove call to ActorManagerParent.flush(). See Bug 1649843.
    
    3. Use L10nRegistry.registerSources() instead of
       L10nRegistry.registerSource(). See Bug 1648631.
    
    See the following related upstream changes:
    
    https://phabricator.services.mozilla.com/D95206
    
    https://phabricator.services.mozilla.com/D81243
What does this all tell us? Well, back on Day 88 I removed the call to ActorManager.flush() following a change made upstream. In the process of doing this I broke the initialisation of the ActorManager and so went about restoring it on Day 256 by adding in the new line that creates an instance of the class.

As part of testing the original change I must have commented out the initialisation code on my ESR 78 device. Not in the project code on my laptop, but just on my phone. And so we ended up where we are now. This explains not only why it's not working on my ESR 78 device and why a re-installation of the gecko package didn't help (it's a change made in the embedlite-components package!) but also why the error is appearing on Mark's device and not mine: Mark never hacked around with the code on his device to break it.

It's a little frustrating that this is all down to changes I've made, things I've recorded in this diary but which I'd completely forgotten about. I've been led on a bit of a wild goose chase by myself.

Nevertheless I've reversed the changes on my device running ESR 78. Now when I grant permissions to the microphone I get the same error as I get on ESR 91.
JavaScript error: resource://gre/actors/AudioPlaybackParent.jsm, line 16: 
    TypeError: browser is null
Just as important to know, even with the error now showing, the audio continues to work as it did before.

This is all just as Mark described it in his message on Mastodon.

Even though it's presumably pretty harmless, I still feel compelled to fix it so there's no error message on ESR 91. Not only is it ugly, but it's also just lost me three days of development time. It would be good to avoid that happening again in the future.

To try to understand the issue with the error I want to understand what the embedding element actually is. To that end I've been reading through various pieces of documentation, including the Firefox Source docs on the topic and a Gecko Embedding Basics document. The latter turns out to be a bit of a red herring, since the embedder element isn't specifically related to embedding Gecko in other applications. But it was a useful read anyway,

The conclusion I'm left with is that in practice an attempt is being made to pass the message on to the embedding element, but this is never actually used. It could potentially be used, for example by an extension, but on Sailfish OS it's not bringing any benefit.

So the appropriate thing to do is to prevent the AutioPlaybackParent attempting to call methods on a null browser variable. There are multiple ways we might go about achieving this. One possibility would be to try to override the AudioPlaybackParent.jsm file so as to replace it with our own version. Sadly, after testing a few things out, I've not been able to get this to work. So I've given up and have decided to go for the simplest possible option, which is just to patch the file with a check for the null value:
$ git diff -U1
diff --git a/toolkit/actors/AudioPlaybackParent.jsm b/toolkit/actors/
    AudioPlaybackParent.jsm
index 5c54058528c5..7cfc5134e886 100644
--- a/toolkit/actors/AudioPlaybackParent.jsm
+++ b/toolkit/actors/AudioPlaybackParent.jsm
@@ -16,2 +16,5 @@ class AudioPlaybackParent extends JSWindowActorParent {
     const browser = this.browsingContext.top.embedderElement;
+    if (!browser) {
+      return;
+    }
     switch (aMessage.name) {
The good news is that this fixes the error, which now no longer appears in the console output. What's more there was a completely separate reason, unrelated to gecko, for the sound not working on my device. Having addressed that I now have a build that has working microphone and no error showing.

Nearly no error. There's still the following appearing:
library &quot;libandroidicu.so&quot; needed or dlopened by &quot;/system/lib64/
    libmedia.so&quot; is not accessible for the namespace &quot;(default)&quot;
Once again, despite the fact that the audio is now working, I still feel the need to track down the reason behind this error. It's possible this is nothing to do with gecko... but I'm not convinced and I'd like to be certain.

From the error it looks like the trigger is from the loading of libmedia.so rather than an attempt to load libandroidicu.so directly. In amongst the Gecko code there are two methods that attempt to dynamically load libmedia.so. Here's one of them:
media_lib *
cubeb_load_media_library()
{
  media_lib ml = {0};
  ml.libmedia = dlopen(&quot;libmedia.so&quot;, RTLD_LAZY);
  if (!ml.libmedia) {
    return NULL;
  }
[...]
The other one is in a method called audiotrack_init. Both of these relate to cubeb which appears to be a cross platform media library. However, when I fire up the debugger and try to put a breakpoint on either of these methods, nothing sticks. It looks like they're not part of xulrunner. So, with nothing else to try, I've just stuck a breakpoint on dlopen() on the off-chance something useful comes from it.

It gets hit many, many times. But I'm interested in a particular time, which is just after the website has requested to use the microphone, and just before I've granted access as a user. And there is a hit which happens right then. And the backtrace looks promising.
Thread 70 &quot;VideoCapture&quot; hit Breakpoint 3, 0x0000007ff0df02ac in 
    dlopen () from /lib64/libdl.so.2
(gdb) bt
#0  0x0000007ff0df02ac in dlopen () from /lib64/libdl.so.2
#1  0x0000007f10057c30 in ?? () from /usr/lib64/gecko-camera/plugins/
    libgeckocamera-droid.so
#2  0x0000007f1005aa4c in droid_media_init () from /usr/lib64/gecko-camera/
    plugins/libgeckocamera-droid.so
#3  0x0000007f1004c2d4 in DroidCameraManager::init() () from /usr/lib64/
    gecko-camera/plugins/libgeckocamera-droid.so
#4  0x0000007fee922844 in gecko::camera::RootCameraManager::init() () from /usr/
    lib64/libgeckocamera.so.0
#5  0x0000007fee922a2c in gecko_camera_manager () from /usr/lib64/
    libgeckocamera.so.0
#6  0x0000007ff48d65a4 in webrtc::videocapturemodule::DeviceInfoSFOS::Init (
    this=this@entry=0x7e60002fc0)
    at third_party/libwebrtc/webrtc/modules/video_capture/sfos/
    device_info_sfos.cc:40
#7  0x0000007ff48dad68 in webrtc::videocapturemodule::DeviceInfoSFOS::
    DeviceInfoSFOS (this=0x7e60002fc0)
    at third_party/libwebrtc/webrtc/modules/video_capture/sfos/
    device_info_sfos.cc:35
#8  0x0000007ff48dad90 in webrtc::videocapturemodule::VideoCaptureImpl::
    CreateDeviceInfo ()
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33
#9  0x0000007ff4867a68 in webrtc::VideoCaptureFactory::CreateDeviceInfo ()
    at third_party/libwebrtc/webrtc/modules/video_capture/
    video_capture_factory.cc:28
#10 0x0000007ff37de984 in mozilla::camera::VideoEngine::
    GetOrCreateVideoCaptureDeviceInfo (this=0x7e60002f00)
    at dom/media/systemservices/VideoEngine.cpp:180
[...]
#24 0x0000007fef54989c in ?? () from /lib64/libc.so.6
(gdb) c
Continuing.
library &quot;libandroidicu.so&quot; needed or dlopened by &quot;/system/lib64/
    libmedia.so&quot; is not accessible for the namespace &quot;(default)&quot;
As you can see, the error is generated straight after I continue running the code. It's hard to judge whether this is really happening straight after or not (my phone runs pretty fast), but it looks promising. Let's dig deeper.

So here's the code that the backtrace is pointing to. This is the last piece of code before the backtrace leaves the gecko library.
int32_t DeviceInfoSFOS::Init()
{
    cameraManager = gecko_camera_manager();
    // Initialize parent class member
    _lastUsedDeviceName = (char *)malloc(1);
    _lastUsedDeviceName[0] = 0;
    return 0;
}
The key line is the one that calls gecko_camera_manager() that we can also see in the backtrace. This method isn't part of gecko but is instead part of the libgeckocamera.so library, which is part of the Sailfish stack but which lives in a separate repository.

I want to know whether it's precisely this line that's generating the error message. So I've placed a breakpoint on the DeviceInfoSFOS::Init() method so that I can step through it a line at a time. If this is indeed the right place we should see the error message generated as we step through.

And so we do. Here's the output, with the error generated exactly as we step over the call to gecko_camera_manager():
Thread 69 &quot;VideoCapture&quot; hit Breakpoint 4, webrtc::videocapturemodule:
    :DeviceInfoSFOS::Init (this=this@entry=0x7e68002fa0)
    at third_party/libwebrtc/webrtc/modules/video_capture/sfos/
    device_info_sfos.cc:39
39      {
(gdb) n
40          cameraManager = gecko_camera_manager();
(gdb) n
library &quot;libandroidicu.so&quot; needed or dlopened by &quot;/system/lib64/
    libmedia.so&quot; is not accessible for the namespace &quot;(default)&quot;
42          _lastUsedDeviceName = (char *)malloc(1);
(gdb) n
43          _lastUsedDeviceName[0] = 0;
(gdb) 
If, rather than stepping over the call, we instead jump over it, so that it doesn't get executed at all, the error never shows:
Thread 71 &quot;VideoCapture&quot; hit Breakpoint 4, webrtc::videocapturemodule:
    :DeviceInfoSFOS::Init (this=this@entry=0x7e9c004210)
    at third_party/libwebrtc/webrtc/modules/video_capture/sfos/
    device_info_sfos.cc:39
39      {
(gdb) n
40          cameraManager = gecko_camera_manager();
(gdb) tbreak +1
Temporary breakpoint 5 at 0x7ff48d65a8: file third_party/libwebrtc/webrtc/
    modules/video_capture/sfos/device_info_sfos.cc, line 42.
(gdb) jump +1
Continuing at 0x7ff48d65a8.

Thread 71 &quot;VideoCapture&quot; hit Temporary breakpoint 5, webrtc::
    videocapturemodule::DeviceInfoSFOS::Init (this=this@entry=0x7e9c004210)
    at third_party/libwebrtc/webrtc/modules/video_capture/sfos/
    device_info_sfos.cc:42
42          _lastUsedDeviceName = (char *)malloc(1);
(gdb) n
43          _lastUsedDeviceName[0] = 0;
(gdb) c
Continuing.
Interestingly, having jumped over it, this doesn't seem to cause any issues for the microphone or audio. That rather makes sense: the name of the method suggests it's camera-related rather than audio-related. And having jumped over it, attempting to then access the camera does cause a crash:
Thread 71 &quot;VideoCapture&quot; received signal SIGSEGV, Segmentation fault.
0x0000007ff48dc5c4 in webrtc::videocapturemodule::DeviceInfoSFOS::
    NumberOfDevices (this=0x7e9c004210)
    at include/c++/8.3.0/ext/new_allocator.h:86
86      include/c++/8.3.0/ext/new_allocator.h: No such file or directory.
This isn't a crash that could ever happen in real use: it's only happened here because I forcefully skipped a line of code using the debugger that should have been executed.

My conclusion from all this is that the libandroidicu.so error falls outside the scope of the gecko upgrade. It doesn't seem to be doing any harm in practice and there isn't an obvious way to prevent it without wading into the gecko-camera library. I've reached the end of where I can reasonably take this to.

On testing with the LiveKit WebRTC Browser Test I now get the same results using ESR 91 as I do with ESR 78.
 
Screenshots showing the LiveKit WebRTC Browser Test results. On the left the ESR 78 results, on the right the ESR 91 results. There are both green (success) and red (failure) lines on both, but they both tally with one another

One interesting thing to notice is that the ESR 78 version is showing three rear cameras, whereas the ESR 91 output shows only one. But that's to be expected as the former is running on an Xperia 10 II on which all three cameras are supported, whereas the latter is on an Xperia 10 III which supports only one. So the results tally nicely with what we'd expect.

So today was rather a deep dive into WebRTC errors. Practically speaking I'm not sure I accomplished a great deal, but finding the answers to all of these open questions was cathartic to say the least. I'm glad it was possible to get to the bottom of them all. Where does this leave things? The WebRTC changes seem to be looking pretty good now, so I've pushed all of my commits related to them to the remote repository.

And that means things are looking pretty decent more generally as well. Tomorrow I'm going to do some more general testing of the browser. But I genuinely think we may be reaching a conclusion here.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comments

Uncover Disqus comments