List items

Items from the current list are shown below.

Gecko

All items from July 2024

31 Jul 2024 : Day 305 #

Before moving on to today's development I want to say a few things about the ambience switching implementation that I've been working on over the last few days. You may recall that at one point I explained that simply setting the preference to "light" or "dark" in the front end and letting the gecko engine pick up the change that way wasn't going to work, because the gecko engine recognises only two values ("light" and "dark") whereas on Sailfish OS we need three values ("light", "dark" and "follow ambience").

An obvious question that arises is "why not introduce a second setting?". Then the existing ui.systemUsesDarkTheme setting could be used to switch between light and dark modes, with a subsequent setting, ui.systemDarkThemeFollowsAmbience say, used to control whether or not the user interface should update the ui.systemUsesDarkTheme value when the ambience changes.

This preference-based approach has several advantages. First it would avoid the need to patch the gecko engine code, since only the user interface would need to know about the new setting and the existing setting could be left to do what it already does. Plus it would have avoided all of the problems related to race conditions at initialisation, because the preferences are already set up to handle all that correctly.

It sounds too good to be true. So why not do that? The answer is two-fold. First, it feels like the wrong thing to do architecturally. Using settings as a signalling mechanism is ugly, especially when there's literally a proper signalling mechanism already available. We have the same issue on Sailfish OS more generally, where it's very tempting to use dconf values as a signalling mechanism between the front-end and other parts of the code, or between different processes. Adding a dconf value is maybe three lines of code, whereas exposing and consuming a DBus interface — which is the correct way to signal between processes on Sailfish OS — requires more boilerplate than Stevenson's Rocket.

It can make it very tempting then to use preference values, rather than inter-process communication (IPC). But using preferences this way can be wasteful, leaves extra preferences lying around that can be hard to get rid of later, and is also usually less efficient than using the intended approach.

Those are the abstract reasons. The more concrete and practical reason is that using an extra preference would have meant diverging from the ESR 78 implementation. So either the user preference would have to be lost during the upgrade to ESR 91, or we'd have had to add some hard-to-maintain one-shot code to convert it.

So I decided to patch the gecko code instead. Neither option is ideal, but since I'm writing about this whole development process already, it makes sense to explain the possible different approaches and my reasons for choosing one over another.

Okay, on with the coding. Today I'm going to start with the task of addressing issues from the "Everything on the browser test page" task. The test page this is referring to is Jolla's gecko testing suite, which has been built up over many years to capture bugs and issues that the browser has suffered at various times during its life.

The tests are all manual, but their existence makes it far easier to test a whole swathe of different functionalities. When I worked through the tests previously I encountered the following problems.

Single select widget.
External links.
Full screen.
Double-tap.

The first of these — single select widget — refers to the very first item on the selection test page. Tapping on either of the first two widgets ("second" or "Volvo") should open up a Silica page listing several selectable options. Instead the browser just crashes. In contrast the following widgets (the list of fruit and the celestial bodies) work fine and open up their respective pages as intended.

As this is a crash the obvious place to start is by capturing a backtrace using the debugger.

Thread 10 &quot;GeckoWorkerThre&quot; received signal SIGSEGV, Segmentation 
    fault.
[Switching to LWP 15152]
mozilla::PresShell::Paint (this=this@entry=0x7fb90feb60, 
    aViewToPaint=aViewToPaint@entry=0x7eb4007270, aDirtyRegion=..., 
    aFlags=aFlags@entry=mozilla::PaintFlags::PaintLayers)
    at ${PROJECT}/gecko-dev/layout/base/PresShell.cpp:6276
6276    ${PROJECT}/gecko-dev/layout/base/PresShell.cpp: No such file or 
    directory.
(gdb) bt
#0  mozilla::PresShell::Paint (this=this@entry=0x7fb90feb60, 
    aViewToPaint=aViewToPaint@entry=0x7eb4007270, aDirtyRegion=..., 
    aFlags=aFlags@entry=mozilla::PaintFlags::PaintLayers)
    at ${PROJECT}/gecko-dev/layout/base/PresShell.cpp:6276
#1  0x0000007ff3e410c8 in nsViewManager::ProcessPendingUpdatesPaint (
    this=this@entry=0x7fb8d5d6f0, aWidget=aWidget@entry=0x7fb8cdfa80)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/gfx/RectAbsolute.h:43
#2  0x0000007ff3e4147c in nsViewManager::ProcessPendingUpdatesForView (
    this=this@entry=0x7fb8d5d6f0, aView=<optimized out>, 
    aFlushDirtyRegion=aFlushDirtyRegion@entry=true)
    at ${PROJECT}/gecko-dev/view/nsViewManager.cpp:394
#3  0x0000007ff3e41a6c in nsViewManager::ProcessPendingUpdates (
    this=this@entry=0x7fb8d5d6f0)
    at ${PROJECT}/gecko-dev/view/nsViewManager.cpp:972
#4  0x0000007ff3fe94e4 in nsRefreshDriver::Tick (this=0x7fb90dc8b0, aId=..., 
    aNowTime=..., 
    aIsExtraTick=aIsExtraTick@entry=nsRefreshDriver::IsExtraTick::No)
    at ${PROJECT}/gecko-dev/layout/base/nsRefreshDriver.cpp:2477
#5  0x0000007ff3fe9e78 in nsRefreshDriver::<lambda()>::operator() (
    __closure=0x7fb8d8f2f0)
    at ${PROJECT}/gecko-dev/layout/base/nsRefreshDriver.cpp:234
#6  mozilla::detail::RunnableFunction<nsRefreshDriver::EnsureTimerStarted(
    nsRefreshDriver::EnsureTimerStartedFlags)::<lambda()> >::Run(void) (
    this=0x7fb8d8f2e0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/
    nsThreadUtils.h:532
[...]
#29 0x0000007fef54889c in ?? () from /lib64/libc.so.6

Here's the code that causing the crash:

  bool shouldInvalidate = layerManager->NeedsWidgetInvalidation();

It might be helpful to have a bit more context for this line, so here it is again in its natural habitat:

  LayerManager* layerManager = aViewToPaint->GetWidget()->GetLayerManager();
  NS_ASSERTION(layerManager, &quot;Must be in paint event&quot;);
  bool shouldInvalidate = layerManager->NeedsWidgetInvalidation();

The reason for the crash is that layerManager is null:

(gdb) p layerManager
$1 = (mozilla::PresShell::LayerManager *) 0x0

For comparison, the code in ESR 78 looks pretty much identical, but unlike ESR 91 the call to GetLayerManager() is returning a value. There are many different classes that extend nsIWidget and override this method, but in our case it's an instance of EmbedLitePuppetWidget. The code is quite complex but identical on both ESR 78 and ESR 91:

LayerManager *
EmbedLitePuppetWidget::GetLayerManager(PLayerTransactionChild *aShadowManager, 
    LayersBackend aBackendHint, LayerManagerPersistence aPersistence)
{
  if (!mLayerManager) {
    if (!mShutdownObserver || Destroyed()) {
      // We are shutting down, do not try to re-create a LayerManager
      return nullptr;
    }
  }

  LayerManager *lm = PuppetWidgetBase::GetLayerManager(aShadowManager, 
    aBackendHint, aPersistence);
  if (lm) {
    mLayerManager = lm;
    return mLayerManager;
  }

  if (EmbedLiteApp::GetInstance()->GetType() == EmbedLiteApp::EMBED_INVALID) {
    LOGT(&quot;Create Layer Manager for Process View&quot;);

    mLayerManager = new ClientLayerManager(this);
    ShadowLayerForwarder* lf = mLayerManager->AsShadowForwarder();
    if (!lf->HasShadowManager() && aShadowManager) {
      lf->SetShadowManager(aShadowManager);
    }
    return mLayerManager;
  }

  nsIWidget* topWidget = GetTopLevelWidget();
  if (topWidget && topWidget != this) {
      mLayerManager = topWidget->GetLayerManager(aShadowManager, aBackendHint, 
    aPersistence);
      return mLayerManager;
  }
  else {
      return nullptr;
  }
}

I've been stepping carefully through this code on both ESR 78 and ESR 91 all evening. I've tested it when it's first called and also on the call that's happening inside nsPresShell::Paint(). In the latter case there's a divergence between ESR 78 and ESR 91. On ESR 78 PuppetWidgetBase::GetLayerManager() is returning a value. On ESR 91 this returns null and the code fails through to the call to GetTopLevelWidget() towards the end after skipping all the preceding conditions. On ESR 91 the value returned as topWidget matches the value of this and so the condition is skipped and nullptr is returned. The crash follows.

Here's the code for PuppetWidgetBase::GetLayerManager(), identical on both ESR 78 and ESR 91:

LayerManager *
PuppetWidgetBase::GetLayerManager(PLayerTransactionChild *aShadowManager,
                                                LayersBackend aBackendHint,
                                                LayerManagerPersistence 
    aPersistence)
{
  if (Destroyed()) {
    return nullptr;
  }

  if (mLayerManager) {
    // This layer manager might be used for painting outside of DoDraw(), so we 
    need
    // to set the correct rotation on it.
    if (mLayerManager->GetBackendType() == LayersBackend::LAYERS_CLIENT) {
      ClientLayerManager* manager =
          static_cast<ClientLayerManager*>(mLayerManager.get());
      manager->SetDefaultTargetConfiguration(mozilla::layers::BufferMode::
    BUFFER_NONE,
                                             mRotation);
    }
    return mLayerManager;
  }

  // Layer manager can be null here. Sub-class shall handle this.
  return mLayerManager;
}

At this point in time I have no idea why PuppetWidgetBase::GetLayerManager() is returning a value on ESR 78 but not on ESR 91. Unfortunately the answer to this question is going to have to wait until tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

30 Jul 2024 : Day 304 #

It's just so wonderful to receive responses to these diary entries in my Mastodon feed. People are so generous and encouraging and every message means a lot. Some of these messages really blow me away with the ideas that have gone into them. I'm no artist myself, so when I see something like this poem from Thigg I'm super impressed.

Coding in the park.
peacefully.
Sun. Warmth.
A Tree. I see it.
Is it a frightening file tree?
Or a collapsed minimum spanning tree?
Never trust in the branches of a B- tree.
I need to do some Beam-Search.
Oh, I am relieved. It's a birch.

The poem is a reference to my time spent yesterday in a park in Buxton, but with a coding twist that's just great. Amazing work Thigg and thank you so much!

Most of that time in the park was spent debugging how the initial ambience value is captured in the front-end and passed on to the gecko engine. My conclusion was that a timing issue was preventing the message from arriving because there are multiple asynchronous steps required before the engine is properly initialised. A message is sent to trigger the PEmbedLiteViewConstructor. Once this has been received the EmbedLiteViewChild is instantiated through a runner on a separate thread. Both of these processes are asynchronous.

So although the ambience notification message is sent after the PEmbedLiteViewConstructor message, it arrives before the nsLookAndFeel observer has been registered.

While this was equally true on ESR 78, it seems that the timing works better there than on ESR 91. Most of the time on ESR 78 everything works fine and just occasionally the ambience notification is dropped. But it is dropped sporadically. So a new solution is needed.

The good news is that the ambience notification trigger can happen any time as long as it happens in the front end. The front end is where we have the details about the current ambience setting. What we therefore need is an event to hook a front-end trigger on to. One that we can use to send the ambience notification back but slightly later in the start-up sequence than it is now.

Looking through the code yesterday, I came across the embedliteviewcreated notification. This is sent at the end of the EmbedLiteViewChild::InitGeckoWindow() method:

void
EmbedLiteViewChild::InitGeckoWindow(const uint32_t parentId,
                                    mozilla::dom::BrowsingContext 
    *parentBrowsingContext,
                                    const bool isPrivateWindow,
                                    const bool isDesktopMode,
                                    const bool isHidden)
{
[...]
  // nsWebBrowser::Create creates nsDocShell, calls InitWindow for 
    nsIBaseWindow,
  // and finally creates nsIBaseWindow. When browsingContext is passed to
  // nsWebBrowser::Create, typeContentWrapper type is passed to the nsWebBrowser
  // upon instantiation.
  mWebBrowser = nsWebBrowser::Create(mChrome, mWidget, browsingContext,
                                     nullptr);
[...]
  mInitialized = true;

  Unused << SendInitialized();

  nsCOMPtr<nsIObserverService> observerService =
          do_GetService(NS_OBSERVERSERVICE_CONTRACTID);
  if (observerService) {
      observerService->NotifyObservers(mDOMWindow, 
    &quot;embedliteviewcreated&quot;, nullptr);
  }

Recall that InitGeckoWindow() is the method that actually calls nsWebBrowser::Create() which then goes on to instantiate nsLookAndFeel. At the end of the method this embedliteviewcreated is sent out as a signal that the method has completed.

The signal is already used in a few places, although from what I can tell all of the uses are currently in embedlite-components for triggering onWindowOpen|() calls.

The important point about InitGeckoWindow() is that these last few steps happen synchronously, guaranteed in-order, because they're after all of the message sending and thread-spawning has taken place. That means that the embedliteviewcreated notification is guaranteed to be sent out after the nsLookAndFeel observer has been created and registered. So it should be safe for us to use this as a trigger for sending back the ambience state. Let's try it. There's going to be a bit of to-and-fro here, but my basic plan is, rather than sending out a colour scheme change notification in WebEngineSettings::initialize() I'll instead set up an observer there instead. This observer will wait until it receives an embedliteviewcreated notification and at that point send out the colour scheme change notification.

All of this is asynchronous so we also have to countenance the possibility that the embedliteviewcreated notification will be sent out before WebEngineSettings::initialize() gets executed. Consequently we'll also send a colour scheme change notification immediately as well. A promise would be better, but we don't have that capability here.

Thankfully the notification change is idempotent (if it's received more than once the later instances will be ignored) so sending out two copies shouldn't be a problem.

Here's the code I've added to WebEngineSettings::initialize() for performing all of this:

    // Notify gecko when the ambience switches between light and dark
    if (engineSettings->isInitialized()) {
        engineSettings->notifyColorSchemeChanged();
    } else {
        connect(engineSettings, &QMozEngineSettings::initialized,
                engineSettings, &SailfishOS::WebEngineSettings::
    notifyColorSchemeChanged);
    }
    connect(silicaTheme, &Silica::Theme::colorSchemeChanged,
            engineSettings, &SailfishOS::WebEngineSettings::
    notifyColorSchemeChanged);


    SailfishOS::WebEngine *webEngine = SailfishOS::WebEngine::instance();

    connect(webEngine, &SailfishOS::WebEngine::recvObserve, [engineSettings, 
    webEngine](const QString &message, const QVariant &data) {
        const QVariantMap dataMap = data.toMap();
        if (message == QLatin1String(&quot;embedliteviewcreated&quot;)) {
            qDebug() << &quot;AMBIENCE: received embedliteviewcreated&quot;;
            engineSettings->notifyColorSchemeChanged();
            webEngine->removeObserver(QStringLiteral(
    &quot;embedliteviewcreated&quot;));
            // Should also disconnect signal
        }
    });

    // subscribe to gecko messages
    webEngine->addObserver(QStringLiteral(&quot;embedliteviewcreated&quot;));

Let's break this down a bit. First a notification is sent out with details of the current ambience setting. It may get delayed slightly if QMozEngineSettings hasn't yet initialised. Immediately afterwards an observer is set up for receiving the embedliteviewcreated notification. I've used a lambda function for simplicity here, but for reasons I'll come on to, I'll probably turn it into a named method at some point.

When the embedliteviewcreated message is received another notification is sent out containing the ambience state.

In practice, this actually works really nicely. Now the colour change notification is always received and the ambience status is correct all the way from the start.

That's the proof-of-concept sorted, now I have to turn it into something a little cleaner. There are two considerations here. The first is that I want to disconnect the signal once the notification has been sent out, to avoid the (very small) resource overhead of observing the notifications. Although it is possible to disconnect lambda functions, I think I'll make my life easier and the code clearer if I use a named function instead so that the signal can be disconnected cleanly. The second consideration is that I don't want to pollute the public WebEngineSettings interface with the internal slots needed to do all this.

Therefore I've created a private class to hold all of the behind-the-scenes gubbins. There's a bit of boiler-plate code needed for this, but once it's done it can be used and so is likely to make the interface cleaner in future too, because there will already be easy access to the private class to work with.

Now I have my updated code. The signal connection with the lambda function has now been reduced to this:

    // Subscribe to gecko messages
    // When the embedliteviewcreated notification is received the ambience 
    notification will be sent again
    SailfishOS::WebEngine *webEngine = SailfishOS::WebEngine::instance();
    connect(webEngine, &SailfishOS::WebEngine::recvObserve,
            engineSettings->d, &SailfishOS::WebEngineSettingsPrivate::
    oneShotNotifyColorSchemeChanged);
    webEngine->addObserver(QStringLiteral(&quot;embedliteviewcreated&quot;));

The contents of the lambda function have been moved to a separate named function like this:

/*!
    \internal
    \brief Notifies gecko about ambience color scheme change and removes 
    observer.
*/
void SailfishOS::WebEngineSettingsPrivate::oneShotNotifyColorSchemeChanged(
    const QString &message, const QVariant &data)
{
    Q_UNUSED(data);
    if (message == QLatin1String(&quot;embedliteviewcreated&quot;)) {
        SailfishOS::WebEngine *webEngine = SailfishOS::WebEngine::instance();
        // Remove the observer and disconnect the signal
        webEngine->removeObserver(QStringLiteral(
    &quot;embedliteviewcreated&quot;));
        disconnect(webEngine, &SailfishOS::WebEngine::recvObserve,
                   this, &SailfishOS::WebEngineSettingsPrivate::
    oneShotNotifyColorSchemeChanged);

        // Notify about the ambience state
        notifyColorSchemeChanged();
    }
}

Now I know some people just prefer lambda functions, but personally I find this new structure cleaner and easier to follow. More importantly the signal can be disconnected both easily and in a precise way.

After rebuilding and testing this new code it seems to work nicely, which means the ambience dark/light switching is now working correctly in all situations.

I've ticked the task off in my testing list. Tomorrow I'll start working my way through the "Everything on the browser test page" test failures.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

29 Jul 2024 : Day 303 #

Today I'm coding from Buxton in Derbyshire in a park called "The Slopes". It's bright and the weather is calm, but also cool enough in the shade to be extremely comfortable, so I'm sat on a park bench making the most of it. Truly, this is the way to live! I've spent most of my life, at least since I was at University in the nineties, trying to code in parks; first by connecting to a Nokia feature phone with a Psion; more recently using laptops with screens I could barely read in the sun and Wifi connections that were as flaky as they were slow. Now a quarter of a century later it finally feels like we've reached the point of park-coding viability. This really is progress.

Me, sitting on a park bench in The Slopes, Buxton, with a laptop, a phone and a book. The sun is shining and in the background a grassy hill stretches above me.

Yesterday I was able to get the ambience colour-mode switching to a stable state. This meant that when in "follow ambience" mode the colour mode changed depending on whether the phone was set to a light or a dark ambience. When it comes to the browser, only the front end is really in a position to listen for ambience changes, since it has the appropriate hooks into Lipstick and can listen on DBus signals. The underlying gecko engine has none of these advantages. The front-end sends notifications to the gecko engine, which then keeps its own variable representing the state of the ambience updated.

Those signals are now working correctly in general, but there's still a glitch which occurs during initialisation. The initial state of the internal value is currently not set correctly. The front end sends a signal to gecko on initialisation so that in theory the engine knows the state of the ambience. This is being sent from the WegEngineSettings class like this:

void SailfishOS::WebEngineSettings::notifyColorSchemeChanged()
{
    Silica::Theme *silicaTheme = Silica::Theme::instance();
    QString scheme = silicaTheme->colorScheme() == Silica::Theme::LightOnDark
            ? QStringLiteral(&quot;dark&quot;)
            : QStringLiteral(&quot;light&quot;);
    SailfishOS::WebEngine::instance()->notifyObservers(QStringLiteral(
    &quot;ambience-theme-changed&quot;), scheme);
}

But despite the intent, this signal isn't being picked up by the engine. When we looked at this a couple of days back we determined that the reason was one of timing. The internal gecko theming classes aren't initialised in time and so never receive the notification. Here's an abridged summary of the sequencing, captured using the debugger (see the full output on Day 301 for the full version):

Thread 1 hit Breakpoint 1, WebEngineSettings::notifyColorSchemeChanged ()
    at webenginesettings.cpp:220
Thread 10 hit Breakpoint 3, nsLookAndFeel::nsLookAndFeel ()
    at nsLookAndFeel.cpp:62
Thread 10 hit Breakpoint 2, nsLookAndFeel::Observer::Observe ()
    at nsLookAndFeel.cpp:81

What we see is that the notification is sent from the WebEngineSettings class before the nsLookAndFeel or its nested Observer class have been initialised. The nsLookAndFeel class isn't going to pick up a notification until it's been instantiated so we need to somehow get this ordering reversed.

Perhaps the surprising thing is that this is apparently a new phenomenon for ESR 91. As far as I'm aware there was no such problem with ESR 78. As I write this I have no idea how the ordering, or speed of initialisation, is handled by gecko. So I guess that's what I'll need to look in to now.

For context, it'll be useful to know where the nsLookAndFeel class is being instantiated. Here's the backtrace from the constructor on ESR 91:

Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 2, nsLookAndFeel::
    nsLookAndFeel (this=0x7fb8ce1030)
    at ${PROJECT}/gecko-dev/widget/qt/nsLookAndFeel.cpp:62
62      ${PROJECT}/gecko-dev/widget/qt/nsLookAndFeel.cpp: No such file or 
    directory.
(gdb) bt
#0  nsLookAndFeel::nsLookAndFeel (this=0x7fb8ce1030)
    at ${PROJECT}/gecko-dev/widget/qt/nsLookAndFeel.cpp:62
#1  0x0000007ff3e8af14 in nsXPLookAndFeel::GetInstance ()
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33
#2  0x0000007ff3e8b090 in mozilla::LookAndFeel::GetColor (aId=aId@entry=mozilla:
    :StyleSystemColor::WindowBackground, 
    aScheme=aScheme@entry=mozilla::LookAndFeel::ColorScheme::Light, 
    aUseStandins=aUseStandins@entry=mozilla::LookAndFeel::UseStandins::No)
    at ${PROJECT}/gecko-dev/widget/nsXPLookAndFeel.cpp:996
#3  0x0000007ff4a88ba4 in mozilla::LookAndFeel::Color (aDefault=4278190080, 
    aUseStandins=mozilla::LookAndFeel::UseStandins::No, 
    aScheme=mozilla::LookAndFeel::ColorScheme::Light, aId=mozilla::
    StyleSystemColor::WindowBackground)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/LookAndFeel.h:472
#4  nsWebBrowser::Create (aContainerWindow=0x7fb8ce7130, 
    aParentWidget=0x7fb8706e90, 
    aBrowsingContext=aBrowsingContext@entry=0x7fb8ce0200, 
    aInitialWindowChild=aInitialWindowChild@entry=0x0)
    at ${PROJECT}/gecko-dev/toolkit/components/browser/nsWebBrowser.cpp:133
#5  0x0000007ff4c54634 in mozilla::embedlite::EmbedLiteViewChild::
    InitGeckoWindow (this=0x7fb8a803b0, parentId=<optimized out>, 
    parentBrowsingContext=0x0, isPrivateWindow=false, isDesktopMode=false, 
    isHidden=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:859
#6  0x0000007ff4c45bbc in mozilla::detail::RunnableMethodArguments<unsigned int 
    const, mozilla::dom::BrowsingContext*, bool const, bool const, bool const>::
    applyImpl<mozilla::embedlite::EmbedLiteViewChild, void (mozilla::embedlite::
    EmbedLiteViewChild::*)(unsigned int, mozilla::dom::BrowsingContext*, bool, 
    bool, bool), StoreCopyPassByConstLRef<unsigned int const>, 
    StoreRefPtrPassByPtr<mozilla::dom::BrowsingContext>, 
    StoreCopyPassByConstLRef<bool const>, StoreCopyPassByConstLRef<bool const>, 
    StoreCopyPassByConstLRef<bool const>, 0ul, 1ul, 2ul, 3ul, 4ul> (args=..., 
    m=<optimized out>, o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:280
#7  mozilla::detail::RunnableMethodArguments<unsigned int const, mozilla::dom::
    BrowsingContext*, bool const, bool const, bool const>::apply<mozilla::
    embedlite::EmbedLiteViewChild, void (mozilla::embedlite::EmbedLiteViewChild:
    :*)(unsigned int, mozilla::dom::BrowsingContext*, bool, bool, bool)> (
    m=<optimized out>, o=<optimized out>, this=<optimized out>)
    at ${PROJECT}/gecko-dev/xpcom/threads/nsThreadUtils.h:1154
#8  mozilla::detail::RunnableMethodImpl<mozilla::embedlite::
    EmbedLiteViewChild*, void (mozilla::embedlite::EmbedLiteViewChild::*)(
    unsigned int, mozilla::dom::BrowsingContext*, bool, bool, bool), true, (
    mozilla::RunnableKind)0, unsigned int const, mozilla::dom::
    BrowsingContext*, bool const, bool const, bool const>::Run (this=<optimized 
    out>) at ${PROJECT}/gecko-dev/xpcom/threads/nsThreadUtils.h:1201
#9  0x0000007ff19a4d80 in mozilla::RunnableTask::Run (this=0x7fb8c6c7e0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
[...]
#31 0x0000007fef54989c in ?? () from /lib64/libc.so.6
(gdb)

Perhaps the interesting thing is that this is being triggered from EmbedLiteViewChild::InitGeckoWindow(), which is part of the EmbedLite code. That's our code, not upstream's code, so it's unlikely this will have changed substantially. Let's compare this with the same backtrace using ESR 78:

Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 1, nsLookAndFeel::
    nsLookAndFeel (this=0x7f8045f4f0)
    at widget/qt/nsLookAndFeel.cpp:62
62      nsLookAndFeel::nsLookAndFeel()
(gdb) bt
#0  nsLookAndFeel::nsLookAndFeel (this=0x7f8045f4f0) at widget/qt/
    nsLookAndFeel.cpp:62
#1  0x0000007fbbde7098 in nsXPLookAndFeel::GetInstance ()
    at obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33
#2  0x0000007fbbdeca88 in nsXPLookAndFeel::GetInstance ()
    at widget/nsXPLookAndFeel.cpp:239
#3  mozilla::LookAndFeel::GetColor (aID=aID@entry=mozilla::StyleSystemColor::
    WindowBackground, aResult=aResult@entry=0x7f8045eed8)
    at widget/nsXPLookAndFeel.cpp:1022
#4  0x0000007fbc8ffe3c in nsWebBrowser::Create (aContainerWindow=<optimized 
    out>, aParentWidget=0x7f80bfb860, aBrowsingContext=aBrowsingContext@entry=
    0x7f809bd7a0, aInitialWindowChild=aInitialWindowChild@entry=0x0)
    at toolkit/components/browser/nsWebBrowser.cpp:128
#5  0x0000007fbca930e8 in mozilla::embedlite::EmbedLiteViewChild::
    InitGeckoWindow (this=0x7f80be6790, parentId=0, parentBrowsingContext=0x0, 
    isPrivateWindow=<optimized out>, isDesktopMode=false)
    at obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:847
#6  0x0000007fbca84b84 in mozilla::detail::RunnableMethodArguments<unsigned int 
    const, mozilla::dom::BrowsingContext*, bool const, bool const>::
    applyImpl<mozilla::embedlite::EmbedLiteViewChild, void (mozilla::embedlite::
    EmbedLiteViewChild::*)(unsigned int, mozilla::dom::BrowsingContext*, bool, 
    bool), StoreCopyPassByConstLRef<unsigned int const>, 
    StoreRefPtrPassByPtr<mozilla::dom::BrowsingContext>, 
    StoreCopyPassByConstLRef<bool const>, StoreCopyPassByConstLRef<bool const>, 
    0ul, 1ul, 2ul, 3ul> (args=..., m=<optimized out>, o=<optimized out>)
    at xpcom/threads/nsThreadUtils.h:990
#7  mozilla::detail::RunnableMethodArguments<unsigned int const, mozilla::dom::
    BrowsingContext*, bool const, bool const>::apply<mozilla::embedlite::
    EmbedLiteViewChild, void (mozilla::embedlite::EmbedLiteViewChild::*)(
    unsigned int, mozilla::dom::BrowsingContext*, bool, bool)> (m=<optimized 
    out>, 
    o=<optimized out>, this=<optimized out>) at xpcom/threads/nsThreadUtils.h:
    1191
#8  mozilla::detail::RunnableMethodImpl<mozilla::embedlite::
    EmbedLiteViewChild*, void (mozilla::embedlite::EmbedLiteViewChild::*)(
    unsigned int, mozilla::dom::BrowsingContext*, bool, bool), true, (mozilla::
    RunnableKind)0, unsigned int const, mozilla::dom::BrowsingContext*, bool 
    const, bool const>::Run (
    this=<optimized out>) at xpcom/threads/nsThreadUtils.h:1237
#9  0x0000007fb9c014c4 in nsThread::ProcessNextEvent (aResult=0x7fa69e6d67, 
    aMayWait=<optimized out>, this=0x7f8002fd60)
    at xpcom/threads/nsThread.cpp:1211
[...]
#26 0x0000007fb735c89c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/
    clone.S:78
(gdb)

These backtraces are broadly the same. Certainly the closer we get to the PC end of the stack the more similar they get. I'm left feeling a little surprised: why is this no longer working on ESR 91 when it worked fine on ESR 78? It would be useful to know exactly how the sequence works its way through the system and to do that we really need to know what triggers the call to EmbedLiteViewChild::InitGeckoWindow(). A search of the code shows this happens in the EmbedLiteViewChild constructor:

EmbedLiteViewChild::EmbedLiteViewChild(const uint32_t &aWindowId,
                                       const uint32_t &aId,
                                       const uint32_t &aParentId,
                                       mozilla::dom::BrowsingContext 
    *parentBrowsingContext,
                                       const bool &isPrivateWindow,
                                       const bool &isDesktopMode,
                                       const bool &isHidden)
[...]
{
[...]
  MessageLoop::current()->PostTask(NewRunnableMethod<const uint32_t, mozilla::
    dom::BrowsingContext*, const bool, const bool, const bool>
                                   (&quot;mozilla::embedlite::
    EmbedLiteViewChild::InitGeckoWindow&quot;,
                                    this,
                                    &EmbedLiteViewChild::InitGeckoWindow,
                                    aParentId,
                                    parentBrowsingContext,
                                    isPrivateWindow,
                                    isDesktopMode,
                                    isHidden));
}

So that means we need to find out where this is constructed. Using the debugger we can establish that AllocPEmbedLiteViewChild() is furthest down the usable part of the call stack:

Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 14, mozilla::embedlite::
    EmbedLiteViewChild::EmbedLiteViewChild (this=0x7fb8b93e30, 
    aWindowId=@0x7fde7c06dc: 1, 
    aId=@0x7fde7c06e0: 1, aParentId=@0x7fde7c06e4: 0, 
    parentBrowsingContext=0x0, isPrivateWindow=@0x7fde7c06ce: false, 
    isDesktopMode=@0x7fde7c06cf: false, isHidden=@0x7fde7c06d0: false)
    at ${PROJECT}/gecko-dev/mobile/sailfishos/embedshared/
    EmbedLiteViewChild.cpp:71
71      ${PROJECT}/gecko-dev/mobile/sailfishos/embedshared/
    EmbedLiteViewChild.cpp: No such file or directory.
(gdb) bt
#0  mozilla::embedlite::EmbedLiteViewChild::EmbedLiteViewChild (
    this=0x7fb8b93e30, aWindowId=@0x7fde7c06dc: 1, aId=@0x7fde7c06e0: 1, 
    aParentId=@0x7fde7c06e4: 0, parentBrowsingContext=0x0, 
    isPrivateWindow=@0x7fde7c06ce: false, isDesktopMode=@0x7fde7c06cf: false, 
    isHidden=@0x7fde7c06d0: false)
    at ${PROJECT}/gecko-dev/mobile/sailfishos/embedshared/
    EmbedLiteViewChild.cpp:71
#1  0x0000007ff4c5e850 in mozilla::embedlite::EmbedLiteViewThreadChild::
    EmbedLiteViewThreadChild (this=0x7fb8b93e30, windowId=<optimized out>, 
    id=<optimized out>, parentId=<optimized out>, 
    parentBrowsingContext=<optimized out>, isPrivateWindow=<optimized out>, 
    isDesktopMode=<optimized out>, 
    isHidden=<optimized out>)
    at ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/
    EmbedLiteViewThreadChild.cpp:15
#2  0x0000007ff4c65448 in mozilla::embedlite::EmbedLiteAppThreadChild::
    AllocPEmbedLiteViewChild (this=0x7fb8aba0e0, windowId=@0x7fde7c06dc: 1, 
    id=@0x7fde7c06e0: 1, parentId=@0x7fde7c06e4: 0, 
    parentBrowsingContext=<optimized out>, isPrivateWindow=@0x7fde7c06ce: 
    false, 
    isDesktopMode=@0x7fde7c06cf: false, isHidden=@0x7fde7c06d0: false)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33
#3  0x0000007ff1ef4cd4 in mozilla::embedlite::PEmbedLiteAppChild::
    OnMessageReceived (this=0x7fb8aba0e0, msg__=...) at PEmbedLiteAppChild.cpp:
    529
[...]
#30 0x0000007fef54989c in ?? () from /lib64/libc.so.6
(gdb)

Everything prior to this relates to the message loop, which isn't much use to us. So we'll need to start looking on the other side of the message loop, where the message is sent. The method we're interested in for this is AllocPEmbedLiteViewChild(), but there's actually no code in any of the gecko source that references this method. It turns out that's because the calling code is autogenerated. Searching the autogenerated code gives us this hit:

$ grep -rIn &quot;AllocPEmbedLiteViewChild&quot; *
[...]
gecko-dev/obj-build-mer-qt-xr/ipc/ipdl/PEmbedLiteAppChild.cpp:529: 
    PEmbedLiteViewChild* actor = (static_cast<EmbedLiteAppChild*>(
    this))->AllocPEmbedLiteViewChild(windowId, id, parentId, 
    parentBrowsingContext, isPrivateWindow, isDesktopMode, isHidden);

Let's look at this bit of code in more detail.

auto PEmbedLiteAppChild::OnMessageReceived(const Message& msg__) -> 
    PEmbedLiteAppChild::Result
{
    switch (msg__.type()) {
    case PEmbedLiteApp::Msg_PEmbedLiteViewConstructor__ID:
[...]
            PEmbedLiteViewChild* actor = (static_cast<EmbedLiteAppChild*>(
    this))->AllocPEmbedLiteViewChild(windowId, id, parentId, 
    parentBrowsingContext, isPrivateWindow, isDesktopMode, isHidden);
[...]

As we can see, this waits for the Msg_PEmbedLiteViewConstructor__ID message and, when received, calls the AllocPEmbedLiteViewChild() method. We don't want to put a breakpoint on this because this is the code that receives the message. The debugger can't follow the messages, so it'll just give us a stack trace down to the message processing loop. Instead we want a breakpoint on where the message is sent.

On the sender side, there's a method Msg_PEmbedLiteViewConstructor() that's not used to send the message, but it is used to construct an instance of the message. Chances are this will be called just before the message gets sent. So let's stick a breakpoint on that and see what it gives us.

Thread 1 &quot;sailfish-browse&quot; hit Breakpoint 1, mozilla::embedlite::
    PEmbedLiteApp::Msg_PEmbedLiteViewConstructor (
    routingId=routingId@entry=2147483647)
    at PEmbedLiteApp.cpp:68
68      PEmbedLiteApp.cpp: No such file or directory.
(gdb) bt
#0  mozilla::embedlite::PEmbedLiteApp::Msg_PEmbedLiteViewConstructor (
    routingId=routingId@entry=2147483647) at PEmbedLiteApp.cpp:68
#1  0x0000007ff1f062e0 in mozilla::embedlite::PEmbedLiteAppParent::
    SendPEmbedLiteViewConstructor (this=this@entry=0x7fb81fa7a0, 
    actor=<optimized out>, 
    windowId=@0x7fffffe3b4: 1, id=@0x7ff7e258f0: 1, parentId=@0x7fffffe3ac: 0, 
    parentBrowsingContext=@0x7fffffe3a0: 0, 
    isPrivateWindow=@0x7fffffe3ab: false, isDesktopMode=@0x7fffffe3aa: false, 
    isHidden=@0x7fffffe3a9: false) at PEmbedLiteAppParent.cpp:193
#2  0x0000007ff1f064d0 in mozilla::embedlite::PEmbedLiteAppParent::
    SendPEmbedLiteViewConstructor (this=this@entry=0x7fb81fa7a0, 
    windowId=@0x7fffffe3b4: 1, id=@0x7ff7e258f0: 1, parentId=@0x7fffffe3ac: 0, 
    parentBrowsingContext=@0x7fffffe3a0: 0, 
    isPrivateWindow=@0x7fffffe3ab: false, isDesktopMode=@0x7fffffe3aa: false, 
    isHidden=@0x7fffffe3a9: false) at PEmbedLiteAppParent.cpp:170
#3  0x0000007ff4c535f4 in mozilla::embedlite::EmbedLiteApp::CreateView (
    this=0x5555884370, aWindow=0x5555d8a9b0, aParent=<optimized out>, 
    aParentBrowsingContext=<optimized out>, aIsPrivateWindow=<optimized out>, 
    isDesktopMode=<optimized out>, isHidden=<optimized out>)
    at ${PROJECT}/gecko-dev/mobile/sailfishos/EmbedLiteApp.cpp:478
#4  0x0000007ff7e8595c in QMozViewPrivate::createView (this=0x5555af0e50) at 
    qmozview_p.cpp:864
#5  QMozViewPrivate::createView (this=0x5555af0e50) at qmozview_p.cpp:850
#6  0x00000055555d286c in WebPageFactory::createWebPage (this=0x55559e9090, 
    webContainer=0x5555704610, initialTab=...)
    at ../factories/webpagefactory.cpp:44
#7  0x00000055555ade60 in WebPages::page (this=0x55559e9220, tab=...)
    at /srv/mer/toolings/SailfishOS-4.5.0.18/opt/cross/aarch64-meego-linux-gnu/
    include/c++/8.3.0/bits/atomic_base.h:390
#8  0x0000005555591dcc in DeclarativeWebContainer::activatePage (
    this=this@entry=0x5555704610, tab=..., force=force@entry=true, 
    fromExternal=fromExternal@entry=184)
    at /srv/mer/toolings/SailfishOS-4.5.0.18/opt/cross/aarch64-meego-linux-gnu/
    include/c++/8.3.0/bits/atomic_base.h:390
#9  0x0000005555592f64 in DeclarativeWebContainer::loadTab (
    this=this@entry=0x5555704610, tab=..., force=force@entry=true, 
    fromExternal=184)
    at ../core/declarativewebcontainer.cpp:1211
#10 0x0000005555593984 in DeclarativeWebContainer::initialize (
    this=0x5555704610) at ../core/declarativewebcontainer.cpp:1053
#11 0x0000007fefb5b204 in QMetaObject::activate(QObject*, int, int, void**) () 
    from /usr/lib64/libQt5Core.so.5
#12 0x0000007fefb5b204 in QMetaObject::activate(QObject*, int, int, void**) () 
    from /usr/lib64/libQt5Core.so.5
#13 0x0000007ff7e73690 in QMozContextPrivate::Initialized (
    this=0x7ff7f497e0 <(anonymous namespace)::Q_QGS_mozContextPrivateInstance::
    innerFunction()::holder>) at qmozcontext.cpp:133
#14 0x0000007ff4c4c0a4 in mozilla::embedlite::EmbedLiteApp::Initialized (
    this=0x5555884370)
    at ${PROJECT}/gecko-dev/mobile/sailfishos/EmbedLiteApp.cpp:656
#15 0x0000007ff4c5969c in mozilla::embedlite::EmbedLiteAppThreadParent::
    RecvInitialized (this=0x7fb81fa7a0)
    at ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/
    EmbedLiteAppThreadParent.cpp:46
#16 0x0000007ff1f037d4 in mozilla::embedlite::PEmbedLiteAppParent::
    OnMessageReceived (this=0x7fb81fa7a0, msg__=...) at PEmbedLiteAppParent.cpp:
    721
[...]
#37 0x000000555557bf88 in main (argc=<optimized out>, argv=<optimized out>) at 
    main.cpp:203
(gdb)

Once again we eventually reach the message loop in the backtrace. But this already tells us a lot of what we need to know. We can see that the call to DeclarativeWebContainer::initialize() includes a subsequent call to EmbedLiteApp::CreateView() and this method looks like this:

EmbedLiteView*
EmbedLiteApp::CreateView(EmbedLiteWindow* aWindow, uint32_t aParent, uintptr_t 
    aParentBrowsingContext, bool aIsPrivateWindow, bool isDesktopMode, bool 
    isHidden)
{
  LOGT();
  NS_ASSERTION(mState == INITIALIZED, &quot;The app must be up and runnning by 
    now&quot;);
  static uint32_t sViewCreateID = 0;
  sViewCreateID++;

  PEmbedLiteViewParent* viewParent = static_cast<PEmbedLiteViewParent*>(
      mAppParent->SendPEmbedLiteViewConstructor(aWindow->GetUniqueID(), 
    sViewCreateID,
                                                aParent, 
    aParentBrowsingContext, aIsPrivateWindow,
                                                isDesktopMode, isHidden));
  EmbedLiteView* view = new EmbedLiteView(this, aWindow, viewParent, 
    sViewCreateID);
  mViews[sViewCreateID] = view;
  return view;
}

There's the call that triggers the sending of the message that ends up constructing the class that will fire a runnable that will eventually result in the creation of our nsLookAndFeel class. Let's summarise the process.

EmbedLiteApp::Initialized() is called.
This triggers a call to DeclarativeWebContainer::initialize().
This then sends an AllocPEmbedLiteViewChild message.
Once the message is received the EmbedLiteViewChild is constructed.
The constructor posts a task to trigger a call to EmbedLiteViewChild::InitGeckoWindow().
This is set to run on a different thread, and once running there are a series of direct calls which end up calling nsXPLookAndFeel::GetInstance().
This causes the creation of the nsLookAndFeel class which creates the observer that we need.

Now we know this is the flow we need align with where the colour scheme notification goes out. I've placed breakpoints on a whole host of methods to find out which order they're called in. Here's the result. I've stripped out a lot of the details here and have just left the calls, the order they come in, and the location in their respective files.

Thread 1 hit Breakpoint 2, DeclarativeWebContainer::initialize ()
    at declarativewebcontainer.cpp:998
Thread 1 hit Breakpoint 2, DeclarativeWebContainer::initialize ()
    at declarativewebcontainer.cpp:998
Thread 1 hit Breakpoint 2, DeclarativeWebContainer::initialize ()
    at declarativewebcontainer.cpp:998
Thread 1 hit Breakpoint 9, EmbedLiteApp::CreateView ()
    at EmbedLiteApp.cpp:472
Thread 1 hit Breakpoint 11, PEmbedLiteAppParent::SendPEmbedLiteViewConstructor ()
    at PEmbedLiteAppParent.cpp:168
Thread 1 hit Breakpoint 11, PEmbedLiteAppParent::SendPEmbedLiteViewConstructor ()
    at PEmbedLiteAppParent.cpp:182
Thread 1 hit Breakpoint 6, PEmbedLiteApp::Msg_PEmbedLiteViewConstructor ()
    at PEmbedLiteApp.cpp:68
Thread 1 hit Breakpoint 3, WebEngineSettings::notifyColorSchemeChanged ()
    at webenginesettings.cpp:220
Thread 10 hit Breakpoint 7, EmbedLiteAppThreadChild::AllocPEmbedLiteViewChild ()
    at EmbedLiteAppThreadChild.cpp:45
Thread 10 hit Breakpoint 10, EmbedLiteViewChild::EmbedLiteViewChild ()
    at EmbedLiteViewChild.cpp:71
Thread 1 hit Breakpoint 2, DeclarativeWebContainer::initialize ()
    at declarativewebcontainer.cpp:998
Thread 10 hit Breakpoint 5, nsLookAndFeel::nsLookAndFeel ()
    at nsLookAndFeel.cpp:62
Thread 1 hit Breakpoint 2, DeclarativeWebContainer::initialize ()
    at declarativewebcontainer.cpp:998

This shows us a few things. The call to DeclarativeWebContainer::initialize() happens early. Consequently the sending of Msg_PEmbedLiteViewConstructor happens early as well. But it doesn't trigger the call to AllocPEmbedLiteViewChild() immediately and so the construction of nsLookAndFeel is delayed. In between the two the call to notifyColorSchemeChanged() occurs and the message sent from this is seemingly lost as a result.

There's one other thing I'd really like to check. The code that calls notifyColorSchemeChanged() in WebEngineSettings::initialize() looks like this:

    // Notify gecko when the ambience switches between light and dark
    if (engineSettings->isInitialized()) {
        engineSettings->notifyColorSchemeChanged();
    } else {
        connect(engineSettings, &QMozEngineSettings::initialized,
                engineSettings, &SailfishOS::WebEngineSettings::
    notifyColorSchemeChanged);
    }

I'm interested to know whether it gets called immediately because QMozEngineSettings is already initialised or whether it has to wait for the initialisation signal to be received. To find this out, I'm going to step through the code of WebEngineSettings::initialize().

Thread 1 &quot;sailfish-browse&quot; hit Breakpoint 12, SailfishOS::
    WebEngineSettings::initialize () at webenginesettings.cpp:96
96          if (isInitialized) {
(gdb) n
100         SailfishOS::WebEngineSettings *engineSettings = instance();
(gdb) n
103         QString langs;
(gdb) 
105         if (locale.size() > 1) {
(gdb) n
106             langs = QString(&quot;%1-%2,%3&quot;).arg(locale.at(0)).arg(
    locale.at(1)).arg(locale.at(0));
(gdb) n
110         engineSettings->setPreference(QStringLiteral(
    &quot;intl.accept_languages&quot;),
(gdb) n
111                                       QVariant::fromValue<QString>(langs));
(gdb) n
125             connect(engineSettings, &QMozEngineSettings::initialized,
(gdb) p engineSettings->isInitialized()
$1 = false

So it's not called immediately but rather called only on receiving the initialisation signal from QMozEngineSettings.

All of this is useful context and interesting background, but it's not giving us any answers about how to fix the problem. So I figure it might be useful to find out what's happening on ESR 78 for comparison. I've registered the same set of breakpoints that we saw earlier for ESR 91 and have set the app to run in the debugger. Here's what we get:

Thread 1 hit Breakpoint 1, DeclarativeWebContainer::initialize ()
    at declarativewebcontainer.cpp:963
Thread 1 hit Breakpoint 1, DeclarativeWebContainer::initialize ()
    at declarativewebcontainer.cpp:963
Thread 1 hit Breakpoint 1, DeclarativeWebContainer::initialize ()
    at declarativewebcontainer.cpp:963
Thread 1 hit Breakpoint 2, EmbedLiteApp::CreateView ()
    at EmbedLiteApp.cpp:472
Thread 1 hit Breakpoint 3, SendPEmbedLiteViewConstructor ()
    at PEmbedLiteAppParent.cpp:162
Thread 1 hit Breakpoint 3, SendPEmbedLiteViewConstructor ()
    at PEmbedLiteAppParent.cpp:174
Thread 1 hit Breakpoint 4, Msg_PEmbedLiteViewConstructor ()
    at ipc_message.h:118
Thread 1 hit Breakpoint 5, notifyColorSchemeChanged ()
    from /usr/lib64/libsailfishwebengine.so.1
Thread 1 hit Breakpoint 5, notifyColorSchemeChanged ()
    from /usr/lib64/libsailfishwebengine.so.1
Thread 10 hit Breakpoint 6, AllocPEmbedLiteViewChild ()
    at EmbedLiteAppThreadChild.cpp:45
Thread 10 hit Breakpoint 7, EmbedLiteViewChild::EmbedLiteViewChild ()
    at EmbedLiteViewChild.cpp:102
Thread 10 hit Breakpoint 8, nsLookAndFeel::nsLookAndFeel ()
    at nsLookAndFeel.cpp:62

The odd thing is that when we compare this against the ESR 91 flow, they look pretty similar. Crucially, the call to notifyColorSchemeChanged() happens before the construction of nsLookAndFeel in both cases. Which begs the obvious question of how this ever worked in ESR 78?

When I try the debugging again, but this time with a breakpoint on nsLookAndFeel::Observer::Observe() I find that this doesn't fire on initialisation. So maybe it was actually broken on ESR 78 as well? The fact is, when I try it several times, the initial message setting the colour mode is generally going through just fine:

[(null) 26214: Unnamed thread 0x78c8002670]: I/LookAndFeel Ambience set to dark

However there were a couple of occasions when the message got lost and as a result the browser came up in light mode when it should have been in dark mode. The vast majority of times it worked fine, but not always. The conclusion I've come to is that, whether on ESR 78 or ESR 91, it's all about timing. On ESR 78 the timing for all but a minority of runs works fine and the correct colour mode is set. But occasionally not. That would be good enough reason to try to fix it, but on ESR 91 the message seems to get missed the majority of the time. Possibly just down to different timings during start up. These are asynchronous processes, so seemingly this can happen.

It's worth adding that during debugging the timings are sent completely off course compared to a usual start-up run. The breakpoints stop the flow of execution, add delays that wouldn't have been there otherwise and generally mess things around. These kinds of race conditions are notoriously susceptible to acting differently during debugging.

Anyway, I'll need to find a new way to fix this. The investigation up until now has revealed a problem and shown that there's no immediately obvious solution. I have an idea about what might work, but that will now have to wait until tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

28 Jul 2024 : Day 302 #

I'm disappointed I didn't manage to sort out the ambience switching colour mode yesterday. I got so far and then just crashed. It was an intense week for me at work and this evening my mind decided it'd had enough.

Now it's the morning, the sun is bright and I feel fresh again. Today I'm travelling, which means train-based coding, which I love. There's nothing better than some enforced focus time spent coding combined with the knowledge that the train is making progress on your behalf, even if you're not.

But it also means I have to be a little judicious with my work. I don't want to spend the entire journey building the packages. Minimising time spent waiting for the compiler is going to be key. That, and figuring out what the correct solution is to this ambience problem.

The reason the problem is a little tricky is because, first, the "Look And Feel" code — which is platform-specific — has changed upstream between ESR 78 and ESR 91. Second, I find the way the code is structured, with the Qt variant of the nsLookAndFeel class being a subclass of nsXPLokAndFeel, to be confusing.

The "XP" in "nsXPLookAndFeel" isn't a reference to Windows XP. Ironically it's almost the opposite. In Mozilla parlance, XP is an abbreviation for "Cross Platform". Not confusing at all.

As an aside: to be fair to Mozilla, XPFE (cross-platform front-end), XPCOM (cross-platform component object model) and use of XP to mean "Cross Platform" more generally, goes back at least as far as 1998, predating the release of Windows XP in August 2001 by several years.

Back to the investigation. After a bit of thought and reviewing the code, the reason for the ambience changes failing to affect the colour mode, and in particular the reason for the breakage between ESR 78 and ESR 91, becomes clear.

The first thing to point out is that the state of the colour mode is extracted through a call to LookAndFeel::GetInt() which returns a value that relates to the current state. In ESR 78 this call is immediately forwarded to the current instance that implements nsLookAndFeel. In our case that's an instance of the Qt version of nsLookAndFeel. Here's the code from ESR 78:

nsresult LookAndFeel::GetInt(IntID aID, int32_t* aResult) {
  return nsLookAndFeel::GetInstance()->GetIntImpl(aID, *aResult);
}

The Qt version then has some code that looks like this:

nsresult
nsLookAndFeel::GetIntImpl(IntID aID, int32_t &aResult)
{
    nsresult rv = nsXPLookAndFeel::GetIntImpl(aID, aResult);
    // Make an exception for eIntID_SystemUsesDarkTheme as this is handled below
    if (NS_SUCCEEDED(rv))
        return rv;

    rv = NS_OK;
    
    switch (aID) {
        case eIntID_CaretBlinkTime:
            aResult = 500;
            break;
[...]
        case eIntID_SystemUsesDarkTheme:
            // Choose theme based on ambience
            aResult = mObserver->GetDarkAmbience() ? 1 : 0;
            break;
[...]

The particular value we're interested in is when aID is set to eIntID_SystemUsesDarkTheme, since that's the identifier used for the colour mode.

As you can see the first thing the method does is defer the request to the parent class implementation nsXPLookAndFeel::GetIntImpl(). The value returned by this is used if the call succeeds. However, if the call to the parent class fails (which happens if it doesn't support a particular value) then the execution continues, allowing the Qt implementation to provide its own version of this value instead.

So far so good. What you don't see from this — because it's in a different part of the code — is that nsXPLookAndFeel::GetIntImpl() does support the eIntID_SystemUsesDarkTheme identifier and so will return with a successful result. That prevents us providing our own value via the Qt implementation.

To work around this, on ESR 78 I updated the condition at the head of the method so that it reads like this instead:

    if (NS_SUCCEEDED(rv) && ((aID != eIntID_SystemUsesDarkTheme) || (aResult != 
    2)))
        return rv;

This is a bit nasty. It says "defer to the parent class in all cases except when the identifier is eIntID_SystemUsesDarkTheme." Although it's ugly to have code that reroutes based on a hard-coded exception like this, the alternative would have been to make changes to the parent nsXPLookAndFeel class. When I wrote this I judged the latter to be worse because nsXPLookAndField is an upstream file, whereas the Qt version of nsLookAndFeel is already part of our own Sailfish-specific implementation.

Better to make changes to our own file than to have to patch an upstream file.

How does this all relate to ESR 91? Well, in ESR 91 the execution flow has changed. Rather than routing the request to the Qt-specific implementation first, only to have it call back to the underlying cross-platform implementation from upstream, in ESR 91 things start with the cross-platform implementation. Here's the code:

nsresult nsXPLookAndFeel::GetIntValue(IntID aID, int32_t& aResult) {
  if (!sInitialized) {
    Init();
  }

  if (const auto* cached = sIntCache.Get(aID)) {
    if (cached->isNothing()) {
      return NS_ERROR_FAILURE;
    }
    aResult = cached->value();
    return NS_OK;
  }

  if (NS_SUCCEEDED(Preferences::GetInt(sIntPrefs[size_t(aID)], &aResult))) {
    sIntCache.Insert(aID, Some(aResult));
    return NS_OK;
  }

  if (NS_FAILED(NativeGetInt(aID, aResult))) {
    sIntCache.Insert(aID, Nothing());
    return NS_ERROR_FAILURE;
  }

  sIntCache.Insert(aID, Some(aResult));
  return NS_OK;
}

As we can see in this code, first the cache is checked. Next the cross-platform implementation is checked. Only if both of these fail does the code move on to calling the child class's NativeGetInt() implementation.

As a consequence of all this, all requests for the value of eIntID_SystemUsesDarkTheme get handled by the upstream code and are never routed to our Qt implementation.

Rather than try to code around this, I'm thinking a better solution would be to directly set the preference when the ambience changes.

To this end I've added in some code to do exactly this. When the observer receives a notification about an ambience change, it now updates the preference directly:

    if (mDarkAmbience != darkAmbience) {
        mDarkAmbience = darkAmbience;
        MOZ_LOG(sLookAndFeel, LogLevel::Info, (&quot;Ambience set to %s&quot;, 
    mDarkAmbience ? &quot;dark&quot; : &quot;light&quot;));

        if (NS_SUCCEEDED(Preferences::SetInt(
    &quot;ui.systemUsesDarkTheme&quot;, darkAmbience))) {
            Refresh();
            if (nsCOMPtr<nsIObserverService> obs = services::GetObserverService(
    )) {
                NotifyChangedAllWindows(widget::ThemeChangeKind::
    StyleAndLayout);
            }
        }
        else {
          MOZ_LOG(sLookAndFeel, LogLevel::Warning, (&quot;Setting ambience 
    preference ui.systemUsesDarkTheme failed.&quot;));
        }
    }

There are a couple of things to note about the above snippet of code. First is that we have to call Refresh() directly after setting the preference. This call will trigger all of the settings caches to be cleared:

void nsXPLookAndFeel::RefreshImpl() {
  // Wipe out our caches.
  sLightColorCache.Clear();
  sDarkColorCache.Clear();
  sFontCache.Clear();
  sFloatCache.Clear();
  sIntCache.Clear();

  // Clear any cached FullLookAndFeel data, which is now invalid.
  if (XRE_IsParentProcess()) {
    widget::RemoteLookAndFeel::ClearCachedData();
  }
}

If we don't clear these caches, the value stored at the previous access will be used instead of the new value.

The second point is that we call NotifyChangedAllWindows() directly after making the change. This tells gecko to redraw all of the live pages, so that they switch from light to dark, or dark to light, depending on the change.

This is a change to the C++ code so it'll need to be rebuilt before I can test it out. Thankfully a partial build should be enough.

After rebuilding, installing and testing the change, it's not working quite as I had hoped. It's doing what I expected, in that it's now updating the setting as the ambience changes. Unfortunately this setting is persistent across reboots and is now set to one of only two values: "dark" or "light". That means that not only does the settings change affect the rendering, it also updates the setting that's shown in the user interface. This overwrites the value if it's set to "follow ambience" to one of "dark" or "light".

That's not what we want. If the value is set to "follow ambience" we do want the render to change when the ambience changes, but we don't want the setting to change, we want it to stay as "follow ambience". That is, we want the setting to stay the same but the value used internally to be overridden transparently at runtime. This will therefore need a little rethinking. It also makes me realise why I chose not to implement it this way previously.

So, I've gone back to the approach from before, but updated it for the new structure. This means adding the condition into the nsXPLookAndFeel.cpp file (exactly in the way I'd hoped to avoid!) like this:

  if (NS_SUCCEEDED(Preferences::GetInt(sIntPrefs[size_t(aID)], &aResult))) {
    if ((aID != IntID::SystemUsesDarkTheme) || (aResult != 2)) {
      sIntCache.Insert(aID, Some(aResult));
      return NS_OK;
    }
  }

This will extract the value requested from the preferences. If the value requested is for the system dark them, or if it comes back with the value 2 ("follow ambience") then rather than returning the value directly it will instead go on and request the value via a call to nsLookAndFeel::NativeGetInt().

When it does this it'll pick up the value dynamically selected based on the ambience, rather than the value set as the preference.

I'm still letting it cache the value, which means I'm also keeping the code that calls Refresh() when the ambience changes.

Testing these changes I find that now things are working a lot better: both the fixed light and dark values work. If it's set to follow the ambience then that's exactly what it does. The only caveat is that when the browser is started from cold it's not being initialised with the correct value. That's because the message being sent from the front end at start up is getting lost. I think that's due to the fact the engine ready to receive it yet, but I'm not totally certain. That's something for me to figure out tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

27 Jul 2024 : Day 301 #

Yesterday was all about the login manager, the service that stores and retrieves passwords for you when using the browser. The feature was completely broken, with the code failing as it attempted to use the upstream version that's build for Firefox. On Sailfish OS we have our own implementation which had to be patched in to get it to work.

Eventually it did work and to celebrate the fact Thigg (thigg) has provided another of his wonderfully evocative illustrations. Here's a pig with wings handing a big golden key to a gecko. Which is actually how the login manager works under the hood!

A pig with wings handing a big golden key to a gecko. Medieval woodcut

There's only one thing to rival the excitement of seeing one of Thigg's images in my Mastodon feed and that's spotting one of Leif-Jöran Olsson's (ljo) brilliant poems. So you can only imagine how thrilled I was to see both when I looked at my feed this morning.

Here's Leif-Jöran's brilliant summary of where things are at. A little ominous in places — some might say rightfully so — but with enough of a dose of optimism to last through the day!

The smell of decaying matter gives hope for threading remaining uncharted territory. Lingering patches opens up the ordered passage at the wormholes' gates. The sudden possibility to transition to other worlds challenges your imaginative capabilities. Flickering sight, sensory shivering. Is it epiphany? No, it its the trusted gecko! With endurance and everyday work you can change worlds of many. Cause for celebration in the month of harvest.

Sailing on this wave of artistic energy, today it's all about yin and yang. Day and night. Light and dark. The ESR 78 version of the browser was the first to support colour schemes for websites, being able to switch between Light mode, Dark mode or automatically matching the ambience of your device. Ambiences were a standout feature of Sailfish OS at launch and, while even iOS is now trying to catch up, it's remained one of the most-loved features of Sailfish OS for many users. Light ambiences were introduced in Sailfish OS 3 and since then, supporting light and dark mode in the browser always felt like a natural extension of this.

Currently switching between fixed light and dark mode through the Settings page works just fine. But when set to "Match ambience" it gets stuck in one or the other and never changes. Identifying that the ambience has changed is the job of the browser chrome. It then sends a message to the browser engine to update its internal setting.

This means there's another consideration too, which is what happens at start up. The initialisation sequence needs to be timed just right, so that the chrome sends the message early enough that pages haven't yet been rendered, but late enough that the engine is far enough along its initialisation path to accept the message.

We already have patch 0093 "Add support for prefers-color-scheme" which is supposed to do the job of sorting all this out, including adding support for automatic ambience changes. It looks like I applied most of these changes already back in August as part of the "Bring back Qt layer" megapatch. I must have (rightly, I think) combined the changes to avoid repatching some already-patched code.

But it's not working and today I plan to find out why.

First off I notice as I look through the code that the Qt nsLookAndFeel.cpp source includes some debug output in the code for handling ambience changes, as you can see in the snippet below. We should be able to configure things so that the call to MOZ_LOG() generates debug output to the console.

LazyLogModule sLookAndFeel(&quot;LookAndFeel&quot;);
[...]

NS_IMETHODIMP nsLookAndFeel::Observer::Observe(nsISupports*, const char* 
    aTopic, const char16_t* aData) {
    MOZ_ASSERT(!strcmp(aTopic, &quot;ambience-theme-changed&quot;));

    bool darkAmbience = false;
    nsDependentString data(aData);
    if (data.EqualsLiteral(&quot;dark&quot;)) {
        darkAmbience = true;
    }

    if (mDarkAmbience != darkAmbience) {
        mDarkAmbience = darkAmbience;
        MOZ_LOG(sLookAndFeel, LogLevel::Info, (&quot;Ambience set to %s&quot;, 
    mDarkAmbience ? &quot;dark&quot; : &quot;light&quot;));
        if (nsCOMPtr<nsIObserverService> obs = services::GetObserverService()) {
            NotifyChangedAllWindows(widget::ThemeChangeKind::StyleAndLayout);
        }
    }

    return NS_OK;
}

In order to have it displayed we need to set the MOZ_LOG environment variable appropriately, as described in the Working with the Browser documentation. Setting this up to include LookAndFeel:5 unlocks these logging lines, so they then output to the console:

$ MOZ_LOG=&quot;LookAndFeel:5&quot; sailfish-browser
[...]
[Parent 10403: Unnamed thread 7a78002670]: I/LookAndFeel Ambience set to dark
[Parent 10403: Unnamed thread 7a78002670]: I/LookAndFeel Ambience set to light
[Parent 10403: Unnamed thread 7a78002670]: I/LookAndFeel Ambience set to dark
[...]

This gives us an interesting insight. When the ambience is switched from light to dark, this message is printed. When the browser starts up the dark ambience is set to false by default. But the initial message sent out to update the colour scheme is either lost or ignored. As a consequence, when the next switch is made to light mode the browser thinks it's already in light mode and ignores it.

Subsequent switches do trigger the debug output as intended.

Nevertheless, these messages are getting so far and then stopping, because the actual switch to light or dark mode never happens. To try to get a better idea about what's going on, I've added a breakpoint to a few methods: notifyColorSchemeChanged(), nsLookAndFeel::Observer::Observe() and nsLookAndFeel::nsLookAndFeel(). The last of these is where the observer is configured, so this needs to be called before the very first notification is sent from the front end.

Here are the breakpoints I've added to try to find out the ordering at start-up:

(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   <MULTIPLE>         
        breakpoint already hit 1 time
1.1                         y     0x0000007ff7f50c14 <SailfishOS::
    WebEngineSettings::notifyColorSchemeChanged()@plt+4>
1.2                         y     0x0000007ff7f52fb0 in SailfishOS::
    WebEngineSettings::notifyColorSchemeChanged() at webenginesettings.cpp:220
2       breakpoint     keep y   0x0000007ff3e99514 in nsLookAndFeel::Observer::
    Observe(nsISupports*, char const*, char16_t const*) 
                                                   at widget/qt/
    nsLookAndFeel.cpp:81
        breakpoint already hit 1 time
3       breakpoint     keep y   0x0000007ff3e9c18c in nsLookAndFeel::
    nsLookAndFeel() 
                                                   at widget/qt/
    nsLookAndFeel.cpp:62
        breakpoint already hit 1 time

In terms of actual execution, the running order exposed using these breakpoints turns out to be instructive. Clearly the creation of nsLookAndFeel is happening too late — or maybe the message is being sent too early — but either way the message is being lost before anything can be received.

Thread 1 &quot;sailfish-browse&quot; hit Breakpoint 1, SailfishOS::
    WebEngineSettings::notifyColorSchemeChanged (
    this=0x7ff7f67470 <(anonymous namespace)::Q_QGS_webEngineSettingsInstance::
    innerFunction()::holder>) at webenginesettings.cpp:220
220         Silica::Theme *silicaTheme = Silica::Theme::instance();
(gdb) c
Continuing.
[LWP 11896 exited]
[Switching to LWP 11862]

Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 3, nsLookAndFeel::
    nsLookAndFeel (this=0x7fb87dbc50)
    at widget/qt/nsLookAndFeel.cpp:62
62      nsLookAndFeel::nsLookAndFeel()
(gdb) c
Continuing.
[New LWP 11897]

Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 2, nsLookAndFeel::Observer:
    :Observe (this=0x7fb8b25d90, aTopic=0x555601d728 
    &quot;ambience-theme-changed&quot;, 
    aData=0x7fb8cddf88 u&quot;dark&quot;)
    at widget/qt/nsLookAndFeel.cpp:81
81      NS_IMETHODIMP nsLookAndFeel::Observer::Observe(nsISupports*, const 
    char* aTopic, const char16_t* aData) {
(gdb) n
85          nsDependentString data(aData);
(gdb) 
86          if (data.EqualsLiteral(&quot;dark&quot;)) {
(gdb) 
90          if (mDarkAmbience != darkAmbience) {
(gdb) p mDarkAmbience
$5 = false
(gdb) p darkAmbience
$6 = true
(gdb) c
Continuing.
[Parent 11843: Unnamed thread 7fb8002670]: I/LookAndFeel Ambience set to dark

The other problem seems to relate to how nsLookAndFeel.cpp and nsXPLookAndFeel.cpp work with one another. When the notification is received the integer preference related to the colour mode doesn't seem to be getting changed. This is all code that's changed quite significantly since ESR 78 and I'm going to need a bit more time to decipher it.

I've managed to make a bit of progress with this today, but not as much as I'd hoped. I still don't have all of the answers and so I'm going to have to pick this up again tomorrow to try to get to the bottom of it.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

26 Jul 2024 : Day 300 #

Over the last few days I've been working through the testing tasks in Issue #1053 on the sailfish-browser issue tracker on GitHub. Yesterday I got to the end of the list, with the result that 17 out of 22 tests passed (feels pretty good), but with five tests failing. In addition quite a few of the tests generated error output.

Here are the five failing cases that need looking in to:

Video rendering and controls: total fail.
Audio output and controls: partial fail.
Password storage: total fail.
Automatic dark/light theme switching: partial fail.
Everything on the browser test page: fails: single select widget; external links; full screen; double-tap.

In addition to the above there were several cases where error output was shown on the screen:

// Print to PDF
JavaScript error: resource://gre/actors/BrowserElementParent.jsm, line 24: 
    TypeError: browser is null
// Exiting the browser
JavaScript error: file:///usr/lib64/mozembedlite/components/
    EmbedLiteChromeManager.js, line 170: TypeError: chromeListener is undefined
// Saving a downloaded file
JavaScript error: resource://gre/modules/pdfjs.js, line 29: 
    NS_ERROR_NOT_AVAILABLE: 
// Login manager
JavaScript error: resource://gre/modules/LoginHelper.jsm, line 1734: TypeError: 
    browser is null

So that's a quick summary of the things that need to be fixed. Today I'm going to start looking at the password storage and login manager, which seems to be the biggest failure right now (some may argue video is more important, but that also falls further outside my area of expertise).

As with much of the other functionality, Jolla has a handy login test page. There's no backend functionality to the page, but it allows you to enter fake credentials which, if things are working, the browser should pick up and offer to store for future reuse.

On ESR 78 this works correctly, but on ESR 91 it fails with the following error message:

JavaScript error: resource://gre/modules/LoginHelper.jsm, line 1734: TypeError: 
    browser is null

An error message is an error message, but what's odd about this one is that probably the LoginHelper.jsm functionality shouldn't be being queried at all. To corroborate this I put some debug output inside the getBrowserForPrompt() method on ESR 78 and checked whether it got printed when saving a password. It didn't.

The reason for this quickly becomes apparent when I check the ESR 78 patches. Patch 0082 has the title "Allow LoginManagerPrompter to find its window" and comes with the following description:

This patch blocks loading of gecko's LoginManagerAuthPrompter.jsm so that the embedlite-components version can be used instead.
It also patches the nsILoginManagerPrompter interface to allow a reference to the window to be passed through, to allow the embedlite component to understand its context.

Finally it patches ParentChannelListener to pass the correct window object through to the nsILoginManagerAuthPrompter component.

That very first line "so that the embedlite-components version can be used instead" is crucial. Without this patch the upstream login manager will be used. We want our Sailfish-specific login manager to be used instead, which means tweaking the guts of gecko to allow it.

Checking the ESR 91 source clearly shows that this patch hasn't yet been applied, so now would be the time to do this.

Attempting to apply the patch directly fails, but brilliantly, attempting a 3-way merge succeeds:

$ git am ../rpm/0082-sailfishos-gecko-Allow-LoginManagerPrompter-to-find-.patch
Applying: Allow LoginManagerPrompter to find its window. JB#55760, OMP#JOLLA-418
error: patch failed: toolkit/components/passwordmgr/nsILoginManagerPrompter.idl:
    29
error: toolkit/components/passwordmgr/nsILoginManagerPrompter.idl: patch does 
    not apply
Patch failed at 0001 Allow LoginManagerPrompter to find its window. JB#55760, 
    OMP#JOLLA-418
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run &quot;git am --continue&quot;.
If you prefer to skip this patch, run &quot;git am --skip&quot; instead.
To restore the original branch and stop patching, run &quot;git am 
    --abort&quot;.
$ git am --abort

$ git am --3way ../rpm/
    0082-sailfishos-gecko-Allow-LoginManagerPrompter-to-find-.patch
Applying: Allow LoginManagerPrompter to find its window. JB#55760, OMP#JOLLA-418
Using index info to reconstruct a base tree...
M       netwerk/protocol/http/ParentChannelListener.cpp
M       toolkit/components/passwordmgr/LoginManagerParent.jsm
M       toolkit/components/passwordmgr/components.conf
M       toolkit/components/passwordmgr/nsILoginManagerPrompter.idl
Falling back to patching base and 3-way merge...
Auto-merging toolkit/components/passwordmgr/nsILoginManagerPrompter.idl
Auto-merging toolkit/components/passwordmgr/components.conf
Auto-merging toolkit/components/passwordmgr/LoginManagerParent.jsm
Auto-merging netwerk/protocol/http/ParentChannelListener.cpp

Most of the changes here are to the JavaScript, so could potentially be applied dynamically for testing. However there's also a change to an interface IDL file, plus it's the start of my working day so I won't be able to return to this for the next eight hours anyway, so I may as well kick off a build. When I return, if things have gone well, I'll be able to test out this change.

[...]

The build went through successfully, but when I try to use the login manager I now get the following error:

JavaScript error: file:///usr/lib64/mozembedlite/components/
    LoginManagerPrompter.js, line 1530: ReferenceError: ComponentUtils is not 
    defined
JavaScript error: resource://gre/modules/XPCOMUtils.jsm, line 161: 
    NS_ERROR_XPC_GS_RETURNED_FAILURE: ServiceManager::GetService returned 
    failure code:

For the first of these errors it looks like I'll just need to add the following line to the top of the LoginManagerPrompter.js file (see for example the changes made by Raine in embedlite-components Issue #99)

const { ComponentUtils } = ChromeUtils.import(&quot;resource://gre/modules/
    ComponentUtils.jsm&quot;);

The good news is that in addition to making this change in the package source, I can also make it directly on my device for immediate testing.

devel-su vim /usr/lib64/mozembedlite/components/LoginManagerPrompter.js
[...]

With this line added, both errors are now fixed and the login manager prompter is working correctly! That means I've also now finally been able to test clearing of the password data as well. All is working correctly and without error messages.

I'm going to call it a day. Tomorrow I'll look at the failing "Automatic dark/light theme switching".

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

25 Jul 2024 : Day 299 #

I'm at 15 out of 22 tests this morning with 14 passes and one fail. Next on the list is permissions: "location, pop-ups, cookies, camera, microphone."

First up is location. The permission part of this seems to be working correctly. From Jolla's positioning test page, when I select "Get position" the permissions dialog appears. If I deny access the page announces correctly "User denied geolocation prompt". I can then reset the permission via the little browser lock icon in the address bar.

Now if I allow the geolocation permission rather than denying it, a little "toast" notification appears on the screen to say "Positioning is disabled". On the console I get the following output:

ContentPermissionPrompt.js on message received: top:embedui:permissions, msg:
    {&quot;allow&quot;:true,&quot;checkedDontAsk&quot;:false,&quot;id&quot;:
    &quot;browser.sailfishos.org geolocation&quot;}
[D] unknown:0 - Geoclue client path: &quot;/org/freedesktop/Geoclue/Master/
    client1&quot;

My conclusion is that the positioning permissions dialog is working just fine, even if positioning itself isn't. And in fact, when I try to do the same thing on ESR 78, not only do I get the same "Positioning is disabled" toast, I also get an identical error at the console:

ContentPermissionPrompt.js on message received: top:embedui:permissions, msg:
    {&quot;allow&quot;:true,&quot;checkedDontAsk&quot;:false,&quot;id&quot;:
    &quot;browser.sailfishos.org geolocation&quot;}
[D] unknown:0 - Geoclue client path: &quot;/org/freedesktop/Geoclue/Master/
    client1&quot;

This might need a little more investigation, but if the functionality is identical across the two versions, then I'll take that as a good sign.

For the popup permission the behaviour is a little odd. If I visit Jolla's pop-up test page and wait a couple of seconds the pop-up permission dialog appears. I can select Deny or Allow but the result appears to be the same either way: the pop-up never opens. If I select to remember my choice, no change is made to the underlying permissions, which I can check using the little padlock in the address bar.

However, if I change the permission from "Deny" to "Allow" via the padlock and select the link to open another pop-up? Well, then the pop-up opens correctly.

I've always been confused by this functionality: if I select to permanently set the state to "Allow", shouldn't all future pop-ups be allowed, at least until I remove the setting? It doesn't feel quite right to me, but it turns out it does at least match the approach from ESR 78.

So while I'm not entirely comfortable with how it's working, given the behaviour matches ESR 78, I'm considering this a pass.

Next up cookies. I'm using another of Jolla's test pages for this, which allows you to set a cookie, then check its value after restarting the browser.

I've checked that the cookie gets successfully set when Allowed and blocked when Denied. I've also checked that if cookies are blocked in general but with an exception for a particular site, then the cookie is nevertheless stored correctly when visiting a page from the site. Everything on ESR 91 works the same as on ESR 78 and it all feels intuitive and correct.

For the camera and microphone I'm using the same page as I did for the WebRTC tests, Mozilla's getUserMedia Test Page. Although the camera still has the same colouring issue from earlier, everything is otherwise good. And specifically the permission dialogs do their job as expected.

So location, pop-ups, cookies, camera and microphone permission dialogs are all working correctly. I've updated the issue on GitHub to reflect this.

Next I'm going to find out what Happens when I clear the browser data. There's an option in the settings for this, with various different categories available for clearing: history, cookies, passwords, cache, bookmarks and site permissions.

History, cookies, cache, bookmarks and site permissions all appear to work as expected. Unfortunately I'm not able to test password clearing because the functionality to add passwords is currently broken. But I'll come back to that when it's fixed.

For the dark/light theming, switching between one of the fixed values (light or dark) from the settings page works as expected: the site updates its ambience to match (I'm testing using DuckDuckGo). But switching between light and dark phone ambiences doesn't update the site, even though it successfully updates the user interface elements. So that's going to have to be a fail for now.

For audio testing the results are also unfortunately a fail. I'm testing using BBC Sounds which works fine on ESR 78. But on my ESR 91 build we don't get any audio, just an error message that states "This content doesn't seem to be working". Disappointing.

I get the same with the BBC iPlayer for video: it works on ESR 78 but not ESR 91. When using Jolla's video test page I get the same experience. On YouTube as well.

This isn't, to be honest, much of a surprise. I've not applied the changes needed to get audio and video working yet. It's not all bad news though. For example the audio and audio controls on Jolla's audio test page are working correctly. So it looks like the problems are down to the available codecs, rather than something more fundamentally broken with the way audio or video works (or doesn't, in this case).

The final test is for "Everything on the browser test page". Which is a bit nebulous if I'm honest, but I'll still give it my best shot.

All of the prompt dialogs work fine. The multi-select groups work, but the single select widget actually managed to crash the browser. So that's something to look in to.

Text input, radio buttons and checkboxes are all working fine. History (back and forward) works as expected.

Mouse click positioning is looking good.

Interestingly, external links (for example to email or the phone app) are not working. There's no error in the output console either, which won't make fixing the issue any easier. But for now, I just need to identify the fact that this is broken.

The user agent string is good. Window opening and file pickers all work as expected. Localisation, anchors, CSS, Storage, Service Workers are all working.

Full screen doesn't appear to be working. There's also a difference in behaviour when double-tapping. On ESR 78 the double tap goes through, but on ESR 91 it zooms to the enclosing box item instead, as it does with non-selectable items. This will need fixing.

Everything else on the test pages works fine. So while it's not an unambiguous pass, it's not far off.

So that's everything in the list of tests. Seventeen out of twenty two tests passed. Three were partial failures and two were total failures. Here's the full list:

Video rendering and controls: total fail.
Audio output and controls: partial fail.
Private browsing: pass.
Search on page: pass.
Share link: pass.
Save web page as PDF: pass.
Desktop/mobile view switching: pass.
Bookmarks: pass.
History: pass.
Downloads (including setting save destination): pass.
Configuration using about:config: pass.
Home page functionality: pass.
Search providers: pass.
Close tabs on exit: pass.
Do not track: pass.
JavaScript enable/disable toggle: pass.
Password storage: total fail.
Permissions: location, pop-ups, cookies, camera, microphone: pass.
Clearing the browser data: pass.
Automatic dark/light theme switching: partial fail.
Everything on the browser test page: fails: single select widget; external links; full screen; double-tap.
WebRTC audio and video: pass.

Honestly, I don't think that's looking too bad. Tomorrow I'll start working on the failing cases, the first of which will be the password storage.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

24 Jul 2024 : Day 298 #

I'm back to testing again today. Yesterday I worked through eight tests, all of which passed, but with three errors generated from the privileged JavaScript. The errors don't appear to be affecting the functionality adversely, but I'll have to go back and fix them errors at some point.

Today I'm forging on with the tests; fourteen remaining. But before I do that, check out this phenomenal image from artist-in-residence thigg, shared yesterday on Mastodon.

A pig with wings, dressed like a janitor, little smartphones are running around on their legs on the floor like little bugs. The pig sweeps a few of them into a basket with a broom.

Thigg has been responsible for all of the amazing artistic creations on these pages (check out Day 174 and Day 248 for just a couple of the many others) and I'm always bowled over by the creativity.

This one has a slightly different style compared to previous images. Here's what Thigg has to say about it:

Software development is an emotionally challenging task. You need to deal with lots of frustration, your self-confidence and you always need to fight the urge of being sucked in by something that shouldn't be on your priority list... Today I tried to show that you pick up some bugs, but others run around being ignored.

This captures how I've been feeling over the last few days perfectly! Thigg has it spot on. Since I left the cover image bug yesterday, I've been doing my best to move on and find more bugs. Today it's less about sweeping them up and more about search with a magnifying glass to find them!

Let's continue with this task by working through the remaining tests. For the about:config configuration test I found a webgl.disabled flag. Activating this flag had the desired effect of disabling WebGL rendering, also restored when I deactivated the flag again. So I've satisfied myself that the configuration is working correctly.

I then tested the homepage functionality, search providers and automatic tab closing.

For the "Do Not Track" test I used the very convenient requestheaders.dev site. This mirrors the request headers sent to the site back at you. When the Do Not Track option is set in the browser I see the following line appearing:

  &quot;dnt&quot;: &quot;1&quot;,

This indicates Do Not Track is enabled. After disabling it in the browser settings this header line disappears, just as it should. I also found another convenient site — jsstatus.com — the main purpose of which is to tell you whether JavaScript is enabled or not on your browser. Flipping the switch in the settings gives the appropriate response on this page as well.

While searching for this page I also hit the Basemark browser test page and couldn't resist giving it a go. The result showed that ESR 91 on Sailfish OS isn't particularly fast (compared to desktop Firefox), but its functionality is pretty much up there with the latest.

Here are the results.

Functionality	Desktop Firefox	Sailfish ESR 91
Performance	657.00	168.72
CSS Capabilities	59.66%	58.23%
HTML5 Capabilities	91.71%	92.43%
Page Load and Responsiveness Capabilities	96.94%	98.50%
Resize Capabilities	76.12%	75.86%

To be honest, I wouldn't read too much into these numbers, but they make for an interesting curio.

Four screenshots: Do Not Track headers, JavaScript enabled, JavaScript disabled and the Basemark browser test results

Finally I tested the login manager. Unfortunately this didn't do so well with the following error appearing:

JavaScript error: resource://gre/modules/LoginHelper.jsm, line 1734: TypeError: 
    browser is null

While we've seen similar errors before, this time it seems to to be preventing the login functionality from working at all, so seems to be more serious.

This null browser error is appearing all over the place, so I'd like to get it fixed. But I also suspect there may be something else going on with the login manager as well. My plan is to look in to this further over the coming days.

For today, I've ended up with the following tests passing:

Configuration using about:config.
Home page functionality.
Search providers.
Close tabs on exit.
Do not track.
JavaScript enable/disable toggle.

But also with the first failure as well:

Password storage.

Hopefully the password storage won't remain broken for long.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

23 Jul 2024 : Day 297 #

Yesterday I said I was going to persevere with the hang caused by switching to a non-null cover. But I had a change of heart overnight. I have a tendency to get obsessed with small bugs like this because I'm desperate to know the reason for them. I know there is an answer to this, so the only thing stopping me from finding out is my own inadequacy. When you frame things like that it's easy to overstate the importance of something and end up prioritising the wrong thing.

But while I'd love to know the reason, I do appreciate there are more important things to be working on. I could lose days of work chasing an answer only to discover that someone smarter and better informed knows how to fix it already. During that time I could have fixed several other easier but more impactful bugs.

So I'm going to pause work on the hanging bug and move on to something else. If no solution appears naturally I'll return to it later.

Consequently I started off this morning by fixing all of the remaining cases where the fromExternal flag was needed in the front-end code. This was complicated somewhat by the fact that while it's needed for some calls to newTab(), there are also others where it's not. Some care was therefore needed.

But I think I got all of the relevant cases and none of the extraneous ones. I've committed and pushed my changes and since then I've had some time to spare today to look at other things.

I've gone on to working on Issue 1053 ("Test browser functionality with ESR 91"). This one issue is comprised of 22 subtasks, each of which involves testing some simple functionality of the browser.

So far I've tested the following functionalities, all of which are working as expected, at least to the extent I've been able to test:

Private browsing.
Search on page.
Share link.
Save web page as PDF.
Desktop/mobile view switching.
Bookmarks.
History.
Downloads (including setting save destination).

Since the functionality works I've ticked all of these off on the issue, which feels like a good start. However in some cases alongside the working functionality there were also some errors showing in the debug output.

Given that the errors aren't blocking the functionality, they can't be too serious, but I'm still keen to both document them here and also fix them if they're as straightforward as I hope they are.

The following error appeared while performing a print to PDF:

JavaScript error: resource://gre/actors/BrowserElementParent.jsm, line 24: 
    TypeError: browser is null

When exiting the browser the following error appears:

JavaScript error: file:///usr/lib64/mozembedlite/components/
    EmbedLiteChromeManager.js, line 170: TypeError: chromeListener is undefined

Finally When downloading a file to save it out, the following error appears:

JavaScript error: resource://gre/modules/pdfjs.js, line 29: 
    NS_ERROR_NOT_AVAILABLE:

I'm not going to have time to look into these today, but my plan is to continue testing tomorrow, followed by trying to find simple solutions for each of the errors I encounter as I go through. But that's it for today; there'll be more testing tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

22 Jul 2024 : Day 296 #

Today I've been trying — really quite hard actually — to fix the hang that happens when switching to private browsing mode. Having looked at this before I refreshed myself yesterday to the point of understanding that it's due to the call to setBrowserCover() that happens when mode switch is made on the tab screen.

I've got a bit further today, the results of which only makes things even more confusing. To explain all this, it'll help to take a look at the setBrowserCover() method, which looks like this:

s
    function setBrowserCover(model) {
        if (!model || model.count === 0 || !WebUtils.firstUseDone) {
            cover = Qt.resolvedUrl(&quot;cover/NoTabsCover.qml&quot;)
        } else {
            if (cover != null && window.webView) {
                window.webView.clearSurface()
            }
            cover = null
        }
    }

Let's break this down a bit. The browser manages two tab models, one for normal browsing and the other for private browsing. When switching between the two modes one model is switched out for the other. This setBrowserCover() method is called just proceeding the change in model. So by the time we find ourselves in this method we've already switched the model to the one for private browsing.

Whenever the browser is closed any private browsing state — including the associated tab model — is either destroyed or forgotten. This includes the private browsing tab model. That means that having just opened the browser we know the private browsing tab model will have no tabs and so model.count in the above code will be zero.

That means we're going through the first half of the if statement above. There's only one line of functionality that therefore gets called as a result and that's the following:

            cover = Qt.resolvedUrl(&quot;cover/NoTabsCover.qml&quot;)

Typically the cover model for the browser will be set to null so that it shows the contents of the current page. If there are no pages open the cover is replaced, as we can see with this line of code, with the cover layout defined in the NoTabsCover.qml file.

So far so good. This is exactly what happens when we move to private browsing mode for the very first time. If I comment out the above line there are two consequences:

When there are no active web pages the cover just shows a blank background.
There's no hang.

We can conclude that it seems to be the act of setting the cover that's triggering the hang. This feels very strange because there's nothing special or magical about the cover or the way it gets switched in and out. I've tried a whole host of things in an attempt to get a clearer picture.

For example I wondered whether this was related to private browser mode or not, so I added a timer that switched out the cover after a delay of five seconds, irrespective of what's happening at the time. What I found was that this also hangs the browser, even you just have a static web page open and there's nothing exceptional happening. This suggests that it's not private browsing per se that's causing the problem, but rather the switching of the cover.

Intriguingly, if you do the switch while performing a pan and zoom, there's a crash instead of a hang. This has allowed me to collect the following backtrace:

[D] onTriggered:45 - Set cover: file:///usr/share/sailfish-browser/cover/
    NoTabsCover.qml
[New LWP 2607]
sailfish-browser: ../../../platforms/wayland/wayland_window_common.cpp:256: 
    void WaylandNativeWindow::releaseBuffer(wl_buffer*): Assertion `it != 
    fronted.end()' failed.

Thread 38 &quot;Compositor&quot; received signal SIGABRT, Aborted.
[Switching to LWP 2574]
0x0000007fef49a344 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x0000007fef49a344 in raise () from /lib64/libc.so.6
#1  0x0000007fef47fce8 in abort () from /lib64/libc.so.6
#2  0x0000007fef48ebd8 in ?? () from /lib64/libc.so.6
#3  0x0000007fef48ec40 in __assert_fail () from /lib64/libc.so.6
#4  0x0000007fe74e3044 in WaylandNativeWindow::releaseBuffer(wl_buffer*) () 
    from /usr/lib64/libhybris//eglplatform_wayland.so
#5  0x0000007fee8fa050 in ?? () from /usr/lib64/libffi.so.8
#6  0x0000007fee8f65f8 in ?? () from /usr/lib64/libffi.so.8
#7  0x0000007fe7795f98 in ?? () from /usr/lib64/libwayland-client.so.0
#8  0x0000007fe7792d80 in ?? () from /usr/lib64/libwayland-client.so.0
#9  0x0000007fe7794038 in wl_display_dispatch_queue_pending () from /usr/lib64/
    libwayland-client.so.0
#10 0x0000007fe74e3204 in WaylandNativeWindow::readQueue(bool) () from /usr/
    lib64/libhybris//eglplatform_wayland.so
#11 0x0000007fe74e23ec in WaylandNativeWindow::finishSwap() () from /usr/lib64/
    libhybris//eglplatform_wayland.so
#12 0x0000007fef090210 in _my_eglSwapBuffersWithDamageEXT () from /usr/lib64/
    libEGL.so.1
#13 0x0000007ff2397110 in mozilla::gl::GLLibraryEGL::fSwapBuffers (
    surface=0x5555991a60, dpy=<optimized out>, this=<optimized out>)
    at gfx/gl/GLLibraryEGL.h:303
#14 mozilla::gl::EglDisplay::fSwapBuffers (surface=0x5555991a60, 
    this=<optimized out>)
    at gfx/gl/GLLibraryEGL.h:694
#15 mozilla::gl::GLContextEGL::SwapBuffers (this=0x7ed41a6e30)
    at gfx/gl/GLContextProviderEGL.cpp:558
#16 0x0000007ff2440e00 in mozilla::layers::CompositorOGL::EndFrame (
    this=0x7ed41a1d70)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#17 0x0000007ff25174dc in mozilla::layers::LayerManagerComposite::Render (
    this=this@entry=0x7ed41a8a70, aInvalidRegion=..., aOpaqueRegion=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#18 0x0000007ff2517728 in mozilla::layers::LayerManagerComposite::
    UpdateAndRender (this=this@entry=0x7ed41a8a70)
    at gfx/layers/composite/LayerManagerComposite.cpp:657
#19 0x0000007ff2517ad8 in mozilla::layers::LayerManagerComposite::
    EndTransaction (this=this@entry=0x7ed41a8a70, aTimeStamp=..., 
    aFlags=aFlags@entry=mozilla::layers::LayerManager::END_DEFAULT)
    at gfx/layers/composite/LayerManagerComposite.cpp:572
#20 0x0000007ff2559274 in mozilla::layers::CompositorBridgeParent::
    CompositeToTarget (this=0x7fb89aba80, aId=..., aTarget=0x0, 
    aRect=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#21 0x0000007ff253e9bc in mozilla::layers::CompositorVsyncScheduler::Composite (
    this=0x7fb8b500e0, aVsyncEvent=...)
    at gfx/layers/ipc/CompositorVsyncScheduler.cpp:256
#22 0x0000007ff2536e34 in mozilla::detail::RunnableMethodArguments<mozilla::
    VsyncEvent>::applyImpl<mozilla::layers::CompositorVsyncScheduler, void (
    mozilla::layers::CompositorVsyncScheduler::*)(mozilla::VsyncEvent const&), 
    StoreCopyPassByConstLRef<mozilla::VsyncEvent>, 0ul> (args=..., m=<optimized 
    out>, 
    o=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/
    nsThreadUtils.h:887
#23 mozilla::detail::RunnableMethodArguments<mozilla::VsyncEvent>::
    apply<mozilla::layers::CompositorVsyncScheduler, void (mozilla::layers::
    CompositorVsyncScheduler::*)(mozilla::VsyncEvent const&)> (m=<optimized 
    out>, o=<optimized out>, this=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1154
#24 mozilla::detail::RunnableMethodImpl<mozilla::layers::
    CompositorVsyncScheduler*, void (mozilla::layers::CompositorVsyncScheduler::
    *)(mozilla::VsyncEvent const&), true, (mozilla::RunnableKind)1, mozilla::
    VsyncEvent>::Run (this=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1201
[...]
#34 0x0000007fef54989c in ?? () from /lib64/libc.so.6
(gdb)

This hints at the possibility that the render buffers may be being swapped in the wrong thread. But my attempts to dig deeper into this haven't as yet thrown up anything that could give more of a hint about what's going on.

I've also added some debug output to the setBrowserCover() method so it now looks like this:

    function setBrowserCover(model) {
        console.log(&quot;model: &quot; + model);
        console.log(&quot;model.count: &quot; + model.count);
        if (!model || model.count === 0 || !WebUtils.firstUseDone) {
            console.log(&quot;Setting cover&quot;);
            cover = Qt.resolvedUrl(&quot;cover/NoTabsCover.qml&quot;)
            console.log(&quot;Set cover: &quot; + cover);
        } else {
            console.log(&quot;Not setting cover&quot;);
            if (cover != null && window.webView) {
                window.webView.clearSurface()
            }
            cover = null
        }
        console.log(&quot;Exiting&quot;);
    }

When switching to private browsing mode, whether it's done via the menu or the tab list, the following output is triggered:

[D] setBrowserCover:20 - model: PrivateTabModel(0x62bf23d0e0)
[D] setBrowserCover:21 - model.count: 0
[D] setBrowserCover:23 - Setting cover
[D] setBrowserCover:25 - Set cover: file:///usr/share/sailfish-browser/cover/
    NoTabsCover.qml
[D] setBrowserCover:33 - Exiting

Immediately after the last debug print here the browser hangs. I've been trying hard to find some method inside the browser that's executed between the last line of this debug output and the actual hang, but without success. I've been doing this by adding breakpoints to various methods, switching to private browsing and watching to see if any of the breakpoints are hit.

So far without luck. Here are just a few of the methods I've attached breakpoints to and tested this way:

GLContextEGL::SwapBuffers()
GLContextEGL::SetDamage()
GLContextEGL::RenewSurface()
GLScreenBuffer::Swap()
ReadBuffer::Attach()
BeginTransaction()
EndEmptyTransaction()
NeedsPaint()
QOpenGLWebPage::onDrawOverlay()

Many of the breakpoints on these methods are triggered at other points in the browsing process, but if this happens I've just been continuing execution until the point at which I manually switch to private browsing. I get the same output and the same hang as when there's no breakpoint, like this:

Thread 39 &quot;Compositor&quot; hit Breakpoint 1, mozilla::layers::
    LayerManagerComposite::BeginTransaction (this=0x7ed41a8c20, aURL=...)
    at gfx/layers/composite/LayerManagerComposite.cpp:232
232     bool LayerManagerComposite::BeginTransaction(const nsCString& aURL) {
(gdb) c
Continuing.
[D] setBrowserCover:20 - model: PrivateTabModel(0x7fd800da50)
[D] setBrowserCover:21 - model.count: 0
[D] setBrowserCover:23 - Setting cover
[D] setBrowserCover:25 - Set cover: file:///usr/share/sailfish-browser/cover/
    NoTabsCover.qml
[D] setBrowserCover:33 - Exiting

There are a few other things I think it's worth mentioning. The hang happens when the cover is set, but not when it's cleared. If the cover is set right at the start and left as it is (so it's never set to null), everything runs fine. So it very much seems to be the act of switching from null to non-null that causes the problem.

Having not managed to find any methods that are fired between the cover being set and the hang occurring, I got frustrated and went for a walk outside. We have a lake nearby that's beautifully calm at this time of year. The air is warm and calm without being oppressive, which makes going for a walk a great way for me to clear my thoughts and come back feeling calmer.

I didn't have any revelations while walking, but I did think about whether I can approach this from a different angle. Rather than trying to find the gecko methods that are causing the problem by seeing if they're being used, what if I were to try to disable gecko functionality in the hope that the hang might suddenly vanish.

If the hang goes away with a particular piece of functionality disabled, then it may indicate some kind of clash between the cover change and the disabled functionality.

So I've tried a whole bunch of things, for example, setting it so that the page is always inactive by forcing the state to be always set to false:

 void QOpenGLWebPage::setActive(bool active)
 {
+    active = false;
     // WebPage is in inactive state until the view is initialized.
     // ::processViewInitialization always forces active state so we
     // can just ignore early activation calls.
     if (!d || !d->mViewInitialized)
         return;
 
     if (d->mActive != active) {
         d->mActive = active;
         d->mView->SetIsActive(d->mActive);
         Q_EMIT activeChanged();
     }
 }

I also tried disabling the initialisation code:

 void QOpenGLWebPage::initialize()
 {
-    d->createView();
 }

Plus a whole bunch of other similar things, from disconnecting various signals to preventing the EGL Display from being initialised. Many of these changes prevented rendering, but none of them prevented the hang.

I don't have an answer for why this is happening, but I'll persevere with it. As with everything computer-related, there is definitely an answer, it's just a case of finding it.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

21 Jul 2024 : Day 295 #

Yesterday, you may recall, I didn't do any coding but rather started looking through some changes from Raine that I needed to incorporate into my local codebase. I'm continuing that today, in the hope of wrapping that up and then moving on to some of the other issues in the issue tracker on GitHub.

So I've reviewed and approved both of Raine's PRs. The first, which I already looked at yesterday, updates the sailfish-components-webview code to support the fromExternal flag. This change is needed to get the code to compile against the changes already made to gecko.

The second switches instances of XPCOMUtils for ComponentUtils since upstream has moved the generateNSGetFactory() method from the former to the latter. I'd already made this change in a few places, but Raine has propagated it across the entire codebase, which is great. After looking through and checking it, I've now approved it.

Those are the two PRs I can see, so I've moved on to the issues. I've marked Issue 1024 ("Restore WebRTC code to ESR 91") as completed. I addressed this issue in some of the most recent changes I made (between days 278 and 286). I've also closed Issue 1020 ("Fix ESR 91 rendering pipeline"). If you've been following along you'll know that this was one of the largest pieces of work I had to do, which I spent a total of 152 days on. First for the main browser rendering between days 51 and 83. Then for the WebView rendering between days 159 and 245. Then following on from this up to Day 277 for the WebGL. So I'm frankly pretty pleased to be able to finally close it.

To wrap up I've decided to start looking at Issue 1053, which is an epic test issue comprised of 22 separate things to check. It looks long, but at this stage I'm hoping I can go through and tick them all off pretty swiftly (we'll have to see whether this actually happens or not!).

The very first thing I try is private browsing mode as it looks like one of the easier things to check.

Immediately I hit a problem. You'll recall that yesterday I closed Issue 1051 ("Fix hang when calling window.setBrowserCover()") thinking it was fixed. Well, I was wrong, it is still there after all. So now I'll need to look in to this further.

The problem manifests itself when you switch between normal and private browsing. As I discovered first time around, the bug causes a hang rather than a crash. Which means I have to interrupt the execution in order to get a backtrace:

Thread 1 &quot;sailfish-browse&quot; received signal SIGINT, Interrupt.
0x0000007fef866718 in pthread_cond_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x0000007fef866718 in pthread_cond_wait () from /lib64/libpthread.so.0
#1  0x0000007fef979924 in QWaitCondition::wait(QMutex*, unsigned long) () from /
    usr/lib64/libQt5Core.so.5
#2  0x0000007ff0afde08 in ?? () from /usr/lib64/libQt5Quick.so.5
#3  0x0000007ff0b00aa8 in ?? () from /usr/lib64/libQt5Quick.so.5
#4  0x0000007ff0b01270 in ?? () from /usr/lib64/libQt5Quick.so.5
#5  0x0000007ff05503dc in QWindow::event(QEvent*) () from /usr/lib64/
    libQt5Gui.so.5
#6  0x0000007ff0b307b8 in QQuickWindow::event(QEvent*) () from /usr/lib64/
    libQt5Quick.so.5
#7  0x0000007fefb31144 in QCoreApplication::notify(QObject*, QEvent*) () from /
    usr/lib64/libQt5Core.so.5
#8  0x0000007fefb312e8 in QCoreApplication::notifyInternal2(QObject*, QEvent*) (
    ) from /usr/lib64/libQt5Core.so.5
#9  0x0000007ff0546488 in QGuiApplicationPrivate::processExposeEvent(
    QWindowSystemInterfacePrivate::ExposeEvent*) () from /usr/lib64/
    libQt5Gui.so.5
#10 0x0000007ff05470b4 in QGuiApplicationPrivate::processWindowSystemEvent(
    QWindowSystemInterfacePrivate::WindowSystemEvent*) ()
   from /usr/lib64/libQt5Gui.so.5
#11 0x0000007ff05256e4 in QWindowSystemInterface::sendWindowSystemEvents(
    QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib64/libQt5Gui.so.5
#12 0x0000007fe7882c4c in ?? () from /usr/lib64/libQt5WaylandClient.so.5
#13 0x0000007fef2cfd34 in g_main_context_dispatch () from /usr/lib64/
    libglib-2.0.so.0
#14 0x0000007fef2cffa0 in ?? () from /usr/lib64/libglib-2.0.so.0
#15 0x0000007fef2d0034 in g_main_context_iteration () from /usr/lib64/
    libglib-2.0.so.0
#16 0x0000007fefb83a90 in QEventDispatcherGlib::processEvents(QFlags<QEventLoop:
    :ProcessEventsFlag>) () from /usr/lib64/libQt5Core.so.5
#17 0x0000007fefb2f608 in QEventLoop::exec(QFlags<QEventLoop::
    ProcessEventsFlag>) () from /usr/lib64/libQt5Core.so.5
#18 0x0000007fefb371d4 in QCoreApplication::exec() () from /usr/lib64/
    libQt5Core.so.5
#19 0x000000555557bf88 in main (argc=<optimized out>, argv=<optimized out>) at 
    main.cpp:203
(gdb)

Getting a backtrace this was is always unsatisfactory because you don't know whether you stopped on the correct thread, or somewhere totally unrelated in one of the other threads that's whirring away at the same time. Gecko has tens of threads running simultaneously (last time this happened it was 70 in total), so that adds a pretty big dose of uncertainty.

I looked at this originally back on Day 122 and kept a record of all of the backtraces for all of the threads back then. They weren't, I have to say, super enlightening and I don't plan to repeat the process again.

The way I fixed it at the time was by amending BrowserPage.qml to remove the following line:

            window.setBrowserCover(webView.tabModel)

Here's that line in context:

    // Use Connections so that target updates when model changes.
    Connections {
        target: AccessPolicy.browserEnabled && webView && webView.tabModel || 
    null
        ignoreUnknownSignals: true
        // Animate overlay to top if needed.
        onCountChanged: {
            if (webView.tabModel.count === 0) {
                webView.handleModelChanges(false)
            }
            window.setBrowserCover(webView.tabModel)
        }
    }

Today I'm going to comment that line out again, just to see whether this is actually an error with private browsing or not. When I do this there are immediately a couple of further errors coming from the QML:

[W] unknown:130 - file:///usr/share/sailfish-browser/pages/components/
    Overlay.qml:130: Error: Insufficient arguments

Thankfully these QML errors are easier to fix. You may recall that some time back I was having trouble with DuckDuckGo rendering. The issue turned out to be related to the Sec-Fetch-* headers being set incorrectly. You can refer back to what I wrote on Day 140 for some of the background if you're interested. But the crucial point is that I had to add a fromExternal flag to various methods, so that it could be passed on for use with the Sec-Fetch-* headers.

It turns out I'd not added a parameter for this flag to all the places where it's called. To fix this I've added the fromExternal to all of the methods that need it.

Following this change the private browsing now seems to work as expected. I still need to fix the hang, but that'll be a task for tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

20 Jul 2024 : Day 294 #

No coding today. Almost every day the first thing I do is dive into the code: reading it, writing it, debugging it. After 294 days (with a few gaps, admittedly) I've grown accustomed to this way of working. Diving in to the code each morning has been comforting: a natural context switch between the harsh vagaries of the world and the reassuring consistency of the source.

(I'm overdoing it of course. Many times the code has been frustrating, incomprehensible and capricious. But I'm going to forget that today for the sake of a more convenient narrative).

The point I'm trying to get to is that, like a furry friend in a nineties Grolsch advert, the browser is not ready yet. But despite that, it feels like it's time for a step change. From focusing entirely on development to figuring out where things go from here and how the browser actually gets into users' hands.

So no coding today. Instead I'm going to be looking through the issues on GitHub, replying to comments and trying to figure out the best way to make packages available.

Things could fail disastrously at this point. If the browser goes out too early and doesn't have the functionality people are expecting, then all this effort will be wasted. On the other hand, I know the Sailfish community is driven by optimism and a sense of adventure: willing to try experimental software and also understanding that there are some sacrifices needed for the sake of user freedoms.

But the entire motivation for this work comes from an understanding that a phone needs a browser. It's critical infrastructure, not nice-to-have. So it really does have to work. In fact, ESR 91 has to work better than ESR 78 if the effort is going to have been worth it.

So let me talk briefly about what I've actually done today. Raine (rainemak) has gone to the trouble of testing out the code using the latest internal Jolla development build of Sailfish OS and the Sailfish SDK. When the next version of Sailfish OS is released, it'll be a version of this build and built using this SDK, so the browser has to work with it.

Raine had to make some changes to get it to work and I had a look through the changes. The changes Raine made to embedlite-components aligned with changes that I'd also had to make, so that's an easy case. But for the gecko-dev repository things are a bit more unexpected. To get things to run Raine has also had to make various changes to the privileged JavaScript.

After looking through Raine's PRs I need to incorporate his changes into mine, update my SDK to the latest Early Access release (4.6.0.1) and rebuild everything using that from scratch. Unfortunately I didn't get a chance to do that today, but I'll pick it up tomorrow and spend a bit more time on it.

Doing this non-coding work is important, but I find I'm not very efficient at it and it's harder to gauge progress compared to making commits. But I'm hoping it's also a sign of where things are at with the browser. It feels like this is the start of the descent towards our destination.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

19 Jul 2024 : Day 293 #

Yesterday I was able to complete a couple of commits to fix some issues. I really want to get on to doing some non-coding work, but before I do, there's one last issue that I really need to address. Back on Day 126 I was having trouble with the cover of the browser app. I created a bug to remind myself to come back to it. Despite that I still managed to forget about it until I suddenly realised this morning.

I do need to fix it though. The bug prevented the cover app from showing a preview of the current page and it'd be rubbish if we had to release the browser without this working properly. On the other hand I'm really hoping that the issue is related to the offscreen rendering and that all of the changes made to fix that will magically allow the cover rendering to work again.

As the issue on GitHub explains:

The call to window.setBrowserCover(webView.tabModel) in BrowserPage.qml causes the user interface to hang. To reproduce:

Run the browser.
Select the "Tab" icon on the far left hand side of the toolbar.
Select the "Private" browser tab icon (Batman).
Notice that the app hangs (but doesn't crash) as the page switches from persistent to private mode.

The change made back then to work around the issue was the following:

git diff 51a72ef86825dcf0deca5ab3adc493247768eaee 
    8a8c474abed7d95ac9b32cbfa6fe90275cf97631
diff --git a/apps/browser/qml/pages/BrowserPage.qml b/apps/browser/qml/pages/
    BrowserPage.qml
index 4373eeef..e0fb48c5 100644
--- a/apps/browser/qml/pages/BrowserPage.qml
+++ b/apps/browser/qml/pages/BrowserPage.qml
@@ -225,7 +225,7 @@ Page {
             if (webView.tabModel.count === 0) {
                 webView.handleModelChanges(false)
             }
-            window.setBrowserCover(webView.tabModel)
+            //window.setBrowserCover(webView.tabModel)
         }
     }

This is a great place to start from because the change is in QML code. This means reversing this change is not only just a matter of uncommenting a single line, but it can also be done entirely without needing to do any recompilation.

devel-su vim /usr/share/sailfish-browser/pages/BrowserPage.qml
[...]
sailfish-browser
[...]

Having made the change, the good news is that it does now work, immediately, without any further changes. This isn't entirely a shock. As I mentioned earlier, there was always a good chance that the existing changes would fix this. The cover rendering makes use of the offscreen render pipeline, which was broken before but is now fixed.

I also checked that WebGL still works when rendered on the cover as well. So with this done I just have to now commit my minimal changes and that should be enough to justify closing the issue on GitHub. For reference, here's the commit I'll be reversing:

$ git log -1 8a8c474abed7d95ac9b32cbfa6fe90275cf97631
commit 8a8c474abed7d95ac9b32cbfa6fe90275cf97631
Author: David Llewellyn-Jones <david.llewellyn-jones@jolla.com>
Date:   Fri Dec 29 11:10:03 2023 +0000

    [browser] Disable window.setBrowserCover
    
    The window.setBrowserCover() call causes the browser to hang when
    rendering with ESR 91. This needs to be fixed, but in the meantime
    disabling the call prevents the hang from occurring.
    
    This change prevents the browser page from being rendered on the app
    cover.

With this change committed my plan is to take a break from coding for a bit now. I'm going to continue writing these diary entries and working on the browser, but I've built up a backlog of administration that I really need to deal with. This means reading through (and responding to) comments on the issue tracker, getting some test packages built for others to use, performing some tests, that kind of thing.

I enjoy working on the code and use it as a way to recover from the real world. Maybe even to hide from it sometimes. It allows me to focus on something controllable while ignoring all of the messy uncontrollable stuff going on elsewhere. I don't think I'm unusual in this. I'm sure other people find their own ways to escape from reality periodically.

But with a project like this it's also important to occasionally step back, take stock, make course corrections and deal with the points of intersection with reality. And that's what I plan to do tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

18 Jul 2024 : Day 292 #

The build I started yesterday had completed by the time I got up this morning. I'm still working on avoiding a crash in the hyphenation code with the latest build by adding an early return to the find_hyphen_values() method of the Hyphenator class in mapped_hyph/src/lib.rs.

This time our test page loads and renders without crashing. This tells us two things. First that the partial build wasn't covering the Rust code, as I suspected. Second that the problem is happening in the Rust code itself. This doesn't preclude the possibility that the error is originating outside the Rust code, but it does at least give us something to focus on.

The header of the find_hyphen_values() method mentions a couple of situations in which a panic can be triggered. It's worth taking a look at these since a panic could be responsible for bringing down the browser. One obvious situation in which a panic will be triggered from the Rust code is as a result of the following line, the first line of the method:

        assert!(values.len() >= word.len());

That's not a debug assert, so it'll apply in the release build as well. I can't just remove it, since although that might stop the crash happening at that point in the code, it'll leave the browser in an unsafe state. The assert is there for a reason after all. So instead I've turned it into a condition with an early return to avoid the possibility of a panic:

$ git diff -U1
diff --git a/third_party/rust/mapped_hyph/src/lib.rs b/third_party/rust/
    mapped_hyph/src/lib.rs
index 848c93d25790..c205bb09359c 100644
--- a/third_party/rust/mapped_hyph/src/lib.rs
+++ b/third_party/rust/mapped_hyph/src/lib.rs
@@ -477,3 +477,6 @@ impl Hyphenator<'_> {
     pub fn find_hyphen_values(&self, word: &str, values: &mut [u8]) -> isize {
-        assert!(values.len() >= word.len());
+        if (values.len() < word.len()) {
+            return 0;
+        }
+        //assert!(values.len() >= word.len());
         values.iter_mut().for_each(|x| *x = 0);

However, as I read through the code and look again at the backtrace, this looks highly unlikely to be the problem. Although we don't know the exact line, it's the level() method that's at the top of the stack when the crash occurs. So actually we know the crash is happening there. Here's what the code there looks like:

    fn level(&self, i: usize) -> Level {
        let offset = u32::from_le_bytes(*array_ref!(self.0, FILE_HEADER_SIZE + 
    4 * i, 4)) as usize;
        let limit = if i == self.num_levels() - 1 {
            self.0.len()
        } else {
            u32::from_le_bytes(*array_ref!(self.0, FILE_HEADER_SIZE + 4 * i + 
    4, 4)) as usize
        };
        debug_assert!(offset + LEVEL_HEADER_SIZE <= limit && limit <= 
    self.0.len());
        debug_assert_eq!(offset & 3, 0);
        debug_assert_eq!(limit & 3, 0);
        Level::new(&self.0[offset .. limit])
    }

It's challenging to understand what might be going wrong here. It looks to me like this is where the requirement for self.0 to be valid (mentioned as the other reason for a potential panic) comes into play. I'm now wondering whether the problem might be an invalid hyphenator file being loaded in, or it being invalid for some other reason.

So I've made an additional change to check that the hyphenator is invalid when the call to hyphenate a string is made, like this:

$ git diff -U1
diff --git a/intl/hyphenation/glue/nsHyphenator.cpp b/intl/hyphenation/glue/
    nsHyphenator.cpp
index c3b377767e3f..974596b17115 100644
--- a/intl/hyphenation/glue/nsHyphenator.cpp
+++ b/intl/hyphenation/glue/nsHyphenator.cpp
@@ -348,2 +348,6 @@ nsresult nsHyphenator::Hyphenate(const nsAString& aString,
 
+  if (!IsValid()) {
+    return NS_ERROR_FAILURE;
+  }
+
   bool inWord = false;

I've set it rebuilding with these changes, which will take at least a few hours to complete. In the meantime, I noticed the following error when loading the test page:

JavaScript error: resource://gre/modules/amInstallTrigger.jsm, line 43: 
    TypeError: this.mm is null

That error seems to be coming from the gecko code itself, rather than some JavaScript loaded by the page, so this is also something it would be worth fixing. I'm sure it's unrelated to the crash, but I may as well look into it while the build completes.

Here's the method inside amInstallTrigger.jsm where this error is being generated:

function RemoteMediator(window) {
  this._windowID = window.windowGlobalChild.innerWindowId;

  this.mm = window.docShell.messageManager;
  this.mm.addWeakMessageListener(MSG_INSTALL_CALLBACK, this);

  this._lastCallbackID = 0;
  this._callbacks = new Map();
}

It looks likely that the call to get window.docShell.messageManager is returning null. Here's the code from nsDocShell which is supposed to return it:

NS_IMETHODIMP
nsDocShell::GetMessageManager(ContentFrameMessageManager** aMessageManager) {
  RefPtr<ContentFrameMessageManager> mm;
  if (RefPtr<BrowserChild> browserChild = BrowserChild::GetFrom(this)) {
    mm = browserChild->GetMessageManager();
  } else if (nsPIDOMWindowOuter* win = GetWindow()) {
    mm = win->GetMessageManager();
  }
  mm.forget(aMessageManager);
  return NS_OK;
}

So the error goes up, potentially, to EmbedLite code. It looks like this might be related to patch 0043, which patches this method and which hasn't get been applied. The patch has the title "Get ContentFrameMessageManager via nsIDocShellTreeOwner" with the following description:

nsDocShellTreeOwner has a reference to WebBrowserChrome which in turn can return ContentFrameMessageManager from BrowserChildHelper.

So I'm going to try applying the patch; maybe that will fix this error?

$ git am --3way ../rpm/
    0043-sailfishos-docshell-Get-ContentFrameMessageManager-v.patch
Applying: Get ContentFrameMessageManager via nsIDocShellTreeOwner. JB#55336 
    OMP#55336

Nice: the patch applies cleanly! Now the GetMessageManager() method looks like this:

NS_IMETHODIMP
nsDocShell::GetMessageManager(ContentFrameMessageManager** aMessageManager) {
  RefPtr<ContentFrameMessageManager> mm;

  nsCOMPtr<nsIBrowserChild> bc;
  if (mTreeOwner) {
    bc = do_GetInterface(mTreeOwner);
  }

  if (bc) {
    bc->GetMessageManager(getter_AddRefs(mm));
  } else if (RefPtr<BrowserChild> browserChild = BrowserChild::GetFrom(this)) {
    mm = browserChild->GetMessageManager();
  } else if (nsPIDOMWindowOuter* win = GetWindow()) {
    mm = win->GetMessageManager();
  }
  mm.forget(aMessageManager);
  return NS_OK;
}

As we can see, an extra clause has been added to the condition, to catch the case where the mTreeOwner variable has an nsIBrowserChild XPCOM interface. If it does then the message manager will be extracted from there.

The hyphenation build has gone through and I've copied the packages over to my phone. That means I can now set this change building instead. In the meantime let's return to the hyphenation crash. I've tried a few different arrangements of things now, rebuilding the code and transferring the result over to my phone. What I've discovered is that adding a condition to check the validity of the hyphenator made no difference to the crash; so I've removed that.

However, rather unexpectedly, after converting the assert that tests whether the values return buffer is at least as large as the word string into a condition with an early return, the site now no longer crashes the browser.

Half of me is surprised, the other half isn't. I'm not surprised because this was clearly signposted code designed to cause a panic. On the other hand, I'm surprised because this shouldn't be happening; and also increasing the output buffer size didn't appear to help. So there may be something deeper going on here, but for now I'm going to take the win.

I feel satisfied with these changes today. I've been able to fix a JavaScript error and a crash bug. While I can't claim any credit for finding a solution (Raine, who authored the patch, deserves all of that), it's nonetheless good to be making progress.

Tomorrow I'll need to pick another task from the issue tracker to take a look at. Although, in fact, I think I'll do some housekeeping and try to catch up on some of the work others have been doing. Now that I look at the issue tracker I see some exciting developments there that I really need to examine.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

17 Jul 2024 : Day 291 #

Yesterday we were looking at a crash in the hyphenator, triggered by visiting a page on amazon.de. It makes for an interesting bug because it's the first time — that I can recall at least — that we've been hit with an issue related to the Rust code, other than as part of the build pipeline. I'm not really certain what the problem is, but given the comments we saw yesterday, it could be either an overflow in the output buffer, or the block of memory represented by self.0 being invalid. According to the header, either of these could cause a panic.

It looks like debugging the Rust code may turn out to be tricky, so in the first instance I've gone about disabling one of the calls further down the call stack so that the crashing code is never called.

This is a precursor to trying to establish the real solution. At this stage I just want to find out whether this fixes the issue or not. My fake fix is to early return from the nsHyphenator::HyphenateWord() method like this:

   AutoTArray<uint8_t, 200> hyphenValues;
   hyphenValues.SetLength(utf8.Length());
 
+  return;
   int32_t result = mDict.match(
       [&](const void*& ptr) {
         return mapped_hyph_find_hyphen_values_raw(
             static_cast<const uint8_t*>(ptr), mDictSize, utf8.BeginReading(),
             utf8.Length(), hyphenValues.Elements(), hyphenValues.Length());
       },

You may notice that I've made the change to the C++ code rather than the Rust code. That's mostly for convenience because I don't actually know whether the partial build process will notice and rebuild changes in the Rust code. I'm sure I'll find out, but right now I'm just trying to find an answer as quickly as possible.

Other than the panic theory, there are other reasons why the code may be crashing. It could simply be that the stack is being exhausted. I must admit that I'm not convinced by this theory: the backtrace doesn't look corrupted and although the stack is very large (177 frames) it's not uncommon for the gecko stack to get quite large as the DOM is traversed. The change I've made now should at least help shed some light on this: if the crash persists, then the exhausted-stack theory becomes more persuasive.

The partial build completed with this small change and I've copied the library over to my phone. Now will it crash?

No; there's no crash. Presumably hyphenation is now no longer working correctly, but it's hard to tell that just from the way the page is laid out. Nevertheless, it looks like we've pinned down the location of the problem, even if it's not clear why it's happening. I thought I'd also have a go at increasing the size of the buffer for the return values. I've extended it by an additional 1024 bytes:

   AutoTArray<uint8_t, 200> hyphenValues;
-  hyphenValues.SetLength(utf8.Length());
+  hyphenValues.SetLength(utf8.Length() + 1024);
   int32_t result = mDict.match(
       [&](const void*& ptr) {
         return mapped_hyph_find_hyphen_values_raw(
             static_cast<const uint8_t*>(ptr), mDictSize, utf8.BeginReading(),
             utf8.Length(), hyphenValues.Elements(), hyphenValues.Length());
       },

But with this change the crash still occurs. It's possible I didn't choose a large enough amount to add to the buffer, but this seems unlikely to me.

As the next test I've amended the Rust code so that the find_hyphen_values() method in mapped_hyph/src/lib.rs returns early.

     pub fn find_hyphen_values(&self, word: &str, values: &mut [u8]) -> isize {
         assert!(values.len() >= word.len());
         values.iter_mut().for_each(|x| *x = 0);
         let top_level = self.level(0);
         let (lh_min, rh_min, clh_min, crh_min) = top_level.word_boundary_mins(
    );
         if word.len() < lh_min + rh_min {
             return 0;
         }
+        if word.len() >= lh_min + rh_min {
+            return 0;
+        }

This may look a little contrived: the two conditions together are equivalent to a guaranteed return. But I didn't want to just add a return here because the Rust compiler is quite thorough. If it had noticed the guaranteed early return it might have also complained about unused functionality later in the method.

I'm interested to know whether this will be enough to fix the crash. If it is, I can play around with the code a bit more to try to isolate the error. However I'm not even certain whether the partial build is actually rebuilding the Rust code or not. I think not, but this test should help determine this with more certainty.

Having run the partial build and copied over the binary, I find that the crash still happens on the page. That makes me think that the partial build wasn't enough to capture the changes to the Rust code. So my next step will be to do a full rebuild and have a go with the output from that once it's ready in the morning.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

16 Jul 2024 : Day 290 #

It was a huge relief to finally wrap up the WebRTC changes yesterday. When I started this process I had my suspicions that it would be the hardest part, hence I left it to the end. It turned out not to be as hard as fixing the rendering pipeline. Actually it turned out to be surprisingly straightforward, notwithstanding a few bumps along the way, and either way it's now done. So I'm happy.

During the process, back on Day 285, we did see that the colours for the WebRTC video feed looked wrong. After playing around for a bit this morning I've concluded that the issue is with the red and blue channels being swapped over. I'm not planning to spend the time fixing this just yet; it needs fixing for sure, but there are other more important tasks to complete. In the meantime, I've lodged an issue on the browser bug tracker so that it doesn't get forgotten about.

Three images of a face (my face, as it happens). The first image from ESR 91 has an obvious blue tint; next to this is a channel mixer showing red and blue channels swapped; then there's the ESR 91 image with channels swapped and the colours look good; finally there's the image taken using ESR 78 and the colours look good

I was happy to rapidly receive replies back from Frajo (krnlyng) and Björn (Thaodan). They suggest that the issue is happening in gecko-camera. If that's the case, then I'm satisfied that it doesn't have to be fixed within the gecko code.

That brings me on nicely to my next task, which is to work my way through the remaining bugs on this tracker to see what really needs addressing. I have three other important tasks as well, which in totality gives me the following on my to-do list:

Work through the outstanding issues tagged with ESR 91 in the bug tracker.
Respond to Simon Schmeisser and Raine's comments on various PRs.
Package up versions of all the projects so that brave members of the community can test them out.
Convert all of the commits on the FIREFOX_ESR_91_9_X_RELBRANCH_patches branch into patches.

The last of these is more than just a mechanical process. Not only do I need to try to align the patches with those in the existing ESR 78 build, I also want to try to rationalise the remaining patches to minimise overlaps between them.

So that's my four-point plan. And as has been the case throughout this process, there's no point delaying, so I'm going to get straight on to working through the issues on the bug tracker.

First up is bug #1055, an apparent crash when visiting amazon.de. When I try a few pages on the site I don't experience any issues, but there's an example page given in my earlier bug report that, according to what I've written, consistently causes the ESR 91 browser to crash. And when I try this I do indeed still get a crash. There's a backtrace in the bug report, but here's the latest one that I got today.

This is the most enormous backtrace with depth of 177 frames, so I'm not going to include all of them here. But I want to flag up it's surprising depth as it may turn out to be relevant.

Thread 10 &quot;GeckoWorkerThre&quot; received signal SIGSEGV, Segmentation 
    fault.
[Switching to LWP 31398]
0x0000007ff57c9eb4 in mapped_hyph::Hyphenator::level () from /usr/lib64/
    xulrunner-qt5-91.9.1/libxul.so
(gdb)
(gdb) bt
#0  0x0000007ff57c9eb4 in mapped_hyph::Hyphenator::level () from /usr/lib64/
    xulrunner-qt5-91.9.1/libxul.so
#1  0x0000007ff57ca0a0 in mapped_hyph::Hyphenator::find_hyphen_values () from /
    usr/lib64/xulrunner-qt5-91.9.1/libxul.so
#2  0x0000007ff57caed8 in mapped_hyph_find_hyphen_values_raw () from /usr/lib64/
    xulrunner-qt5-91.9.1/libxul.so
#3  0x0000007ff1a2856c in nsHyphenator::<lambda(mozilla::UniquePtr<base::
    SharedMemory, mozilla::DefaultDelete<base::SharedMemory> >&)>::operator() (
    shm=..., __closure=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsTArray.h:1183
#4  mozilla::detail::VariantImplementation<unsigned char, 1, mozilla::
    UniquePtr<base::SharedMemory, mozilla::DefaultDelete<base::SharedMemory> >, 
    mozilla::UniquePtr<HyphDic const, mozilla::DefaultDelete<HyphDic const> > >:
    :matchN<mozilla::Variant<void const*, mozilla::UniquePtr<base::
    SharedMemory, mozilla::DefaultDelete<base::SharedMemory> >, mozilla::
    UniquePtr<const HyphDic, mozilla::DefaultDelete<const HyphDic> > >&, 
    nsHyphenator::HyphenateWord(const nsAString&, uint32_t, uint32_t, 
    nsTArray<bool>&)::<lambda(mozilla::UniquePtr<base::SharedMemory>&)>, 
    nsHyphenator::HyphenateWord(const nsAString&, uint32_t, uint32_t, 
    nsTArray<bool>&)::<lambda(mozilla::UniquePtr<const HyphDic, mozilla::
    DefaultDelete<const HyphDic> >&)> > (aMi=..., aV=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/Variant.h:309
#5  mozilla::detail::VariantImplementation<unsigned char, 0, void const*, 
    mozilla::UniquePtr<base::SharedMemory, mozilla::DefaultDelete<base::
    SharedMemory> >, mozilla::UniquePtr<HyphDic const, mozilla::
    DefaultDelete<HyphDic const> > >::matchN<mozilla::Variant<void const*, 
    mozilla::UniquePtr<base::SharedMemory, mozilla::DefaultDelete<base::
    SharedMemory> >, mozilla::UniquePtr<const HyphDic, mozilla::
    DefaultDelete<const HyphDic> > >&, nsHyphenator::HyphenateWord(const 
    nsAString&, uint32_t, uint32_t, nsTArray<bool>&)::<lambda(void const*&)>, 
    nsHyphenator::HyphenateWord(const nsAString&, uint32_t, uint32_t, 
    nsTArray<bool>&)::<lambda(mozilla::UniquePtr<base::SharedMemory, mozilla::
    DefaultDelete<base::SharedMemory> >&)>, nsHyphenator::HyphenateWord(const 
    nsAString&, uint32_t, uint32_t, nsTArray<bool>&)::<lambda(mozilla::
    UniquePtr<const HyphDic, mozilla::DefaultDelete<const HyphDic> >&)> > (
    aMi=..., aV=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/Variant.h:318
#6  mozilla::Variant<void const*, mozilla::UniquePtr<base::SharedMemory, 
    mozilla::DefaultDelete<base::SharedMemory> >, mozilla::UniquePtr<HyphDic 
    const, mozilla::DefaultDelete<HyphDic const> > >::matchN<mozilla::
    Variant<void const*, mozilla::UniquePtr<base::SharedMemory, mozilla::
    DefaultDelete<base::SharedMemory> >, mozilla::UniquePtr<const HyphDic, 
    mozilla::DefaultDelete<const HyphDic> > >&, nsHyphenator::HyphenateWord(
    const nsAString&, uint32_t, uint32_t, nsTArray<bool>&)::<lambda(void 
    const*&)>, nsHyphenator::HyphenateWord(const nsAString&, uint32_t, 
    uint32_t, nsTArray<bool>&)::<lambda(mozilla::UniquePtr<base::
    SharedMemory>&)>, nsHyphenator::HyphenateWord(const nsAString&, uint32_t, 
    uint32_t, nsTArray<bool>&)::<lambda(mozilla::UniquePtr<const HyphDic, 
    mozilla::DefaultDelete<const HyphDic> >&)> > (aM1=..., aM0=..., 
    aVariant=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/Variant.h:902
#7  mozilla::Variant<void const*, mozilla::UniquePtr<base::SharedMemory, 
    mozilla::DefaultDelete<base::SharedMemory> >, mozilla::UniquePtr<HyphDic 
    const, mozilla::DefaultDelete<HyphDic const> > >::match<nsHyphenator::
    HyphenateWord(const nsAString&, uint32_t, uint32_t, nsTArray<bool>&)::
    <lambda(void const*&)>, nsHyphenator::HyphenateWord(const nsAString&, 
    uint32_t, uint32_t, nsTArray<bool>&)::<lambda(mozilla::UniquePtr<base::
    SharedMemory>&)>, nsHyphenator::HyphenateWord(const nsAString&, uint32_t, 
    uint32_t, nsTArray<bool>&)::<lambda(mozilla::UniquePtr<const HyphDic, 
    mozilla::DefaultDelete<const HyphDic> >&)> > (
    aM1=..., aM0=..., this=0x7e512e3188)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/Variant.h:857
#8  nsHyphenator::HyphenateWord (this=this@entry=0x7e512e3180, aString=..., 
    aStart=aStart@entry=0, aLimit=aLimit@entry=7, aHyphens=...)
    at intl/hyphenation/glue/nsHyphenator.cpp:444
#9  0x0000007ff1a2884c in nsHyphenator::Hyphenate (
    this=this@entry=0x7e512e3180, aString=..., aHyphens=...)
    at intl/hyphenation/glue/nsHyphenator.cpp:378
#10 0x0000007ff28e3890 in nsLineBreaker::FindHyphenationPoints (
    this=this@entry=0x7fde7c6e90, aHyphenator=aHyphenator@entry=0x7e512e3180,
    aTextStart=<optimized out>, aTextLimit=<optimized out>, 
    aBreakState=0x7fde7c4a08 &quot;&quot;)
    at dom/base/nsLineBreaker.cpp:322
[...]
#177 0x0000007fef54989c in ?? () from /lib64/libc.so.6
(gdb)

As I mentioned, this has 177 frames, which makes me think it could be a stack overflow that's the problem. It's not clear to me whether this is therefore a bug (in the sense of an error in the code) or just a mismatch between the memory available and the complexity of this particular page. The error is happening during an attempt to hyphenate a word, which doesn't seem like an especially atypical or extraordinary activity.

I'm not planning to fix all of the remaining issues in the bug tracker, but having a crashing browser is no good so I do feel I need to get to the bottom of this.

The crash is happening inside some Rust code, which is interesting since, as far as I'm aware, this is the first time there's been a problem with the Rust code outside of the build process. We're not getting full debugging information as a result, but it looks like the error is happening with one of the calls to level() in mapped_hyph/src/lib.rs here:

    /// Identify acceptable hyphenation positions in the given `word`.
    ///
    /// The caller-supplied `values` must be at least as long as the `word`.
    ///
    /// On return, any elements with an odd value indicate positions in the word
    /// after which a hyphen could be inserted.
    ///
    /// Returns the number of possible hyphenation positions that were found.
    ///
    /// # Panics
    /// If the given `values` slice is too small to hold the results.
    ///
    /// If the block of memory represented by `self.0` is not in fact a valid
    /// hyphenation dictionary, this function may panic with an overflow or
    /// array bounds violation.
    pub fn find_hyphen_values(&self, word: &str, values: &mut [u8]) -> isize {
        assert!(values.len() >= word.len());
        values.iter_mut().for_each(|x| *x = 0);
        let top_level = self.level(0);
        let (lh_min, rh_min, clh_min, crh_min) = top_level.word_boundary_mins();
        if word.len() < lh_min + rh_min {
            return 0;
        }
        let mut hyph_count = top_level.find_hyphen_values(word, values, lh_min, 
    rh_min);
        let compound = hyph_count > 0;
        // Subsequent levels are applied to fragments between potential breaks
        // already found:
        for l in 1 .. self.num_levels() {
            let level = self.level(l);
            if hyph_count > 0 {
[...]

I've included the comment here in case the part about it potentially panicking in two different situations turns out to be important. Could one of these panics be the cause of the crash?

The first panic about the values slide being too small looks like the easiest one to follow back through the code. The length of this buffer seems to be determined by the length of the hyphenValues array created in the nsHyphenator::HyphenateWord() method in the nsHyphenator.cpp file:

  AutoTArray<uint8_t, 200> hyphenValues;
  hyphenValues.SetLength(utf8.Length());
  int32_t result = mDict.match(
      [&](const void*& ptr) {
        return mapped_hyph_find_hyphen_values_raw(
            static_cast<const uint8_t*>(ptr), mDictSize, utf8.BeginReading(),
            utf8.Length(), hyphenValues.Elements(), hyphenValues.Length());
      },
      [&](UniquePtr<base::SharedMemory>& shm) {
        return mapped_hyph_find_hyphen_values_raw(
            static_cast<const uint8_t*>(shm->memory()), mDictSize,
            utf8.BeginReading(), utf8.Length(), hyphenValues.Elements(),
            hyphenValues.Length());
      },
      [&](UniquePtr<const HyphDic>& hyph) {
        return mapped_hyph_find_hyphen_values_dic(
            hyph.get(), utf8.BeginReading(), utf8.Length(),
            hyphenValues.Elements(), hyphenValues.Length());
      });

This is then getting passed on to this method in mapped_hyph/src/ffi.rs:

pub unsafe extern &quot;C&quot; fn mapped_hyph_find_hyphen_values_raw(dic_buf: 
    *const u8, dic_len: u32,
                                                            word: *const 
    c_char, word_len: u32,
                                                            hyphens: *mut u8, 
    hyphens_len: u32) -> i32 {
    if word_len > hyphens_len {
        return -1;
    }
    let (word_str, hyphen_buf) = params_from_c(word, word_len, hyphens, 
    hyphens_len);
    if word_str.is_err() {
        return -1;
    }
    Hyphenator::new(slice::from_raw_parts(dic_buf, dic_len as usize))
        .find_hyphen_values(word_str.unwrap(), hyphen_buf) as i32
}

I'm going to need a bit more time and mental energy to figure out what's happening here, so I'm going to have to leave the rest of the analysis until the morning. I'm thinking that I could do a couple of things: increase the size of the return buffer; add a check to the Rust code; remove the call to the Rust code entirely. The aim wouldn't be to fix the problem, but rather to explore whether we're in the right place or not.

I'll sleep on it and see how these ideas feel in the morning.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

15 Jul 2024 : Day 289 #

We finished the diary entry day yesterday with a mystery. The C++ code in AudioChannelServices.cpp that sends audio-playback messages is being called on both ESR 78 and ESR 91. On ESR 91 the JavaScript code in AudioPlaybackChild.jsm that receives these messages, converts them into AudioPlayback messages and forwards them on to AudioPlaybackParent.jsm is also working correctly. But on ESR 78 the AudioPlaybackChild never receives the messages.

To add to the mystery, Mark Washeim (poetaster) flagged up the fact that he's experienced the same error when developing apps that use a WebView. That suggests there are at least some occasions when the AudioPlayback class is successfully receiving the messages.

Why are the messages being received in ESR 91 but only on ESR 78 sporadically?

Today I plan to delve into this, but be warned, it's a bit of a lengthy one today. I'm hoping the payoff will be worth it.

Considering the reasons for the peculiar differences I'm seeing between ESR 78 and ESR 91, one possibility is that the actor isn't being properly instantiated on ESR 78. To help determine if this is the case I've added the following code to the AudioPlaybackChild class on both ESR 78 and ESR 91. I've added similar code (but with the output changed accordingly) to the AudioPlaybackParent class, in addition to the debug output that I already added yesterday.

  constructor() {
    super();
    dump(&quot;AudioPlaybackChild: constructor()\n&quot;);
  }

Here's what I get when running this on ESR 91. As you can see I've not removed the other debug output lines, so we can watch both classes be instantiated before the messages are passed to their respective observers.

library &quot;libandroidicu.so&quot; needed or dlopened by &quot;/system/lib64/
    libmedia.so&quot; is not accessible for the namespace &quot;(default)&quot;

AudioPlaybackChild: constructor()
AudioPlayback: observe()
AudioPlayback: topic: audio-playback
AudioPlayback: data: active
Stacktrace: observe@resource://gre/actors/AudioPlaybackChild.jsm:19:22

AudioPlaybackParent: constructor()
AudioPlayback: receiveMessage()
AudioPlayback: aMessage: AudioPlayback:Start
Stacktrace: receiveMessage@resource://gre/actors/AudioPlaybackParent.jsm:20:22

JavaScript error: resource://gre/actors/AudioPlaybackParent.jsm, line 27: 
    TypeError: browser is null
AudioPlayback: observe()
AudioPlayback: topic: audio-playback
AudioPlayback: data: inactive-pause
Stacktrace: observe@resource://gre/actors/AudioPlaybackChild.jsm:19:22

On ESR 78 we get none of this output, despite the same debug print lines being present in the code.

I'm beginning to feel really frustrated by this now. There must be some reason why the actor is being initialised on ESR 91 but not on ESR 78. There are two possibilities that I need to pursue to try to get to the bottom of this. The first is to try to find out if it's just this specific actor that's being skipped, or second is whether the ActorManager — the class that performs the initialisation for all of the actors — isn't starting at all.

To help with this I've placed some debug output in the loop that starts all of the actors. I notice that there's some conditionality here: actors can be disabled using a preference if it's listed as such in the actor configuration. This configuration is listed in the ActorManagerParent.jsm source file itself. Here's the initialisation loop taken from this file:

    for (let [actorName, actor] of Object.entries(actors)) {
      dump(&quot;ActorManagerParent: actor: &quot; + actorName + &quot;, 
    enabledPreference: &quot; + actor.emablePreference + &quot;\n&quot;);
      // If enablePreference is set, only register the actor while the
      // preference is set to true.
      if (actor.enablePreference) {
        let actorNameProp = actorName + &quot;_Preference&quot;;
        XPCOMUtils.defineLazyPreferenceGetter(
          this,
          actorNameProp,
          actor.enablePreference,
          false,
          (prefName, prevValue, isEnabled) => {
            if (isEnabled) {
              dump(&quot;ActorManagerParent: pref registering &quot; + 
    actorName + &quot;\n&quot;);
              register(actorName, actor);
            } else {
              dump(&quot;ActorManagerParent: pref unregistering &quot; + 
    actorName + &quot;\n&quot;);
              unregister(actorName, actor);
            }
            if (actor.onPreferenceChanged) {
              actor.onPreferenceChanged(prefName, prevValue, isEnabled);
            }
          }
        );
        if (!this[actorNameProp]) {
          continue;
        }
      }
   
      dump(&quot;ActorManagerParent: registering &quot; + actorName + 
    &quot;\n&quot;); 
      register(actorName, actor);
    }

As we can see, the loop goes through each of the Actors and checks whether they have an enablePreference entry; if they do it searches for the appropriate preference which, in our case, would be AudioPlayback_Preference. If this is set to true or if there's no preference enabled then the Actor is registered; otherwise if it's set to false the actor is unregistered. The calls to dump() in this code were added by me just now so that we can trace progress through the method (sadly gdb won't work on JavaScript code).

The entry for the AudioPlayback actor, as listed higher up in the same file, looks like this:

  AudioPlayback: {
    parent: {
      moduleURI: &quot;resource://gre/actors/AudioPlaybackParent.jsm&quot;,
    },

    child: {
      moduleURI: &quot;resource://gre/actors/AudioPlaybackChild.jsm&quot;,
      observers: [&quot;audio-playback&quot;],
    },

    allFrames: true,
  },

As we can see from this, the actorName is AudioPlayback and the actor is everything else. But we can also see there's noemabledPreference entry here. As such what I'd expect to see is the preference code being entirely skipped and for the actor to be directly registered through the call at the end of the loop.

When I execute this on ESR 91, here's the output I get. It turns out there are very many actors (36 in total) so there's quite a lot of output. I've cut out the majority of them for brevity, but they all basically look the same anyway so you're not missing much.

Created LOG for EmbedLite
ActorManagerParent: actor: AsyncPrefs, enabledPreference: undefined
ActorManagerParent: registering AsyncPrefs
[...]
ActorManagerParent: actor: AudioPlayback, enabledPreference: undefined
ActorManagerParent: registering AudioPlayback
ActorManagerParent: actor: AutoComplete, enabledPreference: undefined
ActorManagerParent: registering AutoComplete
ActorManagerParent: actor: Autoplay, enabledPreference: undefined
ActorManagerParent: registering Autoplay
[...]
ActorManagerParent: actor: UnselectedTabHover, enabledPreference: undefined
ActorManagerParent: registering UnselectedTabHover
Created LOG for EmbedPrefs

What we see here is what we expect to see: the enablePreference is undefined and so the Actor gets registered. That's on ESR 91. When I add exactly the same debug annotations to the ESR 78 code I get... nothing. I've also added some code to the initialisation steps of the ActorManager and they don't get executed either. So it's looking clearly like the ActorManager isn't being executed at all.

So I need to find where the ActorManager initialisation is triggered. Searching through the code I can see a few potential places, for example in the tab-content.js file and in the browser_startup.js file. Both of these reference the ActorManagerParent. But neither are included in our omni.ja file so presumably they can't be the culprits.

At this point I've placed debug output all over the place and I don't seem to be getting any closer to finding out the reason. This is deeply frustrating.

But suddenly I realise that the relevant code might not be in the gecko repository at all: it could be in the embedlite-components repository. And when I check there, it looks plausible that the ActorManagerChild is getting initialised in the embedhelper.js script which is part of embedlite-components. Could this be what I've been searching for?

I've added some debug output to this file as well, as you can see in the dump() calls in this extract from the file:

function canInitializeActorManagerChild() {
  if (actorManagerChildAttached)
    return false;
  
  const { sharedData } = Services.cpmm;
  dump(&quot;canInitializeActorManagerChild: sharedData.get(
    CHILD_SINGLETON_ACTORS) = &quot; + sharedData.get(CHILD_SINGLETON_ACTORS) + 
    &quot;\n&quot;);
  return sharedData.get(CHILD_SINGLETON_ACTORS);
}

function initializeActorManagerChild() {
  dump(&quot;initializeActorManagerChild\n&quot;);
  try {
    if (canInitializeActorManagerChild()) {
      dump(&quot;canInitializeActorManagerChild: true\n&quot;);
      const { ActorManagerChild } = ChromeUtils.import(&quot;resource://gre/
    modules/ActorManagerChild.jsm&quot;);
      ActorManagerChild.attach(this);
      actorManagerChildAttached = true;
    }
  } catch (e) {}
} 
  
initializeActorManagerChild();

The code here calls initializeActorManagerChild(), which then checks whether initialisation is possible and if it is, initialises it. The two situations in which the initialisation may not be possible are if either it's already been initialised, or the value returned by sharedData.get(CHILD_SINGLETON_ACTORS) resolves to false. Here's the output I get when I run this on ESR 78:

Created LOG for EmbedLiteLayerManager
initializeActorManagerChild
canInitializeActorManagerChild: sharedData.get(CHILD_SINGLETON_ACTORS) = null

This initially looks encouraging: the value returned from the CHILD_SINGLETON_ACTORS check is null, which is interpreted as false in the condition and so the initialisation is skipped. Could this be the reason the ActorManager isn't being initialised?

Sadly my hopes are dashed when I discover that ESR 91 generates exactly the same output. There as well I get a return value of null with no initialisation.

I thought this would be it... I was wrong and I'm no further forwards than before. This all feels incredibly frustrating.

But having wandered into the embedlite-components code, I now notice there's one more place which could be triggering the initialisation and that's in the EmbedLiteGlobalHelper.js file. When I open this up I discover something unexpected. Compare the code in the file from ESR 78 with that from ESR 91. Here first is code taken from my device running ESR 78:

//ChromeUtils.defineModuleGetter(
//  this,
//  &quot;ActorManagerParent&quot;,
//  &quot;resource://gre/modules/ActorManagerParent.jsm&quot;
//);  

Services.scriptloader.loadSubScript(&quot;chrome://embedlite/content/
    Logger.js&quot;);

// Common helper service
  
function EmbedLiteGlobalHelper()
{ 
  //ActorManagerParent.flush();
[...]

And now here's the same code taken from my device running ESR 91:

ChromeUtils.defineModuleGetter(
  this,
  &quot;ActorManagerParent&quot;,
  &quot;resource://gre/modules/ActorManagerParent.jsm&quot;
);

Services.scriptloader.loadSubScript(&quot;chrome://embedlite/content/
    Logger.js&quot;);

// Common helper service

function EmbedLiteGlobalHelper()
{
  // Touch ActorManagerParent so that it gets initialised
  var actor = ActorManagerParent;
[...]

In the first block of code from ESR 78 the lines for getting the ActorManagerParent are completely commented out. That doesn't look right; they're not commented out on my ESR 91 device.

There's another difference too, which is less relevant for the issue at hand, but which explains how this came about. Slightly further down we can see that the call to ActorMangerParent.flush() (which is also commented out) has been switched for a line that just creates an instance of the ActorManagerParent.

I think I've finally found the culprit I've been looking for. The nice thing is that this explains fully what's going on. To understand we need to look at the history of the commits related to these changes:

$ git log -1
commit 377ed10e9cffece90458b1d8152a579b438b5dc0 (HEAD -> sailfishos-esr91, 
    origin/sailfishos-esr91)
Author: David Llewellyn-Jones <david@flypig.co.uk>
Date:   Mon Feb 12 21:33:19 2024 +0000

    [embedlite-components] Ensure ActorManagerParent is initialised
    
    In diff 0bf2601425ec1d8d63925 a call to ActorManagerParent.flush() was
    removed because the function was removed from upstream. However this
    caused ActorManagerParent not to be initialised, preventing the Window
    Actors from being registered.
    
    This change adds code to touch ActorManagerParent to ensure it gets
    reinitialised again.
$ git log -1 0bf2601425ec1d8d63925
commit 0bf2601425ec1d8d639255d6a7c32231e7e38eae
Author: David Llewellyn-Jones <david.llewellyn-jones@jolla.com>
Date:   Thu Nov 23 21:55:13 2023 +0000

    Remove EmbedLiteGlobalHelper.js errors
    
    Makes three changes to address errors that were coming from
    EmbedLiteGlobalHelper.js:
    
    1. Use ComponentUtils.generateNSGetFactory() instead of
       XPCOMUtils.generateNSGetFactory().
    
    2. Remove call to ActorManagerParent.flush(). See Bug 1649843.
    
    3. Use L10nRegistry.registerSources() instead of
       L10nRegistry.registerSource(). See Bug 1648631.
    
    See the following related upstream changes:
    
    https://phabricator.services.mozilla.com/D95206
    
    https://phabricator.services.mozilla.com/D81243

What does this all tell us? Well, back on Day 88 I removed the call to ActorManager.flush() following a change made upstream. In the process of doing this I broke the initialisation of the ActorManager and so went about restoring it on Day 256 by adding in the new line that creates an instance of the class.

As part of testing the original change I must have commented out the initialisation code on my ESR 78 device. Not in the project code on my laptop, but just on my phone. And so we ended up where we are now. This explains not only why it's not working on my ESR 78 device and why a re-installation of the gecko package didn't help (it's a change made in the embedlite-components package!) but also why the error is appearing on Mark's device and not mine: Mark never hacked around with the code on his device to break it.

It's a little frustrating that this is all down to changes I've made, things I've recorded in this diary but which I'd completely forgotten about. I've been led on a bit of a wild goose chase by myself.

Nevertheless I've reversed the changes on my device running ESR 78. Now when I grant permissions to the microphone I get the same error as I get on ESR 91.

JavaScript error: resource://gre/actors/AudioPlaybackParent.jsm, line 16: 
    TypeError: browser is null

Just as important to know, even with the error now showing, the audio continues to work as it did before.

This is all just as Mark described it in his message on Mastodon.

Even though it's presumably pretty harmless, I still feel compelled to fix it so there's no error message on ESR 91. Not only is it ugly, but it's also just lost me three days of development time. It would be good to avoid that happening again in the future.

To try to understand the issue with the error I want to understand what the embedding element actually is. To that end I've been reading through various pieces of documentation, including the Firefox Source docs on the topic and a Gecko Embedding Basics document. The latter turns out to be a bit of a red herring, since the embedder element isn't specifically related to embedding Gecko in other applications. But it was a useful read anyway,

The conclusion I'm left with is that in practice an attempt is being made to pass the message on to the embedding element, but this is never actually used. It could potentially be used, for example by an extension, but on Sailfish OS it's not bringing any benefit.

So the appropriate thing to do is to prevent the AutioPlaybackParent attempting to call methods on a null browser variable. There are multiple ways we might go about achieving this. One possibility would be to try to override the AudioPlaybackParent.jsm file so as to replace it with our own version. Sadly, after testing a few things out, I've not been able to get this to work. So I've given up and have decided to go for the simplest possible option, which is just to patch the file with a check for the null value:

$ git diff -U1
diff --git a/toolkit/actors/AudioPlaybackParent.jsm b/toolkit/actors/
    AudioPlaybackParent.jsm
index 5c54058528c5..7cfc5134e886 100644
--- a/toolkit/actors/AudioPlaybackParent.jsm
+++ b/toolkit/actors/AudioPlaybackParent.jsm
@@ -16,2 +16,5 @@ class AudioPlaybackParent extends JSWindowActorParent {
     const browser = this.browsingContext.top.embedderElement;
+    if (!browser) {
+      return;
+    }
     switch (aMessage.name) {

The good news is that this fixes the error, which now no longer appears in the console output. What's more there was a completely separate reason, unrelated to gecko, for the sound not working on my device. Having addressed that I now have a build that has working microphone and no error showing.

Nearly no error. There's still the following appearing:

library &quot;libandroidicu.so&quot; needed or dlopened by &quot;/system/lib64/
    libmedia.so&quot; is not accessible for the namespace &quot;(default)&quot;

Once again, despite the fact that the audio is now working, I still feel the need to track down the reason behind this error. It's possible this is nothing to do with gecko... but I'm not convinced and I'd like to be certain.

From the error it looks like the trigger is from the loading of libmedia.so rather than an attempt to load libandroidicu.so directly. In amongst the Gecko code there are two methods that attempt to dynamically load libmedia.so. Here's one of them:

media_lib *
cubeb_load_media_library()
{
  media_lib ml = {0};
  ml.libmedia = dlopen(&quot;libmedia.so&quot;, RTLD_LAZY);
  if (!ml.libmedia) {
    return NULL;
  }
[...]

The other one is in a method called audiotrack_init. Both of these relate to cubeb which appears to be a cross platform media library. However, when I fire up the debugger and try to put a breakpoint on either of these methods, nothing sticks. It looks like they're not part of xulrunner. So, with nothing else to try, I've just stuck a breakpoint on dlopen() on the off-chance something useful comes from it.

It gets hit many, many times. But I'm interested in a particular time, which is just after the website has requested to use the microphone, and just before I've granted access as a user. And there is a hit which happens right then. And the backtrace looks promising.

Thread 70 &quot;VideoCapture&quot; hit Breakpoint 3, 0x0000007ff0df02ac in 
    dlopen () from /lib64/libdl.so.2
(gdb) bt
#0  0x0000007ff0df02ac in dlopen () from /lib64/libdl.so.2
#1  0x0000007f10057c30 in ?? () from /usr/lib64/gecko-camera/plugins/
    libgeckocamera-droid.so
#2  0x0000007f1005aa4c in droid_media_init () from /usr/lib64/gecko-camera/
    plugins/libgeckocamera-droid.so
#3  0x0000007f1004c2d4 in DroidCameraManager::init() () from /usr/lib64/
    gecko-camera/plugins/libgeckocamera-droid.so
#4  0x0000007fee922844 in gecko::camera::RootCameraManager::init() () from /usr/
    lib64/libgeckocamera.so.0
#5  0x0000007fee922a2c in gecko_camera_manager () from /usr/lib64/
    libgeckocamera.so.0
#6  0x0000007ff48d65a4 in webrtc::videocapturemodule::DeviceInfoSFOS::Init (
    this=this@entry=0x7e60002fc0)
    at third_party/libwebrtc/webrtc/modules/video_capture/sfos/
    device_info_sfos.cc:40
#7  0x0000007ff48dad68 in webrtc::videocapturemodule::DeviceInfoSFOS::
    DeviceInfoSFOS (this=0x7e60002fc0)
    at third_party/libwebrtc/webrtc/modules/video_capture/sfos/
    device_info_sfos.cc:35
#8  0x0000007ff48dad90 in webrtc::videocapturemodule::VideoCaptureImpl::
    CreateDeviceInfo ()
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33
#9  0x0000007ff4867a68 in webrtc::VideoCaptureFactory::CreateDeviceInfo ()
    at third_party/libwebrtc/webrtc/modules/video_capture/
    video_capture_factory.cc:28
#10 0x0000007ff37de984 in mozilla::camera::VideoEngine::
    GetOrCreateVideoCaptureDeviceInfo (this=0x7e60002f00)
    at dom/media/systemservices/VideoEngine.cpp:180
[...]
#24 0x0000007fef54989c in ?? () from /lib64/libc.so.6
(gdb) c
Continuing.
library &quot;libandroidicu.so&quot; needed or dlopened by &quot;/system/lib64/
    libmedia.so&quot; is not accessible for the namespace &quot;(default)&quot;

As you can see, the error is generated straight after I continue running the code. It's hard to judge whether this is really happening straight after or not (my phone runs pretty fast), but it looks promising. Let's dig deeper.

So here's the code that the backtrace is pointing to. This is the last piece of code before the backtrace leaves the gecko library.

int32_t DeviceInfoSFOS::Init()
{
    cameraManager = gecko_camera_manager();
    // Initialize parent class member
    _lastUsedDeviceName = (char *)malloc(1);
    _lastUsedDeviceName[0] = 0;
    return 0;
}

The key line is the one that calls gecko_camera_manager() that we can also see in the backtrace. This method isn't part of gecko but is instead part of the libgeckocamera.so library, which is part of the Sailfish stack but which lives in a separate repository.

I want to know whether it's precisely this line that's generating the error message. So I've placed a breakpoint on the DeviceInfoSFOS::Init() method so that I can step through it a line at a time. If this is indeed the right place we should see the error message generated as we step through.

And so we do. Here's the output, with the error generated exactly as we step over the call to gecko_camera_manager():

Thread 69 &quot;VideoCapture&quot; hit Breakpoint 4, webrtc::videocapturemodule:
    :DeviceInfoSFOS::Init (this=this@entry=0x7e68002fa0)
    at third_party/libwebrtc/webrtc/modules/video_capture/sfos/
    device_info_sfos.cc:39
39      {
(gdb) n
40          cameraManager = gecko_camera_manager();
(gdb) n
library &quot;libandroidicu.so&quot; needed or dlopened by &quot;/system/lib64/
    libmedia.so&quot; is not accessible for the namespace &quot;(default)&quot;
42          _lastUsedDeviceName = (char *)malloc(1);
(gdb) n
43          _lastUsedDeviceName[0] = 0;
(gdb)

If, rather than stepping over the call, we instead jump over it, so that it doesn't get executed at all, the error never shows:

Thread 71 &quot;VideoCapture&quot; hit Breakpoint 4, webrtc::videocapturemodule:
    :DeviceInfoSFOS::Init (this=this@entry=0x7e9c004210)
    at third_party/libwebrtc/webrtc/modules/video_capture/sfos/
    device_info_sfos.cc:39
39      {
(gdb) n
40          cameraManager = gecko_camera_manager();
(gdb) tbreak +1
Temporary breakpoint 5 at 0x7ff48d65a8: file third_party/libwebrtc/webrtc/
    modules/video_capture/sfos/device_info_sfos.cc, line 42.
(gdb) jump +1
Continuing at 0x7ff48d65a8.

Thread 71 &quot;VideoCapture&quot; hit Temporary breakpoint 5, webrtc::
    videocapturemodule::DeviceInfoSFOS::Init (this=this@entry=0x7e9c004210)
    at third_party/libwebrtc/webrtc/modules/video_capture/sfos/
    device_info_sfos.cc:42
42          _lastUsedDeviceName = (char *)malloc(1);
(gdb) n
43          _lastUsedDeviceName[0] = 0;
(gdb) c
Continuing.

Interestingly, having jumped over it, this doesn't seem to cause any issues for the microphone or audio. That rather makes sense: the name of the method suggests it's camera-related rather than audio-related. And having jumped over it, attempting to then access the camera does cause a crash:

Thread 71 &quot;VideoCapture&quot; received signal SIGSEGV, Segmentation fault.
0x0000007ff48dc5c4 in webrtc::videocapturemodule::DeviceInfoSFOS::
    NumberOfDevices (this=0x7e9c004210)
    at include/c++/8.3.0/ext/new_allocator.h:86
86      include/c++/8.3.0/ext/new_allocator.h: No such file or directory.

This isn't a crash that could ever happen in real use: it's only happened here because I forcefully skipped a line of code using the debugger that should have been executed.

My conclusion from all this is that the libandroidicu.so error falls outside the scope of the gecko upgrade. It doesn't seem to be doing any harm in practice and there isn't an obvious way to prevent it without wading into the gecko-camera library. I've reached the end of where I can reasonably take this to.

On testing with the LiveKit WebRTC Browser Test I now get the same results using ESR 91 as I do with ESR 78.

Screenshots showing the LiveKit WebRTC Browser Test results. On the left the ESR 78 results, on the right the ESR 91 results. There are both green (success) and red (failure) lines on both, but they both tally with one another

One interesting thing to notice is that the ESR 78 version is showing three rear cameras, whereas the ESR 91 output shows only one. But that's to be expected as the former is running on an Xperia 10 II on which all three cameras are supported, whereas the latter is on an Xperia 10 III which supports only one. So the results tally nicely with what we'd expect.

So today was rather a deep dive into WebRTC errors. Practically speaking I'm not sure I accomplished a great deal, but finding the answers to all of these open questions was cathartic to say the least. I'm glad it was possible to get to the bottom of them all. Where does this leave things? The WebRTC changes seem to be looking pretty good now, so I've pushed all of my commits related to them to the remote repository.

And that means things are looking pretty decent more generally as well. Tomorrow I'm going to do some more general testing of the browser. But I genuinely think we may be reaching a conclusion here.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

14 Jul 2024 : Day 288 #

I'm still continuing today to try to find out what's causing an error coming from the AudioPlayback window actor that's triggered by granting a Web page permission to use the microphone.

To get to the bottom of this, I think it'll be helpful to understand a bit better about how it's all structured. As a window actor, there isn't any code that explicitly constructs an instance of the AudioPlaybackParent JavaScript class where the error is coming from. Instead the Parent and Child pair that makes up the AudioPlayback actor is listed in the ActorManagerParent.jsm file, like this:

  AudioPlayback: {
    parent: {
      moduleURI: &quot;resource://gre/actors/AudioPlaybackParent.jsm&quot;,
    },

    child: {
      moduleURI: &quot;resource://gre/actors/AudioPlaybackChild.jsm&quot;,
      observers: [&quot;audio-playback&quot;],
    },

    allFrames: true,
  },

This is then used to instantiate the two classes and hook them together and register the child class as a listener for the messages given in the observers list. In this case, the actor is only interested in audio-playback messages. The messages get picked up by the child, which then sends further messages to the parent to act on them.

The child class does some conversion of the messages. Here's the code from which audio-playback messages are received and AudioPlayback messages are sent out.

class AudioPlaybackChild extends JSWindowActorChild {
  observe(subject, topic, data) {
    dump(&quot;AudioPlayback: observe()\n&quot;);
    dump(&quot;AudioPlayback: topic: &quot; + topic + &quot;\n&quot;);
    dump(&quot;AudioPlayback: data: &quot; + data + &quot;\n&quot;);

    var stackTrace = Error().stack;
    dump(&quot;Stacktrace: &quot; + stackTrace + &quot;\n&quot;);

    if (topic === &quot;audio-playback&quot;) {
      let name = &quot;AudioPlayback:&quot;;
      if (data === &quot;activeMediaBlockStart&quot;) {
        name += &quot;ActiveMediaBlockStart&quot;;
      } else if (data === &quot;activeMediaBlockStop&quot;) {
        name += &quot;ActiveMediaBlockStop&quot;;
      } else {
        name += data === &quot;active&quot; ? &quot;Start&quot; : 
    &quot;Stop&quot;;
      }   
      this.sendAsyncMessage(name);
    }   
  }
}

In order to try to understand what's going on I've inserted some debug printouts to this method (the lines that call dump()). This should tell us exactly when the observe() method gets called. And when it doesn't.

Having received such a message the child method then sends a message to the parent, which is where the error is triggered. I've added some additional debug output to the parent class as well. This is the version from ESR 91, but I've added similar annotations to the ESR 78 version too.

class AudioPlaybackParent extends JSWindowActorParent {
  constructor() {
    super();
    this._hasAudioPlayback = false;
    this._hasBlockMedia = false;
  }
  receiveMessage(aMessage) {
    dump(&quot;AudioPlayback: receiveMessage()\n&quot;);
    dump(&quot;AudioPlayback: aMessage: &quot; + aMessage.name + 
    &quot;\n&quot;);

    var stackTrace = Error().stack;
    dump(&quot;Stacktrace: &quot; + stackTrace + &quot;\n&quot;);

    const browser = this.browsingContext.top.embedderElement;
    switch (aMessage.name) {
      case &quot;AudioPlayback:Start&quot;:
        this._hasAudioPlayback = true;
        browser.audioPlaybackStarted();
        break;
[...]
    }
  }
}

When I execute this on ESR 91 all of the console output is generated at exactly the time and in exactly the way that's expected:

library &quot;libandroidicu.so&quot; needed or dlopened by &quot;/system/lib64/
    libmedia.so&quot; is not accessible for the namespace &quot;(default)&quot;
AudioPlayback: observe()
AudioPlayback: topic: audio-playback
AudioPlayback: data: active
Stacktrace: observe@resource://gre/actors/AudioPlaybackChild.jsm:15:22

AudioPlayback: receiveMessage()
AudioPlayback: aMessage: AudioPlayback:Start
Stacktrace: receiveMessage@resource://gre/actors/AudioPlaybackParent.jsm:19:22

JavaScript error: resource://gre/actors/AudioPlaybackParent.jsm, line 26: 
    TypeError: browser is null
AudioPlayback: observe()
AudioPlayback: topic: audio-playback
AudioPlayback: data: inactive-pause
Stacktrace: observe@resource://gre/actors/AudioPlaybackChild.jsm:15:22

AudioPlayback: receiveMessage()
AudioPlayback: aMessage: AudioPlayback:Stop
Stacktrace: receiveMessage@resource://gre/actors/AudioPlaybackParent.jsm:19:22

JavaScript error: resource://gre/actors/AudioPlaybackParent.jsm, line 30: 
    TypeError: browser is null

All of this output can be matched up with the child and parent AudioPlayback classes and, as we can see, the end result is the same null browser element error that we saw before.

Since the way in to all this is via the sending of an audio-playback message, the next thing I want to know is where this gets sent from. It turns out there's only one place where this happens and that's in the AudioPlaybackRunnable class that can be found in AudioChannelService.cpp.

I've added breakpoints to the AudioPlaybackRunnable constructor and to the AudioPlaybackRunnable::Run() method to check whether they get used. On ESR 91, when pressing the "Microphone" button, after agreeing to the permissions request, the breakpoint then hits:

(gdb) break AudioPlaybackRunnable::Run
Breakpoint 2 at 0x7ff3c3ab1c: file dom/audiochannel/AudioChannelService.cpp, 
    line 45.
(gdb) break AudioPlaybackRunnable::AudioPlaybackRunnable
Breakpoint 3 at 0x7ff3c3b714: file include/c++/8.3.0/bits/atomic_base.h, line 
    256.
(gdb) c
Continuing.
[...]
Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 3, (anonymous namespace)::
    AudioPlaybackRunnable::AudioPlaybackRunnable (
    aReason=mozilla::dom::AudioChannelService::eDataAudibleChanged, 
    aActive=true, aWindow=0x7fb85d1640, this=0x7fb861eaf0)
    at dom/audiochannel/AudioChannelService.cpp:43
43              mReason(aReason) {}
(gdb) bt
#0  (anonymous namespace)::AudioPlaybackRunnable::AudioPlaybackRunnable (
    aReason=mozilla::dom::AudioChannelService::eDataAudibleChanged, 
    aActive=true, 
    aWindow=0x7fb85d1640, this=0x7fb861eaf0)
    at dom/audiochannel/AudioChannelService.cpp:43
#1  mozilla::dom::AudioChannelService::AudioChannelWindow::
    NotifyAudioAudibleChanged (this=this@entry=0x7fb8e450a0, 
    aWindow=0x7fb85d1640, 
    aAudible=aAudible@entry=mozilla::dom::AudioChannelService::eAudible, 
    aReason=aReason@entry=mozilla::dom::AudioChannelService::
    eDataAudibleChanged)
    at dom/audiochannel/AudioChannelService.cpp:580
[...]
#35 0x0000007fef54989c in ?? () from /lib64/libc.so.6
(gdb) c
Continuing.

Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 2, (anonymous namespace)::
    AudioPlaybackRunnable::Run (this=0x7fb861eaf0)
    at dom/audiochannel/AudioChannelService.cpp:45
45        NS_IMETHOD Run() override {
(gdb) bt
#0  (anonymous namespace)::AudioPlaybackRunnable::Run (this=0x7fb861eaf0)
    at dom/audiochannel/AudioChannelService.cpp:45
#1  0x0000007ff19a4e50 in mozilla::RunnableTask::Run (this=0x7fb956a580)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
[...]
#34 0x0000007fef54989c in ?? () from /lib64/libc.so.6
(gdb) c
Continuing.
AudioPlayback: observe()
AudioPlayback: topic: audio-playback
AudioPlayback: data: active
Stacktrace: observe@resource://gre/actors/AudioPlaybackChild.jsm:15:22

udioPlayback: receiveMessage()
AudioPlayback: aMessage: AudioPlayback:Start
Stacktrace: receiveMessage@resource://gre/actors/AudioPlaybackParent.jsm:19:22

JavaScript error: resource://gre/actors/AudioPlaybackParent.jsm, line 26: 
    TypeError: browser is null

Directly after this the page gives the option to stop recording. If I then press on this the breakpoints trigger again in a similar way:

Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 3, (anonymous namespace)::
    AudioPlaybackRunnable::AudioPlaybackRunnable (
    aReason=mozilla::dom::AudioChannelService::ePauseStateChanged, 
    aActive=false, aWindow=0x7fb85d1640, this=0x7fb8d105c0)
    at dom/audiochannel/AudioChannelService.cpp:43
43              mReason(aReason) {}
(gdb) c
Continuing.

Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 2, (anonymous namespace)::
    AudioPlaybackRunnable::Run (this=0x7fb8d105c0)
    at dom/audiochannel/AudioChannelService.cpp:45
45        NS_IMETHOD Run() override {
(gdb) c
Continuing.
AudioPlayback: observe()
AudioPlayback: topic: audio-playback
AudioPlayback: data: inactive-pause
Stacktrace: observe@resource://gre/actors/AudioPlaybackChild.jsm:15:22

AudioPlayback: receiveMessage()
AudioPlayback: aMessage: AudioPlayback:Stop
Stacktrace: receiveMessage@resource://gre/actors/AudioPlaybackParent.jsm:19:22

JavaScript error: resource://gre/actors/AudioPlaybackParent.jsm, line 30: 
    TypeError: browser is null

This all makes sense and is as expected. But the flow seems to diverge from what we see with ESR 78. There the C++ methods get called just as before. But as you can see in the output below, this never translates into the problematic JavaScript being executed. The following shows the breakpoint being hit when the "Microphone" button is pressed and the user then agrees to grant the site access to the microphone. Crucially, although the breakpoints are hit, the debug output added to the AudioPlayback classes are never triggered.

Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 1, 0x0000007fbbbc47e0 in (
    anonymous namespace)::AudioPlaybackRunnable::AudioPlaybackRunnable (
    aReason=<optimized out>, aActive=<optimized out>, aWindow=<optimized out>, 
    this=<optimized out>)
    at obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33
33      obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h: No such file or 
    directory.
(gdb) bt
#0  0x0000007fbbbc47e0 in (anonymous namespace)::AudioPlaybackRunnable::
    AudioPlaybackRunnable (aReason=<optimized out>, aActive=<optimized out>,
    aWindow=<optimized out>, this=<optimized out>)
    at obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33
#1  mozilla::dom::AudioChannelService::AudioChannelWindow::
    NotifyAudioAudibleChanged (this=this@entry=0x7f80fdd0b0, 
    aWindow=0x7f80cb6060,
    aAudible=aAudible@entry=mozilla::dom::AudioChannelService::eAudible, 
    aReason=aReason@entry=mozilla::dom::AudioChannelService::
    eDataAudibleChanged)
    at dom/audiochannel/AudioChannelService.cpp:592
[...]
#37 0x0000007fb735c89c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/
    clone.S:78
(gdb) c
Continuing.

Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 2, (anonymous namespace)::
    AudioPlaybackRunnable::Run (this=0x7f8137d4f0)
    at dom/audiochannel/AudioChannelService.cpp:44
44        NS_IMETHOD Run() override {
(gdb) bt
#0  (anonymous namespace)::AudioPlaybackRunnable::Run (this=0x7f8137d4f0)
    at dom/audiochannel/AudioChannelService.cpp:44
#1  0x0000007fb9c014c4 in nsThread::ProcessNextEvent (aResult=0x7fa69f3ba7, 
    aMayWait=<optimized out>, this=0x7f8002fd60)
    at xpcom/threads/nsThread.cpp:1211
[...]
#24 0x0000007fb735c89c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/
    clone.S:78
(gdb) c            
Continuing.
[...]

Selecting the "Stop" button on ESR 78 shows a similar pattern: the breakpoints are hit but the JavaScript debug output never appears:

Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 1, 0x0000007fbbbc47e0 in (
    anonymous namespace)::AudioPlaybackRunnable::AudioPlaybackRunnable (
    aReason=<optimized out>, aActive=<optimized out>, aWindow=<optimized out>, 
    this=<optimized out>)
    at obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33
33      obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h: No such file or 
    directory.
(gdb) bt
#0  0x0000007fbbbc47e0 in (anonymous namespace)::AudioPlaybackRunnable::
    AudioPlaybackRunnable (aReason=<optimized out>, aActive=<optimized out>,
    aWindow=<optimized out>, this=<optimized out>)
    at obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33
#1  mozilla::dom::AudioChannelService::AudioChannelWindow::
    NotifyAudioAudibleChanged (this=this@entry=0x7f80f97430, 
    aWindow=0x7f80b985a0,
    aAudible=aAudible@entry=mozilla::dom::AudioChannelService::eAudible, 
    aReason=aReason@entry=mozilla::dom::AudioChannelService::
    eDataAudibleChanged)
    at dom/audiochannel/AudioChannelService.cpp:592
[...]
#37 0x0000007fb735c89c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/
    clone.S:78
(gdb) c                            
Continuing.

Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 2, (anonymous namespace)::
    AudioPlaybackRunnable::Run (this=0x7f80f52980)
    at dom/audiochannel/AudioChannelService.cpp:44
44        NS_IMETHOD Run() override {
(gdb) bt
#0  (anonymous namespace)::AudioPlaybackRunnable::Run (this=0x7f80f52980)
    at dom/audiochannel/AudioChannelService.cpp:44
#1  0x0000007fb9c014c4 in nsThread::ProcessNextEvent (aResult=0x7fa69f3ba7, 
    aMayWait=<optimized out>, this=0x7f8002fd60)
    at xpcom/threads/nsThread.cpp:1211
[...]
#24 0x0000007fb735c89c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/
    clone.S:78
(gdb) c
Continuing.
[...]

I'm currently rather baffled by this. Everything works as expected on ESR 91, but while on ESR 78 the messages are very clearly sent from the C++ code, they're never received by the JavaScript actor.

Unfortunately I've run out of time to investigate this further today. But I'll continue looking at it tomorrow, when I'll also have to come to a decision about how to actually tackle the error. Do I add in a check for a null browser variable, remove the AudioPlayback actor entirely, or dig deeper to try to establish the underlying for why the embedder element is never defined. With any luck our investigations tomorrow will lead us to a suitable answer.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

13 Jul 2024 : Day 287 #

Today I'm continuing my investigation of an error that happens just after the user has granted permissions for a site to use the microphone. The problem generates an error to the console that looks like this:

JavaScript error: resource://gre/actors/AudioPlaybackParent.jsm, line 20: 
    TypeError: browser is null

As a quick recap, the line of code that's triggering the error, as it appears in the ESR 91 codebase, is the following:

    const browser = this.browsingContext.top.embedderElement;

This is a slight simplification of the code as it appears in ESR 78, like this:

    let topBrowsingContext = this.browsingContext.top;
    let browser = topBrowsingContext.embedderElement;

By looking at the history of this source file I was able to ascertain that the code has been merged into a single line but is otherwise the same, save for the fact that browser is now annotated with a const keyword.

The two variants of this code really appear to do very similar things, so what's interesting is that despite this change being barely different in terms of functionality, the error doesn't trigger under the same circumstances in ESR 78.

It seems unlikely that this const-ness is going to be the problem, but I suppose it's worth checking to make sure. The AudioPlaybackParent.jsm file is part of the gecko package and I don't want to have to rebuild everything, so I'm going to edit it in situ by accessing the omni.ja file.

cd omni/
./omni.sh unpack
find . -iname &quot;AudioPlaybackParent.jsm&quot; -exec vim {} \;
./omni.sh pack

I've made the following edits so that the version used is equivalent to the ESR 78 formulation:

  receiveMessage(aMessage) {
    //const browser = this.browsingContext.top.embedderElement;
    let topBrowsingContext = this.browsingContext.top;
    let browser = topBrowsingContext.embedderElement;
[...]

As expected, this has no practical effect, other than changing the line number the error is triggered on:

JavaScript error: resource://gre/actors/AudioPlaybackParent.jsm, line 22: 
    TypeError: browser is null

This means the error must be elsewhere. I've reversed the change I just made to avoid anything unexpected happening. But I now need to figure out why the browsing context is returning a null value for the embedderElement when, presumably, it didn't before.

Except, it turns out that's not quite the full story. While I thought the error didn't trigger at all in ESR 78, it seems it's not quite as simple as that. Earlier this morning I received the following messages from fellow Sailfish OS user and developer Mark Washeim (poetaster) on Mastodon suggesting otherwise:

@flypig I don't think it helps in any manner but I had noticed that AudioPlaybackParent message here: forum.sailfishos.org/t/javascr... That was with 4.4 and the error went away with 4.5. I was testing audio stuff for the mahjong webwrapper I maintain.

@flypig Ah, and I'm wrong, the error can still be triggered in 4.5 I just launched 'harbour-scaler' the web wrapper around scaler and it produces that error as soon as you hit the play button in the app. The audio does play though.

This is fascinating. I can't seem to get it to trigger at all myself, but clearly it does under the right conditions. This was a really helpful intervention from Mark.

It nevertheless feels like something is different, so it's worth taking a look at the code involved. Here's the code in BrowsingContext for extracting the top window, taken from BrowsingContext.cpp:

BrowsingContext* BrowsingContext::Top() {
  BrowsingContext* bc = this;
  while (bc->mParentWindow) {
    bc = bc->GetParent();
  }
  return bc;
}

Since this can't reasonably be null here, the method is guaranteed to return a non-null value. But that's not our problem, our problem is the embedderElement hanging off this. Here's the code for this, taken this time from BrowsingContext.h

  // Get the embedder element for this BrowsingContext if the embedder is
  // in-process, or null if it's not.
  Element* GetEmbedderElement() const { return mEmbedderElement; }

There aren't any other immediately obvious differences in this between the two versions. When running the two versions of the browser the results are interesting. Not only is there no indication of this error on the console for ESR 78, but having granted permission to the site to use the microphone, I actually got to hear some resulting feedback. That doesn't happen on my ESR 91 build. We don't know whether this difference is related to the error though.

But while I don't know whether the error output to the console is the reason why the microphone is left inert on ESR 91, fixing nevertheless would seem like the right thing to do at this point.

So I'm going to have to follow that embedderElement variable down the rabbit hole to see where it leads.

None of the code related to embedderElement in the BrowsingContext class differs between ESR 78 and ESR 91. The external interface to it appears to be the SetEmbedderElement() method. This is quite a complex method and there's also the question of how it gets called. So I think some stepping through of the code will be in order to compare how ESR 78 is handling it with the ESR 91 flow.

To start off I've placed a breakpoint on BrowsingContext::SetEmbedderElemenet() to find out where the value actually gets set. But whether I do this on ESR 78 or ESR 91 I don't get a hit. Looking through the code I can't see any other way that the value can be set, but it's possible there's some way I missed. So I've tried to place a watch point on the variable instead.

The mEmbedderElement variable is a member of the BrowsingContext class, so the variable won't exist until an instance of the class has been constructed. So I've first placed a breakpoint on the BrowsingContext constructor. Once that's called the object will actually exist and I can then find the memory location that mEmbedderElement points to. I can then add a watch on that memory location, which should trigger a break whenever the variable is written to.

Now I know the value on ESR 91 remains null, so I'm particularly interested to find out where the value gets set on ESR 78:

(gdb) b BrowsingContext::BrowsingContext
(gdb) r
[...]
Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 1, mozilla::dom::
    BrowsingContext::BrowsingContext (this=this@entry=0x7f809a02a0, 
    aParentWindow=0x0,
    aGroup=0x7f80bef930, aBrowsingContextId=1, aType=mozilla::dom::
    BrowsingContext::Type::Content, aFields=...)
    at docshell/base/BrowsingContext.cpp:392
392     BrowsingContext::BrowsingContext(WindowContext* aParentWindow,
(gdb) n
92      ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsWrapperCache.h: No such 
    file or directory.
(gdb) n
410           mUseRemoteSubframes(false) {
(gdb) print &mEmbedderElement
$1 = (RefPtr<mozilla::dom::Element> *) 0x7f809a04f8
(gdb) watch *0x7f809a04f8
Hardware watchpoint 2: *0x7f809a04f8
(gdb) c
Continuing.

This watchpoint is never hit. There are several reasons I can think of for why this might happen. I don't use watchpoints often, so one possibility is that I'm using it wrong. I've also noticed the debugger crashing out occasionally with this particular ESR 78 build, so it could also be an issue with the debugger. Finally it could simply be that the value never gets written to.

There is another way to check this, which is to find out the value it takes when it gets read. To test this I placed a breakpoint on GetEmbedderElement(). On ESR 91 the breakpoint is hit when I request the microphone permission. At that point it's null, as anticipated:

(gdb) b GetEmbedderElement
Breakpoint 3 at 0x7fbc79cbbc: GetEmbedderElement. (7 locations)
(gdb) r
Starting program: /usr/bin/sailfish-browser 
[...]
Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 3, mozilla::dom::
    BrowsingContext::GetEmbedderElement (this=0x7fb8bb2850)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/dom/
    BrowsingContext.h:335
335     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/dom/
    BrowsingContext.h: No such file or directory.
(gdb) p mEmbedderElement
$3 = {mRawPtr = 0x0}
(gdb)

On ESR 78 I don't have so much luck. The breakpoint won't stick to GetEmbedderElement() at all: apparently it doesn't exist as a debug symbol.

Okay, fair enough, there's another way to do this too, which is to print out the value from inside the AudioPlaybackParent::receiveMessage() method in the JavaScript. This is the method that's generating the error on ESR 91.

But when I do this, I find this JavaScript method is never hit either. It means that the fix may not be to ensure mEmbedderElement is set to a non-null value at all, but rather making sure that the AudioPlaybackParent::receiveMessage() is never called.

This would seem to tally with Mark's experiences. And alongside Mark's comment this gives me a slightly new direction to look in to. Unfortunately I don't have time to look in to this new direction today. That means that tomorrow I'll need to look at where the AudioPlaybackParent::receiveMessage() is getting called from.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

12 Jul 2024 : Day 286 #

Still working on WebRTC today, after some success getting video to work, but less success with audio. I have two things explicitly on my to-do list. The first being the error when using GlobalMuteListener, the second the AudioPlaybackParent.jsm error triggered when trying to grant access to the microphone.

Let's start with the GlobalMuteListener class then. This is a new class added to ESR 91 with the following description associated with it, taken from changeset D86719:

$ git log -1 -S &quot;GlobalMuteListener&quot; browser/actors/WebRTCChild.jsm
commit 01fa52e2f5946f511199b6879919e2547f63fa25
Author: Mike Conley <mconley@mozilla.com>
Date:   Thu Aug 27 15:46:26 2020 +0000

    Bug 1643027 - Add GlobalMuteListener to WebRTCChild for globally muting the 
    microphone and camera. r=pbz
    
    This listens to state changes set via SharedData. Those state changes are 
    executed by
    front-end code in a later patch in this series.
    
    Depends on D87129
    
    Differential Revision: https://phabricator.services.mozilla.com/D86719

The associated bug is 1643027 and there are two further changesets associated with this bug. The first is D86619 which turns the Firefox microphone and camera icons into global mute toggles. The second is D86762 which adds unit tests for the other changes.

At some point I should take a look at the first of these to see whether it relates to anything in the Sailfish user interface. This could have privacy implications, so is of no small importance.

Right now though, I'm just going to add in the missing GlobalMuteListener class so that we're matching at least this part of the upstream functionality and which we might need to hook the user interface changes in to later.

The new class isn't especially large; in fact it's small enough that I can justify sharing all of it here.

/**
 * GlobalMuteListener is a process-global object that listens for changes to
 * the global mute state of the camera and microphone. When it notices a
 * change in that state, it tells the underlying platform code to mute or
 * unmute those devices.
 */
const GlobalMuteListener = {
  _initted: false,

  /**
   * Initializes the listener if it hasn't been already. This will also
   * ensure that the microphone and camera are initially in the right
   * muting state.
   */
  init() {
    if (!this._initted) {
      Services.cpmm.sharedData.addEventListener(&quot;change&quot;, this);
      this._updateCameraMuteState();
      this._updateMicrophoneMuteState();
      this._initted = true;
    }
  },

  handleEvent(event) {
    if (event.changedKeys.includes(&quot;WebRTC:GlobalCameraMute&quot;)) {
      this._updateCameraMuteState();
    }
    if (event.changedKeys.includes(&quot;WebRTC:GlobalMicrophoneMute&quot;)) {
      this._updateMicrophoneMuteState();
    }
  },

  _updateCameraMuteState() {
    let shouldMute = Services.cpmm.sharedData.get(&quot;WebRTC:
    GlobalCameraMute&quot;);
    let topic = shouldMute
      ? &quot;getUserMedia:muteVideo&quot;
      : &quot;getUserMedia:unmuteVideo&quot;;
    Services.obs.notifyObservers(null, topic);
  },

  _updateMicrophoneMuteState() {
    let shouldMute = Services.cpmm.sharedData.get(
      &quot;WebRTC:GlobalMicrophoneMute&quot;
    );
    let topic = shouldMute
      ? &quot;getUserMedia:muteAudio&quot;
      : &quot;getUserMedia:unmuteAudio&quot;;

    Services.obs.notifyObservers(null, topic);
  },
};

All this is really doing is listening to a couple of events for muting the camera and microphone and, in the event that they're received, broadcasting a notification to indicate the change (microphone/camera mute/unmute) that can be picked up by the user interface and acted upon.

Crucially it doesn't appear to depend on anything that's not already available in our EmbedLiteWebrtcUK.js source file. So I've added it towards the top of the file and uncommented the line that calls it.

When I use this updated code I don't get any obvious changes, except that there's now no longer an error output to the console.

I'll need to create a task to check the status of global mute on Sailfish OS, but this change is going to be enough for what we need right now.

Let's move on to the second error, which was the following:

JavaScript error: resource://gre/actors/AudioPlaybackParent.jsm, line 20: 
    TypeError: browser is null

The AudioPlaybackParent.jsm source is available upstream for both ESR 78 and ESR 91. Both versions are tiny files (29 lines and 45 lines respectively) but there have been changes between the two.

The change that is triggering the error relates to the browser variable. It's being initialised in two different ways. On ESR 78 we have this: let topBrowsingContext = this.browsingContext.top; let browser = topBrowsingContext.embedderElement; Whereas on ESR 91 we have this:

    const browser = this.browsingContext.top.embedderElement;

Looking carefully at and comparing both of these there's not much practical difference. On ESR 91 the variable is marked as const but that seems to be the only practical difference.

The top level class doesn't include a browsingContext member so in both cases this must be being inherited from the parent class, which is JSWindowActorParent:

class AudioPlaybackParent extends JSWindowActorParent {

The file I'm checking in the ESR 78 codebase is the patched version with the Sailfish OS changes applied, so it's worth checking to see whether these lines in particular are from upstream or from us:

$ git blame toolkit/actors/AudioPlaybackParent.jsm -L 10,+4
5ce82c5c12dfa (Abdoulaye O. Ly 2019-08-19 21:17:21 +0000 10)
   receiveMessage(aMessage) {
5ce82c5c12dfa (Abdoulaye O. Ly 2019-08-19 21:17:21 +0000 11)
     let topBrowsingContext = this.browsingContext.top;
5ce82c5c12dfa (Abdoulaye O. Ly 2019-08-19 21:17:21 +0000 12)
     let browser = topBrowsingContext.embedderElement;
5ce82c5c12dfa (Abdoulaye O. Ly 2019-08-19 21:17:21 +0000 13)

They're from upstream. So then I wonder why they were changed? Let's find out.

$ git blame toolkit/actors/AudioPlaybackParent.jsm -L 15,+4
Blaming lines:   8% (4/45), done.
5ce82c5c12dfa (Abdoulaye O. Ly 2019-08-19 21:17:21 +0000 15)
   receiveMessage(aMessage) {
ebf5370d519e5 (alwu            2020-10-02 03:56:09 +0000 16)
     const browser = this.browsingContext.top.embedderElement;
5ce82c5c12dfa (Abdoulaye O. Ly 2019-08-19 21:17:21 +0000 17)
     switch (aMessage.name) {
5ce82c5c12dfa (Abdoulaye O. Ly 2019-08-19 21:17:21 +0000 18)
       case &quot;AudioPlayback:Start&quot;:
$ git log -1 ebf5370d519e5
commit ebf5370d519e5fe34c4a0b7aeb210de3c851c00a
Author: alwu <alwu@mozilla.com>
Date:   Fri Oct 2 03:56:09 2020 +0000

    Bug 1656414 - part1 : stop audio playback and media block if needed when 
    destroying an parent actor. r=farre
    
    If content process dispatches event after actor close, then we are not able 
    to clear the tab media indicator. Therefore, we should do corresponding 
    cleanup when actor destroys.
    
    Differential Revision: https://phabricator.services.mozilla.com/D91367

It's not quite clear to me what's going on with this, but it's beginning to feel like my brain needs to get some sleep. So I'll return to this fresh in the morning and take another look.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

11 Jul 2024 : Day 285 #

This morning I'm continuing my analysis of how the browser responds to the getUserMedia:request notification. I've been taking a second look at the three methods we were examining yesterday. I'm going to be referring back to them today, so it may be worth having them to hand if you want to follow along.

There are a few things that I want to note about them. First is that they're all triggered in the same way, namely in response to the getUserMedia:request notification. This notification is sent from the C++ code in MediaManager.cpp and picked up by the observer in WebRTCChild.jsm in the upstream code and in EmbedLiteWebrtcUI.js on the Sailfish browser.

On the Sailfish OS side we can see the case that filters for this message in the snippet I posted yesterday. But I skipped that part on the upstream snippets. I think it's worth taking a look at it though, just to tie everything together:

  // This observer is called from BrowserProcessChild to avoid
  // loading this .jsm when WebRTC is not in use.
  static observe(aSubject, aTopic, aData) {
    switch (aTopic) {
      case &quot;getUserMedia:request&quot;:
        handleGUMRequest(aSubject, aTopic, aData);
        break;
[...]
    }
  }

There's nothing magical going on here. The notification is fired, it's picked up by the observer and one of the handleGUMRequest() methods that I shared yesterday is executed depending on whether we're running ESR 78 or ESR 91.

On Sailfish OS a similar thing happens but with a different observer in a different source file. The main difference is that there's no handleGUMRequest() method that gets called. Instead the code is executed inline inside the switch statement.

This is one of two main differences between the two. The other main difference is due to what happens inside that code. The upstream code opens a prompt like this on ESR 78:

      prompt(
        contentWindow,
        aSubject.windowID,
        aSubject.callID,
        constraints,
        devices,
        secure,
        isHandlingUserInput
      );

This aligns with some equivalent but different code used to open a prompt on Sailfish OS:

      EmbedLiteWebrtcUI.prototype._prompt(
        contentWindow,
        aSubject.callID,
        constraints,
        devices,
        aSubject.isSecure);
    }

On upstream ESR 91 we also have some code that does something similar. There are differences here as well though, most notably the parameters that are passed in and where they come from.

  prompt(
    aSubject.type,
    contentWindow,
    aSubject.windowID,
    aSubject.callID,
    constraints,
    aSubject.devices,
    aSubject.isSecure,
    aSubject.isHandlingUserInput
  );

The part of this we're particularly interesting in is where the parameters are coming from, and we're especially concerned with the parameters that are needed for the _prompt() call on Sailfish OS.

Most of the parameters needed by our Sailfish OS code have stayed the same between ESR 78 and ESR 91 in the upstream code. That covers contentWindow, aSubject.callID, constraints and aSubject.isSecure.

That leaves just the devices parameter that has changed and that we need to worry about. Let's examine a bit more closely what's going on with this. In both ESR 78 and the Sailfish OS code we essentially had the following, where I've simplified things to remove all of the obfuscatory cruft.

  contentWindow.navigator.mozGetUserMediaDevices(
    function(devices) {
      prompt(devices);
    },
    function(error) {[...]},
  );

Although we don't have a Promise object, this is nevertheless essentially a promise on devices. The call to mozGetUserMediaDevices() will eventually either resolve to call the prompt() method with the fulfilled value for devices or it will call the error method.

In ESR 91 the promise is no longer needed: it's presumably been refactored higher up the call chain so that the devices value has already been resolved before the getUserMedia:request notification is sent out. Rather than us having to request the value asynchronously, in ESR 91 it's passed directly as part of the notification. That's why we have devices in ESR 78 and aSubject.devices in ESR 91.

This is true for ESR 91 upstream, but that means it's also likely to be true for our Sailfish OS version that's downstream of ESR 91. We should therefore, in theory, be able to do something like this:

      case &quot;getUserMedia:request&quot;:
        // Now that a getUserMedia request has been created, we should check
        // to see if we're supposed to have any devices muted. This needs
        // to occur after the getUserMedia request is made, since the global
        // mute state is associated with the GetUserMediaWindowListener, which
        // is only created after a getUserMedia request.
        GlobalMuteListener.init();

        let constraints = aSubject.getConstraints();
        let contentWindow = Services.wm.getOuterWindowWithId(aSubject.windowID);

        EmbedLiteWebrtcUI.prototype._prompt(
          contentWindow,
          aSubject.callID,
          constraints,
          aSubject.devices,
          aSubject.isSecure);
        break;

The good news about this is that the EmbedLiteWebrtcUI.js file is contained in the embedlite-components project, built separately from the gecko packages. This project is incredibly swift to compile, so I can test these changes almost immediately.

When I do I still get an error, but it's different this time:

JavaScript error: file:///usr/lib64/mozembedlite/components/
    EmbedLiteWebrtcUI.js, line 295: ReferenceError: GlobalMuteListener is not 
    defined

Looking back at the file I can see that GlobalMuteListener is a class defined in WebRTCChild.jsm. I may need to do something with this, but just to test things, I'm commenting out the line where it's used so as to see whether it makes any difference.

And it does! Now when I visit the getUserMedia example and select to use the camera I get the usual nice Sailfish OS permissions request page appearing. If I grant the permission, the camera even works!

Screenshots showing the camera and microphone permissions pages, one showing the camera working and another showing the error message when access to the microphone is denied

Denying the permissions results in an error showing on the web page stating that the request was not allowed, which is exactly what should happen. The only wrinkle is when I try to use the microphone. The correct permissions page appears, but if I allow access I then get the following error output to the console:

JavaScript error: resource://gre/actors/AudioPlaybackParent.jsm, line 20: 
    TypeError: browser is null

I'll need to fix this, but not today. I'll take another look at this tomorrow and try to fix both the GlobalMuteListener error and this AudioPlaybackParent.jsm error. One further thing that's worth commenting on is the fact the video stream colours are a bit strange as you can see in the screenshot. From what I recall this isn't a new issue, the release version of the browser exhibits this as well. It'd be nice to get it fixed, but having thought about it, I'm not really sure what the problem is. Maybe a way to improve it will become apparent later.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

10 Jul 2024 : Day 284 #

Yesterday I applied the final two WebRTC patches, transferred over from ESR 78, with mixed results. I left a build running overnight after having manually applied the changes to the moz.build file after I failed to get the changes autogenerated from changes to the BUILD.gn file.

When I test the new version I do get slightly different results compared to the previous install. On Mozilla's getUserMedia example page I now get the same results whether I'm trying to use the camera or the microphone:

library &quot;libandroidicu.so&quot; needed or dlopened by &quot;/system/lib64/
    libmedia.so&quot; is not accessible for the namespace &quot;(default)&quot;
JavaScript error: file:///usr/lib64/mozembedlite/components/
    EmbedLiteWebrtcUI.js, line 293: TypeError: 
    contentWindow.navigator.mozGetUserMediaDevices is not a function

I'm not sure about that libandroidicu.so error. I've not had a chance to find the code that's attempting to load this in dynamically — it doesn't look like the sort of library we should be needing — so I'll have to add this task to my queue.

The fact that both the microphone and camera produce this error is I think a sign that the changes to the moz.build file are useful, but it brings us to the same conclusion, which is that we need to fix the process for requesting permissions from the user. So the obvious thing to do is to follow that error and try to find out why it's happening. But when I check the EmbedLiteWebrtcUI.js file the reason for the error is immediately evident: the mozGetUserMediaDevices() method is no longer part of the Navigator interface, as defined in Navigator.webidl.

I can use git blame to find out when it was removed and why:

$ git log -1 -S &quot;mozGetUserMediaDevices&quot; dom/webidl/Navigator.webidl
commit 9f60769705be72d6057feb073c87aca32e0baf27
Author: Karl Tomlinson <karlt+@karlt.net>
Date:   Thu May 6 05:16:49 2021 +0000

    Bug 1709474 move mozGetUserMediaDevices from Navigator to 
    GetUserMediaRequest r=jib,webidl,geckoview-reviewers,smaug,agi
    
    Differential Revision: https://phabricator.services.mozilla.com/D111565

Referring to the upstream D111565 changeset, not only has the method been removed from the interface definition, it's also been removed from the source file, in this case Navigator.cpp. This is the method that could be found there in ESR 78:

void Navigator::MozGetUserMediaDevices(
    MozGetUserMediaDevicesSuccessCallback& aOnSuccess,
    NavigatorUserMediaErrorCallback& aOnError, uint64_t aInnerWindowID,
    const nsAString& aCallID, ErrorResult& aRv) {
  if (!mWindow || !mWindow->GetOuterWindow() ||
      mWindow->GetOuterWindow()->GetCurrentInnerWindow() != mWindow) {
    aRv.Throw(NS_ERROR_NOT_AVAILABLE);
    return;
  }

Again, referring to the same changeset it looks like the method has been moved to the GetUserMediaRequest.cpp file, or at least that there's a new equivalent to be found there. This is what this looks like in ESR 91:

void GetUserMediaRequest::GetDevices(
    nsTArray<RefPtr<nsIMediaDevice>>& retval) const {
  MOZ_ASSERT(retval.Length() == 0);
  if (!mMediaDeviceSet) {
    return;
  }
  for (const auto& device : *mMediaDeviceSet) {
    retval.AppendElement(device);
  }
}

These are clearly not identical methods, but with a bit of adjustment to the calling code it may be that we'll need to make use of this as an alternative to MozGetUserMediaDevices(). I'm not sure how this will pan out, so this will require some more investigation. First up, this is the code in EmbedLiteWebrtcUI.js where this is called, which is currently present in both ESR 78 and ESR 91 because it's part of the embedlite-components package:

      case &quot;getUserMedia:request&quot;:
        let constraints = aSubject.getConstraints();
        let contentWindow = Services.wm.getOuterWindowWithId(aSubject.windowID);

        contentWindow.navigator.mozGetUserMediaDevices(
          constraints,
          function(devices) {
            if (!contentWindow.closed) {
              EmbedLiteWebrtcUI.prototype._prompt(
                contentWindow,
                aSubject.callID,
                constraints,
                devices,
                aSubject.isSecure);
            }
          },
          function(error) {
            Services.obs.notifyObservers(null, &quot;getUserMedia:request:
    deny&quot;, aSubject.callID);
            Cu.reportError(error);
          },
          aSubject.innerWindowID,
          aSubject.callID
        );
        break;

Note that this is JavaScript code, as are the following snippets as well. The equivalent upstream call as used by Firefox happening in the ESR 78 code is in the WebRTCChild.jsm file:

function handleGUMRequest(aSubject, aTopic, aData) {
  let constraints = aSubject.getConstraints();
  let secure = aSubject.isSecure;
  let isHandlingUserInput = aSubject.isHandlingUserInput;
  let contentWindow = Services.wm.getOuterWindowWithId(aSubject.windowID);

  contentWindow.navigator.mozGetUserMediaDevices(
    constraints,
    function(devices) {
      // If the window has been closed while we were waiting for the list of
      // devices, there's nothing to do in the callback anymore.
      if (contentWindow.closed) {
        return;
      }

      prompt(
        contentWindow,
        aSubject.windowID,
        aSubject.callID,
        constraints,
        devices,
        secure,
        isHandlingUserInput
      );
    },
    function(error) {
      // Device enumeration is done ahead of handleGUMRequest, so we're not
      // responsible for handling the NotFoundError spec case.
      denyGUMRequest({ callID: aSubject.callID });
    },
    aSubject.innerWindowID,
    aSubject.callID
  );
}

And here's the equivalent code, as it would run on Firefox, on ESR 91. As we can see, this no longer uses the mozGetUserMediaDevices() method:

function handleGUMRequest(aSubject, aTopic, aData) {
  // Now that a getUserMedia request has been created, we should check
  // to see if we're supposed to have any devices muted. This needs
  // to occur after the getUserMedia request is made, since the global
  // mute state is associated with the GetUserMediaWindowListener, which
  // is only created after a getUserMedia request.
  GlobalMuteListener.init();

  let constraints = aSubject.getConstraints();
  let contentWindow = Services.wm.getOuterWindowWithId(aSubject.windowID);

  prompt(
    aSubject.type,
    contentWindow,
    aSubject.windowID,
    aSubject.callID,
    constraints,
    aSubject.devices,
    aSubject.isSecure,
    aSubject.isHandlingUserInput
  );
}

I'm going to need to compare these three methods further, but it's late here already, so that will have to wait until tomorrow. I'm not anticipating having much time to look at this tomorrow, but will try to get something done and posted about it nonetheless.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

9 Jul 2024 : Day 283 #

This morning I've picked up where I left off, attempting to get the library to link after apply equivalents for patches 77, 78 and 79. These three patches represent the first of two stages needed to get WebRTC back up and running. Once these changes are going through I'll then need to apply patches 80 and 81, which represent the second stage, which should hopefully be enough to get things working.

Having applied the first three patches I was still left with a single undefined reference during linking. Apparently the following constructor is missing an implementation:

webrtc::EncoderSimulcastProxy::EncoderSimulcastProxy()

Checking through the code I discovered that in ESR 78 the EncoderSimulcastProxy class isn't used, with a VP8EncoderSimulcastProxy class being used in its stead. It seems likely this has something to do with the problem. In ESR 78 this class is defined in the vp8_encoder_simulcast_proxy.cc and vp8_encoder_simulcast_proxy.h files, whereas the ESR 91 code is found in the encoder_simulcast_proxy.cc and encoder_simulcast_proxy.h files.

If you've been following along for the last few days you'll know that the WebRTC code is being built using the recipes in the following GN build files:

$ ls dom/media/webrtc/third_party_build/gn-configs/
arm64_False_arm64_freebsd.json  x64_False_x64_netbsd.json
arm64_False_arm64_linux.json    x64_False_x64_openbsd.json
arm64_False_arm64_netbsd.json   x64_False_x64_win.json
arm64_False_arm64_openbsd.json  x64_False_x86_android.json
arm64_True_arm64_freebsd.json   x64_False_x86_linux.json
arm64_True_arm64_linux.json     x64_True_arm64_android.json
arm64_True_arm64_netbsd.json    x64_True_arm64_mac.json
arm64_True_arm64_openbsd.json   x64_True_arm64_win.json
arm_False_arm_freebsd.json      x64_True_arm_android.json
arm_False_arm_linux.json        x64_True_x64_android.json
arm_False_arm_netbsd.json       x64_True_x64_dragonfly.json
arm_False_arm_openbsd.json      x64_True_x64_freebsd.json
arm_True_arm_freebsd.json       x64_True_x64_linux.json
arm_True_arm_linux.json         x64_True_x64_mac.json
arm_True_arm_netbsd.json        x64_True_x64_netbsd.json
arm_True_arm_openbsd.json       x64_True_x64_openbsd.json
mips64_False_mips64_linux.json  x64_True_x64_win.json
mips64_True_mips64_linux.json   x64_True_x86_android.json
ppc64_False_ppc64_linux.json    x64_True_x86_linux.json
ppc64_True_ppc64_linux.json     x86_False_x86_freebsd.json
x64_False_arm64_android.json    x86_False_x86_linux.json
x64_False_arm64_mac.json        x86_False_x86_netbsd.json
x64_False_arm64_win.json        x86_False_x86_openbsd.json
x64_False_arm_android.json      x86_False_x86_win.json
x64_False_x64_android.json      x86_True_x86_freebsd.json
x64_False_x64_dragonfly.json    x86_True_x86_netbsd.json
x64_False_x64_freebsd.json      x86_True_x86_openbsd.json
x64_False_x64_linux.json        x86_True_x86_win.json
x64_False_x64_mac.json

Of these the arm64_False_arm64_linux.json, arm_False_arm_linux.json and x86_False_x86_linux.json files were all transferred over from ESR 78, having been amended by patch 78. When I examined these files I notice something relevant: they differ slightly from all of the other files in that in the ESR 91 files encoder_simulcast_proxy is referenced where vp8_encoder_simulcast_proxy was referenced on ESR 78. So this could well be our problem.

And it makes sense as well: in the code the V8 versions of the files aren't used any more, having been replaced by the generic encoder simulcaast proxy implementation. So it makes sense that the build scripts might need updating to reference the new files rather than the previous ones.

If this is the case, then the solution should be straightforward. All I need to do is switch the file names in the GN build files, then regenerate the moz.build files from these.

So, here is the change I've made to the GN build files. This mimics the changes that have already been made to the other GN build files (but which we're not actually using).

$ git diff -U0
diff --git a/dom/media/webrtc/third_party_build/gn-configs/
    arm64_False_arm64_linux.json b/dom/media/webrtc/third_party_build/
    gn-configs/arm64_False_arm64_linux.json
index f1e7bfac33b8..beeded9b8602 100644
--- a/dom/media/webrtc/third_party_build/gn-configs/arm64_False_arm64_linux.json
+++ b/dom/media/webrtc/third_party_build/gn-configs/arm64_False_arm64_linux.json
@@ -4143,2 +4143,2 @@
-                &quot;//media/engine/vp8_encoder_simulcast_proxy.cc&quot;,
-                &quot;//media/engine/vp8_encoder_simulcast_proxy.h&quot;,
+                &quot;//media/engine/encoder_simulcast_proxy.cc&quot;,
+                &quot;//media/engine/encoder_simulcast_proxy.h&quot;,
diff --git a/dom/media/webrtc/third_party_build/gn-configs/
    arm_False_arm_linux.json b/dom/media/webrtc/third_party_build/gn-configs/
    arm_False_arm_linux.json
index 9885557f56cf..f3a87ac3b260 100644
--- a/dom/media/webrtc/third_party_build/gn-configs/arm_False_arm_linux.json
+++ b/dom/media/webrtc/third_party_build/gn-configs/arm_False_arm_linux.json
@@ -4408,2 +4408,2 @@
-                &quot;//media/engine/vp8_encoder_simulcast_proxy.cc&quot;,
-                &quot;//media/engine/vp8_encoder_simulcast_proxy.h&quot;,
+                &quot;//media/engine/encoder_simulcast_proxy.cc&quot;,
+                &quot;//media/engine/encoder_simulcast_proxy.h&quot;,
diff --git a/dom/media/webrtc/third_party_build/gn-configs/
    x86_False_x86_linux.json b/dom/media/webrtc/third_party_build/gn-configs/
    x86_False_x86_linux.json
index 27b49b928398..796fc1e920a5 100644
--- a/dom/media/webrtc/third_party_build/gn-configs/x86_False_x86_linux.json
+++ b/dom/media/webrtc/third_party_build/gn-configs/x86_False_x86_linux.json
@@ -4124,2 +4124,2 @@
-                &quot;//media/engine/vp8_encoder_simulcast_proxy.cc&quot;,
-                &quot;//media/engine/vp8_encoder_simulcast_proxy.h&quot;,
+                &quot;//media/engine/encoder_simulcast_proxy.cc&quot;,
+                &quot;//media/engine/encoder_simulcast_proxy.h&quot;,

Following this I'll need to regenerate the gecko build files that are actually used during the build process.

$ source `pwd`/obj-build-mer-qt-xr/rpm-shared.env
$ cd gecko-dev/
$ ./mach build-backend -b GnMozbuildWriter

I can see this has resulted in a large number of changes to the moz.build files, which I take to be a good sign. Now time to do a full clean rebuild of the code.

$ sfdk build -d -p --with git_workaround
[...]
Wrote: RPMS/SailfishOS-devel-aarch64/
    xulrunner-qt5-devel-91.9.1+git1+sailfishos.esr91.20240707214235.e32f3a780e7e-1.aarch64.rpm
Wrote: RPMS/SailfishOS-devel-aarch64/
    xulrunner-qt5-misc-91.9.1+git1+sailfishos.esr91.20240707214235.e32f3a780e7e-1.aarch64.rpm
Wrote: RPMS/SailfishOS-devel-aarch64/
    xulrunner-qt5-91.9.1+git1+sailfishos.esr91.20240707214235.e32f3a780e7e-1.aarch64.rpm
Wrote: RPMS/SailfishOS-devel-aarch64/
    xulrunner-qt5-debugsource-91.9.1+git1+sailfishos.esr91.20240707214235.e32f3a780e7e-1.aarch64.rpm
Wrote: RPMS/SailfishOS-devel-aarch64/
    xulrunner-qt5-debuginfo-91.9.1+git1+sailfishos.esr91.20240707214235.e32f3a780e7e-1.aarch64.rpm
Executing(%clean): /bin/sh -e /var/tmp/rpm-tmp.B5jeR8
+ umask 022
+ cd gecko-dev-esr91/gecko-dev/gecko-dev
+ /bin/rm -rf /home/deploy/installroot
+ RPM_EC=0
++ jobs -p
+ exit 0

Fantastic! The build has gone through without errors either during compilation or linking. This is exciting progress. But it's not time to test the packages just yet as I also have to apply the remaining two patches. Both of these are marked as being part of task JB#53982, so I should apply them both together.

As we saw a couple of days ago, I've already updated patch 81 to use the new location of the libwebrtc code and I didn't notice any obvious issues with patch 80, so with any luck these will now apply without too much drama.

$ git am --3way ../rpm/
    0080-sailfishos-webrtc-Enable-GMP-for-encoding-decoding.-.patch
Applying: Enable GMP for encoding/decoding. JB#53982
Using index info to reconstruct a base tree...
M       dom/media/gmp/GMPSharedMemManager.cpp
A       media/webrtc/signaling/src/media-conduit/GmpVideoCodec.cpp
A       media/webrtc/signaling/src/media-conduit/GmpVideoCodec.h
A       media/webrtc/signaling/src/media-conduit/VideoConduit.cpp
A       media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.cpp
A       media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.h
A       media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.cpp
Falling back to patching base and 3-way merge...
Auto-merging dom/media/webrtc/libwebrtcglue/WebrtcGmpVideoCodec.h
Auto-merging dom/media/webrtc/libwebrtcglue/WebrtcGmpVideoCodec.cpp
CONFLICT (content): Merge conflict in dom/media/webrtc/libwebrtcglue/
    WebrtcGmpVideoCodec.cpp
Auto-merging dom/media/webrtc/libwebrtcglue/VideoConduit.cpp
Auto-merging dom/media/webrtc/libwebrtcglue/GmpVideoCodec.h
Auto-merging dom/media/webrtc/jsapi/PeerConnectionImpl.cpp
Auto-merging dom/media/gmp/GMPSharedMemManager.cpp
error: Failed to merge in the changes.
Patch failed at 0001 Enable GMP for encoding/decoding. JB#53982
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run &quot;git am --continue&quot;.
If you prefer to skip this patch, run &quot;git am --skip&quot; instead.
To restore the original branch and stop patching, run &quot;git am 
    --abort&quot;.

Ah, sadly it's not going to be quite as straightforward as I'd hoped. No matter. There are three hunks that have failed to apply cleanly to the WebrtcGmpVideoCode.cpp file. It looks like the reason might be an upstream change that has slightly amended the way strings are handled. The line that used to look like this:

tags.AppendElement(NS_LITERAL_CSTRING(&quot;h264&quot;));

Now looks like this:

tags.AppendElement(&quot;h264&quot;_ns);

We've seen this before all the way back on Day 35. Thankfully fixing it is a straightforward process, but I notice that NS_LITERAL_CSTRING() is used in several places in the new code as well, so I'll need to change those as well.

With these changes tackled the merge now goes through successfully.

$ git add dom/media/webrtc/libwebrtcglue/WebrtcGmpVideoCodec.cpp
$ git am --continue
Applying: Enable GMP for encoding/decoding. JB#53982

Now on to patch 81. Unfortunately this also fails to apply cleanly.

$ git am --3way ../rpm/
    0081-sailfishos-webrtc-Implement-video-capture-module.-JB.patch
Applying: Implement video capture module. JB#53982
Using index info to reconstruct a base tree...
M       old-configure.in
M       third_party/libwebrtc/webrtc/modules/video_capture/BUILD.gn
M       third_party/libwebrtc/webrtc/modules/video_capture/
    video_capture_internal_impl_gn/moz.build
error: patch failed: third_party/libwebrtc/webrtc/modules/video_capture/
    video_capture_internal_impl_gn/moz.build:60
error: third_party/libwebrtc/webrtc/modules/video_capture/
    video_capture_internal_impl_gn/moz.build: patch does not apply
error: Did you hand edit your patch?
It does not apply to blobs recorded in its index.
Patch failed at 0001 Implement video capture module. JB#53982
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run &quot;git am --continue&quot;.
If you prefer to skip this patch, run &quot;git am --skip&quot; instead.
To restore the original branch and stop patching, run &quot;git am 
    --abort&quot;.

In this case the issue is different. It's specifically the moz.build file it fails on. But as we've seen, these moz.build files are autogenerated. In this case, it's the change to BUILD.gn that we can see in the patch (and which applied without issue) that I believe should then be propagated to moz.build when the build files are regenerated.

So it looks like the simple solution will be to remove the changes to moz.build from the patch, apply the patch, then regenerate the build files. I should be left with some similar changes to moz.build autogenerated for me. I've therefore manually edited patch 81 to remove the moz.build changes listed there.

$ git am --3way ../rpm/0081-sailfishos-webrtc-Implement-video-capture-modul
e.-JB.patch
Applying: Implement video capture module. JB#53982

Patch goes through okay. Now to regenerate the build files.

$ sfdk engine exec
$ sb2 -t SailfishOS-esr78-aarch64.default
$ source `pwd`/obj-build-mer-qt-xr/rpm-shared.env
$ cd gecko-dev/
$ ./mach build-backend -b GnMozbuildWriter
[...]
Finished reading 1165 moz.build files in 12.36s
Read 13 gyp files in parallel contributing 0.00s to total wall time
Processed into 7258 build config descriptors in 10.27s
GnMozbuildWriter backend executed in 25.27s
  1 total backend files; 1 created; 0 updated; 0 unchanged; 0 deleted
Total wall time: 48.95s; CPU time: 47.91s; Efficiency: 98%; Untracked: 1.06s
Glean could not be found, so telemetry will not be reported. You may need to 
    run |mach bootstrap|.

I'm a bit perturbed to discover that this hasn't changed the moz.build file in the way I was expecting. To be clear: it hasn't changed it at all. I'm not happy with this and it's going to require a bit more investigation.

Perhaps I need to clean out the build before attempting to regenerate the build files?

$ sfdk engine exec
$ sb2 -t SailfishOS-esr78-aarch64.default
$ cd ..
$ source `pwd`/obj-build-mer-qt-xr/rpm-shared.env
$ cd gecko-dev/
$ ./mach build-backend -b GnMozbuildWriter
config.status not found.  Please run |mach configure| or |mach build| prior to 
    building the ['GnMozbuildWriter'] build backend.
Glean could not be found, so telemetry will not be reported. You may need to 
    run |mach bootstrap|.

Well, no, clearly that's not going to do the job. After doing a complete rebuild I've tried again, but still without any changes resulting in the moz.build files.

$ ./mach build-backend -b GnMozbuildWriter
 0:03.89 ${PROJECT}/obj-build-mer-qt-xr/_virtualenvs/common/bin/python 
    ${PROJECT}/obj-build-mer-qt-xr/config.status --backend GnMozbuildWriter
Reticulating splines...
 0:04.87 File already read. Skipping: ${PROJECT}/gecko-dev/intl/components/
    moz.build
 0:08.18 File already read. Skipping: ${PROJECT}/gecko-dev/gfx/angle/targets/
    angle_common/moz.build

There are some warnings about files already having been read. I did some reading around this but there's no indication that it could be part of the problem. Part of the output generated from the process contains a list of all the GN build files it's consuming. Here's the list (edited for clarity):

Writing moz.build files based on the following gn configs: [
    'x64_False_arm_android.json', 'arm_True_arm_freebsd.json',
    'mips64_True_mips64_linux.json', 'arm64_True_arm64_linux.json',
    'x64_False_x64_openbsd.json', 'x64_True_x64_mac.json', 
    x64_False_x64_win.json', 'x64_True_x86_linux.json',
    'x86_True_x86_win.json', 'mips64_False_mips64_linux.json',
    'x64_True_arm64_win.json', 'arm64_True_arm64_openbsd.json',
    'ppc64_False_ppc64_linux.json', 'x64_False_arm64_mac.json',
    'x86_False_x86_linux.json', 'x86_False_x86_win.json',
    'arm_False_arm_linux.json', 'x64_False_x64_mac.json',
    'arm64_True_arm64_freebsd.json', 'x64_True_x64_netbsd.json',
    'arm64_False_arm64_linux.json', 'x64_False_x86_android.json',
    'x86_False_x86_openbsd.json', 'x64_True_x64_win.json',
    'arm64_False_arm64_openbsd.json', 'x86_True_x86_netbsd.json',
    'x86_True_x86_openbsd.json', 'x64_False_x64_android.json',
    'x64_False_x64_freebsd.json', 'x64_True_x86_android.json',
    'x64_True_x64_openbsd.json', 'arm_True_arm_netbsd.json',
    'x64_False_x64_netbsd.json', 'arm64_False_arm64_netbsd.json',
    'x64_True_arm_android.json', 'x64_False_x64_dragonfly.json',
    'x64_False_arm64_android.json', 'x64_True_x64_freebsd.json',
    'arm64_False_arm64_freebsd.json', 'x64_True_arm64_mac.json',
    'x64_False_x64_linux.json', 'x86_False_x86_freebsd.json',
    'arm_False_arm_openbsd.json', 'ppc64_True_ppc64_linux.json',
    'x64_True_x64_linux.json', 'x64_False_x86_linux.json',
    'x64_True_arm64_android.json', 'arm_True_arm_openbsd.json',
    'x64_False_arm64_win.json', 'arm64_True_arm64_netbsd.json',
    'arm_False_arm_netbsd.json', 'arm_False_arm_freebsd.json',
    'x64_True_x64_android.json', 'x64_True_x64_dragonfly.json',
    'x86_False_x86_netbsd.json', 'x86_True_x86_freebsd.json',
    'arm_True_arm_linux.json'
]

The three files of importance to us are the ones we've amended, which are the arm64_False_arm64_linux.json, arm_False_arm_linux.json and x86_False_x86_linux.json files. All of these appear in the list, which I'd take as a good sign if it weren't for the fact it doesn't appear to be having any effect.

In spite of all this I left the build running all day and it went through fine, so I thought I'd have a go at running it and testing it with some WebRTC test sites. First up I tried the LiveKit WebRTC Browser Test. The results are a bit of a mixed bag. Although the video codecs seem to be working correctly, when the site requests audio input it gets stuck in a holding position indefinitely. When I perform the same action on my desktop browser a permission box opens and the process continues once I've agreed for the permissions to be granted.

Screenshots of the LiveKit WebRTC Browser Test: running on the desktop with Firefox; running on Sailfish OS. The later shows a spinner when asking for Media devices audioinput

There are some errors output to the console as well:

Created LOG for EmbedLiteLayerManager
JavaScript warning: https://livekit.io/_next/static/chunks/
    main-4e25e037b50f6258.js, line 1: unreachable code after return statement
JavaScript warning: https://livekit.io/_next/static/chunks/
    main-4e25e037b50f6258.js, line 1: unreachable code after return statement
JavaScript error: resource://gre/modules/amInstallTrigger.jsm, line 43: 
    TypeError: this.mm is null
JavaScript error: resource://gre/modules/amInstallTrigger.jsm, line 43: 
    TypeError: this.mm is null
JavaScript error: resource://gre/modules/amInstallTrigger.jsm, line 43: 
    TypeError: this.mm is null
JavaScript error: file:///usr/lib64/mozembedlite/components/
    EmbedLiteWebrtcUI.js, line 293: TypeError: 
    contentWindow.navigator.mozGetUserMediaDevices is not a function

###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, 
    messages will be lost

This all suggests to me that the permission dialogues are currently broken on the ESR 91 build, so this is something I'll need to look into. I also had a play with Mozilla's WebRTC Tests.

Testing out the getUserMedia example shows similar problems. Pressing the camera button has no effect, while pressing the Microphone button generates a similar error to the console:

JavaScript error: file:///usr/lib64/mozembedlite/components/
    EmbedLiteWebrtcUI.js, line 293: TypeError: 
    contentWindow.navigator.mozGetUserMediaDevices is not a function

I have better success with the Single host DataChannels page. In this case I'm able to send messages backwards and forwards just as I can on a desktop browser.

So this is so far all pointing to an issue with the page for requesting permissions. I'm not going to have time to check this now, but there is one other thing I'd like to try.

Although I wasn't able to get the moz.build files updated automatically by generating them from the GN build scripts, I do have the patch that shows what ought to be changing in the moz.build. So I'm going to apply this change manually and run another rebuild. It'll be interesting to establish whether this has any effect. If all goes to plan, we can find out tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

8 Jul 2024 : Day 282 #

So far I've been trying to apply WebRTC patches. There are five of them, 77 – 81, of which 77 and 78 have already been applied. Well, sort of. The first applied fine but for the second I copied over the GN build configuration json files and regenerated the build scripts hoping that would do the trick.

I left the build running overnight and, unlike my previous attempts, it now compiles without error. That's a positive step forwards. But although compilation goes through, the linking now fails.

There are a lot of missing symbols. Here are some of the errors; I won't give them all as they're repetitive to say the least. But this gives a decently representative sample of them.

285:37.36 toolkit/library/build/libxul.so
289:04.49 SailfishOS-devel-aarch64.default/opt/cross/bin/
    aarch64-meego-linux-gnu-ld: ../../../dom/media/systemservices/
    Unified_cpp_systemservices0.o: in function `webrtc::ScreenDeviceInfoImpl::
    GetDeviceName(unsigned int, char*, unsigned int, char*, unsigned int, 
    char*, unsigned int, int*)':
289:04.49 ${PROJECT}/gecko-dev/dom/media/systemservices/video_engine/
    desktop_capture_impl.cc:63: undefined reference to `webrtc::
    DesktopDisplayDevice::DesktopDisplayDevice()'
289:04.49 SailfishOS-devel-aarch64.default/opt/cross/bin/
    aarch64-meego-linux-gnu-ld: ${PROJECT}/gecko-dev/dom/media/systemservices/
    video_engine/desktop_capture_impl.cc:63: undefined reference to `webrtc::
    DesktopDisplayDevice::~DesktopDisplayDevice()'
289:04.49 SailfishOS-devel-aarch64.default/opt/cross/bin/
    aarch64-meego-linux-gnu-ld: ${PROJECT}/gecko-dev/dom/media/systemservices/
    video_engine/desktop_capture_impl.cc:81: undefined reference to `webrtc::
    DesktopDisplayDevice::getDeviceName()'
[...]
289:04.51 SailfishOS-devel-aarch64.default/opt/cross/bin/
    aarch64-meego-linux-gnu-ld: libxul.so: hidden symbol 
    `_ZN6webrtc15DesktopCapturer20CreateScreenCapturerERKNS_21DesktopCaptureOptionsE' isn't defined
289:04.51 SailfishOS-devel-aarch64.default/opt/cross/bin/
    aarch64-meego-linux-gnu-ld: final link failed: bad value
289:04.51 collect2: error: ld returned 1 exit status

All bar one of the issues are undefined reference errors. There's also one hidden symbol error which I'll come on to. But first, the missing symbols, all of which are inside the webrtc namespace, are the following:

ScreenDeviceInfoImpl::GetDeviceName()
DesktopDisplayDevice::DesktopDisplayDevice()
DesktopDisplayDevice::~DesktopDisplayDevice()
DesktopDisplayDevice::getDeviceName()
DesktopDisplayDevice::getUniqueIdName()
WindowDeviceInfoImpl::GetDeviceName()
DesktopDisplayDevice::getPid()
BrowserDeviceInfoImpl::GetDeviceName()
DesktopTab::DesktopTab()
DesktopTab::~DesktopTab()
DesktopTab::getTabName()
DesktopTab::getUniqueIdName()
DesktopDeviceInfoImpl::Create()
DesktopCaptureOptions::CreateDefault()
DesktopCaptureOptions::~DesktopCaptureOptions()
DesktopCapturer::CreateScreenCapturer()
MouseCursorMonitor::CreateForScreen()
DesktopAndCursorComposer::DesktopAndCursorComposer()
DesktopCapturer::CreateWindowCapturer()
MouseCursorMonitor::CreateForWindow(webrtc::DesktopCaptureOptions()
DesktopCapturer::CreateTabCapturer()
DesktopRegion::DesktopRegion()
DesktopRegion::~DesktopRegion()
EncoderSimulcastProxy::EncoderSimulcastProxy()

The single hidden symbol is for the following:

_ZN6webrtc15DesktopCapturer20CreateScreenCapturerERKNS_21
    DesktopCaptureOptionsE

Running this through the demangler gives the following name:

$ c++filt _ZN6webrtc15DesktopCapturer20CreateScreenCapturerERKNS_21
    DesktopCaptureOptionsE
webrtc::DesktopCapturer::CreateScreenCapturer(
    webrtc::DesktopCaptureOptions const&
)

As we can see this is also one of the missing symbols. A missing symbol usually means one of two things. It can mean that the header file is being compiled without a related source file, so that there's a call to a method that the compiler can provide a symbol for, but then the linker can't hook it up to the actual function. Alternatively it can mean that there's some object code being generated that's not being included in the final linking step.

I'm not sure what's going on here, but I do note that there are implementations for most of these methods in the code. For example DesktopDisplayDevice::getDeviceName() can be found in the desktop_device_info.cc file:

const char *DesktopDisplayDevice::getDeviceName() {
  return deviceNameUTF8_;
}

Amongst the patches I don't see any obvious change that would immediately try to link in a missing object file. There are some changes to the build configuration that would affect what's compiled, for example the following:

     UNIFIED_SOURCES += [
-        &quot;/third_party/libwebrtc/webrtc/modules/video_capture/linux/
    device_info_linux.cc&quot;,
-        &quot;/third_party/libwebrtc/webrtc/modules/video_capture/linux/
    video_capture_linux.cc&quot;
+        &quot;/third_party/libwebrtc/webrtc/modules/video_capture/sfos/
    device_info_sfos.cc&quot;,
+        &quot;/third_party/libwebrtc/webrtc/modules/video_capture/sfos/
    video_capture_sfos.cc&quot;
     ]

There also seem to be some relevant relationships between the missing symbols and some of the later patches too. For example patch 81 contains the a new method called DeviceInfoSFOS::GetDeviceName(). Now that's not going to generate an exact replacement for the ScreenDeviceInfoImpl::GetDeviceName() symbol, but it's surely related.

From the descriptions in the patches it's also clear that patches 77, 78 and 79 are all supposed to be applied together, since they were all worked on under bug JB#53756 (The "JB" here stands for "Jolla Bug" and the number is the bug number in Jolla's Bugzilla database).

My conclusion is that I should probably try applying patch 79 and attempt to build again before I start attempting any manual fixes myself. It's possible the linking errors will just go away if I do this.

Patch 79 makes changes to just a single file and that file doesn't appear to have moved, so maybe it'll just apply?

$ git am --3way ../rpm/
    0079-sailfishos-webrtc-Disable-desktop-sharing-feature-on.patch
Applying: Disable desktop sharing feature on SFOS. JB#53756
Using index info to reconstruct a base tree...
M       dom/media/systemservices/VideoEngine.cpp
Falling back to patching base and 3-way merge...
Auto-merging dom/media/systemservices/VideoEngine.cpp

The patch failed initially, but after adding the -3way option to fall back to a three-way merge it then applies fine. The patch is quite small and I was able to easily check manually that all of the elements had correctly applied.

So I've set the build going again. It's another fully clean build, so I'm expecting it to take some time. Looking at the output the last build took four and three quarter hours before it reached the linking stage. Right now it's 16:45, so with any luck it'll be completed (either with or without errors) by 21:30 which might still give me some time to do some more work on it afterwards.

[...]

The build ran pretty much to schedule. Unfortunately it didn't go through, failing once again at the final linking stage. But the results are encouraging nevertheless. While last time there were 24 undefined references, now there's only one:

274:07.07 toolkit/library/build/libxul.so
277:37.26 SailfishOS-devel-aarch64.default/opt/cross/bin/
    aarch64-meego-linux-gnu-ld: ../../../dom/media/webrtc/libwebrtcglue/
    Unified_cpp_libwebrtcglue0.o: in function `mozilla::WebrtcVideoConduit::
    CreateEncoder(webrtc::VideoCodecType)':
277:37.26 ${PROJECT}/gecko-dev/dom/media/webrtc/libwebrtcglue/VideoConduit.cpp:
    1774: undefined reference to `webrtc::EncoderSimulcastProxy::
    EncoderSimulcastProxy(webrtc::VideoEncoderFactory*, webrtc::SdpVideoFormat 
    const&)'
277:37.26 collect2: error: ld returned 1 exit status

Extracting the crucial information from this error, the undefined reference is for the following method:

webrtc::EncoderSimulcastProxy::EncoderSimulcastProxy()

A quick check shows that this class doesn't exist in ESR 78, although there was what appears to be a similar VP8EncoderSimulcastProxy class that's used instead. I also notice that the WebrtcVideoConduit::CreateEncoder() method where the constructor for the class is used (which is causing the linker error) is heavily amended by our patch 80.

This is going to require further investigation, but it does at least give us a clear set of leads to follow. But as it's late now, following these leads will have to wait until tomorrow. I'm feeling quite upbeat about this though: it's very clear progress.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

7 Jul 2024 : Day 281 #

Today I'm continuing my attempts to restore WebRTC to the browser. Last night I set off a completely clean build after re-enabling the WebRTC option in mozconfig.merqtxulrunner. I didn't expect the build to complete successfully and, indeed, it didn't. After four and a half hours it crashed out with an error while attempting a header include:

272:51.05 In file included from ${PROJECT}/gecko-dev/third_party/libwebrtc/
    webrtc/modules/desktop_capture/desktop_capturer.cc:17,
272:51.05                  from Unified_cpp_p_capture_generic_gn0.cpp:56:
272:51.05 ${PROJECT}/obj-build-mer-qt-xr/dist/system_wrappers/gtk/gtk.h:3:15: 
    fatal error: gtk/gtk.h: No such file or directory
272:51.05  #include_next <gtk/gtk.h>
272:51.06                ^~~~~~~~~~~
272:51.06 compilation terminated.
272:51.06 make[4]: *** [${PROJECT}/gecko-dev/config/rules.mk:694: 
    Unified_cpp_p_capture_generic_gn0.o] Error 1

Our build is Qt-, rather than GTK- based, so it's not surprising that there's a problem if it tries to include a GTK header. If we look at the source code we can see that the likely cause is that our change to the build options has resulted in the WEBRTC_USE_PIPEWIRE define macro being set.

#if defined(WEBRTC_USE_PIPEWIRE) || defined(USE_X11)
#include <gtk/gtk.h>
#include <gtk/gtkx.h>
#endif

Interestingly this code is the same in ESR 78, even after all of the patches have been applied. That suggests to me that, in fact, WEBRTC_USE_PIPEWIRE shouldn't be defined on our build even if WebRTC is enabled. That would make sense because it can't be the case that all builds of Gecko require the GTK headers.

Checking the GN build file we can see the following:

  if (rtc_use_pipewire) {
    sources += [
      &quot;linux/base_capturer_pipewire.cc&quot;,
      &quot;linux/base_capturer_pipewire.h&quot;,
      &quot;linux/screen_capturer_pipewire.cc&quot;,
      &quot;linux/screen_capturer_pipewire.h&quot;,
      &quot;linux/window_capturer_pipewire.cc&quot;,
      &quot;linux/window_capturer_pipewire.h&quot;,
    ]

    if (!build_with_mozilla) {
      configs += [
        &quot;:gio&quot;,
        &quot;:pipewire&quot;,
      ]
    } else {
      defines += [ &quot;WEBRTC_USE_PIPEWIRE&quot; ]
      include_dirs += [ &quot;/third_party/pipewire&quot; ]
    }
  }

This shows that the define is dependent on both the rtc_use_pipewire option being set and the build_with_mozilla option being set. Searching for rtc_use_pipewire and comparing against ESR 78 shows that this wasn't set in ESR 78:

$ diff \
    gecko-dev-project/gecko-dev/gecko-dev/media/webrtc/trunk/webrtc/webrtc.gni \
    gecko-dev-esr91/gecko-dev/gecko-dev/third_party/libwebrtc/webrtc/webrtc.gni 
74c74
<   rtc_use_pipewire = false
---
>   rtc_use_pipewire = true

This appears to be a change to the upstream default. So the thing to do here is to unset this option and then, I suppose, regenerate all of the build scripts.

$ source `pwd`/obj-build-mer-qt-xr/rpm-shared.env
$ cd gecko-dev/
$ ./mach python python/mozbuild/mozbuild/gn_processor.py dom/media/webrtc/
    third_party_build/gn-configs/arm64_False_arm64_linux.json
$ ./mach python python/mozbuild/mozbuild/gn_processor.py dom/media/webrtc/
    third_party_build/gn-configs/arm_False_arm_linux.json
$ ./mach python python/mozbuild/mozbuild/gn_processor.py dom/media/webrtc/
    third_party_build/gn-configs/x86_False_x86_linux.json

However, running these gives the same result as before: no change at all. That doesn't seem right to me, but it's always possible that the change will be propagated at build time automatically somehow. So I've set another clean build running again:

git clean -xdf
cd gecko-dev
git clean -xdf
cd ..
sfdk build -d -p --with git_workaround
[...]

As I mentioned earlier, yesterday the build ran for four and a half hours before hitting the error, so we should expect it to run for at least the same length of time again. So I'm off to focus on other things for a bit and will return once the build has completed, error or otherwise.

[...]

Unfortunately it's given the same error as before. What's frustrating is that I can quite clearly see the following in moz.build:

if CONFIG[&quot;CPU_ARCH&quot;] == &quot;aarch64&quot; and 
    CONFIG[&quot;OS_TARGET&quot;] == &quot;Linux&quot;:

    DEFINES[&quot;DISABLE_NACL&quot;] = True
    DEFINES[&quot;NO_TCMALLOC&quot;] = True
    DEFINES[&quot;WEBRTC_USE_PIPEWIRE&quot;] = True

    LOCAL_INCLUDES += [
        &quot;/third_party/pipewire/&quot;
    ]

    UNIFIED_SOURCES += [
        &quot;/third_party/libwebrtc/webrtc/modules/desktop_capture/linux/
    base_capturer_pipewire.cc&quot;,
        &quot;/third_party/libwebrtc/webrtc/modules/desktop_capture/linux/
    screen_capturer_pipewire.cc&quot;,
        &quot;/third_party/libwebrtc/webrtc/modules/desktop_capture/linux/
    window_capturer_pipewire.cc&quot;
    ]

So I probably should have just checked this before starting an entirely fresh build off again: it could have saved me some time. Anyhow, I'm a bit stumped. How to regenerate these GN json build files?

After trying a few things and not making much progress, I've decide to take a closer look at patch 78. That's the patch with the changes to the build configuration. The various moz.build files there look autogenerated, but the three json files — the GN input scripts — don't. Could it be that Denis edited these by hand?

It seems unlikely given they each have 6000 lines of changed code. But I don't see where else they would have come from. The WebRTC code has been moved in the build tree but otherwise seems pretty similar. So now I'm wondering if I can get away with simply copying these GN build files over from ESR 78 straight into ESR 91.

This is a bit of a risky way to proceed, but I don't have a lot to lose at this point. Besides, the way my repository is set up right now, it'd be super-easy to revert the changes. So I'm going to give it a go.

$ cp gecko-dev-project/gecko-dev/gecko-dev/media/webrtc/gn-configs/
    arm64_False_arm64_linux.json gecko-dev-esr91/gecko-dev/gecko-dev/dom/media/
    webrtc/third_party_build/gn-configs/
$ cp gecko-dev-project/gecko-dev/gecko-dev/media/webrtc/gn-configs/
    arm_False_arm_linux.json gecko-dev-esr91/gecko-dev/gecko-dev/dom/media/
    webrtc/third_party_build/gn-configs/
$ cp gecko-dev-project/gecko-dev/gecko-dev/media/webrtc/gn-configs/
    x86_False_x86_linux.json gecko-dev-esr91/gecko-dev/gecko-dev/dom/media/
    webrtc/third_party_build/gn-configs/

By looking through various bugs on Bugzilla (specifically Bug 1466254 and Bug 1658853) I've also discovered I can regenerate all of the build files from the GN json files in one go using just one command. This needs to be executed inside the scratchbox2 target of course:

$ ./mach build-backend -b GnMozbuildWriter

After running this I can see a large number of changed files, primarily moz.build files, all related to the build configuration. Maybe this has done the job? Given all of these changes I'm willing to run a build to see how it goes. Given there are changes to the build scripts it's going to have to be another completely clean build though.

git clean -xdf
cd gecko-dev
git clean -xdf
cd ..
sfdk build -d -p --with git_workaround
[...]

Once again, it'll likely take a while for this to either fail or go through. So that'll probably be my lot for the day.

[...]

It's now bed time. The build has been running for a bit over three and a half hours and continues onwards. So I'll definitely have to leave this until morning. Let's see how things look then.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

6 Jul 2024 : Day 280 #

Before I continue my investigation of the changes needed to get WebRTC integrated back into the browser, I want to first take stock. What's the current situation with the browser; what still needs to be done?

Well, the browser is now quite usable in my experience. There are some glitches, such as sporadic failures in the dark mode detection. But I wouldn't count that as critical failures. I've not really driven it hard (meaning, visited a large number of sites on the Web) but I've not generally experienced problems on the sites I've visited.

I've not been able to yet test the WebView running in the email client, but it works fine in the Settings app, so I expect it to work similarly well with the email app.

With these working correctly, it does feel like now would be a good time to roll out packages for others to try and lodge issues against. For users that have a second phone, or who are really willing to live on the cutting edge, I don't think it will be an unusable experience.

Nevertheless, it's also worth highlighting that I've not installed ESR 91 on my daily device. That's mostly just laziness on my part. I'm not sure I'd find it problematic if I did.

WebRTC and multimedia playback are the two obvious areas that need attention. Video works, but I'm not sure how it's happening or what backend it's using. Video playback isn't my area of expertise so I'm hoping Jolla will ultimately be able to help with this part.

There are some tidying up jobs to be done as well. Most notably I have to corral all of the commits I've made to the FIREFOX_ESR_91_9_X_RELBRANCH_patches branch into actual patches. I'm expecting this to be quite a bit of work; work that will require care and attention, but hopefully not too much thought.

So that's where things are at. Right now then, it's back to WebRTC so that I can cross the largest remaining item off this list. The ESR 78 WebRTC patches were developed by Denis Grigorev (d-grigorev), who I had the privilege of working with while I was at Jolla. Denis is a fantastic developer and did amazing work with these patches and I wouldn't want to have to replicate all of that from scratch. Frankly I'm not sure I could. My hope therefore is that I'll be able to apply the patches without too much drama.

As we saw yesterday there are five sequentially numbered patches that must be applied in order from 77 to 81 inclusive. Starting with the first patch, attempting to apply it gives disastrous results. None of the hunks apply successfully.

Checking the details of the patch it quickly becomes clear why: none of the files listed in the patch exist in the source tree any more. Or at least, they're not where they're supposed to be.

But checking the log for one of the missing files shows that it's been moved rather than removed:

$ git log -1 -- media/webrtc/trunk/webrtc/BUILD.gn
commit 9f1d2b5af54877aacf06605681a5b313c0441c51
Author: Dan Minor <dminor@mozilla.com>
Date:   Thu Sep 24 18:28:41 2020 +0000

    Bug 1665166 - Move media/webrtc/trunk/* to third-party/libwebrtc; r=ng
    
    Differential Revision: https://phabricator.services.mozilla.com/D91317

After a bit of experimentation I soon discover the description for this commit isn't quite correct. The file has actually been moved to third_party/libwebrtc (with an underscore rather than a hyphen). But having made this adjustment it appears that all of the relevant files have been moved there which, if correct, is going to make things much easier.

Looking through the patches I can see that the same issue is going to arise for patch 81. A quick bit of search-and-replace should fix this for both of them:

$ sed -i -e &quot;s/media\/webrtc\/trunk\//third_party\/libwebrtc\//g&quot; 
    0077-*.patch
$ sed -i -e &quot;s/media\/webrtc\/trunk\//third_party\/libwebrtc\//g&quot; 
    0081-*.patch

Now when I attempt to apply the first of these patches I get much better results:

$ git am ../rpm/0077-sailfishos-webrtc-Adapt-build-configuration-for-Sail.patch
Applying: Adapt build configuration for SailfishOS. JB#53756

Great! So that's patch 77 dealt with. Patch 78 is going to be a bit trickier, because this is the patch that applies the changes following the regeneration of the build process.

As we saw yesterday, there is in theory a simple command for regenerating the build configuration:

./mach python python/mozbuild/mozbuild/gn_processor.py dom/media/webrtc/
    third_party_build/gn-configs/webrtc.json

Checking the patch, we can see that the following files are the ones that need to be regenerated:

 .../gn-configs/arm64_False_arm64_linux.json   |  5874 +-------
 .../gn-configs/arm_False_arm_linux.json       |  6114 +--------
 .../gn-configs/x86_False_x86_linux.json       | 11339 ++++++++++++++++

I run the mach command inside the build engine like this:

$ source `pwd`/obj-build-mer-qt-xr/rpm-shared.env
$ cd gecko-dev/
$ ./mach python python/mozbuild/mozbuild/gn_processor.py dom/media/webrtc/
    third_party_build/gn-configs/arm64_False_arm64_linux.json
$ ./mach python python/mozbuild/mozbuild/gn_processor.py dom/media/webrtc/
    third_party_build/gn-configs/arm_False_arm_linux.json
$ ./mach python python/mozbuild/mozbuild/gn_processor.py dom/media/webrtc/
    third_party_build/gn-configs/x86_False_x86_linux.json

This all executes without incident, but in practice it doesn't have any effect. The build scripts are identical to what they were before: git status shows that nothing has changed. Could that be because the WebRTC is still disabled in our build configuration?

Looking back through my diary records I quickly find the option that's supposed to control this; we talked about it all the way back on Day 38. Here's the change that I committed back then:

$ git log -1 embedding/embedlite/config/mozconfig.merqtxulrunner
commit 577ff63ae39d49c83bab7fca86f2e1d32053d167
Author: David Llewellyn-Jones <david@flypig.co.uk>
Date:   Thu Sep 21 08:43:12 2023 +0100

    Disable WebRTC
    
    The WebRTC patches are large and tricky to apply. This change disables
    building of the WebRTC portions of the code so that reapplying the
    patches can be tackled later.
$ git diff 577ff63ae~ 577ff63ae
diff --git a/embedding/embedlite/config/mozconfig.merqtxulrunner b/embedding/
    embedlite/config/mozconfig.merqtxulrunner
index 2593d32fd5a9..bec8b01bb07b 100644
--- a/embedding/embedlite/config/mozconfig.merqtxulrunner
+++ b/embedding/embedlite/config/mozconfig.merqtxulrunner
@@ -46,7 +46,7 @@ ac_add_options --with-app-name=xulrunner-qt5
 # ac_add_options --enable-warnings-as-errors
 
 # disabling for now, since the build fails...
-ac_add_options --enable-webrtc
+ac_add_options --disable-webrtc
 ac_add_options --enable-profiling
 ac_add_options --disable-dbus
 ac_add_options --disable-necko-wifi

If I restored this so that the option is set to "enable-webrtc" rather than "disable-webrtc" then maybe I'll get better results? Let's see. In order for this change to work it'll need to be propagated through the build system, which means cleaning everything out and starting the build completely from scratch. So that's what I've done:

git clean -xdf
cd gecko-dev
git clean -xdf
cd ..
sfdk build -d -p --with git_workaround
[...]

I've set this running but I'm not exactly sure what to expect. My expectation was that the build would fail soon after the new build configuration had been generated, but after more than a quarter of an hour the build is still happily chugging away. It's also possible that it'll all just rebuild fine. That would be more unexpected, but you never know. Either way, it looks like it's not going to be the fast failure I thought it would be, so I'll need to leave the build running, probably overnight. We'll come back to this tomorrow to see how it's got on.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

5 Jul 2024 : Day 279 #

Today I'm continuing with the plan I hatched yesterday. I'm half way through Step 6 ("Build and check everything still works"), having built the packages overnight.

This morning I've done some basic testing: installed the packages, run the browser to view a page with some WebGL content, run the harbour-webview app.

I was a bit perturbed to find the browser failed to render WebGL if I executed it from the app drawer rather than the command line. But this got fixed when I deleted the startupCache:

$ rm ~/.local/share/org.sailfishos/browser/.mozilla/startupCache/*

It's luck that I noticed this, but I'm glad I did: glitches like this can cause an almighty headache if you don't discover them early on. I also tested the Settings app to check the WebView still works on the various Account creation pages. Using the ESR 91 WebView I was successfully able to view the Twitter (nee X) Terms of Service; I didn't go as far as agreeing to them of course.

So, Step 6 is completed.

Step 7: Take a deep breath. Commit the staged changes.

After making all of the changes and discarding anything unnecessary, there are now nineteen modified files to commit.

$ git diff --stat --cached
 dom/base/nsGlobalWindowOuter.cpp                 |   6 -
 gfx/gl/GLContext.cpp                             |  34 +--
 gfx/gl/GLContext.h                               |   6 +-
 gfx/gl/GLContextEGL.h                            |   6 +-
 gfx/gl/GLContextProviderEGL.cpp                  | 242 +++------------------
 gfx/gl/GLLibraryEGL.cpp                          |  11 +-
 gfx/gl/GLLibraryEGL.h                            |   6 +-
 gfx/gl/GLScreenBuffer.cpp                        |   2 +-
 gfx/gl/GLTextureImage.cpp                        |  26 +--
 gfx/gl/SharedSurface.cpp                         |   2 -
 gfx/gl/SharedSurface.h                           |  17 +-
 gfx/gl/SharedSurfaceEGL.h                        |   4 -
 gfx/gl/SharedSurfaceGL.cpp                       |  12 +-
 gfx/gl/SharedSurfaceGL.h                         |  16 ++
 gfx/layers/CompositorTypes.h                     |   1 -
 gfx/layers/client/TextureClient.cpp              |  42 +---
 gfx/layers/client/TextureClient.h                |   3 -
 gfx/layers/client/TextureClientSharedSurface.cpp |   1 -
 gfx/layers/ipc/CompositorVsyncScheduler.cpp      |   1 +
 19 files changed, 93 insertions(+), 345 deletions(-)

Perhaps the more interesting thing to check is what will happen to the commit I'm merging this into. Part of the aim here has been to reduce the size of this commit to make it more manageable and I'm interested to know whether or not this has been successful. Here's the incumbent commit as it currently looks:

$ git diff --stat HEAD~..HEAD
 dom/base/nsGlobalWindowOuter.cpp                 |   6 +
 gfx/gl/GLContext.cpp                             |  74 +++++-
 gfx/gl/GLContext.h                               |  21 +-
 gfx/gl/GLContextEGL.h                            |   6 +-
 gfx/gl/GLContextProviderEGL.cpp                  | 240 +++++++++++++++++-
 gfx/gl/GLContextProviderImpl.h                   |  23 ++
 gfx/gl/GLLibraryEGL.cpp                          |  11 +-
 gfx/gl/GLLibraryEGL.h                            |   6 +-
 gfx/gl/GLScreenBuffer.cpp                        | 274 +++++++++++++++++++++
 gfx/gl/GLScreenBuffer.h                          | 132 +++++++++-
 gfx/gl/GLTextureImage.cpp                        |  34 ++-
 gfx/gl/SharedSurface.cpp                         | 116 ++++++++-
 gfx/gl/SharedSurface.h                           | 122 ++++++++-
 gfx/gl/SharedSurfaceEGL.cpp                      |  89 +++++--
 gfx/gl/SharedSurfaceEGL.h                        |  34 ++-
 gfx/gl/SharedSurfaceGL.cpp                       |  64 ++++-
 gfx/gl/SharedSurfaceGL.h                         |  32 ++-
 gfx/gl/SurfaceTypes.h                            |   2 +
 gfx/gl/TextureImageEGL.cpp                       | 221 +++++++++++++++++
 gfx/gl/TextureImageEGL.h                         |  80 ++++++
 gfx/gl/moz.build                                 |   1 +
 gfx/layers/CompositorTypes.h                     |   1 +
 gfx/layers/client/TextureClient.cpp              |  42 +++-
 gfx/layers/client/TextureClient.h                |   3 +
 gfx/layers/client/TextureClientSharedSurface.cpp |  50 ++++
 gfx/layers/client/TextureClientSharedSurface.h   |  25 ++
 gfx/layers/ipc/CompositorVsyncScheduler.cpp      |   2 -
 gfx/layers/opengl/CompositorOGL.cpp              |  24 +-
 28 files changed, 1631 insertions(+), 104 deletions(-)

After merging in these changes to the commit, it'll look like this:

$ git diff --stat HEAD~
 gfx/gl/GLContext.cpp                             |  40 ++-
 gfx/gl/GLContext.h                               |  17 +-
 gfx/gl/GLContextProviderEGL.cpp                  |  36 +++
 gfx/gl/GLContextProviderImpl.h                   |  23 ++
 gfx/gl/GLScreenBuffer.cpp                        | 274 +++++++++++++++++++++
 gfx/gl/GLScreenBuffer.h                          | 132 +++++++++-
 gfx/gl/GLTextureImage.cpp                        |  10 +
 gfx/gl/SharedSurface.cpp                         | 114 ++++++++-
 gfx/gl/SharedSurface.h                           | 105 +++++++-
 gfx/gl/SharedSurfaceEGL.cpp                      |  89 +++++--
 gfx/gl/SharedSurfaceEGL.h                        |  30 ++-
 gfx/gl/SharedSurfaceGL.cpp                       |  74 +++++-
 gfx/gl/SharedSurfaceGL.h                         |  46 ++++
 gfx/gl/SurfaceTypes.h                            |   2 +
 gfx/gl/TextureImageEGL.cpp                       | 221 +++++++++++++++++
 gfx/gl/TextureImageEGL.h                         |  80 ++++++
 gfx/gl/moz.build                                 |   1 +
 gfx/layers/client/TextureClientSharedSurface.cpp |  49 ++++
 gfx/layers/client/TextureClientSharedSurface.h   |  25 ++
 gfx/layers/ipc/CompositorVsyncScheduler.cpp      |   1 -
 gfx/layers/opengl/CompositorOGL.cpp              |  24 +-
 21 files changed, 1334 insertions(+), 59 deletions(-)

There are big gains in reducing the diff for GLContextProviderEGL.cpp and the only losers are SharedSurfaceGL.cpp and SharedSurfaceGL.h. But the wins are much greater than the losses, so I'm happy with this. That means it's time to apply the changes to the previous commit.

Deep breath.

$ git log -1
commit 7437a9d17284bd9a4427502b1af6e94a9d7ee356 (HEAD -> 
    FIREFOX_ESR_91_9_X_RELBRANCH_patches)
Author: David Llewellyn-Jones <david@flypig.co.uk>
Date:   Wed May 22 08:35:49 2024 +0100

    Restore GLScreenBuffer and TextureImageEGL
    
    TextureImageEGL was removed with Bug 1716559:
    
    https://phabricator.services.mozilla.com/D117904
    
    GLScreenBuffer was removed with Bug 1632249:
    
    https://phabricator.services.mozilla.com/D75055
    
    Both are used as part of the Sailfish OS WebView offscreen render
    pipeline. This patch reintroduces these two classes and allows them to
    be made use of.
$ git commit --amend
[FIREFOX_ESR_91_9_X_RELBRANCH_patches 1765dd0460a9] Restore GLScreenBuffer and 
    TextureImageEGL
 Date: Wed May 22 08:35:49 2024 +0100
 21 files changed, 1334 insertions(+), 59 deletions(-)
 create mode 100644 gfx/gl/TextureImageEGL.cpp
 create mode 100644 gfx/gl/TextureImageEGL.h

Step 7 completed. On to Step 8.

Step 8: Admire the resulting, much cleaner, GLSCreenBuffer commit now at HEAD.

Looking through the diff, I think it'll make a decent patch. It remains uncomfortably big — not ideal — but I've now spent a lot of time cutting it down to the bare minimum and I'm satisfied that this is the best I can hope for, at least for now. Certainly it's not going to be worth spending more time trying to cut it down further.

Which leave me with just one more thing to do.

Step 9: Another deep breath. Force push everything to the server. Done.

At least, the plan was to force push. But interestingly, it turns out I don't need to force push at all. It's been so long since I started work on restoring GLScreenBuffer that I thought I must have pushed some of the changes to the remote. But it turns out I hadn't.

This is great news! It means I won't be doing anything destructive by pushing. No need to take any deep breaths after all!

$ git push
Enumerating objects: 92, done.
Counting objects: 100% (92/92), done.
Delta compression using up to 16 threads
Compressing objects: 100% (51/51), done.
Writing objects: 100% (51/51), 22.70 KiB | 2.84 MiB/s, done.
Total 51 (delta 40), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (40/40), completed with 37 local objects.
To github.com:llewelld/gecko-dev-mirror.git
   c6ea49286566..1765dd0460a9  FIREFOX_ESR_91_9_X_RELBRANCH_patches -> 
    FIREFOX_ESR_91_9_X_RELBRANCH_patches
$ cd ..
$ git push
Enumerating objects: 13, done.
Counting objects: 100% (13/13), done.
Delta compression using up to 16 threads
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 1.46 KiB | 373.00 KiB/s, done.
Total 7 (delta 5), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (5/5), completed with 5 local objects.
To github.com:llewelld/gecko-dev.git
   9f64ce35a187..5b9c330998db  sailfishos-esr91 -> sailfishos-esr91

Done!

So there we are. I can finally draw a line under the WebView rendering changes. It's a line drawn in pencil of course as I'm sure I'll find things I need to come back to, but for now it's time to move on.

The next thing I want to tackle is another Web-related acronym: WebRTC. I intentionally disabled this back on Day 38 in order to get the build to go through. And as we'll see there's a reason I didn't tackle it head on at the time. Now there's no avoiding it: it's time to swallow the bitter pill and get to it.

There are five patches which relate to WebRTC. Some look tractable. But others... well... I don't fancy the chances of that 2.4 MiB patch applying cleanly:

$ grep -rInl &quot;webrtc&quot; * --include=&quot;*.patch&quot; | xargs  ls 
    -lah | awk '{print t$9 &quot;\t&quot; $5}'
0077-sailfishos-webrtc-Adapt-build-configuration-for-Sail.patch 5.6K
0078-sailfishos-webrtc-Regenerate-moz.build-files.-JB-537.patch 2.4M
0079-sailfishos-webrtc-Disable-desktop-sharing-feature-on.patch 2.1K
0080-sailfishos-webrtc-Enable-GMP-for-encoding-decoding.-.patch 33K
0081-sailfishos-webrtc-Implement-video-capture-module.-JB.patch 29K

In practice the really big patch — patch 0078 — is due to changes to the build system which have been auto-generated. Specifically it relates to components that would usually be built using GN, but where the build configuration has been automgically converted into the moz.build files used by the Gecko build system. Here's the text from the docs explaining what's going on here;

GN is a third-party build tool used by chromium and some related projects that are vendored in mozilla-central. Rather than requiring GN to build or writing our own build definitions for these projects, we have support in the build system for translating GN configuration files into moz.build files. In most cases these moz.build files will be like any others in the tree (except that they shouldn’t be modified by hand), however those updating vendored code or building on platforms not supported by Mozilla automation may need to re-generate these files. This is a per-project process, described in dom/media/webrtc/third_party_build/gn-configs/README.md for webrtc. As of writing, it is very specific to webrtc, and likely doesn’t work as-is for other projects.

The problem is that the README.md file mentioned in the explanation is missing from the source tree. A bit of a search throws up a potentially useful comment in Bugzilla that links through to some text that's likely the README.md being referenced. It says the following:

Generate new gn json files and moz.build files for building libwebrtc in our tree
/!\ This is only supported on Linux and macOS. If you are on Windows, you can run the script under WSL.

1. The script should be run from the top directory of our firefox tree.

./mach python python/mozbuild/mozbuild/gn_processor.py dom/media/webrtc/
    third_party_build/gn-configs/webrtc.json

2. Checkin all the generated/modified files and try your build!

Adding new configurations to the build

Edit the main function in the python/mozbuild/mozbuild/gn_processor.py file.

Honestly that doesn't seem as bad as I'd feared. But this will of course constitute only one of the five patches that need to be applied. So tomorrow I'll start looking at the other patches to see what's really going to be involved here.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

4 Jul 2024 : Day 278 #

Finally, after much investigation, refinement and guesswork, I finally got the WebView and WebGL working simultaneously yesterday. The key realisation was that the SharedSurface_Basic class needed different functionality depending on whether it was used for one or the other. This was needed as a result of the changes that were made upstream to the WebGL rendering pipeline.

But solving this has also brought other benefits. The process I used to get here was to start with a version with working WebView but broken WebGL, then I started removing all of the GLScreenBuffer code (needed for the WebView) until I had working WebGL as well. But I overshot (embarrassingly and for reasons I won't go in to here) and so then started restoring the GLScreenBuffer code again gently until eventually I hit the critical point where adding or removing a single line resulted in either the WebGL or the WebView working, but not both. That gave me the info I needed to fix the issue, after which I added in just enough more of the remaining code to tidy things up.

As a result I've now added 101 lines to the previous code in order to fix things, but more importantly I've removed 349 lines as well:

$ git diff --stat
 dom/base/nsGlobalWindowOuter.cpp                 |   6 -
 gfx/gl/GLContext.cpp                             |  35 +--
 gfx/gl/GLContext.h                               |   7 +-
 gfx/gl/GLContextEGL.h                            |   6 +-
 gfx/gl/GLContextProviderEGL.cpp                  | 242 +++------------------
 gfx/gl/GLLibraryEGL.cpp                          |  11 +-
 gfx/gl/GLLibraryEGL.h                            |   6 +-
 gfx/gl/GLScreenBuffer.cpp                        |   2 +-
 gfx/gl/GLTextureImage.cpp                        |  26 +--
 gfx/gl/SharedSurface.cpp                         |   3 +-
 gfx/gl/SharedSurface.h                           |  20 +-
 gfx/gl/SharedSurfaceEGL.cpp                      |   1 -
 gfx/gl/SharedSurfaceEGL.h                        |   4 -
 gfx/gl/SharedSurfaceGL.cpp                       |  12 +-
 gfx/gl/SharedSurfaceGL.h                         |  12 +
 gfx/layers/CompositorTypes.h                     |   1 -
 gfx/layers/client/TextureClient.cpp              |  42 +---
 gfx/layers/client/TextureClient.h                |   3 -
 gfx/layers/client/TextureClientSharedSurface.cpp |   2 +-
 gfx/layers/ipc/CompositorVsyncScheduler.cpp      |   1 +
 gfx/layers/opengl/CompositorOGL.cpp              |   8 +
 21 files changed, 101 insertions(+), 349 deletions(-)

So the delta represents the removal of 248 lines, which is what I was hoping for. It should mean the patch for reintroducing GLSCreenBuffer will be much simpler than it would otherwise have been.

I now have to format things into commits. As I just explained the current diff on my local machine removes a bunch of code and eventually I want to merge this diff with the last commit, described as follows:

Author: David Llewellyn-Jones <david@flypig.co.uk>
Date:   Wed May 22 08:35:49 2024 +0100

    Restore GLScreenBuffer and TextureImageEGL
    
    TextureImageEGL was removed with Bug 1716559:
    
    https://phabricator.services.mozilla.com/D117904
    
    GLScreenBuffer was removed with Bug 1632249:
    
    https://phabricator.services.mozilla.com/D75055
    
    Both are used as part of the Sailfish OS WebView offscreen render
    pipeline. This patch reintroduces these two classes and allows them to
    be made use of.

However, the code my local diff removes actually is there for a reason. Looking through it, I can see three separate reasons for why I made the changes in the first place:

To pass the EGLDisplay value through the system.
To support Wayland surfaces.
To ensure texture data is only destroyed on the main thread.

The first of these involved changes added by me when I was scrabbling around trying to fix the WebView first time around. There were upstream changes to the way the EGLDisplay value was being handled that made me suspect it might be causing the WebView to fail. In retrospect this didn't turn out to be a factor, so I think it'll be safe to remove these changes. I'll just drop them.

The second is functionality that may be needed for native ports and the emulator. It doesn't seem to have any impact on my Xperia 10 III, but that's a libhyris port and doesn't appear to need these changes. Consequently I'm going to drop these changes too, but before I do I want to capture a diff of them, so that I can restore them later if it turns out they're important for native ports.

The third set of changes were to revert changeset D106879 the result of which was to restore the horrifically named mWorkaroundAnnoyingSharedSurfaceLifetimeIssues variable.

commit 6f458ab05edc6c086e304c999365f6b375c46640
Author: sotaro <sotaro.ikeda.g@gmail.com>
Date:   Tue Mar 2 10:48:55 2021 +0000

    Bug 1695846 - Remove unused flags of TextureClient r=nical
    
    Flags were used for WebGL SharedSurface
    
    Differential Revision: https://phabricator.services.mozilla.com/D106879

I'm not able to recall off the top of my head why I reversed this changes, but the good news is that I did write about it, back on Day 187. Referring back I can see that at that time I thought it might have been the reason for the browser seizing up. I eventually found the actual underlying reason, which was unrelated. But I never removed this attempted workaround.

So it should now be safe to have this removed. It'll be good not to have that variable included any more. I'll drop this code and moreover I notice there are some further changes in the changeset which I can apply as well and which should safely reduce the size of the patch a little more.

Okay, here's my rubric for the next few steps.

Take a diff of the my current, uncommitted, local changes. But make it a reverse diff so that it switches additions and removals. This will allow me to get back to my current position from the last commit in case I screw something up.
Stage all of the changes that aren't related to the three points above (EGLDisplay, Wayland surfaces and texture destruction) using git add -p -u . so that things are a bit cleaner. Add these changes to the latest commit using git commit --amend.
Stage all of the changes related to 1 and 3 above (EGLDisplay and texture destruction) using git add -p -u .. Add these changes to the latest commit.
Create a patch from the remaining changes, which should represent those needed to support Wayland surfaces. Keep this patch somewhere safe.
Apply all of the remaining changes from D106879.
Build and check that everything still works.
Take a deep breath. Commit the staged changes.
Admire the resulting, much cleaner, GLSCreenBuffer commit now at HEAD.
Another deep breath. Force push everything to the server. Done.

That looks very much like a plan. Time to execute it.

Step 1. First I'm going to save out a reverse of the diff containing all of my current changes. This is in case something goes wrong, it'll give me something to refer back to so that I can restore things to the current state.

$ git diff -R > ../../glscreenbuffer-changes.diff
$ head ../../glscreenbuffer-changes.diff 
diff --git b/dom/base/nsGlobalWindowOuter.cpp a/dom/base/nsGlobalWindowOuter.cpp
index 41c93c51cf3b..31a93ea1cc70 100644
--- b/dom/base/nsGlobalWindowOuter.cpp
+++ a/dom/base/nsGlobalWindowOuter.cpp
@@ -4542,10 +4542,16 @@ nsresult nsGlobalWindowOuter::SetFullscreenInternal(
    FullscreenReason aReason,
   if (rootItem != mDocShell)
     return window->SetFullscreenInternal(aReason, aFullscreen);
 
+  // We don't have chrome from doc shell point of view. This commit
+  // sha1 3116f3bf53df offends fullscreen API to work without chrome

That looks good.

Step 2. Clean up and stage all of the changes that are clearly needed and not related to the three areas listed above.

$ git add -p -u .
[...]

Great; all done, no issues.

Step 3. In this step I'm going to add all of the code that's not related to the Wayland surface.

$ git add -p -u .
[...]

This turned out to be more straightforward than I'd feared. The Wayland surface changes are now the only thing remaining and in fact all live inside the one file: GLContextProviderEGL.cpp.

Step 4. Create a diff of the changes needed to remove the Wayland surface changes. Reverse it so that it shows how to add rather than remove the changes.

$ git diff -R > ../../wayland-surface-changes.diff
flypig@chattan:~/Documents/Development/jolla/gecko-dev-esr91/gecko-dev/
    gecko-dev$ head ../../wayland-surface-changes.diff 
diff --git b/gfx/gl/GLContextProviderEGL.cpp a/gfx/gl/GLContextProviderEGL.cpp
index 1632d663d101..b5ff16a9e522 100644
--- b/gfx/gl/GLContextProviderEGL.cpp
+++ a/gfx/gl/GLContextProviderEGL.cpp
@@ -85,6 +85,14 @@
 #  endif
 #endif
 
+#ifdef MOZ_WIDGET_QT
+#include <qpa/qplatformnativeinterface.h>

Okay, good, all looking fine.

Step 5. Apply all of the remaining changes from D106879. I'm going to do this carefully in hunks just so that I don't make any errors.

$ git add -p -u .
[...]

Step 6. I'm going to do a full build, so this will be an overnight activity. It's a shame I can't get all of the steps completed today, but this is definitely the right thing to do. Steady as she goes.

$ cd ..
$ sfdk build -d --with git_workaround
[...]

So, I've set the build going. I'll continue with the rest of Step 6 and the remaining steps tomorrow. We're getting there and, honestly, I think it's looking pretty good!

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

3 Jul 2024 : Day 277 #

I'm continuing to restore bits of WebView functionality today, in the hope that the renderer will click into place and start rendering. I've convinced myself it's only a matter of time now.

Yesterday I tried restoring the TextureClient destruction code, but that didn't have any obvious effects so I removed the changes again. Today I'm trying something else: restoring the augmented fBindFramebuffer() functionality.

This is unusual because in default Gecko fBindFramebuffer() is just a wrapper for glBindFramebuffer from the OpenGL library. In previous EmbedLite versions of Gecko the wrapper has been given additional functionality which we can find in GLContext.cpp and which looks like this:

void GLContext::fBindFramebuffer(GLenum target, GLuint framebuffer) {
  if (!mScreen) {
    raw_fBindFramebuffer(target, framebuffer);
    return;
  }

  switch (target) {
    case LOCAL_GL_DRAW_FRAMEBUFFER_EXT:
      mScreen->BindDrawFB(framebuffer);
      return;

    case LOCAL_GL_READ_FRAMEBUFFER_EXT:
      mScreen->BindReadFB(framebuffer);
      return;

    case LOCAL_GL_FRAMEBUFFER:
      mScreen->BindFB(framebuffer);
      return;

    default:
      // Nothing we care about, likely an error.
      break;
  }

  raw_fBindFramebuffer(target, framebuffer);
}

With this change the raw_fBindFramebuffer() becomes the new name for the glBindFramebuffer() wrapper. As you can see, now when fBindFramebuffer() is called some framebuffer binding is also performed.

The additional functionality isn't needed everywhere though, and in fact there are a few places where we then have to switch the calls from using fBindFramebuffer() to using the new raw wrapper instead:

@ -215,11 +215,11 @@ void GLScreenBuffer::BindFB(GLuint fb) {
   mInternalReadFB = (fb == 0) ? readFB : fb;
 
   if (mInternalDrawFB == mInternalReadFB) {
+    mGL->raw_fBindFramebuffer(LOCAL_GL_FRAMEBUFFER, mInternalDrawFB);
-    mGL->fBindFramebuffer(LOCAL_GL_FRAMEBUFFER, mInternalDrawFB);
   } else {
     MOZ_ASSERT(mGL->IsSupported(GLFeature::split_framebuffer));
+    mGL->raw_fBindFramebuffer(LOCAL_GL_DRAW_FRAMEBUFFER_EXT, mInternalDrawFB);
+    mGL->raw_fBindFramebuffer(LOCAL_GL_READ_FRAMEBUFFER_EXT, mInternalReadFB);
-    mGL->fBindFramebuffer(LOCAL_GL_DRAW_FRAMEBUFFER_EXT, mInternalDrawFB);
-    mGL->fBindFramebuffer(LOCAL_GL_READ_FRAMEBUFFER_EXT, mInternalReadFB);
   }
 
 #ifdef DEBUG
@@ -235,7 +235,7 @@ void GLScreenBuffer::BindDrawFB(GLuint fb) {
   mUserDrawFB = fb;
   mInternalDrawFB = (fb == 0) ? drawFB : fb;
 
+  mGL->raw_fBindFramebuffer(LOCAL_GL_DRAW_FRAMEBUFFER_EXT, mInternalDrawFB);
-  mGL->fBindFramebuffer(LOCAL_GL_DRAW_FRAMEBUFFER_EXT, mInternalDrawFB);
 
 #ifdef DEBUG
   mInInternalMode_DrawFB = false;
@@ -249,7 +249,7 @@ void GLScreenBuffer::BindReadFB(GLuint fb) {
   mUserReadFB = fb;
   mInternalReadFB = (fb == 0) ? readFB : fb;
 
+  mGL->raw_fBindFramebuffer(LOCAL_GL_READ_FRAMEBUFFER_EXT, mInternalReadFB);
-  mGL->fBindFramebuffer(LOCAL_GL_READ_FRAMEBUFFER_EXT, mInternalReadFB);
 
 #ifdef DEBUG
   mInInternalMode_ReadFB = false;

These aren't all of the uses though and in the remaining cases the new augmented functionality will be called instead. This is another nicely self-contained change without which I could perfectly well imagine the rendering might be broken. So restoring this functionality might have positive effects.

Let's see. I've made these changes and set the partial build running. Soon I'll be able to test this out.

[...]

Two screenshots, both look identical with WebGL content; on the left the browser, on the right the WebView; shader written by Shane

And it works! The WebView renders nicely. The browser renders nicely. Both of them render WebGL nicely (props to Shane for the nice shader example). Great!

Phew. This concludes the difficult part of this process. Sadly it doesn't mean the end of the journey just yet though, oh no: I still need to turn this into a tidy commit and figure out what to do with the remaining changes (the ones that don't appear to be necessary). And to avoid any undue excitement, there are more things to fix after that as well.

Those will all be for tomorrow and beyond though. For today, I'm going to celebrate (in a quiet contemplative way!) and enjoy the win.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

2 Jul 2024 : Day 276 #

Last night the new packages built so that I now have a version with working WebGL and partially working WebView. Crucially, this version splits off the functionality needed for working WebGL from the WebView functionality, meaning I should now be able to reintroduce the WebView functionality without it affecting the WebGL rendering.

So the next steps will be to reintroduce as much of the WebView changes needed in order to get WebView working, but preferably no more than that.

Before getting on to that I have one other issue I want to address. Although the browser is now working, along with the WebGL rendering, there are also now spurious crashes happening, presumably as a result of some of these changes. The backtrace associated with the crash is basically useless, even with a fully installed and correct set of debug symbols:

Thread 8 &quot;GeckoWorkerThre&quot; received signal SIGSEGV, Segmentation 
    fault.
[Switching to LWP 30258]
0x0000007fe5ee29a0 in ?? ()
(gdb) bt
#0  0x0000007fe5ee29a0 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)

My hunch is that this relates to the changes to SharedSurface_Basic. In particular, although I changed the code so that a different constructor is used to create the SharedSurface_Basic objects depending on the type of rendering, the process of destruction, which used to rely on just the default destructor, now has bespoke functionality:

SharedSurface_Basic::~SharedSurface_Basic() {
  if (!mDesc.gl || !mDesc.gl->MakeCurrent()) return;

  if (mFB) mDesc.gl->fDeleteFramebuffers(1, &mFB);

  if (mOwnsTex) mDesc.gl->fDeleteTextures(1, &mTex);
}

This code needs to be there for the WebView to work, but it was never there in the WebGL version. My guess is that as surfaces are destroyed periodically, they're calling some of this code which is then triggering a crash. In particular, I notice that mFB isn't being set in the WebGL constructor, which means it could have a value other than zero. If this were the case, it could be causing problems when this uninitialised value gets passed to the OpenGL library, which could in turn explain why the backtrace is so poor.

As an attempt to fix this I've added in a line to initialise mFB to zero, which should be enough to prevent the new lines in the destructor from executing.

 SharedSurface_Basic::SharedSurface_Basic(const SharedSurfaceDesc& desc,
                                          UniquePtr<MozFramebuffer>&& fb)
     : SharedSurface(desc, std::move(fb)),
       mTex(0),
       mOwnsTex(false),
+      mFB(0)
 {
 }

Great, that's done the trick! No more random crashes. So, now I can move on to the larger task of getting the WebView to work. It's a bigger task, but I'm also hoping it'll be relatively formulaic, given that I already have a working version to compare against. I just need to gradually reintroduce the code from that version until it all clicks into place and works. The plan is that, with the WebGL rendering now separated from it, what I'll end up with is a working WebView and working WebGL.

So I'm looking through the code that's been amended and it looks like there are some relatively self-contained changes related to destruction of TextureClient instances. By reverting the changes from the following files, I should get a slice of the changes back with minimal fuss:

gfx/layers/CompositorTypes.h
gfx/layers/client/TextureClient.h
gfx/layers/client/TextureClient.cpp
gfx/layers/client/TextureClientSharedSurface.cpp
gfx/layers/ipc/CompositorVsyncScheduler.cpp

I'm not sure if this will actually have any effect though, so I'm going to make a copy of the diff before reverting it. That way I can restore the changes if needed.

$ git diff gfx/layers/CompositorTypes.h \
    gfx/layers/client/TextureClient.h \
    gfx/layers/client/TextureClient.cpp \
    gfx/layers/client/TextureClientSharedSurface.cpp \
    gfx/layers/ipc/CompositorVsyncScheduler.cpp \
    > destroy-texture-changes.diff
$ git checkout gfx/layers/CompositorTypes.h \
    gfx/layers/client/TextureClient.h \
    gfx/layers/client/TextureClient.cpp \
    gfx/layers/client/TextureClientSharedSurface.cpp \
    gfx/layers/ipc/CompositorVsyncScheduler.cpp

I'm a little surprised to discover that after resetting the files to revert the changes the code still builds just fine without needing any further adjustments. But when I test out the new library I don't see any obvious positive (or indeed negative) effects. I was hoping for something different. I guess this isn't the critical difference that I'm looking for.

So I'm going to restore thee changes and try something else tomorrow. I feel like getting the WebGL not to crash was a good step forwards today. It's a shame the TextureClient changes didn't work, but that's still something to cross off my list of changes to try. So that's progress.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

1 Jul 2024 : Day 275 #

After completing a partial build yesterday we were left with working WebGL when using the browser and a partially working WebView. My suspicion was that a regression has resulted in the same hanging that we experienced much earlier in the process, but I have yet to investigate that.

More to the point, I want to find out where SharedSurface_Basic objects get created (there may be multiple) during each of the different permutations of rendering involving the browser, WebView, non-WebGL-adorned pages and pages with WebGL content.

So that I can do this properly using the debugger I left a build running overnight. When I got up this morning it hadn't yet completed, but it was in the next best state: everything was already being packaged up into rpms. I've never seen it fail from this point onwards and it would usually take no more than a few minutes from here, so this was good to see.

A console showing the gecko build about to finish

And indeed it did complete very quickly. So the build was successful and I now have packages to test. But before I do test them, I'm going to run with the original working WebGL packages to find out where the ShareSurface_Basic objects, if any, get created. To test this I've placed a breakpoint on the ShareSurface_Basic::Create() method.

What I find is that there's no call to ShareSurface_Basic::Create() when using the browser generally, unless there's WebGL content on the page. If there is WebGL content it gets called like this:

Thread 8 &quot;GeckoWorkerThre&quot; hit Breakpoint 1, mozilla::gl::
    SharedSurface_Basic::Create (desc=...)
    at gfx/gl/SharedSurfaceGL.cpp:20
20          const SharedSurfaceDesc& desc) {
(gdb) bt
#0  mozilla::gl::SharedSurface_Basic::Create (desc=...)
    at gfx/gl/SharedSurfaceGL.cpp:20
#1  0x0000007ff28a8130 in mozilla::gl::SurfaceFactory_Basic::CreateSharedImpl (
    this=<optimized out>, desc=...)
    at gfx/gl/SharedSurfaceGL.h:41
#2  0x0000007ff28aaee8 in mozilla::gl::SurfaceFactory::CreateShared (size=..., 
    this=0x7fc98a1980)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefCounted.h:240
#3  mozilla::gl::SwapChain::Acquire (this=this@entry=0x7fc949ac48, size=...)
    at gfx/gl/GLScreenBuffer.cpp:56
#4  0x0000007ff369d340 in mozilla::WebGLContext::PresentInto (
    this=this@entry=0x7fc949a7b0, swapChain=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#5  0x0000007ff369d7ac in mozilla::WebGLContext::Present (
    this=this@entry=0x7fc949a7b0, xrFb=<optimized out>,
    consumerType=consumerType@entry=mozilla::layers::TextureType::Unknown, 
    webvr=webvr@entry=false)
    at dom/canvas/WebGLContext.cpp:9
36
#6  0x0000007ff36656a8 in mozilla::HostWebGLContext::Present (webvr=false, 
    t=mozilla::layers::TextureType::Unknown, xrFb=<optimized out>,
    this=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/
    mozilla/RefPtr.h:280
#7  mozilla::ClientWebGLContext::Run<void (mozilla::HostWebGLContext::*)(
    unsigned long, mozilla::layers::TextureType, bool) const, &(mozilla::
    HostWebGLCont
ext::Present(unsigned long, mozilla::layers::TextureType, bool) const), 
    unsigned long, mozilla::layers::TextureType const&, bool const&> (
    this=<optimized out>, args#0=@0x7fdf2f26c0: 0, args#1=@0x7fdf2f26bf: 
    mozilla::layers::TextureType::Unknown, args#2=@0x7fdf2f26be: false)
    at dom/canvas/ClientWebGLContext
.cpp:313
#8  0x0000007ff3665810 in mozilla::ClientWebGLContext::Present (
    this=this@entry=0x7fc9564c10, xrFb=xrFb@entry=0x0, type=<optimized out>,
    webvr=<optimized out>, webvr@entry=false)
    at dom/canvas/ClientWebGLContext
.cpp:363
#9  0x0000007ff3691020 in mozilla::ClientWebGLContext::OnBeforePaintTransaction 
    (this=0x7fc9564c10)
    at dom/canvas/ClientWebGLContext
.cpp:345
#10 0x0000007ff290058c in mozilla::layers::CanvasRenderer::
    FirePreTransactionCallback (this=this@entry=0x7fc9840c50)
    at gfx/layers/CanvasRenderer.cpp
:75
#11 0x0000007ff29b1228 in mozilla::layers::ShareableCanvasRenderer::
    UpdateCompositableClient (this=0x7fc9840c50)
    at gfx/layers/ShareableCanvasRen
derer.cpp:192
#12 0x0000007ff29f0dc4 in mozilla::layers::ClientCanvasLayer::RenderLayer (
    this=0x7fc989ee20)
    at gfx/layers/client/ClientCanva
sLayer.cpp:25
#13 0x0000007ff29efee4 in mozilla::layers::ClientLayer::RenderLayerWithReadback 
    (this=<optimized out>, aReadback=<optimized out>)
    at gfx/layers/client/ClientLayer
Manager.h:365
#14 0x0000007ff2a00224 in mozilla::layers::ClientContainerLayer::RenderLayer (
    this=0x7fc9898670)
    at gfx/layers/Layers.h:1051
#15 0x0000007ff29efee4 in mozilla::layers::ClientLayer::RenderLayerWithReadback 
    (this=<optimized out>, aReadback=<optimized out>)
    at gfx/layers/client/ClientLayer
Manager.h:365
#16 0x0000007ff2a00224 in mozilla::layers::ClientContainerLayer::RenderLayer (
    this=0x7fc963f3f0)
    at gfx/layers/Layers.h:1051
#17 0x0000007ff29efee4 in mozilla::layers::ClientLayer::RenderLayerWithReadback 
    (this=<optimized out>, aReadback=<optimized out>)
    at gfx/layers/client/ClientLayer
Manager.h:365
#18 0x0000007ff2a00224 in mozilla::layers::ClientContainerLayer::RenderLayer (
    this=0x7fc984bb20)
    at gfx/layers/Layers.h:1051
#19 0x0000007ff2a06f08 in mozilla::layers::ClientLayerManager::
    EndTransactionInternal (this=this@entry=0x7fc8b18820, 
    aCallback=aCallback@entry=
    0x7ff46a379c <mozilla::FrameLayerBuilder::DrawPaintedLayer(mozilla::layers::
    PaintedLayer*, gfxContext*, mozilla::gfx::IntRegionTyped<mozilla::gfx::Unkn
ownUnits> const&, mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> 
    const&, mozilla::layers::DrawRegionClip, mozilla::gfx::
    IntRegionTyped<mozilla::g
fx::UnknownUnits> const&, void*)>, 
    aCallbackData=aCallbackData@entry=0x7fdf2f3268)
    at gfx/layers/client/ClientLayerManager.cpp:341
#20 0x0000007ff2a11e08 in mozilla::layers::ClientLayerManager::EndTransaction (
    this=0x7fc8b18820, 
    aCallback=0x7ff46a379c <mozilla::FrameLayerBuilder::DrawPaintedLayer(
    mozilla::layers::PaintedLayer*, gfxContext*, mozilla::gfx::
    IntRegionTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gfx::
    IntRegionTyped<mozilla::gfx::UnknownUnits> const&, mozilla::layers::
    DrawRegionClip, mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> 
    const&, void*)>, aCallbackData=0x7fdf2f3268, aFlags=mozilla::layers::
    LayerManager::END_DEFAULT)
    at gfx/layers/client/ClientLayerManager.cpp:397
#21 0x0000007ff46a0bbc in nsDisplayList::PaintRoot (
    this=this@entry=0x7fdf2f5078, aBuilder=aBuilder@entry=0x7fdf2f3268, 
    aCtx=aCtx@entry=0x0, 
    aFlags=aFlags@entry=13, aDisplayListBuildTime=...)
    at layout/painting/nsDisplayList.cpp:2622
#22 0x0000007ff442cf18 in nsLayoutUtils::PaintFrame (
    aRenderingContext=aRenderingContext@entry=0x0, 
    aFrame=aFrame@entry=0x7fc93460b0, aDirtyRegion=..., 
    aBackstop=aBackstop@entry=4294967295, 
    aBuilderMode=aBuilderMode@entry=nsDisplayListBuilderMode::Painting, 
    aFlags=aFlags@entry=(nsLayoutUtils::PaintFrameFlags::WidgetLayers | 
    nsLayoutUtils::PaintFrameFlags::ExistingTransaction | nsLayoutUtils::
    PaintFrameFlags::NoComposite)) at ${PROJECT}/obj-build-mer-qt-xr/dist/
    include/mozilla/MaybeStorageBase.h:80
#23 0x0000007ff43b760c in mozilla::PresShell::Paint (
    this=this@entry=0x7fc92c2810, aViewToPaint=aViewToPaint@entry=0x7fc92a0d20, 
    aDirtyRegion=..., 
    aFlags=aFlags@entry=mozilla::PaintFlags::PaintLayers)
    at layout/base/PresShell.cpp:6400
#24 0x0000007ff41ef4dc in nsViewManager::ProcessPendingUpdatesPaint (
    this=this@entry=0x7fc92a0cb0, aWidget=aWidget@entry=0x7fc9250910)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/gfx/RectAbsolute.h:43
[...]
#57 0x0000007fefbae89c in ?? () from /lib64/libc.so.6
(gdb)

That's a rather long backtrace, but it captures what I need to know, which is that it's not getting called from the CompositorOGL code, but rather from the painting code. That's because it's being created specifically for the WebGL entity on the page.

It'll also be useful to find out whether CompositorOGL::CreateContext() gets called when rendering browser pages. That's useful because it'll tell us why the code there isn't creating a SharedSurface_Basic object. So I've placed a breakpoint on the method and will step through to find out. This, again, is using the code installed from our original packages with working WebGL and on a page that includes WebGL content:

Thread 37 &quot;Compositor&quot; hit Breakpoint 3, mozilla::layers::
    CompositorOGL::CreateContext (this=this@entry=0x7ed81a1c80)
    at gfx/layers/opengl/CompositorOGL.cpp:227
227     already_AddRefed<mozilla::gl::GLContext> CompositorOGL::CreateContext() 
    {
(gdb) n
231       nsIWidget* widget = mWidget->RealWidget();
(gdb) n
232       void* widgetOpenGLContext =
(gdb) n
234       if (widgetOpenGLContext) {
(gdb) p widgetOpenGLContext
$1 = (void *) 0x7ed81a6fe0
(gdb) n
mozilla::layers::CompositorOGL::Initialize (this=0x7ed81a1c80, 
    out_failureReason=0x7f1651d510)
    at gfx/layers/opengl/CompositorOGL.cpp:395
395         mGLContext = CreateContext();

Stepping through the code, after reaching the condition on widgetOpenGLContext, we then find ourselves outside the CompositorOGL::CreateContext() method:

(gdb) bt 2
#0  mozilla::layers::CompositorOGL::Initialize (this=0x7ed81a1c80, 
    out_failureReason=0x7f1651d510)
    at gfx/layers/opengl/CompositorOGL.cpp:395
#1  0x0000007ff2a66d88 in mozilla::layers::CompositorBridgeParent::
    NewCompositor (this=this@entry=0x7fc89ab730, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1493
(More stack frames follow...)
(gdb)

That's because of the early return from near the start of the method, as we can see here:

already_AddRefed<mozilla::gl::GLContext> CompositorOGL::CreateContext() {
  RefPtr<GLContext> context;

  // Used by mock widget to create an offscreen context
  nsIWidget* widget = mWidget->RealWidget();
  void* widgetOpenGLContext =
      widget ? widget->GetNativeData(NS_NATIVE_OPENGL_CONTEXT) : nullptr;
  if (widgetOpenGLContext) {
    GLContext* alreadyRefed = reinterpret_cast<GLContext*>(widgetOpenGLContext);
    return already_AddRefed<GLContext>(alreadyRefed);
  }
[...]

So what we can infer from this is, in short, that widget->GetNativeData(NS_NATIVE_OPENGL_CONTEXT) is set in the case of WebGL rendering. but not when it's WebView rendering.

It turns out we get the same flow when rendering a page without WebGL, so this isn't a consequence of the WebGL on the page, it's just because we're using the browser rather than a WebView, as we can see here:

Thread 39 &quot;Compositor&quot; hit Breakpoint 3, mozilla::layers::
    CompositorOGL::CreateContext (this=this@entry=0x7ed81a1c70)
    at gfx/layers/opengl/CompositorOGL.cpp:227
227     already_AddRefed<mozilla::gl::GLContext> CompositorOGL::CreateContext() 
    {
(gdb) bt 3
#0  mozilla::layers::CompositorOGL::CreateContext (this=this@entry=0x7ed81a1c70)
    at gfx/layers/opengl/CompositorOGL.cpp:227
#1  0x0000007ff2950ff8 in mozilla::layers::CompositorOGL::Initialize (
    this=0x7ed81a1c70, out_failureReason=0x7f3633e510)
    at gfx/layers/opengl/CompositorOGL.cpp:395
#2  0x0000007ff2a66d88 in mozilla::layers::CompositorBridgeParent::
    NewCompositor (this=this@entry=0x7fc89a9d80, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1493
(More stack frames follow...)
(gdb) n
231       nsIWidget* widget = mWidget->RealWidget();
(gdb) n
232       void* widgetOpenGLContext =
(gdb) n
234       if (widgetOpenGLContext) {
(gdb) p widgetOpenGLContext
$2 = (void *) 0x7ed81a6fd0
(gdb) n
79      ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h: No such 
    file or directory.
(gdb) n
mozilla::layers::CompositorOGL::Initialize (this=0x7ed81a1c70, 
    out_failureReason=0x7f3633e510)
    at gfx/layers/opengl/CompositorOGL.cpp:395
395         mGLContext = CreateContext();
(gdb)

Okay, great. There's one other important piece of information that we need in order to unravel all this and that's that the code we added to the SharedSurface_Basic class in order to get the WebView working (it's not working yet, but we now know we'll need it) is actually only an extension of the SharedSurface_Basic code, without removing any existing code. In fact it adds a new constructor and new Create() method (as an override with a different signature), but without removing the original constructor or Create() method. Here are the old and new Create() methods as they appear side-by-side in the code:

  static UniquePtr<SharedSurface_Basic> Create(const SharedSurfaceDesc&);

  static UniquePtr<SharedSurface_Basic> Create(GLContext* gl,
                                               const gfx::IntSize& size);

The only code that's not strictly an addition happens in the SurfaceFactory_Basic class where the Create() method that's used to create the SharedSurface_Basic object in the CreateSharedImpl() method is switched from the old one to the new one.

 class SurfaceFactory_Basic final : public SurfaceFactory {
  public:
   explicit SurfaceFactory_Basic(GLContext& gl);
   explicit SurfaceFactory_Basic(GLContext* gl,
                                 const layers::TextureFlags& flags);
 
   virtual UniquePtr<SharedSurface> CreateSharedImpl(
       const SharedSurfaceDesc& desc) override {
-    return SharedSurface_Basic::Create(desc);
+    return SharedSurface_Basic::Create(mDesc.gl, desc.size);
   }
 };

It would be helpful if I could confirm this, so I've built a version of the library that's exactly the same as the version with broken WebGL and I've switched it to use the original SharedSurface_Basic::Create() method. That's the only change. And when I run it, the WebGL is working. So that's confirmed, it's definitely the new code that's been added to the SharedSurface_Basic class that's causing the problem. It's great to have this finally pinned down, but what's even better is that it looks like the problem can be bypassed simply by choosing a different constructor.

What does this all mean? It means that we know all of the following:

The code added to the SharedSurface_Basic needed for WebView rendering is what's causing the WebGL to fail in the browser.
The change that actually causes WebView rendering to fail is just one line: the switch of the Create() method in the SurfaceFactory_Basic::CreateSharedImpl() method.
The calls to SurfaceFactory_Basic that are used to create the SharedSurface_Basic objects are in different places depending on whether they're used for the WebGL or the WebView. In the former case they happen in the painting loop. In the latter case they happen in CompositorOGL::CreateContext().

We can use all of this to our advantage. I've taken the SurfaceFactory_Basic class and created a new version, which I've called SurfaceFactory_GL. This performs the new constructor type. So we now have two versions, like this:

class SurfaceFactory_Basic final : public SurfaceFactory {
 public:
  explicit SurfaceFactory_Basic(GLContext& gl);
  explicit SurfaceFactory_Basic(GLContext* gl,
                                const layers::TextureFlags& flags);

  virtual UniquePtr<SharedSurface> CreateSharedImpl(
      const SharedSurfaceDesc& desc) override {
    return SharedSurface_Basic::Create(desc);
  }
};

class SurfaceFactory_GL final : public SurfaceFactory {
 public:
  explicit SurfaceFactory_GL(GLContext& gl);
  explicit SurfaceFactory_GL(GLContext* gl,
                                const layers::TextureFlags& flags);

  virtual UniquePtr<SharedSurface> CreateSharedImpl(
      const SharedSurfaceDesc& desc) override {
    return SharedSurface_Basic::Create(mDesc.gl, desc.size);
  }
};

Notice how they differ only in the parameters passed to the SharedSurface_Basic::Create() method. Together they'll allow us to create the SharedSurface_Basic objects needed for the two different pipelines. I may find I need something more nuanced in the future and, if so, I may end up having to create an entirely new type of SharedSurface. But if that does turn out to be the case then I now know it'll work out fine: it's all nice self-contained code.

Alright, I've made this change and I want to see it in action. The partial build shows good results, but I want to check things with the debugger again. So I've set off another full build. Let's find out where things are at when it's done.

Before I sign off for the day, let me say that I'm getting increasingly hopeful about this. It really feels like we've got all the information needed to fix the problem and that we now just need to carefully move all of the metaphorical girders into their correct positions.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comment

flypig.co.uk

Location

List items

Gecko

All items from July 2024

Navigate

Actions

Archives