flypig.co.uk

List items

Items from the current list are shown below.

Gecko

All items from July 2024

22 Jul 2024 : Day 296 #
Today I've been trying — really quite hard actually — to fix the hang that happens when switching to private browsing mode. Having looked at this before I refreshed myself yesterday to the point of understanding that it's due to the call to setBrowserCover() that happens when mode switch is made on the tab screen.

I've got a bit further today, the results of which only makes things even more confusing. To explain all this, it'll help to take a look at the setBrowserCover() method, which looks like this:
s
    function setBrowserCover(model) {
        if (!model || model.count === 0 || !WebUtils.firstUseDone) {
            cover = Qt.resolvedUrl("cover/NoTabsCover.qml")
        } else {
            if (cover != null && window.webView) {
                window.webView.clearSurface()
            }
            cover = null
        }
    }
Let's break this down a bit. The browser manages two tab models, one for normal browsing and the other for private browsing. When switching between the two modes one model is switched out for the other. This setBrowserCover() method is called just proceeding the change in model. So by the time we find ourselves in this method we've already switched the model to the one for private browsing.

Whenever the browser is closed any private browsing state — including the associated tab model — is either destroyed or forgotten. This includes the private browsing tab model. That means that having just opened the browser we know the private browsing tab model will have no tabs and so model.count in the above code will be zero.

That means we're going through the first half of the if statement above. There's only one line of functionality that therefore gets called as a result and that's the following:
            cover = Qt.resolvedUrl("cover/NoTabsCover.qml")
Typically the cover model for the browser will be set to null so that it shows the contents of the current page. If there are no pages open the cover is replaced, as we can see with this line of code, with the cover layout defined in the NoTabsCover.qml file.

So far so good. This is exactly what happens when we move to private browsing mode for the very first time. If I comment out the above line there are two consequences:
 
  1. When there are no active web pages the cover just shows a blank background.
  2. There's no hang.


We can conclude that it seems to be the act of setting the cover that's triggering the hang. This feels very strange because there's nothing special or magical about the cover or the way it gets switched in and out. I've tried a whole host of things in an attempt to get a clearer picture.

For example I wondered whether this was related to private browser mode or not, so I added a timer that switched out the cover after a delay of five seconds, irrespective of what's happening at the time. What I found was that this also hangs the browser, even you just have a static web page open and there's nothing exceptional happening. This suggests that it's not private browsing per se that's causing the problem, but rather the switching of the cover.

Intriguingly, if you do the switch while performing a pan and zoom, there's a crash instead of a hang. This has allowed me to collect the following backtrace:
[D] onTriggered:45 - Set cover: file:///usr/share/sailfish-browser/cover/
    NoTabsCover.qml
[New LWP 2607]
sailfish-browser: ../../../platforms/wayland/wayland_window_common.cpp:256: 
    void WaylandNativeWindow::releaseBuffer(wl_buffer*): Assertion `it != 
    fronted.end()' failed.

Thread 38 "Compositor" received signal SIGABRT, Aborted.
[Switching to LWP 2574]
0x0000007fef49a344 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x0000007fef49a344 in raise () from /lib64/libc.so.6
#1  0x0000007fef47fce8 in abort () from /lib64/libc.so.6
#2  0x0000007fef48ebd8 in ?? () from /lib64/libc.so.6
#3  0x0000007fef48ec40 in __assert_fail () from /lib64/libc.so.6
#4  0x0000007fe74e3044 in WaylandNativeWindow::releaseBuffer(wl_buffer*) () 
    from /usr/lib64/libhybris//eglplatform_wayland.so
#5  0x0000007fee8fa050 in ?? () from /usr/lib64/libffi.so.8
#6  0x0000007fee8f65f8 in ?? () from /usr/lib64/libffi.so.8
#7  0x0000007fe7795f98 in ?? () from /usr/lib64/libwayland-client.so.0
#8  0x0000007fe7792d80 in ?? () from /usr/lib64/libwayland-client.so.0
#9  0x0000007fe7794038 in wl_display_dispatch_queue_pending () from /usr/lib64/
    libwayland-client.so.0
#10 0x0000007fe74e3204 in WaylandNativeWindow::readQueue(bool) () from /usr/
    lib64/libhybris//eglplatform_wayland.so
#11 0x0000007fe74e23ec in WaylandNativeWindow::finishSwap() () from /usr/lib64/
    libhybris//eglplatform_wayland.so
#12 0x0000007fef090210 in _my_eglSwapBuffersWithDamageEXT () from /usr/lib64/
    libEGL.so.1
#13 0x0000007ff2397110 in mozilla::gl::GLLibraryEGL::fSwapBuffers (
    surface=0x5555991a60, dpy=<optimized out>, this=<optimized out>)
    at gfx/gl/GLLibraryEGL.h:303
#14 mozilla::gl::EglDisplay::fSwapBuffers (surface=0x5555991a60, 
    this=<optimized out>)
    at gfx/gl/GLLibraryEGL.h:694
#15 mozilla::gl::GLContextEGL::SwapBuffers (this=0x7ed41a6e30)
    at gfx/gl/GLContextProviderEGL.cpp:558
#16 0x0000007ff2440e00 in mozilla::layers::CompositorOGL::EndFrame (
    this=0x7ed41a1d70)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#17 0x0000007ff25174dc in mozilla::layers::LayerManagerComposite::Render (
    this=this@entry=0x7ed41a8a70, aInvalidRegion=..., aOpaqueRegion=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#18 0x0000007ff2517728 in mozilla::layers::LayerManagerComposite::
    UpdateAndRender (this=this@entry=0x7ed41a8a70)
    at gfx/layers/composite/LayerManagerComposite.cpp:657
#19 0x0000007ff2517ad8 in mozilla::layers::LayerManagerComposite::
    EndTransaction (this=this@entry=0x7ed41a8a70, aTimeStamp=..., 
    aFlags=aFlags@entry=mozilla::layers::LayerManager::END_DEFAULT)
    at gfx/layers/composite/LayerManagerComposite.cpp:572
#20 0x0000007ff2559274 in mozilla::layers::CompositorBridgeParent::
    CompositeToTarget (this=0x7fb89aba80, aId=..., aTarget=0x0, 
    aRect=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#21 0x0000007ff253e9bc in mozilla::layers::CompositorVsyncScheduler::Composite (
    this=0x7fb8b500e0, aVsyncEvent=...)
    at gfx/layers/ipc/CompositorVsyncScheduler.cpp:256
#22 0x0000007ff2536e34 in mozilla::detail::RunnableMethodArguments<mozilla::
    VsyncEvent>::applyImpl<mozilla::layers::CompositorVsyncScheduler, void (
    mozilla::layers::CompositorVsyncScheduler::*)(mozilla::VsyncEvent const&), 
    StoreCopyPassByConstLRef<mozilla::VsyncEvent>, 0ul> (args=..., m=<optimized 
    out>, 
    o=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/
    nsThreadUtils.h:887
#23 mozilla::detail::RunnableMethodArguments<mozilla::VsyncEvent>::
    apply<mozilla::layers::CompositorVsyncScheduler, void (mozilla::layers::
    CompositorVsyncScheduler::*)(mozilla::VsyncEvent const&)> (m=<optimized 
    out>, o=<optimized out>, this=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1154
#24 mozilla::detail::RunnableMethodImpl<mozilla::layers::
    CompositorVsyncScheduler*, void (mozilla::layers::CompositorVsyncScheduler::
    *)(mozilla::VsyncEvent const&), true, (mozilla::RunnableKind)1, mozilla::
    VsyncEvent>::Run (this=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1201
[...]
#34 0x0000007fef54989c in ?? () from /lib64/libc.so.6
(gdb) 
This hints at the possibility that the render buffers may be being swapped in the wrong thread. But my attempts to dig deeper into this haven't as yet thrown up anything that could give more of a hint about what's going on.

I've also added some debug output to the setBrowserCover() method so it now looks like this:
    function setBrowserCover(model) {
        console.log(&quot;model: &quot; + model);
        console.log(&quot;model.count: &quot; + model.count);
        if (!model || model.count === 0 || !WebUtils.firstUseDone) {
            console.log(&quot;Setting cover&quot;);
            cover = Qt.resolvedUrl(&quot;cover/NoTabsCover.qml&quot;)
            console.log(&quot;Set cover: &quot; + cover);
        } else {
            console.log(&quot;Not setting cover&quot;);
            if (cover != null && window.webView) {
                window.webView.clearSurface()
            }
            cover = null
        }
        console.log(&quot;Exiting&quot;);
    }
When switching to private browsing mode, whether it's done via the menu or the tab list, the following output is triggered:
[D] setBrowserCover:20 - model: PrivateTabModel(0x62bf23d0e0)
[D] setBrowserCover:21 - model.count: 0
[D] setBrowserCover:23 - Setting cover
[D] setBrowserCover:25 - Set cover: file:///usr/share/sailfish-browser/cover/
    NoTabsCover.qml
[D] setBrowserCover:33 - Exiting
Immediately after the last debug print here the browser hangs. I've been trying hard to find some method inside the browser that's executed between the last line of this debug output and the actual hang, but without success. I've been doing this by adding breakpoints to various methods, switching to private browsing and watching to see if any of the breakpoints are hit.

So far without luck. Here are just a few of the methods I've attached breakpoints to and tested this way:
GLContextEGL::SwapBuffers()
GLContextEGL::SetDamage()
GLContextEGL::RenewSurface()
GLScreenBuffer::Swap()
ReadBuffer::Attach()
BeginTransaction()
EndEmptyTransaction()
NeedsPaint()
QOpenGLWebPage::onDrawOverlay()
Many of the breakpoints on these methods are triggered at other points in the browsing process, but if this happens I've just been continuing execution until the point at which I manually switch to private browsing. I get the same output and the same hang as when there's no breakpoint, like this:
Thread 39 &quot;Compositor&quot; hit Breakpoint 1, mozilla::layers::
    LayerManagerComposite::BeginTransaction (this=0x7ed41a8c20, aURL=...)
    at gfx/layers/composite/LayerManagerComposite.cpp:232
232     bool LayerManagerComposite::BeginTransaction(const nsCString& aURL) {
(gdb) c
Continuing.
[D] setBrowserCover:20 - model: PrivateTabModel(0x7fd800da50)
[D] setBrowserCover:21 - model.count: 0
[D] setBrowserCover:23 - Setting cover
[D] setBrowserCover:25 - Set cover: file:///usr/share/sailfish-browser/cover/
    NoTabsCover.qml
[D] setBrowserCover:33 - Exiting
There are a few other things I think it's worth mentioning. The hang happens when the cover is set, but not when it's cleared. If the cover is set right at the start and left as it is (so it's never set to null), everything runs fine. So it very much seems to be the act of switching from null to non-null that causes the problem.

Having not managed to find any methods that are fired between the cover being set and the hang occurring, I got frustrated and went for a walk outside. We have a lake nearby that's beautifully calm at this time of year. The air is warm and calm without being oppressive, which makes going for a walk a great way for me to clear my thoughts and come back feeling calmer.

I didn't have any revelations while walking, but I did think about whether I can approach this from a different angle. Rather than trying to find the gecko methods that are causing the problem by seeing if they're being used, what if I were to try to disable gecko functionality in the hope that the hang might suddenly vanish.

If the hang goes away with a particular piece of functionality disabled, then it may indicate some kind of clash between the cover change and the disabled functionality.

So I've tried a whole bunch of things, for example, setting it so that the page is always inactive by forcing the state to be always set to false:
 void QOpenGLWebPage::setActive(bool active)
 {
+    active = false;
     // WebPage is in inactive state until the view is initialized.
     // ::processViewInitialization always forces active state so we
     // can just ignore early activation calls.
     if (!d || !d->mViewInitialized)
         return;
 
     if (d->mActive != active) {
         d->mActive = active;
         d->mView->SetIsActive(d->mActive);
         Q_EMIT activeChanged();
     }
 }
I also tried disabling the initialisation code:
 void QOpenGLWebPage::initialize()
 {
-    d->createView();
 }
Plus a whole bunch of other similar things, from disconnecting various signals to preventing the EGL Display from being initialised. Many of these changes prevented rendering, but none of them prevented the hang.

I don't have an answer for why this is happening, but I'll persevere with it. As with everything computer-related, there is definitely an answer, it's just a case of finding it.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
21 Jul 2024 : Day 295 #
Yesterday, you may recall, I didn't do any coding but rather started looking through some changes from Raine that I needed to incorporate into my local codebase. I'm continuing that today, in the hope of wrapping that up and then moving on to some of the other issues in the issue tracker on GitHub.

So I've reviewed and approved both of Raine's PRs. The first, which I already looked at yesterday, updates the sailfish-components-webview code to support the fromExternal flag. This change is needed to get the code to compile against the changes already made to gecko.

The second switches instances of XPCOMUtils for ComponentUtils since upstream has moved the generateNSGetFactory() method from the former to the latter. I'd already made this change in a few places, but Raine has propagated it across the entire codebase, which is great. After looking through and checking it, I've now approved it.

Those are the two PRs I can see, so I've moved on to the issues. I've marked Issue 1024 ("Restore WebRTC code to ESR 91") as completed. I addressed this issue in some of the most recent changes I made (between days 278 and 286). I've also closed Issue 1020 ("Fix ESR 91 rendering pipeline"). If you've been following along you'll know that this was one of the largest pieces of work I had to do, which I spent a total of 152 days on. First for the main browser rendering between days 51 and 83. Then for the WebView rendering between days 159 and 245. Then following on from this up to Day 277 for the WebGL. So I'm frankly pretty pleased to be able to finally close it.

To wrap up I've decided to start looking at Issue 1053, which is an epic test issue comprised of 22 separate things to check. It looks long, but at this stage I'm hoping I can go through and tick them all off pretty swiftly (we'll have to see whether this actually happens or not!).

The very first thing I try is private browsing mode as it looks like one of the easier things to check.

Immediately I hit a problem. You'll recall that yesterday I closed Issue 1051 ("Fix hang when calling window.setBrowserCover()") thinking it was fixed. Well, I was wrong, it is still there after all. So now I'll need to look in to this further.

The problem manifests itself when you switch between normal and private browsing. As I discovered first time around, the bug causes a hang rather than a crash. Which means I have to interrupt the execution in order to get a backtrace:
Thread 1 &quot;sailfish-browse&quot; received signal SIGINT, Interrupt.
0x0000007fef866718 in pthread_cond_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x0000007fef866718 in pthread_cond_wait () from /lib64/libpthread.so.0
#1  0x0000007fef979924 in QWaitCondition::wait(QMutex*, unsigned long) () from /
    usr/lib64/libQt5Core.so.5
#2  0x0000007ff0afde08 in ?? () from /usr/lib64/libQt5Quick.so.5
#3  0x0000007ff0b00aa8 in ?? () from /usr/lib64/libQt5Quick.so.5
#4  0x0000007ff0b01270 in ?? () from /usr/lib64/libQt5Quick.so.5
#5  0x0000007ff05503dc in QWindow::event(QEvent*) () from /usr/lib64/
    libQt5Gui.so.5
#6  0x0000007ff0b307b8 in QQuickWindow::event(QEvent*) () from /usr/lib64/
    libQt5Quick.so.5
#7  0x0000007fefb31144 in QCoreApplication::notify(QObject*, QEvent*) () from /
    usr/lib64/libQt5Core.so.5
#8  0x0000007fefb312e8 in QCoreApplication::notifyInternal2(QObject*, QEvent*) (
    ) from /usr/lib64/libQt5Core.so.5
#9  0x0000007ff0546488 in QGuiApplicationPrivate::processExposeEvent(
    QWindowSystemInterfacePrivate::ExposeEvent*) () from /usr/lib64/
    libQt5Gui.so.5
#10 0x0000007ff05470b4 in QGuiApplicationPrivate::processWindowSystemEvent(
    QWindowSystemInterfacePrivate::WindowSystemEvent*) ()
   from /usr/lib64/libQt5Gui.so.5
#11 0x0000007ff05256e4 in QWindowSystemInterface::sendWindowSystemEvents(
    QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib64/libQt5Gui.so.5
#12 0x0000007fe7882c4c in ?? () from /usr/lib64/libQt5WaylandClient.so.5
#13 0x0000007fef2cfd34 in g_main_context_dispatch () from /usr/lib64/
    libglib-2.0.so.0
#14 0x0000007fef2cffa0 in ?? () from /usr/lib64/libglib-2.0.so.0
#15 0x0000007fef2d0034 in g_main_context_iteration () from /usr/lib64/
    libglib-2.0.so.0
#16 0x0000007fefb83a90 in QEventDispatcherGlib::processEvents(QFlags<QEventLoop:
    :ProcessEventsFlag>) () from /usr/lib64/libQt5Core.so.5
#17 0x0000007fefb2f608 in QEventLoop::exec(QFlags<QEventLoop::
    ProcessEventsFlag>) () from /usr/lib64/libQt5Core.so.5
#18 0x0000007fefb371d4 in QCoreApplication::exec() () from /usr/lib64/
    libQt5Core.so.5
#19 0x000000555557bf88 in main (argc=<optimized out>, argv=<optimized out>) at 
    main.cpp:203
(gdb) 
Getting a backtrace this was is always unsatisfactory because you don't know whether you stopped on the correct thread, or somewhere totally unrelated in one of the other threads that's whirring away at the same time. Gecko has tens of threads running simultaneously (last time this happened it was 70 in total), so that adds a pretty big dose of uncertainty.

I looked at this originally back on Day 122 and kept a record of all of the backtraces for all of the threads back then. They weren't, I have to say, super enlightening and I don't plan to repeat the process again.

The way I fixed it at the time was by amending BrowserPage.qml to remove the following line:
            window.setBrowserCover(webView.tabModel)
Here's that line in context:
    // Use Connections so that target updates when model changes.
    Connections {
        target: AccessPolicy.browserEnabled && webView && webView.tabModel || 
    null
        ignoreUnknownSignals: true
        // Animate overlay to top if needed.
        onCountChanged: {
            if (webView.tabModel.count === 0) {
                webView.handleModelChanges(false)
            }
            window.setBrowserCover(webView.tabModel)
        }
    }
Today I'm going to comment that line out again, just to see whether this is actually an error with private browsing or not. When I do this there are immediately a couple of further errors coming from the QML:
[W] unknown:130 - file:///usr/share/sailfish-browser/pages/components/
    Overlay.qml:130: Error: Insufficient arguments
Thankfully these QML errors are easier to fix. You may recall that some time back I was having trouble with DuckDuckGo rendering. The issue turned out to be related to the Sec-Fetch-* headers being set incorrectly. You can refer back to what I wrote on Day 140 for some of the background if you're interested. But the crucial point is that I had to add a fromExternal flag to various methods, so that it could be passed on for use with the Sec-Fetch-* headers.

It turns out I'd not added a parameter for this flag to all the places where it's called. To fix this I've added the fromExternal to all of the methods that need it.

Following this change the private browsing now seems to work as expected. I still need to fix the hang, but that'll be a task for tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
20 Jul 2024 : Day 294 #
No coding today. Almost every day the first thing I do is dive into the code: reading it, writing it, debugging it. After 294 days (with a few gaps, admittedly) I've grown accustomed to this way of working. Diving in to the code each morning has been comforting: a natural context switch between the harsh vagaries of the world and the reassuring consistency of the source.

(I'm overdoing it of course. Many times the code has been frustrating, incomprehensible and capricious. But I'm going to forget that today for the sake of a more convenient narrative).

The point I'm trying to get to is that, like a furry friend in a nineties Grolsch advert, the browser is not ready yet. But despite that, it feels like it's time for a step change. From focusing entirely on development to figuring out where things go from here and how the browser actually gets into users' hands.

So no coding today. Instead I'm going to be looking through the issues on GitHub, replying to comments and trying to figure out the best way to make packages available.

Things could fail disastrously at this point. If the browser goes out too early and doesn't have the functionality people are expecting, then all this effort will be wasted. On the other hand, I know the Sailfish community is driven by optimism and a sense of adventure: willing to try experimental software and also understanding that there are some sacrifices needed for the sake of user freedoms.

But the entire motivation for this work comes from an understanding that a phone needs a browser. It's critical infrastructure, not nice-to-have. So it really does have to work. In fact, ESR 91 has to work better than ESR 78 if the effort is going to have been worth it.

So let me talk briefly about what I've actually done today. Raine (rainemak) has gone to the trouble of testing out the code using the latest internal Jolla development build of Sailfish OS and the Sailfish SDK. When the next version of Sailfish OS is released, it'll be a version of this build and built using this SDK, so the browser has to work with it.

Raine had to make some changes to get it to work and I had a look through the changes. The changes Raine made to embedlite-components aligned with changes that I'd also had to make, so that's an easy case. But for the gecko-dev repository things are a bit more unexpected. To get things to run Raine has also had to make various changes to the privileged JavaScript.

After looking through Raine's PRs I need to incorporate his changes into mine, update my SDK to the latest Early Access release (4.6.0.1) and rebuild everything using that from scratch. Unfortunately I didn't get a chance to do that today, but I'll pick it up tomorrow and spend a bit more time on it.

Doing this non-coding work is important, but I find I'm not very efficient at it and it's harder to gauge progress compared to making commits. But I'm hoping it's also a sign of where things are at with the browser. It feels like this is the start of the descent towards our destination.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
19 Jul 2024 : Day 293 #
Yesterday I was able to complete a couple of commits to fix some issues. I really want to get on to doing some non-coding work, but before I do, there's one last issue that I really need to address. Back on Day 126 I was having trouble with the cover of the browser app. I created a bug to remind myself to come back to it. Despite that I still managed to forget about it until I suddenly realised this morning.

I do need to fix it though. The bug prevented the cover app from showing a preview of the current page and it'd be rubbish if we had to release the browser without this working properly. On the other hand I'm really hoping that the issue is related to the offscreen rendering and that all of the changes made to fix that will magically allow the cover rendering to work again.

As the issue on GitHub explains:
 
The call to window.setBrowserCover(webView.tabModel) in BrowserPage.qml causes the user interface to hang. To reproduce:
  1. Run the browser.
  2. Select the "Tab" icon on the far left hand side of the toolbar.
  3. Select the "Private" browser tab icon (Batman).
  4. Notice that the app hangs (but doesn't crash) as the page switches from persistent to private mode.

The change made back then to work around the issue was the following:
git diff 51a72ef86825dcf0deca5ab3adc493247768eaee 
    8a8c474abed7d95ac9b32cbfa6fe90275cf97631
diff --git a/apps/browser/qml/pages/BrowserPage.qml b/apps/browser/qml/pages/
    BrowserPage.qml
index 4373eeef..e0fb48c5 100644
--- a/apps/browser/qml/pages/BrowserPage.qml
+++ b/apps/browser/qml/pages/BrowserPage.qml
@@ -225,7 +225,7 @@ Page {
             if (webView.tabModel.count === 0) {
                 webView.handleModelChanges(false)
             }
-            window.setBrowserCover(webView.tabModel)
+            //window.setBrowserCover(webView.tabModel)
         }
     }
This is a great place to start from because the change is in QML code. This means reversing this change is not only just a matter of uncommenting a single line, but it can also be done entirely without needing to do any recompilation.
devel-su vim /usr/share/sailfish-browser/pages/BrowserPage.qml
[...]
sailfish-browser
[...]
Having made the change, the good news is that it does now work, immediately, without any further changes. This isn't entirely a shock. As I mentioned earlier, there was always a good chance that the existing changes would fix this. The cover rendering makes use of the offscreen render pipeline, which was broken before but is now fixed.

I also checked that WebGL still works when rendered on the cover as well. So with this done I just have to now commit my minimal changes and that should be enough to justify closing the issue on GitHub. For reference, here's the commit I'll be reversing:
$ git log -1 8a8c474abed7d95ac9b32cbfa6fe90275cf97631
commit 8a8c474abed7d95ac9b32cbfa6fe90275cf97631
Author: David Llewellyn-Jones <david.llewellyn-jones@jolla.com>
Date:   Fri Dec 29 11:10:03 2023 +0000

    [browser] Disable window.setBrowserCover
    
    The window.setBrowserCover() call causes the browser to hang when
    rendering with ESR 91. This needs to be fixed, but in the meantime
    disabling the call prevents the hang from occurring.
    
    This change prevents the browser page from being rendered on the app
    cover.
With this change committed my plan is to take a break from coding for a bit now. I'm going to continue writing these diary entries and working on the browser, but I've built up a backlog of administration that I really need to deal with. This means reading through (and responding to) comments on the issue tracker, getting some test packages built for others to use, performing some tests, that kind of thing.

I enjoy working on the code and use it as a way to recover from the real world. Maybe even to hide from it sometimes. It allows me to focus on something controllable while ignoring all of the messy uncontrollable stuff going on elsewhere. I don't think I'm unusual in this. I'm sure other people find their own ways to escape from reality periodically.

But with a project like this it's also important to occasionally step back, take stock, make course corrections and deal with the points of intersection with reality. And that's what I plan to do tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
18 Jul 2024 : Day 292 #
The build I started yesterday had completed by the time I got up this morning. I'm still working on avoiding a crash in the hyphenation code with the latest build by adding an early return to the find_hyphen_values() method of the Hyphenator class in mapped_hyph/src/lib.rs.

This time our test page loads and renders without crashing. This tells us two things. First that the partial build wasn't covering the Rust code, as I suspected. Second that the problem is happening in the Rust code itself. This doesn't preclude the possibility that the error is originating outside the Rust code, but it does at least give us something to focus on.

The header of the find_hyphen_values() method mentions a couple of situations in which a panic can be triggered. It's worth taking a look at these since a panic could be responsible for bringing down the browser. One obvious situation in which a panic will be triggered from the Rust code is as a result of the following line, the first line of the method:
        assert!(values.len() >= word.len());
That's not a debug assert, so it'll apply in the release build as well. I can't just remove it, since although that might stop the crash happening at that point in the code, it'll leave the browser in an unsafe state. The assert is there for a reason after all. So instead I've turned it into a condition with an early return to avoid the possibility of a panic:
$ git diff -U1
diff --git a/third_party/rust/mapped_hyph/src/lib.rs b/third_party/rust/
    mapped_hyph/src/lib.rs
index 848c93d25790..c205bb09359c 100644
--- a/third_party/rust/mapped_hyph/src/lib.rs
+++ b/third_party/rust/mapped_hyph/src/lib.rs
@@ -477,3 +477,6 @@ impl Hyphenator<'_> {
     pub fn find_hyphen_values(&self, word: &str, values: &mut [u8]) -> isize {
-        assert!(values.len() >= word.len());
+        if (values.len() < word.len()) {
+            return 0;
+        }
+        //assert!(values.len() >= word.len());
         values.iter_mut().for_each(|x| *x = 0);
However, as I read through the code and look again at the backtrace, this looks highly unlikely to be the problem. Although we don't know the exact line, it's the level() method that's at the top of the stack when the crash occurs. So actually we know the crash is happening there. Here's what the code there looks like:
    fn level(&self, i: usize) -> Level {
        let offset = u32::from_le_bytes(*array_ref!(self.0, FILE_HEADER_SIZE + 
    4 * i, 4)) as usize;
        let limit = if i == self.num_levels() - 1 {
            self.0.len()
        } else {
            u32::from_le_bytes(*array_ref!(self.0, FILE_HEADER_SIZE + 4 * i + 
    4, 4)) as usize
        };
        debug_assert!(offset + LEVEL_HEADER_SIZE <= limit && limit <= 
    self.0.len());
        debug_assert_eq!(offset & 3, 0);
        debug_assert_eq!(limit & 3, 0);
        Level::new(&self.0[offset .. limit])
    }
It's challenging to understand what might be going wrong here. It looks to me like this is where the requirement for self.0 to be valid (mentioned as the other reason for a potential panic) comes into play. I'm now wondering whether the problem might be an invalid hyphenator file being loaded in, or it being invalid for some other reason.

So I've made an additional change to check that the hyphenator is invalid when the call to hyphenate a string is made, like this:
$ git diff -U1
diff --git a/intl/hyphenation/glue/nsHyphenator.cpp b/intl/hyphenation/glue/
    nsHyphenator.cpp
index c3b377767e3f..974596b17115 100644
--- a/intl/hyphenation/glue/nsHyphenator.cpp
+++ b/intl/hyphenation/glue/nsHyphenator.cpp
@@ -348,2 +348,6 @@ nsresult nsHyphenator::Hyphenate(const nsAString& aString,
 
+  if (!IsValid()) {
+    return NS_ERROR_FAILURE;
+  }
+
   bool inWord = false;
I've set it rebuilding with these changes, which will take at least a few hours to complete. In the meantime, I noticed the following error when loading the test page:
JavaScript error: resource://gre/modules/amInstallTrigger.jsm, line 43: 
    TypeError: this.mm is null
That error seems to be coming from the gecko code itself, rather than some JavaScript loaded by the page, so this is also something it would be worth fixing. I'm sure it's unrelated to the crash, but I may as well look into it while the build completes.

Here's the method inside amInstallTrigger.jsm where this error is being generated:
function RemoteMediator(window) {
  this._windowID = window.windowGlobalChild.innerWindowId;

  this.mm = window.docShell.messageManager;
  this.mm.addWeakMessageListener(MSG_INSTALL_CALLBACK, this);

  this._lastCallbackID = 0;
  this._callbacks = new Map();
}
It looks likely that the call to get window.docShell.messageManager is returning null. Here's the code from nsDocShell which is supposed to return it:
NS_IMETHODIMP
nsDocShell::GetMessageManager(ContentFrameMessageManager** aMessageManager) {
  RefPtr<ContentFrameMessageManager> mm;
  if (RefPtr<BrowserChild> browserChild = BrowserChild::GetFrom(this)) {
    mm = browserChild->GetMessageManager();
  } else if (nsPIDOMWindowOuter* win = GetWindow()) {
    mm = win->GetMessageManager();
  }
  mm.forget(aMessageManager);
  return NS_OK;
}
So the error goes up, potentially, to EmbedLite code. It looks like this might be related to patch 0043, which patches this method and which hasn't get been applied. The patch has the title "Get ContentFrameMessageManager via nsIDocShellTreeOwner" with the following description:
 
nsDocShellTreeOwner has a reference to WebBrowserChrome which in turn can return ContentFrameMessageManager from BrowserChildHelper.

So I'm going to try applying the patch; maybe that will fix this error?
$ git am --3way ../rpm/
    0043-sailfishos-docshell-Get-ContentFrameMessageManager-v.patch
Applying: Get ContentFrameMessageManager via nsIDocShellTreeOwner. JB#55336 
    OMP#55336
Nice: the patch applies cleanly! Now the GetMessageManager() method looks like this:
NS_IMETHODIMP
nsDocShell::GetMessageManager(ContentFrameMessageManager** aMessageManager) {
  RefPtr<ContentFrameMessageManager> mm;

  nsCOMPtr<nsIBrowserChild> bc;
  if (mTreeOwner) {
    bc = do_GetInterface(mTreeOwner);
  }

  if (bc) {
    bc->GetMessageManager(getter_AddRefs(mm));
  } else if (RefPtr<BrowserChild> browserChild = BrowserChild::GetFrom(this)) {
    mm = browserChild->GetMessageManager();
  } else if (nsPIDOMWindowOuter* win = GetWindow()) {
    mm = win->GetMessageManager();
  }
  mm.forget(aMessageManager);
  return NS_OK;
}
As we can see, an extra clause has been added to the condition, to catch the case where the mTreeOwner variable has an nsIBrowserChild XPCOM interface. If it does then the message manager will be extracted from there.

The hyphenation build has gone through and I've copied the packages over to my phone. That means I can now set this change building instead. In the meantime let's return to the hyphenation crash. I've tried a few different arrangements of things now, rebuilding the code and transferring the result over to my phone. What I've discovered is that adding a condition to check the validity of the hyphenator made no difference to the crash; so I've removed that.

However, rather unexpectedly, after converting the assert that tests whether the values return buffer is at least as large as the word string into a condition with an early return, the site now no longer crashes the browser.

Half of me is surprised, the other half isn't. I'm not surprised because this was clearly signposted code designed to cause a panic. On the other hand, I'm surprised because this shouldn't be happening; and also increasing the output buffer size didn't appear to help. So there may be something deeper going on here, but for now I'm going to take the win.

I feel satisfied with these changes today. I've been able to fix a JavaScript error and a crash bug. While I can't claim any credit for finding a solution (Raine, who authored the patch, deserves all of that), it's nonetheless good to be making progress.

Tomorrow I'll need to pick another task from the issue tracker to take a look at. Although, in fact, I think I'll do some housekeeping and try to catch up on some of the work others have been doing. Now that I look at the issue tracker I see some exciting developments there that I really need to examine.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
17 Jul 2024 : Day 291 #
Yesterday we were looking at a crash in the hyphenator, triggered by visiting a page on amazon.de. It makes for an interesting bug because it's the first time — that I can recall at least — that we've been hit with an issue related to the Rust code, other than as part of the build pipeline. I'm not really certain what the problem is, but given the comments we saw yesterday, it could be either an overflow in the output buffer, or the block of memory represented by self.0 being invalid. According to the header, either of these could cause a panic.

It looks like debugging the Rust code may turn out to be tricky, so in the first instance I've gone about disabling one of the calls further down the call stack so that the crashing code is never called.

This is a precursor to trying to establish the real solution. At this stage I just want to find out whether this fixes the issue or not. My fake fix is to early return from the nsHyphenator::HyphenateWord() method like this:
   AutoTArray<uint8_t, 200> hyphenValues;
   hyphenValues.SetLength(utf8.Length());
 
+  return;
   int32_t result = mDict.match(
       [&](const void*& ptr) {
         return mapped_hyph_find_hyphen_values_raw(
             static_cast<const uint8_t*>(ptr), mDictSize, utf8.BeginReading(),
             utf8.Length(), hyphenValues.Elements(), hyphenValues.Length());
       },
You may notice that I've made the change to the C++ code rather than the Rust code. That's mostly for convenience because I don't actually know whether the partial build process will notice and rebuild changes in the Rust code. I'm sure I'll find out, but right now I'm just trying to find an answer as quickly as possible.

Other than the panic theory, there are other reasons why the code may be crashing. It could simply be that the stack is being exhausted. I must admit that I'm not convinced by this theory: the backtrace doesn't look corrupted and although the stack is very large (177 frames) it's not uncommon for the gecko stack to get quite large as the DOM is traversed. The change I've made now should at least help shed some light on this: if the crash persists, then the exhausted-stack theory becomes more persuasive.

The partial build completed with this small change and I've copied the library over to my phone. Now will it crash?

No; there's no crash. Presumably hyphenation is now no longer working correctly, but it's hard to tell that just from the way the page is laid out. Nevertheless, it looks like we've pinned down the location of the problem, even if it's not clear why it's happening. I thought I'd also have a go at increasing the size of the buffer for the return values. I've extended it by an additional 1024 bytes:
   AutoTArray<uint8_t, 200> hyphenValues;
-  hyphenValues.SetLength(utf8.Length());
+  hyphenValues.SetLength(utf8.Length() + 1024);
   int32_t result = mDict.match(
       [&](const void*& ptr) {
         return mapped_hyph_find_hyphen_values_raw(
             static_cast<const uint8_t*>(ptr), mDictSize, utf8.BeginReading(),
             utf8.Length(), hyphenValues.Elements(), hyphenValues.Length());
       },
But with this change the crash still occurs. It's possible I didn't choose a large enough amount to add to the buffer, but this seems unlikely to me.

As the next test I've amended the Rust code so that the find_hyphen_values() method in mapped_hyph/src/lib.rs returns early.
     pub fn find_hyphen_values(&self, word: &str, values: &mut [u8]) -> isize {
         assert!(values.len() >= word.len());
         values.iter_mut().for_each(|x| *x = 0);
         let top_level = self.level(0);
         let (lh_min, rh_min, clh_min, crh_min) = top_level.word_boundary_mins(
    );
         if word.len() < lh_min + rh_min {
             return 0;
         }
+        if word.len() >= lh_min + rh_min {
+            return 0;
+        }
This may look a little contrived: the two conditions together are equivalent to a guaranteed return. But I didn't want to just add a return here because the Rust compiler is quite thorough. If it had noticed the guaranteed early return it might have also complained about unused functionality later in the method.

I'm interested to know whether this will be enough to fix the crash. If it is, I can play around with the code a bit more to try to isolate the error. However I'm not even certain whether the partial build is actually rebuilding the Rust code or not. I think not, but this test should help determine this with more certainty.

Having run the partial build and copied over the binary, I find that the crash still happens on the page. That makes me think that the partial build wasn't enough to capture the changes to the Rust code. So my next step will be to do a full rebuild and have a go with the output from that once it's ready in the morning.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
16 Jul 2024 : Day 290 #
It was a huge relief to finally wrap up the WebRTC changes yesterday. When I started this process I had my suspicions that it would be the hardest part, hence I left it to the end. It turned out not to be as hard as fixing the rendering pipeline. Actually it turned out to be surprisingly straightforward, notwithstanding a few bumps along the way, and either way it's now done. So I'm happy.

During the process, back on Day 285, we did see that the colours for the WebRTC video feed looked wrong. After playing around for a bit this morning I've concluded that the issue is with the red and blue channels being swapped over. I'm not planning to spend the time fixing this just yet; it needs fixing for sure, but there are other more important tasks to complete. In the meantime, I've lodged an issue on the browser bug tracker so that it doesn't get forgotten about.
 
Three images of a face (my face, as it happens). The first image from ESR 91 has an obvious blue tint; next to this is a channel mixer showing red and blue channels swapped; then there's the ESR 91 image with channels swapped and the colours look good; finally there's the image taken using ESR 78 and the colours look good

I was happy to rapidly receive replies back from Frajo (krnlyng) and Björn (Thaodan). They suggest that the issue is happening in gecko-camera. If that's the case, then I'm satisfied that it doesn't have to be fixed within the gecko code.

That brings me on nicely to my next task, which is to work my way through the remaining bugs on this tracker to see what really needs addressing. I have three other important tasks as well, which in totality gives me the following on my to-do list:
  1. Work through the outstanding issues tagged with ESR 91 in the bug tracker.
  2. Respond to Simon Schmeisser and Raine's comments on various PRs.
  3. Package up versions of all the projects so that brave members of the community can test them out.
  4. Convert all of the commits on the FIREFOX_ESR_91_9_X_RELBRANCH_patches branch into patches.
The last of these is more than just a mechanical process. Not only do I need to try to align the patches with those in the existing ESR 78 build, I also want to try to rationalise the remaining patches to minimise overlaps between them.

So that's my four-point plan. And as has been the case throughout this process, there's no point delaying, so I'm going to get straight on to working through the issues on the bug tracker.

First up is bug #1055, an apparent crash when visiting amazon.de. When I try a few pages on the site I don't experience any issues, but there's an example page given in my earlier bug report that, according to what I've written, consistently causes the ESR 91 browser to crash. And when I try this I do indeed still get a crash. There's a backtrace in the bug report, but here's the latest one that I got today.

This is the most enormous backtrace with depth of 177 frames, so I'm not going to include all of them here. But I want to flag up it's surprising depth as it may turn out to be relevant.
Thread 10 &quot;GeckoWorkerThre&quot; received signal SIGSEGV, Segmentation 
    fault.
[Switching to LWP 31398]
0x0000007ff57c9eb4 in mapped_hyph::Hyphenator::level () from /usr/lib64/
    xulrunner-qt5-91.9.1/libxul.so
(gdb)
(gdb) bt
#0  0x0000007ff57c9eb4 in mapped_hyph::Hyphenator::level () from /usr/lib64/
    xulrunner-qt5-91.9.1/libxul.so
#1  0x0000007ff57ca0a0 in mapped_hyph::Hyphenator::find_hyphen_values () from /
    usr/lib64/xulrunner-qt5-91.9.1/libxul.so
#2  0x0000007ff57caed8 in mapped_hyph_find_hyphen_values_raw () from /usr/lib64/
    xulrunner-qt5-91.9.1/libxul.so
#3  0x0000007ff1a2856c in nsHyphenator::<lambda(mozilla::UniquePtr<base::
    SharedMemory, mozilla::DefaultDelete<base::SharedMemory> >&)>::operator() (
    shm=..., __closure=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsTArray.h:1183
#4  mozilla::detail::VariantImplementation<unsigned char, 1, mozilla::
    UniquePtr<base::SharedMemory, mozilla::DefaultDelete<base::SharedMemory> >, 
    mozilla::UniquePtr<HyphDic const, mozilla::DefaultDelete<HyphDic const> > >:
    :matchN<mozilla::Variant<void const*, mozilla::UniquePtr<base::
    SharedMemory, mozilla::DefaultDelete<base::SharedMemory> >, mozilla::
    UniquePtr<const HyphDic, mozilla::DefaultDelete<const HyphDic> > >&, 
    nsHyphenator::HyphenateWord(const nsAString&, uint32_t, uint32_t, 
    nsTArray<bool>&)::<lambda(mozilla::UniquePtr<base::SharedMemory>&)>, 
    nsHyphenator::HyphenateWord(const nsAString&, uint32_t, uint32_t, 
    nsTArray<bool>&)::<lambda(mozilla::UniquePtr<const HyphDic, mozilla::
    DefaultDelete<const HyphDic> >&)> > (aMi=..., aV=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/Variant.h:309
#5  mozilla::detail::VariantImplementation<unsigned char, 0, void const*, 
    mozilla::UniquePtr<base::SharedMemory, mozilla::DefaultDelete<base::
    SharedMemory> >, mozilla::UniquePtr<HyphDic const, mozilla::
    DefaultDelete<HyphDic const> > >::matchN<mozilla::Variant<void const*, 
    mozilla::UniquePtr<base::SharedMemory, mozilla::DefaultDelete<base::
    SharedMemory> >, mozilla::UniquePtr<const HyphDic, mozilla::
    DefaultDelete<const HyphDic> > >&, nsHyphenator::HyphenateWord(const 
    nsAString&, uint32_t, uint32_t, nsTArray<bool>&)::<lambda(void const*&)>, 
    nsHyphenator::HyphenateWord(const nsAString&, uint32_t, uint32_t, 
    nsTArray<bool>&)::<lambda(mozilla::UniquePtr<base::SharedMemory, mozilla::
    DefaultDelete<base::SharedMemory> >&)>, nsHyphenator::HyphenateWord(const 
    nsAString&, uint32_t, uint32_t, nsTArray<bool>&)::<lambda(mozilla::
    UniquePtr<const HyphDic, mozilla::DefaultDelete<const HyphDic> >&)> > (
    aMi=..., aV=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/Variant.h:318
#6  mozilla::Variant<void const*, mozilla::UniquePtr<base::SharedMemory, 
    mozilla::DefaultDelete<base::SharedMemory> >, mozilla::UniquePtr<HyphDic 
    const, mozilla::DefaultDelete<HyphDic const> > >::matchN<mozilla::
    Variant<void const*, mozilla::UniquePtr<base::SharedMemory, mozilla::
    DefaultDelete<base::SharedMemory> >, mozilla::UniquePtr<const HyphDic, 
    mozilla::DefaultDelete<const HyphDic> > >&, nsHyphenator::HyphenateWord(
    const nsAString&, uint32_t, uint32_t, nsTArray<bool>&)::<lambda(void 
    const*&)>, nsHyphenator::HyphenateWord(const nsAString&, uint32_t, 
    uint32_t, nsTArray<bool>&)::<lambda(mozilla::UniquePtr<base::
    SharedMemory>&)>, nsHyphenator::HyphenateWord(const nsAString&, uint32_t, 
    uint32_t, nsTArray<bool>&)::<lambda(mozilla::UniquePtr<const HyphDic, 
    mozilla::DefaultDelete<const HyphDic> >&)> > (aM1=..., aM0=..., 
    aVariant=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/Variant.h:902
#7  mozilla::Variant<void const*, mozilla::UniquePtr<base::SharedMemory, 
    mozilla::DefaultDelete<base::SharedMemory> >, mozilla::UniquePtr<HyphDic 
    const, mozilla::DefaultDelete<HyphDic const> > >::match<nsHyphenator::
    HyphenateWord(const nsAString&, uint32_t, uint32_t, nsTArray<bool>&)::
    <lambda(void const*&)>, nsHyphenator::HyphenateWord(const nsAString&, 
    uint32_t, uint32_t, nsTArray<bool>&)::<lambda(mozilla::UniquePtr<base::
    SharedMemory>&)>, nsHyphenator::HyphenateWord(const nsAString&, uint32_t, 
    uint32_t, nsTArray<bool>&)::<lambda(mozilla::UniquePtr<const HyphDic, 
    mozilla::DefaultDelete<const HyphDic> >&)> > (
    aM1=..., aM0=..., this=0x7e512e3188)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/Variant.h:857
#8  nsHyphenator::HyphenateWord (this=this@entry=0x7e512e3180, aString=..., 
    aStart=aStart@entry=0, aLimit=aLimit@entry=7, aHyphens=...)
    at intl/hyphenation/glue/nsHyphenator.cpp:444
#9  0x0000007ff1a2884c in nsHyphenator::Hyphenate (
    this=this@entry=0x7e512e3180, aString=..., aHyphens=...)
    at intl/hyphenation/glue/nsHyphenator.cpp:378
#10 0x0000007ff28e3890 in nsLineBreaker::FindHyphenationPoints (
    this=this@entry=0x7fde7c6e90, aHyphenator=aHyphenator@entry=0x7e512e3180,
    aTextStart=<optimized out>, aTextLimit=<optimized out>, 
    aBreakState=0x7fde7c4a08 &quot;&quot;)
    at dom/base/nsLineBreaker.cpp:322
[...]
#177 0x0000007fef54989c in ?? () from /lib64/libc.so.6
(gdb)
As I mentioned, this has 177 frames, which makes me think it could be a stack overflow that's the problem. It's not clear to me whether this is therefore a bug (in the sense of an error in the code) or just a mismatch between the memory available and the complexity of this particular page. The error is happening during an attempt to hyphenate a word, which doesn't seem like an especially atypical or extraordinary activity.

I'm not planning to fix all of the remaining issues in the bug tracker, but having a crashing browser is no good so I do feel I need to get to the bottom of this.

The crash is happening inside some Rust code, which is interesting since, as far as I'm aware, this is the first time there's been a problem with the Rust code outside of the build process. We're not getting full debugging information as a result, but it looks like the error is happening with one of the calls to level() in mapped_hyph/src/lib.rs here:
    /// Identify acceptable hyphenation positions in the given `word`.
    ///
    /// The caller-supplied `values` must be at least as long as the `word`.
    ///
    /// On return, any elements with an odd value indicate positions in the word
    /// after which a hyphen could be inserted.
    ///
    /// Returns the number of possible hyphenation positions that were found.
    ///
    /// # Panics
    /// If the given `values` slice is too small to hold the results.
    ///
    /// If the block of memory represented by `self.0` is not in fact a valid
    /// hyphenation dictionary, this function may panic with an overflow or
    /// array bounds violation.
    pub fn find_hyphen_values(&self, word: &str, values: &mut [u8]) -> isize {
        assert!(values.len() >= word.len());
        values.iter_mut().for_each(|x| *x = 0);
        let top_level = self.level(0);
        let (lh_min, rh_min, clh_min, crh_min) = top_level.word_boundary_mins();
        if word.len() < lh_min + rh_min {
            return 0;
        }
        let mut hyph_count = top_level.find_hyphen_values(word, values, lh_min, 
    rh_min);
        let compound = hyph_count > 0;
        // Subsequent levels are applied to fragments between potential breaks
        // already found:
        for l in 1 .. self.num_levels() {
            let level = self.level(l);
            if hyph_count > 0 {
[...]
I've included the comment here in case the part about it potentially panicking in two different situations turns out to be important. Could one of these panics be the cause of the crash?

The first panic about the values slide being too small looks like the easiest one to follow back through the code. The length of this buffer seems to be determined by the length of the hyphenValues array created in the nsHyphenator::HyphenateWord() method in the nsHyphenator.cpp file:
  AutoTArray<uint8_t, 200> hyphenValues;
  hyphenValues.SetLength(utf8.Length());
  int32_t result = mDict.match(
      [&](const void*& ptr) {
        return mapped_hyph_find_hyphen_values_raw(
            static_cast<const uint8_t*>(ptr), mDictSize, utf8.BeginReading(),
            utf8.Length(), hyphenValues.Elements(), hyphenValues.Length());
      },
      [&](UniquePtr<base::SharedMemory>& shm) {
        return mapped_hyph_find_hyphen_values_raw(
            static_cast<const uint8_t*>(shm->memory()), mDictSize,
            utf8.BeginReading(), utf8.Length(), hyphenValues.Elements(),
            hyphenValues.Length());
      },
      [&](UniquePtr<const HyphDic>& hyph) {
        return mapped_hyph_find_hyphen_values_dic(
            hyph.get(), utf8.BeginReading(), utf8.Length(),
            hyphenValues.Elements(), hyphenValues.Length());
      });
This is then getting passed on to this method in mapped_hyph/src/ffi.rs:
pub unsafe extern &quot;C&quot; fn mapped_hyph_find_hyphen_values_raw(dic_buf: 
    *const u8, dic_len: u32,
                                                            word: *const 
    c_char, word_len: u32,
                                                            hyphens: *mut u8, 
    hyphens_len: u32) -> i32 {
    if word_len > hyphens_len {
        return -1;
    }
    let (word_str, hyphen_buf) = params_from_c(word, word_len, hyphens, 
    hyphens_len);
    if word_str.is_err() {
        return -1;
    }
    Hyphenator::new(slice::from_raw_parts(dic_buf, dic_len as usize))
        .find_hyphen_values(word_str.unwrap(), hyphen_buf) as i32
}
I'm going to need a bit more time and mental energy to figure out what's happening here, so I'm going to have to leave the rest of the analysis until the morning. I'm thinking that I could do a couple of things: increase the size of the return buffer; add a check to the Rust code; remove the call to the Rust code entirely. The aim wouldn't be to fix the problem, but rather to explore whether we're in the right place or not.

I'll sleep on it and see how these ideas feel in the morning.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
15 Jul 2024 : Day 289 #
We finished the diary entry day yesterday with a mystery. The C++ code in AudioChannelServices.cpp that sends audio-playback messages is being called on both ESR 78 and ESR 91. On ESR 91 the JavaScript code in AudioPlaybackChild.jsm that receives these messages, converts them into AudioPlayback messages and forwards them on to AudioPlaybackParent.jsm is also working correctly. But on ESR 78 the AudioPlaybackChild never receives the messages.

To add to the mystery, Mark Washeim (poetaster) flagged up the fact that he's experienced the same error when developing apps that use a WebView. That suggests there are at least some occasions when the AudioPlayback class is successfully receiving the messages.

Why are the messages being received in ESR 91 but only on ESR 78 sporadically?

Today I plan to delve into this, but be warned, it's a bit of a lengthy one today. I'm hoping the payoff will be worth it.

Considering the reasons for the peculiar differences I'm seeing between ESR 78 and ESR 91, one possibility is that the actor isn't being properly instantiated on ESR 78. To help determine if this is the case I've added the following code to the AudioPlaybackChild class on both ESR 78 and ESR 91. I've added similar code (but with the output changed accordingly) to the AudioPlaybackParent class, in addition to the debug output that I already added yesterday.
  constructor() {
    super();
    dump(&quot;AudioPlaybackChild: constructor()\n&quot;);
  }
Here's what I get when running this on ESR 91. As you can see I've not removed the other debug output lines, so we can watch both classes be instantiated before the messages are passed to their respective observers.
library &quot;libandroidicu.so&quot; needed or dlopened by &quot;/system/lib64/
    libmedia.so&quot; is not accessible for the namespace &quot;(default)&quot;

AudioPlaybackChild: constructor()
AudioPlayback: observe()
AudioPlayback: topic: audio-playback
AudioPlayback: data: active
Stacktrace: observe@resource://gre/actors/AudioPlaybackChild.jsm:19:22

AudioPlaybackParent: constructor()
AudioPlayback: receiveMessage()
AudioPlayback: aMessage: AudioPlayback:Start
Stacktrace: receiveMessage@resource://gre/actors/AudioPlaybackParent.jsm:20:22

JavaScript error: resource://gre/actors/AudioPlaybackParent.jsm, line 27: 
    TypeError: browser is null
AudioPlayback: observe()
AudioPlayback: topic: audio-playback
AudioPlayback: data: inactive-pause
Stacktrace: observe@resource://gre/actors/AudioPlaybackChild.jsm:19:22
On ESR 78 we get none of this output, despite the same debug print lines being present in the code.

I'm beginning to feel really frustrated by this now. There must be some reason why the actor is being initialised on ESR 91 but not on ESR 78. There are two possibilities that I need to pursue to try to get to the bottom of this. The first is to try to find out if it's just this specific actor that's being skipped, or second is whether the ActorManager — the class that performs the initialisation for all of the actors — isn't starting at all.

To help with this I've placed some debug output in the loop that starts all of the actors. I notice that there's some conditionality here: actors can be disabled using a preference if it's listed as such in the actor configuration. This configuration is listed in the ActorManagerParent.jsm source file itself. Here's the initialisation loop taken from this file:
    for (let [actorName, actor] of Object.entries(actors)) {
      dump(&quot;ActorManagerParent: actor: &quot; + actorName + &quot;, 
    enabledPreference: &quot; + actor.emablePreference + &quot;\n&quot;);
      // If enablePreference is set, only register the actor while the
      // preference is set to true.
      if (actor.enablePreference) {
        let actorNameProp = actorName + &quot;_Preference&quot;;
        XPCOMUtils.defineLazyPreferenceGetter(
          this,
          actorNameProp,
          actor.enablePreference,
          false,
          (prefName, prevValue, isEnabled) => {
            if (isEnabled) {
              dump(&quot;ActorManagerParent: pref registering &quot; + 
    actorName + &quot;\n&quot;);
              register(actorName, actor);
            } else {
              dump(&quot;ActorManagerParent: pref unregistering &quot; + 
    actorName + &quot;\n&quot;);
              unregister(actorName, actor);
            }
            if (actor.onPreferenceChanged) {
              actor.onPreferenceChanged(prefName, prevValue, isEnabled);
            }
          }
        );
        if (!this[actorNameProp]) {
          continue;
        }
      }
   
      dump(&quot;ActorManagerParent: registering &quot; + actorName + 
    &quot;\n&quot;); 
      register(actorName, actor);
    }
As we can see, the loop goes through each of the Actors and checks whether they have an enablePreference entry; if they do it searches for the appropriate preference which, in our case, would be AudioPlayback_Preference. If this is set to true or if there's no preference enabled then the Actor is registered; otherwise if it's set to false the actor is unregistered. The calls to dump() in this code were added by me just now so that we can trace progress through the method (sadly gdb won't work on JavaScript code).

The entry for the AudioPlayback actor, as listed higher up in the same file, looks like this:
  AudioPlayback: {
    parent: {
      moduleURI: &quot;resource://gre/actors/AudioPlaybackParent.jsm&quot;,
    },

    child: {
      moduleURI: &quot;resource://gre/actors/AudioPlaybackChild.jsm&quot;,
      observers: [&quot;audio-playback&quot;],
    },

    allFrames: true,
  },
As we can see from this, the actorName is AudioPlayback and the actor is everything else. But we can also see there's noemabledPreference entry here. As such what I'd expect to see is the preference code being entirely skipped and for the actor to be directly registered through the call at the end of the loop.

When I execute this on ESR 91, here's the output I get. It turns out there are very many actors (36 in total) so there's quite a lot of output. I've cut out the majority of them for brevity, but they all basically look the same anyway so you're not missing much.
Created LOG for EmbedLite
ActorManagerParent: actor: AsyncPrefs, enabledPreference: undefined
ActorManagerParent: registering AsyncPrefs
[...]
ActorManagerParent: actor: AudioPlayback, enabledPreference: undefined
ActorManagerParent: registering AudioPlayback
ActorManagerParent: actor: AutoComplete, enabledPreference: undefined
ActorManagerParent: registering AutoComplete
ActorManagerParent: actor: Autoplay, enabledPreference: undefined
ActorManagerParent: registering Autoplay
[...]
ActorManagerParent: actor: UnselectedTabHover, enabledPreference: undefined
ActorManagerParent: registering UnselectedTabHover
Created LOG for EmbedPrefs
What we see here is what we expect to see: the enablePreference is undefined and so the Actor gets registered. That's on ESR 91. When I add exactly the same debug annotations to the ESR 78 code I get... nothing. I've also added some code to the initialisation steps of the ActorManager and they don't get executed either. So it's looking clearly like the ActorManager isn't being executed at all.

So I need to find where the ActorManager initialisation is triggered. Searching through the code I can see a few potential places, for example in the tab-content.js file and in the browser_startup.js file. Both of these reference the ActorManagerParent. But neither are included in our omni.ja file so presumably they can't be the culprits.

At this point I've placed debug output all over the place and I don't seem to be getting any closer to finding out the reason. This is deeply frustrating.

But suddenly I realise that the relevant code might not be in the gecko repository at all: it could be in the embedlite-components repository. And when I check there, it looks plausible that the ActorManagerChild is getting initialised in the embedhelper.js script which is part of embedlite-components. Could this be what I've been searching for?

I've added some debug output to this file as well, as you can see in the dump() calls in this extract from the file:
function canInitializeActorManagerChild() {
  if (actorManagerChildAttached)
    return false;
  
  const { sharedData } = Services.cpmm;
  dump(&quot;canInitializeActorManagerChild: sharedData.get(
    CHILD_SINGLETON_ACTORS) = &quot; + sharedData.get(CHILD_SINGLETON_ACTORS) + 
    &quot;\n&quot;);
  return sharedData.get(CHILD_SINGLETON_ACTORS);
}

function initializeActorManagerChild() {
  dump(&quot;initializeActorManagerChild\n&quot;);
  try {
    if (canInitializeActorManagerChild()) {
      dump(&quot;canInitializeActorManagerChild: true\n&quot;);
      const { ActorManagerChild } = ChromeUtils.import(&quot;resource://gre/
    modules/ActorManagerChild.jsm&quot;);
      ActorManagerChild.attach(this);
      actorManagerChildAttached = true;
    }
  } catch (e) {}
} 
  
initializeActorManagerChild();
The code here calls initializeActorManagerChild(), which then checks whether initialisation is possible and if it is, initialises it. The two situations in which the initialisation may not be possible are if either it's already been initialised, or the value returned by sharedData.get(CHILD_SINGLETON_ACTORS) resolves to false. Here's the output I get when I run this on ESR 78:
Created LOG for EmbedLiteLayerManager
initializeActorManagerChild
canInitializeActorManagerChild: sharedData.get(CHILD_SINGLETON_ACTORS) = null
This initially looks encouraging: the value returned from the CHILD_SINGLETON_ACTORS check is null, which is interpreted as false in the condition and so the initialisation is skipped. Could this be the reason the ActorManager isn't being initialised?

Sadly my hopes are dashed when I discover that ESR 91 generates exactly the same output. There as well I get a return value of null with no initialisation.

I thought this would be it... I was wrong and I'm no further forwards than before. This all feels incredibly frustrating.

But having wandered into the embedlite-components code, I now notice there's one more place which could be triggering the initialisation and that's in the EmbedLiteGlobalHelper.js file. When I open this up I discover something unexpected. Compare the code in the file from ESR 78 with that from ESR 91. Here first is code taken from my device running ESR 78:
//ChromeUtils.defineModuleGetter(
//  this,
//  &quot;ActorManagerParent&quot;,
//  &quot;resource://gre/modules/ActorManagerParent.jsm&quot;
//);  

Services.scriptloader.loadSubScript(&quot;chrome://embedlite/content/
    Logger.js&quot;);

// Common helper service
  
function EmbedLiteGlobalHelper()
{ 
  //ActorManagerParent.flush();
[...]
And now here's the same code taken from my device running ESR 91:
ChromeUtils.defineModuleGetter(
  this,
  &quot;ActorManagerParent&quot;,
  &quot;resource://gre/modules/ActorManagerParent.jsm&quot;
);

Services.scriptloader.loadSubScript(&quot;chrome://embedlite/content/
    Logger.js&quot;);

// Common helper service

function EmbedLiteGlobalHelper()
{
  // Touch ActorManagerParent so that it gets initialised
  var actor = ActorManagerParent;
[...]
In the first block of code from ESR 78 the lines for getting the ActorManagerParent are completely commented out. That doesn't look right; they're not commented out on my ESR 91 device.

There's another difference too, which is less relevant for the issue at hand, but which explains how this came about. Slightly further down we can see that the call to ActorMangerParent.flush() (which is also commented out) has been switched for a line that just creates an instance of the ActorManagerParent.

I think I've finally found the culprit I've been looking for. The nice thing is that this explains fully what's going on. To understand we need to look at the history of the commits related to these changes:
$ git log -1
commit 377ed10e9cffece90458b1d8152a579b438b5dc0 (HEAD -> sailfishos-esr91, 
    origin/sailfishos-esr91)
Author: David Llewellyn-Jones <david@flypig.co.uk>
Date:   Mon Feb 12 21:33:19 2024 +0000

    [embedlite-components] Ensure ActorManagerParent is initialised
    
    In diff 0bf2601425ec1d8d63925 a call to ActorManagerParent.flush() was
    removed because the function was removed from upstream. However this
    caused ActorManagerParent not to be initialised, preventing the Window
    Actors from being registered.
    
    This change adds code to touch ActorManagerParent to ensure it gets
    reinitialised again.
$ git log -1 0bf2601425ec1d8d63925
commit 0bf2601425ec1d8d639255d6a7c32231e7e38eae
Author: David Llewellyn-Jones <david.llewellyn-jones@jolla.com>
Date:   Thu Nov 23 21:55:13 2023 +0000

    Remove EmbedLiteGlobalHelper.js errors
    
    Makes three changes to address errors that were coming from
    EmbedLiteGlobalHelper.js:
    
    1. Use ComponentUtils.generateNSGetFactory() instead of
       XPCOMUtils.generateNSGetFactory().
    
    2. Remove call to ActorManagerParent.flush(). See Bug 1649843.
    
    3. Use L10nRegistry.registerSources() instead of
       L10nRegistry.registerSource(). See Bug 1648631.
    
    See the following related upstream changes:
    
    https://phabricator.services.mozilla.com/D95206
    
    https://phabricator.services.mozilla.com/D81243
What does this all tell us? Well, back on Day 88 I removed the call to ActorManager.flush() following a change made upstream. In the process of doing this I broke the initialisation of the ActorManager and so went about restoring it on Day 256 by adding in the new line that creates an instance of the class.

As part of testing the original change I must have commented out the initialisation code on my ESR 78 device. Not in the project code on my laptop, but just on my phone. And so we ended up where we are now. This explains not only why it's not working on my ESR 78 device and why a re-installation of the gecko package didn't help (it's a change made in the embedlite-components package!) but also why the error is appearing on Mark's device and not mine: Mark never hacked around with the code on his device to break it.

It's a little frustrating that this is all down to changes I've made, things I've recorded in this diary but which I'd completely forgotten about. I've been led on a bit of a wild goose chase by myself.

Nevertheless I've reversed the changes on my device running ESR 78. Now when I grant permissions to the microphone I get the same error as I get on ESR 91.
JavaScript error: resource://gre/actors/AudioPlaybackParent.jsm, line 16: 
    TypeError: browser is null
Just as important to know, even with the error now showing, the audio continues to work as it did before.

This is all just as Mark described it in his message on Mastodon.

Even though it's presumably pretty harmless, I still feel compelled to fix it so there's no error message on ESR 91. Not only is it ugly, but it's also just lost me three days of development time. It would be good to avoid that happening again in the future.

To try to understand the issue with the error I want to understand what the embedding element actually is. To that end I've been reading through various pieces of documentation, including the Firefox Source docs on the topic and a Gecko Embedding Basics document. The latter turns out to be a bit of a red herring, since the embedder element isn't specifically related to embedding Gecko in other applications. But it was a useful read anyway,

The conclusion I'm left with is that in practice an attempt is being made to pass the message on to the embedding element, but this is never actually used. It could potentially be used, for example by an extension, but on Sailfish OS it's not bringing any benefit.

So the appropriate thing to do is to prevent the AutioPlaybackParent attempting to call methods on a null browser variable. There are multiple ways we might go about achieving this. One possibility would be to try to override the AudioPlaybackParent.jsm file so as to replace it with our own version. Sadly, after testing a few things out, I've not been able to get this to work. So I've given up and have decided to go for the simplest possible option, which is just to patch the file with a check for the null value:
$ git diff -U1
diff --git a/toolkit/actors/AudioPlaybackParent.jsm b/toolkit/actors/
    AudioPlaybackParent.jsm
index 5c54058528c5..7cfc5134e886 100644
--- a/toolkit/actors/AudioPlaybackParent.jsm
+++ b/toolkit/actors/AudioPlaybackParent.jsm
@@ -16,2 +16,5 @@ class AudioPlaybackParent extends JSWindowActorParent {
     const browser = this.browsingContext.top.embedderElement;
+    if (!browser) {
+      return;
+    }
     switch (aMessage.name) {
The good news is that this fixes the error, which now no longer appears in the console output. What's more there was a completely separate reason, unrelated to gecko, for the sound not working on my device. Having addressed that I now have a build that has working microphone and no error showing.

Nearly no error. There's still the following appearing:
library &quot;libandroidicu.so&quot; needed or dlopened by &quot;/system/lib64/
    libmedia.so&quot; is not accessible for the namespace &quot;(default)&quot;
Once again, despite the fact that the audio is now working, I still feel the need to track down the reason behind this error. It's possible this is nothing to do with gecko... but I'm not convinced and I'd like to be certain.

From the error it looks like the trigger is from the loading of libmedia.so rather than an attempt to load libandroidicu.so directly. In amongst the Gecko code there are two methods that attempt to dynamically load libmedia.so. Here's one of them:
media_lib *
cubeb_load_media_library()
{
  media_lib ml = {0};
  ml.libmedia = dlopen(&quot;libmedia.so&quot;, RTLD_LAZY);
  if (!ml.libmedia) {
    return NULL;
  }
[...]
The other one is in a method called audiotrack_init. Both of these relate to cubeb which appears to be a cross platform media library. However, when I fire up the debugger and try to put a breakpoint on either of these methods, nothing sticks. It looks like they're not part of xulrunner. So, with nothing else to try, I've just stuck a breakpoint on dlopen() on the off-chance something useful comes from it.

It gets hit many, many times. But I'm interested in a particular time, which is just after the website has requested to use the microphone, and just before I've granted access as a user. And there is a hit which happens right then. And the backtrace looks promising.
Thread 70 &quot;VideoCapture&quot; hit Breakpoint 3, 0x0000007ff0df02ac in 
    dlopen () from /lib64/libdl.so.2
(gdb) bt
#0  0x0000007ff0df02ac in dlopen () from /lib64/libdl.so.2
#1  0x0000007f10057c30 in ?? () from /usr/lib64/gecko-camera/plugins/
    libgeckocamera-droid.so
#2  0x0000007f1005aa4c in droid_media_init () from /usr/lib64/gecko-camera/
    plugins/libgeckocamera-droid.so
#3  0x0000007f1004c2d4 in DroidCameraManager::init() () from /usr/lib64/
    gecko-camera/plugins/libgeckocamera-droid.so
#4  0x0000007fee922844 in gecko::camera::RootCameraManager::init() () from /usr/
    lib64/libgeckocamera.so.0
#5  0x0000007fee922a2c in gecko_camera_manager () from /usr/lib64/
    libgeckocamera.so.0
#6  0x0000007ff48d65a4 in webrtc::videocapturemodule::DeviceInfoSFOS::Init (
    this=this@entry=0x7e60002fc0)
    at third_party/libwebrtc/webrtc/modules/video_capture/sfos/
    device_info_sfos.cc:40
#7  0x0000007ff48dad68 in webrtc::videocapturemodule::DeviceInfoSFOS::
    DeviceInfoSFOS (this=0x7e60002fc0)
    at third_party/libwebrtc/webrtc/modules/video_capture/sfos/
    device_info_sfos.cc:35
#8  0x0000007ff48dad90 in webrtc::videocapturemodule::VideoCaptureImpl::
    CreateDeviceInfo ()
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33
#9  0x0000007ff4867a68 in webrtc::VideoCaptureFactory::CreateDeviceInfo ()
    at third_party/libwebrtc/webrtc/modules/video_capture/
    video_capture_factory.cc:28
#10 0x0000007ff37de984 in mozilla::camera::VideoEngine::
    GetOrCreateVideoCaptureDeviceInfo (this=0x7e60002f00)
    at dom/media/systemservices/VideoEngine.cpp:180
[...]
#24 0x0000007fef54989c in ?? () from /lib64/libc.so.6
(gdb) c
Continuing.
library &quot;libandroidicu.so&quot; needed or dlopened by &quot;/system/lib64/
    libmedia.so&quot; is not accessible for the namespace &quot;(default)&quot;
As you can see, the error is generated straight after I continue running the code. It's hard to judge whether this is really happening straight after or not (my phone runs pretty fast), but it looks promising. Let's dig deeper.

So here's the code that the backtrace is pointing to. This is the last piece of code before the backtrace leaves the gecko library.
int32_t DeviceInfoSFOS::Init()
{
    cameraManager = gecko_camera_manager();
    // Initialize parent class member
    _lastUsedDeviceName = (char *)malloc(1);
    _lastUsedDeviceName[0] = 0;
    return 0;
}
The key line is the one that calls gecko_camera_manager() that we can also see in the backtrace. This method isn't part of gecko but is instead part of the libgeckocamera.so library, which is part of the Sailfish stack but which lives in a separate repository.

I want to know whether it's precisely this line that's generating the error message. So I've placed a breakpoint on the DeviceInfoSFOS::Init() method so that I can step through it a line at a time. If this is indeed the right place we should see the error message generated as we step through.

And so we do. Here's the output, with the error generated exactly as we step over the call to gecko_camera_manager():
Thread 69 &quot;VideoCapture&quot; hit Breakpoint 4, webrtc::videocapturemodule:
    :DeviceInfoSFOS::Init (this=this@entry=0x7e68002fa0)
    at third_party/libwebrtc/webrtc/modules/video_capture/sfos/
    device_info_sfos.cc:39
39      {
(gdb) n
40          cameraManager = gecko_camera_manager();
(gdb) n
library &quot;libandroidicu.so&quot; needed or dlopened by &quot;/system/lib64/
    libmedia.so&quot; is not accessible for the namespace &quot;(default)&quot;
42          _lastUsedDeviceName = (char *)malloc(1);
(gdb) n
43          _lastUsedDeviceName[0] = 0;
(gdb) 
If, rather than stepping over the call, we instead jump over it, so that it doesn't get executed at all, the error never shows:
Thread 71 &quot;VideoCapture&quot; hit Breakpoint 4, webrtc::videocapturemodule:
    :DeviceInfoSFOS::Init (this=this@entry=0x7e9c004210)
    at third_party/libwebrtc/webrtc/modules/video_capture/sfos/
    device_info_sfos.cc:39
39      {
(gdb) n
40          cameraManager = gecko_camera_manager();
(gdb) tbreak +1
Temporary breakpoint 5 at 0x7ff48d65a8: file third_party/libwebrtc/webrtc/
    modules/video_capture/sfos/device_info_sfos.cc, line 42.
(gdb) jump +1
Continuing at 0x7ff48d65a8.

Thread 71 &quot;VideoCapture&quot; hit Temporary breakpoint 5, webrtc::
    videocapturemodule::DeviceInfoSFOS::Init (this=this@entry=0x7e9c004210)
    at third_party/libwebrtc/webrtc/modules/video_capture/sfos/
    device_info_sfos.cc:42
42          _lastUsedDeviceName = (char *)malloc(1);
(gdb) n
43          _lastUsedDeviceName[0] = 0;
(gdb) c
Continuing.
Interestingly, having jumped over it, this doesn't seem to cause any issues for the microphone or audio. That rather makes sense: the name of the method suggests it's camera-related rather than audio-related. And having jumped over it, attempting to then access the camera does cause a crash:
Thread 71 &quot;VideoCapture&quot; received signal SIGSEGV, Segmentation fault.
0x0000007ff48dc5c4 in webrtc::videocapturemodule::DeviceInfoSFOS::
    NumberOfDevices (this=0x7e9c004210)
    at include/c++/8.3.0/ext/new_allocator.h:86
86      include/c++/8.3.0/ext/new_allocator.h: No such file or directory.
This isn't a crash that could ever happen in real use: it's only happened here because I forcefully skipped a line of code using the debugger that should have been executed.

My conclusion from all this is that the libandroidicu.so error falls outside the scope of the gecko upgrade. It doesn't seem to be doing any harm in practice and there isn't an obvious way to prevent it without wading into the gecko-camera library. I've reached the end of where I can reasonably take this to.

On testing with the LiveKit WebRTC Browser Test I now get the same results using ESR 91 as I do with ESR 78.
 
Screenshots showing the LiveKit WebRTC Browser Test results. On the left the ESR 78 results, on the right the ESR 91 results. There are both green (success) and red (failure) lines on both, but they both tally with one another

One interesting thing to notice is that the ESR 78 version is showing three rear cameras, whereas the ESR 91 output shows only one. But that's to be expected as the former is running on an Xperia 10 II on which all three cameras are supported, whereas the latter is on an Xperia 10 III which supports only one. So the results tally nicely with what we'd expect.

So today was rather a deep dive into WebRTC errors. Practically speaking I'm not sure I accomplished a great deal, but finding the answers to all of these open questions was cathartic to say the least. I'm glad it was possible to get to the bottom of them all. Where does this leave things? The WebRTC changes seem to be looking pretty good now, so I've pushed all of my commits related to them to the remote repository.

And that means things are looking pretty decent more generally as well. Tomorrow I'm going to do some more general testing of the browser. But I genuinely think we may be reaching a conclusion here.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
14 Jul 2024 : Day 288 #
I'm still continuing today to try to find out what's causing an error coming from the AudioPlayback window actor that's triggered by granting a Web page permission to use the microphone.

To get to the bottom of this, I think it'll be helpful to understand a bit better about how it's all structured. As a window actor, there isn't any code that explicitly constructs an instance of the AudioPlaybackParent JavaScript class where the error is coming from. Instead the Parent and Child pair that makes up the AudioPlayback actor is listed in the ActorManagerParent.jsm file, like this:
  AudioPlayback: {
    parent: {
      moduleURI: &quot;resource://gre/actors/AudioPlaybackParent.jsm&quot;,
    },

    child: {
      moduleURI: &quot;resource://gre/actors/AudioPlaybackChild.jsm&quot;,
      observers: [&quot;audio-playback&quot;],
    },

    allFrames: true,
  },
This is then used to instantiate the two classes and hook them together and register the child class as a listener for the messages given in the observers list. In this case, the actor is only interested in audio-playback messages. The messages get picked up by the child, which then sends further messages to the parent to act on them.

The child class does some conversion of the messages. Here's the code from which audio-playback messages are received and AudioPlayback messages are sent out.
class AudioPlaybackChild extends JSWindowActorChild {
  observe(subject, topic, data) {
    dump(&quot;AudioPlayback: observe()\n&quot;);
    dump(&quot;AudioPlayback: topic: &quot; + topic + &quot;\n&quot;);
    dump(&quot;AudioPlayback: data: &quot; + data + &quot;\n&quot;);

    var stackTrace = Error().stack;
    dump(&quot;Stacktrace: &quot; + stackTrace + &quot;\n&quot;);

    if (topic === &quot;audio-playback&quot;) {
      let name = &quot;AudioPlayback:&quot;;
      if (data === &quot;activeMediaBlockStart&quot;) {
        name += &quot;ActiveMediaBlockStart&quot;;
      } else if (data === &quot;activeMediaBlockStop&quot;) {
        name += &quot;ActiveMediaBlockStop&quot;;
      } else {
        name += data === &quot;active&quot; ? &quot;Start&quot; : 
    &quot;Stop&quot;;
      }   
      this.sendAsyncMessage(name);
    }   
  }
}
In order to try to understand what's going on I've inserted some debug printouts to this method (the lines that call dump()). This should tell us exactly when the observe() method gets called. And when it doesn't.

Having received such a message the child method then sends a message to the parent, which is where the error is triggered. I've added some additional debug output to the parent class as well. This is the version from ESR 91, but I've added similar annotations to the ESR 78 version too.
class AudioPlaybackParent extends JSWindowActorParent {
  constructor() {
    super();
    this._hasAudioPlayback = false;
    this._hasBlockMedia = false;
  }
  receiveMessage(aMessage) {
    dump(&quot;AudioPlayback: receiveMessage()\n&quot;);
    dump(&quot;AudioPlayback: aMessage: &quot; + aMessage.name + 
    &quot;\n&quot;);

    var stackTrace = Error().stack;
    dump(&quot;Stacktrace: &quot; + stackTrace + &quot;\n&quot;);

    const browser = this.browsingContext.top.embedderElement;
    switch (aMessage.name) {
      case &quot;AudioPlayback:Start&quot;:
        this._hasAudioPlayback = true;
        browser.audioPlaybackStarted();
        break;
[...]
    }
  }
}
When I execute this on ESR 91 all of the console output is generated at exactly the time and in exactly the way that's expected:
library &quot;libandroidicu.so&quot; needed or dlopened by &quot;/system/lib64/
    libmedia.so&quot; is not accessible for the namespace &quot;(default)&quot;
AudioPlayback: observe()
AudioPlayback: topic: audio-playback
AudioPlayback: data: active
Stacktrace: observe@resource://gre/actors/AudioPlaybackChild.jsm:15:22

AudioPlayback: receiveMessage()
AudioPlayback: aMessage: AudioPlayback:Start
Stacktrace: receiveMessage@resource://gre/actors/AudioPlaybackParent.jsm:19:22

JavaScript error: resource://gre/actors/AudioPlaybackParent.jsm, line 26: 
    TypeError: browser is null
AudioPlayback: observe()
AudioPlayback: topic: audio-playback
AudioPlayback: data: inactive-pause
Stacktrace: observe@resource://gre/actors/AudioPlaybackChild.jsm:15:22

AudioPlayback: receiveMessage()
AudioPlayback: aMessage: AudioPlayback:Stop
Stacktrace: receiveMessage@resource://gre/actors/AudioPlaybackParent.jsm:19:22

JavaScript error: resource://gre/actors/AudioPlaybackParent.jsm, line 30: 
    TypeError: browser is null
All of this output can be matched up with the child and parent AudioPlayback classes and, as we can see, the end result is the same null browser element error that we saw before.

Since the way in to all this is via the sending of an audio-playback message, the next thing I want to know is where this gets sent from. It turns out there's only one place where this happens and that's in the AudioPlaybackRunnable class that can be found in AudioChannelService.cpp.

I've added breakpoints to the AudioPlaybackRunnable constructor and to the AudioPlaybackRunnable::Run() method to check whether they get used. On ESR 91, when pressing the "Microphone" button, after agreeing to the permissions request, the breakpoint then hits:
(gdb) break AudioPlaybackRunnable::Run
Breakpoint 2 at 0x7ff3c3ab1c: file dom/audiochannel/AudioChannelService.cpp, 
    line 45.
(gdb) break AudioPlaybackRunnable::AudioPlaybackRunnable
Breakpoint 3 at 0x7ff3c3b714: file include/c++/8.3.0/bits/atomic_base.h, line 
    256.
(gdb) c
Continuing.
[...]
Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 3, (anonymous namespace)::
    AudioPlaybackRunnable::AudioPlaybackRunnable (
    aReason=mozilla::dom::AudioChannelService::eDataAudibleChanged, 
    aActive=true, aWindow=0x7fb85d1640, this=0x7fb861eaf0)
    at dom/audiochannel/AudioChannelService.cpp:43
43              mReason(aReason) {}
(gdb) bt
#0  (anonymous namespace)::AudioPlaybackRunnable::AudioPlaybackRunnable (
    aReason=mozilla::dom::AudioChannelService::eDataAudibleChanged, 
    aActive=true, 
    aWindow=0x7fb85d1640, this=0x7fb861eaf0)
    at dom/audiochannel/AudioChannelService.cpp:43
#1  mozilla::dom::AudioChannelService::AudioChannelWindow::
    NotifyAudioAudibleChanged (this=this@entry=0x7fb8e450a0, 
    aWindow=0x7fb85d1640, 
    aAudible=aAudible@entry=mozilla::dom::AudioChannelService::eAudible, 
    aReason=aReason@entry=mozilla::dom::AudioChannelService::
    eDataAudibleChanged)
    at dom/audiochannel/AudioChannelService.cpp:580
[...]
#35 0x0000007fef54989c in ?? () from /lib64/libc.so.6
(gdb) c
Continuing.

Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 2, (anonymous namespace)::
    AudioPlaybackRunnable::Run (this=0x7fb861eaf0)
    at dom/audiochannel/AudioChannelService.cpp:45
45        NS_IMETHOD Run() override {
(gdb) bt
#0  (anonymous namespace)::AudioPlaybackRunnable::Run (this=0x7fb861eaf0)
    at dom/audiochannel/AudioChannelService.cpp:45
#1  0x0000007ff19a4e50 in mozilla::RunnableTask::Run (this=0x7fb956a580)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
[...]
#34 0x0000007fef54989c in ?? () from /lib64/libc.so.6
(gdb) c
Continuing.
AudioPlayback: observe()
AudioPlayback: topic: audio-playback
AudioPlayback: data: active
Stacktrace: observe@resource://gre/actors/AudioPlaybackChild.jsm:15:22

udioPlayback: receiveMessage()
AudioPlayback: aMessage: AudioPlayback:Start
Stacktrace: receiveMessage@resource://gre/actors/AudioPlaybackParent.jsm:19:22

JavaScript error: resource://gre/actors/AudioPlaybackParent.jsm, line 26: 
    TypeError: browser is null
Directly after this the page gives the option to stop recording. If I then press on this the breakpoints trigger again in a similar way:
Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 3, (anonymous namespace)::
    AudioPlaybackRunnable::AudioPlaybackRunnable (
    aReason=mozilla::dom::AudioChannelService::ePauseStateChanged, 
    aActive=false, aWindow=0x7fb85d1640, this=0x7fb8d105c0)
    at dom/audiochannel/AudioChannelService.cpp:43
43              mReason(aReason) {}
(gdb) c
Continuing.

Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 2, (anonymous namespace)::
    AudioPlaybackRunnable::Run (this=0x7fb8d105c0)
    at dom/audiochannel/AudioChannelService.cpp:45
45        NS_IMETHOD Run() override {
(gdb) c
Continuing.
AudioPlayback: observe()
AudioPlayback: topic: audio-playback
AudioPlayback: data: inactive-pause
Stacktrace: observe@resource://gre/actors/AudioPlaybackChild.jsm:15:22

AudioPlayback: receiveMessage()
AudioPlayback: aMessage: AudioPlayback:Stop
Stacktrace: receiveMessage@resource://gre/actors/AudioPlaybackParent.jsm:19:22

JavaScript error: resource://gre/actors/AudioPlaybackParent.jsm, line 30: 
    TypeError: browser is null
This all makes sense and is as expected. But the flow seems to diverge from what we see with ESR 78. There the C++ methods get called just as before. But as you can see in the output below, this never translates into the problematic JavaScript being executed. The following shows the breakpoint being hit when the "Microphone" button is pressed and the user then agrees to grant the site access to the microphone. Crucially, although the breakpoints are hit, the debug output added to the AudioPlayback classes are never triggered.
Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 1, 0x0000007fbbbc47e0 in (
    anonymous namespace)::AudioPlaybackRunnable::AudioPlaybackRunnable (
    aReason=<optimized out>, aActive=<optimized out>, aWindow=<optimized out>, 
    this=<optimized out>)
    at obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33
33      obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h: No such file or 
    directory.
(gdb) bt
#0  0x0000007fbbbc47e0 in (anonymous namespace)::AudioPlaybackRunnable::
    AudioPlaybackRunnable (aReason=<optimized out>, aActive=<optimized out>,
    aWindow=<optimized out>, this=<optimized out>)
    at obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33
#1  mozilla::dom::AudioChannelService::AudioChannelWindow::
    NotifyAudioAudibleChanged (this=this@entry=0x7f80fdd0b0, 
    aWindow=0x7f80cb6060,
    aAudible=aAudible@entry=mozilla::dom::AudioChannelService::eAudible, 
    aReason=aReason@entry=mozilla::dom::AudioChannelService::
    eDataAudibleChanged)
    at dom/audiochannel/AudioChannelService.cpp:592
[...]
#37 0x0000007fb735c89c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/
    clone.S:78
(gdb) c
Continuing.

Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 2, (anonymous namespace)::
    AudioPlaybackRunnable::Run (this=0x7f8137d4f0)
    at dom/audiochannel/AudioChannelService.cpp:44
44        NS_IMETHOD Run() override {
(gdb) bt
#0  (anonymous namespace)::AudioPlaybackRunnable::Run (this=0x7f8137d4f0)
    at dom/audiochannel/AudioChannelService.cpp:44
#1  0x0000007fb9c014c4 in nsThread::ProcessNextEvent (aResult=0x7fa69f3ba7, 
    aMayWait=<optimized out>, this=0x7f8002fd60)
    at xpcom/threads/nsThread.cpp:1211
[...]
#24 0x0000007fb735c89c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/
    clone.S:78
(gdb) c            
Continuing.
[...]
Selecting the "Stop" button on ESR 78 shows a similar pattern: the breakpoints are hit but the JavaScript debug output never appears:
Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 1, 0x0000007fbbbc47e0 in (
    anonymous namespace)::AudioPlaybackRunnable::AudioPlaybackRunnable (
    aReason=<optimized out>, aActive=<optimized out>, aWindow=<optimized out>, 
    this=<optimized out>)
    at obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33
33      obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h: No such file or 
    directory.
(gdb) bt
#0  0x0000007fbbbc47e0 in (anonymous namespace)::AudioPlaybackRunnable::
    AudioPlaybackRunnable (aReason=<optimized out>, aActive=<optimized out>,
    aWindow=<optimized out>, this=<optimized out>)
    at obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33
#1  mozilla::dom::AudioChannelService::AudioChannelWindow::
    NotifyAudioAudibleChanged (this=this@entry=0x7f80f97430, 
    aWindow=0x7f80b985a0,
    aAudible=aAudible@entry=mozilla::dom::AudioChannelService::eAudible, 
    aReason=aReason@entry=mozilla::dom::AudioChannelService::
    eDataAudibleChanged)
    at dom/audiochannel/AudioChannelService.cpp:592
[...]
#37 0x0000007fb735c89c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/
    clone.S:78
(gdb) c                            
Continuing.

Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 2, (anonymous namespace)::
    AudioPlaybackRunnable::Run (this=0x7f80f52980)
    at dom/audiochannel/AudioChannelService.cpp:44
44        NS_IMETHOD Run() override {
(gdb) bt
#0  (anonymous namespace)::AudioPlaybackRunnable::Run (this=0x7f80f52980)
    at dom/audiochannel/AudioChannelService.cpp:44
#1  0x0000007fb9c014c4 in nsThread::ProcessNextEvent (aResult=0x7fa69f3ba7, 
    aMayWait=<optimized out>, this=0x7f8002fd60)
    at xpcom/threads/nsThread.cpp:1211
[...]
#24 0x0000007fb735c89c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/
    clone.S:78
(gdb) c
Continuing.
[...]
I'm currently rather baffled by this. Everything works as expected on ESR 91, but while on ESR 78 the messages are very clearly sent from the C++ code, they're never received by the JavaScript actor.

Unfortunately I've run out of time to investigate this further today. But I'll continue looking at it tomorrow, when I'll also have to come to a decision about how to actually tackle the error. Do I add in a check for a null browser variable, remove the AudioPlayback actor entirely, or dig deeper to try to establish the underlying for why the embedder element is never defined. With any luck our investigations tomorrow will lead us to a suitable answer.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
13 Jul 2024 : Day 287 #
Today I'm continuing my investigation of an error that happens just after the user has granted permissions for a site to use the microphone. The problem generates an error to the console that looks like this:
JavaScript error: resource://gre/actors/AudioPlaybackParent.jsm, line 20: 
    TypeError: browser is null
As a quick recap, the line of code that's triggering the error, as it appears in the ESR 91 codebase, is the following:
    const browser = this.browsingContext.top.embedderElement;
This is a slight simplification of the code as it appears in ESR 78, like this:
    let topBrowsingContext = this.browsingContext.top;
    let browser = topBrowsingContext.embedderElement;
By looking at the history of this source file I was able to ascertain that the code has been merged into a single line but is otherwise the same, save for the fact that browser is now annotated with a const keyword.

The two variants of this code really appear to do very similar things, so what's interesting is that despite this change being barely different in terms of functionality, the error doesn't trigger under the same circumstances in ESR 78.

It seems unlikely that this const-ness is going to be the problem, but I suppose it's worth checking to make sure. The AudioPlaybackParent.jsm file is part of the gecko package and I don't want to have to rebuild everything, so I'm going to edit it in situ by accessing the omni.ja file.
cd omni/
./omni.sh unpack
find . -iname &quot;AudioPlaybackParent.jsm&quot; -exec vim {} \;
./omni.sh pack
I've made the following edits so that the version used is equivalent to the ESR 78 formulation:
  receiveMessage(aMessage) {
    //const browser = this.browsingContext.top.embedderElement;
    let topBrowsingContext = this.browsingContext.top;
    let browser = topBrowsingContext.embedderElement;
[...]
As expected, this has no practical effect, other than changing the line number the error is triggered on:
JavaScript error: resource://gre/actors/AudioPlaybackParent.jsm, line 22: 
    TypeError: browser is null
This means the error must be elsewhere. I've reversed the change I just made to avoid anything unexpected happening. But I now need to figure out why the browsing context is returning a null value for the embedderElement when, presumably, it didn't before.

Except, it turns out that's not quite the full story. While I thought the error didn't trigger at all in ESR 78, it seems it's not quite as simple as that. Earlier this morning I received the following messages from fellow Sailfish OS user and developer Mark Washeim (poetaster) on Mastodon suggesting otherwise:
 
@flypig I don't think it helps in any manner but I had noticed that AudioPlaybackParent message here: forum.sailfishos.org/t/javascr... That was with 4.4 and the error went away with 4.5. I was testing audio stuff for the mahjong webwrapper I maintain.
 
@flypig Ah, and I'm wrong, the error can still be triggered in 4.5 I just launched 'harbour-scaler' the web wrapper around scaler and it produces that error as soon as you hit the play button in the app. The audio does play though.

This is fascinating. I can't seem to get it to trigger at all myself, but clearly it does under the right conditions. This was a really helpful intervention from Mark.

It nevertheless feels like something is different, so it's worth taking a look at the code involved. Here's the code in BrowsingContext for extracting the top window, taken from BrowsingContext.cpp:
BrowsingContext* BrowsingContext::Top() {
  BrowsingContext* bc = this;
  while (bc->mParentWindow) {
    bc = bc->GetParent();
  }
  return bc;
}
Since this can't reasonably be null here, the method is guaranteed to return a non-null value. But that's not our problem, our problem is the embedderElement hanging off this. Here's the code for this, taken this time from BrowsingContext.h
  // Get the embedder element for this BrowsingContext if the embedder is
  // in-process, or null if it's not.
  Element* GetEmbedderElement() const { return mEmbedderElement; }
There aren't any other immediately obvious differences in this between the two versions. When running the two versions of the browser the results are interesting. Not only is there no indication of this error on the console for ESR 78, but having granted permission to the site to use the microphone, I actually got to hear some resulting feedback. That doesn't happen on my ESR 91 build. We don't know whether this difference is related to the error though.

But while I don't know whether the error output to the console is the reason why the microphone is left inert on ESR 91, fixing nevertheless would seem like the right thing to do at this point.

So I'm going to have to follow that embedderElement variable down the rabbit hole to see where it leads.

None of the code related to embedderElement in the BrowsingContext class differs between ESR 78 and ESR 91. The external interface to it appears to be the SetEmbedderElement() method. This is quite a complex method and there's also the question of how it gets called. So I think some stepping through of the code will be in order to compare how ESR 78 is handling it with the ESR 91 flow.

To start off I've placed a breakpoint on BrowsingContext::SetEmbedderElemenet() to find out where the value actually gets set. But whether I do this on ESR 78 or ESR 91 I don't get a hit. Looking through the code I can't see any other way that the value can be set, but it's possible there's some way I missed. So I've tried to place a watch point on the variable instead.

The mEmbedderElement variable is a member of the BrowsingContext class, so the variable won't exist until an instance of the class has been constructed. So I've first placed a breakpoint on the BrowsingContext constructor. Once that's called the object will actually exist and I can then find the memory location that mEmbedderElement points to. I can then add a watch on that memory location, which should trigger a break whenever the variable is written to.

Now I know the value on ESR 91 remains null, so I'm particularly interested to find out where the value gets set on ESR 78:
(gdb) b BrowsingContext::BrowsingContext
(gdb) r
[...]
Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 1, mozilla::dom::
    BrowsingContext::BrowsingContext (this=this@entry=0x7f809a02a0, 
    aParentWindow=0x0,
    aGroup=0x7f80bef930, aBrowsingContextId=1, aType=mozilla::dom::
    BrowsingContext::Type::Content, aFields=...)
    at docshell/base/BrowsingContext.cpp:392
392     BrowsingContext::BrowsingContext(WindowContext* aParentWindow,
(gdb) n
92      ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsWrapperCache.h: No such 
    file or directory.
(gdb) n
410           mUseRemoteSubframes(false) {
(gdb) print &mEmbedderElement
$1 = (RefPtr<mozilla::dom::Element> *) 0x7f809a04f8
(gdb) watch *0x7f809a04f8
Hardware watchpoint 2: *0x7f809a04f8
(gdb) c
Continuing.
This watchpoint is never hit. There are several reasons I can think of for why this might happen. I don't use watchpoints often, so one possibility is that I'm using it wrong. I've also noticed the debugger crashing out occasionally with this particular ESR 78 build, so it could also be an issue with the debugger. Finally it could simply be that the value never gets written to.

There is another way to check this, which is to find out the value it takes when it gets read. To test this I placed a breakpoint on GetEmbedderElement(). On ESR 91 the breakpoint is hit when I request the microphone permission. At that point it's null, as anticipated:
(gdb) b GetEmbedderElement
Breakpoint 3 at 0x7fbc79cbbc: GetEmbedderElement. (7 locations)
(gdb) r
Starting program: /usr/bin/sailfish-browser 
[...]
Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 3, mozilla::dom::
    BrowsingContext::GetEmbedderElement (this=0x7fb8bb2850)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/dom/
    BrowsingContext.h:335
335     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/dom/
    BrowsingContext.h: No such file or directory.
(gdb) p mEmbedderElement
$3 = {mRawPtr = 0x0}
(gdb) 
On ESR 78 I don't have so much luck. The breakpoint won't stick to GetEmbedderElement() at all: apparently it doesn't exist as a debug symbol.

Okay, fair enough, there's another way to do this too, which is to print out the value from inside the AudioPlaybackParent::receiveMessage() method in the JavaScript. This is the method that's generating the error on ESR 91.

But when I do this, I find this JavaScript method is never hit either. It means that the fix may not be to ensure mEmbedderElement is set to a non-null value at all, but rather making sure that the AudioPlaybackParent::receiveMessage() is never called.

This would seem to tally with Mark's experiences. And alongside Mark's comment this gives me a slightly new direction to look in to. Unfortunately I don't have time to look in to this new direction today. That means that tomorrow I'll need to look at where the AudioPlaybackParent::receiveMessage() is getting called from.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
12 Jul 2024 : Day 286 #
Still working on WebRTC today, after some success getting video to work, but less success with audio. I have two things explicitly on my to-do list. The first being the error when using GlobalMuteListener, the second the AudioPlaybackParent.jsm error triggered when trying to grant access to the microphone.

Let's start with the GlobalMuteListener class then. This is a new class added to ESR 91 with the following description associated with it, taken from changeset D86719:
$ git log -1 -S &quot;GlobalMuteListener&quot; browser/actors/WebRTCChild.jsm
commit 01fa52e2f5946f511199b6879919e2547f63fa25
Author: Mike Conley <mconley@mozilla.com>
Date:   Thu Aug 27 15:46:26 2020 +0000

    Bug 1643027 - Add GlobalMuteListener to WebRTCChild for globally muting the 
    microphone and camera. r=pbz
    
    This listens to state changes set via SharedData. Those state changes are 
    executed by
    front-end code in a later patch in this series.
    
    Depends on D87129
    
    Differential Revision: https://phabricator.services.mozilla.com/D86719
The associated bug is 1643027 and there are two further changesets associated with this bug. The first is D86619 which turns the Firefox microphone and camera icons into global mute toggles. The second is D86762 which adds unit tests for the other changes.

At some point I should take a look at the first of these to see whether it relates to anything in the Sailfish user interface. This could have privacy implications, so is of no small importance.

Right now though, I'm just going to add in the missing GlobalMuteListener class so that we're matching at least this part of the upstream functionality and which we might need to hook the user interface changes in to later.

The new class isn't especially large; in fact it's small enough that I can justify sharing all of it here.
/**
 * GlobalMuteListener is a process-global object that listens for changes to
 * the global mute state of the camera and microphone. When it notices a
 * change in that state, it tells the underlying platform code to mute or
 * unmute those devices.
 */
const GlobalMuteListener = {
  _initted: false,

  /**
   * Initializes the listener if it hasn't been already. This will also
   * ensure that the microphone and camera are initially in the right
   * muting state.
   */
  init() {
    if (!this._initted) {
      Services.cpmm.sharedData.addEventListener(&quot;change&quot;, this);
      this._updateCameraMuteState();
      this._updateMicrophoneMuteState();
      this._initted = true;
    }
  },

  handleEvent(event) {
    if (event.changedKeys.includes(&quot;WebRTC:GlobalCameraMute&quot;)) {
      this._updateCameraMuteState();
    }
    if (event.changedKeys.includes(&quot;WebRTC:GlobalMicrophoneMute&quot;)) {
      this._updateMicrophoneMuteState();
    }
  },

  _updateCameraMuteState() {
    let shouldMute = Services.cpmm.sharedData.get(&quot;WebRTC:
    GlobalCameraMute&quot;);
    let topic = shouldMute
      ? &quot;getUserMedia:muteVideo&quot;
      : &quot;getUserMedia:unmuteVideo&quot;;
    Services.obs.notifyObservers(null, topic);
  },

  _updateMicrophoneMuteState() {
    let shouldMute = Services.cpmm.sharedData.get(
      &quot;WebRTC:GlobalMicrophoneMute&quot;
    );
    let topic = shouldMute
      ? &quot;getUserMedia:muteAudio&quot;
      : &quot;getUserMedia:unmuteAudio&quot;;

    Services.obs.notifyObservers(null, topic);
  },
};
All this is really doing is listening to a couple of events for muting the camera and microphone and, in the event that they're received, broadcasting a notification to indicate the change (microphone/camera mute/unmute) that can be picked up by the user interface and acted upon.

Crucially it doesn't appear to depend on anything that's not already available in our EmbedLiteWebrtcUK.js source file. So I've added it towards the top of the file and uncommented the line that calls it.

When I use this updated code I don't get any obvious changes, except that there's now no longer an error output to the console.

I'll need to create a task to check the status of global mute on Sailfish OS, but this change is going to be enough for what we need right now.

Let's move on to the second error, which was the following:
JavaScript error: resource://gre/actors/AudioPlaybackParent.jsm, line 20: 
    TypeError: browser is null
The AudioPlaybackParent.jsm source is available upstream for both ESR 78 and ESR 91. Both versions are tiny files (29 lines and 45 lines respectively) but there have been changes between the two.

The change that is triggering the error relates to the browser variable. It's being initialised in two different ways. On ESR 78 we have this: let topBrowsingContext = this.browsingContext.top; let browser = topBrowsingContext.embedderElement; Whereas on ESR 91 we have this:
    const browser = this.browsingContext.top.embedderElement;
Looking carefully at and comparing both of these there's not much practical difference. On ESR 91 the variable is marked as const but that seems to be the only practical difference.

The top level class doesn't include a browsingContext member so in both cases this must be being inherited from the parent class, which is JSWindowActorParent:
class AudioPlaybackParent extends JSWindowActorParent {
The file I'm checking in the ESR 78 codebase is the patched version with the Sailfish OS changes applied, so it's worth checking to see whether these lines in particular are from upstream or from us:
$ git blame toolkit/actors/AudioPlaybackParent.jsm -L 10,+4
5ce82c5c12dfa (Abdoulaye O. Ly 2019-08-19 21:17:21 +0000 10)
   receiveMessage(aMessage) {
5ce82c5c12dfa (Abdoulaye O. Ly 2019-08-19 21:17:21 +0000 11)
     let topBrowsingContext = this.browsingContext.top;
5ce82c5c12dfa (Abdoulaye O. Ly 2019-08-19 21:17:21 +0000 12)
     let browser = topBrowsingContext.embedderElement;
5ce82c5c12dfa (Abdoulaye O. Ly 2019-08-19 21:17:21 +0000 13) 
They're from upstream. So then I wonder why they were changed? Let's find out.
$ git blame toolkit/actors/AudioPlaybackParent.jsm -L 15,+4
Blaming lines:   8% (4/45), done.
5ce82c5c12dfa (Abdoulaye O. Ly 2019-08-19 21:17:21 +0000 15)
   receiveMessage(aMessage) {
ebf5370d519e5 (alwu            2020-10-02 03:56:09 +0000 16)
     const browser = this.browsingContext.top.embedderElement;
5ce82c5c12dfa (Abdoulaye O. Ly 2019-08-19 21:17:21 +0000 17)
     switch (aMessage.name) {
5ce82c5c12dfa (Abdoulaye O. Ly 2019-08-19 21:17:21 +0000 18)
       case &quot;AudioPlayback:Start&quot;:
$ git log -1 ebf5370d519e5
commit ebf5370d519e5fe34c4a0b7aeb210de3c851c00a
Author: alwu <alwu@mozilla.com>
Date:   Fri Oct 2 03:56:09 2020 +0000

    Bug 1656414 - part1 : stop audio playback and media block if needed when 
    destroying an parent actor. r=farre
    
    If content process dispatches event after actor close, then we are not able 
    to clear the tab media indicator. Therefore, we should do corresponding 
    cleanup when actor destroys.
    
    Differential Revision: https://phabricator.services.mozilla.com/D91367
It's not quite clear to me what's going on with this, but it's beginning to feel like my brain needs to get some sleep. So I'll return to this fresh in the morning and take another look.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
11 Jul 2024 : Day 285 #
This morning I'm continuing my analysis of how the browser responds to the getUserMedia:request notification. I've been taking a second look at the three methods we were examining yesterday. I'm going to be referring back to them today, so it may be worth having them to hand if you want to follow along.

There are a few things that I want to note about them. First is that they're all triggered in the same way, namely in response to the getUserMedia:request notification. This notification is sent from the C++ code in MediaManager.cpp and picked up by the observer in WebRTCChild.jsm in the upstream code and in EmbedLiteWebrtcUI.js on the Sailfish browser.

On the Sailfish OS side we can see the case that filters for this message in the snippet I posted yesterday. But I skipped that part on the upstream snippets. I think it's worth taking a look at it though, just to tie everything together:
  // This observer is called from BrowserProcessChild to avoid
  // loading this .jsm when WebRTC is not in use.
  static observe(aSubject, aTopic, aData) {
    switch (aTopic) {
      case &quot;getUserMedia:request&quot;:
        handleGUMRequest(aSubject, aTopic, aData);
        break;
[...]
    }
  }
There's nothing magical going on here. The notification is fired, it's picked up by the observer and one of the handleGUMRequest() methods that I shared yesterday is executed depending on whether we're running ESR 78 or ESR 91.

On Sailfish OS a similar thing happens but with a different observer in a different source file. The main difference is that there's no handleGUMRequest() method that gets called. Instead the code is executed inline inside the switch statement.

This is one of two main differences between the two. The other main difference is due to what happens inside that code. The upstream code opens a prompt like this on ESR 78:
      prompt(
        contentWindow,
        aSubject.windowID,
        aSubject.callID,
        constraints,
        devices,
        secure,
        isHandlingUserInput
      );
This aligns with some equivalent but different code used to open a prompt on Sailfish OS:
      EmbedLiteWebrtcUI.prototype._prompt(
        contentWindow,
        aSubject.callID,
        constraints,
        devices,
        aSubject.isSecure);
    }
On upstream ESR 91 we also have some code that does something similar. There are differences here as well though, most notably the parameters that are passed in and where they come from.
  prompt(
    aSubject.type,
    contentWindow,
    aSubject.windowID,
    aSubject.callID,
    constraints,
    aSubject.devices,
    aSubject.isSecure,
    aSubject.isHandlingUserInput
  );
The part of this we're particularly interesting in is where the parameters are coming from, and we're especially concerned with the parameters that are needed for the _prompt() call on Sailfish OS.

Most of the parameters needed by our Sailfish OS code have stayed the same between ESR 78 and ESR 91 in the upstream code. That covers contentWindow, aSubject.callID, constraints and aSubject.isSecure.

That leaves just the devices parameter that has changed and that we need to worry about. Let's examine a bit more closely what's going on with this. In both ESR 78 and the Sailfish OS code we essentially had the following, where I've simplified things to remove all of the obfuscatory cruft.
  contentWindow.navigator.mozGetUserMediaDevices(
    function(devices) {
      prompt(devices);
    },
    function(error) {[...]},
  );
Although we don't have a Promise object, this is nevertheless essentially a promise on devices. The call to mozGetUserMediaDevices() will eventually either resolve to call the prompt() method with the fulfilled value for devices or it will call the error method.

In ESR 91 the promise is no longer needed: it's presumably been refactored higher up the call chain so that the devices value has already been resolved before the getUserMedia:request notification is sent out. Rather than us having to request the value asynchronously, in ESR 91 it's passed directly as part of the notification. That's why we have devices in ESR 78 and aSubject.devices in ESR 91.

This is true for ESR 91 upstream, but that means it's also likely to be true for our Sailfish OS version that's downstream of ESR 91. We should therefore, in theory, be able to do something like this:
      case &quot;getUserMedia:request&quot;:
        // Now that a getUserMedia request has been created, we should check
        // to see if we're supposed to have any devices muted. This needs
        // to occur after the getUserMedia request is made, since the global
        // mute state is associated with the GetUserMediaWindowListener, which
        // is only created after a getUserMedia request.
        GlobalMuteListener.init();

        let constraints = aSubject.getConstraints();
        let contentWindow = Services.wm.getOuterWindowWithId(aSubject.windowID);

        EmbedLiteWebrtcUI.prototype._prompt(
          contentWindow,
          aSubject.callID,
          constraints,
          aSubject.devices,
          aSubject.isSecure);
        break;
The good news about this is that the EmbedLiteWebrtcUI.js file is contained in the embedlite-components project, built separately from the gecko packages. This project is incredibly swift to compile, so I can test these changes almost immediately.

When I do I still get an error, but it's different this time:
JavaScript error: file:///usr/lib64/mozembedlite/components/
    EmbedLiteWebrtcUI.js, line 295: ReferenceError: GlobalMuteListener is not 
    defined
Looking back at the file I can see that GlobalMuteListener is a class defined in WebRTCChild.jsm. I may need to do something with this, but just to test things, I'm commenting out the line where it's used so as to see whether it makes any difference.

And it does! Now when I visit the getUserMedia example and select to use the camera I get the usual nice Sailfish OS permissions request page appearing. If I grant the permission, the camera even works!
 
Screenshots showing the camera and microphone permissions pages, one showing the camera working and another showing the error message when access to the microphone is denied

Denying the permissions results in an error showing on the web page stating that the request was not allowed, which is exactly what should happen. The only wrinkle is when I try to use the microphone. The correct permissions page appears, but if I allow access I then get the following error output to the console:
JavaScript error: resource://gre/actors/AudioPlaybackParent.jsm, line 20: 
    TypeError: browser is null
I'll need to fix this, but not today. I'll take another look at this tomorrow and try to fix both the GlobalMuteListener error and this AudioPlaybackParent.jsm error. One further thing that's worth commenting on is the fact the video stream colours are a bit strange as you can see in the screenshot. From what I recall this isn't a new issue, the release version of the browser exhibits this as well. It'd be nice to get it fixed, but having thought about it, I'm not really sure what the problem is. Maybe a way to improve it will become apparent later.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
10 Jul 2024 : Day 284 #
Yesterday I applied the final two WebRTC patches, transferred over from ESR 78, with mixed results. I left a build running overnight after having manually applied the changes to the moz.build file after I failed to get the changes autogenerated from changes to the BUILD.gn file.

When I test the new version I do get slightly different results compared to the previous install. On Mozilla's getUserMedia example page I now get the same results whether I'm trying to use the camera or the microphone:
library &quot;libandroidicu.so&quot; needed or dlopened by &quot;/system/lib64/
    libmedia.so&quot; is not accessible for the namespace &quot;(default)&quot;
JavaScript error: file:///usr/lib64/mozembedlite/components/
    EmbedLiteWebrtcUI.js, line 293: TypeError: 
    contentWindow.navigator.mozGetUserMediaDevices is not a function
I'm not sure about that libandroidicu.so error. I've not had a chance to find the code that's attempting to load this in dynamically — it doesn't look like the sort of library we should be needing — so I'll have to add this task to my queue.

The fact that both the microphone and camera produce this error is I think a sign that the changes to the moz.build file are useful, but it brings us to the same conclusion, which is that we need to fix the process for requesting permissions from the user. So the obvious thing to do is to follow that error and try to find out why it's happening. But when I check the EmbedLiteWebrtcUI.js file the reason for the error is immediately evident: the mozGetUserMediaDevices() method is no longer part of the Navigator interface, as defined in Navigator.webidl.

I can use git blame to find out when it was removed and why:
$ git log -1 -S &quot;mozGetUserMediaDevices&quot; dom/webidl/Navigator.webidl
commit 9f60769705be72d6057feb073c87aca32e0baf27
Author: Karl Tomlinson <karlt+@karlt.net>
Date:   Thu May 6 05:16:49 2021 +0000

    Bug 1709474 move mozGetUserMediaDevices from Navigator to 
    GetUserMediaRequest r=jib,webidl,geckoview-reviewers,smaug,agi
    
    Differential Revision: https://phabricator.services.mozilla.com/D111565
Referring to the upstream D111565 changeset, not only has the method been removed from the interface definition, it's also been removed from the source file, in this case Navigator.cpp. This is the method that could be found there in ESR 78:
void Navigator::MozGetUserMediaDevices(
    MozGetUserMediaDevicesSuccessCallback& aOnSuccess,
    NavigatorUserMediaErrorCallback& aOnError, uint64_t aInnerWindowID,
    const nsAString& aCallID, ErrorResult& aRv) {
  if (!mWindow || !mWindow->GetOuterWindow() ||
      mWindow->GetOuterWindow()->GetCurrentInnerWindow() != mWindow) {
    aRv.Throw(NS_ERROR_NOT_AVAILABLE);
    return;
  }
Again, referring to the same changeset it looks like the method has been moved to the GetUserMediaRequest.cpp file, or at least that there's a new equivalent to be found there. This is what this looks like in ESR 91:
void GetUserMediaRequest::GetDevices(
    nsTArray<RefPtr<nsIMediaDevice>>& retval) const {
  MOZ_ASSERT(retval.Length() == 0);
  if (!mMediaDeviceSet) {
    return;
  }
  for (const auto& device : *mMediaDeviceSet) {
    retval.AppendElement(device);
  }
}
These are clearly not identical methods, but with a bit of adjustment to the calling code it may be that we'll need to make use of this as an alternative to MozGetUserMediaDevices(). I'm not sure how this will pan out, so this will require some more investigation. First up, this is the code in EmbedLiteWebrtcUI.js where this is called, which is currently present in both ESR 78 and ESR 91 because it's part of the embedlite-components package:
      case &quot;getUserMedia:request&quot;:
        let constraints = aSubject.getConstraints();
        let contentWindow = Services.wm.getOuterWindowWithId(aSubject.windowID);

        contentWindow.navigator.mozGetUserMediaDevices(
          constraints,
          function(devices) {
            if (!contentWindow.closed) {
              EmbedLiteWebrtcUI.prototype._prompt(
                contentWindow,
                aSubject.callID,
                constraints,
                devices,
                aSubject.isSecure);
            }
          },
          function(error) {
            Services.obs.notifyObservers(null, &quot;getUserMedia:request:
    deny&quot;, aSubject.callID);
            Cu.reportError(error);
          },
          aSubject.innerWindowID,
          aSubject.callID
        );
        break;
Note that this is JavaScript code, as are the following snippets as well. The equivalent upstream call as used by Firefox happening in the ESR 78 code is in the WebRTCChild.jsm file:
function handleGUMRequest(aSubject, aTopic, aData) {
  let constraints = aSubject.getConstraints();
  let secure = aSubject.isSecure;
  let isHandlingUserInput = aSubject.isHandlingUserInput;
  let contentWindow = Services.wm.getOuterWindowWithId(aSubject.windowID);

  contentWindow.navigator.mozGetUserMediaDevices(
    constraints,
    function(devices) {
      // If the window has been closed while we were waiting for the list of
      // devices, there's nothing to do in the callback anymore.
      if (contentWindow.closed) {
        return;
      }

      prompt(
        contentWindow,
        aSubject.windowID,
        aSubject.callID,
        constraints,
        devices,
        secure,
        isHandlingUserInput
      );
    },
    function(error) {
      // Device enumeration is done ahead of handleGUMRequest, so we're not
      // responsible for handling the NotFoundError spec case.
      denyGUMRequest({ callID: aSubject.callID });
    },
    aSubject.innerWindowID,
    aSubject.callID
  );
}
And here's the equivalent code, as it would run on Firefox, on ESR 91. As we can see, this no longer uses the mozGetUserMediaDevices() method:
function handleGUMRequest(aSubject, aTopic, aData) {
  // Now that a getUserMedia request has been created, we should check
  // to see if we're supposed to have any devices muted. This needs
  // to occur after the getUserMedia request is made, since the global
  // mute state is associated with the GetUserMediaWindowListener, which
  // is only created after a getUserMedia request.
  GlobalMuteListener.init();

  let constraints = aSubject.getConstraints();
  let contentWindow = Services.wm.getOuterWindowWithId(aSubject.windowID);

  prompt(
    aSubject.type,
    contentWindow,
    aSubject.windowID,
    aSubject.callID,
    constraints,
    aSubject.devices,
    aSubject.isSecure,
    aSubject.isHandlingUserInput
  );
}
I'm going to need to compare these three methods further, but it's late here already, so that will have to wait until tomorrow. I'm not anticipating having much time to look at this tomorrow, but will try to get something done and posted about it nonetheless.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
9 Jul 2024 : Day 283 #
This morning I've picked up where I left off, attempting to get the library to link after apply equivalents for patches 77, 78 and 79. These three patches represent the first of two stages needed to get WebRTC back up and running. Once these changes are going through I'll then need to apply patches 80 and 81, which represent the second stage, which should hopefully be enough to get things working.

Having applied the first three patches I was still left with a single undefined reference during linking. Apparently the following constructor is missing an implementation:
webrtc::EncoderSimulcastProxy::EncoderSimulcastProxy()
Checking through the code I discovered that in ESR 78 the EncoderSimulcastProxy class isn't used, with a VP8EncoderSimulcastProxy class being used in its stead. It seems likely this has something to do with the problem. In ESR 78 this class is defined in the vp8_encoder_simulcast_proxy.cc and vp8_encoder_simulcast_proxy.h files, whereas the ESR 91 code is found in the encoder_simulcast_proxy.cc and encoder_simulcast_proxy.h files.

If you've been following along for the last few days you'll know that the WebRTC code is being built using the recipes in the following GN build files:
$ ls dom/media/webrtc/third_party_build/gn-configs/
arm64_False_arm64_freebsd.json  x64_False_x64_netbsd.json
arm64_False_arm64_linux.json    x64_False_x64_openbsd.json
arm64_False_arm64_netbsd.json   x64_False_x64_win.json
arm64_False_arm64_openbsd.json  x64_False_x86_android.json
arm64_True_arm64_freebsd.json   x64_False_x86_linux.json
arm64_True_arm64_linux.json     x64_True_arm64_android.json
arm64_True_arm64_netbsd.json    x64_True_arm64_mac.json
arm64_True_arm64_openbsd.json   x64_True_arm64_win.json
arm_False_arm_freebsd.json      x64_True_arm_android.json
arm_False_arm_linux.json        x64_True_x64_android.json
arm_False_arm_netbsd.json       x64_True_x64_dragonfly.json
arm_False_arm_openbsd.json      x64_True_x64_freebsd.json
arm_True_arm_freebsd.json       x64_True_x64_linux.json
arm_True_arm_linux.json         x64_True_x64_mac.json
arm_True_arm_netbsd.json        x64_True_x64_netbsd.json
arm_True_arm_openbsd.json       x64_True_x64_openbsd.json
mips64_False_mips64_linux.json  x64_True_x64_win.json
mips64_True_mips64_linux.json   x64_True_x86_android.json
ppc64_False_ppc64_linux.json    x64_True_x86_linux.json
ppc64_True_ppc64_linux.json     x86_False_x86_freebsd.json
x64_False_arm64_android.json    x86_False_x86_linux.json
x64_False_arm64_mac.json        x86_False_x86_netbsd.json
x64_False_arm64_win.json        x86_False_x86_openbsd.json
x64_False_arm_android.json      x86_False_x86_win.json
x64_False_x64_android.json      x86_True_x86_freebsd.json
x64_False_x64_dragonfly.json    x86_True_x86_netbsd.json
x64_False_x64_freebsd.json      x86_True_x86_openbsd.json
x64_False_x64_linux.json        x86_True_x86_win.json
x64_False_x64_mac.json
Of these the arm64_False_arm64_linux.json, arm_False_arm_linux.json and x86_False_x86_linux.json files were all transferred over from ESR 78, having been amended by patch 78. When I examined these files I notice something relevant: they differ slightly from all of the other files in that in the ESR 91 files encoder_simulcast_proxy is referenced where vp8_encoder_simulcast_proxy was referenced on ESR 78. So this could well be our problem.

And it makes sense as well: in the code the V8 versions of the files aren't used any more, having been replaced by the generic encoder simulcaast proxy implementation. So it makes sense that the build scripts might need updating to reference the new files rather than the previous ones.

If this is the case, then the solution should be straightforward. All I need to do is switch the file names in the GN build files, then regenerate the moz.build files from these.

So, here is the change I've made to the GN build files. This mimics the changes that have already been made to the other GN build files (but which we're not actually using).
$ git diff -U0
diff --git a/dom/media/webrtc/third_party_build/gn-configs/
    arm64_False_arm64_linux.json b/dom/media/webrtc/third_party_build/
    gn-configs/arm64_False_arm64_linux.json
index f1e7bfac33b8..beeded9b8602 100644
--- a/dom/media/webrtc/third_party_build/gn-configs/arm64_False_arm64_linux.json
+++ b/dom/media/webrtc/third_party_build/gn-configs/arm64_False_arm64_linux.json
@@ -4143,2 +4143,2 @@
-                &quot;//media/engine/vp8_encoder_simulcast_proxy.cc&quot;,
-                &quot;//media/engine/vp8_encoder_simulcast_proxy.h&quot;,
+                &quot;//media/engine/encoder_simulcast_proxy.cc&quot;,
+                &quot;//media/engine/encoder_simulcast_proxy.h&quot;,
diff --git a/dom/media/webrtc/third_party_build/gn-configs/
    arm_False_arm_linux.json b/dom/media/webrtc/third_party_build/gn-configs/
    arm_False_arm_linux.json
index 9885557f56cf..f3a87ac3b260 100644
--- a/dom/media/webrtc/third_party_build/gn-configs/arm_False_arm_linux.json
+++ b/dom/media/webrtc/third_party_build/gn-configs/arm_False_arm_linux.json
@@ -4408,2 +4408,2 @@
-                &quot;//media/engine/vp8_encoder_simulcast_proxy.cc&quot;,
-                &quot;//media/engine/vp8_encoder_simulcast_proxy.h&quot;,
+                &quot;//media/engine/encoder_simulcast_proxy.cc&quot;,
+                &quot;//media/engine/encoder_simulcast_proxy.h&quot;,
diff --git a/dom/media/webrtc/third_party_build/gn-configs/
    x86_False_x86_linux.json b/dom/media/webrtc/third_party_build/gn-configs/
    x86_False_x86_linux.json
index 27b49b928398..796fc1e920a5 100644
--- a/dom/media/webrtc/third_party_build/gn-configs/x86_False_x86_linux.json
+++ b/dom/media/webrtc/third_party_build/gn-configs/x86_False_x86_linux.json
@@ -4124,2 +4124,2 @@
-                &quot;//media/engine/vp8_encoder_simulcast_proxy.cc&quot;,
-                &quot;//media/engine/vp8_encoder_simulcast_proxy.h&quot;,
+                &quot;//media/engine/encoder_simulcast_proxy.cc&quot;,
+                &quot;//media/engine/encoder_simulcast_proxy.h&quot;,
Following this I'll need to regenerate the gecko build files that are actually used during the build process.
$ source `pwd`/obj-build-mer-qt-xr/rpm-shared.env
$ cd gecko-dev/
$ ./mach build-backend -b GnMozbuildWriter
I can see this has resulted in a large number of changes to the moz.build files, which I take to be a good sign. Now time to do a full clean rebuild of the code.
$ sfdk build -d -p --with git_workaround
[...]
Wrote: RPMS/SailfishOS-devel-aarch64/
    xulrunner-qt5-devel-91.9.1+git1+sailfishos.esr91.20240707214235.e32f3a780e7e-1.aarch64.rpm
Wrote: RPMS/SailfishOS-devel-aarch64/
    xulrunner-qt5-misc-91.9.1+git1+sailfishos.esr91.20240707214235.e32f3a780e7e-1.aarch64.rpm
Wrote: RPMS/SailfishOS-devel-aarch64/
    xulrunner-qt5-91.9.1+git1+sailfishos.esr91.20240707214235.e32f3a780e7e-1.aarch64.rpm
Wrote: RPMS/SailfishOS-devel-aarch64/
    xulrunner-qt5-debugsource-91.9.1+git1+sailfishos.esr91.20240707214235.e32f3a780e7e-1.aarch64.rpm
Wrote: RPMS/SailfishOS-devel-aarch64/
    xulrunner-qt5-debuginfo-91.9.1+git1+sailfishos.esr91.20240707214235.e32f3a780e7e-1.aarch64.rpm
Executing(%clean): /bin/sh -e /var/tmp/rpm-tmp.B5jeR8
+ umask 022
+ cd gecko-dev-esr91/gecko-dev/gecko-dev
+ /bin/rm -rf /home/deploy/installroot
+ RPM_EC=0
++ jobs -p
+ exit 0
Fantastic! The build has gone through without errors either during compilation or linking. This is exciting progress. But it's not time to test the packages just yet as I also have to apply the remaining two patches. Both of these are marked as being part of task JB#53982, so I should apply them both together.

As we saw a couple of days ago, I've already updated patch 81 to use the new location of the libwebrtc code and I didn't notice any obvious issues with patch 80, so with any luck these will now apply without too much drama.
$ git am --3way ../rpm/
    0080-sailfishos-webrtc-Enable-GMP-for-encoding-decoding.-.patch
Applying: Enable GMP for encoding/decoding. JB#53982
Using index info to reconstruct a base tree...
M       dom/media/gmp/GMPSharedMemManager.cpp
A       media/webrtc/signaling/src/media-conduit/GmpVideoCodec.cpp
A       media/webrtc/signaling/src/media-conduit/GmpVideoCodec.h
A       media/webrtc/signaling/src/media-conduit/VideoConduit.cpp
A       media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.cpp
A       media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.h
A       media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.cpp
Falling back to patching base and 3-way merge...
Auto-merging dom/media/webrtc/libwebrtcglue/WebrtcGmpVideoCodec.h
Auto-merging dom/media/webrtc/libwebrtcglue/WebrtcGmpVideoCodec.cpp
CONFLICT (content): Merge conflict in dom/media/webrtc/libwebrtcglue/
    WebrtcGmpVideoCodec.cpp
Auto-merging dom/media/webrtc/libwebrtcglue/VideoConduit.cpp
Auto-merging dom/media/webrtc/libwebrtcglue/GmpVideoCodec.h
Auto-merging dom/media/webrtc/jsapi/PeerConnectionImpl.cpp
Auto-merging dom/media/gmp/GMPSharedMemManager.cpp
error: Failed to merge in the changes.
Patch failed at 0001 Enable GMP for encoding/decoding. JB#53982
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run &quot;git am --continue&quot;.
If you prefer to skip this patch, run &quot;git am --skip&quot; instead.
To restore the original branch and stop patching, run &quot;git am 
    --abort&quot;.
Ah, sadly it's not going to be quite as straightforward as I'd hoped. No matter. There are three hunks that have failed to apply cleanly to the WebrtcGmpVideoCode.cpp file. It looks like the reason might be an upstream change that has slightly amended the way strings are handled. The line that used to look like this:
tags.AppendElement(NS_LITERAL_CSTRING(&quot;h264&quot;));
Now looks like this:
tags.AppendElement(&quot;h264&quot;_ns);
We've seen this before all the way back on Day 35. Thankfully fixing it is a straightforward process, but I notice that NS_LITERAL_CSTRING() is used in several places in the new code as well, so I'll need to change those as well.

With these changes tackled the merge now goes through successfully.
$ git add dom/media/webrtc/libwebrtcglue/WebrtcGmpVideoCodec.cpp
$ git am --continue
Applying: Enable GMP for encoding/decoding. JB#53982
Now on to patch 81. Unfortunately this also fails to apply cleanly.
$ git am --3way ../rpm/
    0081-sailfishos-webrtc-Implement-video-capture-module.-JB.patch
Applying: Implement video capture module. JB#53982
Using index info to reconstruct a base tree...
M       old-configure.in
M       third_party/libwebrtc/webrtc/modules/video_capture/BUILD.gn
M       third_party/libwebrtc/webrtc/modules/video_capture/
    video_capture_internal_impl_gn/moz.build
error: patch failed: third_party/libwebrtc/webrtc/modules/video_capture/
    video_capture_internal_impl_gn/moz.build:60
error: third_party/libwebrtc/webrtc/modules/video_capture/
    video_capture_internal_impl_gn/moz.build: patch does not apply
error: Did you hand edit your patch?
It does not apply to blobs recorded in its index.
Patch failed at 0001 Implement video capture module. JB#53982
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run &quot;git am --continue&quot;.
If you prefer to skip this patch, run &quot;git am --skip&quot; instead.
To restore the original branch and stop patching, run &quot;git am 
    --abort&quot;.
In this case the issue is different. It's specifically the moz.build file it fails on. But as we've seen, these moz.build files are autogenerated. In this case, it's the change to BUILD.gn that we can see in the patch (and which applied without issue) that I believe should then be propagated to moz.build when the build files are regenerated.

So it looks like the simple solution will be to remove the changes to moz.build from the patch, apply the patch, then regenerate the build files. I should be left with some similar changes to moz.build autogenerated for me. I've therefore manually edited patch 81 to remove the moz.build changes listed there.
$ git am --3way ../rpm/0081-sailfishos-webrtc-Implement-video-capture-modul
e.-JB.patch
Applying: Implement video capture module. JB#53982
Patch goes through okay. Now to regenerate the build files.
$ sfdk engine exec
$ sb2 -t SailfishOS-esr78-aarch64.default
$ source `pwd`/obj-build-mer-qt-xr/rpm-shared.env
$ cd gecko-dev/
$ ./mach build-backend -b GnMozbuildWriter
[...]
Finished reading 1165 moz.build files in 12.36s
Read 13 gyp files in parallel contributing 0.00s to total wall time
Processed into 7258 build config descriptors in 10.27s
GnMozbuildWriter backend executed in 25.27s
  1 total backend files; 1 created; 0 updated; 0 unchanged; 0 deleted
Total wall time: 48.95s; CPU time: 47.91s; Efficiency: 98%; Untracked: 1.06s
Glean could not be found, so telemetry will not be reported. You may need to 
    run |mach bootstrap|.
I'm a bit perturbed to discover that this hasn't changed the moz.build file in the way I was expecting. To be clear: it hasn't changed it at all. I'm not happy with this and it's going to require a bit more investigation.

Perhaps I need to clean out the build before attempting to regenerate the build files?
$ sfdk engine exec
$ sb2 -t SailfishOS-esr78-aarch64.default
$ cd ..
$ source `pwd`/obj-build-mer-qt-xr/rpm-shared.env
$ cd gecko-dev/
$ ./mach build-backend -b GnMozbuildWriter
config.status not found.  Please run |mach configure| or |mach build| prior to 
    building the ['GnMozbuildWriter'] build backend.
Glean could not be found, so telemetry will not be reported. You may need to 
    run |mach bootstrap|.
Well, no, clearly that's not going to do the job. After doing a complete rebuild I've tried again, but still without any changes resulting in the moz.build files.
$ ./mach build-backend -b GnMozbuildWriter
 0:03.89 ${PROJECT}/obj-build-mer-qt-xr/_virtualenvs/common/bin/python 
    ${PROJECT}/obj-build-mer-qt-xr/config.status --backend GnMozbuildWriter
Reticulating splines...
 0:04.87 File already read. Skipping: ${PROJECT}/gecko-dev/intl/components/
    moz.build
 0:08.18 File already read. Skipping: ${PROJECT}/gecko-dev/gfx/angle/targets/
    angle_common/moz.build
There are some warnings about files already having been read. I did some reading around this but there's no indication that it could be part of the problem. Part of the output generated from the process contains a list of all the GN build files it's consuming. Here's the list (edited for clarity):
Writing moz.build files based on the following gn configs: [
    'x64_False_arm_android.json', 'arm_True_arm_freebsd.json',
    'mips64_True_mips64_linux.json', 'arm64_True_arm64_linux.json',
    'x64_False_x64_openbsd.json', 'x64_True_x64_mac.json', 
    x64_False_x64_win.json', 'x64_True_x86_linux.json',
    'x86_True_x86_win.json', 'mips64_False_mips64_linux.json',
    'x64_True_arm64_win.json', 'arm64_True_arm64_openbsd.json',
    'ppc64_False_ppc64_linux.json', 'x64_False_arm64_mac.json',
    'x86_False_x86_linux.json', 'x86_False_x86_win.json',
    'arm_False_arm_linux.json', 'x64_False_x64_mac.json',
    'arm64_True_arm64_freebsd.json', 'x64_True_x64_netbsd.json',
    'arm64_False_arm64_linux.json', 'x64_False_x86_android.json',
    'x86_False_x86_openbsd.json', 'x64_True_x64_win.json',
    'arm64_False_arm64_openbsd.json', 'x86_True_x86_netbsd.json',
    'x86_True_x86_openbsd.json', 'x64_False_x64_android.json',
    'x64_False_x64_freebsd.json', 'x64_True_x86_android.json',
    'x64_True_x64_openbsd.json', 'arm_True_arm_netbsd.json',
    'x64_False_x64_netbsd.json', 'arm64_False_arm64_netbsd.json',
    'x64_True_arm_android.json', 'x64_False_x64_dragonfly.json',
    'x64_False_arm64_android.json', 'x64_True_x64_freebsd.json',
    'arm64_False_arm64_freebsd.json', 'x64_True_arm64_mac.json',
    'x64_False_x64_linux.json', 'x86_False_x86_freebsd.json',
    'arm_False_arm_openbsd.json', 'ppc64_True_ppc64_linux.json',
    'x64_True_x64_linux.json', 'x64_False_x86_linux.json',
    'x64_True_arm64_android.json', 'arm_True_arm_openbsd.json',
    'x64_False_arm64_win.json', 'arm64_True_arm64_netbsd.json',
    'arm_False_arm_netbsd.json', 'arm_False_arm_freebsd.json',
    'x64_True_x64_android.json', 'x64_True_x64_dragonfly.json',
    'x86_False_x86_netbsd.json', 'x86_True_x86_freebsd.json',
    'arm_True_arm_linux.json'
]
The three files of importance to us are the ones we've amended, which are the arm64_False_arm64_linux.json, arm_False_arm_linux.json and x86_False_x86_linux.json files. All of these appear in the list, which I'd take as a good sign if it weren't for the fact it doesn't appear to be having any effect.

In spite of all this I left the build running all day and it went through fine, so I thought I'd have a go at running it and testing it with some WebRTC test sites. First up I tried the LiveKit WebRTC Browser Test. The results are a bit of a mixed bag. Although the video codecs seem to be working correctly, when the site requests audio input it gets stuck in a holding position indefinitely. When I perform the same action on my desktop browser a permission box opens and the process continues once I've agreed for the permissions to be granted.

Screenshots of the LiveKit WebRTC Browser Test: running on the desktop with Firefox; running on Sailfish OS. The later shows a spinner when asking for Media devices audioinput

There are some errors output to the console as well:
Created LOG for EmbedLiteLayerManager
JavaScript warning: https://livekit.io/_next/static/chunks/
    main-4e25e037b50f6258.js, line 1: unreachable code after return statement
JavaScript warning: https://livekit.io/_next/static/chunks/
    main-4e25e037b50f6258.js, line 1: unreachable code after return statement
JavaScript error: resource://gre/modules/amInstallTrigger.jsm, line 43: 
    TypeError: this.mm is null
JavaScript error: resource://gre/modules/amInstallTrigger.jsm, line 43: 
    TypeError: this.mm is null
JavaScript error: resource://gre/modules/amInstallTrigger.jsm, line 43: 
    TypeError: this.mm is null
JavaScript error: file:///usr/lib64/mozembedlite/components/
    EmbedLiteWebrtcUI.js, line 293: TypeError: 
    contentWindow.navigator.mozGetUserMediaDevices is not a function

###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, 
    messages will be lost
This all suggests to me that the permission dialogues are currently broken on the ESR 91 build, so this is something I'll need to look into. I also had a play with Mozilla's WebRTC Tests.

Testing out the getUserMedia example shows similar problems. Pressing the camera button has no effect, while pressing the Microphone button generates a similar error to the console:
JavaScript error: file:///usr/lib64/mozembedlite/components/
    EmbedLiteWebrtcUI.js, line 293: TypeError: 
    contentWindow.navigator.mozGetUserMediaDevices is not a function
I have better success with the Single host DataChannels page. In this case I'm able to send messages backwards and forwards just as I can on a desktop browser.

So this is so far all pointing to an issue with the page for requesting permissions. I'm not going to have time to check this now, but there is one other thing I'd like to try.

Although I wasn't able to get the moz.build files updated automatically by generating them from the GN build scripts, I do have the patch that shows what ought to be changing in the moz.build. So I'm going to apply this change manually and run another rebuild. It'll be interesting to establish whether this has any effect. If all goes to plan, we can find out tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
8 Jul 2024 : Day 282 #
So far I've been trying to apply WebRTC patches. There are five of them, 77 – 81, of which 77 and 78 have already been applied. Well, sort of. The first applied fine but for the second I copied over the GN build configuration json files and regenerated the build scripts hoping that would do the trick.

I left the build running overnight and, unlike my previous attempts, it now compiles without error. That's a positive step forwards. But although compilation goes through, the linking now fails.

There are a lot of missing symbols. Here are some of the errors; I won't give them all as they're repetitive to say the least. But this gives a decently representative sample of them.
285:37.36 toolkit/library/build/libxul.so
289:04.49 SailfishOS-devel-aarch64.default/opt/cross/bin/
    aarch64-meego-linux-gnu-ld: ../../../dom/media/systemservices/
    Unified_cpp_systemservices0.o: in function `webrtc::ScreenDeviceInfoImpl::
    GetDeviceName(unsigned int, char*, unsigned int, char*, unsigned int, 
    char*, unsigned int, int*)':
289:04.49 ${PROJECT}/gecko-dev/dom/media/systemservices/video_engine/
    desktop_capture_impl.cc:63: undefined reference to `webrtc::
    DesktopDisplayDevice::DesktopDisplayDevice()'
289:04.49 SailfishOS-devel-aarch64.default/opt/cross/bin/
    aarch64-meego-linux-gnu-ld: ${PROJECT}/gecko-dev/dom/media/systemservices/
    video_engine/desktop_capture_impl.cc:63: undefined reference to `webrtc::
    DesktopDisplayDevice::~DesktopDisplayDevice()'
289:04.49 SailfishOS-devel-aarch64.default/opt/cross/bin/
    aarch64-meego-linux-gnu-ld: ${PROJECT}/gecko-dev/dom/media/systemservices/
    video_engine/desktop_capture_impl.cc:81: undefined reference to `webrtc::
    DesktopDisplayDevice::getDeviceName()'
[...]
289:04.51 SailfishOS-devel-aarch64.default/opt/cross/bin/
    aarch64-meego-linux-gnu-ld: libxul.so: hidden symbol 
    `_ZN6webrtc15DesktopCapturer20CreateScreenCapturerERKNS_21DesktopCaptureOptionsE' isn't defined
289:04.51 SailfishOS-devel-aarch64.default/opt/cross/bin/
    aarch64-meego-linux-gnu-ld: final link failed: bad value
289:04.51 collect2: error: ld returned 1 exit status
All bar one of the issues are undefined reference errors. There's also one hidden symbol error which I'll come on to. But first, the missing symbols, all of which are inside the webrtc namespace, are the following:
  1. ScreenDeviceInfoImpl::GetDeviceName()
  2. DesktopDisplayDevice::DesktopDisplayDevice()
  3. DesktopDisplayDevice::~DesktopDisplayDevice()
  4. DesktopDisplayDevice::getDeviceName()
  5. DesktopDisplayDevice::getUniqueIdName()
  6. WindowDeviceInfoImpl::GetDeviceName()
  7. DesktopDisplayDevice::getPid()
  8. BrowserDeviceInfoImpl::GetDeviceName()
  9. DesktopTab::DesktopTab()
  10. DesktopTab::~DesktopTab()
  11. DesktopTab::getTabName()
  12. DesktopTab::getUniqueIdName()
  13. DesktopDeviceInfoImpl::Create()
  14. DesktopCaptureOptions::CreateDefault()
  15. DesktopCaptureOptions::~DesktopCaptureOptions()
  16. DesktopCapturer::CreateScreenCapturer()
  17. MouseCursorMonitor::CreateForScreen()
  18. DesktopAndCursorComposer::DesktopAndCursorComposer()
  19. DesktopCapturer::CreateWindowCapturer()
  20. MouseCursorMonitor::CreateForWindow(webrtc::DesktopCaptureOptions()
  21. DesktopCapturer::CreateTabCapturer()
  22. DesktopRegion::DesktopRegion()
  23. DesktopRegion::~DesktopRegion()
  24. EncoderSimulcastProxy::EncoderSimulcastProxy()
The single hidden symbol is for the following:
_ZN6webrtc15DesktopCapturer20CreateScreenCapturerERKNS_21
    DesktopCaptureOptionsE
Running this through the demangler gives the following name:
$ c++filt _ZN6webrtc15DesktopCapturer20CreateScreenCapturerERKNS_21
    DesktopCaptureOptionsE
webrtc::DesktopCapturer::CreateScreenCapturer(
    webrtc::DesktopCaptureOptions const&
)
As we can see this is also one of the missing symbols. A missing symbol usually means one of two things. It can mean that the header file is being compiled without a related source file, so that there's a call to a method that the compiler can provide a symbol for, but then the linker can't hook it up to the actual function. Alternatively it can mean that there's some object code being generated that's not being included in the final linking step.

I'm not sure what's going on here, but I do note that there are implementations for most of these methods in the code. For example DesktopDisplayDevice::getDeviceName() can be found in the desktop_device_info.cc file:
const char *DesktopDisplayDevice::getDeviceName() {
  return deviceNameUTF8_;
}
Amongst the patches I don't see any obvious change that would immediately try to link in a missing object file. There are some changes to the build configuration that would affect what's compiled, for example the following:
     UNIFIED_SOURCES += [
-        &quot;/third_party/libwebrtc/webrtc/modules/video_capture/linux/
    device_info_linux.cc&quot;,
-        &quot;/third_party/libwebrtc/webrtc/modules/video_capture/linux/
    video_capture_linux.cc&quot;
+        &quot;/third_party/libwebrtc/webrtc/modules/video_capture/sfos/
    device_info_sfos.cc&quot;,
+        &quot;/third_party/libwebrtc/webrtc/modules/video_capture/sfos/
    video_capture_sfos.cc&quot;
     ]
There also seem to be some relevant relationships between the missing symbols and some of the later patches too. For example patch 81 contains the a new method called DeviceInfoSFOS::GetDeviceName(). Now that's not going to generate an exact replacement for the ScreenDeviceInfoImpl::GetDeviceName() symbol, but it's surely related.

From the descriptions in the patches it's also clear that patches 77, 78 and 79 are all supposed to be applied together, since they were all worked on under bug JB#53756 (The "JB" here stands for "Jolla Bug" and the number is the bug number in Jolla's Bugzilla database).

My conclusion is that I should probably try applying patch 79 and attempt to build again before I start attempting any manual fixes myself. It's possible the linking errors will just go away if I do this.

Patch 79 makes changes to just a single file and that file doesn't appear to have moved, so maybe it'll just apply?
$ git am --3way ../rpm/
    0079-sailfishos-webrtc-Disable-desktop-sharing-feature-on.patch
Applying: Disable desktop sharing feature on SFOS. JB#53756
Using index info to reconstruct a base tree...
M       dom/media/systemservices/VideoEngine.cpp
Falling back to patching base and 3-way merge...
Auto-merging dom/media/systemservices/VideoEngine.cpp
The patch failed initially, but after adding the -3way option to fall back to a three-way merge it then applies fine. The patch is quite small and I was able to easily check manually that all of the elements had correctly applied.

So I've set the build going again. It's another fully clean build, so I'm expecting it to take some time. Looking at the output the last build took four and three quarter hours before it reached the linking stage. Right now it's 16:45, so with any luck it'll be completed (either with or without errors) by 21:30 which might still give me some time to do some more work on it afterwards.

[...]

The build ran pretty much to schedule. Unfortunately it didn't go through, failing once again at the final linking stage. But the results are encouraging nevertheless. While last time there were 24 undefined references, now there's only one:
274:07.07 toolkit/library/build/libxul.so
277:37.26 SailfishOS-devel-aarch64.default/opt/cross/bin/
    aarch64-meego-linux-gnu-ld: ../../../dom/media/webrtc/libwebrtcglue/
    Unified_cpp_libwebrtcglue0.o: in function `mozilla::WebrtcVideoConduit::
    CreateEncoder(webrtc::VideoCodecType)':
277:37.26 ${PROJECT}/gecko-dev/dom/media/webrtc/libwebrtcglue/VideoConduit.cpp:
    1774: undefined reference to `webrtc::EncoderSimulcastProxy::
    EncoderSimulcastProxy(webrtc::VideoEncoderFactory*, webrtc::SdpVideoFormat 
    const&)'
277:37.26 collect2: error: ld returned 1 exit status
Extracting the crucial information from this error, the undefined reference is for the following method:
webrtc::EncoderSimulcastProxy::EncoderSimulcastProxy()
A quick check shows that this class doesn't exist in ESR 78, although there was what appears to be a similar VP8EncoderSimulcastProxy class that's used instead. I also notice that the WebrtcVideoConduit::CreateEncoder() method where the constructor for the class is used (which is causing the linker error) is heavily amended by our patch 80.

This is going to require further investigation, but it does at least give us a clear set of leads to follow. But as it's late now, following these leads will have to wait until tomorrow. I'm feeling quite upbeat about this though: it's very clear progress.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
7 Jul 2024 : Day 281 #
Today I'm continuing my attempts to restore WebRTC to the browser. Last night I set off a completely clean build after re-enabling the WebRTC option in mozconfig.merqtxulrunner. I didn't expect the build to complete successfully and, indeed, it didn't. After four and a half hours it crashed out with an error while attempting a header include:
272:51.05 In file included from ${PROJECT}/gecko-dev/third_party/libwebrtc/
    webrtc/modules/desktop_capture/desktop_capturer.cc:17,
272:51.05                  from Unified_cpp_p_capture_generic_gn0.cpp:56:
272:51.05 ${PROJECT}/obj-build-mer-qt-xr/dist/system_wrappers/gtk/gtk.h:3:15: 
    fatal error: gtk/gtk.h: No such file or directory
272:51.05  #include_next <gtk/gtk.h>
272:51.06                ^~~~~~~~~~~
272:51.06 compilation terminated.
272:51.06 make[4]: *** [${PROJECT}/gecko-dev/config/rules.mk:694: 
    Unified_cpp_p_capture_generic_gn0.o] Error 1
Our build is Qt-, rather than GTK- based, so it's not surprising that there's a problem if it tries to include a GTK header. If we look at the source code we can see that the likely cause is that our change to the build options has resulted in the WEBRTC_USE_PIPEWIRE define macro being set.
#if defined(WEBRTC_USE_PIPEWIRE) || defined(USE_X11)
#include <gtk/gtk.h>
#include <gtk/gtkx.h>
#endif
Interestingly this code is the same in ESR 78, even after all of the patches have been applied. That suggests to me that, in fact, WEBRTC_USE_PIPEWIRE shouldn't be defined on our build even if WebRTC is enabled. That would make sense because it can't be the case that all builds of Gecko require the GTK headers.

Checking the GN build file we can see the following:
  if (rtc_use_pipewire) {
    sources += [
      &quot;linux/base_capturer_pipewire.cc&quot;,
      &quot;linux/base_capturer_pipewire.h&quot;,
      &quot;linux/screen_capturer_pipewire.cc&quot;,
      &quot;linux/screen_capturer_pipewire.h&quot;,
      &quot;linux/window_capturer_pipewire.cc&quot;,
      &quot;linux/window_capturer_pipewire.h&quot;,
    ]

    if (!build_with_mozilla) {
      configs += [
        &quot;:gio&quot;,
        &quot;:pipewire&quot;,
      ]
    } else {
      defines += [ &quot;WEBRTC_USE_PIPEWIRE&quot; ]
      include_dirs += [ &quot;/third_party/pipewire&quot; ]
    }
  }
This shows that the define is dependent on both the rtc_use_pipewire option being set and the build_with_mozilla option being set. Searching for rtc_use_pipewire and comparing against ESR 78 shows that this wasn't set in ESR 78:
$ diff \
    gecko-dev-project/gecko-dev/gecko-dev/media/webrtc/trunk/webrtc/webrtc.gni \
    gecko-dev-esr91/gecko-dev/gecko-dev/third_party/libwebrtc/webrtc/webrtc.gni 
74c74
<   rtc_use_pipewire = false
---
>   rtc_use_pipewire = true
This appears to be a change to the upstream default. So the thing to do here is to unset this option and then, I suppose, regenerate all of the build scripts.
$ source `pwd`/obj-build-mer-qt-xr/rpm-shared.env
$ cd gecko-dev/
$ ./mach python python/mozbuild/mozbuild/gn_processor.py dom/media/webrtc/
    third_party_build/gn-configs/arm64_False_arm64_linux.json
$ ./mach python python/mozbuild/mozbuild/gn_processor.py dom/media/webrtc/
    third_party_build/gn-configs/arm_False_arm_linux.json
$ ./mach python python/mozbuild/mozbuild/gn_processor.py dom/media/webrtc/
    third_party_build/gn-configs/x86_False_x86_linux.json
However, running these gives the same result as before: no change at all. That doesn't seem right to me, but it's always possible that the change will be propagated at build time automatically somehow. So I've set another clean build running again:
git clean -xdf
cd gecko-dev
git clean -xdf
cd ..
sfdk build -d -p --with git_workaround
[...]
As I mentioned earlier, yesterday the build ran for four and a half hours before hitting the error, so we should expect it to run for at least the same length of time again. So I'm off to focus on other things for a bit and will return once the build has completed, error or otherwise.

[...]

Unfortunately it's given the same error as before. What's frustrating is that I can quite clearly see the following in moz.build:
if CONFIG[&quot;CPU_ARCH&quot;] == &quot;aarch64&quot; and 
    CONFIG[&quot;OS_TARGET&quot;] == &quot;Linux&quot;:

    DEFINES[&quot;DISABLE_NACL&quot;] = True
    DEFINES[&quot;NO_TCMALLOC&quot;] = True
    DEFINES[&quot;WEBRTC_USE_PIPEWIRE&quot;] = True

    LOCAL_INCLUDES += [
        &quot;/third_party/pipewire/&quot;
    ]

    UNIFIED_SOURCES += [
        &quot;/third_party/libwebrtc/webrtc/modules/desktop_capture/linux/
    base_capturer_pipewire.cc&quot;,
        &quot;/third_party/libwebrtc/webrtc/modules/desktop_capture/linux/
    screen_capturer_pipewire.cc&quot;,
        &quot;/third_party/libwebrtc/webrtc/modules/desktop_capture/linux/
    window_capturer_pipewire.cc&quot;
    ]
So I probably should have just checked this before starting an entirely fresh build off again: it could have saved me some time. Anyhow, I'm a bit stumped. How to regenerate these GN json build files?

After trying a few things and not making much progress, I've decide to take a closer look at patch 78. That's the patch with the changes to the build configuration. The various moz.build files there look autogenerated, but the three json files — the GN input scripts — don't. Could it be that Denis edited these by hand?

It seems unlikely given they each have 6000 lines of changed code. But I don't see where else they would have come from. The WebRTC code has been moved in the build tree but otherwise seems pretty similar. So now I'm wondering if I can get away with simply copying these GN build files over from ESR 78 straight into ESR 91.

This is a bit of a risky way to proceed, but I don't have a lot to lose at this point. Besides, the way my repository is set up right now, it'd be super-easy to revert the changes. So I'm going to give it a go.
$ cp gecko-dev-project/gecko-dev/gecko-dev/media/webrtc/gn-configs/
    arm64_False_arm64_linux.json gecko-dev-esr91/gecko-dev/gecko-dev/dom/media/
    webrtc/third_party_build/gn-configs/
$ cp gecko-dev-project/gecko-dev/gecko-dev/media/webrtc/gn-configs/
    arm_False_arm_linux.json gecko-dev-esr91/gecko-dev/gecko-dev/dom/media/
    webrtc/third_party_build/gn-configs/
$ cp gecko-dev-project/gecko-dev/gecko-dev/media/webrtc/gn-configs/
    x86_False_x86_linux.json gecko-dev-esr91/gecko-dev/gecko-dev/dom/media/
    webrtc/third_party_build/gn-configs/
By looking through various bugs on Bugzilla (specifically Bug 1466254 and Bug 1658853) I've also discovered I can regenerate all of the build files from the GN json files in one go using just one command. This needs to be executed inside the scratchbox2 target of course:
$ ./mach build-backend -b GnMozbuildWriter
After running this I can see a large number of changed files, primarily moz.build files, all related to the build configuration. Maybe this has done the job? Given all of these changes I'm willing to run a build to see how it goes. Given there are changes to the build scripts it's going to have to be another completely clean build though.
git clean -xdf
cd gecko-dev
git clean -xdf
cd ..
sfdk build -d -p --with git_workaround
[...]
Once again, it'll likely take a while for this to either fail or go through. So that'll probably be my lot for the day.

[...]

It's now bed time. The build has been running for a bit over three and a half hours and continues onwards. So I'll definitely have to leave this until morning. Let's see how things look then.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
6 Jul 2024 : Day 280 #
Before I continue my investigation of the changes needed to get WebRTC integrated back into the browser, I want to first take stock. What's the current situation with the browser; what still needs to be done?

Well, the browser is now quite usable in my experience. There are some glitches, such as sporadic failures in the dark mode detection. But I wouldn't count that as critical failures. I've not really driven it hard (meaning, visited a large number of sites on the Web) but I've not generally experienced problems on the sites I've visited.

I've not been able to yet test the WebView running in the email client, but it works fine in the Settings app, so I expect it to work similarly well with the email app.

With these working correctly, it does feel like now would be a good time to roll out packages for others to try and lodge issues against. For users that have a second phone, or who are really willing to live on the cutting edge, I don't think it will be an unusable experience.

Nevertheless, it's also worth highlighting that I've not installed ESR 91 on my daily device. That's mostly just laziness on my part. I'm not sure I'd find it problematic if I did.

WebRTC and multimedia playback are the two obvious areas that need attention. Video works, but I'm not sure how it's happening or what backend it's using. Video playback isn't my area of expertise so I'm hoping Jolla will ultimately be able to help with this part.

There are some tidying up jobs to be done as well. Most notably I have to corral all of the commits I've made to the FIREFOX_ESR_91_9_X_RELBRANCH_patches branch into actual patches. I'm expecting this to be quite a bit of work; work that will require care and attention, but hopefully not too much thought.

So that's where things are at. Right now then, it's back to WebRTC so that I can cross the largest remaining item off this list. The ESR 78 WebRTC patches were developed by Denis Grigorev (d-grigorev), who I had the privilege of working with while I was at Jolla. Denis is a fantastic developer and did amazing work with these patches and I wouldn't want to have to replicate all of that from scratch. Frankly I'm not sure I could. My hope therefore is that I'll be able to apply the patches without too much drama.

As we saw yesterday there are five sequentially numbered patches that must be applied in order from 77 to 81 inclusive. Starting with the first patch, attempting to apply it gives disastrous results. None of the hunks apply successfully.

Checking the details of the patch it quickly becomes clear why: none of the files listed in the patch exist in the source tree any more. Or at least, they're not where they're supposed to be.

But checking the log for one of the missing files shows that it's been moved rather than removed:
$ git log -1 -- media/webrtc/trunk/webrtc/BUILD.gn
commit 9f1d2b5af54877aacf06605681a5b313c0441c51
Author: Dan Minor <dminor@mozilla.com>
Date:   Thu Sep 24 18:28:41 2020 +0000

    Bug 1665166 - Move media/webrtc/trunk/* to third-party/libwebrtc; r=ng
    
    Differential Revision: https://phabricator.services.mozilla.com/D91317
After a bit of experimentation I soon discover the description for this commit isn't quite correct. The file has actually been moved to third_party/libwebrtc (with an underscore rather than a hyphen). But having made this adjustment it appears that all of the relevant files have been moved there which, if correct, is going to make things much easier.

Looking through the patches I can see that the same issue is going to arise for patch 81. A quick bit of search-and-replace should fix this for both of them:
$ sed -i -e &quot;s/media\/webrtc\/trunk\//third_party\/libwebrtc\//g&quot; 
    0077-*.patch
$ sed -i -e &quot;s/media\/webrtc\/trunk\//third_party\/libwebrtc\//g&quot; 
    0081-*.patch
Now when I attempt to apply the first of these patches I get much better results:
$ git am ../rpm/0077-sailfishos-webrtc-Adapt-build-configuration-for-Sail.patch
Applying: Adapt build configuration for SailfishOS. JB#53756
Great! So that's patch 77 dealt with. Patch 78 is going to be a bit trickier, because this is the patch that applies the changes following the regeneration of the build process.

As we saw yesterday, there is in theory a simple command for regenerating the build configuration:
./mach python python/mozbuild/mozbuild/gn_processor.py dom/media/webrtc/
    third_party_build/gn-configs/webrtc.json
Checking the patch, we can see that the following files are the ones that need to be regenerated:
 .../gn-configs/arm64_False_arm64_linux.json   |  5874 +-------
 .../gn-configs/arm_False_arm_linux.json       |  6114 +--------
 .../gn-configs/x86_False_x86_linux.json       | 11339 ++++++++++++++++
I run the mach command inside the build engine like this:
$ source `pwd`/obj-build-mer-qt-xr/rpm-shared.env
$ cd gecko-dev/
$ ./mach python python/mozbuild/mozbuild/gn_processor.py dom/media/webrtc/
    third_party_build/gn-configs/arm64_False_arm64_linux.json
$ ./mach python python/mozbuild/mozbuild/gn_processor.py dom/media/webrtc/
    third_party_build/gn-configs/arm_False_arm_linux.json
$ ./mach python python/mozbuild/mozbuild/gn_processor.py dom/media/webrtc/
    third_party_build/gn-configs/x86_False_x86_linux.json
This all executes without incident, but in practice it doesn't have any effect. The build scripts are identical to what they were before: git status shows that nothing has changed. Could that be because the WebRTC is still disabled in our build configuration?

Looking back through my diary records I quickly find the option that's supposed to control this; we talked about it all the way back on Day 38. Here's the change that I committed back then:
$ git log -1 embedding/embedlite/config/mozconfig.merqtxulrunner
commit 577ff63ae39d49c83bab7fca86f2e1d32053d167
Author: David Llewellyn-Jones <david@flypig.co.uk>
Date:   Thu Sep 21 08:43:12 2023 +0100

    Disable WebRTC
    
    The WebRTC patches are large and tricky to apply. This change disables
    building of the WebRTC portions of the code so that reapplying the
    patches can be tackled later.
$ git diff 577ff63ae~ 577ff63ae
diff --git a/embedding/embedlite/config/mozconfig.merqtxulrunner b/embedding/
    embedlite/config/mozconfig.merqtxulrunner
index 2593d32fd5a9..bec8b01bb07b 100644
--- a/embedding/embedlite/config/mozconfig.merqtxulrunner
+++ b/embedding/embedlite/config/mozconfig.merqtxulrunner
@@ -46,7 +46,7 @@ ac_add_options --with-app-name=xulrunner-qt5
 # ac_add_options --enable-warnings-as-errors
 
 # disabling for now, since the build fails...
-ac_add_options --enable-webrtc
+ac_add_options --disable-webrtc
 ac_add_options --enable-profiling
 ac_add_options --disable-dbus
 ac_add_options --disable-necko-wifi
If I restored this so that the option is set to "enable-webrtc" rather than "disable-webrtc" then maybe I'll get better results? Let's see. In order for this change to work it'll need to be propagated through the build system, which means cleaning everything out and starting the build completely from scratch. So that's what I've done:
git clean -xdf
cd gecko-dev
git clean -xdf
cd ..
sfdk build -d -p --with git_workaround
[...]
I've set this running but I'm not exactly sure what to expect. My expectation was that the build would fail soon after the new build configuration had been generated, but after more than a quarter of an hour the build is still happily chugging away. It's also possible that it'll all just rebuild fine. That would be more unexpected, but you never know. Either way, it looks like it's not going to be the fast failure I thought it would be, so I'll need to leave the build running, probably overnight. We'll come back to this tomorrow to see how it's got on.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
5 Jul 2024 : Day 279 #
Today I'm continuing with the plan I hatched yesterday. I'm half way through Step 6 ("Build and check everything still works"), having built the packages overnight.

This morning I've done some basic testing: installed the packages, run the browser to view a page with some WebGL content, run the harbour-webview app.

I was a bit perturbed to find the browser failed to render WebGL if I executed it from the app drawer rather than the command line. But this got fixed when I deleted the startupCache:
$ rm ~/.local/share/org.sailfishos/browser/.mozilla/startupCache/*
It's luck that I noticed this, but I'm glad I did: glitches like this can cause an almighty headache if you don't discover them early on. I also tested the Settings app to check the WebView still works on the various Account creation pages. Using the ESR 91 WebView I was successfully able to view the Twitter (nee X) Terms of Service; I didn't go as far as agreeing to them of course.

So, Step 6 is completed.

Step 7: Take a deep breath. Commit the staged changes.

After making all of the changes and discarding anything unnecessary, there are now nineteen modified files to commit.
$ git diff --stat --cached
 dom/base/nsGlobalWindowOuter.cpp                 |   6 -
 gfx/gl/GLContext.cpp                             |  34 +--
 gfx/gl/GLContext.h                               |   6 +-
 gfx/gl/GLContextEGL.h                            |   6 +-
 gfx/gl/GLContextProviderEGL.cpp                  | 242 +++------------------
 gfx/gl/GLLibraryEGL.cpp                          |  11 +-
 gfx/gl/GLLibraryEGL.h                            |   6 +-
 gfx/gl/GLScreenBuffer.cpp                        |   2 +-
 gfx/gl/GLTextureImage.cpp                        |  26 +--
 gfx/gl/SharedSurface.cpp                         |   2 -
 gfx/gl/SharedSurface.h                           |  17 +-
 gfx/gl/SharedSurfaceEGL.h                        |   4 -
 gfx/gl/SharedSurfaceGL.cpp                       |  12 +-
 gfx/gl/SharedSurfaceGL.h                         |  16 ++
 gfx/layers/CompositorTypes.h                     |   1 -
 gfx/layers/client/TextureClient.cpp              |  42 +---
 gfx/layers/client/TextureClient.h                |   3 -
 gfx/layers/client/TextureClientSharedSurface.cpp |   1 -
 gfx/layers/ipc/CompositorVsyncScheduler.cpp      |   1 +
 19 files changed, 93 insertions(+), 345 deletions(-)
Perhaps the more interesting thing to check is what will happen to the commit I'm merging this into. Part of the aim here has been to reduce the size of this commit to make it more manageable and I'm interested to know whether or not this has been successful. Here's the incumbent commit as it currently looks:
$ git diff --stat HEAD~..HEAD
 dom/base/nsGlobalWindowOuter.cpp                 |   6 +
 gfx/gl/GLContext.cpp                             |  74 +++++-
 gfx/gl/GLContext.h                               |  21 +-
 gfx/gl/GLContextEGL.h                            |   6 +-
 gfx/gl/GLContextProviderEGL.cpp                  | 240 +++++++++++++++++-
 gfx/gl/GLContextProviderImpl.h                   |  23 ++
 gfx/gl/GLLibraryEGL.cpp                          |  11 +-
 gfx/gl/GLLibraryEGL.h                            |   6 +-
 gfx/gl/GLScreenBuffer.cpp                        | 274 +++++++++++++++++++++
 gfx/gl/GLScreenBuffer.h                          | 132 +++++++++-
 gfx/gl/GLTextureImage.cpp                        |  34 ++-
 gfx/gl/SharedSurface.cpp                         | 116 ++++++++-
 gfx/gl/SharedSurface.h                           | 122 ++++++++-
 gfx/gl/SharedSurfaceEGL.cpp                      |  89 +++++--
 gfx/gl/SharedSurfaceEGL.h                        |  34 ++-
 gfx/gl/SharedSurfaceGL.cpp                       |  64 ++++-
 gfx/gl/SharedSurfaceGL.h                         |  32 ++-
 gfx/gl/SurfaceTypes.h                            |   2 +
 gfx/gl/TextureImageEGL.cpp                       | 221 +++++++++++++++++
 gfx/gl/TextureImageEGL.h                         |  80 ++++++
 gfx/gl/moz.build                                 |   1 +
 gfx/layers/CompositorTypes.h                     |   1 +
 gfx/layers/client/TextureClient.cpp              |  42 +++-
 gfx/layers/client/TextureClient.h                |   3 +
 gfx/layers/client/TextureClientSharedSurface.cpp |  50 ++++
 gfx/layers/client/TextureClientSharedSurface.h   |  25 ++
 gfx/layers/ipc/CompositorVsyncScheduler.cpp      |   2 -
 gfx/layers/opengl/CompositorOGL.cpp              |  24 +-
 28 files changed, 1631 insertions(+), 104 deletions(-)
After merging in these changes to the commit, it'll look like this:
$ git diff --stat HEAD~
 gfx/gl/GLContext.cpp                             |  40 ++-
 gfx/gl/GLContext.h                               |  17 +-
 gfx/gl/GLContextProviderEGL.cpp                  |  36 +++
 gfx/gl/GLContextProviderImpl.h                   |  23 ++
 gfx/gl/GLScreenBuffer.cpp                        | 274 +++++++++++++++++++++
 gfx/gl/GLScreenBuffer.h                          | 132 +++++++++-
 gfx/gl/GLTextureImage.cpp                        |  10 +
 gfx/gl/SharedSurface.cpp                         | 114 ++++++++-
 gfx/gl/SharedSurface.h                           | 105 +++++++-
 gfx/gl/SharedSurfaceEGL.cpp                      |  89 +++++--
 gfx/gl/SharedSurfaceEGL.h                        |  30 ++-
 gfx/gl/SharedSurfaceGL.cpp                       |  74 +++++-
 gfx/gl/SharedSurfaceGL.h                         |  46 ++++
 gfx/gl/SurfaceTypes.h                            |   2 +
 gfx/gl/TextureImageEGL.cpp                       | 221 +++++++++++++++++
 gfx/gl/TextureImageEGL.h                         |  80 ++++++
 gfx/gl/moz.build                                 |   1 +
 gfx/layers/client/TextureClientSharedSurface.cpp |  49 ++++
 gfx/layers/client/TextureClientSharedSurface.h   |  25 ++
 gfx/layers/ipc/CompositorVsyncScheduler.cpp      |   1 -
 gfx/layers/opengl/CompositorOGL.cpp              |  24 +-
 21 files changed, 1334 insertions(+), 59 deletions(-)
There are big gains in reducing the diff for GLContextProviderEGL.cpp and the only losers are SharedSurfaceGL.cpp and SharedSurfaceGL.h. But the wins are much greater than the losses, so I'm happy with this. That means it's time to apply the changes to the previous commit.

Deep breath.
$ git log -1
commit 7437a9d17284bd9a4427502b1af6e94a9d7ee356 (HEAD -> 
    FIREFOX_ESR_91_9_X_RELBRANCH_patches)
Author: David Llewellyn-Jones <david@flypig.co.uk>
Date:   Wed May 22 08:35:49 2024 +0100

    Restore GLScreenBuffer and TextureImageEGL
    
    TextureImageEGL was removed with Bug 1716559:
    
    https://phabricator.services.mozilla.com/D117904
    
    GLScreenBuffer was removed with Bug 1632249:
    
    https://phabricator.services.mozilla.com/D75055
    
    Both are used as part of the Sailfish OS WebView offscreen render
    pipeline. This patch reintroduces these two classes and allows them to
    be made use of.
$ git commit --amend
[FIREFOX_ESR_91_9_X_RELBRANCH_patches 1765dd0460a9] Restore GLScreenBuffer and 
    TextureImageEGL
 Date: Wed May 22 08:35:49 2024 +0100
 21 files changed, 1334 insertions(+), 59 deletions(-)
 create mode 100644 gfx/gl/TextureImageEGL.cpp
 create mode 100644 gfx/gl/TextureImageEGL.h
Step 7 completed. On to Step 8.

Step 8: Admire the resulting, much cleaner, GLSCreenBuffer commit now at HEAD.

Looking through the diff, I think it'll make a decent patch. It remains uncomfortably big — not ideal — but I've now spent a lot of time cutting it down to the bare minimum and I'm satisfied that this is the best I can hope for, at least for now. Certainly it's not going to be worth spending more time trying to cut it down further.

Which leave me with just one more thing to do.

Step 9: Another deep breath. Force push everything to the server. Done.

At least, the plan was to force push. But interestingly, it turns out I don't need to force push at all. It's been so long since I started work on restoring GLScreenBuffer that I thought I must have pushed some of the changes to the remote. But it turns out I hadn't.

This is great news! It means I won't be doing anything destructive by pushing. No need to take any deep breaths after all!
$ git push
Enumerating objects: 92, done.
Counting objects: 100% (92/92), done.
Delta compression using up to 16 threads
Compressing objects: 100% (51/51), done.
Writing objects: 100% (51/51), 22.70 KiB | 2.84 MiB/s, done.
Total 51 (delta 40), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (40/40), completed with 37 local objects.
To github.com:llewelld/gecko-dev-mirror.git
   c6ea49286566..1765dd0460a9  FIREFOX_ESR_91_9_X_RELBRANCH_patches -> 
    FIREFOX_ESR_91_9_X_RELBRANCH_patches
$ cd ..
$ git push
Enumerating objects: 13, done.
Counting objects: 100% (13/13), done.
Delta compression using up to 16 threads
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 1.46 KiB | 373.00 KiB/s, done.
Total 7 (delta 5), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (5/5), completed with 5 local objects.
To github.com:llewelld/gecko-dev.git
   9f64ce35a187..5b9c330998db  sailfishos-esr91 -> sailfishos-esr91
Done!

So there we are. I can finally draw a line under the WebView rendering changes. It's a line drawn in pencil of course as I'm sure I'll find things I need to come back to, but for now it's time to move on.

The next thing I want to tackle is another Web-related acronym: WebRTC. I intentionally disabled this back on Day 38 in order to get the build to go through. And as we'll see there's a reason I didn't tackle it head on at the time. Now there's no avoiding it: it's time to swallow the bitter pill and get to it.

There are five patches which relate to WebRTC. Some look tractable. But others... well... I don't fancy the chances of that 2.4 MiB patch applying cleanly:
$ grep -rInl &quot;webrtc&quot; * --include=&quot;*.patch&quot; | xargs  ls 
    -lah | awk '{print t$9 &quot;\t&quot; $5}'
0077-sailfishos-webrtc-Adapt-build-configuration-for-Sail.patch 5.6K
0078-sailfishos-webrtc-Regenerate-moz.build-files.-JB-537.patch 2.4M
0079-sailfishos-webrtc-Disable-desktop-sharing-feature-on.patch 2.1K
0080-sailfishos-webrtc-Enable-GMP-for-encoding-decoding.-.patch 33K
0081-sailfishos-webrtc-Implement-video-capture-module.-JB.patch 29K
In practice the really big patch — patch 0078 — is due to changes to the build system which have been auto-generated. Specifically it relates to components that would usually be built using GN, but where the build configuration has been automgically converted into the moz.build files used by the Gecko build system. Here's the text from the docs explaining what's going on here;
 
GN is a third-party build tool used by chromium and some related projects that are vendored in mozilla-central. Rather than requiring GN to build or writing our own build definitions for these projects, we have support in the build system for translating GN configuration files into moz.build files. In most cases these moz.build files will be like any others in the tree (except that they shouldn’t be modified by hand), however those updating vendored code or building on platforms not supported by Mozilla automation may need to re-generate these files. This is a per-project process, described in dom/media/webrtc/third_party_build/gn-configs/README.md for webrtc. As of writing, it is very specific to webrtc, and likely doesn’t work as-is for other projects.

The problem is that the README.md file mentioned in the explanation is missing from the source tree. A bit of a search throws up a potentially useful comment in Bugzilla that links through to some text that's likely the README.md being referenced. It says the following:
 
Generate new gn json files and moz.build files for building libwebrtc in our tree
/!\ This is only supported on Linux and macOS. If you are on Windows, you can run the script under WSL.

1. The script should be run from the top directory of our firefox tree.
./mach python python/mozbuild/mozbuild/gn_processor.py dom/media/webrtc/
    third_party_build/gn-configs/webrtc.json
2. Checkin all the generated/modified files and try your build!

Adding new configurations to the build

Edit the main function in the python/mozbuild/mozbuild/gn_processor.py file.


Honestly that doesn't seem as bad as I'd feared. But this will of course constitute only one of the five patches that need to be applied. So tomorrow I'll start looking at the other patches to see what's really going to be involved here.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
4 Jul 2024 : Day 278 #
Finally, after much investigation, refinement and guesswork, I finally got the WebView and WebGL working simultaneously yesterday. The key realisation was that the SharedSurface_Basic class needed different functionality depending on whether it was used for one or the other. This was needed as a result of the changes that were made upstream to the WebGL rendering pipeline.

But solving this has also brought other benefits. The process I used to get here was to start with a version with working WebView but broken WebGL, then I started removing all of the GLScreenBuffer code (needed for the WebView) until I had working WebGL as well. But I overshot (embarrassingly and for reasons I won't go in to here) and so then started restoring the GLScreenBuffer code again gently until eventually I hit the critical point where adding or removing a single line resulted in either the WebGL or the WebView working, but not both. That gave me the info I needed to fix the issue, after which I added in just enough more of the remaining code to tidy things up.

As a result I've now added 101 lines to the previous code in order to fix things, but more importantly I've removed 349 lines as well:
$ git diff --stat
 dom/base/nsGlobalWindowOuter.cpp                 |   6 -
 gfx/gl/GLContext.cpp                             |  35 +--
 gfx/gl/GLContext.h                               |   7 +-
 gfx/gl/GLContextEGL.h                            |   6 +-
 gfx/gl/GLContextProviderEGL.cpp                  | 242 +++------------------
 gfx/gl/GLLibraryEGL.cpp                          |  11 +-
 gfx/gl/GLLibraryEGL.h                            |   6 +-
 gfx/gl/GLScreenBuffer.cpp                        |   2 +-
 gfx/gl/GLTextureImage.cpp                        |  26 +--
 gfx/gl/SharedSurface.cpp                         |   3 +-
 gfx/gl/SharedSurface.h                           |  20 +-
 gfx/gl/SharedSurfaceEGL.cpp                      |   1 -
 gfx/gl/SharedSurfaceEGL.h                        |   4 -
 gfx/gl/SharedSurfaceGL.cpp                       |  12 +-
 gfx/gl/SharedSurfaceGL.h                         |  12 +
 gfx/layers/CompositorTypes.h                     |   1 -
 gfx/layers/client/TextureClient.cpp              |  42 +---
 gfx/layers/client/TextureClient.h                |   3 -
 gfx/layers/client/TextureClientSharedSurface.cpp |   2 +-
 gfx/layers/ipc/CompositorVsyncScheduler.cpp      |   1 +
 gfx/layers/opengl/CompositorOGL.cpp              |   8 +
 21 files changed, 101 insertions(+), 349 deletions(-)
So the delta represents the removal of 248 lines, which is what I was hoping for. It should mean the patch for reintroducing GLSCreenBuffer will be much simpler than it would otherwise have been.

I now have to format things into commits. As I just explained the current diff on my local machine removes a bunch of code and eventually I want to merge this diff with the last commit, described as follows:
Author: David Llewellyn-Jones <david@flypig.co.uk>
Date:   Wed May 22 08:35:49 2024 +0100

    Restore GLScreenBuffer and TextureImageEGL
    
    TextureImageEGL was removed with Bug 1716559:
    
    https://phabricator.services.mozilla.com/D117904
    
    GLScreenBuffer was removed with Bug 1632249:
    
    https://phabricator.services.mozilla.com/D75055
    
    Both are used as part of the Sailfish OS WebView offscreen render
    pipeline. This patch reintroduces these two classes and allows them to
    be made use of.
However, the code my local diff removes actually is there for a reason. Looking through it, I can see three separate reasons for why I made the changes in the first place:
 
  1. To pass the EGLDisplay value through the system.
  2. To support Wayland surfaces.
  3. To ensure texture data is only destroyed on the main thread.


The first of these involved changes added by me when I was scrabbling around trying to fix the WebView first time around. There were upstream changes to the way the EGLDisplay value was being handled that made me suspect it might be causing the WebView to fail. In retrospect this didn't turn out to be a factor, so I think it'll be safe to remove these changes. I'll just drop them.

The second is functionality that may be needed for native ports and the emulator. It doesn't seem to have any impact on my Xperia 10 III, but that's a libhyris port and doesn't appear to need these changes. Consequently I'm going to drop these changes too, but before I do I want to capture a diff of them, so that I can restore them later if it turns out they're important for native ports.

The third set of changes were to revert changeset D106879 the result of which was to restore the horrifically named mWorkaroundAnnoyingSharedSurfaceLifetimeIssues variable.
commit 6f458ab05edc6c086e304c999365f6b375c46640
Author: sotaro <sotaro.ikeda.g@gmail.com>
Date:   Tue Mar 2 10:48:55 2021 +0000

    Bug 1695846 - Remove unused flags of TextureClient r=nical
    
    Flags were used for WebGL SharedSurface
    
    Differential Revision: https://phabricator.services.mozilla.com/D106879
I'm not able to recall off the top of my head why I reversed this changes, but the good news is that I did write about it, back on Day 187. Referring back I can see that at that time I thought it might have been the reason for the browser seizing up. I eventually found the actual underlying reason, which was unrelated. But I never removed this attempted workaround.

So it should now be safe to have this removed. It'll be good not to have that variable included any more. I'll drop this code and moreover I notice there are some further changes in the changeset which I can apply as well and which should safely reduce the size of the patch a little more.

Okay, here's my rubric for the next few steps.
  1. Take a diff of the my current, uncommitted, local changes. But make it a reverse diff so that it switches additions and removals. This will allow me to get back to my current position from the last commit in case I screw something up.
  2. Stage all of the changes that aren't related to the three points above (EGLDisplay, Wayland surfaces and texture destruction) using git add -p -u . so that things are a bit cleaner. Add these changes to the latest commit using git commit --amend.
  3. Stage all of the changes related to 1 and 3 above (EGLDisplay and texture destruction) using git add -p -u .. Add these changes to the latest commit.
  4. Create a patch from the remaining changes, which should represent those needed to support Wayland surfaces. Keep this patch somewhere safe.
  5. Apply all of the remaining changes from D106879.
  6. Build and check that everything still works.
  7. Take a deep breath. Commit the staged changes.
  8. Admire the resulting, much cleaner, GLSCreenBuffer commit now at HEAD.
  9. Another deep breath. Force push everything to the server. Done.
That looks very much like a plan. Time to execute it.

Step 1. First I'm going to save out a reverse of the diff containing all of my current changes. This is in case something goes wrong, it'll give me something to refer back to so that I can restore things to the current state.
$ git diff -R > ../../glscreenbuffer-changes.diff
$ head ../../glscreenbuffer-changes.diff 
diff --git b/dom/base/nsGlobalWindowOuter.cpp a/dom/base/nsGlobalWindowOuter.cpp
index 41c93c51cf3b..31a93ea1cc70 100644
--- b/dom/base/nsGlobalWindowOuter.cpp
+++ a/dom/base/nsGlobalWindowOuter.cpp
@@ -4542,10 +4542,16 @@ nsresult nsGlobalWindowOuter::SetFullscreenInternal(
    FullscreenReason aReason,
   if (rootItem != mDocShell)
     return window->SetFullscreenInternal(aReason, aFullscreen);
 
+  // We don't have chrome from doc shell point of view. This commit
+  // sha1 3116f3bf53df offends fullscreen API to work without chrome
That looks good.

Step 2. Clean up and stage all of the changes that are clearly needed and not related to the three areas listed above.
$ git add -p -u .
[...]
Great; all done, no issues.

Step 3. In this step I'm going to add all of the code that's not related to the Wayland surface.
$ git add -p -u .
[...]
This turned out to be more straightforward than I'd feared. The Wayland surface changes are now the only thing remaining and in fact all live inside the one file: GLContextProviderEGL.cpp.

Step 4. Create a diff of the changes needed to remove the Wayland surface changes. Reverse it so that it shows how to add rather than remove the changes.
$ git diff -R > ../../wayland-surface-changes.diff
flypig@chattan:~/Documents/Development/jolla/gecko-dev-esr91/gecko-dev/
    gecko-dev$ head ../../wayland-surface-changes.diff 
diff --git b/gfx/gl/GLContextProviderEGL.cpp a/gfx/gl/GLContextProviderEGL.cpp
index 1632d663d101..b5ff16a9e522 100644
--- b/gfx/gl/GLContextProviderEGL.cpp
+++ a/gfx/gl/GLContextProviderEGL.cpp
@@ -85,6 +85,14 @@
 #  endif
 #endif
 
+#ifdef MOZ_WIDGET_QT
+#include <qpa/qplatformnativeinterface.h>
Okay, good, all looking fine.

Step 5. Apply all of the remaining changes from D106879. I'm going to do this carefully in hunks just so that I don't make any errors.
$ git add -p -u .
[...]
Step 6. I'm going to do a full build, so this will be an overnight activity. It's a shame I can't get all of the steps completed today, but this is definitely the right thing to do. Steady as she goes.
$ cd ..
$ sfdk build -d --with git_workaround
[...]
So, I've set the build going. I'll continue with the rest of Step 6 and the remaining steps tomorrow. We're getting there and, honestly, I think it's looking pretty good!

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
3 Jul 2024 : Day 277 #
I'm continuing to restore bits of WebView functionality today, in the hope that the renderer will click into place and start rendering. I've convinced myself it's only a matter of time now.

Yesterday I tried restoring the TextureClient destruction code, but that didn't have any obvious effects so I removed the changes again. Today I'm trying something else: restoring the augmented fBindFramebuffer() functionality.

This is unusual because in default Gecko fBindFramebuffer() is just a wrapper for glBindFramebuffer from the OpenGL library. In previous EmbedLite versions of Gecko the wrapper has been given additional functionality which we can find in GLContext.cpp and which looks like this:
void GLContext::fBindFramebuffer(GLenum target, GLuint framebuffer) {
  if (!mScreen) {
    raw_fBindFramebuffer(target, framebuffer);
    return;
  }

  switch (target) {
    case LOCAL_GL_DRAW_FRAMEBUFFER_EXT:
      mScreen->BindDrawFB(framebuffer);
      return;

    case LOCAL_GL_READ_FRAMEBUFFER_EXT:
      mScreen->BindReadFB(framebuffer);
      return;

    case LOCAL_GL_FRAMEBUFFER:
      mScreen->BindFB(framebuffer);
      return;

    default:
      // Nothing we care about, likely an error.
      break;
  }

  raw_fBindFramebuffer(target, framebuffer);
}
With this change the raw_fBindFramebuffer() becomes the new name for the glBindFramebuffer() wrapper. As you can see, now when fBindFramebuffer() is called some framebuffer binding is also performed.

The additional functionality isn't needed everywhere though, and in fact there are a few places where we then have to switch the calls from using fBindFramebuffer() to using the new raw wrapper instead:
@ -215,11 +215,11 @@ void GLScreenBuffer::BindFB(GLuint fb) {
   mInternalReadFB = (fb == 0) ? readFB : fb;
 
   if (mInternalDrawFB == mInternalReadFB) {
+    mGL->raw_fBindFramebuffer(LOCAL_GL_FRAMEBUFFER, mInternalDrawFB);
-    mGL->fBindFramebuffer(LOCAL_GL_FRAMEBUFFER, mInternalDrawFB);
   } else {
     MOZ_ASSERT(mGL->IsSupported(GLFeature::split_framebuffer));
+    mGL->raw_fBindFramebuffer(LOCAL_GL_DRAW_FRAMEBUFFER_EXT, mInternalDrawFB);
+    mGL->raw_fBindFramebuffer(LOCAL_GL_READ_FRAMEBUFFER_EXT, mInternalReadFB);
-    mGL->fBindFramebuffer(LOCAL_GL_DRAW_FRAMEBUFFER_EXT, mInternalDrawFB);
-    mGL->fBindFramebuffer(LOCAL_GL_READ_FRAMEBUFFER_EXT, mInternalReadFB);
   }
 
 #ifdef DEBUG
@@ -235,7 +235,7 @@ void GLScreenBuffer::BindDrawFB(GLuint fb) {
   mUserDrawFB = fb;
   mInternalDrawFB = (fb == 0) ? drawFB : fb;
 
+  mGL->raw_fBindFramebuffer(LOCAL_GL_DRAW_FRAMEBUFFER_EXT, mInternalDrawFB);
-  mGL->fBindFramebuffer(LOCAL_GL_DRAW_FRAMEBUFFER_EXT, mInternalDrawFB);
 
 #ifdef DEBUG
   mInInternalMode_DrawFB = false;
@@ -249,7 +249,7 @@ void GLScreenBuffer::BindReadFB(GLuint fb) {
   mUserReadFB = fb;
   mInternalReadFB = (fb == 0) ? readFB : fb;
 
+  mGL->raw_fBindFramebuffer(LOCAL_GL_READ_FRAMEBUFFER_EXT, mInternalReadFB);
-  mGL->fBindFramebuffer(LOCAL_GL_READ_FRAMEBUFFER_EXT, mInternalReadFB);
 
 #ifdef DEBUG
   mInInternalMode_ReadFB = false;
These aren't all of the uses though and in the remaining cases the new augmented functionality will be called instead. This is another nicely self-contained change without which I could perfectly well imagine the rendering might be broken. So restoring this functionality might have positive effects.

Let's see. I've made these changes and set the partial build running. Soon I'll be able to test this out.

[...]
 
Two screenshots, both look identical with WebGL content; on the left the browser, on the right the WebView; shader written by Shane

And it works! The WebView renders nicely. The browser renders nicely. Both of them render WebGL nicely (props to Shane for the nice shader example). Great!

Phew. This concludes the difficult part of this process. Sadly it doesn't mean the end of the journey just yet though, oh no: I still need to turn this into a tidy commit and figure out what to do with the remaining changes (the ones that don't appear to be necessary). And to avoid any undue excitement, there are more things to fix after that as well.

Those will all be for tomorrow and beyond though. For today, I'm going to celebrate (in a quiet contemplative way!) and enjoy the win.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
2 Jul 2024 : Day 276 #
Last night the new packages built so that I now have a version with working WebGL and partially working WebView. Crucially, this version splits off the functionality needed for working WebGL from the WebView functionality, meaning I should now be able to reintroduce the WebView functionality without it affecting the WebGL rendering.

So the next steps will be to reintroduce as much of the WebView changes needed in order to get WebView working, but preferably no more than that.

Before getting on to that I have one other issue I want to address. Although the browser is now working, along with the WebGL rendering, there are also now spurious crashes happening, presumably as a result of some of these changes. The backtrace associated with the crash is basically useless, even with a fully installed and correct set of debug symbols:
Thread 8 &quot;GeckoWorkerThre&quot; received signal SIGSEGV, Segmentation 
    fault.
[Switching to LWP 30258]
0x0000007fe5ee29a0 in ?? ()
(gdb) bt
#0  0x0000007fe5ee29a0 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) 
My hunch is that this relates to the changes to SharedSurface_Basic. In particular, although I changed the code so that a different constructor is used to create the SharedSurface_Basic objects depending on the type of rendering, the process of destruction, which used to rely on just the default destructor, now has bespoke functionality:
SharedSurface_Basic::~SharedSurface_Basic() {
  if (!mDesc.gl || !mDesc.gl->MakeCurrent()) return;

  if (mFB) mDesc.gl->fDeleteFramebuffers(1, &mFB);

  if (mOwnsTex) mDesc.gl->fDeleteTextures(1, &mTex);
}
This code needs to be there for the WebView to work, but it was never there in the WebGL version. My guess is that as surfaces are destroyed periodically, they're calling some of this code which is then triggering a crash. In particular, I notice that mFB isn't being set in the WebGL constructor, which means it could have a value other than zero. If this were the case, it could be causing problems when this uninitialised value gets passed to the OpenGL library, which could in turn explain why the backtrace is so poor.

As an attempt to fix this I've added in a line to initialise mFB to zero, which should be enough to prevent the new lines in the destructor from executing.
 SharedSurface_Basic::SharedSurface_Basic(const SharedSurfaceDesc& desc,
                                          UniquePtr<MozFramebuffer>&& fb)
     : SharedSurface(desc, std::move(fb)),
       mTex(0),
       mOwnsTex(false),
+      mFB(0)
 {
 }
Great, that's done the trick! No more random crashes. So, now I can move on to the larger task of getting the WebView to work. It's a bigger task, but I'm also hoping it'll be relatively formulaic, given that I already have a working version to compare against. I just need to gradually reintroduce the code from that version until it all clicks into place and works. The plan is that, with the WebGL rendering now separated from it, what I'll end up with is a working WebView and working WebGL.

So I'm looking through the code that's been amended and it looks like there are some relatively self-contained changes related to destruction of TextureClient instances. By reverting the changes from the following files, I should get a slice of the changes back with minimal fuss:
  1. gfx/layers/CompositorTypes.h
  2. gfx/layers/client/TextureClient.h
  3. gfx/layers/client/TextureClient.cpp
  4. gfx/layers/client/TextureClientSharedSurface.cpp
  5. gfx/layers/ipc/CompositorVsyncScheduler.cpp
I'm not sure if this will actually have any effect though, so I'm going to make a copy of the diff before reverting it. That way I can restore the changes if needed.
$ git diff gfx/layers/CompositorTypes.h \
    gfx/layers/client/TextureClient.h \
    gfx/layers/client/TextureClient.cpp \
    gfx/layers/client/TextureClientSharedSurface.cpp \
    gfx/layers/ipc/CompositorVsyncScheduler.cpp \
    > destroy-texture-changes.diff
$ git checkout gfx/layers/CompositorTypes.h \
    gfx/layers/client/TextureClient.h \
    gfx/layers/client/TextureClient.cpp \
    gfx/layers/client/TextureClientSharedSurface.cpp \
    gfx/layers/ipc/CompositorVsyncScheduler.cpp
I'm a little surprised to discover that after resetting the files to revert the changes the code still builds just fine without needing any further adjustments. But when I test out the new library I don't see any obvious positive (or indeed negative) effects. I was hoping for something different. I guess this isn't the critical difference that I'm looking for.

So I'm going to restore thee changes and try something else tomorrow. I feel like getting the WebGL not to crash was a good step forwards today. It's a shame the TextureClient changes didn't work, but that's still something to cross off my list of changes to try. So that's progress.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
1 Jul 2024 : Day 275 #
After completing a partial build yesterday we were left with working WebGL when using the browser and a partially working WebView. My suspicion was that a regression has resulted in the same hanging that we experienced much earlier in the process, but I have yet to investigate that.

More to the point, I want to find out where SharedSurface_Basic objects get created (there may be multiple) during each of the different permutations of rendering involving the browser, WebView, non-WebGL-adorned pages and pages with WebGL content.

So that I can do this properly using the debugger I left a build running overnight. When I got up this morning it hadn't yet completed, but it was in the next best state: everything was already being packaged up into rpms. I've never seen it fail from this point onwards and it would usually take no more than a few minutes from here, so this was good to see.
 
A console showing the gecko build about to finish

And indeed it did complete very quickly. So the build was successful and I now have packages to test. But before I do test them, I'm going to run with the original working WebGL packages to find out where the ShareSurface_Basic objects, if any, get created. To test this I've placed a breakpoint on the ShareSurface_Basic::Create() method.

What I find is that there's no call to ShareSurface_Basic::Create() when using the browser generally, unless there's WebGL content on the page. If there is WebGL content it gets called like this:
Thread 8 &quot;GeckoWorkerThre&quot; hit Breakpoint 1, mozilla::gl::
    SharedSurface_Basic::Create (desc=...)
    at gfx/gl/SharedSurfaceGL.cpp:20
20          const SharedSurfaceDesc& desc) {
(gdb) bt
#0  mozilla::gl::SharedSurface_Basic::Create (desc=...)
    at gfx/gl/SharedSurfaceGL.cpp:20
#1  0x0000007ff28a8130 in mozilla::gl::SurfaceFactory_Basic::CreateSharedImpl (
    this=<optimized out>, desc=...)
    at gfx/gl/SharedSurfaceGL.h:41
#2  0x0000007ff28aaee8 in mozilla::gl::SurfaceFactory::CreateShared (size=..., 
    this=0x7fc98a1980)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefCounted.h:240
#3  mozilla::gl::SwapChain::Acquire (this=this@entry=0x7fc949ac48, size=...)
    at gfx/gl/GLScreenBuffer.cpp:56
#4  0x0000007ff369d340 in mozilla::WebGLContext::PresentInto (
    this=this@entry=0x7fc949a7b0, swapChain=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#5  0x0000007ff369d7ac in mozilla::WebGLContext::Present (
    this=this@entry=0x7fc949a7b0, xrFb=<optimized out>,
    consumerType=consumerType@entry=mozilla::layers::TextureType::Unknown, 
    webvr=webvr@entry=false)
    at dom/canvas/WebGLContext.cpp:9
36
#6  0x0000007ff36656a8 in mozilla::HostWebGLContext::Present (webvr=false, 
    t=mozilla::layers::TextureType::Unknown, xrFb=<optimized out>,
    this=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/
    mozilla/RefPtr.h:280
#7  mozilla::ClientWebGLContext::Run<void (mozilla::HostWebGLContext::*)(
    unsigned long, mozilla::layers::TextureType, bool) const, &(mozilla::
    HostWebGLCont
ext::Present(unsigned long, mozilla::layers::TextureType, bool) const), 
    unsigned long, mozilla::layers::TextureType const&, bool const&> (
    this=<optimized out>, args#0=@0x7fdf2f26c0: 0, args#1=@0x7fdf2f26bf: 
    mozilla::layers::TextureType::Unknown, args#2=@0x7fdf2f26be: false)
    at dom/canvas/ClientWebGLContext
.cpp:313
#8  0x0000007ff3665810 in mozilla::ClientWebGLContext::Present (
    this=this@entry=0x7fc9564c10, xrFb=xrFb@entry=0x0, type=<optimized out>,
    webvr=<optimized out>, webvr@entry=false)
    at dom/canvas/ClientWebGLContext
.cpp:363
#9  0x0000007ff3691020 in mozilla::ClientWebGLContext::OnBeforePaintTransaction 
    (this=0x7fc9564c10)
    at dom/canvas/ClientWebGLContext
.cpp:345
#10 0x0000007ff290058c in mozilla::layers::CanvasRenderer::
    FirePreTransactionCallback (this=this@entry=0x7fc9840c50)
    at gfx/layers/CanvasRenderer.cpp
:75
#11 0x0000007ff29b1228 in mozilla::layers::ShareableCanvasRenderer::
    UpdateCompositableClient (this=0x7fc9840c50)
    at gfx/layers/ShareableCanvasRen
derer.cpp:192
#12 0x0000007ff29f0dc4 in mozilla::layers::ClientCanvasLayer::RenderLayer (
    this=0x7fc989ee20)
    at gfx/layers/client/ClientCanva
sLayer.cpp:25
#13 0x0000007ff29efee4 in mozilla::layers::ClientLayer::RenderLayerWithReadback 
    (this=<optimized out>, aReadback=<optimized out>)
    at gfx/layers/client/ClientLayer
Manager.h:365
#14 0x0000007ff2a00224 in mozilla::layers::ClientContainerLayer::RenderLayer (
    this=0x7fc9898670)
    at gfx/layers/Layers.h:1051
#15 0x0000007ff29efee4 in mozilla::layers::ClientLayer::RenderLayerWithReadback 
    (this=<optimized out>, aReadback=<optimized out>)
    at gfx/layers/client/ClientLayer
Manager.h:365
#16 0x0000007ff2a00224 in mozilla::layers::ClientContainerLayer::RenderLayer (
    this=0x7fc963f3f0)
    at gfx/layers/Layers.h:1051
#17 0x0000007ff29efee4 in mozilla::layers::ClientLayer::RenderLayerWithReadback 
    (this=<optimized out>, aReadback=<optimized out>)
    at gfx/layers/client/ClientLayer
Manager.h:365
#18 0x0000007ff2a00224 in mozilla::layers::ClientContainerLayer::RenderLayer (
    this=0x7fc984bb20)
    at gfx/layers/Layers.h:1051
#19 0x0000007ff2a06f08 in mozilla::layers::ClientLayerManager::
    EndTransactionInternal (this=this@entry=0x7fc8b18820, 
    aCallback=aCallback@entry=
    0x7ff46a379c <mozilla::FrameLayerBuilder::DrawPaintedLayer(mozilla::layers::
    PaintedLayer*, gfxContext*, mozilla::gfx::IntRegionTyped<mozilla::gfx::Unkn
ownUnits> const&, mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> 
    const&, mozilla::layers::DrawRegionClip, mozilla::gfx::
    IntRegionTyped<mozilla::g
fx::UnknownUnits> const&, void*)>, 
    aCallbackData=aCallbackData@entry=0x7fdf2f3268)
    at gfx/layers/client/ClientLayerManager.cpp:341
#20 0x0000007ff2a11e08 in mozilla::layers::ClientLayerManager::EndTransaction (
    this=0x7fc8b18820, 
    aCallback=0x7ff46a379c <mozilla::FrameLayerBuilder::DrawPaintedLayer(
    mozilla::layers::PaintedLayer*, gfxContext*, mozilla::gfx::
    IntRegionTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gfx::
    IntRegionTyped<mozilla::gfx::UnknownUnits> const&, mozilla::layers::
    DrawRegionClip, mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> 
    const&, void*)>, aCallbackData=0x7fdf2f3268, aFlags=mozilla::layers::
    LayerManager::END_DEFAULT)
    at gfx/layers/client/ClientLayerManager.cpp:397
#21 0x0000007ff46a0bbc in nsDisplayList::PaintRoot (
    this=this@entry=0x7fdf2f5078, aBuilder=aBuilder@entry=0x7fdf2f3268, 
    aCtx=aCtx@entry=0x0, 
    aFlags=aFlags@entry=13, aDisplayListBuildTime=...)
    at layout/painting/nsDisplayList.cpp:2622
#22 0x0000007ff442cf18 in nsLayoutUtils::PaintFrame (
    aRenderingContext=aRenderingContext@entry=0x0, 
    aFrame=aFrame@entry=0x7fc93460b0, aDirtyRegion=..., 
    aBackstop=aBackstop@entry=4294967295, 
    aBuilderMode=aBuilderMode@entry=nsDisplayListBuilderMode::Painting, 
    aFlags=aFlags@entry=(nsLayoutUtils::PaintFrameFlags::WidgetLayers | 
    nsLayoutUtils::PaintFrameFlags::ExistingTransaction | nsLayoutUtils::
    PaintFrameFlags::NoComposite)) at ${PROJECT}/obj-build-mer-qt-xr/dist/
    include/mozilla/MaybeStorageBase.h:80
#23 0x0000007ff43b760c in mozilla::PresShell::Paint (
    this=this@entry=0x7fc92c2810, aViewToPaint=aViewToPaint@entry=0x7fc92a0d20, 
    aDirtyRegion=..., 
    aFlags=aFlags@entry=mozilla::PaintFlags::PaintLayers)
    at layout/base/PresShell.cpp:6400
#24 0x0000007ff41ef4dc in nsViewManager::ProcessPendingUpdatesPaint (
    this=this@entry=0x7fc92a0cb0, aWidget=aWidget@entry=0x7fc9250910)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/gfx/RectAbsolute.h:43
[...]
#57 0x0000007fefbae89c in ?? () from /lib64/libc.so.6
(gdb) 
That's a rather long backtrace, but it captures what I need to know, which is that it's not getting called from the CompositorOGL code, but rather from the painting code. That's because it's being created specifically for the WebGL entity on the page.

It'll also be useful to find out whether CompositorOGL::CreateContext() gets called when rendering browser pages. That's useful because it'll tell us why the code there isn't creating a SharedSurface_Basic object. So I've placed a breakpoint on the method and will step through to find out. This, again, is using the code installed from our original packages with working WebGL and on a page that includes WebGL content:
Thread 37 &quot;Compositor&quot; hit Breakpoint 3, mozilla::layers::
    CompositorOGL::CreateContext (this=this@entry=0x7ed81a1c80)
    at gfx/layers/opengl/CompositorOGL.cpp:227
227     already_AddRefed<mozilla::gl::GLContext> CompositorOGL::CreateContext() 
    {
(gdb) n
231       nsIWidget* widget = mWidget->RealWidget();
(gdb) n
232       void* widgetOpenGLContext =
(gdb) n
234       if (widgetOpenGLContext) {
(gdb) p widgetOpenGLContext
$1 = (void *) 0x7ed81a6fe0
(gdb) n
mozilla::layers::CompositorOGL::Initialize (this=0x7ed81a1c80, 
    out_failureReason=0x7f1651d510)
    at gfx/layers/opengl/CompositorOGL.cpp:395
395         mGLContext = CreateContext();
Stepping through the code, after reaching the condition on widgetOpenGLContext, we then find ourselves outside the CompositorOGL::CreateContext() method:
(gdb) bt 2
#0  mozilla::layers::CompositorOGL::Initialize (this=0x7ed81a1c80, 
    out_failureReason=0x7f1651d510)
    at gfx/layers/opengl/CompositorOGL.cpp:395
#1  0x0000007ff2a66d88 in mozilla::layers::CompositorBridgeParent::
    NewCompositor (this=this@entry=0x7fc89ab730, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1493
(More stack frames follow...)
(gdb) 
That's because of the early return from near the start of the method, as we can see here:
already_AddRefed<mozilla::gl::GLContext> CompositorOGL::CreateContext() {
  RefPtr<GLContext> context;

  // Used by mock widget to create an offscreen context
  nsIWidget* widget = mWidget->RealWidget();
  void* widgetOpenGLContext =
      widget ? widget->GetNativeData(NS_NATIVE_OPENGL_CONTEXT) : nullptr;
  if (widgetOpenGLContext) {
    GLContext* alreadyRefed = reinterpret_cast<GLContext*>(widgetOpenGLContext);
    return already_AddRefed<GLContext>(alreadyRefed);
  }
[...]
So what we can infer from this is, in short, that widget->GetNativeData(NS_NATIVE_OPENGL_CONTEXT) is set in the case of WebGL rendering. but not when it's WebView rendering.

It turns out we get the same flow when rendering a page without WebGL, so this isn't a consequence of the WebGL on the page, it's just because we're using the browser rather than a WebView, as we can see here:
Thread 39 &quot;Compositor&quot; hit Breakpoint 3, mozilla::layers::
    CompositorOGL::CreateContext (this=this@entry=0x7ed81a1c70)
    at gfx/layers/opengl/CompositorOGL.cpp:227
227     already_AddRefed<mozilla::gl::GLContext> CompositorOGL::CreateContext() 
    {
(gdb) bt 3
#0  mozilla::layers::CompositorOGL::CreateContext (this=this@entry=0x7ed81a1c70)
    at gfx/layers/opengl/CompositorOGL.cpp:227
#1  0x0000007ff2950ff8 in mozilla::layers::CompositorOGL::Initialize (
    this=0x7ed81a1c70, out_failureReason=0x7f3633e510)
    at gfx/layers/opengl/CompositorOGL.cpp:395
#2  0x0000007ff2a66d88 in mozilla::layers::CompositorBridgeParent::
    NewCompositor (this=this@entry=0x7fc89a9d80, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1493
(More stack frames follow...)
(gdb) n
231       nsIWidget* widget = mWidget->RealWidget();
(gdb) n
232       void* widgetOpenGLContext =
(gdb) n
234       if (widgetOpenGLContext) {
(gdb) p widgetOpenGLContext
$2 = (void *) 0x7ed81a6fd0
(gdb) n
79      ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h: No such 
    file or directory.
(gdb) n
mozilla::layers::CompositorOGL::Initialize (this=0x7ed81a1c70, 
    out_failureReason=0x7f3633e510)
    at gfx/layers/opengl/CompositorOGL.cpp:395
395         mGLContext = CreateContext();
(gdb) 
Okay, great. There's one other important piece of information that we need in order to unravel all this and that's that the code we added to the SharedSurface_Basic class in order to get the WebView working (it's not working yet, but we now know we'll need it) is actually only an extension of the SharedSurface_Basic code, without removing any existing code. In fact it adds a new constructor and new Create() method (as an override with a different signature), but without removing the original constructor or Create() method. Here are the old and new Create() methods as they appear side-by-side in the code:
  static UniquePtr<SharedSurface_Basic> Create(const SharedSurfaceDesc&);

  static UniquePtr<SharedSurface_Basic> Create(GLContext* gl,
                                               const gfx::IntSize& size);
The only code that's not strictly an addition happens in the SurfaceFactory_Basic class where the Create() method that's used to create the SharedSurface_Basic object in the CreateSharedImpl() method is switched from the old one to the new one.
 class SurfaceFactory_Basic final : public SurfaceFactory {
  public:
   explicit SurfaceFactory_Basic(GLContext& gl);
   explicit SurfaceFactory_Basic(GLContext* gl,
                                 const layers::TextureFlags& flags);
 
   virtual UniquePtr<SharedSurface> CreateSharedImpl(
       const SharedSurfaceDesc& desc) override {
-    return SharedSurface_Basic::Create(desc);
+    return SharedSurface_Basic::Create(mDesc.gl, desc.size);
   }
 };
It would be helpful if I could confirm this, so I've built a version of the library that's exactly the same as the version with broken WebGL and I've switched it to use the original SharedSurface_Basic::Create() method. That's the only change. And when I run it, the WebGL is working. So that's confirmed, it's definitely the new code that's been added to the SharedSurface_Basic class that's causing the problem. It's great to have this finally pinned down, but what's even better is that it looks like the problem can be bypassed simply by choosing a different constructor.

What does this all mean? It means that we know all of the following:
 
  1. The code added to the SharedSurface_Basic needed for WebView rendering is what's causing the WebGL to fail in the browser.
  2. The change that actually causes WebView rendering to fail is just one line: the switch of the Create() method in the SurfaceFactory_Basic::CreateSharedImpl() method.
  3. The calls to SurfaceFactory_Basic that are used to create the SharedSurface_Basic objects are in different places depending on whether they're used for the WebGL or the WebView. In the former case they happen in the painting loop. In the latter case they happen in CompositorOGL::CreateContext().


We can use all of this to our advantage. I've taken the SurfaceFactory_Basic class and created a new version, which I've called SurfaceFactory_GL. This performs the new constructor type. So we now have two versions, like this:
class SurfaceFactory_Basic final : public SurfaceFactory {
 public:
  explicit SurfaceFactory_Basic(GLContext& gl);
  explicit SurfaceFactory_Basic(GLContext* gl,
                                const layers::TextureFlags& flags);

  virtual UniquePtr<SharedSurface> CreateSharedImpl(
      const SharedSurfaceDesc& desc) override {
    return SharedSurface_Basic::Create(desc);
  }
};

class SurfaceFactory_GL final : public SurfaceFactory {
 public:
  explicit SurfaceFactory_GL(GLContext& gl);
  explicit SurfaceFactory_GL(GLContext* gl,
                                const layers::TextureFlags& flags);

  virtual UniquePtr<SharedSurface> CreateSharedImpl(
      const SharedSurfaceDesc& desc) override {
    return SharedSurface_Basic::Create(mDesc.gl, desc.size);
  }
};
Notice how they differ only in the parameters passed to the SharedSurface_Basic::Create() method. Together they'll allow us to create the SharedSurface_Basic objects needed for the two different pipelines. I may find I need something more nuanced in the future and, if so, I may end up having to create an entirely new type of SharedSurface. But if that does turn out to be the case then I now know it'll work out fine: it's all nice self-contained code.

Alright, I've made this change and I want to see it in action. The partial build shows good results, but I want to check things with the debugger again. So I've set off another full build. Let's find out where things are at when it's done.

Before I sign off for the day, let me say that I'm getting increasingly hopeful about this. It really feels like we've got all the information needed to fix the problem and that we now just need to carefully move all of the metaphorical girders into their correct positions.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment