flypig.co.uk

Not Found

Sorry, but I couldn't find the page that you requested. Maybe it's been lost? Or deleted? Or stolen?!

Click the 'back' button of your browser to return to where you came from, or alternatively, you can always return Home.

Blog

28 Mar 2024 : Day 199 #
Adam Pigg (piggz, who you'll know from his daily Ofono blog amongst other Sailfish-related things, asked a good question on Mastodon yesterday. "How" Adam asks "was the texture deletion missed? Was it present in 71?" This is in relation to how I ended up solving the app seizure problem. With apologies to those who have been following along and already know, but to recap, the problem turned out to be that an EGL texture was being created for each SharedSurface_EGLImage, but then in the destructor there was no code to delete the texture.

It's a classic resource leakage problem and one you might rightly ask (and that's exactly what Adam did) how such an obvious failure could have snuck through.

The answer is that the code to delete the texture was in ESR 78 but was removed from ESR 91. That sounds strange until you also realise that the code to create the texture was removed as well. A whole raft of changes were made as a result of changeset D75055 and the changes specifically to the SharedSurfaceEGL.cpp file involved stripping out the EGL code that the WebView relies on. When attempting to reintroduce this code back I returned the texture creation code but somehow missed the texture deletion code.

And that's the challenge of where I'm at with the WebView rendering changes now. It's all down to whether I've successfully reversed these changes or not. I know it can work because it works with ESR 78. But getting all of the pieces to balance together is turning out to be a bit of a challenge. It's just a matter of time before everything fits in the right way, but it is, I'm afraid, taking a lot of time.

So thanks Adam for asking the question. It's good to reflect on these things and hopefully in my case learn how to avoid similar problems happening in future. Now on to the work of finding and fixing the other issues.

So today I've been looking through the diffs I mentioned yesterday, but admittedly without making a huge amount of progress. The one thing I've discovered, that I think may be important, is that there's a difference in the way the display is being configured between ESR 78 and ESR 91.

On ESR 78 the display is collected via a call to GetAppDisplay(). This function can be found in GLContextProviderEGL.cpp and looks like this:
// Use the main app's EGLDisplay to avoid creating multiple Wayland connections
// See JB#56215
static EGLDisplay GetAppDisplay() {
#ifdef MOZ_WIDGET_QT
  QPlatformNativeInterface* interface = QGuiApplication::
    platformNativeInterface();
  MOZ_ASSERT(interface);
  return (EGLDisplay)(interface->nativeResourceForIntegration(QByteArrayLiteral(
    "egldisplay")));
#else
  return EGL_NO_DISPLAY;
#endif
}
In our case we have MOZ_WIDGET_QT defined, so it's the first half of the ifdef that's getting compiled. This gets passed in to the GLLibraryEGL::EnsureInitialized() method when the library is initialised.

The initialisation process has been changed in ESR 91. But there's still a similar process that happens when the library is initialised, the difference being that currently EGL_NO_DISPLAY is passed into the method instead.

Eventually this gets passed on to the CreateDisplay() method, which is where we need the correct value to be. Using the debugger I've checked exactly how this gets called on ESR 91. It's clear from this that the display isn't being set up as it should be.
(gdb) b GLLibraryEGL::CreateDisplay
Function "GLLibraryEGL::CreateDisplay" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (GLLibraryEGL::CreateDisplay) pending.
(gdb) r
[...]
Thread 37 "Compositor" hit Breakpoint 1, mozilla::gl::GLLibraryEGL::
    CreateDisplay (this=this@entry=0x7ed8003200, 
    forceAccel=forceAccel@entry=false, 
    out_failureId=out_failureId@entry=0x7f176081c8, aDisplay=aDisplay@entry=0x0)
    at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:747
747     ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp: No such file or directory.
(gdb) bt
#0  mozilla::gl::GLLibraryEGL::CreateDisplay (this=this@entry=0x7ed8003200, 
    forceAccel=forceAccel@entry=false, 
    out_failureId=out_failureId@entry=0x7f176081c8, aDisplay=aDisplay@entry=0x0)
    at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:747
#1  0x0000007ff111d850 in mozilla::gl::GLLibraryEGL::DefaultDisplay (
    this=0x7ed8003200, out_failureId=out_failureId@entry=0x7f176081c8)
    at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:740
#2  0x0000007ff112ef28 in mozilla::gl::DefaultEglDisplay (
    out_failureId=0x7f176081c8)
    at ${PROJECT}/gecko-dev/gfx/gl/GLContextEGL.h:33
#3  mozilla::gl::GLContextProviderEGL::CreateHeadless (desc=..., 
    out_failureId=out_failureId@entry=0x7f176081c8)
    at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1246
#4  0x0000007ff112f804 in mozilla::gl::GLContextProviderEGL::CreateOffscreen (
    size=..., minCaps=..., 
    flags=flags@entry=mozilla::gl::CreateContextFlags::REQUIRE_COMPAT_PROFILE, 
    out_failureId=out_failureId@entry=0x7f176081c8)
    at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1288
#5  0x0000007ff11982f8 in mozilla::layers::CompositorOGL::CreateContext (
    this=this@entry=0x7ed8002f10)
    at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:254
#6  0x0000007ff11ad8e8 in mozilla::layers::CompositorOGL::Initialize (
    this=0x7ed8002f10, out_failureReason=0x7f17608520)
    at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:391
#7  0x0000007ff12c3584 in mozilla::layers::CompositorBridgeParent::
    NewCompositor (this=this@entry=0x7fc46c2070, aBackendHints=...)
    at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1493
#8  0x0000007ff12ce600 in mozilla::layers::CompositorBridgeParent::
    InitializeLayerManager (this=this@entry=0x7fc46c2070, aBackendHints=...)
    at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1436
[...]
#26 0x0000007ff6a0289c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/
    clone.S:78
(gdb) 
Besides this investigation I've also started making changes to try to fix the code, but this is still very much a work-in-progress. Hopefully tomorrow I'll have something more concrete to show for my efforts.

So it's just a short one today, but rest assured I'll be writing more about all this tomorrow.

Before finishing up, I also just want to reiterate my commitment to base-2 milestones. The fact I'll be hitting a day that just happens to be represented neatly in base 10 is of no interest to me and I won't be making a big deal out of it.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
27 Mar 2024 : Day 198 #
I'm looking forward to getting back to a more balanced cadence with gecko development. It's been frustrating to be stuck on app seizing for the last couple of weeks, now that it's out of the way it'll be nice to focus on other parts of the code. But I'm not going to be wandering too far afield as I continue to try to get the WebView render pipeline working.

What's become clear is that the front and back buffer are being successfully created (and now destroyed!). So now there are two other potential places where the rendering could be failing. It could be that the paint from the Web pages is failing to get on the texture. Or it could be that the paint from the texture is failing to get on the screen.

I'd like to devise ways to test both of those things, but before I do that I want to first check another area that's ripe for failure in my opinion, and that's the setting of the display value. The EGL library uses an EGLDisplay object to control where rendering happens. Although it's part of the Khronos EGL specification, the official documentation is frustratingly vague about what an EGLDisplay actually is. Thankfully the PowerVR documentation has a note that summarises it quite clearly.
 
EGL uses the concept of a "display". This is an abstract object which will show the rendered graphical output. In most environments it corresponds to a single physical screen. After creating a native display for a given windowing system, EGL can use this handle to get a corresponding EGLDisplay handle for use in rendering.

The shift from ESR 78 to ESR 91 brought with it a more flexible handling of displays. In particular, while ESR 78 had just a single instance of a display, ESR 91 allows multiple displays to be configured. What the practical benefit of this is I'm not entirely certain of, but handling of EGLDisplay storage has become more complex as a result.

So whereas previously gecko had a single mDisplay value that got used everywhere, the EGLDisplay is now wrapped in a gecko-specific EglDisplay class, defined in GLLibraryEGL.h. This class captures a collection of functionalities, one of which is to store an EGLDisplay value. There can be multiple instances of EglDisplay live at any one time.

The subtle distinction in the capitalisation — EGLDisplay vs. EglDisplay — is critical. The former belongs to EGL whereas the latter belongs to gecko. The fact they're so similar and that the shift from ESR 78 to ESR 91 has resulted in a switch from one to the other in many parts of the code, makes things all the more confusing.

There's plenty of opportunity for errors here. So I'm thinking: this is something to check.

An obvious place to start these checks is with display initialisation. A quick grep of the code for eglInitialize doesn't give any useful results. However as we saw at some length on Monday, all of these EGL library calls have been abstracted away. And eglInitialize() is no different. The gecko code uses a method called GLLibraryEGL::fInitialize() instead.

Grepping for that throws up some more useful references. The most promising one being this:
static EGLDisplay GetAndInitDisplay(GLLibraryEGL& egl, void* displayType, 
    EGLDisplay display = EGL_NO_DISPLAY) {
  if (display == EGL_NO_DISPLAY) {
      display = egl.fGetDisplay(displayType);
      if (display == EGL_NO_DISPLAY) return EGL_NO_DISPLAY;
      if (!egl.fInitialize(display, nullptr, nullptr)) return EGL_NO_DISPLAY;
  }
  return display;
}
That's on ESR 78. On ESR 91 things are different and for good reason. The GetAndInitDisplay() method assumes a single instance of EGLDisplay as discussed earlier. On ESR 91 the display is initialised when its EglDisplay wrapper is created:
// static
std::shared_ptr<EglDisplay> EglDisplay::Create(GLLibraryEGL& lib,
                                               const EGLDisplay display,
                                               const bool isWarp) {
  // Retrieve the EglDisplay if it already exists
  {
    const auto itr = lib.mActiveDisplays.find(display);
    if (itr != lib.mActiveDisplays.end()) {
      const auto ret = itr->second.lock();
      if (ret) {
        return ret;
      }
    }
  }

  if (!lib.fInitialize(display, nullptr, nullptr)) {
    return nullptr;
  }
[...]
}
I've chopped off the end of the method there, but the section shown highlights the important part. It's also worth mentioning that in ESR 78 this and the surrounding functionality were all amended by Patch 0038 "Fix mesa egl display and buffer initialisation". I attempted to apply this patch all the way back on Day 55 and it does contain plenty of relevant changes. Here's the way the patch describes itself:
 
Ensure the same display is used for all initialisations to avoid creating multiple wayland connections. Fallback to a wayland window surface in case pixel buffers aren't supported. This is needed on the emulator.

Unfortunately applying the patch, especially due to the differences in the way EGLDisplay is handled, turned out to be a challenge.

Consequently I'm now working my way through this patch again. It'll take me longer than just today, so I'll continue with it until it's all applied properly and report back if I find anything important tomorrow. Raine (rainemak) also flagged up patches 0045 and 0065. The former claims to "Prioritize GMP plugins over all others, and support decoding video for h264, vp8 & vp9"; whereas the latter will:
 
Hardcode loopback address for profile lock filename. When engine started without network PR_GetHostByName takes 20 seconds when connman tries to resolve host name. As this is only used as part of the profile lock filename it can as well be like "127.0.0.1:+<pid>".

It'll take me a while to work through these as well, which means that's it for today. I'll write more about all this tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
26 Mar 2024 : Day 197 #
If you've been following any of these diary entries over the last couple of weeks you'll know I've been struggling to diagnose a problem related to graphics surfaces. A serious bug prevented the graphics surface from being properly created, but as soon as that was fixed another serious issue appeared: after a short period of time using the WebView the app started to seize up, rapidly progressing to the entire phone. After a while the watchdog kicked in causing the phone to reboot itself.

This is, as a general rule, not considered ideal behaviour for an application.

Since then I've been generally debugging, monitoring and annotating the code to try to figure out what was causing the problem. As of yesterday I'd narrowed the issue down to the creation of the EGL image associated with an EGL texture. Each frame the app would create the texture, then create the image from the texture and then create a surface from that.

Skipping execution from anywhere up to the image creation and beyond would result in the seizing up happening. This led me to the EGL instructions: creating and destroying the image.

I've been looking at this code in ShareSurfaceEGL.cpp quite deeply for a couple of weeks now. And finally, narrowing down the area of consideration has finally thrown up something useful.

It turns out that while the surface destructor is called correctly and that this calls fDestroyImage() correctly, that's not all it's supposed to be doing.

All of this was stuff we checked yesterday: a call to fDestroyImage() was being called for every call to fCreateImage() except two, allowing for the front and back buffer to exist at all times.

But looking at the code today I realised there was something missing. When the image is created in SharedSurface_EGLImage::Create() it needs a texture to work with. And so we have this code:
  GLuint prodTex = CreateTextureForOffscreen(prodGL, formats, size);
  if (!prodTex) {
    return ret;
  }

  EGLClientBuffer buffer =
      reinterpret_cast<EGLClientBuffer>(uintptr_t(prodTex));
  EGLImage image = egl->fCreateImage(context,
                                     LOCAL_EGL_GL_TEXTURE_2D, buffer, nullptr);
First create the texture then pass this in to the image creation routine. But while the image is deleted in the destructor, the texture is not!

Here is our destructor code in ESR 91:
SharedSurface_EGLImage::~SharedSurface_EGLImage() {
  const auto& gle = GLContextEGL::Cast(mDesc.gl);
  const auto& egl = gle->mEgl;
  egl->fDestroyImage(mImage);

  if (mSync) {
    // We can't call this unless we have the ext, but we will always have
    // the ext if we have something to destroy.
    egl->fDestroySync(mSync);
    mSync = 0;
  }
}
The image and sync are both destroyed, but the texture never is. So what happens if we add in the texture deletion? To test this I've added it in and the code now looks like this:
SharedSurface_EGLImage::~SharedSurface_EGLImage() {
  const auto& gle = GLContextEGL::Cast(mDesc.gl);
  const auto& egl = gle->mEgl;
  egl->fDestroyImage(mImage);

  if (mSync) {
    // We can't call this unless we have the ext, but we will always have
    // the ext if we have something to destroy.
    egl->fDestroySync(mSync);
    mSync = 0;
  }

  if (!mDesc.gl || !mDesc.gl->MakeCurrent()) return;

  mDesc.gl->fDeleteTextures(1, &mProdTex);
  mProdTex = 0;
}
And now, after building and running this new version, the app no longer seizes up!

To be clear, there's still no rendering happening to the screen, but this is nevertheless an important step forwards and I'm pretty chuffed to have noticed the missing code. In retrospect, it's something I should have noticed a lot earlier, but this goes to show both how intricate these things are, and where my limitations are as a developer. It's hard to keep all of the execution paths in my head all at the same time. As a result I'm left using these often trial-and-error based approaches to finding fixes.

It's a small victory. But it means that tomorrow I can continue on with the proper job of finding out why the render never makes it to the screen. With this resolved I'm feeling more confident again that it will be possible to get to the bottom of it.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
25 Mar 2024 : Day 196 #
Yesterday I finally narrowed down the error causing the WebView app to seize up during execution to a call to EglDisplay::fCreateImage(). Now it may not be this call that's the problem, it might be the way the result is used or the fact that it's not being freed properly, or maybe the parameters that are being passed in to it. But the fact that we've narrowed it down is likely to be a big help in figuring things out.

The call itself goes through to a method that looks like this:
  EGLImage fCreateImage(EGLContext ctx, EGLenum target, EGLClientBuffer buffer,
                        const EGLint* attribList) const {
    MOZ_ASSERT(HasKHRImageBase());
    return mLib->fCreateImage(mDisplay, ctx, target, buffer, attribList);
  }
Here mLib is an instance of GLLibraryEGL. It looks like we have several layers of wrappers here so let's continue digging. This goes through to the following method that's part of GLLibraryEGL:
  EGLImage fCreateImage(EGLDisplay dpy, EGLContext ctx, EGLenum target,
                        EGLClientBuffer buffer,
                        const EGLint* attrib_list) const {
    WRAP(fCreateImageKHR(dpy, ctx, target, buffer, attrib_list));
  }
That looks similar but it's not quite the same. It is just another wrapper though, this time going through to a dynamically created method. The WRAP() macro looks like this:
#define WRAP(X)                \
  PROFILE_CALL                 \
  BEFORE_CALL                  \
  const auto ret = mSymbols.X; \
  AFTER_CALL                   \
  return ret
The PROFILE_CALL, BEFORE_CALL and AFTER_CALL lines are all macros which turn into something functional in the Android build, but in our build are just empty. That means that the WRAP(fCreateImageKHR(dpy, ctx, target, buffer, attrib_list)) statement actually reduces down to just the following:
  const auto ret = mSymbols.fCreateImageKHR(dpy, ctx, target, buffer, 
    attrib_list);
  return ret
The mSymbols object has the following defined on it:
    EGLImage(GLAPIENTRY* fCreateImageKHR)(EGLDisplay dpy, EGLContext ctx,
                                          EGLenum target,
                                          EGLClientBuffer buffer,
                                          const EGLint* attrib_list);
Here EGLImage is a typedef of void* and GLAPIENTRY is an empty define, giving us a final result that looks like this:
    void* (*fCreateImageKHR)(EGLDisplay dpy, EGLContext ctx,
                             EGLenum target,
                             EGLClientBuffer buffer,
                             const EGLint* attrib_list);
We're still not quite there though. Inside GLLibraryEGL.cpp we find this:
    const SymLoadStruct symbols[] = {SYMBOL(CreateImageKHR),
                                     SYMBOL(DestroyImageKHR), END_OF_SYMBOLS};
    (void)fnLoadSymbols(symbols);
This is packing symbols with some data which is then passed in to fnLoadSymbols(), a method for loading symbols from a dynamically loaded library. The define that's used here is the following:
#define SYMBOL(X)                 \
  {                               \
    (PRFuncPtr*)&mSymbols.f##X, { \
      { &quot;egl&quot; #X }                \
    }                             \
  }
Notice how here it's playing around with the input argument so that, with a little judicious simplification for clarity, SYMBOL(CreateImageKHR) becomes:
  mSymbols.fCreateImageKHR, {{ &quot;eglCreateImageKHR&quot; }}
In other words (big reveal, but no big surprise) a call to mSymbols.fCreateImageKHR() will get converted into a call to the EGL function eglCreateImageKHR, loaded in from the EGL driver.

What does this do? According to the documentation:
 
eglCreateImage is used to create an EGLImage object from an existing image resource buffer. display specifies the EGL display used for this operation. context specifies the EGL client API context used for this operation, or EGL_NO_CONTEXT if a client API context is not required. target specifies the type of resource being used as the EGLImage source (examples include two-dimensional textures in OpenGL ES contexts and VGImage objects in OpenVG contexts). buffer is the name (or handle) of a resource to be used as the EGLImage source, cast into the type EGLClientBuffer. attrib_list is a list of attribute-value pairs which is used to select sub-sections of buffer for use as the EGLImage source, such as mipmap levels for OpenGL ES texture map resources, as well as behavioral options, such as whether to preserve pixel data during creation. If attrib_list is non-NULL, the last attribute specified in the list must be EGL_NONE.

Super. Where does that leave us? Well, it tells us that the call to fCreateImage() in our SharedSurface_EGLImage::Create() is really just a bunch of simple wrapper calls that ends up calling an EGL function. What could be going wrong? One obvious potential problem is that the input parameters may be messed up. Another one is that each call to eglCreateImageKHR() creating an EGLImage object should be balanced out with a call to eglDestroyImageKHR() to destroy it.

We do have a call to eglDestroyImageKHR() happening in our SharedSurface_EGLImage destructor. It looks like this:
SharedSurface_EGLImage::~SharedSurface_EGLImage() {
  const auto& gle = GLContextEGL::Cast(mDesc.gl);
  const auto& egl = gle->mEgl;
  egl->fDestroyImage(mImage);
[...]
There's an unexpected difference with the way it's called in ESR 78, where the code looks like this:
SharedSurface_EGLImage::~SharedSurface_EGLImage() {
  const auto& gle = GLContextEGL::Cast(mGL);
  const auto& egl = gle->mEgl;
  egl->fDestroyImage(egl->Display(), mImage);
[...]
Notice the extra egl->Display() value being passed in as a parameter. That's because in ESR 91 EGLLibrary is storing its own copy of the EGLDisplay:
  EGLBoolean fDestroyImage(EGLImage image) const {
    MOZ_ASSERT(HasKHRImageBase());
    return mLib->fDestroyImage(mDisplay, image);
  }
That gives us a couple of things to look into: first, is the correctly value being passed in for image? Second, is the value stored for mDisplay valid? The underlying call to eglDestroyImage also has a Boolean return value which will return EGL_FALSE in case something goes wrong. A nice first step would be to check this return value in case it's indicating a problem. To do this I've added some additional debug output to the code:
  EGLBoolean result = egl->fDestroyImage(mImage);
  printf_stderr(&quot;RENDER: fDestroyImage() return value: %d\n&quot;, result);
The result of running it shows a large number of successful calls to fDestroyImage():
[...]
[JavaScript Warning: &quot;Layout was forced before the page was fully loaded. 
    If stylesheets are not yet loaded this may cause a flash of unstyled 
    content.&quot; {file: &quot;https://jolla.com/themes/unlike/js/
    modernizr.js?x98582&ver=2.6.2&quot; line: 4}]
RENDER: fDestroyImage() return value: 1
RENDER: fDestroyImage() return value: 1
RENDER: fDestroyImage() return value: 1
RENDER: fDestroyImage() return value: 1
RENDER: fDestroyImage() return value: 1
RENDER: fDestroyImage() return value: 1
RENDER: fDestroyImage() return value: 1
[...]
Since this output looks okay I've taken it a step further and added a count to the creation and deletion calls in case it shows any imbalance between the two.
[...]
Frame script: embedhelper.js loaded
RENDER: fCreateImage() return value: 1, 0
RENDER: fCreateImage() return value: 1, 1
CONSOLE message:
[JavaScript Warning: &quot;This page uses the non standard property “zoom”. 
    Consider using calc() in the relevant property values, or using “transform” 
    along with “transform-origin: 0 0”.&quot; {file: &quot;https://jolla.com/
    &quot; line: 0}]
CONSOLE message:
[JavaScript Warning: &quot;Layout was forced before the page was fully loaded. 
    If stylesheets are not yet loaded this may cause a flash of unstyled 
    content.&quot; {file: &quot;https://jolla.com/themes/unlike/js/
    modernizr.js?x98582&ver=2.6.2&quot; line: 4}]
RENDER: fCreateImage() return value: 1, 2
RENDER: fDestroyImage() return value: 1, 0
RENDER: fCreateImage() return value: 1, 3
RENDER: fDestroyImage() return value: 1, 1
[...]

RENDER: fCreateImage() return value: 1, 316
RENDER: fDestroyImage() return value: 1, 314
RENDER: fCreateImage() return value: 1, 317
RENDER: fDestroyImage() return value: 1, 315
[...]
The increasing numbers (going up to 317 and 315 here) tell us that the balance between creates and destroys is pretty clean. There are two creates at the start which don't have matching destroys, after which everything is balanced. It seems unlikely therefore that this is the cause of the seize-ups. What's more, it all makes sense too: at any point in time there should be a front and a back buffer, so there should always be exactly two images in existence at any one time. That's a situation that's confirmed by the numbers.

Just to ensure this matches the behaviour of the previous version I've also tested the same using the debugger on ESR 78. I got the same sequence of calls. First two creates, followed by balanced create and destroy calls so that there are exactly two images in existence at any one time:
fCreateImage
fCreateImage
fCreateImage
fDestroyImage
fCreateImage
fDestroyImage
fCreateImage
[...]
In conclusion everything here looks in order on ESR 91. So tomorrow I'll move on to checking that the display value is set correctly.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
24 Mar 2024 : Day 195 #
I'm working my way through the SharedSurface_EGLImage::Create() method and gradually increasing the steps that are executed. Over the last few days I first established that preventing the SurfaceFactory_EGLImage from being created was enough to prevent the app from seizing up. Without the factory the surfaces themselves weren't going to get created. Next I enabled the factory but disabled the image creation.

Today I'm allowing the offscreen texture to be created by allowing this call to take place:
  GLuint prodTex = CreateTextureForOffscreen(prodGL, formats, size);
But I've placed a return immediately afterwards so that neither the image nor the surface that builds on this are created. Once again the objective is to find out whether the app seizes up or not. If it does then that would point to the texture being the culprit. If not, it's likely something that follows it.

Change made, code built, binary transferred and library installed. Now running the app, there's no seizing up. So that takes us one more step closer to finding the culprit. Now I've moved the early return one step later, until after the EGLImage has been created using the texture, after these lines:
  EGLClientBuffer buffer =
      reinterpret_cast<EGLClientBuffer>(uintptr_t(prodTex));
  EGLImage image = egl->fCreateImage(context,
                                     LOCAL_EGL_GL_TEXTURE_2D, buffer, nullptr);
  if (!image) {
    prodGL->fDeleteTextures(1, &prodTex);
    return ret;
  }
Once again, I've build, transferred and installed the updated library. And now when I run it... the app seizes up! So we have our culprit. The problem seems to be the creation of the image from the surface that's either causing the problem in itself, or triggering something else to cause the problem. The most likely offender in the latter case would be if the created image weren't being freed:
  EGLImage image = egl->fCreateImage(context,
                                     LOCAL_EGL_GL_TEXTURE_2D, buffer, nullptr);
This is reminiscent of a problem I experienced earlier which resulted in me having to disable the texture capture for the cover image. Now that it's narrowed down I can look into the underlying reason. That will be my task for tomorrow morning.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
23 Mar 2024 : Day 194 #
Today I'm trying to narrow things down after reconfirming that when there's no SurfaceFactory_EGLImage the app doesn't seize up. I want to focus on two methods. First to check if anything is up with SurfaceFactory_EGLImage::Create(). Second I want to try disabling elements of SharedSurface_EGLImage::Create() in case that makes any difference. It's just a short one today, but still important tasks for helping get to the bottom of things.

First up SurfaceFactory_EGLImage::Create(). There's nothing to disable here (all it does is return the factory) but there are some input parameters to check. I added some debug print code for this:
  if (HasEglImageExtensions(*gle)) {
    printf_stderr(&quot;RENDER: prodGL: %p\n&quot;, prodGL);
    printf_stderr(&quot;RENDER: caps: any %d, color %d, alpha %d, bpp16 %d, 
    depth %d, stencil %d, premultAlpha %d, preserve %d\n&quot;, caps.any, 
    caps.color, caps.alpha, caps.bpp16, caps.depth, caps.stencil, 
    caps.premultAlpha, caps.preserve);
    printf_stderr(&quot;RENDER: allocator: %p\n&quot;, allocator.get());
    printf_stderr(&quot;RENDER: flags: %#x\n&quot;, (uint32_t)flags);
    printf_stderr(&quot;RENDER: context: %p\n&quot;, context);

    // The surface allocator that we want to create this
    // for.  May be null.
    RefPtr<layers::LayersIPCChannel> surfaceAllocator;

    ret.reset(new ptrT({prodGL, SharedSurfaceType::Basic, layers::TextureType::
    Unknown, true}, caps, allocator, flags, context));
  }
On ESR 78 the values from this method are the following:
=============== Preparing offscreen rendering context ===============
RENDER: prodGL: 0x7eac109140
RENDER: caps: any 0, color 1, alpha 0, bpp16 0, depth 0, stencil 0, 
    premultAlpha 1, preserve 0
RENDER: allocator: (nil)
RENDER: flags: 0x2
RENDER: context: 0x7eac004d50
And the values in ESR 91 are identical, other than certain structures residing in different places in memory:
=============== Preparing offscreen rendering context ===============
RENDER: prodGL: 0x7ed819aa50
RENDER: caps: any 0, color 1, alpha 0, bpp16 0, depth 0, stencil 0, 
    premultAlpha 1, preserve 0
RENDER: allocator: (nil)
RENDER: flags: 0x2
RENDER: context: 0x7ed8004be0
I forgot to add caps.surfaceAllocator to this list, but using the debugger I was able to confirm that this is set to null in both cases.

Next up is SharedSurface_EGLImage::Create(). For the first check I've got it to return almost immediately with a null return value. This may or may not prevent the seizing up from happening. Either way it will be useful to know. If it does, then I'm focusing my intention in the correct place. If it doesn't I know I need to focus elsewhere.

The builds for this aren't taking the same seven hours that a full gecko builds, but they do still take tens of minutes. This seems to make me even more impatient. I think it's because 10 minutes is long enough to be noticeable, but not long enough that it's worth me context switching to some other task.

On copying over the library and executing the WebView app I find that this change has indeed stopped the app from seizing up. So this is excellent news. It means that tomorrow I can continue through the SharedSurface_EGLImage::Create() narrowing down where the problem starts.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
22 Mar 2024 : Day 193 #
We took a bit of an interlude from my usual plan yesterday to consider some of the many useful suggestions that I've received over the last couple of weeks. I generated quite a few log files but didn't find any obvious discrepancies in them. That's not to say we've seen the end of those ideas and I'm still interested to know if anyone else can spot anything that looks worth following up. In the meantime I'm dropping back into my usual cadence with a plan to test out changes to the source code.

The problem I'm trying to solve is the seizing up of the app. I first noticed this after fixing a bug in the SharedSurface_EGLImage::Create() method. Over the next couple of days I'm planning to work through this method disabling various parts of it to try to pin down exactly which parts are causing the problem.

The start of the method looks like this:
/*static*/
UniquePtr<SurfaceFactory_EGLImage> SurfaceFactory_EGLImage::Create(
    GLContext* prodGL, const SurfaceCaps& caps,
    const RefPtr<layers::LayersIPCChannel>& allocator,
    const layers::TextureFlags& flags) {
  const auto& gle = GLContextEGL::Cast(prodGL);
  //const auto& egl = gle->mEgl;
  const auto& context = gle->mContext;

  typedef SurfaceFactory_EGLImage ptrT;
  UniquePtr<ptrT> ret;

  if (HasEglImageExtensions(*gle)) {
    ret.reset(new ptrT({prodGL, SharedSurfaceType::Basic, layers::TextureType::
    Unknown, true}, caps, allocator, flags, context));
  }

  return ret;
}
The original problem was that the HasExtensions() condition was causing the method to be exited early. In fact, really before the method had done anything at all. So I'm forcing an early return to simulate the same behaviour; like this:
[...]
  typedef SurfaceFactory_EGLImage ptrT;
  UniquePtr<ptrT> ret;

  return ret;
[...]
I've rebuilt it using the standard partial-build process, copied the resulting libxul.so over to my phone and installed it.

The rendering was already broken and this change will only break it further, but the question I want to know the answer to is: "will this stop the app from seizing up".

The answer is: "yes". This one small change means I can now leave the app running for long periods, swiping between the lipstick home screen and the app, swiping up and down on the page (which no obvious effect, but the touch input is still going through). The app remains responsive, my phone's watchdog doesn't bite and the OS doesn't reboot.

I'm glad about this. If it hadn't been the case it would have meant I'd misunderstood where the problem was coming from. Now reassured I can continue on to partition the code up and try to narrow down the error.

But that will be for tomorrow. For today, it's good to be armed with this backstop of knowledge about where the error is emanating from.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
21 Mar 2024 : Day 192 #
Over the last week or so while I've been struggling with what's almost certainly a texture or surface related bug, I've received many more suggestions than I expected from other Sailfish OS developers, both from Sailors (Jolla employees) and independent developers. Receiving all of this helpful input is hugely motivational and much appreciated. I'm always keen to try out others' suggestions, not least because I'm very aware of how much amazing knowledge there is out there and which it's always going to be beneficial to draw from. The Sailfish OS hivemind is impressively well-informed when it comes to technical matters.

However, these diary entries all form part of a pipeline. There's the pipeline of the plans I'm working my way though, on top of which there's also the "editing" pipeline, which means there's a lag of a couple of days between me writing a post and it going online.

So since I already had a set collection of things I wanted to try (checking creation/deletion balance, measuring memory consumption, running the app through valgrind), I've not been in a position to try out some of these suggestions until now — until the other tasks had worked their way through the pipeline — essentially.

Now that I've moved off valgrind and am going back to the source code, now seems like a good time to take a look at some of the suggestions I've received. Now while on the one hand I want to give each suggestion each a fair crack at solving the problem, on the other hand many of the suggestions touch on areas I'm not so familiar with. Consequently I'm going to explain what's happening here, but I may need further input for some of them.

First up, Tone (tortoisedoc) has made a number of useful suggestions recently. The latest is about making use of extra Wayland debugging output:
 
It seems you have exited browser world and are stepping into wayland lands, thats a good sign, problems yes but somewhere else 🙂 you could try the WAYLAND_DEBUG=1 (or "all" iirc) to complete your logs with the wayland surface creation protocol (it might be qt-wayland doesnt support swapchains?). Its a fairly simple protocol.

Thank you for this Tone. Running using WAYLAND_DEBUG=1 certainly produces a lot of output:
$ WAYLAND_DEBUG=1 harbour-webview 2>debug-harbour-webview-esr91-01.txt
[D] unknown:0 - QML debugging is enabled. Only use this in a safe environment.
[D] main:30 - WebView Example
[D] main:44 - Using default start URL:  &quot;https://www.flypig.co.uk/search/
    &quot;
[D] main:47 - Opening webview
[1522475.832]  -> wl_display@1.get_registry(new id wl_registry@2)
[1522477.064]  -> wl_display@1.sync(new id wl_callback@3)
[1522479.962] wl_display@1.delete_id(3)
[1522480.054] wl_registry@2.global(1, &quot;wl_compositor&quot;, 3)
[...]
Having completed the initialisation sequence the output from the app then settles into a loop contianing the following:
[1523976.770] wl_buffer@4278190081.release()
[1523976.951] wl_callback@45.done(5115868)
[1523985.299]  -> wl_surface@20.frame(new id wl_callback@45)
[1523985.422]  -> wl_surface@20.attach(wl_buffer@4278190080, 0, 0)
[1523985.486]  -> wl_surface@20.damage(0, 0, 1080, 2520)
[1523985.557]  -> wl_surface@20.commit()
[1523985.581]  -> wl_display@1.sync(new id wl_callback@44)
[1523995.055] wl_display@1.delete_id(44)
[1524013.656] wl_display@1.delete_id(45)
Periodically there are also touch events that punctuate the output, presumably the result of me scrolling the page.
[1531447.943] qt_touch_extension@10.touch(5123658, 68, 65538, 7950000, 
    16980000, 7367, 6740, 50027, 50027, 255, 0, 0, 196608, array)
[1531448.190] qt_touch_extension@10.touch(5123665, 68, 65538, 7940000, 
    16920000, 7358, 6716, 50027, 50027, 255, 0, 0, 196608, array)
This continues right up until the app seizes up. Working through the file I don't see any changes towards the end that might explain why things are going wrong, but maybe Tone or another developer with a keener eye and greater expertise than I have can spot something?

Please do feel free to download the output file yourself to take a look. I've also generated a similar file for the working ESR 78 build and which I was hoping may be useful for comparison. On ESR 78 the loop it settles into is similar:
[2484510.845] wl_buffer@4278190080.release()
[2484515.820] wl_display@1.delete_id(44)
[2484515.928] wl_callback@44.done(186889767)
[2484515.967]  -> wl_surface@20.frame(new id wl_callback@44)
[2484516.006]  -> wl_surface@20.attach(wl_buffer@4278190082, 0, 0)
[2484516.062]  -> wl_surface@20.damage(0, 0, 1080, 2520)
[2484516.129]  -> wl_surface@20.commit()
[2484516.152]  -> wl_display@1.sync(new id wl_callback@49)
[2484516.942] wl_display@1.delete_id(49)
There is a slight difference in ordering: one of the deletes is out of sync across the two versions. Could this be significant? It's quite possible these logs are hiding a critical piece of information, but nothing is currently leaping out at me unfortunately.

A second nice suggestion, this one made by Tomi (tomin) in the same thread, was to check in case the textures aren't being properly released, resulting in the reserve of file descriptors becoming exhausted:
 
The issue that @flypig is now trying to figure out sounds a bit like the one I had with Qt Scene Graph recently. I wasn’t properly releasing textures so it kept reserving dmabuf file descriptors and then eventually the app crashed because it run out of fds (or rather I think it tried to use fd > 1024 for something that’s not compatible with such “high” value). Anyway lsof -p output might be worth looking at.

This sounds very plausible indeed. My suspicion has always been that the particular issue I'm experiencing relates to textures/surfaces not being released, but it hadn't occurred to me at all that it might be file-descriptor related and it certainly wouldn't have occurred to try using lsof to list them.

Executing lsof while the app is running shows over 400 open file descriptors. But there's nothing too dramatic that I can see there. Again, please feel free to check the output file in case you can spot something of interest. Running the command repeatedly to show open file descriptors over time shows a steady increase until it hits 440, at which point it stays fairly steady:
$ while true ; do bash -c \
    'lsof -p $(pgrep &quot;harbour-webview&quot;) | wc -l' ; \
    sleep 0.5 ; done
[...]
121
280
351
434
435
444
443
443
443
440
440
440
From this it doesn't look like it's a case of file descriptor exhaustion, but I'd be very interested in others' opinions on this.

I've also received some helpful advice from Raine (rainemak). As it was a private message I don't want to quote it here, but Raine suggests several ways to make better use of the gecko logging capabilities which I'll definitely be making use of. He also suggested to look at a couple of the patches:
 
  1. rpm/0065-sailfishos-gecko-Prioritize-loading-of-extension-ver.patch
  2. rpm/0047-sailfishos-egl-Drop-swap_buffers_with_damage-extensi.patch


I'll take a careful look at these two and report back.

Finally, I need to reiterate my thanks to Florian Xaver (wosrediinanatour). While I've not had a chance to document in this diary all of the useful points Florian has been making over the last few weeks, I still very much intend to do so.

As I round off, I also want to mention this nice article by Arthur Schiwon (blizzz) which details his impressive attempts to reduce the memory footprint of his Nextcloud Talk app. Very relevant to some of the discussion here in recent days and with thanks to Patrick (pherjung) for pointing it out.

A big thank you for all of the great suggestions, feedback and general encouragement. I'm always happy to get this input and gradually it's all helping to create a clearer picture.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
20 Mar 2024 : Day 191 #
This morning I'm taking the first opportunity to test out the new libxul.so binary generated from the build yesterday. Firing up the app there's no render and although initially things seem to be running fine (apart from the complete lack of anything useful on-screen!) after a short time the browser seizes up again.

So sadly this hasn't fixed the underlying issues. But I'm hoping it still fixed something. I can at least push the app through valgrind to see if it had some effect or not.

So this is exactly what I've done.
$ valgrind --log-file=valgrind-harbour-webview-esr91-02.txt harbour-webview
The resulting output file still — unsurprisingly — contains no shortage of memory mess, but checking the Compositor thread there's no sign of anything related to fGetDisplay(), GetAndInitDisplay() nor CreateOffscreen().

This is good news: it seems that the change did the trick and, although the render isn't working, the change also doesn't cause the WebView to crash. It feels like this is one step closer to getting our desired result.

For those who are still following along, or just desperate to get your hands on a build to test, I appreciate it's slow progress. But each fix is a fix that's necessary and I remain confident that eventually all of the pieces will fall in to place. This is very definitely a marathon and not a sprint. An Odyssey not a citybreak. We'll get there!

Having exhausted valgrind as a means to find the problem, we now need a new approach. In order to figure out where to focus, I'm going to go back to the SharedSurface_EGLImage::Create() method. You may recall that the seizing up of execution only happened after we persuaded this method to do something sensible and not just return null all the time.

I want to revisit this. My plan is to go through this method and judiciously remove parts of it until the seizing up no longer happens. The plan is to narrow down exactly which part is causing the issue, which should also narrow down where to look for a possible fix.

Unfortunately other commitments mean it's only a short one today. I'm very much hoping to be able to get to this task of cutting out elements to see the effects tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
19 Mar 2024 : Day 190 #
Yesterday I collected valgrind logs from ESR 78 and ESR 91. Even though both logs contain a huge number of errors, I whittled the interesting ones down to just two.

First there's an invalid read size coming from GetAndInitDisplay(). This should be straightforward to track down.
==29044== Thread 32 Compositor:
==29044== Invalid read of size 8
==29044==    at 0xCCF62E0: hybris_egl_display_get_mapping (in /usr/lib64/
    libEGL.so.1.0.0)
==29044==    by 0xCCF63BB: ??? (in /usr/lib64/libEGL.so.1.0.0)
==29044==    by 0x758848B: fGetDisplay (GLLibraryEGL.h:193)
==29044==    by 0x758848B: mozilla::gl::GetAndInitDisplay(mozilla::gl::
    GLLibraryEGL&, void*, void*) (GLLibraryEGL.cpp:151)
==29044==    by 0x75889C7: mozilla::gl::GLLibraryEGL::CreateDisplay(bool, 
    nsTSubstring<char>*, void*) (GLLibraryEGL.cpp:813)
==29044==    by 0x7589917: mozilla::gl::GLLibraryEGL::DefaultDisplay(
    nsTSubstring<char>*) (GLLibraryEGL.cpp:745)
==29044==    by 0x759AFBF: DefaultEglDisplay (GLContextEGL.h:33)
==29044==    by 0x759AFBF: mozilla::gl::GLContextProviderEGL::CreateHeadless(
    mozilla::gl::GLContextCreateDesc const&, nsTSubstring<char>*) (
    GLContextProviderEGL.cpp:1246)
==29044==    by 0x759B89B: mozilla::gl::GLContextProviderEGL::CreateOffscreen(
    mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gl::
    SurfaceCaps const&, mozilla::gl::CreateContextFlags, nsTSubstring<char>*) (
    GLContextProviderEGL.cpp:1288)
==29044==    by 0x76042BB: mozilla::layers::CompositorOGL::CreateContext() (
    CompositorOGL.cpp:256)
[...]

==29044==  Address 0x14c39ef0 is 0 bytes inside a block of size 40 free'd
==29044==    at 0x484EAD8: operator delete(void*, unsigned long) (
    vg_replace_malloc.c:935)
==29044==    by 0xCCF64BB: eglTerminate (in /usr/lib64/libEGL.so.1.0.0)
==29044==    by 0x7587473: fTerminate (GLLibraryEGL.h:234)
==29044==    by 0x7587473: fTerminate (GLLibraryEGL.h:639)
==29044==    by 0x7587473: mozilla::gl::EglDisplay::~EglDisplay() (
    GLLibraryEGL.cpp:734)
==29044==    by 0x758752B: destroy<mozilla::gl::EglDisplay> (new_allocator.h:
    140)
==29044==    by 0x758752B: destroy<mozilla::gl::EglDisplay> (alloc_traits.h:487)
==29044==    by 0x758752B: std::_Sp_counted_ptr_inplace<mozilla::gl::
    EglDisplay, std::allocator<mozilla::gl::EglDisplay>, (__gnu_cxx::
    _Lock_policy)2>::_M_dispose() (shared_ptr_base.h:554)
==29044==    by 0x7575F33: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::
    _M_release() (shared_ptr_base.h:155)
==29044==    by 0x758937F: ~__shared_count (shared_ptr_base.h:728)
==29044==    by 0x758937F: ~__shared_ptr (shared_ptr_base.h:1167)
==29044==    by 0x758937F: ~shared_ptr (shared_ptr.h:103)
==29044==    by 0x758937F: mozilla::gl::GLLibraryEGL::Init(bool, 
    nsTSubstring<char>*, void*) (GLLibraryEGL.cpp:504)
==29044==    by 0x75895F7: mozilla::gl::GLLibraryEGL::Create(
    nsTSubstring<char>*) (GLLibraryEGL.cpp:345)
==29044==    by 0x758974F: mozilla::gl::DefaultEglLibrary(nsTSubstring<char>*) (
    GLContextProviderEGL.cpp:1331)
==29044==    by 0x759AFAB: DefaultEglDisplay (GLContextEGL.h:29)
==29044==    by 0x759AFAB: mozilla::gl::GLContextProviderEGL::CreateHeadless(
    mozilla::gl::GLContextCreateDesc const&, nsTSubstring<char>*) (
    GLContextProviderEGL.cpp:1246)
==29044==    by 0x759B89B: mozilla::gl::GLContextProviderEGL::CreateOffscreen(
    mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gl::
    SurfaceCaps const&, mozilla::gl::CreateContextFlags, nsTSubstring<char>*) (
    GLContextProviderEGL.cpp:1288)
==29044==    by 0x76042BB: mozilla::layers::CompositorOGL::CreateContext() (
    CompositorOGL.cpp:256)
[...]

==29044==  Block was alloc'd at
==29044==    at 0x484BF24: operator new(unsigned long) (vg_replace_malloc.c:422)
==29044==    by 0x14E619FB: waylandws_GetDisplay (in /usr/lib64/libhybris/
    eglplatform_wayland.so)
==29044==    by 0xCCF63C7: ??? (in /usr/lib64/libEGL.so.1.0.0)
==29044==    by 0x14960F73: ??? (in /usr/lib64/qt5/plugins/
    wayland-graphics-integration-client/libwayland-egl.so)
==29044==    by 0x14778947: QtWaylandClient::QWaylandIntegration::
    initializeClientBufferIntegration() (in /usr/lib64/
    libQt5WaylandClient.so.5.6.3)
[...]
Let's dig into this one first. The code that's identified in the error output is the following:
static std::shared_ptr<EglDisplay> GetAndInitDisplay(GLLibraryEGL& egl,
                                          void* displayType,
                                          EGLDisplay display = EGL_NO_DISPLAY) {
  if (display == EGL_NO_DISPLAY) {
    display = egl.fGetDisplay(displayType);
  }
  if (!display) return nullptr;
  return EglDisplay::Create(egl, display, false);
}
It's the call to egl.fGetDisplay() that we can see towards the top of the error backtrace. But note that there are two call stacks in the output. The first is for the data being read that leads to the code above, the second is for where the memory being accessed was previously deleted:
==29044==  Address 0x14c39ef0 is 0 bytes inside a block of size 40 free'd
==29044==    at 0x484EAD8: operator delete(void*, unsigned long) (
    vg_replace_malloc.c:935)
The error itself is "Invalid read of size 8" which makes sense: we're reading from memory that's no longer allocated. Checking the call stack, here's the code (it's specifically the call to fTerminate()) that frees the memory that's subsequently read:
EglDisplay::~EglDisplay() {
  fTerminate();
  mLib->mActiveDisplays.erase(mDisplay);
}
It looks like value of egl that's being passed into GetAndInitDisplay() is an instance of EglDisplay that has already been deleted. Not good.

The reason the for the EGLDisplay being deleted is because of this call here:
  std::shared_ptr<EglDisplay> defaultDisplay = CreateDisplay(forceAccel, 
    out_failureId, aDisplay);
It's presumably not the call to CreateDisplay() that's causing the deletion, but rather the replacement of the EGlDisplay contained in the shared_ptr with the one that's returned from it.

In ESR 78 the logic around EglDisplay didn't exist. Back then there was just an EGLDisplay value, which is a pointer to an opaque EGL structure and which could be passed around everywhere. In ESR 91 an effort has been made to generalise this, so that multiple such displays can be handled simultaneously. As part of this move the value was wrapped in the similarly — but not identically — named EglDisplay, which contains an EGLDisplay pointer along with some other values and helper methods.

It looks like there's something going wrong with this. Disentangling it could be horrific, but there is something that works in our favour, which is that the two calls don't happen to far apart from one another.

In fact, the call stack for both seems to originate in this method:
inline std::shared_ptr<EglDisplay> DefaultEglDisplay(
    nsACString* const out_failureId) {
  const auto lib = DefaultEglLibrary(out_failureId);
  if (!lib) {
    return nullptr;
  }
  return lib->DefaultDisplay(out_failureId);
}
The call to DefaultEglLibrary() in this code ends up calling eglTerminate() which deletes the value. Then the call to DefaultDisplay() attempts to read the value in again. It reads the value that was just deleted, whereas it should — presumably — be using the newly created value.

The actual failure is happening inside libEGL. I looked at the code there and, to be honest, it's not clear to me what the underlying reason is. Nevertheless, it's clear that there's something going wrong here in the gecko code that can be fixed and which should address it.

When the WebView is initialised it has to create an EGL display. Starting at the DefaultEglDisplay() method I copied out above there's a problem and the problem is that the display is created not once, but twice. To understand this, we have to note the fact that some time ago inside the GLLibraryEGL::Init() method I added the following code:
  std::shared_ptr<EglDisplay> defaultDisplay = CreateDisplay(forceAccel, 
    out_failureId, aDisplay);
  if (!defaultDisplay) {
    return false;
  }
  mDefaultDisplay = defaultDisplay;
Why did I add this? It was to replace the following code which had been removed from ESR 78 by D85496 in order to "Support multiple EglDisplays per GLLibraryEGL":
  mEGLDisplay = CreateDisplay(forceAccel, gfxInfo, out_failureId, aDisplay);
  if (!mEGLDisplay) {
    return false;
  }
It seemed natural, when reversing some of the changes moving from ESR 78 to ER 91, to reintroduce the creation of the display, albeit it's now an EglDisplay wrapper rather than the EGLDisplay itself. I remember these changes being a huge source of angst back when I made them around Day 77.

This piece of code is a problem for two reasons, which become clear when we follow the flow through DefaultEglDisplay():
 
  1. DefaultEglLibrary() is called. At this stage things are still being initialised and so there is no GLLibraryEGL just yet. As a result the code is called to create it.
  2. This leads to a call to GLLibraryEGL::Init() which initialises the library.
  3. Inside this Init() method the code I added to call CreateDisplay() is called, the intention being to mimic the behaviour of ESR 78.
  4. defaultDisplay is a shared pointer so when the assignment happens there's a call made to the EglDisplay destructor. I wasn't expecting this, but it seems to be borne out by what the debugger shows. This is the first problem: there should be no construction or destruction happening here apart from via the call to CreateDisplay().
  5. Execution continues until we return back into the body of the DefaultEglDisplay() method. The local lib variable now contains a usable GLLibraryEGL, off of which the DefaultDisplay() method is called.
  6. DefaultDisplay() then goes ahead and calls CreateDisplay() all over again in order to set up the default EGL display. This is the second problem: the display has already been created by this point; we shouldn't be doing it again.
If we run the code through the debugger this is exactly the flow that we see. I've abridged the output below to just go down to DefaultEglLibrary() since that's where everything converges, but that should be enough to confirm this.
(gdb) b EglDisplay::~EglDisplay
(gdb) b GetAndInitDisplay
(gdb) r
[...]
Thread 38 &quot;Compositor&quot; hit Breakpoint 4, mozilla::gl::
    GetAndInitDisplay (egl=..., displayType=displayType@entry=0x0, 
    display=display@entry=0x0)
    at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:149
149     ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp: No such file or directory.
(gdb) bt
#0  mozilla::gl::GetAndInitDisplay (egl=..., displayType=displayType@entry=0x0, 
    display=display@entry=0x0)
    at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:149
#1  0x0000007ff111c9c8 in mozilla::gl::GLLibraryEGL::CreateDisplay (
    this=this@entry=0x7ed80031c0, forceAccel=forceAccel@entry=false, 
    out_failureId=out_failureId@entry=0x7fd40eb1c8, aDisplay=aDisplay@entry=0x0)
    at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:813
#2  0x0000007ff111cdb0 in mozilla::gl::GLLibraryEGL::Init (
    this=this@entry=0x7ed80031c0, forceAccel=forceAccel@entry=false, 
    out_failureId=out_failureId@entry=0x7fd40eb1c8, aDisplay=aDisplay@entry=0x0)
    at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:504
#3  0x0000007ff111d5f8 in mozilla::gl::GLLibraryEGL::Create (
    out_failureId=out_failureId@entry=0x7fd40eb1c8)
    at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:345
#4  0x0000007ff111d750 in mozilla::gl::DefaultEglLibrary (
    out_failureId=out_failureId@entry=0x7fd40eb1c8)
    at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1331
[...]
#29 0x0000007ff6a0289c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/
    clone.S:78
(gdb) c
Continuing.

Thread 38 &quot;Compositor&quot; hit Breakpoint 3, mozilla::gl::EglDisplay::
    ~EglDisplay (this=0x7ed8003550, __in_chrg=<optimized out>)
    at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:733
733     in ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp
(gdb) bt
#0  mozilla::gl::EglDisplay::~EglDisplay (this=0x7ed8003550, 
    __in_chrg=<optimized out>)
    at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:733
#1  0x0000007ff111b52c in __gnu_cxx::new_allocator<mozilla::gl::EglDisplay>::
    destroy<mozilla::gl::EglDisplay> (__p=<optimized out>, this=<optimized out>)
    at include/c++/8.3.0/ext/new_allocator.h:140
#2  std::allocator_traits<std::allocator<mozilla::gl::EglDisplay> >::
    destroy<mozilla::gl::EglDisplay> (__p=<optimized out>, __a=...)
    at include/c++/8.3.0/bits/alloc_traits.h:487
#3  std::_Sp_counted_ptr_inplace<mozilla::gl::EglDisplay, std::
    allocator<mozilla::gl::EglDisplay>, (__gnu_cxx::_Lock_policy)2>::_M_dispose 
    (
    this=<optimized out>) at include/c++/8.3.0/bits/shared_ptr_base.h:554
#4  0x0000007ff1109f34 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::
    _M_release (this=0x7ed8003540)
    at include/c++/8.3.0/ext/atomicity.h:69
#5  0x0000007ff111d380 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::
    ~__shared_count (this=0x7fd40ea790, __in_chrg=<optimized out>)
    at include/c++/8.3.0/bits/shared_ptr_base.h:1167
#6  std::__shared_ptr<mozilla::gl::EglDisplay, (__gnu_cxx::_Lock_policy)2>::
    ~__shared_ptr (this=0x7fd40ea788, __in_chrg=<optimized out>)
    at include/c++/8.3.0/bits/shared_ptr_base.h:1167
#7  std::shared_ptr<mozilla::gl::EglDisplay>::~shared_ptr (this=0x7fd40ea788, 
    __in_chrg=<optimized out>)
    at include/c++/8.3.0/bits/shared_ptr.h:103
#8  mozilla::gl::GLLibraryEGL::Init (this=this@entry=0x7ed80031c0, 
    forceAccel=forceAccel@entry=false, 
    out_failureId=out_failureId@entry=0x7fd40eb1c8, 
    aDisplay=aDisplay@entry=0x0) at ${PROJECT}/gecko-dev/gfx/gl/
    GLLibraryEGL.cpp:504
#9  0x0000007ff111d5f8 in mozilla::gl::GLLibraryEGL::Create (
    out_failureId=out_failureId@entry=0x7fd40eb1c8)
    at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:345
#10 0x0000007ff111d750 in mozilla::gl::DefaultEglLibrary (
    out_failureId=out_failureId@entry=0x7fd40eb1c8)
    at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1331
[...]
#35 0x0000007ff6a0289c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/
    clone.S:78
(gdb) c
Continuing.

Thread 38 &quot;Compositor&quot; hit Breakpoint 4, mozilla::gl::
    GetAndInitDisplay (egl=..., displayType=displayType@entry=0x0, 
    display=display@entry=0x0)
    at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:149
149     in ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp
(gdb) bt
#0  mozilla::gl::GetAndInitDisplay (egl=..., displayType=displayType@entry=0x0, 
    display=display@entry=0x0)
    at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:149
#1  0x0000007ff111c9c8 in mozilla::gl::GLLibraryEGL::CreateDisplay (
    this=this@entry=0x7ed80031c0, forceAccel=forceAccel@entry=false, 
    out_failureId=out_failureId@entry=0x7fd40eb1c8, aDisplay=aDisplay@entry=0x0)
    at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:813
#2  0x0000007ff111d918 in mozilla::gl::GLLibraryEGL::DefaultDisplay (
    this=0x7ed80031c0, out_failureId=out_failureId@entry=0x7fd40eb1c8)
    at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:745
#3  0x0000007ff112efc0 in mozilla::gl::DefaultEglDisplay (
    out_failureId=0x7fd40eb1c8)
    at ${PROJECT}/gecko-dev/gfx/gl/GLContextEGL.h:33
#4  mozilla::gl::GLContextProviderEGL::CreateHeadless (desc=..., 
    out_failureId=out_failureId@entry=0x7fd40eb1c8)
    at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1246
#5  0x0000007ff112f89c in mozilla::gl::GLContextProviderEGL::CreateOffscreen (
    size=..., minCaps=..., 
    flags=flags@entry=mozilla::gl::CreateContextFlags::REQUIRE_COMPAT_PROFILE, 
    out_failureId=out_failureId@entry=0x7fd40eb1c8)
    at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1288
#6  0x0000007ff11982bc in mozilla::layers::CompositorOGL::CreateContext (
    this=this@entry=0x7ed8002ed0)
    at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:254
[...]
#27 0x0000007ff6a0289c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/
    clone.S:78
(gdb) c
[...]
Assuming that all of the above is correct, on the face of it there seems to be a simple solution, which is to remove the code I added to GLLibraryEGL::Init() which calls CreateDisplay(). An important consequence of this is that mDefaultDisplay will then no longer be set. We can't just set this inside DefaultEglDisplay() because that's in the wrong context. But it's already being set by GLLibraryEGL::DefaultDisplay(), which is in the right context (it's part of the GLLibraryEGL) and should be enough already.

I'm about to make this change when something else peculiar jumps out at me. The call to GLLibraryEGL::DefaultDisplay() checks whether the value of mDefaultDisplay is valid and only calls CreateDisplay() to create a new one if it's not. That's strange, since the value should already have been set in GLLibraryEGL::Init(). The exception is if CreateDisplay() returns null. It is possible that this is what's causing this issue. Let's check:
(gdb) b GLLibraryEGL.cpp:504
Breakpoint 5 at 0x7ff111cd98: file ${PROJECT}/gecko-dev/gfx/gl/
    GLLibraryEGL.cpp, line 504.
(gdb) r
[...]
Thread 37 &quot;Compositor&quot; hit Breakpoint 6, mozilla::gl::GLLibraryEGL::
    Init (this=this@entry=0x7ee00031c0, forceAccel=forceAccel@entry=false, 
    out_failureId=out_failureId@entry=0x7f1f7bf1c8, aDisplay=aDisplay@entry=0x0)
    at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:504
504     ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp: No such file or directory.
(gdb) p defaultDisplay
$14 = std::shared_ptr<mozilla::gl::EglDisplay> (use count -134204000, weak 
    count 125) = {get() = 0x0}
(gdb) p mDefaultDisplay
$15 = std::weak_ptr<mozilla::gl::EglDisplay> (empty) = {get() = 0x0}
(gdb) n
505     in ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp
(gdb) p defaultDisplay
$16 = std::shared_ptr<mozilla::gl::EglDisplay> (use count 1, weak count 1) = 
    {get() = 0x7ee0003570}
(gdb) p mDefaultDisplay
$17 = std::weak_ptr<mozilla::gl::EglDisplay> (empty) = {get() = 0x0}
(gdb) n
508     in ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp
(gdb) n
510     in ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp
(gdb) p mDefaultDisplay
$18 = std::weak_ptr<mozilla::gl::EglDisplay> (use count 1, weak count 2) = {get(
    ) = 0x7ee0003570}
(gdb) p this
$19 = (mozilla::gl::GLLibraryEGL * const) 0x7ee00031c0
As we can see that's not what's happening. The actual reason is that when it comes to the call to GLLibraryEGL::DefaultDisplay() the pointer has expired and hence it's created all over again.
(gdb) b GLLibraryEGL::DefaultDisplay
Breakpoint 7 at 0x7ff111d7f4: file ${PROJECT}/gecko-dev/gfx/gl/
    GLLibraryEGL.cpp, line 741.
(gdb) c
Continuing.

Thread 37 &quot;Compositor&quot; hit Breakpoint 7, mozilla::gl::GLLibraryEGL::
    DefaultDisplay (this=0x7ee00031c0, 
    out_failureId=out_failureId@entry=0x7f1f7bf1c8)
    at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:741
741     in ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp
(gdb) n
742     in ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp
(gdb) p this
$20 = (mozilla::gl::GLLibraryEGL * const) 0x7ee00031c0
(gdb) p mDefaultDisplay
$21 = std::weak_ptr<mozilla::gl::EglDisplay> (expired, weak count 1) = {get() = 
    0x7ee0003570}
(gdb) p ret
$22 = std::shared_ptr<mozilla::gl::EglDisplay> (use count 1634882661, weak 
    count 25954) = {get() = 0x7f1f7bf160}
(gdb) n
745     in ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp
(gdb) 
The flow all makes sense now and the solution to all of these issues is the same: remove the code I added to Init(). As long as nothing is attempting to access the mDefaultDisplay before GLLibraryEGL::DefaultDisplay() is called, it should all work out fine.

I've deleted the five lines. I've set the build running.

While that chugs away, let's briefly look at the other potentially relevant issue that valgrind threw up. This issue relates to APZCTreeManager::PrintAPZCInfo(). Although that doesn't sound especially relevant, the code is from the Compositor thread, so may be worth looking in to just in case.
==29044== Thread 32 Compositor:
==29044== Mismatched free() / delete / delete []
==29044==    at 0x484E858: operator delete(void*) (vg_replace_malloc.c:923)
==29044==    by 0x5CEFFFF: std::__cxx11::basic_stringstream<char, std::
    char_traits<char>, std::allocator<char> >::~basic_stringstream() (in /usr/
    lib64/libstdc++.so.6.0.25)
==29044==    by 0x767C23F: void mozilla::layers::APZCTreeManager::
    PrintAPZCInfo<mozilla::layers::LayerMetricsWrapper>(mozilla::layers::
    LayerMetricsWrapper const&, mozilla::layers::AsyncPanZoomController const*) 
    (APZCTreeManager.cpp:1014)
==29044==    by 0x768DC63: mozilla::layers::HitTestingTreeNode* mozilla::layers:
    :APZCTreeManager::PrepareNodeForLayer<mozilla::layers::LayerMetricsWrapper>(
    mozilla::RecursiveMutexAutoLock const&, mozilla::layers::
    LayerMetricsWrapper const&, mozilla::layers::FrameMetrics const&, mozilla::
    layers::LayersId, mozilla::Maybe<mozilla::layers::ZoomConstraints> const&, 
    mozilla::layers::AncestorTransform const&, mozilla::layers::
    HitTestingTreeNode*, mozilla::layers::HitTestingTreeNode*, mozilla::layers::
    APZCTreeManager::TreeBuildingState&) (APZCTreeManager.cpp:1323)
==29044==    by 0x768E55F: mozilla::layers::APZCTreeManager::
    UpdateHitTestingTreeImpl<mozilla::layers::LayerMetricsWrapper>(mozilla::
    layers::LayerMetricsWrapper const&, bool, mozilla::layers::LayersId, 
    unsigned int)::{lambda(mozilla::layers::LayerMetricsWrapper)#2}::operator()(
    mozilla::layers::LayerMetricsWrapper) const (APZCTreeManager.cpp:481)
[...]

==29044==  Address 0x18c00050 is 0 bytes inside a block of size 513 alloc'd
==29044==    at 0x484B7C0: malloc (vg_replace_malloc.c:381)
==29044==    by 0x48F61EB: moz_xmalloc (in /usr/lib64/
    libqt5embedwidget.so.1.53.9)
==29044==    by 0x48F644B: std::__cxx11::basic_string<char, std::
    char_traits<char>, std::allocator<char> >::_M_create(unsigned long&, 
    unsigned long) (in /usr/lib64/libqt5embedwidget.so.1.53.9)
==29044==    by 0x5CFC2E7: std::__cxx11::basic_string<char, std::
    char_traits<char>, std::allocator<char> >::reserve(unsigned long) (in /usr/
    lib64/libstdc++.so.6.0.25)
==29044==    by 0x5CF14D7: std::__cxx11::basic_stringbuf<char, std::
    char_traits<char>, std::allocator<char> >::overflow(int) (in /usr/lib64/
    libstdc++.so.6.0.25)
==29044==    by 0x5CFA7B7: std::basic_streambuf<char, std::char_traits<char> >::
    xsputn(char const*, long) (in /usr/lib64/libstdc++.so.6.0.25)
==29044==    by 0x5CEC1AB: std::basic_ostream<char, std::char_traits<char> >& 
    std::__ostream_insert<char, std::char_traits<char> >(std::
    basic_ostream<char, std::char_traits<char> >&, char const*, long) (in /usr/
    lib64/libstdc++.so.6.0.25)
==29044==    by 0x765AC0B: operator<< <std::char_traits<char> > (ostream:561)
==29044==    by 0x765AC0B: mozilla::layers::operator<<(std::ostream&, mozilla::
    layers::ScrollableLayerGuid const&) (ScrollableLayerGuid.cpp:52)
==29044==    by 0x767BE3F: void mozilla::layers::APZCTreeManager::
    PrintAPZCInfo<mozilla::layers::LayerMetricsWrapper>(mozilla::layers::
    LayerMetricsWrapper const&, mozilla::layers::AsyncPanZoomController const*) 
    (APZCTreeManager.cpp:1015)
==29044==    by 0x768DC63: mozilla::layers::HitTestingTreeNode* mozilla::layers:
    :APZCTreeManager::PrepareNodeForLayer<mozilla::layers::LayerMetricsWrapper>(
    mozilla::RecursiveMutexAutoLock const&, mozilla::layers::
    LayerMetricsWrapper const&, mozilla::layers::FrameMetrics const&, mozilla::
    layers::LayersId, mozilla::Maybe<mozilla::layers::ZoomConstraints> const&, 
    mozilla::layers::AncestorTransform const&, mozilla::layers::
    HitTestingTreeNode*, mozilla::layers::HitTestingTreeNode*, mozilla::layers::
    APZCTreeManager::TreeBuildingState&) (APZCTreeManager.cpp:1323)
[...]
The definition of the APZCTreeManager::PrintAPZCInfo() method that's referred to in the output above looks like this:
template <class ScrollNode>
void APZCTreeManager::PrintAPZCInfo(const ScrollNode& aLayer,
                                    const AsyncPanZoomController* apzc) {
  const FrameMetrics& metrics = aLayer.Metrics();
  std::stringstream guidStr;
  guidStr << apzc->GetGuid();
  mApzcTreeLog << &quot;APZC &quot; << guidStr.str()
               << &quot;\tcb=&quot; << metrics.GetCompositionBounds()
               << &quot;\tsr=&quot; << metrics.GetScrollableRect()
               << (metrics.IsScrollInfoLayer() ? &quot;\tscrollinfo&quot; : 
    &quot;&quot;)
               << (apzc->HasScrollgrab() ? &quot;\tscrollgrab&quot; : 
    &quot;&quot;) << &quot;\t&quot;
               << aLayer.Metadata().GetContentDescription().get();
}
There may be something to do here, but this is so far removed from the rendering pipeline that I don't see how it can be related to the issues I'm trying to fix. So I'm going to ignore this one and potentially come back to it later. I'm thinking there's some work to be done cleaning up the valgrind output and if I do eventually do so, this would fall under that.

It's been a long one today and I apologise for that. Sometimes it takes a lot to unravel the code and structure my thoughts around it. The key question now is whether the small change I've made will make any difference to the rendering. We'll find that out tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
18 Mar 2024 : Day 189 #
The memory graphs we looked at yesterday told a different story to the one we were expecting. The memory usage didn't shoot through the roof, instead stay stable at between 200-250 MB. So if memory allocation isn't the problem, it does beg the question of what is. It could be that it's graphics related, that is, to do with the way GL and EGL textures are being created and destroyed.

Despite this I still think it may be worthwhile running the harbour-webview app through valgrind. Valgrind is a debugging tool that, amongst other tricks, will tell you the state of the memory when the application is exited. In case some portion of the allocated memory isn't freed up at the end, it will highlight the fact. It'll also tell you all sorts of other useful memory information. But crucially, if memory isn't being freed, then that's typically a sign of a memory leak.

In the case of EmbedLite the memory management is pretty flaky, making it hard to disentangle what's normal from what's exceptional. I've nevertheless collected valgrind logs from both harbour-webview and sailfish-browser. I did this on both ESR 78 and ESR 91 in the hope that performing some kind of comparison might be useful. First on my phone running ESR 78:
$ valgrind --log-file=valgrind-harbour-webview-esr78-01.txt harbour-webview
$ valgrind --log-file=valgrind-sailfish-browser-esr78-01.txt sailfish-browser
And then following that on a different phone running ESR 91:
$ valgrind --log-file=valgrind-harbour-webview-esr91-01.txt harbour-webview
$ valgrind --log-file=valgrind-sailfish-browser-esr91-01.txt sailfish-browser
The task now is to manually compare the two. The ESR 91 valgrind output shows a lot of errors:
==29044== ERROR SUMMARY: 7057 errors from 159 contexts (suppressed: 0 from 0)
That's more than the number of errors in the ESR 78 output, but it's not an order of magnitude different:
==17128== ERROR SUMMARY: 5737 errors from 155 contexts (suppressed: 0 from 0)
Skimming through, there are a lot of mismatched frees shown for QtWaylandClient; in fact this seems to be the majority of them. However these appear for both ESR 78 and ESR 91. I don't recommend it, but if you'd like to take a look yourself I've uploaded both the ESR 78 valgrind output and the ESR 91 valdgrind output and you're very welcome to take a look yourself. You might spot something I missed.

There are multiple mismatched frees related to QMozViewPrivate::createView(), android_dlopen(), do_dlopen(), QCoreApplicationPrivate::sendPostedEvents(), QMozSecurity, QSGNode::destroy(), QMozContext and others. These all look unrelated to the parts of the code I'm looking at, plus they also appear for both ESR 78 and ESR 91. So I'm going to leave those to one side for now. It may be that there's some useful work to be done cleaning all of these up, but I don't believe they relate to the rendering issues I'm experiencing.

There are also a collection of memory bugs that happen in the Compositor thread. These are far more likely to relate to the changes I've been looking at. There are some that reference CompositorOGL::GetShaderProgramFor(), but which appear for both ESR 78 and ESR 91. Given they're in both, I think I'll ignore those as well.

There are four further errors which I can see appear in ESR 91 but not ESR 78. These are the ones I want to focus on. The first one doesn't seem to be directly associated with the gecko code. It references EGL, so it's possible it's related, but without something more concrete I'm not sure what to do with it.
==29044== Mismatched free() / delete / delete []
==29044==    at 0x484E2B8: free (vg_replace_malloc.c:872)
==29044==    by 0x15CE6943: ??? (in /usr/libexec/droid-hybris/system/lib64/
    libEGL.so)
==29044==    by 0x15CE590B: ??? (in /usr/libexec/droid-hybris/system/lib64/
    libEGL.so)
==29044==    by 0x15CE523F: ??? (in /usr/libexec/droid-hybris/system/lib64/
    libEGL.so)
==29044==    by 0x15CDC4BF: ??? (in /usr/libexec/droid-hybris/system/lib64/
    libEGL.so)
==29044==    by 0x15CDC7A3: eglGetDisplay (in /usr/libexec/droid-hybris/system/
    lib64/libEGL.so)
==29044==    by 0xCCF63AF: ??? (in /usr/lib64/libEGL.so.1.0.0)
==29044==    by 0x14960F73: ??? (in /usr/lib64/qt5/plugins/
    wayland-graphics-integration-client/libwayland-egl.so)
==29044==    by 0x14778947: QtWaylandClient::QWaylandIntegration::
    initializeClientBufferIntegration() (in /usr/lib64/
    libQt5WaylandClient.so.5.6.3)
==29044==    by 0x14778C6F: QtWaylandClient::QWaylandIntegration::
    clientBufferIntegration() const (in /usr/lib64/libQt5WaylandClient.so.5.6.3)
==29044==    by 0x147784F3: QtWaylandClient::QWaylandIntegration::hasCapability(
    QPlatformIntegration::Capability) const (in /usr/lib64/
    libQt5WaylandClient.so.5.6.3)
==29044==    by 0x4B3EA1F: QSGRenderLoop::instance() (in /usr/lib64/
    libQt5Quick.so.5.6.3)
==29044==  Address 0x14bd40c0 is 0 bytes inside a block of size 48 alloc'd
==29044==    at 0x484BD54: operator new(unsigned long) (vg_replace_malloc.c:420)
==29044==    by 0x15A5C167: ??? (in /usr/libexec/droid-hybris/system/lib64/
    libc++.so)
Similarly, although the following is apparently happening in the Compositor thread, I don't see a way to tie it to the gecko code.
==29044== Mismatched free() / delete / delete []
==29044==    at 0x484E2B8: free (vg_replace_malloc.c:872)
==29044==    by 0x174836EF: ??? (in /odm/lib64/libllvm-glnext.so)
==29044==  Address 0x2156db90 is 0 bytes inside a block of size 48 alloc'd
==29044==    at 0x484BD54: operator new(unsigned long) (vg_replace_malloc.c:420)
==29044==    by 0x163E9167: std::__1::basic_string<char, std::__1::
    char_traits<char>, std::__1::allocator<char> >::__grow_by_and_replace(
    unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, 
    unsigned long, char const*) (in /apex/com.android.vndk.v30/lib64/libc++.so)
==29044==    by 0x163E9273: std::__1::basic_string<char, std::__1::
    char_traits<char>, std::__1::allocator<char> >::append(char const*) (in /
    apex/com.android.vndk.v30/lib64/libc++.so)
==29044==    by 0x17483623: ??? (in /odm/lib64/libllvm-glnext.so)
That leaves two remaining issues, both occurring in the Compositor thread. These look like the most promising avenues to look into, but I'm not going to look at them today; they'll need a bit more time than I have right now. I'll give full details and start working through them tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
17 Mar 2024 : Day 188 #
I was still trying to find memory leaks yesterday, although I came to the conclusion that it may not be a memory leak causing the problem after all. I need to continue investigating to try to get a clearer picture, there are a couple of obvious things to try.

One approach worth trying is running the app through valgrind. I'm not super-hopeful that this will yield helpful results because gecko struggles to maintain a healthy memory map at the best of times. As a result valgrind is likely to generate a huge number of results, most of which are just the consequence of "normal" (for some definition of the word) EmbedLite behaviour and so unrelated to the problem I'm trying to fix.

So before trying out valgrind I'm going to try something simpler. That is, I thought I'd just measure the memory usage and see whether it's growing out of control, or remaining stable.

There are many ways to check memory usage: tops might be one way. But as I'm writing this up as a diary entry I thought it would be better to generate some graphs. This will also give a more concrete idea of what's happening over time.

There's also no shortage of tools for generating memory usage graphs. I've plumped for psrecord, a small Python tool, since it's easy to install and use and generates nice clean graphs of memory usage against time.

The fact it's so easy to install and run a random Python utility is one of the things I really love about Sailfish OS. All I have to do is drop to a shell (I usually have an SSH session running already anyway), create a virtual environment and use pip to install it:
$ python3 -m venv venv
$ . ./venv/bin/activate
$ pip install --upgrade pip
$ pip install psrecord matplotlib
Beautiful! I'm sure there's a nice way to use Python on Android (using Termux?) and iOS (Pythonista?) but I'm not sure I'd be able to install and use psrecord quite so easily.
$ harbour-webview & psrecord --plot mem-harbour-webview-esr91.png \
    --interval 0.2 &quot;$!&quot;
$ sailfish-browser & psrecord --plot mem-sailfish-browser-esr91.png \
    --interval 0.2 &quot;$!&quot;
Then, when I'm done, I can deactivate and delete the virtual environment to restore my phone to its previous state.
$ deactivate
$ rm -rf venv
The resulting graphs are pretty clear, giving both memory usage in MB and CPU usage as a percentage of the total available. We're really only interested in memory usage though. For context, these graphs were collected by running each of the apps for 20 seconds. After 10 seconds I started scrolling the page up and down with my finger, since the problem doesn't seem to manifest itself when the display is static.
 
Four graphs, each showing CPU (%) and Real Memory (MB) against time (20 seconds) for sailfish-browser and harbour-webview ESR 78 and ESR 91; the memory lines increase over time up to around 200-400 MB but there's not much to distinguish between the behaviours shown in the graphs

What do we find from these? With both ESR 78 and ESR 91 the browser increases memory to around 250-300 MB. It's nice to see that the memory footprint on ESR 91 is no higher than for ESR 78, in fact if anything it's lower. ESR 78 seems to accumulate memory until the app is shut down, whereas ESR 91 is more consistent.

We see a similar pattern for the WebView. ESR 78 quickly rises to 300 MB of memory usage before jumping up to 350 MB when I start scrolling the page. On ESR 91 the memory rapidly climbs to 200 MB where it stays pretty consistently throughout. Scrolling does cause some memory jitter, but not as much as on ESR 78.

This is all both encouraging and discouraging. It's encouraging to see ESR 91 isn't more of a memory hog than ESR 78. If anything it seems to be leaner. It's discouraging because the lack of excessive memory usage on ESR 91 suggests I may be looking in the wrong place for the solution to the issue I'm trying to solve.

Is that discouraging? Maybe it's encouraging. I have more information than I had before, but in truth I don't feel closer to finding a solution.

I spent a surprising amount of time investigating different ways to collect memory usage data. On top of that generating the graphs also took a whole, given it involved using two different phones and two different apps on each. So I'm going to call it a day. I still want to run the apps through valgrind — maybe this will pick up something new — but I'm going to leave that task until tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
16 Mar 2024 : Day 187 #
Yesterday I was struggling. And getting this WebView render pipeline working is starting to feel drawn out. I don't want to make any claims about the larger task, but I'm hoping that a good night's sleep and some time to ponder on the best approach in the shower this morning will have helped me focus on the task I'm struggling with.

And this task is figuring out what — if anything &mash; is different between the execution of TextureClient::Destroy() on ESR 78 compared to ESR 91. I'm still labouring under the hypothesis that it's this method that's causing the (hypothetical) memory leak that's causing ESR 91 execution to seize up over time.

The difficulties I experienced yesterday were twofold. First on ESR 78 the actual section of code that reclaims the allocated memory appeared never to be executed. Second applying the debugger to ESR 91 gave peculiar results: what appeared to be an infinite loop of calls to Destroy() that never allowed me to step in to the method.

To tackle these difficulties I'm going to try two things. First I need to stick a breakpoint inside the conditional code that reclaims the memory on ESR 78, to establish whether it ever gets called. Second I'm going to annotate the ESR 91 code with debug prints. That should allow me to get a better idea of the true execution flow. If the debugger isn't playing by the rules, I'll take my ball somewhere else.

So, first up, checking the ESR 78 flow. The structure of the Destroy() method looks like this on ESR 78:
void TextureClient::Destroy() {
[...]
  RefPtr<TextureChild> actor = mActor;
[...]
  TextureData* data = mData;
  if (!mWorkaroundAnnoyingSharedSurfaceLifetimeIssues) {
    mData = nullptr;
  }

  if (data || actor) {
[...]
    DeallocateTextureClient(params);
  }
}
I'm interested in whether it ever goes inside the condition at the end in order to ultimately call DeallocateTextureClient(). If it doesn't — or if this only happens occasionally, say on shutdown — then I'm likely to be looking in the wrong place.

The reason I've never seen it enter this condition is because data (which is derived from the mData class variable) and actor (which is derived from the mActor class variable) have always been null when entering this method.

Let's do this then.
(gdb) break TextureClient.cpp:583
Breakpoint 5 at 0x7fb8e9b2fc: file gfx/layers/client/TextureClient.cpp, line 
    585.
(gdb) r
[...]
Thread 7 &quot;GeckoWorkerThre&quot; hit Breakpoint 5, mozilla::layers::
    TextureClient::Destroy (this=this@entry=0x7f8ea53bb0)
    at gfx/layers/client/TextureClient.cpp:585
585         params.allocator = mAllocator;
(gdb) c
Continuing.
[LWP 16174 exited]

Thread 7 &quot;GeckoWorkerThre&quot; hit Breakpoint 5, mozilla::layers::
    TextureClient::Destroy (this=this@entry=0x7f8effa780)
    at gfx/layers/client/TextureClient.cpp:585
585         params.allocator = mAllocator;
(gdb) p mWorkaroundAnnoyingSharedSurfaceLifetimeIssues
$21 = false
(gdb) p data
$22 = (mozilla::layers::TextureData *) 0x7f8dac3a70
(gdb) p actor
$23 = <optimized out>
(gdb) b DeallocateTextureClient
Breakpoint 6 at 0x7fb8e9a908: file gfx/layers/client/TextureClient.cpp, line 
    490.
(gdb) c
Continuing.

Thread 7 &quot;GeckoWorkerThre&quot; hit Breakpoint 6, mozilla::layers::
    DeallocateTextureClient (params=...)
    at gfx/layers/client/TextureClient.cpp:490
490     void DeallocateTextureClient(TextureDeallocParams params) {
(gdb) p params
$24 = {data = 0x7f8dac3a70, actor = {mRawPtr = 0x7f8d648720}, allocator = 
    {mRawPtr = 0x7f8cb40ff8}, clientDeallocation = false, syncDeallocation = 
    false, workAroundSharedSurfaceOwnershipIssue = false}
(gdb) n
491       if (!params.actor && !params.data) {
(gdb) n
324     obj-build-mer-qt-xr/dist/include/nsCOMPtr.h: No such file or directory.
(gdb) n
499       if (params.allocator) {
(gdb) n
500         ipdlThread = params.allocator->GetThread();
(gdb) n
501         if (!ipdlThread) {
(gdb) n
510       if (ipdlThread && !ipdlThread->IsOnCurrentThread()) {
(gdb) n
532       if (!ipdlThread) {
(gdb) n
540       if (!actor) {
(gdb) n
555       actor->Destroy(params);
(gdb) n
497       nsCOMPtr<nsISerialEventTarget> ipdlThread;
(gdb) n
505           return;
(gdb) c
Continuing.
[...]
So that clears things up: it does go inside the condition and it does deallocate the actor. But the data and actor values are non-null and ultimately in this case because we're executing on the IPDL thread, the actor is destroyed directly.

Let's now try the same thing on ESR 91.
(gdb) break TextureClient.cpp:574
Breakpoint 4 at 0x7ff1148ddc: TextureClient.cpp:574. (2 locations)
(gdb) r
[...]
Thread 7 &quot;GeckoWorkerThre&quot; hit Breakpoint 4, mozilla::layers::
    TextureClient::Destroy (this=this@entry=0x7fc5612680)
    at gfx/layers/client/TextureClient.cpp:587
587         if (actor) {
(gdb) c
Continuing.

Thread 7 &quot;GeckoWorkerThre&quot; hit Breakpoint 4, mozilla::layers::
    TextureClient::Destroy (this=this@entry=0x7fc55b8690)
    at gfx/layers/client/TextureClient.cpp:587
587         if (actor) {
(gdb) c
Continuing.

Thread 7 &quot;GeckoWorkerThre&quot; hit Breakpoint 4, mozilla::layers::
    TextureClient::Destroy (this=this@entry=0x7fc55b6c10)
    at gfx/layers/client/TextureClient.cpp:587
587         if (actor) {
(gdb) p data
$12 = (mozilla::layers::TextureData *) 0x7fc4d79e80
(gdb) p actor
$13 = <optimized out>
(gdb) b DeallocateTextureClient
Breakpoint 5 at 0x7ff1148394: file gfx/layers/client/TextureClient.cpp, line 
    489.
(gdb) c
Continuing.

Thread 7 &quot;GeckoWorkerThre&quot; hit Breakpoint 5, mozilla::layers::
    DeallocateTextureClient (params=...)
    at gfx/layers/client/TextureClient.cpp:489
489     void DeallocateTextureClient(TextureDeallocParams params) {
(gdb) p params
$14 = {data = 0x7fc4d79e80, actor = {mRawPtr = 0x7fc59e6620}, allocator = 
    {mRawPtr = 0x7fc46684e0}, clientDeallocation = false, syncDeallocation = 
    false}
(gdb) n
490       if (!params.actor && !params.data) {
(gdb) n
496       nsCOMPtr<nsISerialEventTarget> ipdlThread;
(gdb) n
[New LWP 5954]
498       if (params.allocator) {
(gdb) n
499         ipdlThread = params.allocator->GetThread();
(gdb) n
[LWP 5954 exited]
[New LWP 6110]
500         if (!ipdlThread) {
(gdb) n
509       if (ipdlThread && !ipdlThread->IsOnCurrentThread()) {
(gdb) n
[LWP 6110 exited]
867     ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h: No such file or 
    directory.
(gdb) n
539       if (!actor) {
(gdb) n
548       actor->Destroy(params);
(gdb) n
496       nsCOMPtr<nsISerialEventTarget> ipdlThread;
(gdb) n
mozilla::layers::TextureClient::Destroy (this=this@entry=0x7fc55b6c10)
    at gfx/layers/client/TextureClient.cpp:591
591         DeallocateTextureClient(params);
(gdb) c
Continuing.
[...]
Here we see something similar. The inner condition is entered and the DeallocateTextureClient() method is ultimately called on the same thread. The data and actor values are both non-null.

To return to our original questions, I think this has answered both of them. First we can see that on ESR 78 this is definitely a place where memory is actually being freed. But on the other hand we also see it being freed on ESR 91. That doesn't mean that there isn't a problem here, but it does make it less likely.

Nevertheless there has been a change to this code. The mWorkaroundAnnoyingSharedSurfaceLifetimeIssues flag was removed by upstream. It's possible that this is causing the issue we're experiencing, so I'm going to reverse this and reinsert the removed code. I'm not really expecting this to fix things, but having travelled out into the sticks I now need to check under every stone. I've no choice but to figure this thing out if the WebView is going to get back up and running again.

[...]

Having worked carefully through the code and reintroduced the mWorkaroundAnnoyingSharedSurfaceLifetimeIssues variable and its associated logic, it's disappointing to find it's not fixed the issue. I'm not out of ideas yet though. Tomorrow I'm going to have a go at profiling the application and using specific memory tools (e.g. valgrind) to try to figure out which memory is being allocated but not deallocated. I'm not sure I hold out much hope of success using valgrind given that gecko is so big and messy and suffering from leakage as it is, but you never know.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
15 Mar 2024 : Day 186 #
Today I'm still searching for memory leaks and one memory leak in particular: after fixing the SurfaceFactory_EGLImage::Create() which, as the name implies, creates EGL surface textures, there's now a massive memory leak that grinds not just the browser but the entire phone to a halt.

I'm hypothesising that there's something created by the method that should get freed. The texture itself is the most likely culprit, since even just a few 1080 by 2520 textures are going to consume a lot of memory. Generate a few of those each frame without freeing them and bad things will happen.

We're definitely seeing bad things happen, so this is my guess. However yesterday I checked a few of the associated destructors and they seem to be getting called.

So today I'm going to try to tackle it from a different angle. Rather than try to figure out what's not getting freed I'm going to try to find out what's causing the crash. I've started off by adjusting the size of the texture created as part of the render loop. Rather than a 1080 by 2520 texture, I've set it to always generate an 8 by 8 texture, by altering this code from GLContext.cpp (the commented part is the old code, replaced by the equivalent lines directly below):
GLuint CreateTexture(GLContext* aGL, GLenum aInternalFormat, GLenum aFormat,
                     GLenum aType, const gfx::IntSize& aSize, bool linear) {
[...]
//  aGL->fTexImage2D(LOCAL_GL_TEXTURE_2D, 0, aInternalFormat, aSize.width,
//                   aSize.height, 0, aFormat, aType, nullptr);
  aGL->fTexImage2D(LOCAL_GL_TEXTURE_2D, 0, aInternalFormat, 8,
                   8, 0, aFormat, aType, nullptr);

  return tex;
}
This won't prevent the memory leak, but if this is the texture that's not being freed, it should at least slow it down, which should be discernible in use.

While the updated code builds and gets transferred over to my phone I can continue looking through the code. Here's the backtrace from the SharedSurface_EGLImage::Create() method, the change to which has triggered the memory leak. It's possible, maybe even likely, that the code for reclaiming the resources will live somewhere in or close to the methods in this stack.
#0  SharedSurface_EGLImage::Create, SharedSurfaceEGL.cpp:58
#1  SurfaceFactory_EGLImage::CreateSharedImpl, WeakPtr.h:185
#2  SurfaceFactory::CreateShared, RefCounted.h:240
#3  SurfaceFactory::NewTexClient, SharedSurface.cpp:406
#4  GLScreenBuffer::Swap, UniquePtr.h:290
#5  GLScreenBuffer::PublishFrame, GLScreenBuffer.h:229
#6  EmbedLiteCompositorBridgeParent::PresentOffscreenSurface,
    EmbedLiteCompositorBridgeParent.cpp:191
#7  embedlite::nsWindow::PostRender, embedshared/nsWindow.cpp:248
#8  InProcessCompositorWidget::PostRender, InProcessCompositorWidget.cpp:60
#9  LayerManagerComposite::Render, Compositor.h:575
#10 LayerManagerComposite::UpdateAndRender, LayerManagerComposite.cpp:657
#11 LayerManagerComposite::EndTransaction, LayerManagerComposite.cpp:572
I should take a look more closely at SharedSurfaceTextureClient::~SharedSurfaceTextureClient() which is — in theory — being called each time Swap() is called when the contents of the mBack reference-counted pointer is freed.

My new binary has copied over to my phone. The running app exhibits very similar symptoms as before: responsive at first but quickly seizes up and becoming unresponsive. I killed the process before it took down my phone, but a visual check suggests reducing the texture size didn't have any obvious benefit.

Back to looking through the code, I'm working through SharedSurfaceTextureClient::~SharedSurfaceTextureClient() and I notice this method called Destroy(). Although it's not immediately clear from the way the code is written, this call actually goes through to TextureClient::Destroy(). I recall making changes to this and scanning through my notes confirms it: it was back on Day 176. At the time I wrote this:
 
This mWorkaroundAnnoyingSharedSurfaceLifetimeIssues is used in ESR 78 to decide whether or not to deallocate the TextureData when the TextureClient is destroyed. This has been removed in ESR 91 and to be honest, I'm comfortable leaving it this way... Maybe this will cause problems later, but my guess is that this will show up with the debugger if that's the case, at which point we can refer back to the ESR 78 code to restore these checks.

Could it be that the changes I made back then are causing the issues I'm experiencing now?

While I can step through the ESR 78 build, unfortunately all of changes integrated using partial builds have messed up the debug source for the ESR 91 build. Stepping through gives me just a slew of useless "TextureClientSharedSurface.cpp: No such file or directory." messages. So I've decided to kick off a full build. With a bit of luck it'll be completed before the end of the day and I'll be able to come back to this in the evening to compare the two executing flows by stepping through them.

[...]

It was just before 9:00 this morning that I set the build going. I could well imagine it running into the night and leaving me with no more time to get the benefit from it. But by 17:13 this evening the build had completed. That means there's still time to perform the comparison of TextureClient::Destroy() running on the two versions.

The results, however, are not what I was expecting. At least on ESR 78 things seem to act normally. The breakpoint is hit and it's possible to step through the code. It seems a little anomalous that both the data and actor variables are null, meaning that the actual dealocation step gets skipped:
(gdb) b TextureClient::Destroy
Breakpoint 4 at 0x7fb8e9b1b8: file gfx/layers/client/TextureClient.cpp,
    line 558.
(gdb) c
Continuing.
[LWP 7428 exited]
[Switching to LWP 7404]

Thread 37 "Compositor" hit Breakpoint 4, mozilla::layers::TextureClient::
    Destroy (this=this@entry=0x7ea0118ab0)
    at gfx/layers/client/TextureClient.cpp:558
558     void TextureClient::Destroy() {
(gdb) n
[LWP 7308 exited]
[LWP 7374 exited]
[LWP 7375 exited]
560       MOZ_RELEASE_ASSERT(mPaintThreadRefs == 0);
(gdb) n
562       if (mActor && !mIsLocked) {
(gdb) n
566       mBorrowedDrawTarget = nullptr;
(gdb) n
567       mReadLock = nullptr;
(gdb) n
[New LWP 7538]
569       RefPtr<TextureChild> actor = mActor;
(gdb) n
577       TextureData* data = mData;
(gdb) n
578       if (!mWorkaroundAnnoyingSharedSurfaceLifetimeIssues) {
(gdb) n
582       if (data || actor) {
(gdb) p mWorkaroundAnnoyingSharedSurfaceLifetimeIssues
$11 = true
(gdb) p data
$12 = (mozilla::layers::TextureData *) 0x0
(gdb) p actor
$13 = {mRawPtr = 0x0}
(gdb) 
Here are the most relevant parts of the code associated with the above debug output to help follow what's happening:
void TextureClient::Destroy() {
  // Async paints should have been flushed by now.
  MOZ_RELEASE_ASSERT(mPaintThreadRefs == 0);

  if (mActor && !mIsLocked) {
    mActor->Lock();
  }

  mBorrowedDrawTarget = nullptr;
  mReadLock = nullptr;

  RefPtr<TextureChild> actor = mActor;
  mActor = nullptr;

  if (actor && !actor->mDestroyed.compareExchange(false, true)) {
    actor->Unlock();
    actor = nullptr;
  }

  TextureData* data = mData;
  if (!mWorkaroundAnnoyingSharedSurfaceLifetimeIssues) {
    mData = nullptr;
  }

  if (data || actor) {
[...]
    DeallocateTextureClient(params);
  }
}
Although the code does step through okay, on ESR 91 something happens which I can't explain. The debugger suggests execution is getting stuck in a loop calling TextureClient::Destroy():
(gdb) b TextureClient::Destroy
Breakpoint 3 at 0x7ff1148c90: file layers/client/
    TextureClient.cpp, line 551.
(gdb) c
Continuing.
[LWP 18393 exited]
[Switching to LWP 18284]

Thread 7 "GeckoWorkerThre" hit Breakpoint 3, mozilla::layers::TextureClient::
    Destroy (this=this@entry=0x7fc5975380)
    at layers/client/TextureClient.cpp:551
551     void TextureClient::Destroy() {
(gdb) n
553       MOZ_RELEASE_ASSERT(mPaintThreadRefs == 0);
(gdb) 

Thread 7 "GeckoWorkerThre" hit Breakpoint 3, mozilla::layers::TextureClient::
    Destroy (this=this@entry=0x7fc595de20)
    at layers/client/TextureClient.cpp:551
551     void TextureClient::Destroy() {
(gdb) 
553       MOZ_RELEASE_ASSERT(mPaintThreadRefs == 0);
(gdb) 

Thread 7 "GeckoWorkerThre" hit Breakpoint 3, mozilla::layers::TextureClient::
    Destroy (this=this@entry=0x7fc55d2280)
    at layers/client/TextureClient.cpp:551
551     void TextureClient::Destroy() {
(gdb) 
The code is quite similar to the ESR 78 code, so I'm not sure why this might be happening. There's no obviously nested call to Destroy():
void TextureClient::Destroy() {
  // Async paints should have been flushed by now.
  MOZ_RELEASE_ASSERT(mPaintThreadRefs == 0);

  if (mActor && !mIsLocked) {
    mActor->Lock();
  }

  mBorrowedDrawTarget = nullptr;
  mReadLock = nullptr;

  RefPtr<TextureChild> actor = mActor;
  mActor = nullptr;

  if (actor && !actor->mDestroyed.compareExchange(false, true)) {
    actor->Unlock();
    actor = nullptr;
  }

  TextureData* data = mData;
  mData = nullptr;

  if (data || actor) {
[...]
    DeallocateTextureClient(params);
  }
}
The backtrace is showing nested calls to TextureClient::~TextureClient(). But none of this explains the repeated hits of the TextureClient::Destroy() breakpoint.
#0  mozilla::layers::TextureClient::Destroy (this=this@entry=0x7fc6f7a3a0)
    at layers/client/TextureClient.cpp:551
#1  0x0000007ff1149144 in mozilla::layers::TextureClient::~TextureClient
    (this=0x7fc6f7a3a0, __in_chrg=<optimized out>)
    at layers/client/TextureClient.cpp:769
#2  0x0000007ff1149310 in mozilla::layers::TextureClient::~TextureClient
    (this=0x7fc6f7a3a0, __in_chrg=<optimized out>)
    at layers/client/TextureClient.cpp:764
#3  0x0000007ff110507c in mozilla::AtomicRefCountedWithFinalize
    <mozilla::layers::TextureClient>::Release (this=0x7fc6f7a3a8)
    at include/c++/8.3.0/bits/basic_ios.h:282
#4  0x0000007ff1268420 in mozilla::RefPtrTraits
    <mozilla::layers::TextureClient>::Release (aPtr=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:381
#5  RefPtr<mozilla::layers::TextureClient>::ConstRemovingRefPtrTraits
    <mozilla::layers::TextureClient>::Release (aPtr=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:381
[...]
I was hoping to get this part resolved today, but the situation is confusing and my head has stopped focusing, so I'm going to have to continue tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
14 Mar 2024 : Day 185 #
By the time I'd finished writing my diary entry yesterday I was pretty tired; my mind wasn't entirely with it. But it wasn't just the need for sleep that was causing me confusion. I was also confused as to why the HasEglImageExtensions() was returning false on ESR 91 while HasExtensions() — which is essentially the same functionality — was returning true.

A good night's sleep hasn't helped answer this question unfortunately. My manual inspection of the code coupled with output from the debugger suggested that HasEglImageExtensions() should have been returning true.

What I'd really like to see is explicit output for each of the items in the condition to figure out which is returning false. The debugger on its own won't be any further help with this as there are too many steps optimised out. But if I expand the code a bit, rebuild and redeploy, then I may be able to get a clearer picture. So at least that's a clear path for today.

The first step is to make some changes to the code. I've added in variables to store return values for each of the four flags, all marked as volatile in the hope this will prevent the compiler from optimising them away. Then I print them all out. In practice I don't really care whether they actually get printed out or not, since my plan is to inspect them using the debugger. But I need to do something with them; printing them out is as good as anything.
static bool HasEglImageExtensions(const GLContextEGL& gl) {
  const auto& egl = *(gl.mEgl);

  volatile bool imagebase = egl.HasKHRImageBase();
  volatile bool tex2D = egl.IsExtensionSupported(
    EGLExtension::KHR_gl_texture_2D_image);
  volatile bool external = gl.IsExtensionSupported(
    GLContext::OES_EGL_image_external);
  volatile bool image = gl.IsExtensionSupported(GLContext::OES_EGL_image);

  printf_stderr("RENDER: egl HasKHRImageBase: %d\n", imagebase);
  printf_stderr("RENDER: egl KHR_gl_texture_2D_image: %d\n", tex2D);
  printf_stderr("RENDER: gl OES_EGL_image_external: %d\n", external);
  printf_stderr("RENDER: gl OES_EGL_image: %d\n", image);

  return egl.HasKHRImageBase() &&
         egl.IsExtensionSupported(EGLExtension::KHR_gl_texture_2D_image) &&
         (gl.IsExtensionSupported(GLContext::OES_EGL_image_external) ||
          gl.IsExtensionSupported(GLContext::OES_EGL_image));
}
Now I've set it building. As I write this I'm on a different mode of transport: travelling by bus. It's surprising that using a laptop on a bus feels far more socially awkward compared to using a laptop on a train. It's true the ride is more bumpy and the space more cramped, but I still find it odd that I don't see other people doing it. Everyone is on their phones; nobody (apart from me) ever seems to have a laptop out.

On executing the updated code I'm surprised to discover that it does actually output to the console. And the results aren't what I was expected. Well, not exactly.
RENDER: egl HasKHRImageBase: 1
RENDER: egl KHR_gl_texture_2D_image: 1
RENDER: gl OES_EGL_image_external: 1
RENDER: gl OES_EGL_image: 1
They're all coming back true, so I must have been mistaken about why the SurfaceFactory_EGLImage::Create() method is exiting early. I've therefore annotated the Create() method with some more debug output in the hope this will shed more light on things. See the added printf_stderr() calls in the code below.
  if (HasEglImageExtensions(*gle)) {
    printf_stderr("RENDER: !HasEglImageExtensions()\n");
    return ret;
  }

  MOZ_ALWAYS_TRUE(prodGL->MakeCurrent());
  GLuint prodTex = CreateTextureForOffscreen(prodGL, formats, size);
  if (!prodTex) {
    printf_stderr("RENDER: !prodTex\n");
    return ret;
  }

  EGLClientBuffer buffer =
      reinterpret_cast<EGLClientBuffer>(uintptr_t(prodTex));
  EGLImage image = egl->fCreateImage(context,
                                     LOCAL_EGL_GL_TEXTURE_2D, buffer, nullptr);
  if (!image) {
    prodGL->fDeleteTextures(1, &prodTex);
    printf_stderr("RENDER: !image\n");
    return ret;
  }

  ret.reset(new SharedSurface_EGLImage(prodGL, size, hasAlpha, formats, prodTex,
                                       image));
  printf_stderr("RENDER: returning normally\n");
  return ret;
None of these should be necessary: it should be possible to extract all of this execution flow from the debugger. But for some reason the conclusion I came to from using the debugger doesn't make sense based on the values HasEglImageExtensions() is returning. Maybe I made a mistake somewhere. Nevertheless, this approach should hopefully give an answer to the question we want to know.

Here's the output I get. And as soon as I see this output I realise the stupid mistake I've made.
RENDER: egl HasKHRImageBase: 1
RENDER: egl KHR_gl_texture_2D_image: 1
RENDER: gl OES_EGL_image_external: 1
RENDER: gl OES_EGL_image: 1
RENDER: !HasEglImageExtensions()
So did you notice the stupid mistake? Here's the relevant ESR 78 code:
  if (!HasExtensions(egl, prodGL)) {
    return ret;
  }
And here — oh dear — is what I replaced it with:
  if (HasEglImageExtensions(*gle)) {
    return ret;
  }
See what's missing? It's the crucial negation of the condition. Oh boy. I can see why I made this mistake: it's because elsewhere in the same file the condition is used — correctly — with the opposite effect, like this:
  if (HasEglImageExtensions(*gle)) {
    ret.reset(new ptrT({prodGL, SharedSurfaceType::Basic,
      layers::TextureType::Unknown, true}, caps, allocator, flags, context));
  }
In one case the method should return early if the extension check fails; in the other case it should reset the returned texture if the extension check succeeds.

I feel more than a little bit silly. But it's okay, the important point is that it's fixed now. I've added in that crucial missing ! and this should now work as expected:
  if (!HasEglImageExtensions(*gle)) {
    return ret;
  }
I'm not expecting this change to miraculously fix the entire rendering pipeline, but it should certainly help.

On executing the app and with this change in place we still don't unfortunately get a render. In fact the app now seems to hog CPU cycles and make my phone unresponsive. I have a feeling this is a memory leak, but a bit more digging will help confirm it (or otherwise).

If this change has triggered a memory leak, it's likely because the surface being created by SurfaceFactory_EGLImage::Create() is never being freed. Creating a new 1080 by 2520 texture each frame will start to eat up memory pretty fast. So an obvious next step is to find out where it's being freed on ESR 78 and establish whether the same thing is happening on ESR 91 or not.

Unfortunately it turns out to be harder to find than I'd expected. There are quite a few methods that are used for deleting textures or memory associated with them. I've tried adding breakpoints to all of the following:
  1. SharedSurface_EGLImage::~SharedSurface_EGLImage()
  2. GLContext::Readback()
  3. GLContext::fDeleteFramebuffers()
  4. GLContext::raw_fDeleteTextures()
  5. SharedSurfaceTextureClient::~SharedSurfaceTextureClient()
And they're either not hit in ESR 78, or they hit in both ESR 78 and ESR 91. So I've yet to find the smoking gun. I think I've reached the limit for my day today though, so the investigation will have to continue in the morning.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
13 Mar 2024 : Day 184 #
Yesterday we determined that a problem in CreateShared() meant that the method was returning a null SharedSurface_EGLImage on ESR 91 when it should have been returning a valid pointer. The question I want to answer today is: "why"?

Stepping through the code the programme counter is jumping all over the place, making it hard to follow. But eventually it becomes clear that it's the HasEglImageExtensions() method that's returning false, causing CreateShared() to return early with a null return value. Although the method is called HasEglImageExtensions() in ESR 91, in ESR 78 it's called something else; just HasExtensions. Let's take a look at the two versions of it. But they're otherwise largely the same. First the ESR 78 version:
bool SharedSurface_EGLImage::HasExtensions(GLLibraryEGL* egl, GLContext* gl) {
  return egl->HasKHRImageBase() &&
         egl->IsExtensionSupported(GLLibraryEGL::KHR_gl_texture_2D_image) &&
         (gl->IsExtensionSupported(GLContext::OES_EGL_image_external) ||
          gl->IsExtensionSupported(GLContext::OES_EGL_image));
}
Followed by the ESR 91 version:
static bool HasEglImageExtensions(const GLContextEGL& gl) {
  const auto& egl = *(gl.mEgl);
  return egl.HasKHRImageBase() &&
         egl.IsExtensionSupported(EGLExtension::KHR_gl_texture_2D_image) &&
         (gl.IsExtensionSupported(GLContext::OES_EGL_image_external) ||
          gl.IsExtensionSupported(GLContext::OES_EGL_image));
}
As you can see, they're similar but not quite identical. Unfortunately the debugger claims the IsExtensionSupported() methods have been optimised out. But it's a pretty simple method, just returning as it does the value in the mAvailableExtensions array referenced by aKnownExtension.
  bool IsExtensionSupported(EGLExtensions aKnownExtension) const {
    return mAvailableExtensions[aKnownExtension];
  }
There's a change on ESR 91 where the aKnownExtension is first redirected via the UnderlyingValue() method. Here's the ESR 91 version:
  bool IsExtensionSupported(EGLExtension aKnownExtension) const {
    return mAvailableExtensions[UnderlyingValue(aKnownExtension)];
  }
We'll come back to UnderlyingValue() in a bit. Now that we know the implementations we can make use of this info when we perform our debugging to circumnavigate the fact the methods have been optimised out: we can just access the mAvailableExtensions array used by each directly instead. Let's take a look at that. First let's look at the values in ESR 78:
(gdb) b HasExtensions
Breakpoint 2 at 0x7fb8e84d70: HasExtensions. (2 locations)
(gdb) c
Continuing.

Thread 36 "Compositor" hit Breakpoint 2, mozilla::gl::SharedSurface_EGLImage::
    HasExtensions (egl=0x7eac0036a0, gl=0x7eac109140)
    at gfx/gl/SharedSurfaceEGL.cpp:59
59        return egl->HasKHRImageBase() &&
(gdb) p egl.mAvailableExtensions
$1 = std::bitset = {  [0] = 1,   [2] = 1,   [3] = 1,   [5] = 1,   [6] = 1,
                      [7] = 1,  [13] = 1,  [21] = 1,  [22] = 1}
(gdb) p gl.mAvailableExtensions
$2 = std::bitset = {  [1] = 1,  [57] = 1,  [58] = 1,  [60] = 1,  [72] = 1,
                     [75] = 1,  [77] = 1,  [78] = 1,  [86] = 1,  [87] = 1,
                     [96] = 1,  [97] = 1, [100] = 1, [111] = 1, [112] = 1,
                    [113] = 1, [114] = 1, [115] = 1, [117] = 1, [118] = 1,
                    [120] = 1, [121] = 1, [122] = 1, [123] = 1, [125] = 1,
                    [126] = 1, [127] = 1, [128] = 1, [129] = 1, [130] = 1,
                    [131] = 1, [132] = 1}
(gdb) 
And for contrast, let's see what happens on ESR 91 using the same process:
(gdb) b HasEglImageExtensions
Breakpoint 1 at 0x7ff11322a0: file include/c++/8.3.0/bitset, line 1163.
(gdb) c
Continuing.
[LWP 26957 exited]
[LWP 26952 exited]
[New LWP 27078]
[LWP 27037 exited]
[Switching to LWP 26961]

Thread 38 "Compositor" hit Breakpoint 1, mozilla::gl::HasEglImageExtensions
    (gl=...)
    at ${PROJECT}/gfx/gl/SharedSurfaceEGL.cpp:28
28      ${PROJECT}/gfx/gl/SharedSurfaceEGL.cpp: No such file or directory.
(gdb) p egl.mAvailableExtensions
$1 = std::bitset = {  [0] = 1,   [2] = 1,   [4] = 1,   [5] = 1,   [6] = 1,
                      [7] = 1,   [8] = 1,  [11] = 1,  [16] = 1,  [17] = 1,
                     [22] = 1}
(gdb) p gl.mAvailableExtensions
$2 = std::bitset = {  [1] = 1,  [57] = 1,  [58] = 1,  [60] = 1,  [72] = 1,
                     [75] = 1,  [77] = 1,  [78] = 1,  [86] = 1,  [87] = 1,
                     [88] = 1, [97] = 1,   [99] = 1, [101] = 1, [102] = 1,
                    [113] = 1, [114] = 1, [115] = 1, [116] = 1, [117] = 1,
                    [119] = 1, [120] = 1, [122] = 1, [123] = 1, [124] = 1,
                    [125] = 1, [127] = 1, [128] = 1, [129] = 1, [130] = 1,
                    [131] = 1, [132] = 1, [133] = 1, [134] = 1}
(gdb) 
It's noticeable that neither the egl nor the gl values are identical across the two versions. The obvious question is whether this is a real difference, or whether the UnderlyingValue() method is obscuring the fact that they're the same. Here's what the code has to say about UnderlyingValue():
/**
 * Get the underlying value of an enum, but typesafe.
 *
 * example:
 *
 *   enum class Pet : int16_t {
 *     Cat,
 *     Dog,
 *     Fish
 *   };
 *   enum class Plant {
 *     Flower,
 *     Tree,
 *     Vine
 *   };
 *   UnderlyingValue(Pet::Fish) -> int16_t(2)
 *   UnderlyingValue(Plant::Tree) -> int(1)
 */
template <typename T>
inline constexpr auto UnderlyingValue(const T v) {
  static_assert(std::is_enum_v<T>);
  return static_cast<typename std::underlying_type<T>::type>(v);
}
So this isn't actually changing the value, it's checking and casting it to the appropriate type. So we can ignore this when we're comparing values and conclude that the mAvailableExtensions array definitely has different indices set to true between ESR 78 and ESR 91. But we still need to check the enums that these represent in order to be sure that these are real differences.

Both egl and gl use different enums, so we'll need to consider them separately.

Here's the enum associated with egl in ESR 78, found in GlLibraryEGL.h:
 0: KHR_image_base
 2: KHR_gl_texture_2D_image
 3: KHR_lock_surface
 5: EXT_create_context_robustness
 6: KHR_image
 7: KHR_fence_sync
13: KHR_create_context
21: KHR_surfaceless_context
22: KHR_create_context_no_error
Based on the HasExtensions() implementation we're interested in KHR_gl_texture_2D_image, KHR_image and KHR_image_base; all of which are present in the list above (indices 2, 6 and 0).

On ESR 91, the related enum, also found in GlLibraryEGL.h, looks like this:
 0: KHR_image_base
 2: KHR_gl_texture_2D_image
 4: ANGLE_surface_d3d_texture_2d_share_handle
 5: EXT_create_context_robustness
 6: KHR_image
 7: KHR_fence_sync
 8: ANDROID_native_fence_sync
11: ANGLE_platform_angle_d3d
16: EXT_device_query
17: NV_stream_consumer_gltexture_yuv
22: KHR_create_context_no_error
Again, looking at the code and based on HasEglImageExtensions() we're interested in the same flags: KHR_gl_texture_2D_image, KHR_image and KHR_image_base. All of these are also present in the ESR 91 list (indices 2, 6 and 0).

So, no obvious problems on the egl side. Let's now check the longer enum for gl. Here are the active values based on the ESR 78 list available in GLContext.h:
  1: AMD_compressed_ATC_texture
 57: EXT_color_buffer_float
 58: EXT_color_buffer_half_float
 60: EXT_disjoint_timer_query
 72: EXT_multisampled_render_to_texture
 75: EXT_read_format_bgra
 77: EXT_sRGB
 78: EXT_sRGB_write_control
 86: EXT_texture_filter_anisotropic
 87: EXT_texture_format_BGRA8888
 96: IMG_texture_npot
 97: KHR_debug
100: KHR_robustness
111: NV_transform_feedback
112: NV_transform_feedback2
113: OES_EGL_image
114: OES_EGL_image_external
115: OES_EGL_sync
117: OES_depth24
118: OES_depth32
120: OES_element_index_uint
121: OES_fbo_render_mipmap
122: OES_framebuffer_object
123: OES_packed_depth_stencil
125: OES_standard_derivatives
126: OES_stencil8
127: OES_texture_3D
128: OES_texture_float
129: OES_texture_float_linear
130: OES_texture_half_float
131: OES_texture_half_float_linear
132: OES_texture_npot
From the ESR 78 code the ones we're interested in are just OES_EGL_image_external and OES_EGL_image. These are both in the list (indices 114 and 113). What about ESR 91? Here's the enum list in this case:
  1: AMD_compressed_ATC_texture
 57: EXT_color_buffer_float
 58: EXT_color_buffer_half_float
 60: EXT_disjoint_timer_query
 72: EXT_multisampled_render_to_texture
 75: EXT_read_format_bgra
 77: EXT_sRGB
 78: EXT_sRGB_write_control
 86: EXT_texture_filter_anisotropic
 87: EXT_texture_format_BGRA8888
 88: EXT_texture_norm16
 97: KHR_debug
 99: KHR_robust_buffer_access_behavior
101: KHR_texture_compression_astc_hdr
102: KHR_texture_compression_astc_ldr
113: OES_EGL_image
114: OES_EGL_image_external
115: OES_EGL_sync
116: OES_compressed_ETC1_RGB8_texture
117: OES_depth24
119: OES_depth_texture
120: OES_element_index_uint
122: OES_framebuffer_object
123: OES_packed_depth_stencil
124: OES_rgb8_rgba8
125: OES_standard_derivatives
127: OES_texture_3D
128: OES_texture_float
129: OES_texture_float_linear
130: OES_texture_half_float
131: OES_texture_half_float_linear
132: OES_texture_npot
133: OES_vertex_array_object
134: OVR_multiview2
Once again from the ESR 91 code we can see the ones we're interested in are the same: OES_EGL_image_external and OES_EGL_image. These are both in the list (indices 114 and 113). So what gives? Both methods have the appropriate flags set, so why is one succeeding and the other failure?

It's not clear to me right now. Something is wrong, but I can't see where. I'd love to dig deeper in to this today but my mind has reached its limit. I'll have to pick this up again tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
12 Mar 2024 : Day 183 #
Before we get in to my usual dev diary I want to fist draw attention to a couple of new development blogs that have started up recently. Both related to mobile operating system development and Linux in particular.

First up is Adventures with Sailfish and Ofono from Adam Pigg (piggz). Adam will be well-known to many amSailfish OS users and already featured here back on Day 178.

A long-time porter of Sailfish OS, Adam is responsible for the native PinePhone, PinePhone Pro and Volla ports, amongst others.

He's recently turned his hand to contributing Sailfish-specific Ofono changes to upstream, with the aim of reversing the divergence that's grown between the two over the years. If successful, this would not only add functionality to upstream Ofono, it would also allow updating the Sailfish OS version and benefiting from recent upstream improvements as well.

Adam recently started writing a semi-daily blog about it. He's up to Day 5 already and it's a great read.

But it's not just Adam. Peter Mack (1peter10) from LINMOB.net has started a developer diary to chart his explorations of — and improvements to — Mobile Config Firefox. The project is aimed at getting Firefox nicely configured for mobile use, part of the postmarketOS project. Peter has already written about his first pull request to tidy up the URL bar for use on smaller displays.

I'm really enjoying reading about others' approach to development and watching as things progress. I'll be following along avidly to both.

It might seem a little obvious, but I also recommend Peter's weekly Mobile Linux Update as the best way to catch up on all the latest activity in the mobile Linux space. I like to think I'm keeping up with developments in the world of Sailfish OS, but keeping up with activity across all of the various distributions is a real challenge. I'd say it was impossible, except that this is exactly what Peter does, making it possible for the rest of us to keep up in the process.

On a separate but related note, I also want to give a shout-out to Florian Xaver (wosrediinanatour). Florian has been extremely helpful reviewing some of the code changes mentioned here in my diary. He's been sharing useful advice and tips. I'm going to go into this in more detail in a future diary entry, but for now, let me just say that I'm grateful for the input.

Alright, let's move back on to the gecko development track. After taking some steps to align the ESR 78 and ESR 91 offscreen rendering pipeline yesterday, I'm following on with more of the same today. My plan is to step through various methods I know to be relevant as part of the render process and see whether they differ between ESR 78 and ESR 91. I have a pretty good setup for this. Two phones, one with ESR 78 another with ESR 91. Two SSH sessions, one for each phone, running my test application through the debugger. Then on another display I have Qt Creator running with ESR 78 code on one side and ESR 91 code on the other.
 
My desktop arrangement: laptop, two phones and a screen; plus some mess

With this setup I can step through the code simultaneously on ESR 78 and ESR 91 to establish whether they diverge or not, and if so where. The first method I'm going to look at is the same one we started with yesterday, which is GLScreenBuffer::Swap(). What I'd really like to do is show the debugger output side-by-side here, but the line lengths are too wide for it to comfortably fit, so I'm just going to have to list them here serially.

Working this way it doesn't take long before I'm able to identify a critical issue. First ESR 78.
(gdb) b GLScreenBuffer::Swap
Breakpoint 2 at 0x7fb8e672b8: file dist/include/mozilla/UniquePtr.h, line 287.
(gdb) c
Continuing.

Thread 36 "Compositor" hit Breakpoint 2, mozilla::gl::GLScreenBuffer::Swap
    (this=this@entry=0x7eac003c00, size=...)
    at dist/include/mozilla/UniquePtr.h:287
287     in dist/include/mozilla/UniquePtr.h
(gdb) p size
$1 = (const mozilla::gfx::IntSize &) @0x7eac1deaa8:
    {<mozilla::gfx::BaseSize<int, mozilla::gfx::IntSizeTyped<mozilla::gfx::
    UnknownUnits> >> = {{{width = 1080, height = 2520}, components =
    {1080, 2520}}}, <mozilla::gfx::UnknownUnits> = {<No data fields>},
    <No data fields>}
(gdb) n
382       if (!newBack) return false;
(gdb) p newBack
$2 = {mRawPtr = 0x7eac138860}
(gdb) 
And now ESR 91.
(gdb) break GLScreenBuffer::Swap
Breakpoint 2 at 0x7ff1106f8c: file ${PROJECT}/gfx/gl/GLScreenBuffer.cpp, line 506.
(gdb) c
Continuing.
[LWP 22652 exited]

Thread 36 "Compositor" hit Breakpoint 2, mozilla::gl::GLScreenBuffer::Swap
    (this=this@entry=0x5555642d50, size=...)
    at ${PROJECT}/gfx/gl/GLScreenBuffer.cpp:506
506     in ${PROJECT}/gfx/gl/GLScreenBuffer.cpp
(gdb) p size
$1 = (const mozilla::gfx::IntSize &) @0x7edc1a21e4:
    {<mozilla::gfx::BaseSize<int, mozilla::gfx::IntSizeTyped<mozilla::gfx::
    UnknownUnits> >> = {{{width = 1080, height = 2520}, components =
    {1080, 2520}}}, <mozilla::gfx::UnknownUnits> = {<No data fields>},
    <No data fields>}
(gdb) n
[LWP 22657 exited]
290     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:
    No such file or directory.
(gdb) n
509     ${PROJECT}/gfx/gl/GLScreenBuffer.cpp: No such file or directory.
(gdb) p newBack
$2 = {mRawPtr = 0x0}
(gdb) 
Debugging on the ESR 91 build isn't so clean due to the partial build messing up some of the debug source alignment, but we can nevertheless see that the call to SurfaceFactory_EGLImage::NewTexClient() is returning something sensible in ESR 78, but null in ESR 91. Here's the relevant code:
  RefPtr<layers::SharedSurfaceTextureClient> newBack =
      mFactory->NewTexClient(size);
  if (!newBack) return false;
Let's ensure that mFactory is valid and of the correct type. First on ESR 78:
(gdb) p mFactory.mTuple.mFirstA
$4 = (mozilla::gl::SurfaceFactory *) 0x7eac139b60
(gdb) set print object on
(gdb) p mFactory.mTuple.mFirstA
$5 = (mozilla::gl::SurfaceFactory_EGLImage *) 0x7eac139b60
(gdb) set print object off
(gdb) 
And then, to compare, on ESR 91
(gdb) p mFactory.mTuple.mFirstA
$7 = (mozilla::gl::SurfaceFactory *) 0x7edc1dc470
(gdb) set print object on
(gdb) p mFactory.mTuple.mFirstA
$8 = (mozilla::gl::SurfaceFactory_EGLImage *) 0x7edc1dc470
(gdb) set print object off
(gdb) 
That all looks similar, so the next thing to check is what's happening inside SurfaceFactory_EGLImage::NewTexClient() that's preventing it from doing what it's supposed to. But when I try to place a breakpoint on SurfaceFactory_EGLImage::NewTexClient() I discover I can't: it doesn't exist.

The SurfaceFactory_EGLImage class must be inheriting the method from SurfaceFactory::NewTexClient(). So we can check by stepping through that. First the ESR 78 code.
(gdb) b SurfaceFactory::NewTexClient
Breakpoint 3 at 0x7fb8e6f338: file gfx/gl/SharedSurface.cpp, line 287.
(gdb) c
Continuing.

Thread 36 "Compositor" hit Breakpoint 3, mozilla::gl::SurfaceFactory::
    NewTexClient (this=0x7eac139b60, size=...)
    at gfx/gl/SharedSurface.cpp:287
287     SurfaceFactory::NewTexClient(const gfx::IntSize& size) {
(gdb) bt
#0  mozilla::gl::SurfaceFactory::NewTexClient (this=0x7eac139b60, size=...)
    at gfx/gl/SharedSurface.cpp:287
#1  0x0000007fb8e672d8 in mozilla::gl::GLScreenBuffer::Swap
    (this=this@entry=0x7eac003c00, size=...)
    at dist/include/mozilla/UniquePtr.h:287
[...]
#25 0x0000007fbe70d89c in ?? () from /lib64/libc.so.6
(gdb) p size
$6 = (const mozilla::gfx::IntSize &) @0x7eac140ec8:
    {<mozilla::gfx::BaseSize<int, mozilla::gfx::IntSizeTyped<mozilla::gfx::
    UnknownUnits> >> = {{{width = 1080, height = 2520}, components =
    {1080, 2520}}}, <mozilla::gfx::UnknownUnits> = {<No data fields>},
    <No data fields>}
(gdb) p mRecycleFreePool
$7 = {mQueue = std::queue wrapping: std::deque with 0 elements}
(gdb) n
1367    include/c++/8.3.0/bits/stl_deque.h: No such file or directory.
(gdb) n
300       UniquePtr<SharedSurface> surf = CreateShared(size);
(gdb) n
301       if (!surf) return nullptr;
(gdb) p surf.mTuple.mFirstA
$8 = (mozilla::gl::SharedSurface *) 0x7eac1802b0
(gdb) n
292     dist/include/mozilla/UniquePtr.h: No such file or directory.
(gdb) n
289     dist/include/mozilla/RefPtr.h: No such file or directory.
(gdb) n
305                                                        mAllocator, mFlags);
(gdb) n
307       StartRecycling(ret);
(gdb) p ret
$10 = {mRawPtr = 0x7eac138910}
(gdb) p mAllocator
$11 = {mRawPtr = 0x0}
(gdb) p mFlags
$12 = mozilla::layers::TextureFlags::ORIGIN_BOTTOM_LEFT
(gdb) 
Notice how the call to CreateShared() returns an object which then when moved into Create() returns another object. The allocator is null and the flags are set to ORIGIN_BOTTOM_LEFT.

On ESR 91 there's a big difference: although mAllocator and mFlags are set correctly, the call to CreateShared returns null. Immediately afterwards the method notices this and returns early.
(gdb) b SurfaceFactory::NewTexClient
Breakpoint 3 at 0x7ff111d888: file ${PROJECT}/gfx/gl/SharedSurface.cpp, line 393.
(gdb) c
Continuing.

Thread 36 "Compositor" hit Breakpoint 3, mozilla::gl::SurfaceFactory::
    NewTexClient (this=0x7edc1dc470, size=...)
    at ${PROJECT}/gfx/gl/SharedSurface.cpp:393
393     ${PROJECT}/gfx/gl/SharedSurface.cpp: No such file or directory.
(gdb) bt
#0  mozilla::gl::SurfaceFactory::NewTexClient (this=0x7edc1dc470, size=...)
    at ${PROJECT}/gfx/gl/SharedSurface.cpp:393
#1  0x0000007ff1106fac in mozilla::gl::GLScreenBuffer::Swap
    (this=this@entry=0x5555642d50, size=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
[...]
#24 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) p size
$17 = (const mozilla::gfx::IntSize &) @0x7ed81a22e4:
    {<mozilla::gfx::BaseSize<int, mozilla::gfx::IntSizeTyped<mozilla::gfx::
    UnknownUnits> >> = {{{width = 1080, height = 2520}, components =
    {1080, 2520}}}, <mozilla::gfx::UnknownUnits> = {<No data fields>},
    <No data fields>}
(gdb) p mRecycleFreePool
$18 = {mQueue = std::queue wrapping: std::deque with 0 elements}
(gdb) p mAllocator
$19 = {mRawPtr = 0x0}
(gdb) p mFlags
$20 = mozilla::layers::TextureFlags::ORIGIN_BOTTOM_LEFT
(gdb) p mRecycleFreePool
$21 = {mQueue = std::queue wrapping: std::deque with 0 elements}
(gdb) n
394     in ${PROJECT}/gfx/gl/SharedSurface.cpp
(gdb) n
406     in ${PROJECT}/gfx/gl/SharedSurface.cpp
(gdb) n
407     in ${PROJECT}/gfx/gl/SharedSurface.cpp
(gdb) p surf.mTuple.mFirstA
$22 = (mozilla::gl::SharedSurface *) 0x0
(gdb) n
79      ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:
    No such file or directory.
(gdb) n
mozilla::gl::GLScreenBuffer::Swap (this=this@entry=0x5555643a10, size=...)
    at ${PROJECT}/gfx/gl/GLScreenBuffer.cpp:509
509     ${PROJECT}/gfx/gl/GLScreenBuffer.cpp: No such file or directory.
(gdb) 
So it would seem that there's a problem in CreateShared(), so the next step will be to drill down into that. That's all I've time for today though; we'll pick this up again tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
11 Mar 2024 : Day 182 #
We reached an important milestone yesterday when our WebView-augmented app finally ran without crashing for the first time. There's still nothing being rendered to the screen, but preventing the app from crashing feels like an important waypoint on the journey towards the result we want.

But crashes have their uses. In particular, when an app crashes the debugger provides a backtrace which can be the bedrock of an investigation; a place to start and fall back to in case things spiral out of control. Now we're presented with something more nebulous: somewhere in the program there are one or many bugs, or differences in execution, between the ESR 78 and ESR 91 rendering pipelines that we need to find and rewire. We've been here before, with the original browser rendering pipeline. That time it took a few weeks before I managed to find the root cause. I'm not expecting this to be any easier.

The first thing to check is whether the render loop is actually being called. Here we have something to go on in the form of the GLScreenBuffer::Swap() method. This should be called every frame in order to move the image on the back buffer onto the front buffer. We can use the debugger to see whether it's being called.
(gdb) b GLScreenBuffer::Swap
Breakpoint 1 at 0x7ff1106f8c: file ${PROJECT}/gecko-dev/gfx/gl/
    GLScreenBuffer.cpp, line 506.
(gdb) c
Continuing.
[LWP 1224 exited]
[LWP 1218 exited]	
[New LWP 1735]
[LWP 1255 exited]
No hits. In one sense this is bad: something is broken. On the other hand, it's also good: we already knew something was broken, this at least gives us something concrete to fix. So the next step is to see whether it's also actually being called on ESR 78. It could be that I've misunderstood how this rendering pipeline is supposed to work.
(gdb) b GLScreenBuffer::Swap
Breakpoint 1 at 0x7fb8e672b8: file obj-build-mer-qt-xr/dist/include/mozilla/
    UniquePtr.h, line 287.
(gdb) c
Continuing.
[LWP 20447 exited]
[LWP 20440 exited]
[LWP 20511 exited]
[LWP 20460 exited]
Also no hit! Ah... that is until I touch the screen. Then suddenly this:
[New LWP 20679]	
[New LWP 20680]
[New LWP 20681]
[New LWP 20682]
[Switching to LWP 20444]

Thread 36 "Compositor" hit Breakpoint 1, mozilla::gl::GLScreenBuffer::Swap
    (this=this@entry=0x7eac003c00, size=...)
    at obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:287
287     obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:
    No such file or directory.
(gdb) 
We should get a backtrace from ESR 78 to compare against.
(gdb) bt
#0  mozilla::gl::GLScreenBuffer::Swap (this=this@entry=0x7eac003c00, size=...)
    at obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:287
#1  0x0000007fbb2e2d8c in mozilla::gl::GLScreenBuffer::PublishFrame
    (size=..., this=0x7eac003c00)
    at obj-build-mer-qt-xr/dist/include/GLScreenBuffer.h:171
#2  mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    PresentOffscreenSurface (this=0x7f8c99d3b0)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:183
#3  0x0000007fbb2f8600 in mozilla::embedlite::nsWindow::PostRender
    (this=0x7f8c87db30, aContext=<optimized out>)
    at mobile/sailfishos/embedshared/nsWindow.cpp:395
#4  0x0000007fb8fbff4c in mozilla::layers::LayerManagerComposite::Render
    (this=this@entry=0x7eac1988c0, aInvalidRegion=..., aOpaqueRegion=...)
    at obj-build-mer-qt-xr/dist/include/mozilla/TimeStamp.h:452
#5  0x0000007fb8fc02c4 in mozilla::layers::LayerManagerComposite::
    UpdateAndRender (this=this@entry=0x7eac1988c0)
    at gfx/layers/composite/LayerManagerComposite.cpp:647
#6  0x0000007fb8fc0514 in mozilla::layers::LayerManagerComposite::
    EndTransaction (aFlags=mozilla::layers::LayerManager::END_DEFAULT,
    aTimeStamp=..., this=0x7eac1988c0) at gfx/layers/composite/
    LayerManagerComposite.cpp:566
#7  mozilla::layers::LayerManagerComposite::EndTransaction (this=0x7eac1988c0,
    aTimeStamp=..., aFlags=mozilla::layers::LayerManager::END_DEFAULT)
    at gfx/layers/composite/LayerManagerComposite.cpp:536
#8  0x0000007fb8fe7f9c in mozilla::layers::CompositorBridgeParent::
    CompositeToTarget (this=0x7f8c99d3b0, aId=..., aTarget=0x0,
    aRect=<optimized out>)
    at obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#9  0x0000007fbb2e288c in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget (this=0x7f8c99d3b0, aId=...)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:159
#10 0x0000007fb8fc7988 in mozilla::layers::CompositorVsyncScheduler::Composite
    (this=0x7f8cbacc40, aId=..., aVsyncTimestamp=...)
    at gfx/layers/ipc/CompositorVsyncScheduler.cpp:249
#11 0x0000007fb8fc5ff0 in mozilla::detail::RunnableMethodArguments
    <mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>,
    mozilla::TimeStamp>::applyImpl<mozilla::layers::CompositorVsyncScheduler,
    void (mozilla::layers::CompositorVsyncScheduler::*)(mozilla::layers::
    BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp), 
    StoreCopyPassByConstLRef<mozilla::layers::BaseTransactionId<mozilla::
    VsyncIdType> >, StoreCopyPassByConstLRef<mozilla::TimeStamp>, 0ul, 1ul>
    (args=..., m=<optimized out>, o=<optimized out>)
    at obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:925
[...]
#24 0x0000007fbe70d89c in ?? () from /lib64/libc.so.6
(gdb) 
Going back to ESR 91, it turns out the method isn't missing after all, it just needs a bit of prodding to get it to be called. So on touching the screen I get the same result. We should compare the backtraces. Here's what we get from ESR 91:
Thread 38 "Compositor" hit Breakpoint 1, mozilla::gl::GLScreenBuffer::Swap
    (this=this@entry=0x5555643950, size=...)
    at ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:506
506     ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp: No such file or directory.
(gdb) bt
#0  mozilla::gl::GLScreenBuffer::Swap (this=this@entry=0x5555643950, size=...)
    at ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:506
#1  0x0000007ff36667a8 in mozilla::gl::GLScreenBuffer::PublishFrame
    (size=..., this=0x5555643950)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/GLScreenBuffer.h:229
#2  mozilla::embedlite::EmbedLiteCompositorBridgeParent::PresentOffscreenSurface
    (this=0x7fc49fde60)
    at ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/
    EmbedLiteCompositorBridgeParent.cpp:191
#3  0x0000007ff367fc0c in mozilla::embedlite::nsWindow::PostRender
    (this=0x7fc49fc290, aContext=<optimized out>)
    at ${PROJECT}/gecko-dev/mobile/sailfishos/embedshared/nsWindow.cpp:248
#4  0x0000007ff2a64ec0 in mozilla::widget::InProcessCompositorWidget::PostRender
    (this=0x7fc4b8df00, aContext=0x7f1f970848)
    at ${PROJECT}/gecko-dev/widget/InProcessCompositorWidget.cpp:60
#5  0x0000007ff128f9f4 in mozilla::layers::LayerManagerComposite::Render
    (this=this@entry=0x7ed41bb450, aInvalidRegion=..., aOpaqueRegion=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/Compositor.h:575
#6  0x0000007ff128fe70 in mozilla::layers::LayerManagerComposite::
    UpdateAndRender (this=this@entry=0x7ed41bb450)
    at ${PROJECT}/gecko-dev/gfx/layers/composite/LayerManagerComposite.cpp:657
#7  0x0000007ff1290220 in mozilla::layers::LayerManagerComposite::EndTransaction
    (this=this@entry=0x7ed41bb450, aTimeStamp=..., 
    aFlags=aFlags@entry=mozilla::layers::LayerManager::END_DEFAULT)
    at ${PROJECT}/gecko-dev/gfx/layers/composite/LayerManagerComposite.cpp:572
#8  0x0000007ff12d19bc in mozilla::layers::CompositorBridgeParent::
    CompositeToTarget (this=0x7fc49fde60, aId=..., aTarget=0x0,
    aRect=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#9  0x0000007ff12b7104 in mozilla::layers::CompositorVsyncScheduler::Composite
    (this=0x7fc429cac0, aVsyncEvent=...)
    at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorVsyncScheduler.cpp:256
#10 0x0000007ff12af57c in mozilla::detail::RunnableMethodArguments<mozilla::
    VsyncEvent>::applyImpl<mozilla::layers::CompositorVsyncScheduler, void
    (mozilla::layers::CompositorVsyncScheduler::*)(mozilla::VsyncEvent const&),
    StoreCopyPassByConstLRef<mozilla::VsyncEvent>, 0ul> (args=...,
    m=<optimized out>, o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:887
[...]
#22 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) 
Both are largely similar, both are initially triggered from the vsync scheduler CompositorVsyncScheduler(), which makes sense for a render update pipeline. But there are some differences too. In between the vsync scheduler and the layer manager's EndTransaction() call we have this on ESR 78:
#7  LayerManagerComposite::EndTransaction, LayerManagerComposite.cpp:536
#8  CompositorBridgeParent::CompositeToTarget, mozilla/RefPtr.h:313
#9  EmbedLiteCompositorBridgeParent::CompositeToDefaultTarget,
    EmbedLiteCompositorBridgeParent.cpp:159
#10 CompositorVsyncScheduler::Composite, CompositorVsyncScheduler.cpp:249
Whereas on ESR 91 we have this:
#7  LayerManagerComposite::EndTransaction, LayerManagerComposite.cpp:572
#8  CompositorBridgeParent::CompositeToTarget, RefPtr.h:313
#9  CompositorVsyncScheduler::Composite, CompositorVsyncScheduler.cpp:256
It could be that this is a completely benign change, either due to us hitting a breakpoint at a slightly different point in the cycle, or due to upstream changes that are unrelated to the rendering issues we're experiencing. But I'm honing in on it because I remember there being differences in the way the target is set up and this immediately looks suspicious to me. Especially suspicious is the fact that EmbedLiteCompositorBridgeParent has been written out of the ESR 91 execution flow. That's Sailfish-specific code, so that could well indicate a problem.

So let's try and find out why. In ESR 78 the code that's being called from CompositorVsyncScheduler::Composite() is the following:
    // Tell the owner to do a composite
    mVsyncSchedulerOwner->CompositeToDefaultTarget(aId);
    mVsyncNotificationsSkipped = 0;
In ESR 91 we have a strange addition to this.
    // Tell the owner to do a composite
    mVsyncSchedulerOwner->CompositeToTarget(aVsyncEvent.mId, nullptr, nullptr);
    mVsyncSchedulerOwner->CompositeToDefaultTarget(aVsyncEvent.mId);

    mVsyncNotificationsSkipped = 0;
The extra line spacing makes this look very intentional, but the real question I'd like to know the answer to is: "was this change done by upstream or by me?". If it was upstream then it's almost certainly intentional. If it was me, well, then it could well be a mistake. We can find out, as always, using git.
 git blame gfx/layers/ipc/CompositorVsyncScheduler.cpp -L 255,259
Blaming lines:   1% (5/370), done.
7a2ef4343bb1d (Kartikaya Gupta       2018-02-01 16:28:53 -0500 255)
    // Tell the owner to do a composite
97287dc1b1d82 (Markus Stange         2020-07-18 05:17:39 +0000 256)
    mVsyncSchedulerOwner->CompositeToTarget(aVsyncEvent.mId, nullptr, nullptr);
0acadeba1ac39 (David Llewellyn-Jones 2023-08-28 14:55:57 +0100 257)
    mVsyncSchedulerOwner->CompositeToDefaultTarget(aVsyncEvent.mId);
7a2ef4343bb1d (Kartikaya Gupta       2018-02-01 16:28:53 -0500 258)
133e28473a0f8 (Sotaro Ikeda          2016-11-18 02:37:04 -0800 259)
    mVsyncNotificationsSkipped = 0;
Well, that's interesting. There is a line inserted by me, but it's not the line I was expecting. It looks very much like I added the line and intended to remove the line before, but forgot. I'm going to rectify this.
$ git diff gfx/layers/ipc/CompositorVsyncScheduler.cpp
diff --git a/gfx/layers/ipc/CompositorVsyncScheduler.cpp
           b/gfx/layers/ipc/CompositorVsyncScheduler.cpp
index 2e8e58a2c46b..3abe24ceeeea 100644
--- a/gfx/layers/ipc/CompositorVsyncScheduler.cpp
+++ b/gfx/layers/ipc/CompositorVsyncScheduler.cpp
@@ -253,9 +253,7 @@ void CompositorVsyncScheduler::Composite
    (const VsyncEvent& aVsyncEvent) {
     mLastComposeTime = SampleTime::FromVsync(aVsyncEvent.mTime);
 
     // Tell the owner to do a composite
-    mVsyncSchedulerOwner->CompositeToTarget(aVsyncEvent.mId, nullptr, nullptr);
     mVsyncSchedulerOwner->CompositeToDefaultTarget(aVsyncEvent.mId);
-
     mVsyncNotificationsSkipped = 0;
 
     TimeDuration compositeFrameTotal = TimeStamp::Now() - aVsyncEvent.mTime;
Line removed. Now to build and see what happens when we try to run it.
(gdb) b GLScreenBuffer::Swap
Breakpoint 1 at 0x7ff1106f8c: file ${PROJECT}/gecko-dev/gfx/gl/
    GLScreenBuffer.cpp, line 506.
(gdb) c
Continuing.
[LWP 22665 exited]
[New LWP 22740]
[Switching to LWP 22660]

Thread 36 "Compositor" hit Breakpoint 1, mozilla::gl::GLScreenBuffer::Swap
    (this=this@entry=0x5555642d50, size=...)
    at ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:506
506     ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:
    No such file or directory.
(gdb) bt
#0  mozilla::gl::GLScreenBuffer::Swap (this=this@entry=0x5555642d50, size=...)
    at ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:506
#1  0x0000007ff3666788 in mozilla::gl::GLScreenBuffer::PublishFrame
    (size=..., this=0x5555642d50)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/GLScreenBuffer.h:229
#2  mozilla::embedlite::EmbedLiteCompositorBridgeParent::PresentOffscreenSurface
    (this=0x7fc49fde10)
    at ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/
    EmbedLiteCompositorBridgeParent.cpp:191
#3  0x0000007ff367fbec in mozilla::embedlite::nsWindow::PostRender
    (this=0x7fc49fc240, aContext=<optimized out>)
    at ${PROJECT}/gecko-dev/mobile/sailfishos/embedshared/nsWindow.cpp:248
#4  0x0000007ff2a64ea4 in mozilla::widget::InProcessCompositorWidget::PostRender
    (this=0x7fc4b88770, aContext=0x7f177f87e8)
    at ${PROJECT}/gecko-dev/widget/InProcessCompositorWidget.cpp:60
#5  0x0000007ff128f9f4 in mozilla::layers::LayerManagerComposite::Render
    (this=this@entry=0x7edc1bb450, aInvalidRegion=..., aOpaqueRegion=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/
    Compositor.h:575
#6  0x0000007ff128fe70 in mozilla::layers::LayerManagerComposite::
    UpdateAndRender (this=this@entry=0x7edc1bb450)
    at ${PROJECT}/gecko-dev/gfx/layers/composite/LayerManagerComposite.cpp:657
#7  0x0000007ff1290220 in mozilla::layers::LayerManagerComposite::EndTransaction
    (this=this@entry=0x7edc1bb450, aTimeStamp=..., 
    aFlags=aFlags@entry=mozilla::layers::LayerManager::END_DEFAULT)
    at ${PROJECT}/gecko-dev/gfx/layers/composite/LayerManagerComposite.cpp:572
#8  0x0000007ff12d19a0 in mozilla::layers::CompositorBridgeParent::
    CompositeToTarget (this=0x7fc49fde10, aId=..., aTarget=0x0,
    aRect=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#9  0x0000007ff3666488 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget (this=0x7fc49fde10, aId=...)
    at ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/
    EmbedLiteCompositorBridgeParent.cpp:165
#10 0x0000007ff12b70fc in mozilla::layers::CompositorVsyncScheduler::Composite
    (this=0x7fc4b82ef0, aVsyncEvent=...)
    at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorVsyncScheduler.cpp:256
#11 0x0000007ff12af57c in mozilla::detail::RunnableMethodArguments<mozilla::
    VsyncEvent>::applyImpl<mozilla::layers::CompositorVsyncScheduler, void
    (mozilla::layers::CompositorVsyncScheduler::*)(mozilla::VsyncEvent const&),
    StoreCopyPassByConstLRef<mozilla::VsyncEvent>, 0ul> (args=...,
    m=<optimized out>, o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:887
[...]
#23 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) 
Let's strip out the part we're interested in.
#7  LayerManagerComposite::EndTransaction, LayerManagerComposite.cpp:572
#8  CompositorBridgeParent::CompositeToTarget, RefPtr.h:313
#9  EmbedLiteCompositorBridgeParent::CompositeToDefaultTarget,
    EmbedLiteCompositorBridgeParent.cpp:165
#10 CompositorVsyncScheduler::Composite, CompositorVsyncScheduler.cpp:256
If we compare this with the previous backtrace from ESR 78 we can see that's now aligning fully. Unfortunately, despite fixing this issue, it doesn't give us a working render: the screen is still just plain white. But it will be one step on the way to fixing things fully.

I'm going to leave it there for today. Tomorrow we'll look further into the differences between ESR 78 and ESR 91.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
10 Mar 2024 : Day 181 #
Today it's time to test out the changes I made yesterday, adding in the SharedSurface_Basic functionality that got lost in the transition to ESR 91. The key change is that now the ProdTexture() will be overridden and so the call to it from SurfaceFactory during initialisation should — I'm hoping — no longer trigger a crash.

One of the downsides of using the massive partial libxul.so builds packed full of debugging information is that they just take up so much room on the device. But after deleting the library, the reinstallation of it then goes through without a hitch. It does strike me as a little odd that the calculation for how much space is needed doesn't take into account how much will be removed as well as how much will be added. I guess this is important for allowing transactional updates.
$ rpm -U xulrunner-qt5-91.*.rpm xulrunner-qt5-debuginfo-91.*.rpm \
    xulrunner-qt5-debugsource-91.*.rpm xulrunner-qt5-misc-91.*.rpm
        installing package xulrunner-qt5-91.9.1+git1.aarch64 needs 7MB more
        space on the / filesystem
$ rm /usr/lib64/xulrunner-qt5-91.9.1/libxul.so
$ rpm -U xulrunner-qt5-91.*.rpm xulrunner-qt5-debuginfo-91.*.rpm \
    xulrunner-qt5-debugsource-91.*.rpm xulrunner-qt5-misc-91.*.rpm
When running the new code it still almost immediately crashes. And that's not a surprise; I'm expecting at least a few more cycles of this "run-crash-debug-fix" process before we have something working.
Thread 37 "Compositor" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 4396]
0x0000007ff1107dc4 in mozilla::gl::SharedSurface::ProdTexture
    (this=<optimized out>)
    at gfx/gl/SharedSurface.h:157
157         MOZ_CRASH("GFX: Did you forget to override this function?");
(gdb) bt
#0  0x0000007ff1107dc4 in mozilla::gl::SharedSurface::ProdTexture
    (this=<optimized out>)
    at gfx/gl/SharedSurface.h:157
#1  0x0000007ff1106cc4 in mozilla::gl::ReadBuffer::Attach (this=0x7ed41a1700,
    surf=surf@entry=0x7ed419f9c0)
    at gfx/gl/GLScreenBuffer.cpp:718
#2  0x0000007ff1106ebc in mozilla::gl::GLScreenBuffer::Attach
    (this=this@entry=0x5555642e30, surf=0x7ed419f9c0, size=...)
    at gfx/gl/GLScreenBuffer.cpp:486
#3  0x0000007ff1106f60 in mozilla::gl::GLScreenBuffer::Swap
    (this=this@entry=0x5555642e30, size=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
[...]
#25 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) p this
$6 = <optimized out>
(gdb) frame 1
#1  0x0000007ff1106cc4 in mozilla::gl::ReadBuffer::Attach (this=0x7ed41a1700, surf=surf@entry=0x7ed419f9c0)
    at gfx/gl/GLScreenBuffer.cpp:718
718             colorTex = surf->ProdTexture();
(gdb) p surf
$3 = (mozilla::gl::SharedSurface *) 0x7ed419f9c0
(gdb) set print object on
(gdb) p surf
$5 = (mozilla::gl::SharedSurface_EGLImage *) 0x7ed419f9c0
(gdb) set print object off
(gdb) 
My initial reaction is that this is the same error that I spent yesterday trying to fix. But on closer inspection it's actually a little different. So maybe the changes made yesterday were actually worthwhile after all?

Nevertheless, this still seems to be a crash due to a missing override, as we can see from the error message that's output: "Did you forget to override this function". It's an induced crash again. But this time the surface is of type SharedSurface_EGLImage. Probably we'll have to add the overrides into this class as well. This will be similar to the work I did yesterday, but this time applied to a different class that's also inheriting from SharedSurface.

Looking at the SharedSurface_EGLImage class definition in SharedSurfaceEGL.h there are some very distinct differences between ESR 78 and ESR 91, including the lack of a ProdTexture() override in ESR 91. Here are the relevant code pieces from ESR 78 (I've rearranged some of the line orders for clarity):
class SharedSurface_EGLImage : public SharedSurface {
[...]
 protected:
  mutable Mutex mMutex;
  const GLFormats mFormats;
  GLuint mProdTex;
[...]
  virtual GLuint ProdTexture() override { return mProdTex; }
[...]
In comparison, the ProdTexture() method and associated mProdTex member variable are both missing in ESR 91. I'll need to add them in, along with all the logic associated with them.

The mFormats variable is also missing from ESR 91, but I can't see anywhere that's used in a meaningful way, so I'll leave that out. The Cast() method has also been removed. But the logic for this is pretty simple and it looks like this has just been replaced with the same logic and direct cast in the various places it's used in the code, rather than in a separate method. Given this, there looks to be no need to revert this particular change.
  static SharedSurface_EGLImage* Cast(SharedSurface* surf) {
    MOZ_ASSERT(surf->mType == SharedSurfaceType::EGLImageShare);

    return (SharedSurface_EGLImage*)surf;
  }
Possibly these were changes I made myself at some point in the (now distant!) past while performing this update.

The other change I've had to make is to the SharedSurface_EGLImage::Create() and SurfaceFactory_EGLImage::Create() methods. I've changed their implementations slightly and redirected them to use the new (old?) constructors.

With all of these changes in place compilation the partial build now goes through. I've linked the partial build, copied it over to my phone and manually copied it to the correct directory. Now to see whether it's had any effect.
$ harbour-webview 
[D] unknown:0 - QML debugging is enabled. Only use this in a safe environment.
[D] main:30 - WebView Example
[D] main:47 - Opening webview
[D] unknown:0 - Using Wayland-EGL
[...]
Created LOG for EmbedLiteLayerManager
=============== Preparing offscreen rendering context ===============
CONSOLE message:
OpenGL compositor Initialized Succesfully.
[...]
Frame script: embedhelper.js loaded
CONSOLE message:
[JavaScript Warning: "This page uses the non standard property “zoom”. Consider
    using calc() in the relevant property values, or using “transform” along
    with “transform-origin: 0 0”." {file: "https://sailfishos.org/" line: 0}]
CONSOLE message:
[JavaScript Warning: "Layout was forced before the page was fully loaded. If
    stylesheets are not yet loaded this may cause a flash of unstyled content."
    {file: "https://sailfishos.org/wp-includes/js/jquery/
    jquery.min.js?ver=3.5.1" line: 2}]
The good news is the WebView test app is no longer crashing. It stays running and even responds to touch input. But it's not rendering. The screen is just showing a completely white page. This is definitely good progress though. It means that tomorrow I can dive back in to the debugger to compare execution with ESR 78, see where they're diverging and hopefully gradually get them to align closer until the rendering works.

I'm afraid to say, there's still a long journey ahead of us. But we are still, slowly but surely, moving forwards.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
9 Mar 2024 : Day 180 #
I'm picking up where I left off, reintroducing code around the SharedSurface_Basic class. I've added back in the missing methods and code that ESR 78 includes as part of SharedSurface_Basic:
  1. Create(). There's already an existing Create() method, but this new implementation provides a new signature.
  2. Wrap(). This was missing from the ESR 91 version. It feels like we'll need it
  3. Cast(). Similarly this might turn out to be needed.
  4. SharedSurface_Basic(). Again, there was already a constructor but this one provides some new features.
  5. ~SharedSurface_Basic(). In ESR 91 the destructor had been removed.
  6. LockProdImpl(). This is an override; it's not clear that it's needed.
  7. UnlockProdImpl(). Similarly here.
  8. ProducerAcquireImpl(). And here.
  9. ProducerReleaseImpl(). And also here.
  10. ProdTexture(). However this override we need; the fact this was missing from ESR 91 is the reason we saw a segfault.
As well as adding all of these I also had to tweak some other code, adding an alternative SharedSurface constructor, along with the following new member variables:
  1. const GLuint mTex.
  2. const bool mOwnsTex.
  3. GLuint mFB.
Having done all this I must now make sure that the new functionality is being used. That means finding out where the SharedSurface_Basic::Create() method is called and replacing it with our new version. Plus I'll need to check whether SharedSurface_Basic::Wrap() is used anywhere in ESR 78 and if so, see whether it should also be applied in ESR 91.

The good news is that with SharedSurface_Basic::Create() being a static function, it'll need to be fully qualified (in other words, prefixed with the class name) which makes it a lot easier to search the code for. On ESR 78 we have these cases:
$ grep -rIn "SharedSurface_Basic::Create(" *
gecko-dev/gfx/gl/SharedSurfaceGL.h:82:
    return SharedSurface_Basic::Create(mGL, mFormats, size, hasAlpha);
gecko-dev/gfx/gl/SharedSurfaceGL.cpp:22:
    UniquePtr<SharedSurface_Basic> SharedSurface_Basic::Create(
gecko-dev/gfx/gl/SharedSurface.cpp:41:
    tempSurf = SharedSurface_Basic::Create(gl, factory->mFormats, src->mSize,
Comparing that to the instances in ESR 91, we can see they're surprisingly similar:
$ grep -rIn "SharedSurface_Basic::Create(" *
gecko-dev/gfx/gl/SharedSurfaceGL.h:72:
    return SharedSurface_Basic::Create(desc);
gecko-dev/gfx/gl/SharedSurfaceGL.cpp:21:
    UniquePtr<SharedSurface_Basic> SharedSurface_Basic::Create(
gecko-dev/gfx/gl/SharedSurfaceGL.cpp:57:
    UniquePtr<SharedSurface_Basic> SharedSurface_Basic::Create(
gecko-dev/gfx/gl/SharedSurface.cpp:66:
    tempSurf = SharedSurface_Basic::Create({{gl, SharedSurfaceType::Basic,
On ESR 91 there are two different versions of SharedSurface_Basic::Create() because we just added a new one. But there are also differences in the way the method is called and I'd like to fix that so that the ESR 91 code better aligns with the ESR 78 code.

I've now changed the ESR 91 code so that it uses the same Create() override as the single version available in ESR 78. I'm really hoping to match up this particular set of functionality as closely as possible.

Finally I need to do a similar check for the SharedSurface_Basic::Wrap() method. This is also static, which again helps with discovery.
$ grep -rIn "SharedSurface_Basic::Wrap(" *
gecko-dev/gfx/gl/SharedSurfaceGL.cpp:44:
    UniquePtr<SharedSurface_Basic> SharedSurface_Basic::Wrap(GLContext* gl,
$ grep -rIn "Wrap(" gecko-dev/gfx/gl/SharedSurfaceGL.h
38:
    static UniquePtr<SharedSurface_Basic> Wrap(GLContext* gl,
$ grep -rIn "Wrap(" gecko-dev/gfx/gl/SharedSurfaceGL.cpp 
44:
    UniquePtr<SharedSurface_Basic> SharedSurface_Basic::Wrap(GLContext* gl,
The Wrap() method isn't an override, so this should be enough to demonstrate that the method isn't actually being used at all in ESR 78. So there should be no need to worry too much about it in ESR 91. And in fact once things are working it should almost certainly be removed. But I want to get a working executable before worrying about that.

Since I've only just now added the Wrap() method back in to ESR 91 there's no point doing a search there. All we'll find is the new code I just added. But let's do it anyway for completeness.
$ grep -rIn "SharedSurface_Basic::Wrap(" *
gecko-dev/gfx/gl/SharedSurfaceGL.cpp:79:
    UniquePtr<SharedSurface_Basic> SharedSurface_Basic::Wrap(GLContext* gl,
$ grep -rIn "Wrap(" gecko-dev/gfx/gl/SharedSurfaceGL.h
26:
    static UniquePtr<SharedSurface_Basic> Wrap(GLContext* gl,
$ grep -rIn "Wrap(" gecko-dev/gfx/gl/SharedSurfaceGL.cpp
79:
    UniquePtr<SharedSurface_Basic> SharedSurface_Basic::Wrap(GLContext* gl,
Having made these changes the partial builds have now all gone through, so it's time to set off the full build so I can test the executable again tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
8 Mar 2024 : Day 179 #
The build started yesterday has now completed; let's not waste any time and get straight to testing it out.
$ gdb harbour-webview 
GNU gdb (GDB) Mer (8.2.1+git9)
[...]
(gdb) r
Starting program: /usr/bin/harbour-webview 
[...]
Created LOG for EmbedLiteLayerManager
[New LWP 4065]

Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 4059]
0x0000007ff125d16c in mozilla::layers::SharedSurfaceTextureData::
    SharedSurfaceTextureData (this=0x7ed81af3c0, surf=...)
    at obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:283
283     obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:
    No such file or directory.
(gdb) bt
#0  0x0000007ff125d16c in mozilla::layers::SharedSurfaceTextureData::
    SharedSurfaceTextureData (this=0x7ed81af3c0, surf=...)
    at obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:283
#1  0x0000007ff125d210 in mozilla::layers::SharedSurfaceTextureClient::
    Create (surf=..., factory=factory@entry=0x7ed80043d0, aAllocator=0x0, 
    aFlags=<optimized out>) at obj-build-mer-qt-xr/dist/include/mozilla/
    cxxalloc.h:33
#2  0x0000007ff111e038 in mozilla::gl::SurfaceFactory::NewTexClient
    (this=0x7ed80043d0, size=...)
    at obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:289
#3  0x0000007ff1107088 in mozilla::gl::GLScreenBuffer::Resize
    (this=0x5555643e90, size=...)
    at obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#4  0x0000007ff1131c04 in mozilla::gl::GLContext::CreateScreenBufferImpl
    (this=this@entry=0x7ed819ee40, size=..., caps=...)
    at gfx/gl/GLContext.cpp:2150
#5  0x0000007ff1131cc8 in mozilla::gl::GLContext::CreateScreenBuffer
    (caps=..., size=..., this=0x7ed819ee40)
    at gfx/gl/GLContext.h:3555
#6  mozilla::gl::GLContext::InitOffscreen
    (this=this@entry=0x7ed819ee40, size=..., caps=...)
    at gfx/gl/GLContext.cpp:2398
[...]
#29 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) 
This is progress: the application got a bit further today. But just from reading through the backtrace I'm not entirely certain what's going on here. The issues — and their locations — are being obscured by the various pointer wrappers (UniquePtr and RefPtr) in use. I'm finding it hard to get a purchase. The first actually useful location is line 2150 of GLContext.cpp but that's way down in frame 4.

One thing that we are able to work with is the differences between the ESR 78 code (working) and the ESR 91 code (broken). And they are different. In ESR 78 the SharedSurfaceTextureData constructor at the top of the stack looks like this:
SharedSurfaceTextureData::SharedSurfaceTextureData(
    UniquePtr<gl::SharedSurface> surf)
    : mSurf(std::move(surf)) {}
Whereas in ESR 91 I appear to have added some additional initialisation steps:
SharedSurfaceTextureData::SharedSurfaceTextureData(
    UniquePtr<gl::SharedSurface> surf)
    : mSurf(std::move(surf)),
      mDesc(),
      mFormat(),
      mSize(surf->mDesc.size)
{
}
One possibility is that the value of surf is null. This wouldn't necessarily cause a problem until we try to read the mDesc entry while setting mSize. I'm not able to extract the value of surf directly as the debugger informs me it's been "optimized out". But if I go up (down?) a stack frame I can seer what was passed in for its value.
(gdb) p surf
$3 = <optimized out>
(gdb) frame 1
#1  0x0000007ff125d210 in mozilla::layers::SharedSurfaceTextureClient::Create
    (surf=..., factory=factory@entry=0x7ed80043d0, aAllocator=0x0, 
    aFlags=<optimized out>) at obj-build-mer-qt-xr/dist/include/mozilla/
    cxxalloc.h:33
33      obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:
    No such file or directory.
(gdb) p surf
$4 = {
  mTuple = {<mozilla::detail::CompactPairHelper<mozilla::gl::SharedSurface*,
    mozilla::DefaultDelete<mozilla::gl::SharedSurface>,
    (mozilla::detail::StorageType)1, (mozilla::detail::StorageType)0>> =
    {<mozilla::DefaultDelete<mozilla::gl::SharedSurface>> = {<No data fields>},
    mFirstA = 0x0}, <No data fields>}}
(gdb) p surf.mTuple.mFirstA
$5 = (mozilla::gl::SharedSurface *) 0x0
(gdb) 
So the fact that surf is null provides us with an immediate reason for the crash, but it leaves us with a question: should the value be null and we shouldn't be attempting to access its contents, or should it be a non-null value going in to this method?

I could fire up my second development phone and place a breakpoint on the SharedSurfaceTextureClient constructor to compare, but I'm on the train and one laptop and two phones is already leaving me cramped. A laptop and three phones would crowd me out entirely. So let's find out why surf is null by looking through the ESR 91 code instead.

The odd thing is that the parent method has plenty of checks for it not being null:
already_AddRefed<SharedSurfaceTextureClient> SharedSurfaceTextureClient::Create(
    UniquePtr<gl::SharedSurface> surf, gl::SurfaceFactory* factory,
    LayersIPCChannel* aAllocator, TextureFlags aFlags) {
  if (!surf) {
    return nullptr;
  }
  TextureFlags flags = aFlags | TextureFlags::RECYCLE | surf->GetTextureFlags();
  SharedSurfaceTextureData* data =
      new SharedSurfaceTextureData(std::move(surf));
  return MakeAndAddRef<SharedSurfaceTextureClient>(data, flags, aAllocator);
}
So the value going in isn't null. And now that I look at it carefully, I see that I have this all wrong: the surf variable isn't an instance of SharedSurface, it's a UniquePtr wrapping an instance of SharedSurface. When the value inside the unique pointer is moved, the value inside the unique pointer that it's coming from gets set to zero. That's the whole point of unique pointers.

So accessing this value that's been optimised out is actually more difficult than I'd thought. I can't just go up (down) a stack frame and check it there after all.

The solution will be to place a breakpoint on SharedSurfaceTextureClient::Create() and inspect the value before it's moved. Let's try that out.
(gdb) b SharedSurfaceTextureClient::Create
Breakpoint 1 at 0x7ff125d1a4: file gfx/layers/client/
    TextureClientSharedSurface.cpp, line 113.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview
[...]
Thread 36 "Compositor" hit Breakpoint 1, mozilla::layers::
    SharedSurfaceTextureClient::Create (surf=...,
    factory=factory@entry=0x7ee0004310, aAllocator=
    0x0, aFlags=mozilla::layers::TextureFlags::ORIGIN_BOTTOM_LEFT)
    at gfx/layers/client/TextureClientSharedSurface.cpp:113
113         LayersIPCChannel* aAllocator, TextureFlags aFlags) {
(gdb) p surf
$6 = {
  mTuple = {<mozilla::detail::CompactPairHelper<mozilla::gl::SharedSurface*,
    mozilla::DefaultDelete<mozilla::gl::SharedSurface>,
    (mozilla::detail::StorageType)1, (mozilla::detail::StorageType)0>> =
    {<mozilla::DefaultDelete<mozilla::gl::SharedSurface>> = {<No data fields>}, 
      mFirstA = 0x7ee01a1c00}, <No data fields>}}
(gdb) p surf.mTuple.mFirstA
$7 = (mozilla::gl::SharedSurface *) 0x7ee01a1c00
(gdb) p surf.mTuple.mFirstA->mDesc.size
$9 = {<mozilla::gfx::BaseSize<int, mozilla::gfx::IntSizeTyped<mozilla::gfx::
    UnknownUnits> >> = {{{width = 1080, height = 2520}, components = {1080,
    2520}}}, <mozilla::gfx::UnknownUnits> = {<No data fields>},
    <No data fields>}
(gdb) p surf.mTuple.mFirstA->mDesc.size.width
$10 = 1080
(gdb) p surf.mTuple.mFirstA->mDesc.size.height
$11 = 2520
(gdb) 
This is all now looking a lot more healthy than I thought. Since it's not that the value is null going in, what else could be causing the crash here? It could be that there are multiple calls to this Create() method and this isn't the one causing the problem. But that's easy to check as well:
(gdb) n
114       if (!surf) {
(gdb) 
49      obj-build-mer-qt-xr/dist/include/mozilla/TypedEnumBits.h:
    No such file or directory.
(gdb) 
117       TextureFlags flags = aFlags | TextureFlags::RECYCLE | surf->GetTextureFlags();
(gdb) 
119           new SharedSurfaceTextureData(std::move(surf));
(gdb) 

Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault.
0x0000007ff125d16c in mozilla::layers::SharedSurfaceTextureData::
    SharedSurfaceTextureData (this=0x7ee01af300, surf=...)
    at obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:283
283     obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:
    No such file or directory.
(gdb) 

Stepping forwards to the next call to the SharedSurfaceTextureData constructor triggers the crash. So we're definitely in the right place and the problem isn't a null value for the surf parameter after all.

And now suddenly it's hit me. There's something very wrong with this ordering:
    : mSurf(std::move(surf)),
      mDesc(),
      mFormat(),
      mSize(surf->mDesc.size)
The sequence here is going to be:
  1. Move surf into mSurf. This will leave the value stored inside surf as null.
  2. Create mDesc and mFormat in their default state.
  3. Attempt to access a value from inside surf. But surf has been moved out of the unique pointer wrapper and into another one, so we can't do this.
So it looks like we have an easy solution: rather than attempt to use surf, we should use the value of mSurf instead, since this contains the value that was moved from surf just a couple of lines prior. Here's the tweaked implementation:
SharedSurfaceTextureData::SharedSurfaceTextureData(
    UniquePtr<gl::SharedSurface> surf)
    : mSurf(std::move(surf)),
      mDesc(),
      mFormat(),
      mSize(mSurf->mDesc.size)
{
}
There may be other reasons why this constructor causes problems later, such as having the default values for mDesc and mFormat. I'm not sure how important these are. But this fix should get us closer to finding out.

I'm going to attempt to run the library generated from the partial build. Unfortunately the partial builds mess up the symbol references so it's not always possible to debug with them. I've also stripping them of debug symbols to make uploading them quicker, so even if they don't get messed up, I still can't use them. But running it may nevertheless help to find out whether this fix has made any difference at all.
$ make -j1 -C obj-build-mer-qt-xr/gfx/layer
$ make -j16 -C `pwd`/obj-build-mer-qt-xr/toolkit
$ strip obj-build-mer-qt-xr/toolkit/library/build/libxul.so
[...]
$ scp obj-build-mer-qt-xr/toolkit/library/build/libxul.so \
    defaultuser@172.28.172.2:~/Documents/Development/gecko/
$ ssh defaultuser@172.28.172.2
[...]
$ devel-su cp libxul.so /usr/lib64//xulrunner-qt5-91.9.1/
$ gdb harbour-webview 
GNU gdb (GDB) Mer (8.2.1+git9)
[...]
(gdb) r
Starting program: /usr/bin/harbour-webview
[...]
Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 29907]
0x0000007ff1107e00 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so
(gdb) bt
#0  0x0000007ff1107e00 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so
#1  0x0000007ff1106948 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so
#2  0x0000007ff1106c30 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so
#3  0x0000007ff1106e5c in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so
#4  0x0000007ff1107104 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so
#5  0x0000007ff1131c64 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so
#6  0x0000007ff1131d28 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so
#7  0x0000007ff1131e74 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so
#8  0x0000007ff11999f8 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so
#9  0x0000007ff11aefe8 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so
#10 0x0000007ff12c4c98 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so
#11 0x0000007ff12cfd14 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so
[...]
#25 0x0000007ff07fe8d8 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so
#26 0x0000007feca3c9f0 in ?? () from /usr/lib64/libnspr4.so
#27 0x0000007fefd00a4c in ?? () from /lib64/libpthread.so.0
#28 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) 
Clearly that's not as helpful as we might have hoped. Maybe it's worth a try without stripping out the debug symbols.
$ make -j16 -C `pwd`/obj-build-mer-qt-xr/toolkit
[...]
$ scp obj-build-mer-qt-xr/toolkit/library/build/libxul.so \
    defaultuser@172.28.172.2:~/Documents/Development/gecko/
The problem with using the non-stripped version is that it's a large 2.7GiB file (that's three times larger than the RPM packages, even including the debuginfo). The consequence is that it's actually taking my entire train journey to copy the file over to my phone (via my other phone using a Wifi hotspot). It's finally got there... but with only a few minutes to spare before we pull in to Cambridge station. I'm going to have to be quick!
$ ssh defaultuser@172.28.172.2
[...]
$ devel-su cp libxul.so /usr/lib64//xulrunner-qt5-91.9.1/
$ gdb harbour-webview 
GNU gdb (GDB) Mer (8.2.1+git9)
[...]
(gdb) r
Starting program: /usr/bin/harbour-webview
[...]
Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 2262]
0x0000007ff1107e00 in mozilla::gl::SharedSurface::ProdTexture
    (this=<optimized out>)
    at gecko-dev/gfx/gl/SharedSurface.h:154
154     gecko-dev/gfx/gl/SharedSurface.h: No such file or directory.
(gdb) bt
#0  0x0000007ff1107e00 in mozilla::gl::SharedSurface::ProdTexture
    (this=<optimized out>)
    at gecko-dev/gfx/gl/SharedSurface.h:154
#1  0x0000007ff1106948 in mozilla::gl::ReadBuffer::Create (gl=0x7ed819ee40,
    caps=..., formats=..., surf=surf@entry=0x7ed81a1cc0)
    at gecko-dev/gfx/gl/GLScreenBuffer.cpp:653
#2  0x0000007ff1106c30 in mozilla::gl::GLScreenBuffer::CreateRead
    (this=this@entry=0x55556434d0, surf=surf@entry=0x7ed81a1cc0)
    at gecko-dev/gfx/gl/GLScreenBuffer.cpp:584
#3  0x0000007ff1106e5c in mozilla::gl::GLScreenBuffer::Attach
    (this=this@entry=0x55556434d0, surf=0x7ed81a1cc0, size=...)
    at gecko-dev/gfx/gl/GLScreenBuffer.cpp:488
#4  0x0000007ff1107104 in mozilla::gl::GLScreenBuffer::Resize
    (this=0x55556434d0, size=...)
    at obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#5  0x0000007ff1131c64 in mozilla::gl::GLContext::CreateScreenBufferImpl
    (this=this@entry=0x7ed819ee40, size=..., caps=...)
    at gecko-dev/gfx/gl/GLContext.cpp:2150
#6  0x0000007ff1131d28 in mozilla::gl::GLContext::CreateScreenBuffer
    (caps=..., size=..., this=0x7ed819ee40)
    at gecko-dev/gfx/gl/GLContext.h:3555
#7  mozilla::gl::GLContext::InitOffscreen (this=this@entry=0x7ed819ee40,
    size=..., caps=...)
    at gecko-dev/gfx/gl/GLContext.cpp:2398
#8  0x0000007ff1131e74 in mozilla::gl::GLContextProviderEGL::CreateOffscreen
    (size=..., minCaps=..., flags=flags@entry=mozilla::gl::CreateContextFlags::
    REQUIRE_COMPAT_PROFILE, out_failureId=out_failureId@entry=0x7f1778a1c8)
    at gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1308
[...]
#30 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) 
This is a more interesting backtrace! And it's certainly different from the one we had yesterday. But now we really are pulling in to the station so it's time to close up my laptop and prepare for the next stage of my journey home. When I get back, I'll dig in to this crash further.

[...]

I'm home now. After digging around in the code a bit and comparing with the execution of ESR 78, the reason for the crash has become clear. At the top of the backtrace is SharedSurface::ProdTexture(). But this method is designed to crash; it looks like this:
  virtual GLuint ProdTexture() {
    MOZ_ASSERT(mAttachType == AttachmentType::GLTexture);
    MOZ_CRASH("GFX: Did you forget to override this function?");
  }
As you can see from the text passed to the crash macro, this function is never supposed to be called. It's supposed to be overridden by something else in a class that inherits from SharedSurface.

For the same reason, when I place a breakpoint on SharedSurface::ProdTexture() in the ESR 78 version of the code it doesn't get hit. On the other hand, when I place a breakpoint on ReadBuffer::Create() which is further down the stack trace that does get hit. After which if we place a breakpoint on all instances of ProdTexture() we do get a hit, but it's from SharedSurface_Basic::ProdTexture() rather than from SharedSurface:
(gdb) b ReadBuffer::Create
Breakpoint 2 at 0x7fb8e66720: file gfx/gl/GLScreenBuffer.cpp, line 501.
(gdb) r
[...]
Thread 36 "Compositor" hit Breakpoint 2, mozilla::gl::ReadBuffer::Create
    (gl=0x7eac109130, caps=..., formats=..., surf=surf@entry=0x7eac10a6e0)
    at gfx/gl/GLScreenBuffer.cpp:501
501                                              SharedSurface* surf) {
(gdb) p surf
$1 = (mozilla::gl::SharedSurface *) 0x7eac10a6e0
(gdb) p surf->mAttachType
$2 = mozilla::gl::AttachmentType::GLTexture
(gdb) b ProdTexture
Breakpoint 3 at 0x7fb83835d0: ProdTexture. (5 locations)
(gdb) c
Continuing.

Thread 36 "Compositor" hit Breakpoint 3, mozilla::gl::SharedSurface_Basic::
    ProdTexture (this=0x7eac10a6e0)
    at gfx/gl/SharedSurfaceGL.h:65
65        virtual GLuint ProdTexture() override { return mTex; }
(gdb)
The SharedSurface_Basic class is defined in the SharedSurfaceGL.h file and in ESR 78 it does override the ProdTexture() method:
// For readback and bootstrapping:
class SharedSurface_Basic : public SharedSurface {
[...]
  virtual GLuint ProdTexture() override { return mTex; }
[...]
};
However in the ESR 91 code there's no such override. I should have added one in, but the need hadn't made it on to my radar. So I'll make this change now. Unfortunately there are a collection of cascading changes that make this a slightly larger job than I'd hoped. Nothing crazy, but too much to do tonight, so I'll have to leave it until the morning.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
7 Mar 2024 : Day 178 #
Good morning! Well, it is for me at least. By the time this gets posted it'll be the evening. But for me right now it's early and when I check in on my laptop I can see that the build has completed.

We'll take a look at the build presently, but before we do, thank you to Adam Pigg (piggz) for commenting on my meander into virtual methods yesterday. As Adam rightly pointed out, my C++ code was missing some accoutrements that he'd have preferred to see included.
 
Sorry, but i have to deduct points for using raw pointers and not smart pointers, and missing the override keyword ;)

Adam is right of course, not only are these sensible ways to improve the code, but the use of override would also have improved clarity. Adam has kindly provide his own, improved, version of the code which I'll share below. He then went on to make a few, perhaps more controversial, suggestions"
 
I'd also drop using namespace, and maybe mix in some C++23, and swap cout for std::print 🙃

I'll leave it to the reader to judge whether these would actually be improvements or not. Here's Adam's updated version of the code. One thing to note is that support for C++23's std::print requires at least GCC 14 or Clang 18 if you want to compile this at home, but Adam has confirmed it all works as expected using the online Godbolt Compiler Explorer.
#include 
#include 
#include 

class Parent {
public:
    virtual ~Parent() {};
    std::string hello() { return std::string("Hello son"); }    
    std::string wave() { return std::string("Waves"); }
    virtual std::string goodbye() { return std::string("Goodbye son"); }
};

class Child : public Parent {
public:
    std::string hello() { return std::string("Hello Mum"); }
    std::string goodbye() override { return std::string("Goodbye Mum"); }
};

int main() {
    std::unique_ptr parent(new Parent);
    std::shared_ptr child = std::make_shared();
    std::shared_ptr vparent = std::dynamic_pointer_cast(child);


    std::println("1. {} ", parent->hello());
    std::println("2. {} ", child->hello());
    std::println("3. {} ", vparent->hello());

    std::println("4. {} ", parent->wave());
    std::println("5. {} ", child->wave());
    std::println("6. {} ", vparent->wave());
    
    std::println("7. {} ", parent->goodbye());
    std::println("8. {} ", child->goodbye());
    std::println("9. {} ", vparent->goodbye());
    
    std::println("10. {} ", reinterpret_cast(parent.get()));
    std::println("11. {} ", reinterpret_cast(child.get()));
    std::println("12. {} ", reinterpret_cast(vparent.get()));
  
    return 0;
}
Thanks for that Adam! I'm always up for a bit of blog-based code review. Okay, now back to the Gecko changes from yesterday.

The latest Gecko build incorporates all of the GLScreenBuffer code that I've been adding in and following changes made yesterday should also now no longer make use of the SwapChain class. I've copied the packages over to my development phone, installed them and now it's time to test them.
$ gdb harbour-webview 
GNU gdb (GDB) Mer (8.2.1+git9)
[...]
(gdb) r
[...]
=============== Preparing offscreen rendering context ===============
[New LWP 20044]

Thread 37 "Compositor" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 20040]
0x0000007ff110a0a8 in mozilla::gl::GLScreenBuffer::Size (this=0x0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
290     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:
    No such file or directory.
(gdb) bt
#0  0x0000007ff110a0a8 in mozilla::gl::GLScreenBuffer::Size (this=0x0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#1  mozilla::gl::GLContext::OffscreenSize (this=this@entry=0x7ed419ee40)
    at gfx/gl/GLContext.cpp:2141
#2  0x0000007ff3666804 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget (this=0x7fc4ba91a0, aId=...)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:156
#3  0x0000007ff12b6f50 in mozilla::layers::CompositorVsyncScheduler::
    ForceComposeToTarget (this=0x7fc4cabe80, aTarget=aTarget@entry=0x0, 
    aRect=aRect@entry=0x0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/LayersTypes.h:82
#4  0x0000007ff12b6fac in mozilla::layers::CompositorBridgeParent::
    ResumeComposition (this=this@entry=0x7fc4ba91a0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#5  0x0000007ff12b7038 in mozilla::layers::CompositorBridgeParent::
    ResumeCompositionAndResize (this=0x7fc4ba91a0, x=<optimized out>,
    y=<optimized out>, width=<optimized out>, height=<optimized out>)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:794
#6  0x0000007ff12afbd4 in mozilla::detail::RunnableMethodArguments<int, int,
    int, int>::applyImpl<mozilla::layers::CompositorBridgeParent, void
    (mozilla::layers::CompositorBridgeParent::*)(int, int, int, int),
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>,
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, 0ul, 1ul, 2ul,
    3ul> (args=..., m=<optimized out>, o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151
[...]
#18 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) frame 1
#1  mozilla::gl::GLContext::OffscreenSize (this=this@entry=0x7ed419ee40)
    at gfx/gl/GLContext.cpp:2141
2141      return mScreen->Size();
(gdb) p mScreen
$1 = {
  mTuple = {<mozilla::detail::CompactPairHelper<mozilla::gl::GLScreenBuffer*,
    mozilla::DefaultDelete<mozilla::gl::GLScreenBuffer>,
    (mozilla::detail::StorageType)1, (mozilla::detail::StorageType)0>> =
    {<mozilla::DefaultDelete<mozilla::gl::GLScreenBuffer>> = {<No data fields>}, 
      mFirstA = 0x0}, <No data fields>}}
(gdb) p mScreen.mTuple.mFirstA
$2 = (mozilla::gl::GLScreenBuffer *) 0x0
(gdb) 
So that's an immediate crash, the reason being that the code is calling the OffScreenSize() method of the mScreen member variable of type GLScreenBuffer, but mScreen is null. That is, it's not been initialised yet.

Looking through the ESR 78 code there's only one place I can discern that sets the mScreen variable and that's this one:
bool GLContext::CreateScreenBufferImpl(const IntSize& size,
                                       const SurfaceCaps& caps) {
  UniquePtr<GLScreenBuffer> newScreen =
      GLScreenBuffer::Create(this, size, caps);
[...]
  mScreen = std::move(newScreen);

  return true;
}
The only place this is called is here:
  bool CreateScreenBuffer(const gfx::IntSize& size, const SurfaceCaps& caps) {
    if (!IsOffscreenSizeAllowed(size)) return false;

    return CreateScreenBufferImpl(size, caps);
  }
But this is just some indirection. Clearly the method we're really interested in is CreateScreenBuffer(). This is also only called in one place:
bool GLContext::InitOffscreen(const gfx::IntSize& size,
                              const SurfaceCaps& caps) {
  if (!CreateScreenBuffer(size, caps)) return false;
[...]
  mCaps = mScreen->mCaps;
  MOZ_ASSERT(!mCaps.any);

  return true;
}
It's worth noticing here that soon after calling the CreateScreenBuffer() method and in the same call, the mCaps member of mScreen is being accessed. If mScreen were null at the time of this access, this would immediately trigger a segmentation fault. So clearly, by the end of the InitOffscreen() method, it's expected that mScreen should be a valid instance of GLScreenBuffer.

Let's continue digging backwards and find out where InitOffscreen() gets called. It turns out it's called in quite a few places:
$ grep -rIn "InitOffscreen(" * --include="*.cpp"
gecko-dev/gfx/gl/GLContextProviderGLX.cpp:1035:
    if (!gl->InitOffscreen(size, minCaps)) {
gecko-dev/gfx/gl/GLContextProviderWGL.cpp:532:
    if (!gl->InitOffscreen(size, minCaps)) return nullptr;
gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1443:
    if (!gl->InitOffscreen(size, minOffscreenCaps)) {
gecko-dev/gfx/gl/GLContext.cpp:2576:
    bool GLContext::InitOffscreen(const gfx::IntSize& size,
The only one of these context providers we care about, given it's the only one used by sailfish-browser, is GLContextProviderEGL. The instance in GLContext.cpp is the method definition so we can ignore that. So there's only one case to concern ourselves with. The call in GLContextProviderEGL occurs in the GLContextProviderEGL::CreateOffscreen() method, which is rather long so I won't list the entire contents here. Suffice it to say that this is the call we need to be finding some equivalent of in ESR 91. This itself is only called from CompositorOGL::CreateContext().

We're building up a pretty clear path for how mScreen ends up initialised in ESR 78. The CreateContext() method is the first time we've reached something on this path which also exists in ESR 91. So this would seem to be a good place to switch back to ESR 91 and try to reconstruct the path in the opposite direction.

Just to make sure we're not being led in the wrong direction, it's also worth checking that CreateContext() is actually being called in ESR 91 and that this is happening before the segmentation fault causes the application to crash.
(gdb) b CompositorOGL::CreateContext
Breakpoint 1 at 0x7ff119a348: file gfx/layers/opengl/CompositorOGL.cpp, line 227.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]

Thread 38 "Compositor" hit Breakpoint 1, mozilla::layers::CompositorOGL::
    CreateContext (this=this@entry=0x7ed8002f10)
    at gfx/layers/opengl/CompositorOGL.cpp:227
227     already_AddRefed<mozilla::gl::GLContext> CompositorOGL::CreateContext() {
(gdb)
Yes, it looks to be the case. So now back to building the path. The call to CreateOffscreen() in ESR 78 has been replaced by a call to CreateHeadless() in ESR 91. This is code we've looked at before, but finally we're getting to unravel it a bit. Here's the ESR 78 version:
    SurfaceCaps caps = SurfaceCaps::ForRGB();
    caps.preserve = false;
    caps.bpp16 = gfxVars::OffscreenFormat() == SurfaceFormat::R5G6B5_UINT16;

    nsCString discardFailureId;
    context = GLContextProvider::CreateOffscreen(
        mSurfaceSize, caps, CreateContextFlags::REQUIRE_COMPAT_PROFILE,
        &discardFailureId);
And here's the equivalent code, also executed from within CompositorOGL::CreateContext(), from ESR 91.
    nsCString discardFailureId;
    context = GLContextProvider::CreateHeadless(
        {CreateContextFlags::REQUIRE_COMPAT_PROFILE}, &discardFailureId);
    if (!context->CreateOffscreenDefaultFb(mSurfaceSize)) {
      context = nullptr;
    }
This gives us all the pieces we need — both mSurfaceSize and caps are available in ESR 91 — so we just need to reconstruct the call to reflect what's happening in ESR 78.

So the next step is to add the CreateOffscreen() code to GLContextProviderEGL. There's code in ESR 78 to copy over, so that part's straightforward. However various interfaces it makes use of have been changed or are missing. I've had to make some quite heavy, but nevertheless justifiable, changes to the code. For example the GLLibraryEGL::EnsureInitialized() method has been replaced by GLLibraryEGL::Init(). That's more than just a name change: the former can be safely called multiple times, whereas the latter can only be called once (or so it appears). Consequently I've removed the call entirely, on the assumption that the Init() method is being called somewhere else. We'll find out whether that's true or not when we come to execute the program.

Nevertheless the partial builds now compile, so it's time to run the full build again, which will take more hours than there are left in the day. So we'll return to this again tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
6 Mar 2024 : Day 177 #
Yesterday I spent the day fixing linker errors. By the evening there was one last error that popped up at the end of the full build that looked like this:
TextureClientSharedSurface.cpp:108: undefined reference to `vtable for mozilla::layers::SharedSurfaceTextureClient'
I promised to spend a bit of time today explaining what was going on and how I fixed it.

One of the reasons I want to explain is that these vtable errors are a little cryptic. They also related to an interesting implementation detail of C++. They're telling us that there's something wrong with the SharedSurfaceTextureClient implementation, but not telling us exactly what. The line number the error refers to is in the SharedSurfaceTextureClient constructor, which is also not particularly helpful:
SharedSurfaceTextureClient::SharedSurfaceTextureClient(
    SharedSurfaceTextureData* aData, TextureFlags aFlags,
    LayersIPCChannel* aAllocator)
    : TextureClient(aData, aFlags, aAllocator) {
}
The vtable the error is referring to is the "virtual table" of methods that can be dynamically overridden. When you subclass a class in C++ you can override certain methods from the parent class. In other words, when the code calls the method in the subclass, you can make sure it uses a method written for the subclass, rather than using the inherited method from the parent class. Both overriding and not overriding are useful. The whole point of creating a subclass is that you re-use some functionality from the parent class, so if you don't override a method it's great that the implementation from the parent class can be used.

However, sometimes you want to change the behaviour — or at least some of it — from the parent class. That might be where you'd override some of that functionality.

Where am I going with this? Bear with me! The point is that there are also two ways that a method can be overridden. It can be statically overridden or dynamically overridden. The former case is useful when the compiler can always tell that the method to be used is the one from the child class. In particular, if you never cast a class to make it look like its parent. If you cast it to the parent, any statically overridden methods will use the parent class's implementation.

However, if you dynamically override a method, the child's implementation will be used even if you cast the class upwards. Let's take some simple example (this doesn't appear in the Gecko codebase!).
#include <iostream>

using namespace std;

class Parent {
public:
    string hello() { return string("Hello son"); }    
    string wave() { return string("Waves"); }
    virtual string goodbye() { return string("Goodbye son"); }
};

class Child : public Parent {
public:
    string hello() { return string("Hello Mum"); }
    string goodbye() { return string("Goodbye Mum"); }
};

int main() {
    Parent* parent = new Parent();
    Child* child = new Child();
    Parent* vparent = dynamic_cast<Parent*>(child);

    cout << "1.  " << parent->hello() << "\n";
    cout << "2.  " << child->hello() << "\n";
    cout << "3.  " << vparent->hello() << "\n";

    cout << "4.  " << parent->wave() << "\n";
    cout << "5.  " << child->wave() << "\n";
    cout << "6.  " << vparent->wave() << "\n";
    
    cout << "7.  " << parent->goodbye() << "\n";
    cout << "8.  " << child->goodbye() << "\n";
    cout << "9.  " << vparent->goodbye() << "\n";
    
    cout << "10. " << parent << "\n";
    cout << "11. " << child << "\n";
    cout << "12. " << vparent << "\n";

    delete child;
    delete parent;

    return 0;
}
Here we can see from the class definition that the Child class is inheriting from the Parent class. The Child class doesn't have any implementation of the wave() method because we're expecting it to be inherited from the Parent. The Child class statically overrides its parent's hello() implementation but dynamically overrides its parent's goodbye() method. We can see this because of the virtual keyword that's added before the name of the method. Note that it's actually specified on the parent's goodbye() method. It's parents that decide whether their children can inherit methods virtually or not.

Now let's see what happens when we build and run this code.
$ g++ main.cpp
$ ./a.out 
1.  Hello son
2.  Hello Mum
3.  Hello son
4.  Waves
5.  Waves
6.  Waves
7.  Goodbye son
8.  Goodbye Mum
9.  Goodbye Mum
10. 0x563a9f694eb0
11. 0x563a9f694ed0
12. 0x563a9f694ed0
Notice how the Child class is still able to successfully call the wave() method (line 5). When the child class calls hello() and goodbye() it calls its own implementations of these methods.

However, there's also this peculiar vparent variable. This is the instance of Child that's been dynamically cast to look like a Parent. We can see that child and parent are actually the same object because they point to the same place (lines 11 and 12) compared to the parent object which has a different pointer (line 10).

Casing a Child to a Parent is perfectly safe because the former is inheriting from the latter. In other words, the code knows that everything a Parent has a Child will have as well. Safe.

But in the case of the dynamic cast, the calls to any overridden virtual methods are resolved at runtime rather than compile time. In particular, even though vparent is of type Parent, calling the goodbye() method on it calls the child's implementation of goodbye() (line 9).

The way this is achieved is with a vtable (this is where we're going!). This is a list of pointers to methods stored at runtime. Because goodbye() is marked as virtual in the Parent class, a pointer to this method is stored in the vtable. When the compiler calls the method it gets the address to call from the vtable, rather than from some fixed value stored at compile time.

Now when the Child dynamically overrides goodbye() it overwrites the value in the vtable. The result is that when the goodbye() class is called from vparent, the Child instance is referenced even though the class looks otherwise just like a Parent.

Finally we get to our error. The error is suggesting that there's some method that should be virtual, but no vtable has been created. This is most likely because there's a signature for a virtual method, but no implementation for any virtual method.

It's all a bit obscure, made more so because there are no virtual methods in the definition. However we do have ~SharedSurfaceTextureClient() in the class definition that doesn't have an implementation. Plus the class is inheriting from TextureData and this has a virtual destructor. As a result the ~SharedSurfaceTextureClient() destructor will also need to be added to the vtable.

But while ~SharedSurfaceTextureClient() appears in the class definition, there is no implementation of it. This is therefore likely what's causing our error.

The solution, after this very long explanation, is that we need to add in the implementation of the class destructor. Thankfully there's a destructor implementation in the ESR 78 code we can use:
SharedSurfaceTextureClient::~SharedSurfaceTextureClient() {
  // XXX - Things break when using the proper destruction handshake with
  // SharedSurfaceTextureData because the TextureData outlives its gl
  // context. Having a strong reference to the gl context creates a cycle.
  // This needs to be fixed in a better way, though, because deleting
  // the TextureData here can race with the compositor and cause flashing.
  TextureData* data = mData;
  mData = nullptr;

  Destroy();

  if (data) {
    // Destroy mData right away without doing the proper deallocation handshake,
    // because SharedSurface depends on things that may not outlive the
    // texture's destructor so we can't wait until we know the compositor isn't
    // using the texture anymore. It goes without saying that this is really bad
    // and we should fix the bugs that block doing the right thing such as bug
    // 1224199 sooner rather than later.
    delete data;
  }
}
As I mentioned yesterday, the partial builds all passed. But last night I also set off the full build again. So what was the outcome of this? The good news is that the full build went through as well, as a consequence of which I now have five shiny new xulrunner-qt5 packages to test on my phone. I'm not expecting them to work first time, but testing them is an essential step in finding out how to progress. Let's give them a go...
$ gdb harbour-webview 
GNU gdb (GDB) Mer (8.2.1+git9)
[...]
(gdb) r
Starting program: /usr/bin/harbour-webview 
[...]
=============== Preparing offscreen rendering context ===============
[New LWP 29323]

Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 29318]
mozilla::gl::SwapChain::OffscreenSize (this=0x0)
    at gfx/gl/GLScreenBuffer.cpp:141
141       return mPresenter->mBackBuffer->mFb->mSize;
(gdb) bt
#0  mozilla::gl::SwapChain::OffscreenSize (this=0x0)
    at gfx/gl/GLScreenBuffer.cpp:141
#1  0x0000007ff3666884 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget (this=0x7fc4b7c500, aId=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#2  0x0000007ff12b6fbc in mozilla::layers::CompositorVsyncScheduler::
    ForceComposeToTarget (this=0x7fc4d23b20, aTarget=aTarget@entry=0x0, 
    aRect=aRect@entry=0x0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/
    LayersTypes.h:82
#3  0x0000007ff12b7018 in mozilla::layers::CompositorBridgeParent::
    ResumeComposition (this=this@entry=0x7fc4b7c500)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#4  0x0000007ff12b70a4 in mozilla::layers::CompositorBridgeParent::
    ResumeCompositionAndResize (this=0x7fc4b7c500, x=<optimized out>, y=<optimized out>, 
    width=<optimized out>, height=<optimized out>)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:794
#5  0x0000007ff12afc40 in mozilla::detail::RunnableMethodArguments<int, int,
    int, int>::applyImpl<mozilla::layers::CompositorBridgeParent, void
    (mozilla::layers::CompositorBridgeParent::*)(int, int, int, int),
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>,
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, 0ul, 1ul,
    2ul, 3ul> (args=..., m=<optimized out>, o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151
[...]
#17 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) p this
$1 = (const mozilla::gl::SwapChain * const) 0x0
(gdb) 
Immediately there's a segmentation fault and it feels like we've been here before. Here's where the crash is triggered:
const gfx::IntSize& SwapChain::OffscreenSize() const {
  return mPresenter->mBackBuffer->mFb->mSize;
}
It's being triggered because the instance of SwapChain this is being called from is null. But hang on; didn't we just spend days getting rid of the SwapChain code and swapping in GLScreenBuffer precisely to avoid this error? Well, yes, we did. Clearly after changing all that code I missed a case of SwapChain being used.

The fix is to switch back from SwapChain to the previous GLScreenBuffer interface, like to:
@@ -151,8 +153,7 @@ EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget(VsyncId aId)
 
   if (context->IsOffscreen()) {
     MutexAutoLock lock(mRenderMutex);
-    if (context->GetSwapChain()->OffscreenSize() != mEGLSurfaceSize
-      && !context->GetSwapChain()->Resize(mEGLSurfaceSize)) {
+    if (context->OffscreenSize() != mEGLSurfaceSize &&
+        !context->ResizeOffscreen(mEGLSurfaceSize)) {
       return;
     }
   }
I've also checked through the rest of the EmbedliteCompositionBridgeParent code to ensure there are no other SwapChain references in there. And, with that, we're done for the day. I've spent a long time explaining vtables today and very little time actually coding, but the change is an essential one and I can't test it without another build. So I've set it building again and with any luck I'll be able to test it again tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
5 Mar 2024 : Day 176 #
Just as I was explaining in my diary entry yesterday, the first thing I like to do after I wake up following a night of hard work — all performed by my laptop building the latest Gecko changes of course — is to scan the output for red. Red indicates errors.

When I peeked in this morning there was no red showing on the console. But my excitement was short-lived. The exception to the "errors are red" rule is when they come from the linker rather than the compiler and that's what seems to have happened here.

I'm not sure why the linker doesn't bother highlighting errors in red, but on close inspection they definitely are errors.

The fact it compiled without error is nevertheless exciting in itself, just not as exciting as actually having a binary to test. So what are these errors? As is the way with the linker, they're all "undefined reference" or symbol-related errors. This happens when the code calls something that may, for example, have a method signature but no method definition. That can be common with pure virtual functions that aren't subsequently overridden, say, but there are other reasons too. For example I might have added a method signature simply without adding in the method body. Here's a sample of the errors that came out:
403:37.88 toolkit/library/build/libxul.so
410:16.07 aarch64-meego-linux-gnu-ld: ../../../gfx/gl/GLScreenBuffer.o:
    in function `mozilla::gl::GLScreenBuffer::~GLScreenBuffer()':
410:16.07 ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:296:
    undefined reference to `mozilla::gl::SurfaceCaps::~SurfaceCaps()'
410:16.07 aarch64-meego-linux-gnu-ld: /jol403:37.88
    toolkit/library/build/libxul.so
410:16.07 aarch64-meego-linux-gnu-ld: ../../../gfx/gl/GLScreenBuffer.o:
    in function `mozilla::gl::GLScreenBuffer::~GLScreenBuffer()':
410:16.07 ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:296:
    undefined reference to `mozilla::gl::SurfaceCaps::~SurfaceCaps()'
410:16.07 aarch64-meego-linux-gnu-ld: ${PROJECT}/gecko-dev/gfx/gl/
    GLScreenBuffer.cpp:296: undefined reference to
    `mozilla::gl::SurfaceCaps::~SurfaceCaps()'
410:16.07 aarch64-meego-linux-gnu-ld: ../../../gfx/gl/Unified_cpp_gfx_gl0.o:
    in function `mozilla::gl::GLContext::GLContext(mozilla::gl::GLContextDesc
    const&, mozilla::gl::GLContext*, bool)':
410:16.07 ${PROJECT}/gecko-dev/gfx/gl/GLContext.cpp:290: undefined reference to
    `mozilla::gl::SurfaceCaps::SurfaceCaps()'
[...]
410:16.10 aarch64-meego-linux-gnu-ld: libxul.so: hidden symbol
    `_ZNK7mozilla2gl9GLContext15ChooseGLFormatsERKNS0_11SurfaceCapsE'
    isn't defined
410:16.10 aarch64-meego-linux-gnu-ld: final link failed: bad value
410:16.10 collect2: error: ld returned 1 exit status
Let me summarise all of the errors we have here and make things a bit clearer. The following are all the error locations followed by the names of the missing implementations.
GLScreenBuffer.cpp:296: SurfaceCaps::~SurfaceCaps()
GLContext.cpp:290: SurfaceCaps::SurfaceCaps()
SharedSurface.cpp:362: SurfaceCaps::SurfaceCaps(mozilla::gl::SurfaceCaps const&)
SharedSurface.cpp:361: GLContext::ChooseGLFormats(
    mozilla::gl::SurfaceCaps const&) const
SharedSurface.cpp:362: SurfaceCaps::SurfaceCaps()
SharedSurface.cpp:338: SurfaceCaps::SurfaceCaps()
SurfaceTypes.h:38: SurfaceCaps::SurfaceCaps()
SurfaceTypes.h:38: SurfaceCaps::operator=(mozilla::gl::SurfaceCaps const&)
SurfaceTypes.h:38: SurfaceCaps::~SurfaceCaps()
SharedSurface.cpp:350: SurfaceCaps::operator=(mozilla::gl::SurfaceCaps const&)
SharedSurface.cpp:338: SurfaceCaps::~SurfaceCaps()
SharedSurface.cpp:367: SurfaceCaps::~SurfaceCaps()
GLContext.cpp:296: SurfaceCaps::~SurfaceCaps()
SharedSurfaceGL.cpp:39: SurfaceCaps::SurfaceCaps()
SharedSurfaceGL.cpp:39: SurfaceCaps::~SurfaceCaps()
RefPtr.h:590: SharedSurfaceTextureClient::SharedSurfaceTextureClient(
    mozilla::layers::SharedSurfaceTextureData*, mozilla::layers::TextureFlags,
    mozilla::layers::LayersIPCChannel*)
So, what's going on with these? Let's take a look. First up we have line 296 of GLScreenBuffer.cpp. This line, it turns out, is where the GLScreenBuffer destructor implementation starts in the code:
GLScreenBuffer::~GLScreenBuffer() {
  mFactory = nullptr;
  mRead = nullptr;

  if (!mBack) return;

  // Detach mBack cleanly.
  mBack->Surf()->ProducerRelease();
}
Any instance of GLScreenBuffer will also hold an instance of SurfaceCaps as we can see in the header:
 public:
  const SurfaceCaps mCaps;
This is an actual instance of SurfaceCaps not just a pointer to it, so when the GLScreenBuffer instance is destroyed mCaps will be too, triggering a call to the SurfaceCaps destructor. If we look in the SurfaceTypes.h header file we can see that there is a destructor specified:
struct SurfaceCaps final {
[...]
  SurfaceCaps();
  SurfaceCaps(const SurfaceCaps& other);
  ~SurfaceCaps();
[...]
But nowhere is the body of the method defined. Compare that to the ESR 78 code where the SurfaceCaps class looks the same. The difference is that checking in the GLContect.cpp source we find these:
// These are defined out of line so that we don't need to include
// ISurfaceAllocator.h in SurfaceTypes.h.
SurfaceCaps::SurfaceCaps() = default;
SurfaceCaps::SurfaceCaps(const SurfaceCaps& other) = default;
SurfaceCaps& SurfaceCaps::operator=(const SurfaceCaps& other) = default;
SurfaceCaps::~SurfaceCaps() = default;
They're pretty much the simplest implementations you can get. But they are at least implementations. So with any luck adding these in to the ESR 91 code as well will solve quite a few of the linker errors. That's not everything though. After we take away the instances that these four methods deal with we're left with these two:
SharedSurface.cpp:361: GLContext::ChooseGLFormats(
    mozilla::gl::SurfaceCaps const&) const
RefPtr.h:590: SharedSurfaceTextureClient::SharedSurfaceTextureClient(
    mozilla::layers::SharedSurfaceTextureData*, mozilla::layers::TextureFlags,
    mozilla::layers::LayersIPCChannel*)
Let's check SharedSurface.cpp line 361 next. The line looks like this:
      mFormats(partialDesc.gl->ChooseGLFormats(caps)),
If we look in the GLContext class, sure enough we can see the method signature:
  // Only varies based on bpp16 and alpha.
  GLFormats ChooseGLFormats(const SurfaceCaps& caps) const;
We can see the same signature in the ESR 78 code. Once again though, the body of the method is defined in GLContext.cpp of ESR 78, whereas it's nowhere to be found in ESR 91. So I've copied the missing code over: 
GLFormats GLContext::ChooseGLFormats(const SurfaceCaps& caps) const { GLFormats formats; // If we're on ES2 hardware and we have an explicit request for 16 bits of // color or less OR we don't support full 8-bit color, return a 4444 or 565 // format. bool bpp16 = caps.bpp16; [...] 
Finally we have a reference to SharedSurfaceTextureClient. The reference to line 590 of RefPtr.h is unhelpful; what we really want to know is where the RefPtr is being used. A quick search suggests it's inside GLScreenBuffer where there are a couple of cases that look like this:
bool GLScreenBuffer::Swap(const gfx::IntSize& size) {
  RefPtr<layers::SharedSurfaceTextureClient> newBack =
      mFactory->NewTexClient(size);
[...]
There's no definition of SharedSurfaceTextureClient::SharedSurfaceTextureClient() in the ESR 91 code, but there is in the ESR 78 code, in the TextureClientSharedSurface.cpp source:
SharedSurfaceTextureClient::SharedSurfaceTextureClient(
    SharedSurfaceTextureData* aData, TextureFlags aFlags,
    LayersIPCChannel* aAllocator)
    : TextureClient(aData, aFlags, aAllocator) {
  mWorkaroundAnnoyingSharedSurfaceLifetimeIssues = true;
}
I've copied that code over as well. But my reading that's all of them. So this may mean we're ready to kick off a full rebuild. Before doing so, it's worth spending a bit of extra time to check that the changes pass our three partial builds already.
$ make -j1 -C obj-build-mer-qt-xr/gfx/gl/
[...]
$ make -j1 -C obj-build-mer-qt-xr/dom/canvas
[...]
$ make -j1 -C obj-build-mer-qt-xr/gfx/layers
[...]
The first two work fine, but the last generates a new error:
${PROJECT}/gecko-dev/gfx/layers/client/TextureClientSharedSurface.cpp:109:3:
    error: ‘mWorkaroundAnnoyingSharedSurfaceLifetimeIssues’
    was not declared in this scope
   mWorkaroundAnnoyingSharedSurfaceLifetimeIssues = true;
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This mWorkaroundAnnoyingSharedSurfaceLifetimeIssues variable is present in ESR 78, used to decide whether or not to deallocate the TextureData when the TextureClient is destroyed. Here's the logic it's part of:
void TextureClient::Destroy() {
[...]
  TextureData* data = mData;
  if (!mWorkaroundAnnoyingSharedSurfaceLifetimeIssues) {
    mData = nullptr;
  }

  if (data || actor) {
[...]
    params.workAroundSharedSurfaceOwnershipIssue =
        mWorkaroundAnnoyingSharedSurfaceOwnershipIssues;
    if (mWorkaroundAnnoyingSharedSurfaceLifetimeIssues) {
      params.data = nullptr;
    } else {
      params.data = data;
    }
[...]
    DeallocateTextureClient(params);
This logic has been removed in ESR 91 and, if I'm honest, I'm comfortable leaving it this way. Here's what the equivalent ERS 91 code looks like:
void TextureClient::Destroy() {
[...]
  TextureData* data = mData;
  mData = nullptr;

  if (data || actor) {
[...]
    DeallocateTextureClient(params);
  }
}
As you can see, there are no references to mWorkaroundAnnoyingSharedSurfaceLifetimeIssues, or anything like it, there at all. Consequently I'm just going to remove the references to it from our new SharedSurfaceTextureClient constructor as well:
SharedSurfaceTextureClient::SharedSurfaceTextureClient(
    SharedSurfaceTextureData* aData, TextureFlags aFlags,
    LayersIPCChannel* aAllocator)
    : TextureClient(aData, aFlags, aAllocator) {
}
Maybe this will cause problems later, but my guess is that if there are going to be problems, they'll be pretty straightforward to track down with the debugger, at which point we can refer back to the ESR 78 code to restore these checks.

The partial builds all now pass, so it's time to do a full rebuild. Here's the updated tally of changes once again to highlight today's progress:
$ git diff --numstat
145     16      gfx/gl/GLContext.cpp
73      8       gfx/gl/GLContext.h
11      0       gfx/gl/GLContextTypes.h
590     0       gfx/gl/GLScreenBuffer.cpp
171     1       gfx/gl/GLScreenBuffer.h
305     5       gfx/gl/SharedSurface.cpp
132     4       gfx/gl/SharedSurface.h
16      10      gfx/gl/SharedSurfaceEGL.cpp
13      3       gfx/gl/SharedSurfaceEGL.h
81      2       gfx/gl/SharedSurfaceGL.cpp
61      0       gfx/gl/SharedSurfaceGL.h
60      0       gfx/gl/SurfaceTypes.h
27      0       gfx/layers/client/TextureClientSharedSurface.cpp
25      0       gfx/layers/client/TextureClientSharedSurface.h
$ git diff --shortstat
 14 files changed, 1710 insertions(+), 49 deletions(-)
So off the build goes again:
$ sfdk build -d --with git_workaround
NOTICE: Appending changelog entries to the RPM SPEC file…
Setting version: 91.9.1+git1+sailfishos.esr91.20240302180401.
    2f1f19ac7d73+gecko.dev.5292b747b036
Directory walk started
[...]
This will likely take until the morning, at which point I'll be eagerly checking for errors once again.

[...]

An early and unusually short build run (just 5 hours 12 minutes) meant that I've got a second bit of the cherry! The build finished before the end of the day. It's not a successful build unfortunately, but by discovering this now rather than in the morning I'm able to claim an entire day back. So I'm very happy about that.

Here are the linker errors. Well, actually, I think it's just one error that's spread over many lines:
308:11.87 toolkit/library/build/libxul.so
311:51.74 aarch64-meego-linux-gnu-ld: ../../../gfx/layers/
    Unified_cpp_gfx_layers6.o: in function `mozilla::layers::
    SharedSurfaceTextureClient::SharedSurfaceTextureClient
    (mozilla::layers::SharedSurfaceTextureData*, mozilla::layers::TextureFlags,
    mozilla::layers::LayersIPCChannel*)':
311:51.74 ${PROJECT}/gecko-dev/gfx/layers/client/
    TextureClientSharedSurface.cpp:108: undefined reference to
    `vtable for mozilla::layers::SharedSurfaceTextureClient'
311:51.74 aarch64-meego-linux-gnu-ld: ${PROJECT}/gecko-dev/gfx/layers/client/
    TextureClientSharedSurface.cpp:108: undefined reference to
    `vtable for mozilla::layers::SharedSurfaceTextureClient'
311:51.74 aarch64-meego-linux-gnu-ld: libxul.so: hidden symbol
    `_ZTVN7mozilla6layers26SharedSurfaceTextureClientE' isn't defined
311:51.74 aarch64-meego-linux-gnu-ld: final link failed: bad value
311:51.75 collect2: error: ld returned 1 exit status
The essence of error is the following:
TextureClientSharedSurface.cpp:108: undefined reference to
    `vtable for mozilla::layers::SharedSurfaceTextureClient'
I'm going to hold of explaining what's going on here until tomorrow. This post is already rather long and I don't have the energy to go in to the details tonight. But I have added in a fix.

With these latest changes the partial builds still all pass. So it's time to set the full build off again. But this also means it's the end of my gecko development for the day. More tomorrow when we'll see how things have gone.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
4 Mar 2024 : Day 175 #
I've been really struggling to get the GLScreenBuffer code back compiling again over the last few days. This is partly because a whole raft of interfaces have changed and these changes have seeped through into many different places in the code. Often the changes are quite small, such as using a reference rather than a pointer. But other times they're more significant, such as the way member variables in SharedSurface have changed. To look into the latter in a little more detail, the previous version looked like this:
class SharedSurface {
 public:
[...]
  const SharedSurfaceType mType;Th
  const AttachmentType mAttachType;
  const WeakPtr<GLContext> mGL;
  const gfx::IntSize mSize;
  const bool mHasAlpha;
  const bool mCanRecycle;
But the new code bundles all this up into a structure so that it now looks more like this:
struct PartialSharedSurfaceDesc {
  const WeakPtr<GLContext> gl;
  const SharedSurfaceType type;
  const layers::TextureType consumerType;
  const bool canRecycle;
};
struct SharedSurfaceDesc : public PartialSharedSurfaceDesc {
  gfx::IntSize size = {};
};

class SharedSurface {
 public:
[...]
  const SharedSurfaceDesc mDesc;
Even this might seem like a small change, but it can make it really hard to match up method signatures given the ESR 78 version takes a collection of individual parameters, whilst the ESR 91 version requires just a single SharedSurfaceDesc instance.

Besides all this I've also had what feels like several long days at work. The result is that when I come to work on Gecko of an evening the code swims around in front of my eyes and refuses to stay still as my mind drifts hither and thither.

Nevertheless I've been persevering with my "fix, compile, examine" routine and have continued to make some slight progress. Once again, I don't want to go into the full detail of the changes I've had to make, because it really is just reintroducing code that was removed upstream. So there's a lot of it, and I'm not even spending any time attempting to understand it.

But it has now got to the point where all three directories I've been making changes to are compiling when I do the partial builds:
$ make -j1 -C obj-build-mer-qt-xr/gfx/gl/
[...]
$ make -j1 -C obj-build-mer-qt-xr/dom/canvas
[...]
$ make -j1 -C obj-build-mer-qt-xr/gfx/layers
[...]
I've been here before of course. Just because the partial builds compile without errors doesn't mean the full build will pass. So the next step for me tonight is to set the full build running so we can come back to it and check its status in the morning.

Before I do that, here are the latest stats showing the changes I've made in relation to the offscreen rendering pipeline:
$ git diff --numstat
73      16      gfx/gl/GLContext.cpp
73      8       gfx/gl/GLContext.h
11      0       gfx/gl/GLContextTypes.h
590     0       gfx/gl/GLScreenBuffer.cpp
171     1       gfx/gl/GLScreenBuffer.h
305     5       gfx/gl/SharedSurface.cpp
132     4       gfx/gl/SharedSurface.h
16      10      gfx/gl/SharedSurfaceEGL.cpp
13      3       gfx/gl/SharedSurfaceEGL.h
81      2       gfx/gl/SharedSurfaceGL.cpp
61      0       gfx/gl/SharedSurfaceGL.h
60      0       gfx/gl/SurfaceTypes.h
21      0       gfx/layers/client/TextureClientSharedSurface.cpp
25      0       gfx/layers/client/TextureClientSharedSurface.h
$ git diff --shortstat
 14 files changed, 1632 insertions(+), 49 deletions(-)
That's not a huge increase on yesterday, but is at least non-zero. This reflects the fact that as things have progressed the fixes have become smaller, but thornier. All of the main changes this time seem to have been to the GLContext code.

Alright, time to set the build off. Hopefully this will complete without errors, but I'm not betting on it.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
3 Mar 2024 : Day 174 #
As I walk, half asleep, from bedroom to kitchen for breakfast each morning I have to go past the office space where I work on my laptop. Whenever I leave a build running overnight I can't resist the urge to peak in and see how things have gone as I go past.

This morning I peaked in and saw a barrage of errors, highlighted in red, spanning back into the history of the terminal buffer. That's not quite the morning wake-up I was hoping for!

Nevertheless, as I like to tell myself, it still represents progress. And so today I have the task of fixing these fresh errors. Compile-time errors are generally easier to fix than runtime errors, so I don't mind at all really.

That was earlier today though and as I write this I'm already on the train. I've made the smart move of connecting all my devices to the same Wifi hotspot on my phone, which means no wires today. It's all surprisingly well arranged, although that hasn't prevented me from knocking my phone under the seat in front of me already once this journey.

This calls for another one of thigg's amazing images. This one feels a lot less chaotic than the last, albeit with a stalker fox in the background adding an ominous air to proceedings!
 
A pig with wings wearing a suit sits at in a train at a table using a laptop; resting on the table is a phone and in the background a fox enters the carriage wheeling a case

So that's me today. Before I start, let's have some stats from git so we can compare what happens now with what we get at the end of the day.
$ git diff --numstat
38      4       gfx/gl/GLContext.cpp
59      7       gfx/gl/GLContext.h
11      0       gfx/gl/GLContextTypes.h
590     0       gfx/gl/GLScreenBuffer.cpp
171     1       gfx/gl/GLScreenBuffer.h
305     5       gfx/gl/SharedSurface.cpp
130     3       gfx/gl/SharedSurface.h
16      10      gfx/gl/SharedSurfaceEGL.cpp
13      3       gfx/gl/SharedSurfaceEGL.h
8       2       gfx/gl/SharedSurfaceGL.cpp
3       0       gfx/gl/SharedSurfaceGL.h
60      0       gfx/gl/SurfaceTypes.h
21      0       gfx/layers/client/TextureClientSharedSurface.cpp
25      0       gfx/layers/client/TextureClientSharedSurface.h
$ git diff --shortstat
 14 files changed, 1450 insertions(+), 35 deletions(-)
Now on to the bug fixing!

[...]

I've managed to plough through quite a few issues, almost all related to the SharedSurfaceGL and EmbedliteCompositorBridgeParent classes. Unfortunately I've not managed to get to the point where the partial builds are going through without error. That means there's no point in setting the full build to run over night. Instead, I'll have to pick up the errors where I left off in the morning.

Just to demonstrate that some progress has been made, here are the new stats for this evening.
$ git diff --numstat
58      4       gfx/gl/GLContext.cpp
70      7       gfx/gl/GLContext.h
11      0       gfx/gl/GLContextTypes.h
590     0       gfx/gl/GLScreenBuffer.cpp
171     1       gfx/gl/GLScreenBuffer.h
305     5       gfx/gl/SharedSurface.cpp
131     4       gfx/gl/SharedSurface.h
16      10      gfx/gl/SharedSurfaceEGL.cpp
13      3       gfx/gl/SharedSurfaceEGL.h
72      2       gfx/gl/SharedSurfaceGL.cpp
60      0       gfx/gl/SharedSurfaceGL.h
60      0       gfx/gl/SurfaceTypes.h
21      0       gfx/layers/client/TextureClientSharedSurface.cpp
25      0       gfx/layers/client/TextureClientSharedSurface.h
$ git diff --shortstat
 14 files changed, 1603 insertions(+), 36 deletions(-)
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
2 Mar 2024 : Day 173 #
This morning I wake up to find the build has failed while attempting to compile dom/canvas/WebGLContext.cpp. This follows from all of the changes I've been making to try to bring back GLScreenBuffer. Yesterday I used partial builds, performed inside the scratchbox build target, to test my changes. Now that I've tried to build the whole of gecko it's failed with the following error:
66:19.17 ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:
    In instantiation of ‘typename mozilla::detail::UniqueSelector<T>::
    SingleObject mozilla::MakeUnique(Args&& ...) [with T = mozilla::gl::
    SurfaceFactory_Basic; Args = {mozilla::gl::GLContext&};
    typename mozilla::detail::UniqueSelector<T>::SingleObject = mozilla::
    UniquePtr<mozilla::gl::SurfaceFactory_Basic, mozilla::DefaultDelete
    <mozilla::gl::SurfaceFactory_Basic> >]’:
66:19.17 ${PROJECT}/gecko-dev/dom/canvas/WebGLContext.cpp:929:67:
    required from here
66:19.17 ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:609:23:
    error: no matching function for call to ‘mozilla::gl::SurfaceFactory_Basic::
    SurfaceFactory_Basic(mozilla::gl::GLContext&)’
66:19.17    return UniquePtr<T>(new T(std::forward<Args>(aArgs)...));
66:19.17                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
66:19.17 In file included from ${PROJECT}/gecko-dev/dom/canvas/
    WebGLContext.cpp:52,
66:19.17                  from Unified_cpp_dom_canvas1.cpp:119:
66:19.18 ${PROJECT}/gecko-dev/gfx/gl/SharedSurfaceGL.h:31:12: note: candidate: ‘mozilla::gl::SurfaceFactory_Basic::SurfaceFactory_Basic(mozilla::gl::
    GLContext*, const mozilla::gl::SurfaceCaps&, const
    mozilla::layers::TextureFlags&)’
66:19.18    explicit SurfaceFactory_Basic(GLContext* gl,
66:19.18             ^~~~~~~~~~~~~~~~~~~~
66:19.18 ${PROJECT}/gecko-dev/gfx/gl/SharedSurfaceGL.h:31:12: note:
    candidate expects 3 arguments, 1 provided
This can happen and, in fact, I was almost expecting it. A partial build will only builds the files inside a particular subdirectory and its children (depending on how the build scripts are structured). So if there's some other bit of code in a different branch of the directory hierarchy that uses something in this subdirectory, then it won't be compiled against the changes. If the interface changes and the a consumer of that interface is neither updated nor compiled against it, then we end up where we are now.

It's not an ideal sign though as it suggests the changes I've made aren't self-contained. Ideally they would be. But let's run with it and focus on fixing the issue.

The good news is that I can trigger the same error back inside the scratchbox target by building the directory containing the file that's causing the error:
$ make -j1 -C obj-build-mer-qt-xr/dom/canvas
[...]
${PROJECT}/gecko-dev/dom/canvas/WebGLContext.cpp:929:67:   required from here
${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:609:23: error:
    no matching function
[...]
This makes it much easier to test my fixes. So with all this in place I can get down to fixing the error. Here's the code causing the problem:
  if (!swapChain->mFactory) {
    NS_WARNING("Failed to make an ideal SurfaceFactory.");
    swapChain->mFactory = MakeUnique<gl::SurfaceFactory_Basic>(*gl);
  }
These lines and all the code around it have changed quite considerably since ESR 78, but I'm still able to find a portion of code in the ESR 78 codebase that looks to be doing something similar:
  if (!factory) {
    // Absolutely must have a factory here, so create a basic one
    factory = MakeUnique<gl::SurfaceFactory_Basic>(gl, gl->Caps(), flags);
    mBackend = layers::LayersBackend::LAYERS_BASIC;
  }
The error is caused by the fact I restored the previously removed SurfaceCaps and TextureFlags parameters to the SurfaceFactory_Basic() method. All of these end up being fed into the SurfaceFactory constructor.

Performing a diff on the change we've made to SurfaceFactory_Basic() we can see how things are and how they were for comparison:
$ git diff gfx/gl/SharedSurfaceGL.cpp
[...]
-SurfaceFactory_Basic::SurfaceFactory_Basic(GLContext& gl)
-    : SurfaceFactory({&gl, SharedSurfaceType::Basic,
-                      layers::TextureType::Unknown, true}) {}
+SurfaceFactory_Basic::SurfaceFactory_Basic(GLContext* gl,
+                                           const SurfaceCaps& caps,
+                                           const layers::TextureFlags& flags)
+    : SurfaceFactory({gl, SharedSurfaceType::Basic,
+                  layers::TextureType::Unknown, true}, caps, nullptr, flags) {}
This doesn't really help us though: the caps and flags, which are the ones causing the problem, get passed straight through. The important change is further down. The reason I'm chasing these is because I need to find sensible default values to set them to. There may not be sensible defaults, but if there are then it will allow us to create a new method that doesn't require these additional parameters, which is what we need for the code to compile.

So, we need to look inside SurfaceFactory.cpp and specifically at the SurfaceFactory constructor. I fear this is going to get messy.
$ git diff gfx/gl/SharedSurface.cpp
[...]
-SurfaceFactory::SurfaceFactory(const PartialSharedSurfaceDesc& partialDesc)
-    : mDesc(partialDesc), mMutex("SurfaceFactor::mMutex") {}
[...]
+SurfaceFactory::SurfaceFactory(const PartialSharedSurfaceDesc& partialDesc,
+                             const SurfaceCaps& caps,
+                             const RefPtr<layers::LayersIPCChannel>& allocator,
+                             const layers::TextureFlags& flags)
+    : mDesc(partialDesc),
+      mCaps(caps),
+      mAllocator(allocator),
+      mFlags(flags),
+      mFormats(partialDesc.gl->ChooseGLFormats(caps)),
+      mMutex("SurfaceFactor::mMutex")
+{
+  ChooseBufferBits(caps, &mDrawCaps, &mReadCaps);
+}
The constructor itself is just storing the values and calling ChooseBufferBits() with some of them in order to set the mDrawCaps and mReadCaps member variables. It's unlikely that mDrawCaps or mReadCaps are being used by the WebGLContext code because I added them in as part of these changes (it's possible they're used by a method that's called in WebGLContext, but I don't believe that's the case).

The mCaps, mAllocator and mFlags members only seem to get used directly in SurfaceFactory::NewTexClient(). They're public, so that doesn't mean they don't get used elsewhere, but I suspect the only other places they're used are in new code I've added for GLScreenBuffer. This won't be used by WebGLContext.

Similarly the NewTexClient() method is only called within GLScreenBuffer::Swap() and GLScreenBuffer::Resize(). The WebGLContext code doesn't have to worry or care about these as the WebGLContext code is entirely orthogonal.

In conclusion, it should be safe to set all the new parameters to some default or null values. So I've added in a new version of the SurfaceFactory_Basic constructor like this:
SurfaceFactory_Basic::SurfaceFactory_Basic(GLContext& gl)
    : SurfaceFactory({&gl, SharedSurfaceType::Basic,
            layers::TextureType::Unknown, true}, SurfaceCaps(), nullptr, 0) {}
I've run partial builds on the original source directory that contains the changes we've been looking at, as well as the directory containing the code that failed during the full build, like this:
$ make -j1 -C obj-build-mer-qt-xr/gfx/gl/
[...]
$ make -j1 -C obj-build-mer-qt-xr/dom/canvas
[...]
Both complete successfully without errors. Hooray!

Okay, it's now time to set off a full build again. It's quite possible something else will fail, but running the build is the simplest way to find out. This will, of course, take a while. If it's done by the end of the day there may be time to test it this evening; let's see.

In the meantime, it's worth noting that ultimately the mCaps, mAllocator and mFlags parameters we've been discussing only seem to get used by the two different versions of GLScreenBuffer::CreateFactory(). However I can't find any code that actually calls either of these. So it's quite possible that eventually all of the code related to these parameters will be found to be redundant and can be removed. But we'll come back to that later.

[...]

Sadly this wasn't the last of the errors; the build failed again. The error is rather interesting:
193:24.93 In file included from Unified_cpp_gfx_layers6.cpp:128:
193:24.93 ${PROJECT}/gecko-dev/gfx/layers/client/TextureClientSharedSurface.cpp:
    In static member function ‘static already_AddRefed<mozilla::layers::
    SharedSurfaceTextureClient> mozilla::layers::SharedSurfaceTextureClient::
    Create(mozilla::UniquePtr<mozilla::gl::SharedSurface>,
    mozilla::gl::SurfaceFactory*, mozilla::layers::LayersIPCChannel*,
    mozilla::layers::TextureFlags)’:
193:24.94 ${PROJECT}/gecko-dev/gfx/layers/client/
    TextureClientSharedSurface.cpp:102:63: error:
    ‘class mozilla::gl::SharedSurface’ has no member named ‘GetTextureFlags’
193:24.94    TextureFlags flags = aFlags | TextureFlags::RECYCLE | surf->GetTextureFlags();
193:24.94                                                                ^~~~~~~~~~~~~~~
193:24.94 ${PROJECT}/gecko-dev/gfx/layers/client/
    TextureClientSharedSurface.cpp:104:51: error: no matching function for call
    to ‘mozilla::layers::SharedSurfaceTextureData::SharedSurfaceTextureData(
    std::remove_reference<mozilla::UniquePtr<mozilla::gl::SharedSurface>&>::
    type)’
193:24.94        new SharedSurfaceTextureData(std::move(surf));
193:24.94                                                    ^
193:24.94 ${PROJECT}/gecko-dev/gfx/layers/client/
    TextureClientSharedSurface.cpp:34:1: note: candidate:
    ‘mozilla::layers::SharedSurfaceTextureData::SharedSurfaceTextureData(
    const mozilla::layers::SurfaceDescriptor&, mozilla::gfx::SurfaceFormat,
    mozilla::gfx::IntSize)’
193:24.94  SharedSurfaceTextureData::SharedSurfaceTextureData(
193:24.94  ^~~~~~~~~~~~~~~~~~~~~~~~
193:24.94 ${PROJECT}/gecko-dev/gfx/layers/client/
    TextureClientSharedSurface.cpp:34:1: note:   candidate expects 3 arguments,
    1 provided
Interesting because this is in part of the code that I made changes to in order to get the partial build to complete, but it turns out I wasn't actually building this code at all, it was just using the header.

So I'll need to add in the GetTextureFlags() to SharedSurface.h. It's a simple method because all of the functionality is supposed to come from it being overridden. So I've added this in to SharedSurface.h:
  // Specifies to the TextureClient any flags which
  // are required by the SharedSurface backend.
  virtual layers::TextureFlags GetTextureFlags() const;
In addition to this I've inserted a simple implementation for it to SharedSurface.cpp:
layers::TextureFlags SharedSurface::GetTextureFlags() const {
  return layers::TextureFlags::NO_FLAGS;
}
But to actually match the functionality of ESR 78 I've also added this override, directly into SharedSurfaceEGL.h:
  virtual layers::TextureFlags GetTextureFlags() const override {
    return layers::TextureFlags::DEALLOCATE_CLIENT;
  }
They're all super-simple methods and with any luck shouldn't trigger any additional errors. I now have to do a partial build of the now three the directories that I've touched with changes:
$ make -j1 -C obj-build-mer-qt-xr/gfx/gl/
[...]
$ make -j1 -C obj-build-mer-qt-xr/dom/canvas
[...]
$ make -j1 -C obj-build-mer-qt-xr/gfx/layers
[...]
Happily these all compile successfully and without errors. So it's time to kick off another full build again. I don't expect this to be done before I go to bed this evening, so I'll have to pick this up again tomorrow. In the morning we'll find out what new gruesome errors there are to deal with!

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
1 Mar 2024 : Day 172 #
Having failed to fix the offscreen (WebView) rendering pipeline using a scalpel, over the last few days I've resorted to using a sledgehammer. What I'm now doing is essentially reverting the changes made upstream that stripped out all of the GLScreenBuffer goodness that the Sailfish WebView relied on. Unfortunately the changes aren't included in a single neat commit, so I'm having to do this by hand, copying and pasting over the changes from ESR 78 back into ESR 91.

Ever present and willing, git is happy to give me a summary of how many changes I've made so far:
$ git diff --shortstat
 11 files changed, 1012 insertions(+), 22 deletions(-)
So that's over a thousand new lines added to ESR 91. Let's break that down further to find out what's been changed:
$ git diff --numstat
38      4       gfx/gl/GLContext.cpp
59      7       gfx/gl/GLContext.h
11      0       gfx/gl/GLContextTypes.h
592     0       gfx/gl/GLScreenBuffer.cpp
171     1       gfx/gl/GLScreenBuffer.h
26      0       gfx/gl/SharedSurface.cpp
14      0       gfx/gl/SharedSurface.h
15      9       gfx/gl/SharedSurfaceEGL.cpp
4       1       gfx/gl/SharedSurfaceEGL.h
60      0       gfx/gl/SurfaceTypes.h
22      0       gfx/layers/client/TextureClientSharedSurface.h
As we can see, all of the changes are to the rendering code. That's encouraging. And the majority, as expected, are re-introducing code to the GLScreenBuffer class. In fact, not just adding code into the class, but reintroducing the class in its entirety. It's this GLScreenBuffer class that was completely removed and replaced with a class called SwapChain in the upstream transition from ESR 78 to ESR 91.

The changes to the other files are all intended to accommodate the reintroduction of GLScreenBuffer.

I could live with some big changes being made to a single file, but having to make a large number of changes to many files is really not ideal for the project. It'll contribute to the burden of maintenance and future upgrades. But my suspicion is that not all of the code in GLScreenBuffer is actually used. Rather than try to trim out the fat as I go along, my plan is to introduce everything and get the renderer to a state where it's working, then work on trimming out the unnecessary code afterwards.

Once that's done, I'll then move on to re-architecting the code to try to minimise the changes to the Gecko library itself. It may be that we can move some of the changes into the EmbedLite code, say, in a similar way to what we did with the printing changes. If that can be done, it'll make future maintenance that much easier.

That's the summary of where things are at. Now I'm heading back into the code to perform my bug fix cycle:
  1. Build code.
  2. Examine compile-time errors.
  3. Fix the first one or two erros shown.
  4. Go to step 1.
There's not much to say about all this: I fix, I compile, I examine, I fix, I compile, I examine... Describing each of the steps in any detail here would slow things down too much whilst also being even more dull than usual. So I'm going to just dive in and then summarise at the end.

[...]

After several hours of going around the bug fix cycle I've finally reached the point where the partial build completes without any compiler errors. I've added in a lot of code, not always fully understanding the purpose, but nevertheless with the intention of matching the structure and purpose of the GLScreenBuffer code from ESR 78.

Here are the final stats — for comparison — after these changes.
$ git diff --numstat
38      4       gfx/gl/GLContext.cpp
59      7       gfx/gl/GLContext.h
11      0       gfx/gl/GLContextTypes.h
590     0       gfx/gl/GLScreenBuffer.cpp
171     1       gfx/gl/GLScreenBuffer.h
301     5       gfx/gl/SharedSurface.cpp
126     3       gfx/gl/SharedSurface.h
16      10      gfx/gl/SharedSurfaceEGL.cpp
9       3       gfx/gl/SharedSurfaceEGL.h
6       4       gfx/gl/SharedSurfaceGL.cpp
3       1       gfx/gl/SharedSurfaceGL.h
60      0       gfx/gl/SurfaceTypes.h
12      0       gfx/layers/client/TextureClientSharedSurface.cpp
22      0       gfx/layers/client/TextureClientSharedSurface.h
$ git diff --shortstat
 14 files changed, 1424 insertions(+), 38 deletions(-)
Most of the changes I've made this evening seem to have been to SharedSurface.cpp. That makes sense as it seems there were many GL/EGL related calls that had been removed.

So far I've only been running quick partial builds. To find out if things are really working correctly I'll have to do a full rebuild, which will need to run overnight. So this seems like a good time to stop for the day, ready to start again in the morning.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
29 Feb 2024 : Day 171 #
After some considerably thought about the various options and after trying to fix the rendering pipeline with minimal changes, I've decided to change tack today. There's so much code that's been removed from the GLScreenBuffer.h and GLSCreenBuffer.cpp files that I don't see any way to resurrect the functionality without moving large parts of the removed code back in again.

Now, ideally it would be possible to add this to the EmbedLite code, rather than the gecko code. But as a first step I'm going to try to add it back in just as it was before. Following that I can then look at re-architecting it to minimise the changes needed to the gecko code itself. It would be a shame to end up with a patch that essentially just reverts a whole load of changes from upstream, but if that's where we end up, but with a working offscreen renderer, then maybe that's what we'll have to have.

Over the last few days I've already made a few changes to the code, but ironically they've only so far been to the EmbedLite portion of the code. But they're all also aimed at getting the SwapChain object working correctly. If I'm now going to reverse the upstream changes to this particular pipeline, then the SwapChain will be lost (it might get restored later; let's see). So I don't need the changes I made any more.
$ git diff
diff --git a/embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp
           b/embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp
index 34cff71f6e07..82cdf357f926 100644
--- a/embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp
+++ b/embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp
@@ -109,6 +109,7 @@ EmbedLiteCompositorBridgeParent::PrepareOffscreen()
 
   GLContext* context = static_cast<CompositorOGL*>(state->mLayerManager->GetCompositor())->gl();
   NS_ENSURE_TRUE(context, );
+  bool initSwapChain = false;
 
   // TODO: The switch from GLSCreenBuffer to SwapChain needs completing
   // See: https://phabricator.services.mozilla.com/D75055
@@ -125,6 +126,7 @@ EmbedLiteCompositorBridgeParent::PrepareOffscreen()
 
     SwapChain* swapChain = context->GetSwapChain();
     if (swapChain == nullptr) {
+      initSwapChain = true;
       swapChain = new SwapChain();
       new SwapChainPresenter(*swapChain);
       context->mSwapChain.reset(swapChain);
@@ -133,6 +135,13 @@ EmbedLiteCompositorBridgeParent::PrepareOffscreen()
     if (factory) {
       swapChain->Morph(std::move(factory));
     }
+
+    if (initSwapChain) {
+      bool success = context->ResizeScreenBuffer(mEGLSurfaceSize);
+      if (!success) {
+          NS_WARNING("Failed to create SwapChain back buffer");
+      }
+    }
   }
 }
 
$ git status
On branch sailfishos-esr91
Your branch is up-to-date with 'origin/sailfishos-esr91'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)
        modified:   embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp
        modified:   gecko-dev (untracked content)

no changes added to commit (use "git add" and/or "git commit -a")
$ git checkout embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp
Updated 1 path from the index
As you can see, the changes weren't very major anyway. So I've started reconstructing the GlScreenBuffer code. It's actually quite extensive and there are a lot of edge cases related to the EGL code. It's going to take quite a few rounds of changes, failed compilations, following up on missing or changed code and then recompilations. Each of these takes quite a while, so I'm bracing myself for quite a long haul here.

I've made some changes, I'm going to set it to compile and see what errors come out. It's also time for my work day, so I'll return to this — and all of the errors that come out of it — later on this evening.

[...]

I'm back to looking at this again and it's time consider the errors that came out of the most recent partial build. They look a bit like this.
64:31.86 ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.h:98:60: error:
    ‘SurfaceCaps’ does not name a type; did you mean ‘SurfaceFactory’?
64:31.86    static UniquePtr<ReadBuffer> Create(GLContext* gl, const SurfaceCaps& caps,
64:31.86                                                             ^~~~~~~~~~~
64:31.86                                                             SurfaceFactory
64:31.88 ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.h:99:45:
    error: ‘GLFormats’ does not name a type; did you mean ‘eFormData’?
64:31.88                                        const GLFormats& formats,
64:31.88                                              ^~~~~~~~~
64:31.88                                              eFormData
[...]
64:32.20 ${PROJECT}/gecko-dev/gfx/gl/GLContext.h:3537:54:
    error: ‘SurfaceCaps’ does not name a type; did you mean ‘SurfaceFormat’?
64:32.20    bool InitOffscreen(const gfx::IntSize& size, const SurfaceCaps& caps);
64:32.20                                                       ^~~~~~~~~~~
64:32.20                                                       SurfaceFormat
64:32.23 ${PROJECT}/gecko-dev/gfx/gl/GLContext.h:3546:59:
    error: ‘SurfaceCaps’ does not name a type; did you mean ‘SurfaceFormat’?
64:32.24    bool CreateScreenBuffer(const gfx::IntSize& size, const SurfaceCaps& caps) {
64:32.24                                                            ^~~~~~~~~~~
64:32.24                                                            SurfaceFormat
[...]
There are, as you can see, many, many, many errors. For the rest of this evening my recipe will be this:
  1. Build code.
  2. Examine compile-time errors.
  3. Fix the first one or two erros shown.
  4. Go to step 1.
This would take an inordinate amount of time with a standard build, but thankfully I can do partial builds just of the gfx/gl code.
$ sfdk engine exec
$ sb2 -t SailfishOS-devel-aarch64.default
$ source `pwd`/obj-build-mer-qt-xr/rpm-shared.env
$ make -j1 -C obj-build-mer-qt-xr/gfx/gl/
Initially these partial builds were taking less than a second, with errors being hit almost immediately. Now after a couple of hours of fixing compile-time errors the builds are taking longer, maybe nearing ten seconds or so. That's how I'm judging success right now.

My guess is that it'll be a few days at least before I've got all of these errors resolved. I'll continue charting my progress with these diary entries of course, but they may be a little shorter than usual, since the last thing anyone wants to read about is this iterative build-check-fix churn.

At least it's quite fulfilling for me, gradually watching the errors seep away. It's mundane but fulfilling work, just a little laborious. Let's see how far I've got by the end of tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
28 Feb 2024 : Day 170 #
Today I woke up to discover a bad result. The build I started yesterday stalled about half way through. This does happen very occasionally, but honestly since I dropped down to just using a single process, it's barely happened at all. So that's more than a little annoying. Nevertheless I've woken up early today and it does at least mean that my first task of the day is an easy one: kick off the build once again.

So here goes... Once it's done, I'll give the changes I made yesterday a go to see whether they've fixed the segfault.

[...]

Finally the build completed, second time lucky it seems. So now SwapChain, SurfaceFactory and the SharedSurface back buffer should all be created respectively in this order. And this should also be the correct order. Let's find out.

Now there's still a crash, but it does at least get further than last time:
$ harbour-webview 
[D] unknown:0 - QML debugging is enabled. Only use this in a safe environment.
[D] main:30 - WebView Example
[D] main:44 - Using default start URL:  "https://www.flypig.co.uk/search/"
[D] main:47 - Opening webview
[D] unknown:0 - Using Wayland-EGL
library "libutils.so" not found
[...]
Created LOG for EmbedLiteLayerManager
=============== Preparing offscreen rendering context ===============
CONSOLE message:
OpenGL compositor Initialized Succesfully.
Version: OpenGL ES 3.2 V@0502.0 (GIT@704ecd9a2b, Ib3f3e69395, 1609240670)
    (Date:12/29/20)
Vendor: Qualcomm
Renderer: Adreno (TM) 619
FBO Texture Target: TEXTURE_2D
JSScript: ContextMenuHandler.js loaded
JSScript: SelectionPrototype.js loaded
JSScript: SelectionHandler.js loaded
JSScript: SelectAsyncHelper.js loaded
JSScript: FormAssistant.js loaded
JSScript: InputMethodHandler.js loaded
EmbedHelper init called
Available locales: en-US, fi, ru
Frame script: embedhelper.js loaded
Segmentation fault
That's without the debugger. To find out where precisely it's crashing we can execute it again, but this time with the debugger attached:
$ gdb harbour-webview 
GNU gdb (GDB) Mer (8.2.1+git9)
[...]
(gdb) r
Starting program: /usr/bin/harbour-webview 
[...]
Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 13568]
0x0000007ff110a378 in mozilla::gl::SwapChain::Size
    (this=this@entry=0x7ed81ce090)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
290     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:
    No such file or directory.
(gdb) bt
#0  0x0000007ff110a378 in mozilla::gl::SwapChain::Size
    (this=this@entry=0x7ed81ce090)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#1  0x0000007ff3667cc8 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    PresentOffscreenSurface (this=0x7fc4b41c20)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:199
#2  0x0000007ff3680fe0 in mozilla::embedlite::nsWindow::PostRender
    (this=0x7fc4c331e0, aContext=<optimized out>)
    at mobile/sailfishos/embedshared/nsWindow.cpp:248
#3  0x0000007ff2a664fc in mozilla::widget::InProcessCompositorWidget::PostRender
    (this=0x7fc4658990, aContext=0x7f17ae4848)
    at widget/InProcessCompositorWidget.cpp:60
#4  0x0000007ff1291074 in mozilla::layers::LayerManagerComposite::Render
    (this=this@entry=0x7ed81afa80, aInvalidRegion=..., aOpaqueRegion=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/
    Compositor.h:575
#5  0x0000007ff12914f0 in mozilla::layers::LayerManagerComposite::
    UpdateAndRender (this=this@entry=0x7ed81afa80)
    at gfx/layers/composite/LayerManagerComposite.cpp:657
#6  0x0000007ff12918a0 in mozilla::layers::LayerManagerComposite::
    EndTransaction (this=this@entry=0x7ed81afa80, aTimeStamp=..., 
    aFlags=aFlags@entry=mozilla::layers::LayerManager::END_DEFAULT)
    at gfx/layers/composite/LayerManagerComposite.cpp:572
#7  0x0000007ff12d303c in mozilla::layers::CompositorBridgeParent::
    CompositeToTarget (this=0x7fc4b41c20, aId=..., aTarget=0x0,
    aRect=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#8  0x0000007ff12b8784 in mozilla::layers::CompositorVsyncScheduler::Composite
    (this=0x7fc4d01e30, aVsyncEvent=...)
    at gfx/layers/ipc/CompositorVsyncScheduler.cpp:256
#9  0x0000007ff12b0bfc in mozilla::detail::RunnableMethodArguments
    <mozilla::VsyncEvent>::applyImpl<mozilla::layers::CompositorVsyncScheduler,
    void (mozilla::layers::CompositorVsyncScheduler::*)(mozilla::VsyncEvent
    const&), StoreCopyPassByConstLRef<mozilla::VsyncEvent>, 0ul> (args=...,
    m=<optimized out>, o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:887
[...]
#21 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) p mFrontBuffer
$1 = std::shared_ptr<mozilla::gl::SharedSurface> (empty) = {get() = 0x0}
(gdb) 
Looking at the above, it seems that the back buffer isn't causing a crash any more. The problem now seems to be the front buffer. That's okay: that's progress! There are only two situations in which the front buffer gets set. First it happens if the SwapChainPresenter destructor is called. In this case the back buffer held by the presenter is moved into the front buffer, then the presenter's back buffer is set to null. Second it happens when the SwapChain::Swap() method is called. In this case the back buffer held by the presenter and the front buffer held by the swap chain are switched. In some sense, the Swap() method isn't really going to help us because if the front buffer is null beforehand, afterwards the back buffer will be null, which is also no good.

Checking the ESR 78 code, there is no mFrontBuffer variable, but there is an mFront which appears to be doing ostensibly the same thing. The mFront is only every used to switch the back buffer in to it, or to be accessed by EmbedLiteCompositorBridgeParent::GetPlatformImage(). In the latter case it's used, but not set.

So the arrangement isn't so dissimilar. Perhaps the main difference is that in ESR 78 there's no call to get the size of the front buffer as there is in ESR 91. Just as a reminder again: it's this size request that's causing the crash.

In ESR 78 the Swap() method is called from PublishFrame(), which is called from EmbedLiteCompositorBridgeParent::PresentOffscreenSurface(). It would be good to try to find out whether there's anything tying these together, to understand the sequencing, but the code is too convoluted for me to figure that out by hand.

So, instead, I'm going to look at the call to SwapChain::Size(). This is a call I added myself on top of the changes since ESR 91 and which doesn't have an immediately obvious equivalent call in ESR 78, so there must have been some reason why I added it.

Looking at the code in ESR 78 I can see that this is the reason I added this call:
  GLScreenBuffer* screen = context->Screen();
  MOZ_ASSERT(screen);

  if (screen->Size().IsEmpty() || !screen->PublishFrame(screen->Size())) {
    NS_ERROR("Failed to publish context frame");
  }
Compare that to the attempt I made to replicate the functionality in ESR 91:
  // TODO: The switch from GLSCreenBuffer to SwapChain needs completing
  // See: https://phabricator.services.mozilla.com/D75055
  SwapChain* swapChain = context->GetSwapChain();
  MOZ_ASSERT(swapChain);

  const gfx::IntSize& size = swapChain->Size();
  if (size.IsEmpty() || !swapChain->PublishFrame(size)) {
    NS_ERROR("Failed to publish context frame");
  }
The obvious question is, what is context->Screen() returning in ESR 78 and where is it created. Unfortunately the answer is complex. It's returning the following member of GLContext:
  UniquePtr<GLScreenBuffer> mScreen;
[...]
  GLScreenBuffer* Screen() const { return mScreen.get(); }
This gets created from a call to GLContext::InitOffscreen(), like this:
Delete all breakpoints? (y or n) y
(gdb) b CreateScreenBufferImpl
Breakpoint 7 at 0x7fb8e837d8: file gfx/gl/GLContext.cpp, line 2120.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]
Thread 36 "Compositor" hit Breakpoint 7, mozilla::gl::GLContext::
    CreateScreenBufferImpl (this=this@entry=0x7eac109140, size=..., caps=...)
    at gfx/gl/GLContext.cpp:2120
2120                                           const SurfaceCaps& caps) {
(gdb) bt
#0  mozilla::gl::GLContext::CreateScreenBufferImpl
    (this=this@entry=0x7eac109140, size=..., caps=...)
    at gfx/gl/GLContext.cpp:2120
#1  0x0000007fb8e838ec in mozilla::gl::GLContext::CreateScreenBuffer
    (caps=..., size=..., this=0x7eac109140)
    at gfx/gl/GLContext.h:3517
#2  mozilla::gl::GLContext::InitOffscreen (this=0x7eac109140, size=...,
    caps=...)
    at gfx/gl/GLContext.cpp:2578
#3  0x0000007fb8e83ac8 in mozilla::gl::GLContextProviderEGL::CreateOffscreen
    (size=..., minCaps=..., flags=flags@entry=mozilla::gl::CreateContextFlags::
    REQUIRE_COMPAT_PROFILE, out_failureId=out_failureId@entry=0x7fa50ed378)
    at gfx/gl/GLContextProviderEGL.cpp:1443
#4  0x0000007fb8ee475c in mozilla::layers::CompositorOGL::CreateContext
    (this=0x7eac003420)
    at gfx/layers/opengl/CompositorOGL.cpp:250
#5  mozilla::layers::CompositorOGL::CreateContext (this=0x7eac003420)
    at gfx/layers/opengl/CompositorOGL.cpp:223
#6  0x0000007fb8f053bc in mozilla::layers::CompositorOGL::Initialize
    (this=0x7eac003420, out_failureReason=0x7fa50ed730)
    at gfx/layers/opengl/CompositorOGL.cpp:374
#7  0x0000007fb8fdcf7c in mozilla::layers::CompositorBridgeParent::NewCompositor
    (this=this@entry=0x7f8c99d3f0, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1534
#8  0x0000007fb8fe65e8 in mozilla::layers::CompositorBridgeParent::
    InitializeLayerManager (this=this@entry=0x7f8c99d3f0, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1491
#9  0x0000007fb8fe6730 in mozilla::layers::CompositorBridgeParent::
    AllocPLayerTransactionParent (this=this@entry=0x7f8c99d3f0,
    aBackendHints=..., aId=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1587
#10 0x0000007fbb2e31b4 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    AllocPLayerTransactionParent (this=0x7f8c99d3f0, aBackendHints=..., 
    aId=...)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:77
#11 0x0000007fb88c13d0 in mozilla::layers::PCompositorBridgeParent::
    OnMessageReceived (this=0x7f8c99d3f0, msg__=...)
    at PCompositorBridgeParent.cpp:1391
[...]
#27 0x0000007fbe70d89c in ?? () from /lib64/libc.so.6
(gdb) 
Recall that the call to CreateOffscreen() at frame 3 is now a call to CreateHeadless(). And it looks like that's where things really start to diverge.

After thinking long and hard about this I don't think it's going to be possible to fit everything that's needed into the current SwapChain structure. So tomorrow I'm going to start putting back in all of the pieces from ESR 78 that were ripped out of ESR 91. This should be a much more tractable exercise than trying to reconstruct the functionality from scratch. Once I've got a working renderer I can then take the diff and try to fit as much of what's needed as possible into the swap chain structure.

But I'm not going to be able to do that today as it's time for me to head to bed. I'll pick this up in the morning.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
27 Feb 2024 : Day 169 #
I've been trying to get the swap chain to initialise correctly over the last few days. This is part of the code that I made large changes to early on in this process, before the build would fully compile. I'm now having to simultaneously unravel the changes I made, while at the same time finally figuring out what they're supposed to be doing. It's quite a relief to finally get the chance to fix the mistakes I made in the past.

But the task right now is a little more prosaic. I'm just trying to get the thing to run without crashing. Getting the actual rendering working will be stage two of this process.

So I'm still trying to get the back buffer to be initialised before it's accessed. Sounds simple, but the code is a bit of web. We have a call to Resize() which is crashing and a call to PrepareOffscreen() which creates the swap chain. We need to create the swap chain and initialise the back buffer before the Resize() happens.

If we follow the backtraces back, the ordering problem seems to end up here:
PLayerTransactionParent*
EmbedLiteCompositorBridgeParent::AllocPLayerTransactionParent
    (const nsTArray<LayersBackend>& aBackendHints, const LayersId& aId)
{
  PLayerTransactionParent* p =
    CompositorBridgeParent::AllocPLayerTransactionParent(aBackendHints, aId);

  EmbedLiteWindowParent *parentWindow = EmbedLiteWindowParent::From(mWindowId);
  if (parentWindow) {
    parentWindow->GetListener()->CompositorCreated();
  }

  if (!StaticPrefs::embedlite_compositor_external_gl_context()) {
    // Prepare Offscreen rendering context
    PrepareOffscreen();
  }
  return p;
}
That's because the call stack for CreateContext(), which is where the SwapChain::Resize() gets called, includes CompositorBridgeParent::AllocPLayerTransactionParent(), whereas the SwapChain object is created in PrepareOffscreen(). As we can see, these happen in the wrong order.

One thing that seems worth trying is configuring the back buffer immediately after creating the swap chain. So I've given it a go by adding the call to ResizeScreenBuffer() in directly after the SwapChain constructor is called, like this:
    SwapChain* swapChain = context->GetSwapChain();
    if (swapChain == nullptr) {
      swapChain = new SwapChain();
      new SwapChainPresenter(*swapChain);
      context->mSwapChain.reset(swapChain);

      bool success = context->ResizeScreenBuffer(mEGLSurfaceSize);
      if (!success) {
          NS_WARNING("Failed to create SwapChain back buffer");
      }
    }
When I execute the updated code, this call to resize the screen buffer now triggers a crash.
$ gdb harbour-webview
GNU gdb (GDB) Mer (8.2.1+git9)
[...]
=============== Preparing offscreen rendering context ===============

Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 16991]
mozilla::gl::SwapChain::Resize (this=0x7ed81ce090, size=...)
    at gfx/gl/GLScreenBuffer.cpp:134
134           mFactory->CreateShared(size);
(gdb) bt
#0  mozilla::gl::SwapChain::Resize (this=0x7ed81ce090, size=...)
    at gfx/gl/GLScreenBuffer.cpp:134
#1  0x0000007ff110dc14 in mozilla::gl::GLContext::ResizeScreenBuffer
    (this=this@entry=0x7ed819ee40, size=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#2  0x0000007ff366824c in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    PrepareOffscreen (this=this@entry=0x7fc4bef570)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:132
#3  0x0000007ff36682b8 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    AllocPLayerTransactionParent (this=0x7fc4bef570, aBackendHints=..., aId=...)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:90
#4  0x0000007ff0c65ad0 in mozilla::layers::PCompositorBridgeParent::
    OnMessageReceived (this=0x7fc4bef570, msg__=...)
    at PCompositorBridgeParent.cpp:1285
[...]
#19 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) p size.width
$2 = 1080
(gdb) p size.height
$3 = 2520
(gdb) p mFactory.mTuple.mFirstA
$5 = (mozilla::gl::SurfaceFactory *) 0x0
(gdb) 
As we can see from the above value of mFactory.mTuple.mFirstA and the code below, the reason for the crash is that the SurfaceFactory needed to generate the surface hasn't yet been initialised. As before, it's all about the sequencing.
bool SwapChain::Resize(const gfx::IntSize& size) {
  UniquePtr<SharedSurface> newBack =
      mFactory->CreateShared(size);
  if (!newBack) return false;

  if (mPresenter->mBackBuffer) mPresenter->mBackBuffer->ProducerRelease();

  mPresenter->mBackBuffer.reset(newBack.release());

  mPresenter->mBackBuffer->ProducerAcquire();

  return true;
}
It turns out, the factory is created before, but isn't set until afterwards:
  if (context->IsOffscreen()) {
    UniquePtr<SurfaceFactory> factory;
    if (context->GetContextType() == GLContextType::EGL) {
      // [Basic/OGL Layers, OMTC] WebGL layer init.
      factory = SurfaceFactory_EGLImage::Create(*context);
    } else {
      // [Basic Layers, OMTC] WebGL layer init.
      // Well, this *should* work...
      factory = MakeUnique<SurfaceFactory_Basic>(*context);
    }

    SwapChain* swapChain = context->GetSwapChain();
    if (swapChain == nullptr) {
      swapChain = new SwapChain();
      new SwapChainPresenter(*swapChain);
      context->mSwapChain.reset(swapChain);
    }

    if (factory) {
      swapChain->Morph(std::move(factory));
    }
  }
So I've rejigged things. Crucially though, although the factory should be reset independently of whether we're creating a new swap chain or not, we don't want the resize to happen except when it's a new swap chain. I've therefore had to create a new initSwapChain Boolean to capture whether this is a new swap chain or not. If it is, we can then perform the resize after the factory code has executed.
  bool initSwapChain = false;
  // TODO: The switch from GLSCreenBuffer to SwapChain needs completing
  // See: https://phabricator.services.mozilla.com/D75055
  if (context->IsOffscreen()) {
[...]
    SwapChain* swapChain = context->GetSwapChain();
    if (swapChain == nullptr) {
      initSwapChain = true;
      swapChain = new SwapChain();
      new SwapChainPresenter(*swapChain);
      context->mSwapChain.reset(swapChain);
    }

    if (factory) {
      swapChain->Morph(std::move(factory));
    }

    if (initSwapChain) {
      bool success = context->ResizeScreenBuffer(mEGLSurfaceSize);
      if (!success) {
          NS_WARNING("Failed to create SwapChain back buffer");
      }
    }
  }
This seems worth a try, so I've set the build off running again and we'll see how it pans out when it's done.

As always the build is taking a very long, so we'll have to wait until the morning to find out how this has gone.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
26 Feb 2024 : Day 168 #
Overnight the build I started yesterday successfully finished. That, in itself, is a bit of a surprise (no stupid syntax errors in my code!). This morning I've copied over the packages and installed them, and now I'm on the train ready to debug.

I optimistically run the app without the debugger. The window appears again. There's no rendering, just a white screen, but there's also no immediate crash and no obvious errors in the debug output.

After running for around twenty seconds or so, the app then crashes.
$ time harbour-webview 
[D] unknown:0 - QML debugging is enabled. Only use this in a safe environment.
[D] main:30 - WebView Example
[D] main:44 - Using default start URL:  "https://www.flypig.co.uk/search/"
[D] main:47 - Opening webview
[D] unknown:0 - Using Wayland-EGL
library "libutils.so" not found
[...]
JSComp: UserAgentOverrideHelper.js loaded
UserAgentOverrideHelper app-startup
CONSOLE message:
[JavaScript Error: "Unexpected event profile-after-change"
    {file: "resource://gre/modules/URLQueryStrippingListService.jsm" line: 228}]
observe@resource://gre/modules/URLQueryStrippingListService.jsm:228:12

Created LOG for EmbedPrefs
Created LOG for EmbedLiteLayerManager
Command terminated by signal 11
real    0m 20.82s
user    0m 0.87s
sys     0m 0.23s
This is quite unexpected behaviour if I'm honest. Something is causing it to crash after a prolonged period ("prolonged" meaning from the perspective of computation, rather than from the perspective of the user).

That was without the debugger; I'd better try it with the debugger to find out why it's crashing.
$ gdb harbour-webview 
GNU gdb (GDB) Mer (8.2.1+git9)
[...]
(gdb) r
Starting program: /usr/bin/harbour-webview 
[...]
Thread 37 "Compositor" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 18684]
mozilla::gl::SwapChain::Resize (this=0x0, size=...)
    at gfx/gl/GLScreenBuffer.cpp:134
134           mFactory->CreateShared(size);
(gdb) bt
#0  mozilla::gl::SwapChain::Resize (this=0x0, size=...)
    at gfx/gl/GLScreenBuffer.cpp:134
#1  0x0000007ff110dc14 in mozilla::gl::GLContext::ResizeScreenBuffer
    (this=this@entry=0x7edc19ee40, size=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#2  0x0000007ff119b8d4 in mozilla::layers::CompositorOGL::CreateContext
    (this=this@entry=0x7edc002f10)
    at gfx/layers/opengl/CompositorOGL.cpp:264
#3  0x0000007ff11b0ea8 in mozilla::layers::CompositorOGL::Initialize
    (this=0x7edc002f10, out_failureReason=0x7f17aac520)
    at gfx/layers/opengl/CompositorOGL.cpp:394
#4  0x0000007ff12c68e8 in mozilla::layers::CompositorBridgeParent::NewCompositor
    (this=this@entry=0x7fc4b7b450, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1493
#5  0x0000007ff12d1964 in mozilla::layers::CompositorBridgeParent::
    InitializeLayerManager (this=this@entry=0x7fc4b7b450, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1436
#6  0x0000007ff12d1a94 in mozilla::layers::CompositorBridgeParent::
    AllocPLayerTransactionParent (this=this@entry=0x7fc4b7b450,
    aBackendHints=..., aId=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1546
#7  0x0000007ff36682b8 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    AllocPLayerTransactionParent (this=0x7fc4b7b450, aBackendHints=..., 
    aId=...)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:80
#8  0x0000007ff0c65ad0 in mozilla::layers::PCompositorBridgeParent::
    OnMessageReceived (this=0x7fc4b7b450, msg__=...)
    at PCompositorBridgeParent.cpp:1285
#9  0x0000007ff0ca9fe4 in mozilla::layers::PCompositorManagerParent::
    OnMessageReceived (this=<optimized out>, msg__=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/ipc/
    ProtocolUtils.h:675
#10 0x0000007ff0bc985c in mozilla::ipc::MessageChannel::DispatchAsyncMessage
    (this=this@entry=0x7fc4d82fb8, aProxy=aProxy@entry=0x7edc002aa0, aMsg=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/ipc/
    ProtocolUtils.h:675
[...]
#23 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) 
As before, it runs for around twenty seconds, then crashes. The line that's causing the crash is this one:
bool SwapChain::Resize(const gfx::IntSize& size) {
  UniquePtr<SharedSurface> newBack =
      mFactory->CreateShared(size);
[...]
}
And the reason isn't because mFactory is null, it's because this (meaning the SwapChain instance) is null. But when I try to access the memory to show that it's null using the debugger I start getting strange errors:
(gdb) p mFactory
Cannot access memory at address 0x8
(gdb) frame 1
#1  0x0000007ff110dc14 in mozilla::gl::GLContext::ResizeScreenBuffer
    (this=this@entry=0x7edc19ee40, size=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
290     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:
    No such file or directory.
(gdb) p mSwapChain
Cannot access memory at address 0x7edc19f838
(gdb) p this
$1 = (mozilla::gl::GLContext * const) 0x7edc19ee40
(gdb) frame 2
#2  0x0000007ff119b8d4 in mozilla::layers::CompositorOGL::CreateContext
    (this=this@entry=0x7edc002f10)
    at gfx/layers/opengl/CompositorOGL.cpp:264
264       bool success = context->ResizeScreenBuffer(mSurfaceSize);
(gdb) p context
$2 = {mRawPtr = 0x7edc19ee40}
(gdb) p context->mRawPtr
Attempt to take address of value not located in memory.
(gdb) p context->mRawPtr->mSwapChain
Attempt to take address of value not located in memory.
I wonder if this is being caused by a memory leak that quickly gets out of hand? Placing a breakpoint on GLContext::ResizeScreenBuffer()K shows that it's not due to repeated calls to this method: this gets called only once, at which point there's an immediate segfault.
(gdb) b GLContext::ResizeScreenBuffer
Breakpoint 1 at 0x7ff110dbdc: file gf
x/gl/GLContext.cpp, line 1885.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview
[...]
Thread 37 "Compositor" hit Breakpoint 1, mozilla::gl::GLContext::
    ResizeScreenBuffer (this=this@entry=0x7ed419ee40, size=...)
    at gfx/gl/GLContext.cpp:1885
1885    bool GLContext::ResizeScreenBuffer(const gfx::IntSize& size) {
(gdb) c
Continuing.

Thread 37 "Compositor" received signal SIGSEGV, Segmentation fault.
mozilla::gl::SwapChain::Resize (this=0x0, size=...)
    at gfx/gl/GLScreenBuffer.cpp:134
134           mFactory->CreateShared(size);
(gdb)                            
I'm curious to know what's happening after twenty seconds that would cause this. Looking more carefully at the backtrace for the crash above, it's strange that an attempt is being made to create the compositor. Shouldn't that have already been created? I wonder if this delay is related to network connectivity.

As usual I'm attempting this debugging on the train. But my development phone has no Internet connectivity here. So perhaps it's waiting for a connection before creating the compositor? Maybe the connection fails after twenty seconds at which point the compositor is created and the library segfaults.

This seems plausible, even if it doesn't quite explain the peculiar nature of the debugging that followed, where I couldn't access any of the variables.

Let's assume this is the case, back up a bit, and try to capture some state before the crash happens. If the crash is causing memory corruption, that might explain the lack of accessible variables. And if that's the case, then catching execution before the memory gets messed up should allow us to get a clearer picture.
(gdb) b CompositorOGL::CreateContext
Breakpoint 2 at 0x7ff119b764: file gfx/layers/opengl/CompositorOGL.cpp,
    line 227.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]
We're coming in to London now, so time to pause and rapidly pack up my stuff before we pull in to the station!

[...]

You'll be pleased to hear I made it off the train safely and with all my belongings. It was touch-and-go for a few seconds there though. I'm now travelling in the opposite direction on (I hope) the adjacent tracks. Time to return to that debugging.

I'm happy to discover, despite having literally pulled the plug on my phone mid-debug, that on reattaching the cable and restoring my gnu screen session, the debugger is still in exactly the same state that I left it. Linux is great!

And now we have a bit more luck again from the captured backtrace:
Thread 37 "Compositor" hit Breakpoint 2, mozilla::layers::CompositorOGL::
    CreateContext (this=this@entry=0x7edc002ed0)
    at gfx/layers/opengl/CompositorOG
L.cpp:227
227     already_AddRefed<mozilla::gl::GLContext> CompositorOGL::CreateContext() {
(gdb) p context
$3 = <optimized out>
(gdb) p mSwapChain
No symbol "mSwapChain" in current context.
(gdb) p context
$4 = <optimized out>
(gdb) bt
#0  mozilla::layers::CompositorOGL::CreateContext (this=this@entry=0x7edc002ed0)
    at gfx/layers/opengl/CompositorOG
L.cpp:227
#1  0x0000007ff11b0ea8 in mozilla::layers::CompositorOGL::Initialize
    (this=0x7edc002ed0, out_failureReason=0x7f17a6b520)
    at gfx/layers/opengl/CompositorOGL.cpp:394
#2  0x0000007ff12c68e8 in mozilla::layers::CompositorBridgeParent::NewCompositor
    (this=this@entry=0x7fc4beb0e0, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1493
#3  0x0000007ff12d1964 in mozilla::layers::CompositorBridgeParent::
    InitializeLayerManager (this=this@entry=0x7fc4beb0e0, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1436
#4  0x0000007ff12d1a94 in mozilla::layers::CompositorBridgeParent::
    AllocPLayerTransactionParent (this=this@entry=0x7fc4beb0e0,
    aBackendHints=..., aId=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1546
#5  0x0000007ff36682b8 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    AllocPLayerTransactionParent (this=0x7fc4beb0e0, aBackendHints=..., aId=...)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:80
#6  0x0000007ff0c65ad0 in mozilla::layers::PCompositorBridgeParent::
    OnMessageReceived (this=0x7fc4beb0e0, msg__=...)
    at PCompositorBridgeParent.cpp:1285
[...]
#21 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) n
[New LWP 32378]
231       nsIWidget* widget = mWidget->RealWidget();
(gdb) n
[New LWP 32389]
[LWP 7850 exited]
232       void* widgetOpenGLContext =
(gdb) n
[New LWP 32476]
[LWP 32389 exited]
234       if (widgetOpenGLContext) {
(gdb) n
248       if (!context && gfxEnv::LayersPreferOffscreen()) {
(gdb) n
249         nsCString discardFailureId;
(gdb) n
250         context = GLContextProvider::CreateHeadless(
(gdb) n
252         if (!context->CreateOffscreenDefaultFb(mSurfaceSize)) {
(gdb) n
249         nsCString discardFailureId;
(gdb) n
257       if (!context) {
(gdb) n
264       bool success = context->ResizeScreenBuffer(mSurfaceSize);
(gdb) p context
$7 = {mRawPtr = 0x7edc19ee40}
(gdb) p context.mRawPtr
$8 = (mozilla::gl::GLContext *) 0x7edc19ee40
(gdb) p context.mRawPtr.mSwapChain
$9 = {
  mTuple = {<mozilla::detail::CompactPairHelper<mozilla::gl::SwapChain*, mozilla::DefaultDelete<mozilla::gl::SwapChain>, (mozilla::detail::StorageType)1, (mozilla::detail::StorageType)0>> = {<mozilla::DefaultDelete<mozilla::gl::SwapChain>> = {<No data fields>}, mFirstA = 0x0}, <No data fields>}}
(gdb) p context.mRawPtr.mSwapChain.mTuple.mFirstA
$10 = (mozilla::gl::SwapChain *) 0x0
(gdb)
We can conclude that the SwapChain hasn't been created yet. Which means this new bit of code I added, which is the code that's crashing, is being called too early. That's not quite what I was expecting. Just to check I've added a breakpoint to EmbedLiteCompositorBridgeParent::PrepareOffscreen(), which is where the SwapChain is created. This is just to double-check the ordering.
(gdb) b EmbedLiteCompositorBridgeParent::PrepareOffscreen
Breakpoint 3 at 0x7ff366810c: file mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp, line 104.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]
Thread 36 "Compositor" hit Breakpoint 2, mozilla::layers::CompositorOGL::
    CreateContext (this=this@entry=0x7ed8002da0)
    at gfx/layers/opengl/CompositorOGL.cpp:227
227     already_AddRefed<mozilla::gl::GLContext> CompositorOGL::
    CreateContext() {
(gdb) 
This confirms it: the CreateContext() call is happening before the PrepareOffscreen() call. I'll need to think about this again then.

The train is now coming in to Cambridge. I'm not taking any chances this time and will be packing up with plenty of time to spare! Sadly that's going to have to be it for today, but I'll pick this up again tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
25 Feb 2024 : Day 167 #
It felt like I had to stop halfway through a thought yesterday. But sometimes tiredness gets the better of me, I can feel myself spiralling into incoherence and the only sensible thing to do is head to bed. Sometimes it's a slow descent while relaxing or reading a book; other times I reach the incoherent end of the spectrum before my mind has even caught up with the fact I'm heading there.

So let me try to regroup. What we learnt yesterday was that OffscreenSize() previously returned mScreen->Size() and mScreen was created in InitOffscreen(). This InitOffscreen() method no longer exists — it was removed in D75055 — but was originally called in GLContextProviderEGL::CreateOffscreen().

The GLContextProviderEGL::CreateOffscreen() method also no longer exists in the codebase, replaced as it was in D79390:
$ git log -1 -S "GLContextProviderEGL::CreateOffscreen" \
    gecko-dev/gfx/gl/GLContextProviderEGL.cpp
commit 4232c2c466220d42223443bd5bd2f3c849123380
Author: Jeff Gilbert <jgilbert@mozilla.com>
Date:   Mon Jun 15 18:26:12 2020 +0000

    Bug 1632249 - Replace GLContextProvider::CreateOffscreen with
    GLContext::CreateOffscreenDefaultFb. r=lsalzman
    
    Differential Revision: https://phabricator.services.mozilla.com/D79390
Looking at the ESR 91 code and the diffs applied to them it's not immediately obvious to me where this was getting called from and what's replacing it now, but we can get a callstack for how that was being called using the debugger on ESR 78. Here's the (abridged) backtrace:
(gdb) delete break
Delete all breakpoints? (y or n) y
(gdb) b GLContextProviderEGL::CreateOffscreen Breakpoint 6 at 0x7fb8e839f0:
    file gfx/gl/GLContextProviderEGL.cpp, line 1400.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]
Thread 36 "Compositor" hit Breakpoint 6, mozilla::gl::GLContextProviderEGL::
    CreateOffscreen (size=..., minCaps=..., 
    flags=flags@entry=mozilla::gl::CreateContextFlags::REQUIRE_COMPAT_PROFILE,
    out_failureId=out_failureId@entry=0x7fa516f378)
    at gfx/gl/GLContextProviderEGL.cpp:1400
1400        CreateContextFlags flags, nsACString* const out_failureId) {
(gdb) bt
#0  mozilla::gl::GLContextProviderEGL::CreateOffscreen (size=..., minCaps=..., flags=flags@entry=mozilla::gl::CreateContextFlags::REQUIRE_COMPAT_PROFILE, 
    out_failureId=out_failureId@entry=0x7fa516f378)
    at gfx/gl/GLContextProviderEGL.cpp:1400
#1  0x0000007fb8ee475c in mozilla::layers::CompositorOGL::CreateContext
    (this=0x7eac003420)
    at gfx/layers/opengl/CompositorOGL.cpp:250
#2  mozilla::layers::CompositorOGL::CreateContext (this=0x7eac003420)
    at gfx/layers/opengl/CompositorOGL.cpp:223
#3  0x0000007fb8f053bc in mozilla::layers::CompositorOGL::Initialize
    (this=0x7eac003420, out_failureReason=0x7fa516f730)
    at gfx/layers/opengl/CompositorOGL.cpp:374
#4  0x0000007fb8fdcf7c in mozilla::layers::CompositorBridgeParent::NewCompositor
    (this=this@entry=0x7f8c99dc60, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1534
#5  0x0000007fb8fe65e8 in mozilla::layers::CompositorBridgeParent::
    InitializeLayerManager (this=this@entry=0x7f8c99dc60, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1491
#6  0x0000007fb8fe6730 in mozilla::layers::CompositorBridgeParent::
    AllocPLayerTransactionParent (this=this@entry=0x7f8c99dc60,
    aBackendHints=..., aId=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1587
#7  0x0000007fbb2e31b4 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    AllocPLayerTransactionParent (this=0x7f8c99dc60, aBackendHints=..., 
    aId=...) at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:77
#8  0x0000007fb88c13d0 in mozilla::layers::PCompositorBridgeParent::
    OnMessageReceived (this=0x7f8c99dc60, msg__=...)
    at PCompositorBridgeParent.cpp:1391
#9  0x0000007fb88f86ac in mozilla::layers::PCompositorManagerParent::
    OnMessageReceived (this=<optimized out>, msg__=...)
    at obj-build-mer-qt-xr/dist/include/mozilla/ipc/ProtocolUtils.h:866
[...]
#24 0x0000007fbe70d89c in ?? () from /lib64/libc.so.6
(gdb) 
So examining the code at frame 1, we see that what was a call to CreateOffscreen():
  // Allow to create offscreen GL context for main Layer Manager
  if (!context && gfxEnv::LayersPreferOffscreen()) {
    SurfaceCaps caps = SurfaceCaps::ForRGB();
    caps.preserve = false;
    caps.bpp16 = gfxVars::OffscreenFormat() == SurfaceFormat::R5G6B5_UINT16;

    nsCString discardFailureId;
    context = GLContextProvider::CreateOffscreen(
        mSurfaceSize, caps, CreateContextFlags::REQUIRE_COMPAT_PROFILE,
        &discardFailureId);
  }
Is now a call to CreateHeadless():
  // Allow to create offscreen GL context for main Layer Manager
  if (!context && gfxEnv::LayersPreferOffscreen()) {
    nsCString discardFailureId;
    context = GLContextProvider::CreateHeadless(
        {CreateContextFlags::REQUIRE_COMPAT_PROFILE}, &discardFailureId);
    if (!context->CreateOffscreenDefaultFb(mSurfaceSize)) {
      context = nullptr;
    }
  }
Whether we're on ESR 78 or ESR 91, this is all happening inside CompositorOGL::CreateContext().

Looking at the difference between the previous code that was called in CreateOffscreen() and the new code being called in CreateHeadless() my heart sinks a bit. There's so much that's been removed. It's true that CreateOffscreen() did go on to call CreateHeadless() in ESR 78, but there's so much other initialisation code in ESR 78, I just can't believe we can safely throw it all away.

But I'm going to persevere down this road I've started on, gradually building things back up only where they're needed to get things working. Right now that still means fixing the crash when OffscreenSize() is called.

I've not quite reached the point where the two sides of this circle meet up and the correct position to create the mBackBuffer emerges, but I feel like this exploration is getting us closer.

It's time for work now, I'll pick this up later on today.

[...]

After thinking on this some more, I've come to the conclusion that the right place to set up the mBackBuffer variable is in, or near, the call to GLContextProviderEGL::CreateHeadless(). It's there that the mScreen object would have been created in ESR 78 and checking with the debugger shows that the ordering is appropriate: CreateHeadless() gets called before CompositeToDefaultTarget(), which is what we need.
(gdb) delete break
(gdb) b EmbedLiteCompositorBridgeParent::CompositeToDefaultTarget
Breakpoint 3 at 0x7ff3667880: EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget. (2 locations)
(gdb) b GLContextProviderEGL::CreateHeadless
Breakpoint 4 at 0x7ff1133740: file gfx/gl/GLContextProviderEGL.cpp, line 1245.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]
Thread 36 "Compositor" hit Breakpoint 4, mozilla::gl::GLContextProviderEGL::
    CreateHeadless (desc=..., out_failureId=out_failureId@entry=0x7f1faad1c8)
    at gfx/gl/GLContextProviderEGL.cpp:1245
1245        const GLContextCreateDesc& desc, nsACString* const out_failureId) {
(gdb) bt
#0  mozilla::gl::GLContextProviderEGL::CreateHeadless (desc=...,
    out_failureId=out_failureId@entry=0x7f1faad1c8)
    at gfx/gl/GLContextProviderEGL.cpp:1245
#1  0x0000007ff119b81c in mozilla::layers::CompositorOGL::CreateContext
    (this=this@entry=0x7ee0002f50)
    at gfx/layers/opengl/CompositorOGL.cpp:250
#2  0x0000007ff11b0e24 in mozilla::layers::CompositorOGL::Initialize
    (this=0x7ee0002f50, out_failureReason=0x7f1faad520)
    at gfx/layers/opengl/CompositorOGL.cpp:389
#3  0x0000007ff12c6864 in mozilla::layers::CompositorBridgeParent::
    NewCompositor (this=this@entry=0x7fc4b3d260, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1493
#4  0x0000007ff12d18e0 in mozilla::layers::CompositorBridgeParent::
    InitializeLayerManager (this=this@entry=0x7fc4b3d260, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1436
#5  0x0000007ff12d1a10 in mozilla::layers::CompositorBridgeParent::
    AllocPLayerTransactionParent (this=this@entry=0x7fc4b3d260,
    aBackendHints=..., aId=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1546
#6  0x0000007ff3668238 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    AllocPLayerTransactionParent (this=0x7fc4b3d260, aBackendHints=..., aId=...)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:80
#7  0x0000007ff0c65ad0 in mozilla::layers::PCompositorBridgeParent::
    OnMessageReceived (this=0x7fc4b3d260, msg__=...)
    at PCompositorBridgeParent.cpp:1285
#8  0x0000007ff0ca9fe4 in mozilla::layers::PCompositorManagerParent::
    OnMessageReceived (this=<optimized out>, msg__=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/ipc/
    ProtocolUtils.h:675
[...]
#22 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) c
Continuing.
[New LWP 2694]
=============== Preparing offscreen rendering context ===============
[New LWP 2695]

Thread 36 "Compositor" hit Breakpoint 3, non-virtual thunk to mozilla::embedlite::EmbedLiteCompositorBridgeParent::CompositeToDefaultTarget
    (mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>) ()
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.h:58
58        virtual void CompositeToDefaultTarget(VsyncId aId) override;
(gdb) bt
#0  non-virtual thunk to mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget(mozilla::layers::BaseTransactionId
    <mozilla::VsyncIdType>) ()
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.h:58
#1  0x0000007ff12b808c in mozilla::layers::CompositorVsyncScheduler::
    ForceComposeToTarget (this=0x7fc4d0df60, aTarget=aTarget@entry=0x0, 
    aRect=aRect@entry=0x0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/
    LayersTypes.h:82
#2  0x0000007ff12b80e8 in mozilla::layers::CompositorBridgeParent::
    ResumeComposition (this=this@entry=0x7fc4b3d260)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#3  0x0000007ff12b8174 in mozilla::layers::CompositorBridgeParent::
    ResumeCompositionAndResize (this=0x7fc4b3d260, x=<optimized out>,
    y=<optimized out>, width=<optimized out>, height=<optimized out>)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:794
#4  0x0000007ff12b0d10 in mozilla::detail::RunnableMethodArguments<int,
    int, int, int>::applyImpl<mozilla::layers::CompositorBridgeParent, void
    (mozilla::layers::CompositorBridgeParent::*)(int, int, int, int),
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>,
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, 0ul, 1ul, 2ul,
    3ul> (args=..., m=<optimized out>, o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151
[...]
#16 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) 
The only problem is that the SwapChain object is held by GLContext whereas CreateHeadless() is part of GLContextProviderEGL(), which has access to very little, let alone the SwapChain. The good news is GLContextProviderEGL does have access to GLContext. The structure is something like this:
class GLContext 
public:
    UniquePtr<SwapChain> mSwapChain;
}

class GLContextEGL final : public GLContext {
}

void EmbedLiteCompositorBridgeParent::PrepareOffscreen() {
    const CompositorBridgeParent::LayerTreeState* state =
    CompositorBridgeParent::GetIndirectShadowTree(RootLayerTreeId());
    GLContext* context = static_cast<CompositorOGL*>(state->mLayerManager->
    GetCompositor())->gl();

    swapChain = new SwapChain();
    new SwapChainPresenter(*swapChain);
    context->mSwapChain.reset(swapChain);
}

already_AddRefed<mozilla::gl::GLContext> CompositorOGL::CreateContext() {
    context = GLContextProvider::CreateHeadless({CreateContextFlags::
    REQUIRE_COMPAT_PROFILE}, &discardFailureId);
    // Here we have access to context->mSwapChain;
}
Based on this, it looks like adding code in to CompositorOGL::CreateContext(), after the call to GLContextProvider::CreateHeadless() (which is actually a call to GLContextProviderEGL::CreateHeadless() might be the right place to put the code to create mBackBuffer.

So, after the calls in CompositorOGL::CreateContext() I've inserted the following code:
  bool success = context->ResizeScreenBuffer(mSurfaceSize);
  if (!success) {
      NS_WARNING("Failed to create SwapChain back buffer");
  }
This will make for my first attempt to fix this. I've set the build running and we'll have to find out for sure whether that's improved the situation tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
24 Feb 2024 : Day 166 #
We're back on to looking at the SwapChain code, part of the offscreen rendering pipeline, today. This is all for the purpose of getting the WebView working. Currently opening a WebView will cause the parent app to crash; it's clearly something that needs fixing before ESR 91 will be acceptable for daily use.

We managed to persuade the SwapChain object to be created by force-disabling the embedlite.compositor.external_gl_context static preference. Now we've found that the mBackBuffer is null which also results in a segfault. The mBackBuffer variable is a member of SwapChainPresenter *mPresenter, which is itself contained within the SwapChain class.

Looking through the code in GLScreenBuffer.cpp there are currently only two ways for the mBackBuffer variable to be initialised. Either it gets swapped in as a result of a call to SwapBackBuffer() like this:
std::shared_ptr<SharedSurface> SwapChainPresenter::SwapBackBuffer(
    std::shared_ptr<SharedSurface> back) {
[...]
  auto old = mBackBuffer;
  mBackBuffer = back;
  if (mBackBuffer) {
    mBackBuffer->WaitForBufferOwnership();
    mBackBuffer->ProducerAcquire();
    mBackBuffer->LockProd();
  }
  return old;
}

Or it gets created as a result of a call to the Resize() method like this:
bool SwapChain::Resize(const gfx::IntSize& size) {
  UniquePtr<SharedSurface> newBack =
      mFactory->CreateShared(size);
  if (!newBack) return false;

  if (mPresenter->mBackBuffer) mPresenter->mBackBuffer->ProducerRelease();

  mPresenter->mBackBuffer.reset(newBack.release());

  mPresenter->mBackBuffer->ProducerAcquire();

  return true;
}
We should check whether either of these are being called. It is possible that SwapBackBuffer() is being called but with an empty back parameter, or that the CreateShared() call in Resize() is failing. Either of these would leave us in our current situation. However just as likely is that neither of these are being called and there's not even an attempt being made to initialise mBackBuffer. We need to know!
(gdb) delete break
(gdb) b SwapChainPresenter::SwapBackBuffer
Breakpoint 1 at 0x7ff1109c14: file gfx/gl/GLScreenBuffer.cpp, line 82.
(gdb) b SwapChain::Resize
Breakpoint 2 at 0x7ff110a398: file gfx/gl/GLScreenBuffer.cpp, line 132.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]

Thread 37 "Compositor" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 4362]
0x0000007ff110a38c in mozilla::gl::SwapChain::OffscreenSize
    (this=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
290     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:
    No such file or directory.
(gdb) frame 1
#1  0x0000007ff3667930 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget (this=0x7fc4b3f070, aId=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
290     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:
    No such file or directory.
(gdb) p context->mSwapChain.mTuple.mFirstA->mPresenter->mBackBuffer
$1 = std::shared_ptr<mozilla::gl::SharedSurface> (empty) = {get() = 0x0}
(gdb) 
No breakpoint hits before the segfault, so neither of those methods are being called. Armed with this knowledge we must now turn our thoughts to solutions, and there are multiple potential solutions we could choose: guard against a null access in SwapChain::OffscreenSize(); ensure mBackBuffer is created at the same time as the mPresenter that contains it; find out if the process flow should be calling one of the swap or resize methods prior to this call.

It's now time for my work day, so answers to these questions will have to wait until tonight. Still, this is progress.

[...]

To try to get a handle on where this mBackBuffer ought to be created, I thought it might help to figure out some sequencing. Here's how things seem to be happening:
  1. The SwapChain is created in EmbedLiteCompositorBridgeParent::PrepareOffscreen().
  2. The SwapChainPresenter() is created immediately afterwards.
  3. In EmbedLiteCompositorBridgeParent::CompositeToDefaultTarget() the SwapChain::OffscreenSize() method is called, causing the crash.
  4. Immediately after this SwapChain::Resize() is called, which if done earlier, would prevent the crash.
Here's that sequence of code where the crash is caused:
    if (context->GetSwapChain()->OffscreenSize() != mEGLSurfaceSize
      && !context->GetSwapChain()->Resize(mEGLSurfaceSize)) {
      return;
    }
It's also worth noting, I think, that SwapChain::Acquire() calls SwapBackBuffer() with a non-null back buffer parameter, which if called early enough would also prevent the code from crashing subsequently when OffscreenSize() is read.

Having re-reviewed the original D75055 changeset that introduced the SwapChain, along with the history of the related files that aren't part of the changeset, I'm beginning to get a better picture.

For example, the code that makes the failing call to OffscreenSize() was added by me earlier on in this process:
$ git blame \
    embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp \
    -L 153,156
^01b1a0352034 embedthread/EmbedLiteCompositorBridgeParent.cpp
    (Raine Makelainen  2020-07-24 16:25:17 +0300 153)
    MutexAutoLock lock(mRenderMutex);
d59d44a5bccaf embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp
    (David Llewellyn-Jones 2023-09-04 09:03:49 +0100 154)
    if (context->GetSwapChain()->OffscreenSize() != mEGLSurfaceSize
d59d44a5bccaf embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp
    (David Llewellyn-Jones 2023-09-04 09:03:49 +0100 155)
    && !context->GetSwapChain()->Resize(mEGLSurfaceSize)) {
^01b1a0352034 embedthread/EmbedLiteCompositorBridgeParent.cpp
    (Raine Makelainen      2020-07-24 16:25:17 +0300 156)
    return;
Prior to this the logic was somewhat different:
$ git diff d59d44a5bccaf~ d59d44a5bccaf

@@ -153,7 +157,8 @@ EmbedLiteCompositorBridgeParent::CompositeToDefaultTarget
    (VsyncId aId)
[...] 
   if (context->IsOffscreen()) {
     MutexAutoLock lock(mRenderMutex);
-    if (context->OffscreenSize() != mEGLSurfaceSize && !context->
    ResizeOffscreen(mEGLSurfaceSize)) {
+    if (context->GetSwapChain()->OffscreenSize() != mEGLSurfaceSize
+      && !context->GetSwapChain()->Resize(mEGLSurfaceSize)) {
       return;
     }
   }
The code for returning the offscreen size was also added by me slightly earlier on the same day:
$ git blame gfx/gl/GLScreenBuffer.cpp -L 128,130
f536dbf9f6f8a (David Llewellyn-Jones 2023-09-04 08:52:00 +0100 128)
    const gfx::IntSize& SwapChain::OffscreenSize() const {
f536dbf9f6f8a (David Llewellyn-Jones 2023-09-04 08:52:00 +0100 129)
      return mPresenter->mBackBuffer->mFb->mSize;
f536dbf9f6f8a (David Llewellyn-Jones 2023-09-04 08:52:00 +0100 130)
    }
And as you can see, a lot of code was changed in this commit, especially this code:
diff --git a/gfx/gl/GLScreenBuffer.cpp b/gfx/gl/GLScreenBuffer.cpp
index 0398dd7dc6a2..e71263068777 100644
--- a/gfx/gl/GLScreenBuffer.cpp
+++ b/gfx/gl/GLScreenBuffer.cpp
@@ -116,4 +116,38 @@ SwapChain::~SwapChain() {
   }
 }
 
[...]
+
+const gfx::IntSize& SwapChain::OffscreenSize() const {
+  return mPresenter->mBackBuffer->mFb->mSize;
+}
+
[...]
So much — if not all — of this faulty code is down to me. Back then I was coding without being able to test, so this isn't a huge surprise. But it does also mean I have more scope to control the situation and make changes to the implementation.

Prior to all these changes the OffscreenSize() implementation looked like this:
const gfx::IntSize& GLContext::OffscreenSize() const {
  MOZ_ASSERT(IsOffscreen());
  return mScreen->Size();
}
The mScreen being used here is analogous to the mBackBuffer and is created in this method:
bool GLContext::CreateScreenBufferImpl(const IntSize& size,
                                       const SurfaceCaps& caps) {
  UniquePtr<GLScreenBuffer> newScreen =
      GLScreenBuffer::Create(this, size, caps);
  if (!newScreen) return false;

  if (!newScreen->Resize(size)) {
    return false;
  }

  // This will rebind to 0 (Screen) if needed when
  // it falls out of scope.
  ScopedBindFramebuffer autoFB(this);

  mScreen = std::move(newScreen);

  return true;
}
And the mScreen is created in a method called InitOffscreen().
bool GLContext::InitOffscreen(const gfx::IntSize& size,
                              const SurfaceCaps& caps) {
  if (!CreateScreenBuffer(size, caps)) return false;
[...]
Finally InitOffScreen() is being called in GLContextProviderCGL::CreateOffscreen():
already_AddRefed<GLContext> GLContextProviderCGL::CreateOffscreen(
    const IntSize& size,
    const SurfaceCaps& minCaps,
    CreateContextFlags flags,
    nsACString* const out_failureId) {
  RefPtr<GLContext> gl = CreateHeadless(flags, out_failureId);
  if (!gl) {
    return nullptr;
  }

  if (!gl->InitOffscreen(size, minCaps)) {
    *out_failureId = NS_LITERAL_CSTRING("FEATURE_FAILURE_CGL_INIT");
    return nullptr;
  }

  return gl.forget();
}
It feels like we've nearly come full circle, which would be good because then that would be enough to make a clear decision about how to address this. But it's already late here now and time for me to call it a night, so that decision will have to wait for tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
23 Feb 2024 : Day 165 #
Today I'm following up the tasks I set out for myself yesterday:
 
Tomorrow I'll do a sweep of the other code to check whether any attempt is being made to initialise it somewhere else. If not I'll add in some initialisation code to see what happens.

As you may recall the WebView is crashing because the SwapChain returned from the context is null. It should be created somewhere, but it's not yet clear where. So the first question is whether there's some code to create it that isn't being called, or whether there's nowhere in the code currently set to create it.

A quick grep of the code throws up a few potential places where it could be being created:
$ grep -rIn "new SwapChain(" *
embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp:128:
    swapChain = new SwapChain();
gecko-dev/dom/webgpu/CanvasContext.cpp:95:
    mSwapChain = new SwapChain(aDesc, extent, mExternalImageId, format);
gecko-dev/dom/webgpu/CanvasContext.cpp:139:
    mSwapChain = new SwapChain(desc, extent, mExternalImageId, gfxFormat);
The most promising of these is going to be the code in EmbedLiteCompositorBridgeParent.cpp since that's EmbedLite-specific code. In fact, probably something I added myself during the ESR 91 changes (since SwapChain is new to ESR 91):
$ git blame \
    embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp \
    -L 128,128
d59d44a5bccaf (David Llewellyn-Jones 2023-09-04 09:03:49 +0100 128)
    swapChain = new SwapChain();
Confirmed. I even added a note to myself at the time to explain that this might need fixing:
$ git blame \\
    embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp \\
    -L 113,114
d59d44a5bccaf (David Llewellyn-Jones 2023-09-04 09:03:49 +0100 113)
    // TODO: The switch from GLSCreenBuffer to SwapChain needs completing
d59d44a5bccaf (David Llewellyn-Jones 2023-09-04 09:03:49 +0100 114)
    // See: https://phabricator.services.mozilla.com/D75055
Here's the relevant bit of code:
void
EmbedLiteCompositorBridgeParent::PrepareOffscreen()
{
  fprintf(stderr,
      "=============== Preparing offscreen rendering context ===============\n");

  const CompositorBridgeParent::LayerTreeState* state =
      CompositorBridgeParent::GetIndirectShadowTree(RootLayerTreeId());
  NS_ENSURE_TRUE(state && state->mLayerManager, );

  GLContext* context = static_cast<CompositorOGL*>(state->mLayerManager->
      GetCompositor())->gl();
  NS_ENSURE_TRUE(context, );

  // TODO: The switch from GLSCreenBuffer to SwapChain needs completing
  // See: https://phabricator.services.mozilla.com/D75055
  if (context->IsOffscreen()) {
    UniquePtr<SurfaceFactory> factory;
    if (context->GetContextType() == GLContextType::EGL) {
      // [Basic/OGL Layers, OMTC] WebGL layer init.
      factory = SurfaceFactory_EGLImage::Create(*context);
    } else {
      // [Basic Layers, OMTC] WebGL layer init.
      // Well, this *should* work...
      factory = MakeUnique<SurfaceFactory_Basic>(*context);
    }

    SwapChain* swapChain = context->GetSwapChain();
    if (swapChain == nullptr) {
      swapChain = new SwapChain();
      new SwapChainPresenter(*swapChain);
      context->mSwapChain.reset(swapChain);
    }

    if (factory) {
      swapChain->Morph(std::move(factory));
    }
  }
}
So either this method isn't being run, or context->IsOffscreen() must be set to false. Let's find out. Conveniently I can once again continue with my debugging session from yesterday:
(gdb) delete break
Delete all breakpoints? (y or n) y
(gdb) b EmbedLiteCompositorBridgeParent::PrepareOffscreen
Breakpoint 10 at 0x7ff366808c: file mobile/sailfishos/embedthread/
    EmbedLiteCompositorBridgeParent.cpp, line 104.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]

Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 14342]
mozilla::gl::SwapChain::OffscreenSize (this=0x0)
    at gfx/gl/GLScreenBuffer.cpp:129
129       return mPresenter->mBackBuffer->mFb->mSize;
(gdb) bt
#0  mozilla::gl::SwapChain::OffscreenSize (this=0x0)
    at gfx/gl/GLScreenBuffer.cpp:129
#1  0x0000007ff3667930 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget (this=0x7fc4aebc90, aId=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#2  0x0000007ff12b808c in mozilla::layers::CompositorVsyncScheduler::
    ForceComposeToTarget (this=0x7fc4d0b230, aTarget=aTarget@entry=0x0, 
    aRect=aRect@entry=0x0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/LayersTypes.h:82
#3  0x0000007ff12b80e8 in mozilla::layers::CompositorBridgeParent::
    ResumeComposition (this=this@entry=0x7fc4aebc90)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#4  0x0000007ff12b8174 in mozilla::layers::CompositorBridgeParent::
    ResumeCompositionAndResize (this=0x7fc4aebc90, x=<optimized out>,
    y=<optimized out>, width=<optimized out>, height=<optimized out>)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:794
#5  0x0000007ff12b0d10 in mozilla::detail::RunnableMethodArguments<int, int,
    int, int>::applyImpl<mozilla::layers::CompositorBridgeParent, void
    (mozilla::layers::CompositorBridgeParent::*)(int, int, int, int),
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>,
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, 0ul, 1ul,
    2ul, 3ul> (args=..., m=<optimized out>, o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151
#6  mozilla::detail::RunnableMethodArguments<int, int, int, int>::apply
    <mozilla::layers::CompositorBridgeParent, void (mozilla::layers::
    CompositorBridgeParent::*)(int, int, int, int)> (m=<optimized out>,
    o=<optimized out>, this=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1154
[...]
#17 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) 
So the crash is happening before PrepareOffscreen() is called. Looking through the code there's actually only one place it is called and that's inside EmbedLiteCompositorBridgeParent::AllocPLayerTransactionParent(). There it's wrapped in a condition:
  if (!StaticPrefs::embedlite_compositor_external_gl_context()) {
    // Prepare Offscreen rendering context
    PrepareOffscreen();
  }
So I should check whether the problem is that this isn't being called, or the condition is false. A quick debug confirms that it's the latter: the method is entered but the value of the static preference means the PrepareOffScreen() call is never being made:
Thread 36 "Compositor" hit Breakpoint 11, mozilla::embedlite::
    EmbedLiteCompositorBridgeParent::AllocPLayerTransactionParent
    (this=0x7fc4bc68c0, aBackendHints=..., aId=...)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:79
79      {
(gdb) n
80        PLayerTransactionParent* p =
(gdb) n
83        EmbedLiteWindowParent *parentWindow = EmbedLiteWindowParent::From(mWindowId);
(gdb) n
84        if (parentWindow) {
(gdb) n
85          parentWindow->GetListener()->CompositorCreated();
(gdb) n
88        if (!StaticPrefs::embedlite_compositor_external_gl_context()) {
(gdb) n
mozilla::layers::PCompositorBridgeParent::OnMessageReceived
    (this=0x7fc4bc68c0, msg__=...) at PCompositorBridgeParent.cpp:1286
1286    PCompositorBridgeParent.cpp: No such file or directory.
(gdb) 
As we can see, it comes down to this embedlite.compositor.external_gl_context static preference, which needs to be set to false for the condition to be entered.

This preference isn't being set for the WebView, although it is set for the browser:
$ pushd ../sailfish-browser/
$ grep -rIn "embedlite.compositor.external_gl_context" *
apps/core/declarativewebutils.cpp:239:
    webEngineSettings->setPreference(QString(
    "embedlite.compositor.external_gl_context"), QVariant(true));
data/prefs.js:5:
    user_pref("embedlite.compositor.external_gl_context", true);
tests/auto/mocks/declarativewebutils/declarativewebutils.cpp:62:
    webEngineSettings->setPreference(QString(
    "embedlite.compositor.external_gl_context"), QVariant(true));
$ popd
I'm going to set it to false explicitly for the WebView. But this immediately makes me feel nervous: this setting isn't new and there's a reason it's not being touched in the WebView code. It makes me think that I'm travelling down a rendering pipeline path that I shouldn't be.

So as well as trying out this change I'm also going to ask for some expert advice from the Jolla team about this, just in case it's actually important that I don't set this to false and that the real issue is somewhere else.

But, it's the start of my work day, so that will all have to wait until later.

[...]

I've added in the change to set the embedlite.compositor.external_gl_context static preference to false:
diff --git a/lib/webenginesettings.cpp b/lib/webenginesettings.cpp
index de9e4b86..780f6555 100644
--- a/lib/webenginesettings.cpp
+++ b/lib/webenginesettings.cpp
@@ -110,6 +110,12 @@ void SailfishOS::WebEngineSettings::initialize()
     engineSettings->setPreference(QStringLiteral("intl.accept_languages"),
                                   QVariant::fromValue<QString>(langs));
 
+    // Ensure the renderer is configured correctly
+    engineSettings->setPreference(QStringLiteral("gfx.webrender.force-disabled"),
+                                  QVariant(true));
+    engineSettings->setPreference(QStringLiteral("embedlite.compositor.external_gl_context"),
+                                  QVariant(false));
+
     Silica::Theme *silicaTheme = Silica::Theme::instance();
 
     // Notify gecko when the ambience switches between light and dark
The code has successfully built; now it's time to test it. On a dry run of the new code it crashes seemingly somewhere close to where it was crashing before. But crucially the debug print from inside the PrepareOffscreen() method is now being output. So we've definitely moved a step forwards.
$ harbour-webview
[D] unknown:0 - QML debugging is enabled. Only use this in a safe environment.
[D] main:30 - WebView Example
[D] main:44 - Using default start URL:  "https://www.flypig.co.uk/search/"
[D] main:47 - Opening webview
[D] unknown:0 - Using Wayland-EGL
library "libutils.so" not found
[...]
JSComp: UserAgentOverrideHelper.js loaded
UserAgentOverrideHelper app-startup
CONSOLE message:
[JavaScript Error: "Unexpected event profile-after-change" {file:
    "resource://gre/modules/URLQueryStrippingListService.jsm" line: 228}]
observe@resource://gre/modules/URLQueryStrippingListService.jsm:228:12

Created LOG for EmbedPrefs
Created LOG for EmbedLiteLayerManager
=============== Preparing offscreen rendering context ===============
Segmentation fault
To find out what's going on we can step through. And after the new install it's finally time to start a new debugging session.
$ gdb harbour-webview 
GNU gdb (GDB) Mer (8.2.1+git9)
[...]
(gdb) b EmbedLiteCompositorBridgeParent::PrepareOffscreen
Function "EmbedLiteCompositorBridgeParent::PrepareOffscreen" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (EmbedLiteCompositorBridgeParent::PrepareOffscreen) pending.
(gdb) r
[...]
Thread 36 "Compositor" hit Breakpoint 1, mozilla::embedlite::
    EmbedLiteCompositorBridgeParent::PrepareOffscreen
    (this=this@entry=0x7fc4be7560)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:104
104     {
(gdb) n
[LWP 18280 exited]
105       fprintf(stderr,
    "=============== Preparing offscreen rendering context ===============\n");
(gdb) 
=============== Preparing offscreen rendering context ===============
107       const CompositorBridgeParent::LayerTreeState* state =
    CompositorBridgeParent::GetIndirectShadowTree(RootLayerTreeId());
(gdb) 
108       NS_ENSURE_TRUE(state && state->mLayerManager, );
(gdb) 
110       GLContext* context = static_cast<CompositorOGL*>
    (state->mLayerManager->GetCompositor())->gl();
(gdb) p context
$1 = <optimized out>
(gdb) n
111       NS_ENSURE_TRUE(context, );
(gdb) n
3540    ${PROJECT}/obj-build-mer-qt-xr/dist/include/GLContext.h:
    No such file or directory.
(gdb) n
117         if (context->GetContextType() == GLContextType::EGL) {
(gdb) n
119           factory = SurfaceFactory_EGLImage::Create(*context);
(gdb) n
126         SwapChain* swapChain = context->GetSwapChain();
(gdb) n
127         if (swapChain == nullptr) {
(gdb) n
128           swapChain = new SwapChain();
(gdb) n
129           new SwapChainPresenter(*swapChain);
(gdb) n
130           context->mSwapChain.reset(swapChain);
(gdb) n
133         if (factory) {
(gdb) n
134           swapChain->Morph(std::move(factory));
(gdb) n
mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    AllocPLayerTransactionParent (this=0x7fc4be7560, aBackendHints=..., aId=...)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:92
92        return p;
(gdb) c
Continuing.

Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault.
0x0000007ff110a38c in mozilla::gl::SwapChain::OffscreenSize
    (this=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
290     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:
    No such file or directory.
(gdb) bt
#0  0x0000007ff110a38c in mozilla::gl::SwapChain::OffscreenSize
    (this=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#1  0x0000007ff3667930 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget (this=0x7fc4be7560, aId=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#2  0x0000007ff12b808c in mozilla::layers::CompositorVsyncScheduler::
    ForceComposeToTarget (this=0x7fc4d0b1a0, aTarget=aTarget@entry=0x0, 
    aRect=aRect@entry=0x0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/
    LayersTypes.h:82
#3  0x0000007ff12b80e8 in mozilla::layers::CompositorBridgeParent::
    ResumeComposition (this=this@entry=0x7fc4be7560)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#4  0x0000007ff12b8174 in mozilla::layers::CompositorBridgeParent::
    ResumeCompositionAndResize (this=0x7fc4be7560, x=<optimized out>,
    y=<optimized out>, width=<optimized out>, height=<optimized out>)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:794
#5  0x0000007ff12b0d10 in mozilla::detail::RunnableMethodArguments<int, int,
    int, int>::applyImpl<mozilla::layers::CompositorBridgeParent, void
    (mozilla::layers::CompositorBridgeParent::*)(int, int, int, int),
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>,
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, 0ul, 1ul,
    2ul, 3ul> (args=..., m=<optimized out>, o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151
[...]
#17 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) 
Here's the code that's causing the crash:
const gfx::IntSize& SwapChain::OffscreenSize() const {
  return mPresenter->mBackBuffer->mFb->mSize;
}
It would be helpful to know which of these values is null, but unhelpfully the values have been optimised out.
(gdb) p mPresenter
value has been optimized out
(gdb) p this
$2 = <optimized out>
However if we go up a stack frame we can have better luck, applying the trick we used on Day 164 to extract the SwapChain object from the context via the UniquePtr class:
(gdb) frame 1
#1  0x0000007ff3667930 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget (this=0x7fc4be7560, aId=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
290     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:
    No such file or directory.
(gdb) p context
$3 = (mozilla::gl::GLContext *) 0x7ee019ee40
(gdb) p context->mSwapChain.mTuple.mFirstA
$5 = (mozilla::gl::SwapChain *) 0x7ee01ce090
(gdb) p context->mSwapChain.mTuple.mFirstA->mPresenter
$6 = (mozilla::gl::SwapChainPresenter *) 0x7ee01a1380
(gdb) p context->mSwapChain.mTuple.mFirstA->mPresenter->mBackBuffer
$7 = std::shared_ptr<mozilla::gl::SharedSurface> (empty) = {get() = 0x0}
(gdb) p context->mSwapChain.mTuple.mFirstA->mPresenter->mBackBuffer->mFb
Cannot access memory at address 0x20
(gdb) 
As we can see from this, the missing value is the mBackBuffer value inside the SwapChainPresenter object of the SwapChain class.

It's clear what the next step is: find out why the mBackBuffer value isn't being set and, if necessary, set it. But that's a task that'll have to wait until tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
22 Feb 2024 : Day 164 #
Working on the WebView implementation, yesterday we reached the point where the WebView component no longer crashed the app hosting it. We did this by ensuring the correct layer manager was used for rendering.

But now we're left with a bunch of errors. The ones that need fixing immediately are the following:
[W] unknown:7 - file:///usr/share/harbour-webview/qml/harbour-webview.qml:7:30:
    Type WebViewPage unavailable 
         initialPage: Component { WebViewPage { } } 
                                  ^
[W] unknown:13 - file:///usr/share/harbour-webview/qml/pages/
    WebViewPage.qml:13:5: Type WebView unavailable 
         WebView { 
         ^
[W] unknown:141 - file:///usr/lib64/qt5/qml/Sailfish/WebView/
    WebView.qml:141:9: Type TextSelectionController unavailable 
             TextSelectionController { 
             ^
[W] unknown:14 - file:///usr/lib64/qt5/qml/Sailfish/WebView/Controls/
    TextSelectionController.qml:14:1: module "QOfono" is not installed 
     import QOfono 0.2 
     ^
This cascade of errors all reduces to the last:
[W] unknown:14 - file:///usr/lib64/qt5/qml/Sailfish/WebView/Controls/
    TextSelectionController.qml:14:1: module "QOfono" is not installed 
     import QOfono 0.2 
     ^
The reason for this is also clear. The spec file for sailfish-components-webview makes clear that libqofono 0.117 or above is needed. I don't have this on my system for whatever reason (I'll need to investigate), but to work around this I hacked the spec file so that it wouldn't refuse to install on a system with a lower version, like this:
diff --git a/rpm/sailfish-components-webview.spec
           b/rpm/sailfish-components-webview.spec
index 766933ba..c311ebcf 100644
--- a/rpm/sailfish-components-webview.spec
+++ b/rpm/sailfish-components-webview.spec
@@ -18,7 +18,7 @@ Requires: sailfishsilica-qt5 >= 1.1.123
 Requires: sailfish-components-media-qt5
 Requires: sailfish-components-pickers-qt5
 Requires: embedlite-components-qt5 >= 1.21.2
-Requires: libqofono-qt5-declarative >= 0.117
+Requires: libqofono-qt5-declarative >= 0.115
 
 %description
 %{summary}.
There's no build-time requirement, so I thought I might get away with it. But clearly not.

It seems a bit odd that a text selector component should be requiring an entire separate phone library in order to work. Let's take a look at why.

The ofono code comes at the end of the file. There are two OfonoNetworkRegistration components called cellular1Status and cellular2Status. These represent the state of the two SIM card slots in the device. You might ask why there are only two; can't you have more than two SIM card slots? Well, yes, but I guess this is a problem for future developers to deal with.

These two components feed into the following Boolean value at the top of the code:
    readonly property bool _canCall: cellular1Status.registered
        || cellular2Status.registered
Later on in the code we see this being used, like this:
        isPhoneNumber = _canCall && _phoneNumberSelected
So what's this all for? When you select some text the browser will present you with some options for what to do with it. Copy to clipboard? Open a link? If it thinks it's a phone number it will offer to make a call to it for you. Unless you don't have a SIM card installed. So that's why libqofono is needed here.

You might wonder how it knows it's a phone number at all. The answer to this question isn't in the sailfish-components-webview code. The answer is in embedlite-components, in the SelectionPrototype.js file where we find this code:
  _phoneRegex: /^\+?[0-9\s,-.\(\)*#pw]{1,30}$/,

  _getSelectedPhoneNumber: function sh_getSelectedPhoneNumber() {
    return this._isPhoneNumber(this._getSelectedText().trim());
  },

  _isPhoneNumber: function sh_isPhoneNumber(selectedText) {
    return (this._phoneRegex.test(selectedText) ? selectedText : null);
  },
So the decision about whether something is a phone number or not comes down to whether it satisfies the regex /^\+?[0-9\s,-.\(\)*#pw]{1,30}$/ and whether you have a SIM card installed.

But that's a bit of a diversion. We only care about this new libqofono. Why is this newer version needed and why don't I have it on my system? Let's find out when and why it was changed. $ git blame import/controls/TextSelectionController.qml -L 14,14 16ef5cdf4 (Pekka Vuorela 2023-01-05 12:09:27 +0200 14) import QOfono 0.2 $ git log -1 16ef5cdf4 commit 16ef5cdf44c2eafd7d93e17a41927ef5da700c2b Author: Pekka Vuorela <pekka.vuorela@jolla.com> Date: Thu Jan 5 12:09:27 2023 +0200 [components-webview] Migrate to new qofono import. JB#59690 Also dependency was missing. The actual change here was pretty small.
$ git diff 16ef5cdf44c2eafd7d93e17a41927ef5da700c2b~ \
    16ef5cdf44c2eafd7d93e17a41927ef5da700c2b
diff --git a/import/controls/TextSelectionController.qml
           b/import/controls/TextSelectionController.qml
index 5c8f2845..71bd83cc 100644
--- a/import/controls/TextSelectionController.qml
+++ b/import/controls/TextSelectionController.qml
@@ -11,7 +11,7 @@
 
 import QtQuick 2.1
 import Sailfish.Silica 1.0
-import MeeGo.QOfono 0.2
+import QOfono 0.2
 
 MouseArea {
     id: root
diff --git a/rpm/sailfish-components-webview.spec
           b/rpm/sailfish-components-webview.spec
index 9a2a3154..5729a8d9 100644
--- a/rpm/sailfish-components-webview.spec
+++ b/rpm/sailfish-components-webview.spec
@@ -18,6 +18,7 @@ Requires: sailfishsilica-qt5 >= 1.1.123
 Requires: sailfish-components-media-qt5
 Requires: sailfish-components-pickers-qt5
 Requires: embedlite-components-qt5 >= 1.21.2
+Requires: libqofono-qt5-declarative >= 0.117
 
 %description
 %{summary}.
The import has been updated as have the requirements. But there's been no change to the code, so the libqofono version requirement is probably only needed to deal with the name change of the import.

None of this seems essential for ESR 91. My guess is that this change has gone into the development code but hasn't yet made it into a release. So I'm going to hack around it for now (being careful not to commit my hacked changes into the repository).

I've already amended the version number in the spec file, so to get things to work I should just have to reverse this change:
-import MeeGo.QOfono 0.2
+import QOfono 0.2
I can do that on-device. This should do it:
sed -i -e 's/QOfono/MeeGo.QOfono/g' \
    /usr/lib64/qt5/qml/Sailfish/WebView/Controls/TextSelectionController.qml
Great! That's removed the QML error. But now the app is back to crashing again before it gets to even try to render something on-screen:
$ harbour-webview 
[D] unknown:0 - QML debugging is enabled. Only use this in a safe environment.
[...]
UserAgentOverrideHelper app-startup
CONSOLE message:
[JavaScript Error: "Unexpected event profile-after-change" {file:
    "resource://gre/modules/URLQueryStrippingListService.jsm" line: 228}]
observe@resource://gre/modules/URLQueryStrippingListService.jsm:228:12

Created LOG for EmbedPrefs
Created LOG for EmbedLiteLayerManager
Segmentation fault
So it's back to the debugger again. But this will have to wait until this evening.

[...]

It's the evening and time to put the harbour-webview example through the debugger.
$ gdb harbour-webview 
GNU gdb (GDB) Mer (8.2.1+git9)
[...]
Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 24061]
mozilla::gl::SwapChain::OffscreenSize (this=0x0)
    at gfx/gl/GLScreenBuffer.cpp:129
129       return mPresenter->mBackBuffer->mFb->mSize;
(gdb) bt
#0  mozilla::gl::SwapChain::OffscreenSize (this=0x0)
    at gfx/gl/GLScreenBuffer.cpp:129
#1  0x0000007ff3667930 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget (this=0x7fc4be8da0, aId=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#2  0x0000007ff12b808c in mozilla::layers::CompositorVsyncScheduler::
    ForceComposeToTarget (this=0x7fc4d0c0b0, aTarget=aTarget@entry=0x0, 
    aRect=aRect@entry=0x0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/
    LayersTypes.h:82
#3  0x0000007ff12b80e8 in mozilla::layers::CompositorBridgeParent::
    ResumeComposition (this=this@entry=0x7fc4be8da0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#4  0x0000007ff12b8174 in mozilla::layers::CompositorBridgeParent::
    ResumeCompositionAndResize (this=0x7fc4be8da0, x=<optimized out>,
    y=<optimized out>, width=<optimized out>, height=<optimized out>)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:794
#5  0x0000007ff12b0d10 in mozilla::detail::RunnableMethodArguments<int, int,
    int, int>::applyImpl<mozilla::layers::CompositorBridgeParent, void
    (mozilla::layers::CompositorBridgeParent::*)(int, int, int, int),
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>,
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, 0ul, 1ul,
    2ul, 3ul> (args=..., m=<optimized out>, o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151
[...]
#17 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) 
This is now a proper crash, not something induced intentionally by the code. Here's the actual code causing the crash taken from GLSCreenBuffer.cpp:
const gfx::IntSize& SwapChain::OffscreenSize() const {
  return mPresenter->mBackBuffer->mFb->mSize;
}
The problem here being that the SwapChain object itself is null. So we should look in the calling method to find out what's going on there. Here's the relevant code this time from EmbedLiteCompositorBridgeParent.cpp:
void
EmbedLiteCompositorBridgeParent::CompositeToDefaultTarget(VsyncId aId)
{
  GLContext* context = static_cast<CompositorOGL*>(state->mLayerManager->
      GetCompositor())->gl();
[...]
  if (context->IsOffscreen()) {
    MutexAutoLock lock(mRenderMutex);
    if (context->GetSwapChain()->OffscreenSize() != mEGLSurfaceSize
      && !context->GetSwapChain()->Resize(mEGLSurfaceSize)) {
      return;
    }
  }
With a bit of digging we can see that the value being returned by context->GetSwapChain() is null:
(gdb) frame 1
#1  0x0000007ff3667930 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget (this=0x7fc4be8da0, aId=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
290     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:
    No such file or directory.
(gdb) p context
$2 = (mozilla::gl::GLContext *) 0x7ed819ee00
(gdb) p context->GetSwapChain()
Cannot evaluate function -- may be inlined
(gdb) p context.mSwapChain
$3 = {
  mTuple = {<mozilla::detail::CompactPairHelper<mozilla::gl::SwapChain*,
    mozilla::DefaultDelete<mozilla::gl::SwapChain>,
    (mozilla::detail::StorageType)1, (mozilla::detail::StorageType)0>> =
    {<mozilla::DefaultDelete<mozilla::gl::SwapChain>> = {<No data fields>},
    mFirstA = 0x0}, <No data fields>}}
(gdb) p context.mSwapChain.mTuple
$4 = {<mozilla::detail::CompactPairHelper<mozilla::gl::SwapChain*,
    mozilla::DefaultDelete<mozilla::gl::SwapChain>,
    (mozilla::detail::StorageType)1, (mozilla::detail::StorageType)0>> =
    {<mozilla::DefaultDelete<mozilla::gl::SwapChain>> = {<No data fields>},
    mFirstA = 0x0}, <No data fields>}
(gdb) p context.mSwapChain.mTuple.mFirstA
$5 = (mozilla::gl::SwapChain *) 0x0
(gdb) 
You may recall that way back in the first three weeks of working on Gecko I hit a problem with the rendering pipeline. The GLScreenBuffer structure that the WebView has been using for a long time had been completely removed and replaced with this SwapChain class.

At the time I struggled with how to rearrange the code so that it compiled. I made changes that I couldn't test. And while I did get it to compile, these changes are now coming back to haunt me. Now I need to actually fix this rendering pipeline properly.

There's a bit of me that is glad I'm finally having to do this. I really want to know how it's actually supposed to work.

Clearly the first task will be to figure out why the mSwapChain member of GLContext is never being set. With any luck this will be at the easier end of the difficulty spectrum.

I'm going to try to find where mSwapChain is being — or should be being — set. To do that I'll need to find out where the context is coming from. The context is being passed by CompositorOGL so that would seem to be a good place to start.

Looking through the CompositoryOGL.cpp file we can see that the mGLContext member is being initialised from a value passed in to CompositorOGL::Initialize(). The debugger can help us work back from there.
(gdb) break CompositorOGL::Initialize
Breakpoint 1 at 0x7ff11b0c3c: file gfx/layers/opengl/CompositorOGL.cpp,
    line 380.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]
Thread 36 "Compositor" hit Breakpoint 1, mozilla::layers::CompositorOGL::
    Initialize (this=0x7ee0002f50, out_failureReason=0x7f1faac520)
    at gfx/layers/opengl/CompositorOGL.cpp:380
380     bool CompositorOGL::Initialize(nsCString* const out_failureReason) {
(gdb)
Ah! This is interesting. It's not being passed in because there are two different overloads of the CompositorOGL::Initialize() method and the code is using the other one. In this other piece of code the context is created directly:
bool CompositorOGL::Initialize(nsCString* const out_failureReason) {
  ScopedGfxFeatureReporter reporter("GL Layers");

  // Do not allow double initialization
  MOZ_ASSERT(mGLContext == nullptr || !mOwnsGLContext,
             "Don't reinitialize CompositorOGL");

  if (!mGLContext) {
    MOZ_ASSERT(mOwnsGLContext);
    mGLContext = CreateContext();
  }
[...]
Let's see what happens with the context creation.
Thread 36 "Compositor" hit Breakpoint 5, mozilla::layers::CompositorOGL::
    CreateContext (this=this@entry=0x7ee0002f50)
    at gfx/layers/opengl/CompositorOGL.cpp:227
227     already_AddRefed<mozilla::gl::GLContext> CompositorOGL::CreateContext() {
(gdb) n
231       nsIWidget* widget = mWidget->RealWidget();
(gdb) 
232       void* widgetOpenGLContext =
(gdb) 
234       if (widgetOpenGLContext) {
(gdb) 
248       if (!context && gfxEnv::LayersPreferOffscreen()) {
(gdb) b GLContextProviderEGL::CreateHeadless
Breakpoint 6 at 0x7ff1133740: file gfx/gl/GLContextProviderEGL.cpp, line 1245.
(gdb) c
Continuing.

Thread 36 "Compositor" hit Breakpoint 6, mozilla::gl::GLContextProviderEGL::
    CreateHeadless (desc=..., out_failureId=out_failureId@entry=0x7f1faed1c8)
    at gfx/gl/GLContextProviderEGL.cpp:1245
1245        const GLContextCreateDesc& desc, nsACString* const out_failureId) {
(gdb) n
1246      const auto display = DefaultEglDisplay(out_failureId);
(gdb) 
1247      if (!display) {
(gdb) p display
$8 = std::shared_ptr<mozilla::gl::EglDisplay> (use count 1, weak count 2)
    = {get() = 0x7ee0004cb0}
(gdb) n
1250      mozilla::gfx::IntSize dummySize = mozilla::gfx::IntSize(16, 16);
(gdb) b GLContextEGL::CreateEGLPBufferOffscreenContext
Breakpoint 7 at 0x7ff11335b8: file gfx/gl/GLContextProviderEGL.cpp, line 1233.
(gdb) c
Continuing.

Thread 36 "Compositor" hit Breakpoint 7, mozilla::gl::GLContextEGL::
    CreateEGLPBufferOffscreenContext (
    display=std::shared_ptr<mozilla::gl::EglDisplay> (use count 2, weak count 2)
    = {...}, desc=..., size=..., 
    out_failureId=out_failureId@entry=0x7f1faed1c8)
    at gfx/gl/GLContextProviderEGL.cpp:1233
1233        const mozilla::gfx::IntSize& size, nsACString* const
    out_failureId) {
(gdb) b GLContextEGL::CreateEGLPBufferOffscreenContextImpl
Breakpoint 8 at 0x7ff1133160: file gfx/gl/GLContextProviderEGL.cpp, line 1185.
(gdb) c
Continuing.

Thread 36 "Compositor" hit Breakpoint 8, mozilla::gl::GLContextEGL::
    CreateEGLPBufferOffscreenContextImpl (
    egl=std::shared_ptr<mozilla::gl::EglDisplay> (use count 3, weak count 2) =
    {...}, desc=..., size=..., useGles=useGles@entry=false, 
    out_failureId=out_failureId@entry=0x7f1faed1c8)
    at gfx/gl/GLContextProviderEGL.cpp:1185
1185        nsACString* const out_failureId) {
(gdb) n
1186      const EGLConfig config = ChooseConfig(*egl, desc, useGles);
(gdb) 
1187      if (config == EGL_NO_CONFIG) {
(gdb) 
1193      if (GLContext::ShouldSpew()) {
(gdb) 
1197      mozilla::gfx::IntSize pbSize(size);
(gdb) 
1307    include/c++/8.3.0/bits/shared_ptr_base.h: No such file or directory.
(gdb) 
1208      if (!surface) {
(gdb) 
1214      auto fullDesc = GLContextDesc{desc};
(gdb) 
1215      fullDesc.isOffscreen = true;
(gdb) 
1217          egl, fullDesc, config, surface, useGles, out_failureId);
(gdb) b GLContextEGL::CreateGLContext
Breakpoint 9 at 0x7ff1132548: file gfx/gl/GLContextProviderEGL.cpp, line 618.
(gdb) c
Continuing.

Thread 36 "Compositor" hit Breakpoint 9, mozilla::gl::GLContextEGL::
    CreateGLContext (egl=std::shared_ptr<mozilla::gl::EglDisplay>
    (use count 4, weak count 2) = {...}, desc=...,
    config=config@entry=0x55558fc450, surface=surface@entry=0x7ee0008f40,
    useGles=useGles@entry=false, out_failureId=out_failureId@entry=0x7f1faed1c8)
    at gfx/gl/GLContextProviderEGL.cpp:618
618         nsACString* const out_failureId) {
(gdb) n
621       std::vector<EGLint> required_attribs;
(gdb) 
We're getting down into the depths now. It's surprisingly thrilling to be seeing this code again. I recall that this GLContextEGL::CreateGLContext() method is where a lot of the action happens.

But my head is full and this feels like a good place to leave things. Inside this method might be the right place to initialise mSwapChain, but it's definitely not happening here.

Tomorrow I'll do a sweep of the other code to check whether any attempt is being made to initialise it somewhere else. If not I'll add in some initialisation code to see what happens.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
21 Feb 2024 : Day 163 #
We're making good progress with the WebView rendering pipeline. The first issue to fix, which we've been looking at for the last couple of days, has been ensuring the layer manger is of the Client type, rather than the WebRender type. There's a new WEBRENDER_SOFTWARE feature that was introduced between ESR 78 and ESR 91 which is causing the trouble. In previous builds we disabled the WEBRENDER feature, but now with the new feature it's being enabled again. we need to ensure it's not enabled.

So the key questions to answer today are: how was WEBRENDER being disabled on ESR 78; and can we do something equivalent for WEBRENDER_SOFTWARE on ESR 91.

In the gfxConfigureManager.cpp file there are a couple of encouraging looking methods called gfxConfigManager::ConfigureWebRender() and gfxConfigManager::ConfigureWebRenderSoftware(). These enable and disable the web renderer and software web renderer features respectively. Unsurprisingly, the latter is a new method for ESR 91, but the former is available in both ESR 78 and ESR 91, so I'll concentrate on that one first.

When looking at the code in these we also need to refer back to the initialisation method, because that's where some key variables are being created:
void gfxConfigManager::Init() {
[...]
  mFeatureWr = &gfxConfig::GetFeature(Feature::WEBRENDER);
[...]
  mFeatureWrSoftware = &gfxConfig::GetFeature(Feature::WEBRENDER_SOFTWARE);
[...]
So these two variables — mFeatureWr and mFeatureWrSoftware are feature objects which we can then use to enable and disable various features.

In ESR 78 the logic for whether mFeatureWr should be enabled or not is serpentine. I'm not going to try to work through by hand, rather I'll set the debugger on it and see which way it slithers.

Happily my debug session is still running from yesterday (I think it's been running for three days now), so I can continue straight with that. I'll include the full step-through, but there's a lot of it so don't feel you have to follow along, I'll summarise the important parts afterwards.
(gdb) delete break
Delete all breakpoints? (y or n) y
(gdb) b gfxConfigManager::ConfigureWebRender
Breakpoint 5 at 0x7fb90a8d88: file gfx/config/gfxConfigManager.cpp, line 194.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]
Thread 7 "GeckoWorkerThre" hit Breakpoint 5, mozilla::gfx::gfxConfigManager::
    ConfigureWebRender (this=this@entry=0x7fa7972598)
    at gfx/config/gfxConfigManager.cpp:194
194     void gfxConfigManager::ConfigureWebRender() {
(gdb) n
206       mFeatureWrCompositor->SetDefaultFromPref("gfx.webrender.compositor",
    true,
(gdb) n
209       if (mWrCompositorForceEnabled) {
(gdb) n
213       ConfigureFromBlocklist(nsIGfxInfo::FEATURE_WEBRENDER_COMPOSITOR,
(gdb) n
219       if (!mHwStretchingSupport && mScaledResolution) {
(gdb) n
225       bool guardedByQualifiedPref = ConfigureWebRenderQualified();
(gdb) n
300     obj-build-mer-qt-xr/dist/include/nsTStringRepr.h: No such file or directory.
(gdb) p *mFeatureWr
$15 = {mDefault = {mMessage = '\000' <repeats 63 times>, mStatus =
    mozilla::gfx::FeatureStatus::Unused}, mUser = {mMessage = '\000'
    <repeats 63 times>, mStatus = mozilla::gfx::FeatureStatus::Unused},
    mEnvironment = {mMessage = '\000' <repeats 63 times>,
    mStatus = mozilla::gfx::FeatureStatus::Unused}, mRuntime = {mMessage =
    '\000' <repeats 63 times>, mStatus = mozilla::gfx::FeatureStatus::Unused}, 
  mFailureId = {<nsTSubstring<char>> = {<mozilla::detail::nsTStringRepr<char>> =
    {mData = 0x7fbc7d4f42 <gNullChar> "", mLength = 0, mDataFlags =
    mozilla::detail::StringDataFlags::TERMINATED, mClassFlags =
    mozilla::detail::StringClassFlags::NULL_TERMINATED}, 
      static kMaxCapacity = 2147483637}, <No data fields>}}
(gdb) p mFeatureWr->GetValue()
$16 = mozilla::gfx::FeatureStatus::Unused
(gdb) p mFeatureWr->IsEnabled()
$17 = false
(gdb) p mFeatureWr->mDefault.mStatus
$30 = mozilla::gfx::FeatureStatus::Unused
(gdb) p mFeatureWr->mRuntime.mStatus
$31 = mozilla::gfx::FeatureStatus::Unused
(gdb) n
235       if (mWrEnvForceEnabled) {
(gdb) p mWrEnvForceEnabled
$18 = false
(gdb) n
237       } else if (mWrForceEnabled) {
(gdb) p mWrForceEnabled
$19 = false
(gdb) n
239       } else if (mFeatureWrQualified->IsEnabled()) {
(gdb) p mFeatureWrQualified->IsEnabled()
$20 = false
(gdb) n
253       if (mWrForceDisabled ||
(gdb) p mWrForceDisabled
$21 = false
(gdb) p mWrEnvForceDisabled
$22 = false
(gdb) p mWrQualifiedOverride.isNothing()
Cannot evaluate function -- may be inlined
(gdb) n
261       if (!mFeatureHwCompositing->IsEnabled()) {
(gdb) n
268       if (mSafeMode) {
(gdb) n
276       if (mIsWindows && !mIsWin10OrLater && !mDwmCompositionEnabled) {
(gdb) p mIsWindows
$23 = false
(gdb) p mIsWin10OrLater
$24 = false
(gdb) p mDwmCompositionEnabled
$25 = true
(gdb) n
283           NS_LITERAL_CSTRING("FEATURE_FAILURE_DEFAULT_OFF"));
(gdb) n
285       if (mFeatureD3D11HwAngle && mWrForceAngle) {
(gdb) n
301       if (!mFeatureWr->IsEnabled() && mDisableHwCompositingNoWr) {
(gdb) p mFeatureWr->IsEnabled()
$26 = false
(gdb) p mDisableHwCompositingNoWr
$27 = false
(gdb) n
324           NS_LITERAL_CSTRING("FEATURE_FAILURE_DEFAULT_OFF"));
(gdb) n
326       if (mWrDCompWinEnabled) {
(gdb) n
334       if (!mWrPictureCaching) {
(gdb) n
340       if (!mFeatureWrDComp->IsEnabled() && mWrCompositorDCompRequired) {
(gdb) n
348       if (mWrPartialPresent) {
(gdb) n
gfxPlatform::InitWebRenderConfig (this=<optimized out>)
    at gfx/thebes/gfxPlatform.cpp:2733
2733      if (Preferences::GetBool("gfx.webrender.program-binary-disk", false)) {
(gdb) c
[...]
That's a bit too much detail there, but the key conclusion is that mFeatureWr (which represents the state of the WEBRENDER feature starts off disabled and the value is never changed. So by the end of the gfxConfigManager::ConfigureWebRender() method the feature remains disabled. It's not changed anywhere else and so we're left with our layer manager being created as a Client layer manager, which is what we need.

We can see that it's set to disabled from the following sequence, copied from the full debugging session above:
(gdb) p mFeatureWr->IsEnabled()
$17 = false
(gdb) p mFeatureWr->mDefault.mStatus
$30 = mozilla::gfx::FeatureStatus::Unused
(gdb) p mFeatureWr->mRuntime.mStatus
$31 = mozilla::gfx::FeatureStatus::Unused
Features are made from multiple layers of states. Each layer can be either set or unused. To determine the state of a feature each layer is examined in order until one of them is set to something other than Unused. The first unused layer provides the actual state of the feature.

The layers are the following:
  1. mRuntime
  2. mUser
  3. mEnvironment
  4. mStatus
  5. mDefault
The mDefault layer provides a backstop: if all other layers are Unused then whatever value the mDefault layer takes is the value of the feature (even if that value is Unused).

So, to summarise and bring all this together, the mFeatureWr feature is enabled if all of the following hold:
  1. mFeatureWr->mDefault.mStatus is set to anything other than Unused.
  2. The mStatus value of one of the other layers is set to something other than Unused and is either Available or ForceEnabled.
Looking at the values from the debugging session above, we can therefore see exactly why mFeatureWr->IsEnabled() is returning false: it's simply never had any other value set on it.

Now we need to compare this to the process for ESR 91. Before we get into it it's worth noting that the WEBRENDER feature in ESR 91 is also (correctly) disabled, so we may not see any big differences here with this. Let's see.

Again, I can continue with the debugging session I've been running for the last few days:
(gdb) delete break
Delete all breakpoints? (y or n) y
(gdb) b gfxConfigManager::ConfigureWebRender
Breakpoint 9 at 0x7ff138d708: file gfx/config/gfxConfigManager.cpp, line 215.
(gdb) b gfxConfigManager::ConfigureWebRenderSoftware
Breakpoint 10 at 0x7ff138d41c: file gfx/config/gfxConfigManager.cpp, line 125.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]
Thread 7 "GeckoWorkerThre" hit Breakpoint 9, mozilla::gfx::gfxConfigManager::
    ConfigureWebRender (this=this@entry=0x7fd7da72f8)
    at gfx/config/gfxConfigManager.cpp:215
215     void gfxConfigManager::ConfigureWebRender() {
(gdb) p mFeatureWr->IsEnabled()
$13 = false
(gdb) p mFeatureWr->mDefault.mStatus
$14 = mozilla::gfx::FeatureStatus::Unused
(gdb) p mFeatureWr->mRuntime.mStatus
$15 = mozilla::gfx::FeatureStatus::Unused
So as we go in to the ConfigureWebRender() method the value is set to disabled. This is the same as for ESR 78.
(gdb) n
230       mFeatureWrCompositor->SetDefaultFromPref("gfx.webrender.compositor",
    true,
(gdb)
233       if (mWrCompositorForceEnabled) {
(gdb)
237       ConfigureFromBlocklist(nsIGfxInfo::FEATURE_WEBRENDER_COMPOSITOR,
(gdb)
243       if (!mHwStretchingSupport.IsFullySupported() && mScaledResolution) {
(gdb)
253       ConfigureWebRenderSoftware();
(gdb) n
At this point we're jumping in to the ConfigureWebRenderSoftware() method. We're going to continue into it, since we're interested to know what happens there. But it's worth noting that this is a departure from what happens on ESR 78.
Thread 7 "GeckoWorkerThre" hit Breakpoint 10, mozilla::gfx::gfxConfigManager::
    ConfigureWebRenderSoftware (this=this@entry=0x7fd7da72f8)
    at gfx/config/gfxConfigManager.cpp:125
125     void gfxConfigManager::ConfigureWebRenderSoftware() {
(gdb) p mFeatureWrSoftware->IsEnabled()
$16 = false
(gdb) p mFeatureWrSoftware->mDefault.mStatus
$17 = mozilla::gfx::FeatureStatus::Unused
(gdb) p mFeatureWrSoftware->mDefault.mStatus
$18 = mozilla::gfx::FeatureStatus::Unused
(gdb) p mFeatureWrSoftware->mRuntime.mStatus
$19 = mozilla::gfx::FeatureStatus::Unused
Going in we also see that the mFeatureWrSoftware feature is disabled.
(gdb) n
128       mFeatureWrSoftware->EnableByDefault();
(gdb) n
134       if (mWrSoftwareForceEnabled) {
(gdb) p mFeatureWrSoftware->IsEnabled()
$20 = true
(gdb) p mFeatureWrSoftware->mDefault.mStatus
$21 = mozilla::gfx::FeatureStatus::Available
(gdb) p mFeatureWrSoftware->mRuntime.mStatus
$22 = mozilla::gfx::FeatureStatus::Unused
(gdb) p mFeatureWrSoftware->mUser.mStatus
$23 = mozilla::gfx::FeatureStatus::Unused
(gdb) p mFeatureWrSoftware->mEnvironment.mStatus
$24 = mozilla::gfx::FeatureStatus::Unused
(gdb) p mFeatureWrSoftware->mDefault.mStatus
$25 = mozilla::gfx::FeatureStatus::Available
But this is immediately switched to being enabled; in this case set as having a default value of Available. So far there have been no conditions on the execution, so we're guaranteed to reach this state every time. Let's continue.
(gdb) p mWrSoftwareForceEnabled
$33 = false
(gdb) n
136       } else if (mWrForceDisabled || mWrEnvForceDisabled) {
(gdb) p mWrForceDisabled
$26 = false
(gdb) p mWrEnvForceDisabled
$27 = false
Here there was an opportunity to disable the feature if either mWrForceDisabled or mWrEnvForceDisabled were set to true, but since both were set to false we skip over this possibility. This might be our way in to disabling it, so we may want to return to this. But let's continue on with the rest of the debugging for now.
(gdb) n
141       } else if (gfxPlatform::DoesFissionForceWebRender()) {
(gdb) n
145       if (!mHasWrSoftwareBlocklist) {
(gdb) p mHasWrSoftwareBlocklist
$28 = false
At this point the mHasWrSoftwareBlocklist variable is set to false which causes us to jump out of the ConfigureWebRenderSoftware() method early. So we'll return back up the stack to the ConfigureWebRender() method and continue from there.
(gdb) n
mozilla::gfx::gfxConfigManager::ConfigureWebRender
    (this=this@entry=0x7fd7da72f8)
    at gfx/config/gfxConfigManager.cpp:254
254       ConfigureWebRenderQualified();
(gdb) n
256       mFeatureWr->EnableByDefault();
(gdb) n
262       if (mWrSoftwareForceEnabled) {
(gdb) p mFeatureWr->IsEnabled()
$29 = true
(gdb) n
Here we see another change from ESR 78. The mFeatureWr feature is enabled here. We already know it's ultimately disabled so we should keep an eye out for where that happens.
266       } else if (mWrEnvForceEnabled) {
(gdb) 
268       } else if (mWrForceDisabled || mWrEnvForceDisabled) {
(gdb)
275       } else if (mWrForceEnabled) {
(gdb) p mWrForceEnabled
$30 = false
(gdb) n
279       if (!mFeatureWrQualified->IsEnabled()) {
(gdb) p mFeatureWrQualified->IsEnabled()
$31 = false
(gdb) n
282         mFeatureWr->Disable(FeatureStatus::Disabled, "Not qualified",
(gdb) n
287       if (!mFeatureHwCompositing->IsEnabled()) {
(gdb) p mFeatureWr->IsEnabled()
$32 = false
So here it gets disabled again and the reason is because mFeatureWrQualified is disabled. Here's the comment text that goes alongside this in the code (the debugger skips these comments):
    // No qualified hardware. If we haven't allowed software fallback,
    // then we need to disable WR.
So we'll end up with this being disabled whatever happens. There's not much to see in the remainder of the method, but let's skip through the rest of the steps for completeness.
(gdb) n
293       if (mSafeMode) {
(gdb) n
302       if (mXRenderEnabled) {
(gdb) n
312       mFeatureWrAngle->EnableByDefault();
(gdb) n
313       if (mFeatureD3D11HwAngle) {
(gdb) n
335         mFeatureWrAngle->Disable(FeatureStatus::Unavailable,
    "OS not supported",
(gdb) n
339       if (mWrForceAngle && mFeatureWr->IsEnabled() &&
(gdb) n
347       if (!mFeatureWr->IsEnabled() && mDisableHwCompositingNoWr) {
(gdb) n
367       mFeatureWrDComp->EnableByDefault();
(gdb) n
368       if (!mWrDCompWinEnabled) {
(gdb) n
369         mFeatureWrDComp->UserDisable("User disabled via pref",
(gdb) n
373       if (!mIsWin10OrLater) {
(gdb) n
375         mFeatureWrDComp->Disable(FeatureStatus::Unavailable,
(gdb) n
380       if (!mIsNightly) {
(gdb) n
383         nsAutoString adapterVendorID;
(gdb) n
384         mGfxInfo->GetAdapterVendorID(adapterVendorID);
(gdb) n
385         if (adapterVendorID == u"0x10de") {
(gdb) n
383         nsAutoString adapterVendorID;
(gdb) n
396       mFeatureWrDComp->MaybeSetFailed(
(gdb) n
399       mFeatureWrDComp->MaybeSetFailed(mFeatureWrAngle->IsEnabled(),
(gdb) n
403       if (!mFeatureWrDComp->IsEnabled() && mWrCompositorDCompRequired) {
(gdb) n
411       if (mWrPartialPresent) {
(gdb) n
654     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/StaticPrefList_gfx.h:
    No such file or directory.
(gdb) n
433       ConfigureFromBlocklist(nsIGfxInfo::FEATURE_WEBRENDER_SHADER_CACHE,
(gdb) n
435       if (!mFeatureWr->IsEnabled()) {
(gdb) n
436         mFeatureWrShaderCache->ForceDisable(FeatureStatus::Unavailable,
(gdb) n
441       mFeatureWrOptimizedShaders->EnableByDefault();
(gdb) n
442       if (!mWrOptimizedShaders) {
(gdb) n
446       ConfigureFromBlocklist(nsIGfxInfo::FEATURE_WEBRENDER_OPTIMIZED_SHADERS,
(gdb) n
448       if (!mFeatureWr->IsEnabled()) {
(gdb) n
449         mFeatureWrOptimizedShaders->ForceDisable(FeatureStatus::Unavailable,
(gdb) n
And we're out of the method. So that's it: we can see that mFeatureWr is disabled here, as expected. However when it comes to mFeatureWrSoftware it's a different story. The value is enabled by default; to get it disabled we'll need to ensure one of mWrForceDisabled or mWrEnvForceDisabled is set to true.

Both of these are set in the initialisation method, like this:
void gfxConfigManager::Init() {
[...]
  mWrForceDisabled = StaticPrefs::gfx_webrender_force_disabled_AtStartup();
[...]
  mWrEnvForceDisabled = gfxPlatform::WebRenderEnvvarDisabled();
[...]
Here's the code that creates the former:
ONCE_PREF(
  "gfx.webrender.force-disabled",
   gfx_webrender_force_disabled,
   gfx_webrender_force_disabled_AtStartup,
  bool, false
)
That's from the autogenerated obj-build-mer-qt-xr/modules/libpref/init/StaticPrefList_gfx.h file. This is being generated from the gecko-dev/modules/libpref/init/StaticPrefList.yaml file, the relevant part of which looks like this:
# Also expose a pref to allow users to force-disable WR. This is exposed
# on all channels because WR can be enabled on qualified hardware on all
# channels.
- name: gfx.webrender.force-disabled
  type: bool
  value: false
  mirror: once
The latter is set using an environment variable:
/*static*/
bool gfxPlatform::WebRenderEnvvarDisabled() {
  const char* env = PR_GetEnv("MOZ_WEBRENDER");
  return (env && *env == '0');
}
Okay, we've reached the end of this piece of investigation. What's clear is that there may not be any Sailfish-specific code for disabling the web render layer manager because it's being disabled by default anyway.

For the software web render layer manager we could set the MOZ_WEBRENDER environment variable to 0 to force it to be disabled and this will be handy for testing. But in the longer term we should probably put some code into sailfish-browser to explicitly set the gfx.webrender.force-disabled static preference to true.

As I look in to this I discover something surprising. Even though web render is disabled by default, doing some grepping around the code threw the following up in the sailfish-browser code:
void DeclarativeWebUtils::setRenderingPreferences()
{
    SailfishOS::WebEngineSettings *webEngineSettings =
        SailfishOS::WebEngineSettings::instance();

    // Use external Qt window for rendering content
    webEngineSettings->setPreference(
        QString("gfx.compositor.external-window"), QVariant(true));
    webEngineSettings->setPreference(
        QString("gfx.compositor.clear-context"), QVariant(false));
    webEngineSettings->setPreference(
        QString("gfx.webrender.force-disabled"), QVariant(true));
    webEngineSettings->setPreference(
        QString("embedlite.compositor.external_gl_context"), QVariant(true));
}
This is fine for the browser, but it's not going to get executed for the WebView, so I'll need to set this in WebEngineSettings::initialize() as well. Thankfully, making this change turns out to be pretty straightforward:
diff --git a/lib/webenginesettings.cpp b/lib/webenginesettings.cpp
index de9e4b86..13b21d5b 100644
--- a/lib/webenginesettings.cpp
+++ b/lib/webenginesettings.cpp
@@ -110,6 +110,10 @@ void SailfishOS::WebEngineSettings::initialize()
     engineSettings->setPreference(QStringLiteral("intl.accept_languages"),
                                   QVariant::fromValue<QString>(langs));
 
+    // Ensure the web renderer is disabled
+    engineSettings->setPreference(QStringLiteral("gfx.webrender.force-disabled"),
+                                  QVariant(true));
+
     Silica::Theme *silicaTheme = Silica::Theme::instance();
 
     // Notify gecko when the ambience switches between light and dark
As well as this change I also had to amend the rawwebview.cpp file to accommodate some of the API changes I made earlier to gecko. I guess I've not built the sailfish-components-webview packages recently or this would have come up. Nevertheless the fix isn't anything too dramatic:
diff --git a/import/webview/rawwebview.cpp b/import/webview/rawwebview.cpp
index 1b1bb92a..2eab77f5 100644
--- a/import/webview/rawwebview.cpp
+++ b/import/webview/rawwebview.cpp
@@ -37,7 +37,7 @@ public:
     ViewCreator();
     ~ViewCreator();
 
-    quint32 createView(const quint32 &parentId, const uintptr_t &parentBrowsingContext) override;
+    quint32 createView(const quint32 &parentId, const uintptr_t &parentBrowsingContext, bool hidden) override;
 
     static std::shared_ptr<ViewCreator> instance();
 
@@ -54,9 +54,10 @@ ViewCreator::~ViewCreator()
     SailfishOS::WebEngine::instance()->setViewCreator(nullptr);
 }
 
-quint32 ViewCreator::createView(const quint32 &parentId, const uintptr_t &parentBrowsingContext)
+quint32 ViewCreator::createView(const quint32 &parentId, const uintptr_t &parentBrowsingContext, bool hidden)
 {
     Q_UNUSED(parentBrowsingContext)
+    Q_UNUSED(hidden)
 
     for (RawWebView *view : views) {
         if (view->uniqueId() == parentId) {
Having fixed all this, I've built and transferred the new packages over to my phone. Now when I run the harbour-webview example app I get something quite different to the crash we were seeing before:
[defaultuser@Xperia10III gecko]$ harbour-webview 
[D] unknown:0 - QML debugging is enabled. Only use this in a safe environment.
[D] main:30 - WebView Example
[D] main:44 - Using default start URL:  "https://www.flypig.co.uk/search/"
[D] main:47 - Opening webview
[D] unknown:0 - Using Wayland-EGL
library "libutils.so" not found
library "libcutils.so" not found
library "libhardware.so" not found
library "android.hardware.graphics.mapper@2.0.so" not found
library "android.hardware.graphics.mapper@2.1.so" not found
library "android.hardware.graphics.mapper@3.0.so" not found
library "android.hardware.graphics.mapper@4.0.so" not found
library "libc++.so" not found
library "libhidlbase.so" not found
library "libgralloctypes.so" not found
library "android.hardware.graphics.common@1.2.so" not found
library "libion.so" not found
library "libz.so" not found
library "libhidlmemory.so" not found
library "android.hidl.memory@1.0.so" not found
library "vendor.qti.qspmhal@1.0.so" not found
greHome from GRE_HOME:/usr/bin
libxul.so is not found, in /usr/bin/libxul.so
Created LOG for EmbedLiteTrace
[W] unknown:7 - file:///usr/share/harbour-webview/qml/harbour-webview.qml:7:30:
    Type WebViewPage unavailable 
         initialPage: Component { WebViewPage { } } 
                                  ^
[W] unknown:13 - file:///usr/share/harbour-webview/qml/pages/
    WebViewPage.qml:13:5: Type WebView unavailable 
         WebView { 
         ^
[W] unknown:141 - file:///usr/lib64/qt5/qml/Sailfish/WebView/WebView.qml:141:9:
    Type TextSelectionController unavailable 
             TextSelectionController { 
             ^
[W] unknown:14 - file:///usr/lib64/qt5/qml/Sailfish/WebView/Controls/
    TextSelectionController.qml:14:1: module "QOfono" is not installed 
     import QOfono 0.2 
     ^
Created LOG for EmbedLite
JSComp: EmbedLiteConsoleListener.js loaded
JSComp: ContentPermissionManager.js loaded
JSComp: EmbedLiteChromeManager.js loaded
JSComp: EmbedLiteErrorPageHandler.js loaded
JSComp: EmbedLiteFaviconService.js loaded
JSComp: EmbedLiteGlobalHelper.js loaded
EmbedLiteGlobalHelper app-startup
JSComp: EmbedLiteOrientationChangeHandler.js loaded
JSComp: EmbedLiteSearchEngine.js loaded
JSComp: EmbedLiteSyncService.js loaded
EmbedLiteSyncService app-startup
JSComp: EmbedLiteWebrtcUI.js: loaded
JSComp: EmbedLiteWebrtcUI.js: got app-startup
JSComp: EmbedPrefService.js loaded
EmbedPrefService app-startup
JSComp: EmbedliteDownloadManager.js loaded
JSComp: LoginsHelper.js loaded
JSComp: PrivateDataManager.js loaded
JSComp: UserAgentOverrideHelper.js loaded
UserAgentOverrideHelper app-startup
CONSOLE message:
[JavaScript Error: "Unexpected event profile-after-change" {file:
    "resource://gre/modules/URLQueryStrippingListService.jsm" line: 228}]
observe@resource://gre/modules/URLQueryStrippingListService.jsm:228:12

Created LOG for EmbedPrefs
No crash, several errors, but (of course) still a blank screen: no actual rendering taking place. But this is still really good progress. The WebView application which was completely crashing before, is now running, just not rendering. That means we now have the opportunity to debug and fix it. One more step forwards.

I'll look into the rendering more tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
20 Feb 2024 : Day 162 #
Yesterday we were looking in to the WebView rendering pipeline. We got to the point where we had a backtrace showing the flow that resulted in a WebRender layer manager being created, when the EmbedLite code was expecting a Client layer manager. The consequence was that the EmbedLite code forcefully killed itself.

That was on ESR 91. Today I want to find the equivalent flow on ESR 78 to see how it differs. To do this I need to first install the same harbour-webview-example code that I'm using for testing on my ESR 78 device. Then set it off with the debugger:
$ gdb harbour-webview
[...]
(gdb) b nsBaseWidget::CreateCompositorSession
Function "nsBaseWidget::CreateCompositorSession" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (nsBaseWidget::CreateCompositorSession) pending.
(gdb) r
[...]
Thread 7 "GeckoWorkerThre" hit Breakpoint 1, nsBaseWidget::
    CreateCompositorSession (this=this@entry=0x7f8ccbf3d0, aWidth=1080,
    aHeight=2520, aOptionsOut=aOptionsOut@entry=0x7fa7972ac0)
    at widget/nsBaseWidget.cpp:1176
1176        int aWidth, int aHeight, CompositorOptions* aOptionsOut) {
(gdb) n
1180        CreateCompositorVsyncDispatcher();
(gdb) n
1182        gfx::GPUProcessManager* gpu = gfx::GPUProcessManager::Get();
(gdb) n
1186        gpu->EnsureGPUReady();
(gdb) n
67      obj-build-mer-qt-xr/dist/include/mozilla/StaticPtr.h:
    No such file or directory.
(gdb) n
1193        bool enableAPZ = UseAPZ();
(gdb) n
1194        CompositorOptions options(enableAPZ, enableWR);
(gdb) n
1198        bool enableAL =
(gdb) n
1203        options.SetUseWebGPU(StaticPrefs::dom_webgpu_enabled());
(gdb) n
50      obj-build-mer-qt-xr/dist/include/mozilla/layers/CompositorOptions.h:
    No such file or directory.
(gdb) n
1210        options.SetInitiallyPaused(CompositorInitiallyPaused());
(gdb) n
53      obj-build-mer-qt-xr/dist/include/mozilla/layers/CompositorOptions.h:
    No such file or directory.
(gdb) 
39      in obj-build-mer-qt-xr/dist/include/mozilla/layers/CompositorOptions.h
(gdb) 
1217          lm = new ClientLayerManager(this);
(gdb) p enableWR
$1 = false
(gdb) p enableAPZ
$2 = <optimized out>
(gdb) p enableAL
$3 = <optimized out>
(gdb) p gfx::gfxConfig::IsEnabled(gfx::Feature::ADVANCED_LAYERS)
$4 = false
(gdb) p mFissionWindow
$5 = false
(gdb) p StaticPrefs::layers_advanced_fission_enabled()
No symbol "layers_advanced_fission_enabled" in namespace "mozilla::StaticPrefs".
(gdb) p StaticPrefs::dom_webgpu_enabled()
$6 = false
(gdb) p options.UseWebRender()
Cannot evaluate function -- may be inlined
(gdb) p options
$7 = {mUseAPZ = true, mUseWebRender = false, mUseAdvancedLayers = false,
    mUseWebGPU = false, mInitiallyPaused = false}
(gdb) 
As we can see, on ESR 78 things are different: the options.mUseWebRender field is set to false compared to ESR 91 where it's set to true. What's feeding in to these values?

The options structure and its functionality is defined in CompositorOptions.h. Checking through the code there we can see that mUseWebRender is set at initialisation, either to the default value of false if the default constructor is used, or an explicit value if the following constructor overload is used:
  CompositorOptions(bool aUseAPZ, bool aUseWebRender,
                    bool aUseSoftwareWebRender)
      : mUseAPZ(aUseAPZ),
        mUseWebRender(aUseWebRender),
        mUseSoftwareWebRender(aUseSoftwareWebRender) {
    MOZ_ASSERT_IF(aUseSoftwareWebRender, aUseWebRender);
  }
It's never changed after that. So going back to our nsBaseWidget::CreateCompositorSession() code, the only part we need to concern ourselves with is the value that's passed in to the constructor.

For both ESR 78 and ESR 91, the value that's passed in is that of the local enableWR variable. The logic for this value is really straightforward for ESR 78:
    bool enableWR =
        gfx::gfxVars::UseWebRender() && WidgetTypeSupportsAcceleration();
Let's find out how this value is being set:
(gdb) p WidgetTypeSupportsAcceleration()
$8 = true
(gdb) p gfx::gfxVars::UseWebRender()
Cannot evaluate function -- may be inlined
We can't call the UseWebRender() method directly, but we can extract the value it would return by digging into the data structures. This is all following from the code in gfxVars.h:
(gdb) p gfx::gfxVars::sInstance.mRawPtr.mVarUseWebRender.mValue
$11 = false
That's useful, but it doesn't tell us everything we need to know. The next step is to find out where and why this value is being set to false.
$ grep -rIn "gfxVars::SetUseWebRender(" * --include="*.cpp"
gecko-dev/gfx/thebes/gfxPlatform.cpp:2750:    gfxVars::SetUseWebRender(true);
gecko-dev/gfx/thebes/gfxPlatform.cpp:3297:    gfxVars::SetUseWebRender(false);
gecko-dev/gfx/ipc/GPUProcessManager.cpp:479:  gfx::gfxVars::SetUseWebRender(false);
These are being set in gfxPlatform::InitWebRenderConfig(), gfxPlatform::NotifyGPUProcessDisabled() and GPUProcessManager::DisableWebRender() respectively.

Let's find out which is responsible.
(gdb) delete break
Delete all breakpoints? (y or n) y
(gdb) break gfxPlatform::InitWebRenderConfig
Breakpoint 2 at 0x7fb9013328: file gfx/thebes/gfxPlatform.cpp, line 2691.
(gdb) b gfxPlatform::NotifyGPUProcessDisabled
Breakpoint 3 at 0x7fb9016fb0: file gfx/thebes/gfxPlatform.cpp, line 3291.
(gdb) b GPUProcessManager::DisableWebRender
Breakpoint 4 at 0x7fb907f858: GPUProcessManager::DisableWebRender. (3 locations)
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]
Thread 7 "GeckoWorkerThre" hit Breakpoint 2, gfxPlatform::InitWebRenderConfig
    (this=0x7f8c8bbf60)
    at gfx/thebes/gfxPlatform.cpp:2691
2691    void gfxPlatform::InitWebRenderConfig() {
(gdb) n
2692      bool prefEnabled = WebRenderPrefEnabled();
(gdb) n
2693      bool envvarEnabled = WebRenderEnvvarEnabled();
(gdb) n
2698      gfxVars::AddReceiver(&nsCSSProps::GfxVarReceiver());
(gdb) n
2708      ScopedGfxFeatureReporter reporter("WR", prefEnabled || envvarEnabled);
(gdb) n
2709      if (!XRE_IsParentProcess()) {
(gdb) n
2723      gfxConfigManager manager;
(gdb) n
2725      manager.ConfigureWebRender();
(gdb) n
2733      if (Preferences::GetBool("gfx.webrender.program-binary-disk", false)) {
(gdb) n
2738      if (StaticPrefs::gfx_webrender_use_optimized_shaders_AtStartup()) {
(gdb) n
2739        gfxVars::SetUseWebRenderOptimizedShaders(
(gdb) n
2743      if (Preferences::GetBool("gfx.webrender.software", false)) {
(gdb) p gfxConfig::IsEnabled(Feature::WEBRENDER)
$12 = false
(gdb) n
2749      if (gfxConfig::IsEnabled(Feature::WEBRENDER)) {
(gdb) n
2791      if (gfxConfig::IsEnabled(Feature::WEBRENDER_COMPOSITOR)) {
(gdb) p gfxConfig::IsEnabled(Feature::WEBRENDER_COMPOSITOR)
$13 = false
(gdb) n
2795      Telemetry::ScalarSet(
(gdb) n
2799      if (gfxConfig::IsEnabled(Feature::WEBRENDER_PARTIAL)) {
(gdb) n
2805      gfxVars::SetUseGLSwizzle(
(gdb) n
2810      gfxUtils::RemoveShaderCacheFromDiskIfNecessary();
(gdb) r
[...]
No other breakpoints are hit. So as we can see here, on ESR 78 the value for UseWebRender() is left as the default value of false. The reason for this is that gfxConfig::IsEnabled(Feature::WEBRENDER) is returning false. We might need to investigate further where this Feature::WEBRENDER configuration value is coming from or being set, but let's switch to ESR 91 now to find out how things are happening there.

The value of enableWR has a much more complex derivation in ESR 91 compared to that in ESR 78. Here's the logic (note that I've simplified the code to remove the unnecessary parts):
    bool supportsAcceleration = WidgetTypeSupportsAcceleration();
    bool enableWR;
    if (supportsAcceleration ||
        StaticPrefs::gfx_webrender_unaccelerated_widget_force()) {
      enableWR = gfx::gfxVars::UseWebRender();
    } else if (gfxPlatform::DoesFissionForceWebRender() ||
               StaticPrefs::
                   gfx_webrender_software_unaccelerated_widget_allow()) {
      enableWR = gfx::gfxVars::UseWebRender();
    } else {
      enableWR = false;
    }
In practice supportsAcceleration is going to be set to true, which simplifies things and brings us back to this condition:
      enableWR = gfx::gfxVars::UseWebRender();
Let's follow the same investigatory path that we did for ESR 78.
$ grep -rIn "gfxVars::SetUseWebRender(" * --include="*.cpp"
gecko-dev/gfx/thebes/gfxPlatform.cpp:2713:    gfxVars::SetUseWebRender(true);
gecko-dev/gfx/thebes/gfxPlatform.cpp:3435:      gfxVars::SetUseWebRender(true);
gecko-dev/gfx/thebes/gfxPlatform.cpp:3475:    gfxVars::SetUseWebRender(false);
The second of these appears in some code that's compile-time conditional on the platform being Windows XP, so we can ignore it. The other two appear in gfxPlatform::InitWebRenderConfig() and gfxPlatform::FallbackFromAcceleration() respectively. I'm going to go out on a limb and say that we're interested in the former, but let's check using the debugger to make sure.
(gdb) delete break
Delete all breakpoints? (y or n) y
(gdb) b gfxPlatform::InitWebRenderConfig
Breakpoint 7 at 0x7ff12ef954: file gfx/thebes/gfxPlatform.cpp, line 2646.
(gdb) b gfxPlatform::FallbackFromAcceleration
Breakpoint 8 at 0x7ff12f3048: file gfx/thebes/gfxPlatform.cpp, line 3381.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]
Thread 7 "GeckoWorkerThre" hit Breakpoint 7, gfxPlatform::InitWebRenderConfig
    (this=0x7fc4a48c90)
    at gfx/thebes/gfxPlatform.cpp:2646
2646    void gfxPlatform::InitWebRenderConfig() {
(gdb) n
2647      bool prefEnabled = WebRenderPrefEnabled();
(gdb) n
2648      bool envvarEnabled = WebRenderEnvvarEnabled();
(gdb)
[New LWP 27297]
2653      gfxVars::AddReceiver(&nsCSSProps::GfxVarReceiver());
(gdb) 
2663      ScopedGfxFeatureReporter reporter("WR", prefEnabled || envvarEnabled);
(gdb) 
32      ${PROJECT}/obj-build-mer-qt-xr/dist/include/gfxCrashReporterUtils.h:
    No such file or directory.
(gdb) 
2664      if (!XRE_IsParentProcess()) {
(gdb) 
2678      gfxConfigManager manager;
(gdb) 
2679      manager.Init();
(gdb) 
2680      manager.ConfigureWebRender();
(gdb) 
2682      bool hasHardware = gfxConfig::IsEnabled(Feature::WEBRENDER);
(gdb) 
2683      bool hasSoftware = gfxConfig::IsEnabled(Feature::WEBRENDER_SOFTWARE);
(gdb) 
2684      bool hasWebRender = hasHardware || hasSoftware;
(gdb) p hasHardware
$10 = false
(gdb) p hasSoftware
$11 = true
(gdb) p hasWebRender
$12 = <optimized out>
(gdb) n
2701      if (gfxConfig::IsEnabled(Feature::WEBRENDER_SHADER_CACHE)) {
(gdb) n
2705      gfxVars::SetUseWebRenderOptimizedShaders(
(gdb) n
2708      gfxVars::SetUseSoftwareWebRender(!hasHardware && hasSoftware);
(gdb) n
2712      if (hasWebRender) {
(gdb) n
2713        gfxVars::SetUseWebRender(true);
(gdb) c
[...]
So there we can see that the WebRender layer manager is being activated in ESR 91 due to Feature::WEBRENDER_SOFTWARE being enabled.

So we have a clear difference. In ESR 78 Feature::WEBRENDER is set to false. In ESR 91 the Feature::WEBRENDER_SOFTWARE has been added which is enough for the WebRender layer manager to be enabled.

This is good progress. The next step is to figure out where Feature::WEBRENDER_SOFTWARE is being set to enabled and find out how to disable it. I'll take a look at that tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
19 Feb 2024 : Day 161 #
Yesterday I was complaining about the difficulty debugging while travelling by train. Before I'd even posted the diary entry I'd received some beautiful new creations from Thigg to illustrate my experiences. I think he's captured it rather too well and it's a real joy to be able to share this creation with you.
 
A pig with wings sitting in the storage department of a train with a laptop on its lap, entangled into way too many usb cables.

This is just so great! But although this was the most representative, it wasn't my favourite of the images Thigg created. I'll be sharing some of the others at other times when I have the pleasure of enjoying train-based-development, so watch out for more!

On to a fresh day, and this morning the package I started building yesterday evening on the train has finally finished. But that's not as helpful to me as I was hoping it would be when I kicked it off. The change I made was to annotate the code with some debug output. Since then I've been able to find out all the same information using the debugger.

To recap the situation, we've been looking at WebView rendering. Currently any attempt to use the WebView will result in a crash. That's because the the EmbedLite PuppetWdigetBase code, on discovering that the layer manager is of type LAYERS_WR (Web Renderer) is intentionally triggering a crash. It requires the layer manager to be of type LAYERS_CLIENT to prevent this crash from happening.

So my task for today is to find out where the layer manager is being created and establish why the wrong type is being used. To get a good handle on the situation I'll also need to compare this against the same paths in ESR 78 to find out whey they're different.

Looking through the code there are two obvious places where a WebLayerManager is created. First there's code in PuppetWidget that looks like this:
bool PuppetWidget::CreateRemoteLayerManager(
    const std::function<bool(LayerManager*)>& aInitializeFunc) {
  RefPtr<LayerManager> lm;
  MOZ_ASSERT(mBrowserChild);
  if (mBrowserChild->GetCompositorOptions().UseWebRender()) {
    lm = new WebRenderLayerManager(this);
  } else {
    lm = new ClientLayerManager(this);
  }
[...]
Second there's some code in nsBaseWidget that looks like this (I've left some of the comments in, since they're relevant):
already_AddRefed<LayerManager> nsBaseWidget::CreateCompositorSession(
    int aWidth, int aHeight, CompositorOptions* aOptionsOut) {
[...]
    gfx::GPUProcessManager* gpu = gfx::GPUProcessManager::Get();
    // Make sure GPU process is ready for use.
    // If it failed to connect to GPU process, GPU process usage is disabled in
    // EnsureGPUReady(). It could update gfxVars and gfxConfigs.
    gpu->EnsureGPUReady();

    // If widget type does not supports acceleration, we may be allowed to use
    // software WebRender instead. If not, then we use ClientLayerManager even
    // when gfxVars::UseWebRender() is true. WebRender could coexist only with
    // BasicCompositor.
[...]
    RefPtr<LayerManager> lm;
    if (options.UseWebRender()) {
      lm = new WebRenderLayerManager(this);
    } else {
      lm = new ClientLayerManager(this);
    }
[...]
It should be pretty easy to check using the debugger whether either of these are the relevant routes when setting up the layer manager. I still have the debugging session open from yesterday:
(gdb) break nsBaseWidget.cpp:1364
Breakpoint 3 at 0x7ff2a57b64: file widget/nsBaseWidget.cpp, line 1364.
(gdb) break PuppetWidget.cpp:616
Breakpoint 4 at 0x7ff2a67d48: file widget/PuppetWidget.cpp, line 616.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]
Created LOG for EmbedLiteLayerManager

Thread 7 "GeckoWorkerThre" hit Breakpoint 3, nsBaseWidget::
    CreateCompositorSession (this=this@entry=0x7fc4dad520,
    aWidth=aWidth@entry=1080, aHeight=aHeight@entry=2520,
    aOptionsOut=aOptionsOut@entry=0x7fd7da7770)
    at widget/nsBaseWidget.cpp:1364
1364        options.SetInitiallyPaused(CompositorInitiallyPaused());
(gdb) n
43      ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/
    CompositorOptions.h: No such file or directory.
(gdb) 
1369          lm = new WebRenderLayerManager(this);
(gdb) p options
$4 = {mUseAPZ = true, mUseWebRender = true, mUseSoftwareWebRender = true,
    mAllowSoftwareWebRenderD3D11 = false, mAllowSoftwareWebRenderOGL = false, 
  mUseAdvancedLayers = false, mUseWebGPU = false, mInitiallyPaused = false}
(gdb) 
The options structure is really clean and it's helpful to be able to see all of the contents like this.

So we now know that the Web Render version of the layer manager is being created in nsBaseWidget::CreateCompositorSession(). There are two questions that immediately spring to mind: first, if the Client version of the layer manager were being created at this point, would it fix things? Second, is it possible to run with the Web Render layer manager instead?

I also want to know exactly what inputs are being used to decide which type of layer manager to use. Stepping through the nsBaseWidget::CreateCompositorSession() is likely to help with this, so let's give that a go.
(gdb) delete break
Delete all breakpoints? (y or n) y
(gdb) break nsBaseWidget::CreateCompositorSession
Breakpoint 5 at 0x7ff2a578f8: file widget/nsBaseWidget.cpp, line 1308.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]

Thread 7 "GeckoWorkerThre" hit Breakpoint 5, nsBaseWidget::
    CreateCompositorSession (this=this@entry=0x7fc4db8a30,
    aWidth=aWidth@entry=1080, aHeight=aHeight@entry=2520,
    aOptionsOut=aOptionsOut@entry=0x7fd7da7770)
    at widget/nsBaseWidget.cpp:1308
1308        int aWidth, int aHeight, CompositorOptions* aOptionsOut) {
(gdb) n
1312        CreateCompositorVsyncDispatcher();
(gdb) n
1314        gfx::GPUProcessManager* gpu = gfx::GPUProcessManager::Get();
(gdb) n
1318        gpu->EnsureGPUReady();
(gdb) n
1324        bool supportsAcceleration = WidgetTypeSupportsAcceleration();
(gdb) n
1327        if (supportsAcceleration ||
(gdb) n
1329          enableWR = gfx::gfxVars::UseWebRender();
(gdb) n
195     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/gfx/gfxVars.h:
    No such file or directory.
(gdb) n
1338        bool enableAPZ = UseAPZ();
(gdb) n
1339        CompositorOptions options(enableAPZ, enableWR, enableSWWR);
(gdb) p supportsAcceleration
$8 = <optimized out>
(gdb) p enableAPZ
$5 = true
(gdb) p enableWR
$6 = true
(gdb) p enableSWWR
$7 = true
(gdb) n
1357        options.SetUseWebGPU(StaticPrefs::dom_webgpu_enabled());
(gdb) p StaticPrefs::dom_webgpu_enabled()
$9 = false
(gdb) n
mozilla::Atomic<bool, (mozilla::MemoryOrdering)0, void>::operator bool
    (this=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/
    CompositorOptions.h:67
67      ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/
    CompositorOptions.h: No such file or directory.
(gdb) n
nsBaseWidget::CreateCompositorSession (this=this@entry=0x7fc4db8a30,
    aWidth=aWidth@entry=1080, aHeight=aHeight@entry=2520, 
    aOptionsOut=aOptionsOut@entry=0x7fd7da7770)
    at widget/nsBaseWidget.cpp:1364
1364        options.SetInitiallyPaused(CompositorInitiallyPaused());
(gdb) n
43      ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/
    CompositorOptions.h: No such file or directory.
(gdb) n
1369          lm = new WebRenderLayerManager(this);
(gdb) 
That gives us some things to work with, but to actually dig into what this all means will have to wait until the morning.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
18 Feb 2024 : Day 160 #
It's been a long couple of days running an event at work, but now I'm on the train heading home and looking forward to a change of focus for a bit.

And part of that is getting the opportunity to take a look at the backtrace generated yesterday for the WebView rendering pipeline. I won't copy it out again in full, but it might be worth giving a high-level summary.
#0  PuppetWidgetBase::Invalidate (this=0x7fc4dac130, aRect=...)
    at mobile/sailfishos/embedshared/PuppetWidgetBase.cpp:274
#1  PuppetWidgetBase::UpdateBounds (...)
    at mobile/sailfishos/embedshared/PuppetWidgetBase.cpp:395
#2  EmbedLiteWindowChild::CreateWidget (this=0x7fc4d626d0)
    at xpcom/base/nsCOMPtr.h:851
#3  RunnableMethodArguments<>::applyImpl...
    at obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151
[...]
#28 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
Now that I've mentally parsed the backtrace, it's clearly not as useful as I was hoping. But it is something to go on. The line that's causing the crash is the one with MOZ_CRASH() in it below.
void
PuppetWidgetBase::Invalidate(const LayoutDeviceIntRect &aRect)
{
[...]

  if (mozilla::layers::LayersBackend::LAYERS_CLIENT == lm->GetBackendType()) {
    // No need to do anything, the compositor will handle drawing
  } else {
    MOZ_CRASH("Unexpected layer manager type");
  }
[...]
That means that lm->GetBackendType() is returning something other than LAYERS_CLIENT.

It would be nice to know what value is actually being returned, but it looks like this will be easier said than done with the code in its present form. There's nowhere to place the required breakpoint and no variable to extract it from. The LayerManager is an interface and it's not clear what will be inheriting it at this point.

While I'm on the train it's also particularly challenging for me to do any debugging. It is technically possible and I've done it before, but it requires me to attach USB cables between my devices, which is fine until I lose track of time and find I've arrived at my destination. I prefer to spend my time on the train coding, or reviewing code, if I can.

So I'm going to examine the code visually first. So let's suppose it's EmbedLiteAppProcessParentManager that's inheriting from LayerManager. This isn't an absurd suggestion, it's quite possibly the case. So then the value returned will be a constant:
  virtual mozilla::layers::LayersBackend GetBackendType() override {
    return LayersBackend::LAYERS_OPENGL; }
Again, there's nothing to hang a breakpoint from there. So I've added a debug output so the value can be extracted explicitly.
  LOGW("WEBVIEW: Invalidate LAYERS_CLIENT: %d", lm->GetBackendType());
  if (mozilla::layers::LayersBackend::LAYERS_CLIENT == lm->GetBackendType()) {
    // No need to do anything, the compositor will handle drawing
  } else {
    MOZ_CRASH("Unexpected layer manager type");
  }
There's nothing wrong with this approach, except that it requires a rebuild of the code, which I've just set going. Hopefully it'll forge through the changes swiftly.

In the meantime, let's continue with our thought that the layer manager is of type EmbedLiteAppProcessParentManager and that the method is therefore returning LAYERS_OPENGL. The enum in LayersTypes.h shows that this definitely takes a different value from LAYERS_CLIENT:
enum class LayersBackend : int8_t {
  LAYERS_NONE = 0,
  LAYERS_BASIC,
  LAYERS_OPENGL,
  LAYERS_D3D11,
  LAYERS_CLIENT,
  LAYERS_WR,
  LAYERS_LAST
};
Which does make me wonder how this has come about. Isn't it inevitable that the code will crash in this case?

I'll need to check if either the return value or the test condition has changed since ESR 78. But the other possibility is that it's something else inheriting the LayerManager class.

[...]

Now I'm back home and have access to the debugger. The code is still building — no surprise there — so while I wait let's attache the debugger and see what it throws up.
(gdb) p lm->GetBackendType()
$2 = mozilla::layers::LayersBackend::LAYERS_WR
(gdb) ptype lm
type = class mozilla::layers::LayerManager : public mozilla::layers::FrameRecorder {
  protected:
    nsAutoRefCnt mRefCnt;
[...]
    virtual mozilla::layers::LayersBackend GetBackendType(void);
[...]
  protected:
    ~LayerManager();
[...]
} *
(gdb) p this->GetLayerManager(0, mozilla::layers::LayersBackend::LAYERS_NONE, LAYER_MANAGER_CURRENT)
$2 = (mozilla::layers::LayerManager *) 0x7fc4db1250
Direct examination of the LayerManager doesn't show what the original object type is that's inheriting it. But there is a trick you can do with gdb to get it to tell you:
(gdb) set print object on
(gdb) p this->GetLayerManager(0, mozilla::layers::LayersBackend::LAYERS_NONE, LAYER_MANAGER_CURRENT)
$3 = (mozilla::layers::WebRenderLayerManager *) 0x7fc4db1250
(gdb) set print object off
So the actual type of the layer manager is WebRenderLayerManager. This is clearly a problem, because this will always return LAYERS_WR as its backend type:
  LayersBackend GetBackendType() override { return LayersBackend::LAYERS_WR; }
All this debugging has been useful; so useful in fact that it's made the debug prints I added on the train completely redundant. No matter, I'll leave the build running anyway.

Tomorrow I must find out where the layer manager is being created and also what the layer manager type is on ERS 78 for comparison.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
17 Feb 2024 : Day 159 #
I've been working through a singular bug stack for so long now that it feels strange to have a completely open ended set of possibilities for which direction to go next. We had a small diversion yesterday into creating random grids of avatars but before that were focused for a while on the task of getting DuckDuckGo working. Today I have to make some new decisions about where to go next.

There are two things I'd really like to look into. While working on the browser over the last weeks it's been stable enough to use pretty successfully as a browser. But occasionally the renderer crashes out completely, pulling the browser down, for no obvious reason. It's sporadic enough that there's no obvious cause. But if I could get a backtrace from the crash that might be enough to start looking in to it.

So my first option is looking in to these sporadic crashes. They're not nice for users and might signify a deeper issue.

The second option is fixing the webview rendering pipeline. That needs a little explanation. On Sailfish OS the browser is used in one of two ways, either as a web browser, or as a webview embedded in some other application.

The best example of this is the email client which uses the webview to render messages. These often contain HTML, so it makes perfect sense to use a good embedded browser rending engine for them.

So these are two different use-cases. But they also happen to have two different rendering pipelines. Currently the browser pipeline works nicely, but the webview pipeline is broken. I'd really like to fix it.

I've decided to go with the native rendering pipeline task first (issue 1043 on GitHub). It's clearly defined, but also potentially a big job, so needs some attention. But if I continue to see browser crashes I may switch focus to those instead.

For the native rendering pipeline the first step is straightforward: install a webview project on my phone. There are plenty out there, but I also have several basic examples already written in the "projects" folder on my laptop, and which I should be able to just build and install on my phone for testing.

Digging through projects I find one called "harbour-webview-example" (sounds promising) with the following as the main page of the app:
import QtQuick 2.0
import Sailfish.Silica 1.0
import Sailfish.WebView 1.0
import Sailfish.WebEngine 1.0
import uk.co.flypig.webview 1.0

Page {
    allowedOrientations: Orientation.All

    WebView {
        anchors.fill: parent
        active: true
        url: "http://www.sailfishos.org"
        onLinkClicked: {
          WebEngine.notifyObservers("exampleTopic", url)
        }
    }
}
Straightforward. But containing a webview. Attempting to run it, the results sadly aren't good:
$ harbour-webview
[D] unknown:0 - QML debugging is enabled. Only use this in a safe environment.
[...]
JSComp: UserAgentOverrideHelper.js loaded
UserAgentOverrideHelper app-startup
CONSOLE message:
[JavaScript Error: "Unexpected event profile-after-change"
    {file: "resource://gre/modules/URLQueryStrippingListService.jsm" line: 228}]
observe@resource://gre/modules/URLQueryStrippingListService.jsm:228:12

Created LOG for EmbedPrefs
Content JS: resource://gre/modules/SearchSettings.jsm, function: get, message:
    [JavaScript Warning: "get: No settings file exists, new profile?
    NotFoundError: Could not open the file at
    .cache/harbour-webview/harbour-webview/.mozilla/search.json.mozlz4"]
Created LOG for EmbedLiteLayerManager
Segmentation fault
There are a couple of JavaScript errors (may or may not be related) and a crash (definitely relevant). Let's see if we can get a backtrace from the crash:
$ gdb harbour-webview 
GNU gdb (GDB) Mer (8.2.1+git9)
Copyright (C) 2018 Free Software Foundation, Inc.
[...]
Thread 7 "GeckoWorkerThre" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 12581]
0x0000007ff367c0a8 in mozilla::embedlite::PuppetWidgetBase::Invalidate
    (this=0x7fc4dac130, aRect=...)
    at mobile/sailfishos/embedshared/PuppetWidgetBase.cpp:274
274         MOZ_CRASH("Unexpected layer manager type");
(gdb) bt
#0  0x0000007ff367c0a8 in mozilla::embedlite::PuppetWidgetBase::Invalidate
    (this=0x7fc4dac130, aRect=...)
    at mobile/sailfishos/embedshared/PuppetWidgetBase.cpp:274
#1  0x0000007ff368093c in mozilla::embedlite::PuppetWidgetBase::UpdateBounds
    (this=0x7fc4dac130, aRepaint=aRepaint@entry=true)
    at mobile/sailfishos/embedshared/PuppetWidgetBase.cpp:395
#2  0x0000007ff3689b28 in mozilla::embedlite::EmbedLiteWindowChild::CreateWidget
    (this=0x7fc4d626d0)
    at xpcom/base/nsCOMPtr.h:851
#3  0x0000007ff367a094 in mozilla::detail::RunnableMethodArguments<>::applyImpl
    <mozilla::embedlite::EmbedLiteWindowChild, void
    (mozilla::embedlite::EmbedLiteWindowChild::*)()>
    (mozilla::embedlite::EmbedLiteWindowChild*, void
    (mozilla::embedlite::EmbedLiteWindowChild::*)(), mozilla::Tuple<>&,
    std::integer_sequence<unsigned long>) (args=..., m=<optimized out>,
    o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151
[...]
#28 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) 
That's definitely something to go on. But unfortunately I'm tight for time today, so digging in to this backtrace will have to wait until tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
16 Feb 2024 : Day 158 #
A couple of days back I included a small graphic that showed avatars and usernames of everyone who has helped, commented on, liked, boosted and generally supported this dev diary. Here it is again (because you can never say thank you enough!).
 
Many, many avatars with names underneath, showing all of the people I could find who I think have interacted with the dev diary, along with the words 'Thank you'

It's actually a slide from my FOSDEM presentation, and while I do like it, that's not the real reason I'm showing it again here. During my presentation I mentioned that it had taken me a long time to create the slide. And that's true. But I thought it might be interesting to find out how it was created. Maybe it could have been done in a simpler, better, or quicker way?

Creating it required four steps:
  1. Collect the names and avatars.
  2. Process the avatars.
  3. Plot the names and avatars on the canvas.
  4. Tidy things up.
I did my best to automate the process, with small scripts to handle steps two and three. While most of my energy was spent on step one, it's the automated steps that might be of interest to others.

Let me first give a quick overview of how I collected the data in step one. This did take a long time — longer than I expected — primarily because there were more people interacting with my dev diary than I'd quite appreciated. Initially I collected names from the sailfish forum. There's a thread about the dev diary and I picked up most of the names from there, or from direct messages.

Each user on the forum has an avatar, even if it's been auto-generated by the Discourse forum software. Whenever someone posts their avatar is shown next to their comment. But this is a small version of the avatar. Select the username at the top of a post and a more detailed summary of the account is shown, including a larger version of the same image. Right click on this and it can be saved out to disc.

If you try this you'll notice the avatar is actually a square image, even though it's shown in the forum as circular. A mask is being applied to it on the client side. This will be important for step two.

At this point I also added other users I could think of who, while they may not have made a post on the forum, had nevertheless interacted in important ways with the dev diary. This included many different types of interactions such as comments on IRC or matrix. In this case, I also found their avatars and usernames on the forum.

While doing this I kept a CSV file as I was going along containing two columns: username and avatar filename.

Finally I checked my Mastodon account for all the users who had interacted with my posts there. I stepped through all 149 of my dev diary Mastodon posts (as it was at the time), checking who had favourited, boosted, or replied to a post there. Once again I took a copy of their avatar and added their details to the CSV file.

So far so manual. What I was left with was a directory containing 145 images and a CSV file with 145 rows. Here's the first few rows of the CSV file to give you an idea:
000exploit, 000exploit.png
aerique, aerique.png
Adrian McEwen, amcewen.png
ApB, apb.png
Adam T, atib.png
[...]
You'll notice that it's in alphabetical order. That's because after collecting all the details I ran it through sort on the command line.

That brings us on to step two, the processing of the avatars (which sounds far more grand than it is). Different avatars were in different formats (jpeg or PNG), with different sizes, but mostly square in shape. They needed to all be the same size, the same format (PNG) and with a circular mask applied.

For this I used the convert tool, which is part of the brilliant ImageMagick suite. It performs a vast array of image manipulations all from the command line. Here's just the a small flavour from its help output:
$ convert --help
Version: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
Copyright: (C) 1999-2021 ImageMagick Studio LLC
License: https://imagemagick.org/script/license.php
Features: Cipher DPC Modules OpenMP(4.5) 
Delegates (built-in): bzlib djvu fftw fontconfig freetype heic jbig jng jp2
    jpeg lcms lqr ltdl lzma openexr pangocairo png tiff webp wmf x xml zlib
Usage: convert-im6.q16 [options ...] file [ [options ...] file ...]
    [options ...] file

Image Settings:
  -adjoin              join images into a single multi-image file
  -affine matrix       affine transform matrix
  -alpha option        activate, deactivate, reset, or set the alpha channel
  -antialias           remove pixel-aliasing
  -authenticate password
                       decipher image with this password
  -attenuate value     lessen (or intensify) when adding noise to an image
[...]
Happily for me, all three of my desired conversions (changing format, changing size, applying a mask) were available using the tool. I put together a very simple bash script which cycles through all of the files with a given format in a folder, processes them in the way that was needed and then output them to a different folder with a suffix added to the filename.

All of the work is done by the convert tool, via this simple addmask.sh script. It's so simple in fact that I can give the entirety of it here without it taking up too much space:
SUFFIX_FROM=$1
OUT_FOLDER=$2
SUFFIX_TO=$3

set -e

if [ $# -ne 3 ]; then
        echo "Syntax: " $0 "<extension-in> <out-folder> <out-suffix>"
        exit 0
fi

SUFFIX_FROM_DOT="${SUFFIX_FROM}."

echo "Converting <image>.$SUFFIX_FROM images to <image>-$SUFFIX_TO.png"
echo

for name in *.$SUFFIX_FROM; do
	newname="${name%.*}$SUFFIX_TO.png"
        echo "Converting $name to $OUT_FOLDER/$newname"
	convert "$name" masks/mask.png -sample 240x240 -alpha Off -compose \
	    CopyOpacity -composite -density 180 -background transparent \
	    "$OUT_FOLDER/$newname"
done
I then called it a couple of times inside the folder with the images to produce the results needed:
$ ./addmask.sh png masked -masked
$ ./addmask.sh jpg masked -masked
After processing, each of the updated images is given a new name, so I had to perform a regex search and replace on my CSV file to update them appropriately. The file now looks like this:
000exploit, 000exploit-masked.png
aerique, aerique-masked.png
Adrian McEwen, amcewen-masked.png
ApB, apb-masked.png
Adam T, atib-masked.png
[...]
I have to admit that I cheated a bit with this script. I originally wrote it back in September 2022 for Jolla's "Sailing for Ten Years" party held in Berlin on 14th October of the same year. Nico Cartron wrote up a nice summary of the event in the Sailfish Community News at the time. I was asked to give a presentation at the event; one of the slides I created for it was a thank you slide not unlike the one above. In that case it was for translators and apps, but it never actually got used during the event.

Nevertheless the script lived on in my file system and finally found itself a use. To be honest, I was pretty tight for time writing up my presentation for FOSDEM so I'm not sure if I'd have gone down this route if I didn't already have something to build on. I made some small changes to it to handle resizing, but that was pretty much the only change.

That brings us to step three. Now having a directory full of nicely processed images, I needed them to be arranged on a canvas, ideally in SVG format, so that I could then embed the result on a slide.

Since starting my role as a Research Data Scientist I've been immersed in random Python scripts. Python has the benefit of a huge array of random libraries to draw from, SVG-writing included in the form of the drawsvg 2 project. It is, honestly, a really simple way to generate SVGs quickly and easily. Now that I've tried it I think I'll be using it more often.

My aim was to arrange the avatars and names "randomly" on the page. I started by creating a method that placed a single avatar on the canvas with the username underneath. Getting the scale, font and formatting correct took a little trial and error, but I was happy that the final values all made sense. The drawsvg coordinate system works just as it should!

Arranging them at random requires an algorithm. My first instinct was to arrange them all in a grid, but with a random "jitter" applied. That is, for each image select a random angle and distance and move it by that amount on the page.

The script I created for this is a little too long to show here, but you can check it out on GitHub.

Here's how I ran it:
$ python3 createthanks.py names.csv thanks-grid.svg --grid
Using grid formation
Exported to thanks-grid.svg
The results weren't good, as you can see for yourself in this image.


 
The same avatars but arranged in a grid formation with random jitter; but the grid formation is still very clear

The avatars have been placed, but the grid formation is still very clear despite the added jitter, plus there's a gap at the end because the number of avatars in total doesn't divide by the number of avatars on each row. I wasn't happy with it.

So I came up with an alternative solution: for each avatar a random location is chosen on the canvas. As each avatar is added its position is stored in an array, then when the next position is chosen it's compared against all of these positions. If it's within a certain distance (60 units) of any of the other points, it's rejected and a new random position is chosen.

Again, you can see this algorithm given in the same file on GitHub. Here's how I ran it:
$ python3 createthanks.py names.csv thanks-random.svg
Using random formation
100%
Exported to thanks-random.svg
This is the approach I ended up using, so you can see the results in the original slide. But it's not ideal for several reasons. Most crucially it's not guaranteed to complete, for example if there isn't enough space to fit all of the avatars the algorithm will hang indefinitely while it tries to find a place to position the next avatar. It's also inefficient, with each location being compared with every other and a potentially large number of rejections being made before a suitable location is found at each step.

But I found given enough space to locate the avatars the process actually finished pretty quickly. And since I only need to run it once, not multiple times, that's actually not a problem.

In retrospect a better algorithm would have been to partition the canvas up into a grid of sizes much smaller than an avatar. Ideally it would be one pixel per pixel, but in practice we don't really know what a pixel means in this context. Besides which something approaching this is likely to be fine. Now store a Boolean associated with each of these grid points indicating whether it's available or used.

After placing an avatar mark the pixels around the location in this grid as being used, to the extent that there will be no overlap if another avatar is placed in any of the unused spots. Keep a count of the available locations.

Then a random number can be chosen between zero and the total remaining available locations in order to select the next spot.

I didn't implement this, but in my head it works really well.

Finally step four involved tidying up the files. Some of the avatars and usernames were overlapping so needed a bit of manual tweaking (but thankfully not too much). Plus I also had to manually make room for the "Thank you" text in the top left of the window. This required a bit more manual shuffling of avatars, but it all worked out in the end. I'm quite happy with how it came out.

That's it. Just a little diversion into the scripts used to create the image; I hope it's been of some interest.

There will be more of the usual gecko dev diary tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
15 Feb 2024 : Day 157 #
The new gecko packages completed their build during the gap. I'm not expecting anything special from them, apart from the debug output that I added yesterday having been removed. But it will be nice to test whether the packages are working. If they are that means I'll have dealt with the session history issues and the actor issues and it'll be time to wind back down the stack to checking DuckDuckGo.

I've just now fired up sailfish-browser and all of the JavaScript errors seem to have been resolved. If nothing else, that's pleasing to see. So now it's time to return to DuckDuckGo again.

You'll recall that I spent a long time getting DuckDuckGo to work. The main problem was the Sec-Fetch-* headers. This allowed the main search page to load. But there was still an issue in that the search results weren't being displayed. I thought this could potentially be related to the session history issue, which is why I started looking at it. But I wasn't at all certain.

Now that the session history is fixed and the JavaScript output is clean, there's a small chance this will have fixed DuckDuckGo as well.

And it has! Searching with DuckDuckGo now produced nice results. And the page feels nice and smooth as well. With the Forwards and Back buttons now also working as expected, it's really starting to feel like a usable browser finally.
 
The screenshots of DuckDuckGo with captions: DuckDuckGo landing page; Search as you type; Search results now appear!

But my euphoria is short-lived. I'm able to get the browser into a state where DuckDuckGo is no longer working correctly: it just displays a blank page again.

After closing and firing up the browser again everything works as expected again.

It takes me ages to figure out how to get it back into the state where DuckDuckGo is broken. The breakage looks the same as the Sec-Fetch-* header issues that we were experiencing a couple of weeks back. If that's the case, then it's likely there's some route to getting to the DuckDuckGo page that's getting the wrong flags from the user interface and then offering up the wrong Sec-Fetch-* headers as a result.

What do I mean by a route? I mean some set of interactions to get there: loading the page from the history, a bookmark, the URL bar, pressing back, as a page loaded from the command line. All of these are routes and each will potentially need slightly different flags so that the gecko engine knows what's going on and can send the correct Sec-Fetch-* headers to match.

I thought I'd fixed them all, but it would have been easy to miss some. And so it appears.

So I try all the ways to open the page I can think of from different starting states. After having exhausted what I think are all the routes I realise I've missed something important.

So far I've been running the browser from the command line each time. What if the issue is to do with the way the browser is starting up?

Starting the browser from the command line isn't the same as starting it by pressing on its icon in the app grid. There are multiple rasons for this, but the most significant two are:
  1. When launched from the app grid the sandbox is used. It's not used when launched from the command line.
  2. When launched from the app grid the booster is used, unlike when launched from the command line.
And indeed, when I launch the app from the grid, DuckDuckGo fails. This is a serious issue: we can't have users being required to launch the app without sandboxing from the command line each time. But it's not immediately clear to me why sandboxing and/or the booster would make any difference.

To find out what's going wrong I need to establish the Sec-Fetch-* header values that are being sent to the site. That's going to be a little tricky because when launching from the app grid there's no debug output being sent to the console. But it might be possible to extract the same info from the system log. Let's try it:
$ devel-su journalctl --system -f
[...]
booster-browser[14470]: [D] unknown:0 - Using Wayland-EGL
autologind[5343]: library "libutils.so" not found
autologind[5343]: library "libcutils.so" not found
autologind[5343]: library "libhardware.so" not found
autologind[5343]: library "android.hardware.graphics.mapper@2.0.so" not found
autologind[5343]: library "android.hardware.graphics.mapper@2.1.so" not found
autologind[5343]: library "android.hardware.graphics.mapper@3.0.so" not found
autologind[5343]: library "android.hardware.graphics.mapper@4.0.so" not found
autologind[5343]: library "libc++.so" not found
autologind[5343]: library "libhidlbase.so" not found
autologind[5343]: library "libgralloctypes.so" not found
autologind[5343]: library "android.hardware.graphics.common@1.2.so" not found
autologind[5343]: library "libion.so" not found
autologind[5343]: library "libz.so" not found
autologind[5343]: library "libhidlmemory.so" not found
autologind[5343]: library "android.hidl.memory@1.0.so" not found
autologind[5343]: library "vendor.qti.qspmhal@1.0.so" not found
[...]
It looks like the logging output is all there. The next step is to figure out a way to set the EMBED_CONSOLE="network" environment variable, so that we can capture the headers used.

But as I write this I'm hurtling towards King's Cross Station on the train and due to arrive shortly. So I'll have to leave that question open until this evening.

[...]

It's the evening now, so time to check out the logging from the SailJailed browser. Before running it I want to forcefully enable network logging. This will be easier than configuring a special environment for it.

I've made some amendments to the Logger.js directly on-device for this:
vim /usr/lib64/mozembedlite/chrome/embedlite/content/Logger.js
The changes applied are the following, in order to force the "extreme" network logging to be used:
[...]
  get stackTraceEnabled() {
    return true;
    //return this._consoleEnv.indexOf("stacktrace") !== -1;
  },

  get devModeNetworkEnabled() {
    return true;
    //return this._consoleEnv.indexOf("network") !== -1;
  }, 
[...]
Next we must set the logger running, also on the device:
$ devel-su journalctl --system -f | grep "dbus-daemon" > sailjail-log-01.txt
And finally run sailfish-browser. The DuckDuckGo page loads, but doesn't render. That's good: that's the bug we want to examine. After closing the browser down I'm left with a surprisingly tractable 80 KiB log file to investigate:
$ ls -lh sailjail-log-01.txt 
-rw-rw-r--    1 defaultu defaultu   80.0K Feb 13 22:35 sailjail-log-01.txt
The bits we're interested in are the Sec-Fetch-* request headers. This is what's in the log file for these:
[ Request details ------------------------------------------- ]
    Request: GET status: 200 OK
    URL: https://duckduckgo.com/?t=h_
    [ Request headers --------------------------------------- ]
        Host : duckduckgo.com
        User-Agent : Mozilla/5.0 (X11; Linux aarch64; rv:91.0) Gecko/20100101
            Firefox/91.0
        Accept : text/html,application/xhtml+xml,application/xml;q=0.9,
            image/webp,*/*;q=0.8
        Accept-Language : en-GB,en;q=0.5
        Accept-Encoding : gzip, deflate, br
        Connection : keep-alive
        Cookie : l=wt-wt
        Upgrade-Insecure-Requests : 1
        Sec-Fetch-Dest : document
        Sec-Fetch-Mode : navigate
        Sec-Fetch-Site : none
        Sec-Fetch-User : ?1
        If-None-Match : "65cbc605-4716"
Going back to the non-sandboxed run, below are the equivalent headers sent. There are two here because the browser is downloading a second file.
[ Request details ------------------------------------------- ]
    Request: GET status: 200 OK
    URL: https://duckduckgo.com/?t=h_
    [ Request headers --------------------------------------- ]
        Host : duckduckgo.com
        User-Agent : Mozilla/5.0 (X11; Linux aarch64; rv:91.0) Gecko/20100101
            Firefox/91.0
        Accept : text/html,application/xhtml+xml,application/xml;q=0.9,
            image/webp,*/*;q=0.8
        Accept-Language : en-GB,en;q=0.5
        Accept-Encoding : gzip, deflate, br
        Connection : keep-alive
        Cookie : l=wt-wt
        Upgrade-Insecure-Requests : 1
        Sec-Fetch-Dest : document
        Sec-Fetch-Mode : navigate
        Sec-Fetch-Site : none
        Sec-Fetch-User : ?1
        If-None-Match : "65cbc606-4716"
[ Request details ------------------------------------------- ]
    Request: GET status: 304 Not Modified
    URL: https://content-signature-2.cdn.mozilla.net/chains/remote-settings.
        content-signature.mozilla.org-2024-03-20-10-07-03.chain
    [ Request headers --------------------------------------- ]
        Host : content-signature-2.cdn.mozilla.net
        User-Agent : Mozilla/5.0 (X11; Linux aarch64; rv:91.0) Gecko/20100101
            Firefox/91.0
        Accept : */*
        Accept-Language : en-GB,en;q=0.5
        Accept-Encoding : gzip, deflate, br
        Connection : keep-alive
        Sec-Fetch-Dest : empty
        Sec-Fetch-Mode : cors
        Sec-Fetch-Site : cross-site
        If-Modified-Since : Tue, 30 Jan 2024 10:07:04 GMT
        If-None-Match : "b437816fe2de8ff3d429925523643815"
To be honest I'm a bit confused by this: for the initial request there are no difference between the values provided for the Sec-Fetch-* headers in either case. I was expecting to see a difference.

Maybe a caching issue? Let's try deleting the profile, killing any running booster processes and trying again.
$ mv ~/.local/share/org.sailfishos/browser/.mozilla/ \
    ~/.local/share/org.sailfishos/browser/mozilla.bak
$ ps aux | grep browser
 4628 defaultu /usr/bin/invoker [...] --application=sailfish-browser
 4640 defaultu /usr/bin/firejail [...] --application=sailfish-browser
 4674 defaultu /usr/libexec/mapplauncherd/booster-browser
    --application=sailfish-browser
 4675 defaultu booster [browser]
 4702 defaultu grep browser
 6184 defaultu /usr/bin/invoker --type=silica-session [...]
    --application=jolla-email
 6243 defaultu /usr/bin/firejail [...] --application=sailfish-browser
 6248 defaultu /usr/bin/firejail [...] --application=jolla-email
 6442 defaultu /usr/libexec/mapplauncherd/booster-browser
    --application=jolla-email
 6452 defaultu booster [browser]
$ kill -1 4640
$ kill -1 6243
After deleting the browser's local config storage and killing the SailJail/booster processes, DuckDuckGo then works successfully again. So probably it was a caching issue.

When running the browser from the launcher there's an added complication that usually the booster is still running in the background for performance reasons. So changes to the browser code and files may not always be applied if this is the case. After killing the booster processes, this seems to have been fixed.

After this additional investigation I'm satisfied that this doesn't look like an issue with the code after all.

I'm going to leave it there for today. Tomorrow I'll be looking for a completely new task to work on. But I think the browser has got to the stage where it would be worth having more hands to test it, so the next task may be to figure out how that can happen.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
14 Feb 2024 : Day 156 #
It is the morning, time to take a look into why EmbedLiteGlobalHelper.js isn't being initialised on ESR 91. We saw yesterday that this was why the Window Actors weren't being registered. If we want them registered we're going to have to get EmbedLiteGlobalHelper.js back up and running.

Now I think about it, it's not that EmbedLiteGlobalHelper.js isn't being loaded. We know it's being loaded because we see this line in the debug output when running ESR 91:
$ EMBED_CONSOLE=1 sailfish-browser
[...]
JSComp: EmbedLiteGlobalHelper.js loaded
[...]
That line is being directly output from the EmbedLiteGlobalHelper.js file itself:
function EmbedLiteGlobalHelper()
{
  L10nRegistry.registerSources([new FileSource(
                                   "0-mozembedlite",
                                   ["en-US", "fi", "ru"],
                                   "chrome://browser/content/localization/
                                   {locale}/")])

  Logger.debug("JSComp: EmbedLiteGlobalHelper.js loaded");
}
But! There is a line missing from here that's in the ESR 78 code:
function EmbedLiteGlobalHelper()
{
  ActorManagerParent.flush();

  L10nRegistry.registerSource(new FileSource(
                                  "0-mozembedlite",
                                  ["en-US", "fi", "ru"],
                                  "chrome://browser/content/localization/
                                  {locale}/"))

  Logger.debug("JSComp: EmbedLiteGlobalHelper.js loaded");
}
Could it be that the call to ActorManagerParent.flush() is what kicks the ActorManagerParent into life? Presumably there's a reason it was removed. It was most likely me that removed it, but I don't recall. But we should be able to find out.
$ git log -S "ActorManagerParent" -1 -- jscomps/EmbedLiteGlobalHelper.js
commit 0bf2601425ec1d8d639255d6a7c32231e7e38eae
Author: David Llewellyn-Jones <david.llewellyn-jones@jolla.com>
Date:   Thu Nov 23 21:55:13 2023 +0000

    Remove EmbedLiteGlobalHelper.js errors
    
    Makes three changes to address errors that were coming from
    EmbedLiteGlobalHelper.js:
    
    1. Use ComponentUtils.generateNSGetFactory() instead of
       XPCOMUtils.generateNSGetFactory().
    
    2. Remove call to ActorManagerParent.flush(). See Bug 1649843.
    
    3. Use L10nRegistry.registerSources() instead of
       L10nRegistry.registerSource(). See Bug 1648631.
    
    See the following related upstream changes:
    
    https://phabricator.services.mozilla.com/D95206
    
    https://phabricator.services.mozilla.com/D81243
That "Bug 1648631" that's being referred to there is described as "Remove legacy JS actors infrastructure and migrate remaining actors to JSWindowActors".

Presumably there was some error resulting from the call to ActorManagerParent.flush(), but November was a long time ago now (way back on Day 88 in fact). According to the diary entry then, the change was to remove the following error:
JavaScript error: file:///usr/lib64/mozembedlite/components/
    EmbedLiteGlobalHelper.js,
    line 32: TypeError: ActorManagerParent.flush is not a function
JavaScript error: file:///usr/lib64/mozembedlite/components/
    EmbedLiteGlobalHelper.js,
    line 34: TypeError: L10nRegistry.registerSource is not a function
With all of the other changes we've been making, restoring the removed code to see what happens doesn't sound like the worst idea right now. Let's try it.
$ sailfish-browser
[D] unknown:0 - Using Wayland-EGL
[...]
Created LOG for EmbedLite
ACTOR: addJSProcessActors
ACTOR: RegisterProcessActor: 
ACTOR: RegisterProcessActor: 
ACTOR: RegisterProcessActor: 
ACTOR: addJSWindowActors
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
AJavaScript error: file:///usr/lib64/mozembedlite/components/
    EmbedLiteGlobalHelper.js, line 31: TypeError: ActorManagerParent.flush
    is not a function
CTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
Created LOG for EmbedPrefs
Created LOG for EmbedLiteLayerManager
ACTOR: Getting LoginManager
ACTOR: Getting LoginManager
ACTOR: Getting LoginManager
ACTOR: Getting LoginManager
Call EmbedLiteApp::StopChildThread()
It turns out the code I added to output the name of the actors is broken: the log output is being generated, but without the name of the actor concerned. But I don't fancy doing a complete rebuild to fix that so we'll just have to do without for now.

The results here are interesting. This has clearly brought the ActorManagerParent back to life. But there is still this error about flush() not being a function. And when comparing the code in ActorManagerParent.jsm between ESR 78 and ESR 91 it is true that this flush() method has been removed.

That leaves us with an interesting quandary. In ESR 78 the flush() function was being used to trigger initialisation of the module. Now we need something else to do the same.

Thankfully there is a really simple solution. We can just instantiate an ActorManagerParent object without calling any functions on it.
function EmbedLiteGlobalHelper()
{
  // Touch ActorManagerParent so that it gets initialised
  var actor = new ActorManagerParent();

  L10nRegistry.registerSources([new FileSource(
                                   "0-mozembedlite",
                                   ["en-US", "fi", "ru"],
                                   "chrome://browser/content/localization/
                                   {locale}/")])

  Logger.debug("JSComp: EmbedLiteGlobalHelper.js loaded");
}
Now we get a nice clean startup without any errors:
$ sailfish-browser
[D] unknown:0 - Using Wayland-EGL
[...]
Created LOG for EmbedLiteTrace
[D] onCompleted:105 - ViewPlaceholder requires a SilicaFlickable parent
Created LOG for EmbedLite
ACTOR: addJSProcessActors
ACTOR: RegisterProcessActor: 
ACTOR: RegisterProcessActor: 
ACTOR: RegisterProcessActor: 
ACTOR: addJSWindowActors
ACTOR: RegisterWindowActor: 
[...]
ACTOR: RegisterWindowActor: 
Created LOG for EmbedPrefs
Created LOG for EmbedLiteLayerManager
ACTOR: Getting LoginManager
ACTOR: Getting LoginManager
ACTOR: Getting LoginManager
ACTOR: Getting LoginManager
Call EmbedLiteApp::StopChildThread()
Excellent! The final step then is to remove all of the debugging code I added and see where that leaves us. With any luck, this will resolve the session history issues and allow us to head back in to checking DuckDuckGo once again.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
13 Feb 2024 : Day 155 #
This morning the build I started yesterday has completed successfully. Now to test it.

You'll recall that the purpose of the build was to add some debugging output to find out what's going on with the Window Actors and the LoginManager actor in particular. Is it being registered? Are others being registered? Are others being requested? Every time one of these events occurs we should get some appropriate debug output so that we know.

To actually see the output we'll need to activate the BrowsingContext log output, like this:
$ MOZ_LOG="BrowsingContext:5" sailfish-browser
[D] unknown:0 - Using Wayland-EGL
library "libutils.so" not found
library "libcutils.so" not found
library "libhardware.so" not found
library "android.hardware.graphics.mapper@2.0.so" not found
library "android.hardware.graphics.mapper@2.1.so" not found
library "android.hardware.graphics.mapper@3.0.so" not found
library "android.hardware.graphics.mapper@4.0.so" not found
library "libc++.so" not found
library "libhidlbase.so" not found
library "libgralloctypes.so" not found
library "android.hardware.graphics.common@1.2.so" not found
library "libion.so" not found
library "libz.so" not found
library "libhidlmemory.so" not found
library "android.hidl.memory@1.0.so" not found
library "vendor.qti.qspmhal@1.0.so" not found
greHome from GRE_HOME:/usr/bin
libxul.so is not found, in /usr/bin/libxul.so
Created LOG for EmbedLiteTrace
[D] onCompleted:105 - ViewPlaceholder requires a SilicaFlickable parent
Created LOG for EmbedLite
Created LOG for EmbedPrefs
Created LOG for EmbedLiteLayerManager
[Parent 4610: Unnamed thread 7908002670]: D/BrowsingContext Creating 0x00000003
    in Parent
[Parent 4610: Unnamed thread 7908002670]: D/BrowsingContext Parent: Connecting
    0x00000003 to 0x00000000 (private=0, remote=0, fission=0, oa=)
ACTOR: Getting LoginManager
[Parent 4610: Unnamed thread 7908002670]: D/BrowsingContext ACTOR: GetActor: 
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 542:
    NotFoundError: WindowGlobalChild.getActor: No such JSWindowActor
    'LoginManager'
ACTOR: Getting LoginManager
[Parent 4610: Unnamed thread 7908002670]: D/BrowsingContext ACTOR: GetActor: 
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 542:
    NotFoundError: WindowGlobalChild.getActor: No such JSWindowActor
    'LoginManager'
[Parent 4610: Unnamed thread 7908002670]: D/BrowsingContext Parent: Detaching
    0x00000003 from 0x00000000
Call EmbedLiteApp::StopChildThread()
Redirecting call to abort() to mozalloc_abort
It's a little hard to tell, but there are actually only a couple of relevant lines of output in there. Let's clean that up a bit:
$ MOZ_LOG="BrowsingContext:5" sailfish-browser 2>&1 | grep -E "ACTOR:|error"
ACTOR: Getting LoginManager
[Parent 4720: Unnamed thread 7668002670]: D/BrowsingContext ACTOR: GetActor: 
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 542:
    NotFoundError: WindowGlobalChild.getActor: No such JSWindowActor
    'LoginManager'
ACTOR: Getting LoginManager
[Parent 4720: Unnamed thread 7668002670]: D/BrowsingContext ACTOR: GetActor: 
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 542:
    NotFoundError: WindowGlobalChild.getActor: No such JSWindowActor
    'LoginManager'
So what we have is an access to GetActor() but without any apparent registrations for this or any other actors.

This... looks suspicious to me. I'm not entirely convinced that the logging I've added is working, particularly for the printf() output I added to the JSActorService class.

I don't want to have to make another build, but thankfully we can check this using the debugger.
$ MOZ_LOG="BrowsingContext:5" gdb sailfish-browser
GNU gdb (GDB) Mer (8.2.1+git9)
[...]
(gdb) b JSActorService::RegisterWindowActor
Function "JSActorService::RegisterWindowActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (JSActorService::RegisterWindowActor) pending.
(gdb) b JSActorService::UnregisterWindowActor
Function "JSActorService::UnregisterWindowActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 2 (JSActorService::UnregisterWindowActor) pending.
(gdb) b JSActorService::RegisterProcessActor
Function "JSActorService::RegisterProcessActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 3 (JSActorService::RegisterProcessActor) pending.
(gdb) b JSActorService::UnregisterProcessActor
Function "JSActorService::UnregisterProcessActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 4 (JSActorService::UnregisterProcessActor) pending.
(gdb) r
[...]
ACTOR: Getting LoginManager
[Parent 8108: Unnamed thread 7fc0002670]: D/BrowsingContext ACTOR: GetActor: 
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 542:
    NotFoundError: WindowGlobalChild.getActor: No such JSWindowActor
    'LoginManager'
[...]
ACTOR: Getting LoginManager
[Parent 8108: Unnamed thread 7fc0002670]: D/BrowsingContext ACTOR: GetActor: 
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 542:
    NotFoundError: WindowGlobalChild.getActor: No such JSWindowActor
    'LoginManager'
[Parent 8108: Unnamed thread 7fc0002670]: D/BrowsingContext Parent: Detaching
    0x00000003 from 0x00000000
Call EmbedLiteApp::StopChildThread()
Redirecting call to abort() to mozalloc_abort
[...]
(gdb) info break
Num Type       Disp Enb  What
1   breakpoint keep y    in mozilla::dom::JSActorService::RegisterWindowActor
                            (nsTSubstring<char> const&, mozilla::dom::
                            WindowActorOptions const&, mozilla::ErrorResult&) 
                         at dom/ipc/jsactor/JSActorService.cpp:60
2   breakpoint keep y    in mozilla::dom::JSActorService::UnregisterWindowActor
                            (nsTSubstring<char> const&) 
                         at dom/ipc/jsactor/JSActorService.cpp:109
3   breakpoint keep y    in mozilla::dom::JSActorService::RegisterProcessActor
                            (nsTSubstring<char> const&, mozilla::dom::
                            ProcessActorOptions const&, mozilla::ErrorResult&) 
                         at dom/ipc/jsactor/JSActorService.cpp:231
4   breakpoint keep y    in mozilla::dom::JSActorService::UnregisterProcessActor
                            (nsTSubstring<char> const&) 
                         at dom/ipc/jsactor/JSActorService.cpp:275
(gdb) 
Even though the breakpoints found places to attach, there are no hits. It really does look like the entire actor functionality is missing, which may or may not be intended.

The obvious thing to do now is to check the same thing using ESR 78. So let's do that.
$ gdb sailfish-browser 
GNU gdb (GDB) Mer (8.2.1+git9)
[...]
(gdb) b JSActorService::RegisterWindowActor
Function "JSActorService::RegisterWindowActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (JSActorService::RegisterWindowActor) pending.
(gdb) b JSActorService::UnregisterWindowActor
Function "JSActorService::UnregisterWindowActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 2 (JSActorService::UnregisterWindowActor) pending.
(gdb) b JSActorService::RegisterProcessActor
Function "JSActorService::RegisterProcessActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 3 (JSActorService::RegisterProcessActor) pending.
(gdb) b JSActorService::UnregisterProcessActor
Function "JSActorService::UnregisterProcessActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 4 (JSActorService::UnregisterProcessActor) pending.
(gdb) r
[...]
Thread 10 "GeckoWorkerThre" hit Breakpoint 1, mozilla::dom::JSActorService::
    RegisterWindowActor (this=this@entry=0x7f80480850, aName=..., aOptions=..., 
    aRv=...) at dom/ipc/JSActorService.cpp:51
51	                                         ErrorResult& aRv) {
(gdb) handle SIGPIPE nostop
Signal        Stop	Print	Pass to program	Description
SIGPIPE       No	Yes	Yes		Broken pipe
(gdb) p aName
$1 = (const nsACString &) @0x7fa69e3d80: {<mozilla::detail::nsTStringRepr
    <char>> = {mData = 0x7fa69e3d94 "AboutHttpsOnlyError", mLength = 19, 
    mDataFlags = (mozilla::detail::StringDataFlags::TERMINATED |
    mozilla::detail::StringDataFlags::INLINE), 
    mClassFlags = mozilla::detail::StringClassFlags::INLINE},
    static kMaxCapacity = 2147483637}
(gdb) p aName.mData
$2 = (mozilla::detail::nsTStringRepr<char>::char_type *) 0x7fa69e3d94
    "AboutHttpsOnlyError"
(gdb) c
Continuing.

Thread 10 "GeckoWorkerThre" hit Breakpoint 1, mozilla::dom::JSActorService::
    RegisterWindowActor (this=this@entry=0x7f80480850, aName=..., aOptions=..., 
    aRv=...) at dom/ipc/JSActorService.cpp:51
51	                                         ErrorResult& aRv) {
(gdb) p aName.mData
$3 = (mozilla::detail::nsTStringRepr<char>::char_type *) 0x7fa69e3d94
    "AudioPlayback"
(gdb) c
Continuing.

Thread 10 "GeckoWorkerThre" hit Breakpoint 1, mozilla::dom::JSActorService::
    RegisterWindowActor (this=this@entry=0x7f80480850, aName=..., aOptions=..., 
    aRv=...) at dom/ipc/JSActorService.cpp:51
51	                                         ErrorResult& aRv) {
(gdb) p aName.mData
$4 = (mozilla::detail::nsTStringRepr<char>::char_type *) 0x7fa69e3d94
    "AutoComplete"
(gdb) c
[...] 
There are dozens more of them. But also, crucially, this one: 
Thread 10 "GeckoWorkerThre" hit Breakpoint 1, mozilla::dom::JSActorService:: RegisterWindowActor (this=this@entry=0x7f80480850, aName=..., aOptions=..., aRv=...) at dom/ipc/JSActorService.cpp:51 51 ErrorResult& aRv) { (gdb) p aName.mData $17 = (mozilla::detail::nsTStringRepr<char>::char_type *) 0x7fa69e3a84 "LoginManager" (gdb) bt #0 mozilla::dom::JSActorService::RegisterWindowActor (this=this@entry=0x7f80480850, aName=..., aOptions=..., aRv=...) at dom/ipc/JSActorService.cpp:51 #1 0x0000007fba9881b0 in mozilla::dom::ChromeUtils::RegisterWindowActor (aGlobal=..., aName=..., aOptions=..., aRv=...) at dom/base/ChromeUtils.cpp:1243 #2 0x0000007fbae8b4f4 in mozilla::dom::ChromeUtils_Binding::registerWindowActor (cx_=<optimized out>, argc=<optimized out>, vp=0x7fa69e3b00) at ChromeUtilsBinding.cpp:4263 #3 0x0000007f00460260 in ?? () #4 0x0000007fbcb1ba80 in Interpret (cx=0x7fa69e3ad0, state=...) at js/src/vm/Activation.h:541 #5 0x0000007fbcb1ba80 in Interpret (cx=0x7f802270c0, state=...) at js/src/vm/Activation.h:541 #6 0x0000000000000000 in ?? () Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb) 
Eventually it gets through them all and the page loads. None of the other breakpoints get hit, so there don't appear to be any deregistrations.

This is enlightening. It means that this is an iceberg moment: the error from LoginManager is exposing a much more serious problem under the surface. Honestly, I'm puzzled as to why this hasn't caused more breakages of other elements of the browser user interface.

The next step is to look at the backtrace and find out why the same code isn't executing in ESR 91. It's not the best backtrace to be honest because it's backstopped by the interpreter, but it at least gives us something to work with.

So back to ESR 91:
(gdb) b ChromeUtils::RegisterWindowActor
Breakpoint 5 at 0x7ff2c9a0d8: file dom/base/ChromeUtils.cpp, line 1350.
(gdb) b ChromeUtils_Binding::registerWindowActor
Breakpoint 6 at 0x7ff31bf858: file ChromeUtilsBinding.cpp, line 5237.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/sailfish-browser 
[...]
No hits.

The backtrace gets lost when it hits JavaScript, but it looks like the JavaScript action is happen in the ActorManagerParent.jsm file. So back to ESR 78, and I've added a couple of extra debug prints to the ActorManagerParent.jsm file:
  addJSProcessActors(actors) {
    dump("ACTOR: addJSProcessActors\n");
    this._addActors(actors, "JSProcessActor");
  },
  addJSWindowActors(actors) {
    dump("ACTOR: addJSWindowActors\n");
    this._addActors(actors, "JSWindowActor");
  },
And there's an immediate hit:
$ EMBED_CONSOLE=1 sailfish-browser
[D] unknown:0 - Using Wayland-EGL
library "libGLESv2_adreno.so" not found
library "eglSubDriverAndroid.so" not found
greHome from GRE_HOME:/usr/bin
libxul.so is not found, in /usr/bin/libxul.so
Created LOG for EmbedLiteTrace
[D] onCompleted:105 - ViewPlaceholder requires a SilicaFlickable parent
Created LOG for EmbedLite
JSComp: EmbedLiteConsoleListener.js loaded
JSComp: ContentPermissionManager.js loaded
JSComp: EmbedLiteChromeManager.js loaded
JSComp: EmbedLiteErrorPageHandler.js loaded
JSComp: EmbedLiteFaviconService.js loaded
ACTOR: addJSProcessActors
ACTOR: addJSWindowActors
JSComp: EmbedLiteGlobalHelper.js loaded
EmbedLiteGlobalHelper app-startup
[..]
And for a bit more clarity:
$ sailfish-browser 2>&1 | grep "ACTOR:"
ACTOR: addJSProcessActors
ACTOR: addJSWindowActors
I've copied those same debug lines over to the ESR 91 code to see if the same methods are being called there.
$ EMBED_CONSOLE=1 sailfish-browser
[D] unknown:0 - Using Wayland-EGL
library "libutils.so" not found
library "libcutils.so" not found
library "libhardware.so" not found
library "android.hardware.graphics.mapper@2.0.so" not found
library "android.hardware.graphics.mapper@2.1.so" not found
library "android.hardware.graphics.mapper@3.0.so" not found
library "android.hardware.graphics.mapper@4.0.so" not found
library "libc++.so" not found
library "libhidlbase.so" not found
library "libgralloctypes.so" not found
library "android.hardware.graphics.common@1.2.so" not found
library "libion.so" not found
library "libz.so" not found
library "libhidlmemory.so" not found
library "android.hidl.memory@1.0.so" not found
library "vendor.qti.qspmhal@1.0.so" not found
greHome from GRE_HOME:/usr/bin
libxul.so is not found, in /usr/bin/libxul.so
Created LOG for EmbedLiteTrace
[D] onCompleted:105 - ViewPlaceholder requires a SilicaFlickable parent
Created LOG for EmbedLite
JSComp: EmbedLiteConsoleListener.js loaded
JSComp: ContentPermissionManager.js loaded
JSComp: EmbedLiteChromeManager.js loaded
JSComp: EmbedLiteErrorPageHandler.js loaded
JSComp: EmbedLiteFaviconService.js loaded
JSComp: EmbedLiteGlobalHelper.js loaded
EmbedLiteGlobalHelper app-startup
[...]
Nothing that I can see, but just to be certain:
$ sailfish-browser 2>&1 | grep "ACTOR:"
ACTOR: Getting LoginManager
ACTOR: Getting LoginManager
So somewhere between EmbedLiteFaviconService.js being loaded and EmbedLiteGlobalHelper.js being loaded the actors are registered on ESR 78, but that's not happening on ESR 91.

Time for a break, but when I get back I'll look into this further.

[...]

Back to it. So the question I need to answer now is "where in ESR 78 are the actors getting registered?" There are two clear candidates. The first is BrowserGlue.jsm. This includes code to get the ActorManagerParent:
ChromeUtils.defineModuleGetter(
  this,
  "ActorManagerParent",
  "resource://gre/modules/ActorManagerParent.jsm"
);
It even adds some actors of its own during initialisation:
  // initialization (called on application startup)
  _init: function BG__init() {
[...]
    ActorManagerParent.addJSProcessActors(JSPROCESSACTORS);
    ActorManagerParent.addJSWindowActors(JSWINDOWACTORS);
    ActorManagerParent.addLegacyActors(LEGACY_ACTORS);
    ActorManagerParent.flush();
[...]
  },
Another, potentially more promising candidate, is EmbedLiteGlobalHelper.js. It's more promising for multiple reasons. First, it's part of embedlite-components, which means it's intended for use with sailfish-browser. Second, something in the back of my mind tells me sailfish-browser uses its own version of the browser glue. Third, and perhaps most compelling, the debug output messages come straight before EmbedLiteGlobalHelper.js is claiming to be initialised, which would fit with the the actor initialisation being part of the initialisation of EmbedLiteGlobalHelper.js.

I should be able to check this pretty straightforwardly. If I comment out the code in EmbedLiteGlobalHelper.js related to the actors like this:
//ChromeUtils.defineModuleGetter(
//  this,
//  "ActorManagerParent",
//  "resource://gre/modules/ActorManagerParent.jsm"
//);

Services.scriptloader.loadSubScript("chrome://embedlite/content/Logger.js");

// Common helper service

function EmbedLiteGlobalHelper()
{
  //ActorManagerParent.flush();
[...]
Then the errors received start to look very similar to those for ESR 91:
$ sailfish-browser
[D] unknown:0 - Using Wayland-EGL
library "libGLESv2_adreno.so" not found
library "eglSubDriverAndroid.so" not found
greHome from GRE_HOME:/usr/bin
libxul.so is not found, in /usr/bin/libxul.so
Created LOG for EmbedLiteTrace
[D] onCompleted:105 - ViewPlaceholder requires a SilicaFlickable parent
Created LOG for EmbedLite
Created LOG for EmbedPrefs
Created LOG for EmbedLiteLayerManager
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 442:
    NotSupportedError: WindowGlobalChild.getActor: Could not get
    JSWindowActorProtocol: LoginManager is not registered
To double-check, we can run sailfish-browser using the debugger with breakpoints on the relevant methods like this:
$ gdb sailfish-browser
GNU gdb (GDB) Mer (8.2.1+git9)
[...]
(gdb) b JSActorService::RegisterWindowActor
Function "JSActorService::RegisterWindowActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (JSActorService::RegisterWindowActor) pending.
(gdb) b JSActorService::UnregisterWindowActor
Function "JSActorService::UnregisterWindowActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 2 (JSActorService::UnregisterWindowActor) pending.
(gdb) b JSActorService::RegisterProcessActor
Function "JSActorService::RegisterProcessActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 3 (JSActorService::RegisterProcessActor) pending.
(gdb) b JSActorService::UnregisterProcessActor
Function "JSActorService::UnregisterProcessActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 4 (JSActorService::UnregisterProcessActor) pending.
(gdb) r
[...]
No hits. So, in conclusion it seems that EmbedLiteGlobalHelper.js isn't being initialised on ESR 91. The task now is to find out why.

Once again, this feels like progress, but an answer for this question will have to wait until tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
12 Feb 2024 : Day 154 #
I'm still trying to track down the reason for the following error today:
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 541:
    NotFoundError: WindowGlobalChild.getActor:
    No such JSWindowActor 'LoginManager'
It's being a bit elusive, because there doesn't seem to be any clear difference between the way the LoginManager is set up on ESR 78 compared to ESR 91.

So I thought I should let it settle overnight. One possibility I came up with during this pondering process was that there's an error during the initialisation of LoginManager that's resulting in it not becoming available for use later.

But a careful check of the debug output prior to the above error doesn't give any indication that anything like this is going wrong. Here's the entire output.
[defaultuser@Xperia10III gecko]$ sailfish-browser
[D] unknown:0 - Using Wayland-EGL
library "libutils.so" not found
library "libcutils.so" not found
library "libhardware.so" not found
library "android.hardware.graphics.mapper@2.0.so" not found
library "android.hardware.graphics.mapper@2.1.so" not found
library "android.hardware.graphics.mapper@3.0.so" not found
library "android.hardware.graphics.mapper@4.0.so" not found
library "libc++.so" not found
library "libhidlbase.so" not found
library "libgralloctypes.so" not found
library "android.hardware.graphics.common@1.2.so" not found
library "libion.so" not found
library "libz.so" not found
library "libhidlmemory.so" not found
library "android.hidl.memory@1.0.so" not found
library "vendor.qti.qspmhal@1.0.so" not found
greHome from GRE_HOME:/usr/bin
libxul.so is not found, in /usr/bin/libxul.so
Created LOG for EmbedLiteTrace
[D] onCompleted:105 - ViewPlaceholder requires a SilicaFlickable parent
Created LOG for EmbedLite
Created LOG for EmbedPrefs
Created LOG for EmbedLiteLayerManager
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 541:
    NotFoundError: WindowGlobalChild.getActor:
    No such JSWindowActor 'LoginManager'
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 541:
    NotFoundError: WindowGlobalChild.getActor:
    No such JSWindowActor 'LoginManager'
Call EmbedLiteApp::StopChildThread()
Apart from the LoginManager error the output is looking surprisingly clean now. But I do still need to figure this one out.

Another possibility that crossed my mind is that there's a problem with the way WindowGlobalChild::GetActor() is working in general. In other words, the issue isn't really with the LoginManager at all but rather with the code for accessing it.

On the face of it this seems unlikely: the error is only showing for the LoginManager and no other actors. And there are plenty of other uses. I count 167 instances, only 9 of which are for LoginManager.
$ grep -rIn ".getActor(\"" * --include="*.js*" | wc -l
167
$ grep -rIn ".getActor(\"LoginManager\")" * --include="*.js*" | wc -l
9
Nevertheless it is possible that only the LoginManager happens to be being requested. Unlikely, but possible.

To try to find out, I've added some debugging code so that something will be output whenever GetActor() is called off the WindowGlobalChild:
already_AddRefed<JSWindowActorChild> WindowGlobalChild::GetActor(
    JSContext* aCx, const nsACString& aName, ErrorResult& aRv) {

  MOZ_LOG(BrowsingContext::GetLog(), LogLevel::Debug, ("ACTOR: GetActor: ",
      PromiseFlatCString(aName)));

  return JSActorManager::GetActor(aCx, aName, aRv)
      .downcast<JSWindowActorChild>();
}
I've also added some debugging closer to the action in LoginManagerChild.jsm:
  static forWindow(window) {
    let windowGlobal = window.windowGlobalChild;
    if (windowGlobal) {
      dump("ACTOR: Getting LoginManager\n");
      return windowGlobal.getActor("LoginManager");
    }

    return null;
  }
Let's see if either of those provide any insight. Unfortunately these changes require a build, so it will take a while. During the build, I'm going to look into the third possibility I thought about: that the LoginManager isn't being registered correctly.

[...]

Before the build completes I realise that all of the action registering and unregistering actors happens in JSActorService.cpp. There's a hash table called mWindowActorDescriptors which stores all of the actors, alongside registration and removal methods. To help understand what's going on there it will be helpful to add some debug output here too, to expose any actors that are added or removed here. So I've cancelled the build while I add it in.

Here's an example:
void JSActorService::RegisterWindowActor(const nsACString& aName,
                                         const WindowActorOptions& aOptions,
                                         ErrorResult& aRv) {
  MOZ_ASSERT(NS_IsMainThread());
  MOZ_ASSERT(XRE_IsParentProcess());

  printf("ACTOR: RegisterWindowActor: %s", PromiseFlatCString(aName));
[...]
I've added similar debug output for UnregisterWindowActor(), RegisterProcessActor() and UnregisterProcessActor() as well.

Now I've set it building again. Time for a break while my laptop does all the work for me.

[...]

It turns out the break was longer than I anticipated. I thought the build might finish quickly but it's still chugging away several hours later as it heads towards bed time.

So I'll have to pick this up in the morning once it's built. I'm looking forward to finding out what's really happening with this Actor code.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
11 Feb 2024 : Day 153 #
Yesterday we finally saw the session history start to work. But it wasn't without errors and we were still left with the following showing in the console:
Warning: couldn't PurgeHistory. Was it a file download? TypeError:
    legacyHistory.PurgeHistory is not a function

JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 541:
    NotFoundError: WindowGlobalChild.getActor:
    No such JSWindowActor 'LoginManager'
That's two separate errors. We started to look into the first of these, which is emanating from embedhelper.js. The PurgeHistory() method has been renamed to purgeHistory() in nsISHistory.idl. So with any luck if we just make the same change in embedhelper.js it'll fix the first of these.

Happily the embedhelper.js file is part of embedlite-components which makes it super-quick to test. No need to rebuild gecko. And the change does the trick: no more PurgeHistory errors. I experience a couple of cases where it seems to drop some items from the history, but I think this might be teething troubles (and also not just an ESR 91 issue... I'm sure I've seen it in the current release build as well). I'll have to see if I can find a way to reliably reproduce it.

What about the second error relating to LoginManagerChild.jsm then?
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 541:
    NotFoundError: WindowGlobalChild.getActor:
    No such JSWindowActor 'LoginManager'
Here's the bit of code from LoginManagerChild.jsm causing the error:
  static forWindow(window) {
    let windowGlobal = window.windowGlobalChild;
    if (windowGlobal) {
      return windowGlobal.getActor("LoginManager");
    }

    return null;
  }
There's no change between this bit of code and the code in ESR 78. So the reason for the error must be buried a little deeper.

The code in WindowGlobalChild.cpp related to this has changed — become much simpler in fact — but I'm not yet convinced that this is the reason for the error.
already_AddRefed<JSWindowActorChild> WindowGlobalChild::GetActor(
    JSContext* aCx, const nsACString& aName, ErrorResult& aRv) {
  return JSActorManager::GetActor(aCx, aName, aRv)
      .downcast<JSWindowActorChild>();
}
The simplification is because the code has been moved in to JSActorManager.cpp, but it's really doing something that looks pretty similar.

There are no obvious differences in the LoginManager.jsm code itself. Just some additional telemetry and minor reformatting.

I've checked a bunch of things. The LoginManager.jsm file is contained within omni.ja. It's apparently accessed in multiple other places in both ESR 78 and ESR 91 in the same way. There is a very small change to the way it's being registered. From this:
    {
        'cid': '{cb9e0de8-3598-4ed7-857b-827f011ad5d8}',
        'contract_ids': ['@mozilla.org/login-manager;1'],
        'jsm': 'resource://gre/modules/LoginManager.jsm',
        'constructor': 'LoginManager',
    },
To this:
    {
        'js_name': 'logins',
        'cid': '{cb9e0de8-3598-4ed7-857b-827f011ad5d8}',
        'contract_ids': ['@mozilla.org/login-manager;1'],
        'interfaces': ['nsILoginManager'],
        'jsm': 'resource://gre/modules/LoginManager.jsm',
        'constructor': 'LoginManager',
    },
But I don't see why that change would make any difference. So right now I'm unfortunately a bit stumped. Maybe sleeping on it will help. I guess I'll find out in the morning.

So that's it for today, but hopefully I'll make a bit more progress on this tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
10 Feb 2024 : Day 152 #
This morning I woke to find the build had completed. Hooray! That means I can test the changes I made yesterday to add an InitSessionHistory() call into the EmbedLite code.

After installing and running the code I still see the occasional related error, but the main errors we were getting before about the sessionHistory being null are no longer appearing.

That's a small but important step. But even more important is the fact that the Back and Forwards buttons are now working as well. Not only a good sign, but also important functionality being restored. I've been finding it quite challenging using the browser without the navigation buttons. So this is a very welcome result.

A couple of additional errors are also appearing now. These are new; first another error related to the history:
Warning: couldn't PurgeHistory. Was it a file download? TypeError: 
    legacyHistory.PurgeHistory is not a function
But also an error about the LoginManager:
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 541:
    NotFoundError: WindowGlobalChild.getActor:
    No such JSWindowActor 'LoginManager'
Starting with the first error, the PurgeHistory() method certainly exists in nsISHistory.idl:
  /**
   * Called to purge older documents from history.
   * Documents can be removed from session history for various
   * reasons. For example to  control memory usage of the browser, to
   * prevent users from loading documents from history, to erase evidence of
   * prior page loads etc...
   *
   * @param numEntries        The number of toplevel documents to be
   *                          purged from history. During purge operation,
   *                          the latest documents are maintained and older
   *                          'numEntries' documents are removed from history.
   * @throws                  <code>NS_SUCCESS_LOSS_OF_INSIGNIFICANT_DATA</code>
   *                          Purge was vetod.
   * @throws                  <code>NS_ERROR_FAILURE</code> numEntries is
   *                          invalid or out of bounds with the size of history.
   */
  void PurgeHistory(in long aNumEntries);
It's no longer present in ESR 91 though. Let's find out why not.
$ git log -1 -S "PurgeHistory" docshell/shistory/nsISHistory.idl
commit f9f96d23ca42f359e143d0ae98234021e86179a7
Author: Andreas Farre <farre@mozilla.com>
Date:   Wed Sep 16 14:51:01 2020 +0000

    Bug 1662410 - Part 1: Fix usage of ChildSHistory.legacySHistory . r=peterv
    
    ChildSHistory.legacySHistory isn't valid for content processes when
    session history in the parent is enabled. We try to fix this by either
    delegating to the parent by IPC or move the implementation partially
    or as a whole to the parent.
    
    Differential Revision: https://phabricator.services.mozilla.com/D89353
This is definitely the problem that's causing the issue, as we can see in the diff:
$ git diff f9f96d23ca42f359e143d0ae98234021e86179a7~ \
    f9f96d23ca42f359e143d0ae98234021e86179a7 \
    -- docshell/shistory/nsISHistory.idl
diff --git a/docshell/shistory/nsISHistory.idl b/docshell/shistory/
    nsISHistory.idl
index 3d914924c94d..1f5b9c5477e9 100644
--- a/docshell/shistory/nsISHistory.idl
+++ b/docshell/shistory/nsISHistory.idl
@@ -87,7 +87,7 @@ interface nsISHistory: nsISupports
    * @throws                  <code>NS_ERROR_FAILURE</code> numEntries is
    *                          invalid or out of bounds with the size of history.
    */
-  void PurgeHistory(in long aNumEntries);
+  void purgeHistory(in long aNumEntries);
 
   /**
    * Called to register a listener for the session history component.
@@ -255,7 +255,7 @@ interface nsISHistory: nsISupports
    * Collect docshellIDs from aEntry's children and remove those
    * entries from history.
    *
-   * @param aEntry           Children docshellID's will be collected from 
+   * @param aEntry           Children docshellID's will be collected from
    *                         this entry and passed to RemoveEntries as aIDs.
   */
   [noscript, notxpcom]
@@ -265,7 +265,7 @@ interface nsISHistory: nsISupports
   void Reload(in unsigned long aReloadFlags);
 
   [notxpcom] void EnsureCorrectEntryAtCurrIndex(in nsISHEntry aEntry);
-  
+
   [notxpcom] void EvictContentViewersOrReplaceEntry(in nsISHEntry aNewSHEntry,
       in bool aReplace);
 
   nsISHEntry createEntry();
Helpfully we can immediately see from this that the call hasn't exactly been removed. It's just been slightly renamed, switching the initial uppercase "P" for a lowercase "p". With any luck then, this should be an easy fix.

This is where the call is defined, but to fix it we also now need to know where the call comes from. Doing a quick grep on the code highlights that it's being called in embedhelper.js which is part of the embedlite-components package.

Here's the relevant section:
      case "embedui:addhistory": {
        // aMessage.data contains: 1) list of 'links' loaded from DB, 2) current 'index'.

        let docShell = content.docShell;
        let sessionHistory = docShell.QueryInterface(Ci.nsIWebNavigation).sessionHistory;
        let legacyHistory = sessionHistory.legacySHistory;
        let ioService = Cc["@mozilla.org/network/io-service;1"].getService(Ci.nsIIOService);

        try {
          // Initially we load the current URL and that creates an unneeded entry in History -> purge it.
          if (legacyHistory.count > 0) {
            legacyHistory.PurgeHistory(1);
          }
        } catch (e) {
            Logger.warn("Warning: couldn't PurgeHistory. Was it a file download?", e);
        }

Changing legacyHistory.PurgeHistory(1) to legacyHistory.purgeHistory(1) will hopefully do the trick. Unfortunately I'm already out of time for today, so we'll have to wait until tomorrow to find out for certain. But I feel like we're making progress.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
9 Feb 2024 : Day 151 #
Yesterday we were comparing ESR 78 and ESR 91 to see why the session history is initialised in the former but not in the latter. We reached this bit of code in ESR 91:
  // If we are an in-process browser, we want to set up our session history.
  if (mIsTopLevelContent && mOwnerContent->IsXULElement(nsGkAtoms::browser) &&
      !mOwnerContent->HasAttr(kNameSpaceID_None, nsGkAtoms::disablehistory)) {
    // XXX(nika): Set this up more explicitly?
    mPendingBrowsingContext->InitSessionHistory();
  }
As you can see, this could potentially initialise the session history as long as the condition is true going in to it.

But when we stepped through the program we found it wasn't true. That's because mIsTopLevelContent was set to false, which itself was because BrowsingContext::mParentWindow was non-null.

That makes sense intuitively: if the window has a parent, then it's not a top level element.

However with ESR 78 we have a similar situation, because the code looks like this:
  if (aBrowsingContext->IsTop()) {
    aBrowsingContext->InitSessionHistory();
  }
But this is being executed, suggesting that in this case there is no parent window. Why the difference?

Although that seems like a fair question, there's another question I'd like to try to answer first. The D100348 change that caused this problem in the first place moves the InitSessionHistory() call from nsWebBrowser::Create() to sFrameLoader::TryRemoteBrowserInternal(). So clearly the authors of that change thought the execution flow would go via the latter method. So it would be good to know why it isn't for us using EmbedLite.

Looking through the code, there is at least one route to getting to InitSessionHistory() via TryRemoteBrowserInternal() that looks like this (I routed this by hand... quite laborious):
nsFrameLoader::LoadURI()
Document::InitializeFrameLoader()
Document::MaybeInitializeFinalizeFrameLoaders()
nsFrameLoader::ReallyStartLoading()
nsFrameLoader::ReallyStartLoadingInternal()
nsFrameLoader::EnsureRemoteBrowser()
nsFrameLoader::TryRemoteBrowser()
nsFrameLoader::TryRemoteBrowserInternal()
BrowsingContext::InitSessionHistory()
In other words a call to LoadURI() can then call InitializeFrameLoader() which can then call MaybeInitializeFinalizeFrameLoaders() and so on, all the way to InitSessionHistory().

The top half of this, from LoadURI() to ReallyStartLoadingInternal() is certainly being called when the code is executed. We can see that by placing some breakpoints on various methods and checking the results:
Thread 10 "GeckoWorkerThre" hit Breakpoint 4, nsFrameLoader::
    LoadURI (this=this@entry=0x7fc1419ef0, aURI=0x7fc15feb90,
    aTriggeringPrincipal=0x7fc11a47d0, aCsp=0x7ee8004780,
    aOriginalSrc=aOriginalSrc@entry=true) at dom/base/nsFrameLoader.cpp:600
600                                     bool aOriginalSrc) {
(gdb) c
Continuing.

Thread 10 "GeckoWorkerThre" hit Breakpoint 3, mozilla::dom::Document::
    InitializeFrameLoader (this=this@entry=0x7fc11a1960, 
    aLoader=aLoader@entry=0x7fc1419ef0) at dom/base/Document.cpp:8999
8999    nsresult Document::InitializeFrameLoader(nsFrameLoader* aLoader) {
(gdb) c
Continuing.

Thread 10 "GeckoWorkerThre" hit Breakpoint 2, mozilla::dom::Document::
    MaybeInitializeFinalizeFrameLoaders (this=0x7fc11a1960)
    at dom/base/Document.cpp:9039
9039    void Document::MaybeInitializeFinalizeFrameLoaders() {
(gdb) c
Continuing.
[...]
Thread 10 "GeckoWorkerThre" hit Breakpoint 9, nsFrameLoader::
    ReallyStartLoadingInternal (this=this@entry=0x7fc147afe0)
    at dom/base/nsFrameLoader.cpp:664
664     nsresult nsFrameLoader::ReallyStartLoadingInternal() {
(gdb) p mIsRemoteFrame
$3 = false
(gdb) 
But that's as far as it goes. Beyond that it's the fact that mIsRemoteFrame is false that prevents the InitSessionHistory() from being called, because the relevant code inside ReallyStartLoadingInternal() looks like this:
  if (IsRemoteFrame()) {
    if (!EnsureRemoteBrowser()) {
      NS_WARNING("Couldn't create child process for iframe.");
      return NS_ERROR_FAILURE;
    }
[...]
We can understand this better by also considering what IsRemoteFrame() is doing:
bool nsFrameLoader::IsRemoteFrame() {
  if (mIsRemoteFrame) {
    MOZ_ASSERT(!GetDocShell(), "Found a remote frame with a DocShell");
    return true;
  }
  return false;
}
If it jumps past this, then it looks like the next place where InitSessionHistory() could potentially get called is in nsFrameLoader::MaybeCreateDocShell(). Maybe that's the place it's supposed to happen for non-remote frames. But in that case we're back to the same problem of mPendingBrowsingContext being false that we started with.

Once again, the reason is that there's a parent browsing context:
(gdb) p mPendingBrowsingContext.mRawPtr->mParentWindow.mRawPtr->
    mBrowsingContext.mRawPtr
$8 = (mozilla::dom::BrowsingContext *) 0x7fc0ba3f00
Contrast this with the ESR 78 way of doing things. In that case the call to >nsWebBrowser::Create() is coming from EmbedLiteViewChild::InitGeckoWindow(). There the BrowsingContext is explicitly created detached and so has no parent widget.

The route via LoadURI() in ESR 91 is far less direct. Given this, one potential solution, which I think I rather like, is to explicitly initialise the session history where EmbedLite initialises the window, like this:
void
EmbedLiteViewChild::InitGeckoWindow(const uint32_t parentId,
                                    mozilla::dom::BrowsingContext
                                    *parentBrowsingContext,
                                    const bool isPrivateWindow,
                                    const bool isDesktopMode,
                                    const bool isHidden)
{
[...]
  RefPtr<BrowsingContext> browsingContext = BrowsingContext::CreateDetached
      (nullptr, parentBrowsingContext, nullptr, EmptyString(),
      BrowsingContext::Type::Content);
  browsingContext->SetUsePrivateBrowsing(isPrivateWindow); // Needs to be called before attaching
  browsingContext->EnsureAttached();
  browsingContext->InitSessionHistory();
[...]
  mWebBrowser = nsWebBrowser::Create(mChrome, mWidget, browsingContext,
                                     nullptr);
In this case the session history is created broadly speaking where it would have been in ESR 78. This should be safe given that it isn't initialised at any later time. And the nice thing about it is that it doesn't require any changes to the core gecko code, only to the EmbedLite code.

So, with this change made, it's time to set off the build. It'll take a while to complete, but once it has, we can give it a spin to test it.

For the record, and while the build is running, here is the backtrace for nsFrameLoader::MaybeCreateDocShell():
(gdb) bt
#0  nsFrameLoader::MaybeCreateDocShell (this=this@entry=0x7fc1476da0)
    at dom/base/nsFrameLoader.cpp:2241
#1  0x0000007ff2def6c0 in nsFrameLoader::ReallyStartLoadingInternal
    (this=this@entry=0x7fc1476da0)
    at dom/base/nsFrameLoader.cpp:751
#2  0x0000007ff2defa88 in nsFrameLoader::ReallyStartLoading
    (this=this@entry=0x7fc1476da0)
    at dom/base/nsFrameLoader.cpp:656
#3  0x0000007ff2d25198 in mozilla::dom::Document::
    MaybeInitializeFinalizeFrameLoaders (this=0x7fc11f88d0)
    at dom/base/Document.cpp:9068
#4  0x0000007ff2cceebc in mozilla::detail::RunnableMethodArguments<>::applyImpl
    <mozilla::dom::Document, void (mozilla::dom::Document::*)()>
    (mozilla::dom::Document*, void (mozilla::dom::Document::*)(),
    mozilla::Tuple<>&, std::integer_sequence<unsigned long>)
    (args=..., m=<optimized out>, o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151
#5  mozilla::detail::RunnableMethodArguments<>::apply<mozilla::dom::Document,
    void (mozilla::dom::Document::*)()> (m=<optimized out>, o=<optimized out>, 
    this=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1154
[...]
#20 0x0000007ff4e7e564 in js::RunScript (cx=cx@entry=0x7fc0233bd0, state=...)
    at js/src/vm/Interpreter.cpp:395
#21 0x0000007ff4e7e9b0 in js::InternalCallOrConstruct (cx=cx@entry=0x7fc0233bd0,
    args=..., construct=construct@entry=js::NO_CONSTRUCT, 
    reason=reason@entry=js::CallReason::Call) at js/src/vm/Interpreter.cpp:543
If we get the right outcome from this build this backtrace may not be useful any more, but it's worth taking a record just in case.

The build is running, so that's it for today. We'll find out whether it worked or not tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
8 Feb 2024 : Day 150 #
FOSDEM'24 is over and it's back to semi-normal again now. In particular, as promised, I'm going to be back to writing daily dev diaries and hopefully getting the Sailfish Browser into better shape running ESR 91. My first action is going to be to continue in my attempt to get the back and forwards buttons working by fixing the sessionHistory. Then I'll move back up my task stack to return to the Set-Fetch-* headers and DuckDuckGo.

But before jumping back into coding let me also say how great it was to meet so many interesting people — both old friends and new &mdsah; at FOSDEM. The Linux on Mobile stand was buzzing the entire time with a crowd of curious FOSS enthusiasts. I was also really happy to have the chance to talk about these dev diaries in the FOSS on Mobile Devices devroom. In case you missed it there's a video available, as well as the slides and the LaTeX source for the slides.

I'm not going to go over my talk again here, but I will take the opportunity to share two of the images from the slides.

First, to reiterate my thanks for everyone who has been following this dev diary, helping with coding, sharing ideas, performing testing, writing their own code changes (to either this or other packages that gecko relies on), producing images, boosting or liking on Mastodon, commenting on things. I really appreciate it all. I've tried to capture everyone, but I apologise if I manage to miss anyone off.
 
Many, many avatars with names underneath, showing all of the people I could find who I think have interacted with the dev diary, along with the words 'Thank you'

The other graphic I wanted to share summarises my progress to date. This is of course a significant simplification: it's been a lot more messy than this in practice. But it serves as some kind of overview.
 
A timeline that loops backwards and forwards across the page, showing 149 days. Along the timeline at various points the progress details are marked: 45: First successful build; 50: First successful execution; 83: Rendering works; 85: APZ; 90: JS errors; 96: Search; 100: Static prefs; 128: PDF printing; 143: Sec-Fetch-* headers; 149: Session history

As you can see, this takes us right up to today and the session history.

Before this latest two week break I wrote myself some notes to help me get back up to speed when I returned. Here's what my notes say:
 
When I get back I'll be back to posting these dev diaries again. And as a note to myself, when I do, my first action must be to figure out why nsFrameLoader is being created in ESR 78 but not ESR 91. That might hold the key to why the sessionHistory isn't getting called in ESR 91. And as a last resort, I can always revert commit 140a4164598e0c9ed53.

I'm glad I wrote this, because otherwise I'd have completely forgotten the details of what I was working on by now. So let's continue where we left off.

First off I'll try debugging ESR 78 to get a backtrace for the call to the nsFrameLoader constructor.

Unfortunately, although when I place a breakpoint on nsFrameLoader::Create() it does get hit when using ESR 78, it's not possible to extract a backtrace. The debugger just complains and tries to output a core dump.

Happily by placing breakpoints on promising looking methods that call it or its parents, I'm able to step back through the calls and place a breakpoint on nsGenericHTMLFrameElement::GetContentWindow() which, if hit, is guaranteed to then call nsFrameLoader::Create(). And it does git hit. And from this method it's possible to extract a backtrace:
Thread 10 "GeckoWorkerThre" hit Breakpoint 4, nsGenericHTMLFrameElement::
    GetContentWindowInternal (this=this@entry=0x7fba4f9560)
    at dom/html/nsGenericHTMLFrameElement.cpp:100
100       EnsureFrameLoader();
(gdb) bt
#0  nsGenericHTMLFrameElement::GetContentWindowInternal
    (this=this@entry=0x7fba4f9560)
    at dom/html/nsGenericHTMLFrameElement.cpp:100
#1  0x0000007ff39194bc in nsGenericHTMLFrameElement::GetContentWindow
    (this=this@entry=0x7fba4f9560)
    at dom/html/nsGenericHTMLFrameElement.cpp:116
#2  0x0000007ff3518bf0 in mozilla::dom::HTMLIFrameElement_Binding::
    get_contentWindow (cx=0x7fb82263f0, obj=..., void_self=0x7fba4f9560,
    args=...)
    at HTMLIFrameElementBinding.cpp:857
#3  0x0000007ff35b8290 in mozilla::dom::binding_detail::GenericGetter
    <mozilla::dom::binding_detail::NormalThisPolicy, mozilla::dom::
    binding_detail::ThrowExceptions> (cx=0x7fb82263f0, argc=<optimized out>,
    vp=<optimized out>)
    at dist/include/js/CallArgs.h:245
#4  0x0000007ff4e19610 in CallJSNative (args=..., reason=js::CallReason::Getter, 
    native=0x7ff35b80c0 <mozilla::dom::binding_detail::GenericGetter
    <mozilla::dom::binding_detail::NormalThisPolicy, mozilla::dom::
    binding_detail::ThrowExceptions>(JSContext*, unsigned int, JS::Value*)>,
    cx=0x7fb82263f0)
    at dist/include/js/CallArgs.h:285
#5  js::InternalCallOrConstruct (cx=cx@entry=0x7fb82263f0, args=...,
    construct=construct@entry=js::NO_CONSTRUCT, 
    reason=reason@entry=js::CallReason::Getter) at js/src/vm/Interpreter.cpp:585
#6  0x0000007ff4e1a268 in InternalCall (reason=js::CallReason::Getter, args=...,
    cx=0x7fb82263f0)
    at js/src/vm/Interpreter.cpp:648
#7  js::Call (reason=js::CallReason::Getter, rval=..., args=..., thisv=...,
    fval=..., cx=0x7fb82263f0)
    at js/src/vm/Interpreter.cpp:665
#8  js::CallGetter (cx=cx@entry=0x7fb82263f0, thisv=..., getter=...,
    getter@entry=..., rval=...)
    at js/src/vm/Interpreter.cpp:789
[...]
#44 0x0000007fefa864ac in QThread::exec() () from /usr/lib64/libQt5Core.so.5
#45 0x0000007fefa8b0e8 in ?? () from /usr/lib64/libQt5Core.so.5
#46 0x0000007fef971a4c in ?? () from /lib64/libpthread.so.0
#47 0x0000007fef65b89c in ?? () from /lib64/libc.so.6
(gdb) 
By reading through all of the functions in this backtrace, starting at the top and moving downwards, I need to find out where the ESR 91 code diverges from ESR 78 in a way that means this method doesn't get called with ESR 91.

This involves reading through each of the methods as they are in ESR 91 to see if they're different at any point. If they are I can run ESR 91 to establish whether that difference is the actual cause of the different execution paths between the two.

But before I do that I want to review things again.

On both ESR 78 and ESR 91 the nsFrameLoader::Create() method is called. However on ESR 91 changeset D100348 means that this no longer calls InitSessionHistory().

Instead, and because of this change, on ESR 91 it's the TryRemoteBrowserInternal() method needs to be called in order for the session history to be initialised. on ESR 91 the route to that method appears to be via nsFrameLoader::GetBrowsingContext().

On ESR 78 the GetBrowsingContext() method gets called like this:
Thread 8 "GeckoWorkerThre" hit Breakpoint 3, nsFrameLoader::GetBrowsingContext
    (this=0x7f89886f80)
    at dom/base/nsFrameLoader.cpp:3194
3194      if (IsRemoteFrame()) {
(gdb) bt
#0  nsFrameLoader::GetBrowsingContext (this=0x7f89886f80)
    at dom/base/nsFrameLoader.cpp:3194
#1  0x0000007fbb5ad250 in mozilla::dom::HTMLIFrameElement::
    MaybeStoreCrossOriginFeaturePolicy (this=0x7f895858f0)
    at dom/html/HTMLIFrameElement.cpp:252
#2  mozilla::dom::HTMLIFrameElement::MaybeStoreCrossOriginFeaturePolicy
    (this=0x7f895858f0)
    at dom/html/HTMLIFrameElement.cpp:241
#3  0x0000007fbb5ad390 in mozilla::dom::HTMLIFrameElement::RefreshFeaturePolicy
    (this=0x7f895858f0, aParseAllowAttribute=aParseAllowAttribute@entry=true)
    at dom/html/HTMLIFrameElement.cpp:329
#4  0x0000007fbb5ad4b8 in mozilla::dom::HTMLIFrameElement::BindToBrowsingContext
    (this=<optimized out>, aBrowsingContext=<optimized out>)
    at dom/html/HTMLIFrameElement.cpp:72
#5  0x0000007fbc7abf60 in mozilla::dom::BrowsingContext::Embed
    (this=<optimized out>)
    at docshell/base/BrowsingContext.cpp:505
#6  0x0000007fbaae320c in nsFrameLoader::MaybeCreateDocShell (this=0x7f89886f80)
    at dist/include/mozilla/RefPtr.h:313
#7  0x0000007fbaae5468 in nsFrameLoader::ReallyStartLoadingInternal
    (this=this@entry=0x7f89886f80)
    at dom/base/nsFrameLoader.cpp:612
#8  0x0000007fbaae5864 in nsFrameLoader::ReallyStartLoading (
dwarf2read.c:10473: internal-error: process_die_scope::process_die_scope
    (die_info*, dwarf2_cu*): Assertion `!m_die->in_process' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
On ESR 91 the GetBrowsingContext() method never seems to get called at all. But I've put breakpoints on most of the other places in the backtrace to check whether they get hit on ESR 91:
Breakpoint 3 at 0x7ff3925d4c: file dom/html/HTMLIFrameElement.cpp, line 221.
(gdb) b HTMLIFrameElement::MaybeStoreCrossOriginFeaturePolicy
Note: breakpoint 3 also set at pc 0x7ff3925d4c.
Breakpoint 4 at 0x7ff3925d4c: file dom/html/HTMLIFrameElement.cpp, line 221.
(gdb) b HTMLIFrameElement::RefreshFeaturePolicy
Breakpoint 5 at 0x7ff3925fb0: file dom/html/HTMLIFrameElement.cpp, line 266.
(gdb) b HTMLIFrameElement::BindToBrowsingContext
Breakpoint 6 at 0x7ff3926134: file dom/html/HTMLIFrameElement.cpp, line 70.
(gdb) b BrowsingContext::Embed
Breakpoint 7 at 0x7ff4ab2b0c: file ${PROJECT}/obj-build-mer-qt-xr/dist/include/
    mozilla/RefPtr.h, line 313.
(gdb) b nsFrameLoader::MaybeCreateDocShell
Breakpoint 8 at 0x7ff2dedc98: file dom/base/nsFrameLoader.cpp, line 2179.
(gdb) b  nsFrameLoader::ReallyStartLoadingInternal
Breakpoint 9 at 0x7ff2def63c: file dom/base/nsFrameLoader.cpp, line 664.
(gdb) info break
Num Type       Disp Enb What
2   breakpoint keep y   in nsFrameLoader::GetBrowsingContext() 
                        at dom/base/nsFrameLoader.cpp:3489
3   breakpoint keep y   in mozilla::dom::HTMLIFrameElement::
                           MaybeStoreCrossOriginFeaturePolicy() 
                        at dom/html/HTMLIFrameElement.cpp:221
4   breakpoint keep y   in mozilla::dom::HTMLIFrameElement::
                           MaybeStoreCrossOriginFeaturePolicy() 
                        at dom/html/HTMLIFrameElement.cpp:221
5   breakpoint keep y   in mozilla::dom::HTMLIFrameElement::
                           RefreshFeaturePolicy(bool) 
                        at dom/html/HTMLIFrameElement.cpp:266
6   breakpoint keep y   in mozilla::dom::HTMLIFrameElement::
                           BindToBrowsingContext(mozilla::dom::BrowsingContext*) 
                        at dom/html/HTMLIFrameElement.cpp:70
7   breakpoint keep y   in mozilla::dom::BrowsingContext::Embed() 
                        at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/
                           RefPtr.h:313
8   breakpoint keep y   in nsFrameLoader::MaybeCreateDocShell() 
                        at dom/base/nsFrameLoader.cpp:2179
9   breakpoint keep y   in nsFrameLoader::ReallyStartLoadingInternal() 
                        at dom/base/nsFrameLoader.cpp:664
(gdb) r
And one of them does get hit:
Thread 10 "GeckoWorkerThre" hit Breakpoint 9, nsFrameLoader::
    ReallyStartLoadingInternal (this=this@entry=0x7fc160da30)
    at dom/base/nsFrameLoader.cpp:664
664     nsresult nsFrameLoader::ReallyStartLoadingInternal() {
(gdb) c
Continuing.
Let's see what happens next...
Thread 10 "GeckoWorkerThre" hit Breakpoint 8, nsFrameLoader::
    MaybeCreateDocShell(this=this@entry=0x7fc160da30)
    at dom/base/nsFrameLoader.cpp:2179
2179    nsresult nsFrameLoader::MaybeCreateDocShell() {
(gdb) c
Continuing.

Thread 10 "GeckoWorkerThre" hit Breakpoint 7, mozilla::dom::BrowsingContext::
    Embed (this=0x7fc161aa70)
    at docshell/base/BrowsingContext.cpp:711
711       if (auto* frame = HTMLIFrameElement::FromNode(mEmbedderElement)) {
(gdb) c
Continuing.

Thread 10 "GeckoWorkerThre" hit Breakpoint 6, mozilla::dom::HTMLIFrameElement::
    BindToBrowsingContext (this=0x7fc14db4c0)
    at dom/html/HTMLIFrameElement.cpp:70
70      void HTMLIFrameElement::BindToBrowsingContext(BrowsingContext*) {
(gdb) c
Continuing.

Thread 10 "GeckoWorkerThre" hit Breakpoint 5, mozilla::dom::HTMLIFrameElement::
    RefreshFeaturePolicy (this=0x7fc14db4c0, 
    aParseAllowAttribute=aParseAllowAttribute@entry=true) at
    dom/html/HTMLIFrameElement.cpp:266
266     void HTMLIFrameElement::RefreshFeaturePolicy(bool aParseAllowAttribute) {
(gdb) c
Continuing.

Thread 10 "GeckoWorkerThre" hit Breakpoint 3, mozilla::dom::HTMLIFrameElement::
    MaybeStoreCrossOriginFeaturePolicy (this=this@entry=0x7fc14db4c0)
    at dom/html/HTMLIFrameElement.cpp:221
221     void HTMLIFrameElement::MaybeStoreCrossOriginFeaturePolicy() {
(gdb) c
Continuing.

Thread 10 "GeckoWorkerThre" hit Breakpoint 2, nsFrameLoader::GetBrowsingContext
    (this=0x7fc160da30)
    at dom/base/nsFrameLoader.cpp:3489
3489      if (mNotifyingCrash) {
(gdb) 
As a quick reminder, this is what the inside of this method looks like:
BrowsingContext* nsFrameLoader::GetBrowsingContext() {
  if (mNotifyingCrash) {
    if (mPendingBrowsingContext && mPendingBrowsingContext->EverAttached()) {
      return mPendingBrowsingContext;
    }
    return nullptr;
  }
  if (IsRemoteFrame()) {
    Unused << EnsureRemoteBrowser();
  } else if (mOwnerContent) {
    Unused << MaybeCreateDocShell();
  }
  return GetExtantBrowsingContext();
}
Let's compare this to the relevant variable values.
(gdb) p mNotifyingCrash
$1 = false
(gdb) p mIsRemoteFrame
$2 = false
(gdb) p mOwnerContent
$3 = (mozilla::dom::Element *) 0x7fc14db4c0
(gdb) 
With these values IsRemoteFrame() will return false and the MaybeCreateDocShell() path will be entered, rather than the EnsureRemoteBrowser() path that we want.

In ESR 78 we have a slightly different version of this method:
BrowsingContext* nsFrameLoader::GetBrowsingContext() {
  if (IsRemoteFrame()) {
    Unused << EnsureRemoteBrowser();
  } else if (mOwnerContent) {
    Unused << MaybeCreateDocShell();
  }
  return GetExtantBrowsingContext();
}
But given the values of the variables, we'll get the same result:
(gdb) p mIsRemoteFrame
$1 = false
(gdb) p mOwnerContent
$2 = (
dwarf2read.c:10473: internal-error: process_die_scope::process_die_scope
    (die_info*, dwarf2_cu*): Assertion `!m_die->in_process' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) n
(gdb) p (mOwnerContent != 0)
$3 = true
(gdb) 
It's beginning to look like the problem here is that IsRemoteFrame() is always returning false, so that the code we want to get called never gets called.

Having said that, there is also a reference to InitSessionHistory() in MaybeCreateDocShell() so we ought to check that too:
Thread 10 "GeckoWorkerThre" hit Breakpoint 8, nsFrameLoader::MaybeCreateDocShell
    (this=this@entry=0x7fc160da30)
    at dom/base/nsFrameLoader.cpp:2179
2179    nsresult nsFrameLoader::MaybeCreateDocShell() {
(gdb) n
2180      if (GetDocShell()) {
(gdb) 
nsFrameLoader::GetBrowsingContext (this=0x7fc160da30) at dom/base/
    nsFrameLoader.cpp:3500
3500      return GetExtantBrowsingContext();
(gdb) 
mozilla::dom::HTMLIFrameElement::MaybeStoreCrossOriginFeaturePolicy
    (this=this@entry=0x7fc14db4c0)
    at dom/html/HTMLIFrameElement.cpp:232
232       RefPtr<BrowsingContext> browsingContext = mFrameLoader->
    GetBrowsingContext();
(gdb) 
234       if (!browsingContext || !browsingContext->IsContentSubframe()) {
(gdb) 
238       if (ContentChild* cc = ContentChild::GetSingleton()) {
(gdb) 
232       RefPtr<BrowsingContext> browsingContext = mFrameLoader->
    GetBrowsingContext();
(gdb) 
nsFrameLoader::MaybeCreateDocShell (this=this@entry=0x7fc160da30) at
    dom/base/nsFrameLoader.cpp:2237
2237      InvokeBrowsingContextReadyCallback();
(gdb) n
2239      mIsTopLevelContent = mPendingBrowsingContext->IsTopContent();
(gdb) 
2241      if (mIsTopLevelContent) {
(gdb) 
2252      nsCOMPtr<nsIDocShellTreeOwner> parentTreeOwner;
(gdb) 
1363    ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:
    No such file or directory.
(gdb) 
859     in ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h
(gdb) 
2258      RefPtr<EventTarget> chromeEventHandler;
(gdb) 
2259      bool parentIsContent = parentDocShell->GetBrowsingContext()->
    IsContent();
(gdb) 
2260      if (parentIsContent) {
(gdb) 
2263        parentDocShell->GetChromeEventHandler(getter_AddRefs
    (chromeEventHandler));
(gdb) 
289     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:
    No such file or directory.
(gdb) 
2278      nsCOMPtr<nsPIDOMWindowOuter> newWindow = docShell->GetWindow();
(gdb) n
2285      newWindow->SetFrameElementInternal(mOwnerContent);
(gdb) 
2288      if (mOwnerContent->IsXULElement(nsGkAtoms::browser) &&
(gdb) 
2295      if (!docShell->Initialize()) {
(gdb) 
2301      NS_ENSURE_STATE(mOwnerContent);
(gdb) 
We've now reached this condition, which we really want the programme counter to enter:
  // If we are an in-process browser, we want to set up our session history.
  if (mIsTopLevelContent && mOwnerContent->IsXULElement(nsGkAtoms::browser) &&
      !mOwnerContent->HasAttr(kNameSpaceID_None, nsGkAtoms::disablehistory)) {
    // XXX(nika): Set this up more explicitly?
    mPendingBrowsingContext->InitSessionHistory();
  }
But it isn't to be:
(gdb) p mIsTopLevelContent 
$4 = false
(gdb) n
2304      if (mIsTopLevelContent && mOwnerContent->IsXULElement(nsGkAtoms::browser) &&
(gdb) 
2315      HTMLIFrameElement* iframe = HTMLIFrameElement::FromNode(mOwnerContent);
(gdb) 
The value of the mIsTopLevelContent comes from earlier in the same method:
  mIsTopLevelContent = mPendingBrowsingContext->IsTopContent();
Checking in BrowsingContext.h we see this:
  bool IsTopContent() const { return IsContent() && IsTop(); }
[...]
  bool IsContent() const { return mType == Type::Content; }
[...]
  bool IsTop() const { return !GetParent(); }
And to round this off, the GetParent() method is defined in BrowsingContext.cpp like this:
BrowsingContext* BrowsingContext::GetParent() const {
  return mParentWindow ? mParentWindow->GetBrowsingContext() : nullptr;
}
This call to IsTop() matches the call that the call to InitSessionHistory() is conditioned on elsewhere as well. Let's check the relevant values:
(gdb) p mPendingBrowsingContext.mRawPtr->mType
$5 = mozilla::dom::BrowsingContext::Type::Content
(gdb) p mPendingBrowsingContext.mRawPtr->mParentWindow
$6 = {mRawPtr = 0x7fc11bb430}
(gdb) 
This means that GetParent() is returning a non-null value, hence IsTo() must be returning false.

Which is why the session isn't being initialised here.

That's enough digging for today. More tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
25 Jan 2024 : Day 149 #
It's my last dev diary before taking a 14 day break from gecko development today. I'm not convinced that I've made the right decision: there's a part of me that thinks I should forge on ahead and just make all of the things fit into the time I have available. But there's also the realist in me that says something has to give.

So there will be no dev diary tomorrow or until the 8th February, at which point I'll start up again. Ostensibly the reason is so that I can get my presentations together for FOSDEM. I want to do a decent job with the presentations. But I also have a lot going on at work right now. So it's necessary.

But there's still development to do today. Yesterday I set the build running after having added /browser/components/sessionstore to the embedlite/moz.build file. I was hoping this would result in SessionStore.jsm and its dependencies being added to omni.ja.

The package built fine. But after installing it on my device, the new files weren't to be found: they didn't make it into the archive. Worse than that, they've not even made it into the obj-build-mer-qt-x folder on my laptop. That means they're still not getting included in the build.

It rather makes sense as well. The LOCAL_INCLUDES value should, if I'm understanding correctly, list places where C++ headers might be found. This should affect JavaScript files at all.

So I've spent the day digging around in the build system trying to figure out what needs to be changed to get them where they need to be. I'm certain there's an easy answer, but I just can't seem to figure it out.

I thought about trying to override SessionStore.jsm as a component, but since it doesn't actually seem to be a component itself, this didn't work either.

So after much flapping around, I've decided just to patch out the functionality from the SessionStoreFunctions.jsm file. That doesn't feel like the right way to do this, but until someone suggests a better way (which I'm all open to!) this should at least be a pretty simple fix.

Let's see.

I've built a new version of gecko-dev with the changes applied, installed them on my phone and it's now time to run them.
$ sailfish-browser 
[D] unknown:0 - Using Wayland-EGL
library "libGLESv2_adreno.so" not found
library "eglSubDriverAndroid.so" not found
greHome from GRE_HOME:/usr/bin
libxul.so is not found, in /usr/bin/libxul.so
Created LOG for EmbedLiteTrace
[D] onCompleted:105 - ViewPlaceholder requires a SilicaFlickable parent
Created LOG for EmbedLite
Created LOG for EmbedPrefs
Created LOG for EmbedLiteLayerManager
JavaScript error: chrome://embedlite/content/embedhelper.js, line 259:
    TypeError: sessionHistory is null
Call EmbedLiteApp::StopChildThread()
Redirecting call to abort() to mozalloc_abort
The log output is encouragingly quiet. There is one error stating that sessionHistory is null. I think this is unrelated to the SessionStore changes I've made, but it's still worth looking into this to fix it. Maybe this will be what fixes the Back and Forwards buttons?

What's clear is that the SessionStore errors have now gone, which is great. But fixing those errors sadly hasn't fixed the Back and Forwards buttons.

Let's look at this other error then. It's caused by the last line in this code block:
      case "embedui:addhistory": {
        // aMessage.data contains: 1) list of 'links' loaded from DB, 2) current
        // 'index'.

        let docShell = content.docShell;
        let sessionHistory = docShell.QueryInterface(Ci.nsIWebNavigation).
            sessionHistory;
        let legacyHistory = sessionHistory.legacySHistory;
The penultimate line is essentially saying "View the docShell object as a WebNavigation object and access the sessionHistory value stored inside it".

So there are three potential reasons why this might be failing. First it could be that docShell no longer supports the WebNavigation interface. Second it could be that WebNavigation has changed so it no longer contains a sessionHitory value. Third it could be that the value is still there, but it's set to null.

From the nsDocShell class definition in nsDocShell.h it's clear that the interface is still supported:
class nsDocShell final : public nsDocLoader,
                         public nsIDocShell,
                         public nsIWebNavigation,
                         public nsIBaseWindow,
                         public nsIRefreshURI,
                         public nsIWebProgressListener,
                         public nsIWebPageDescriptor,
                         public nsIAuthPromptProvider,
                         public nsILoadContext,
                         public nsINetworkInterceptController,
                         public nsIDeprecationWarner,
                         public mozilla::SupportsWeakPtr {
So let's check that WebNavigation interface, defined in the nsIWebNavigation.idl file. The field is still there in the interface definition:
  /**
   * The session history object used by this web navigation instance. This
   * object will be a mozilla::dom::ChildSHistory object, but is returned as
   * nsISupports so it can be called from JS code.
   */
  [binaryname(SessionHistoryXPCOM)]
  readonly attribute nsISupports sessionHistory;
Although the interface is being accessed from nsDocShell, when we look at the code we can see that the history itself is coming from elsewhere:
  mozilla::dom::ChildSHistory* GetSessionHistory() {
    return mBrowsingContext->GetChildSessionHistory();
  }
[...]
  RefPtr<mozilla::dom::BrowsingContext> mBrowsingContext;
This provides us with an opportunity, because it means we can place a breakpoint here to see what it's doing.
$ gdb sailfish-browser
[...]
(gdb) b nsDocShell::GetSessionHistory
Breakpoint 1 at 0x7fbc7b37c4: nsDocShell::GetSessionHistory. (10 locations)
(gdb) b BrowsingContext::GetChildSessionHistory
Breakpoint 2 at 0x7fbc7b376c: file docshell/base/BrowsingContext.cpp, line 3314.
(gdb) c
Thread 10 "GeckoWorkerThre" hit Breakpoint 1, nsDocShell::GetSessionHistory
    (this=0x7f80aa9280)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
313     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:
    No such file or directory.
(gdb) c
Continuing.

Thread 10 "GeckoWorkerThre" hit Breakpoint 2, mozilla::dom::BrowsingContext::
    GetChildSessionHistory (this=0x7f80c58e90)
    at docshell/base/BrowsingContext.cpp:3314
3314    ChildSHistory* BrowsingContext::GetChildSessionHistory() {
(gdb) b mChildSessionHistory
Function "mChildSessionHistory" not defined.
Make breakpoint pending on future shared library load? (y or [n]) n
(gdb) p mChildSessionHistory
$1 = {mRawPtr = 0x0}
(gdb) 
So this value is unset, but we'd expect it to be set as a consequence of a call to CreateChildSHistory():
void BrowsingContext::CreateChildSHistory() {
  MOZ_ASSERT(IsTop());
  MOZ_ASSERT(GetHasSessionHistory());
  MOZ_DIAGNOSTIC_ASSERT(!mChildSessionHistory);

  // Because session history is global in a browsing context tree, every process
  // that has access to a browsing context tree needs access to its session
  // history. That is why we create the ChildSHistory object in every process
  // where we have access to this browsing context (which is the top one).
  mChildSessionHistory = new ChildSHistory(this);

  // If the top browsing context (this one) is loaded in this process then we
  // also create the session history implementation for the child process.
  // This can be removed once session history is stored exclusively in the
  // parent process.
  mChildSessionHistory->SetIsInProcess(IsInProcess());
}
As I'm looking through this code in ESR 91 and ESR 78 I notice that the above method has changed: the call to SetIsInProcess() is new. I wonder if that will ultimately be related to why this isn't being set? I'm thinking that the location where the creation happens may be different.

There are indeed some differences. In ESR 91 it's called in BrowsingContext::CreateFromIPC() and BrowsingContext::Attach() whereas in ESR 78 it's called in BrowsingContext::Attach(). Both versions also have it being called from BrowsingContext::DidSet().

I should put some breakpoints on those methods to see which, if any, is being called. And I should do it for both versions.

See here's the result for ESR 91:
$ gdb sailfish-browser 
[...]
(gdb) b BrowsingContext::CreateChildSHistory
Function "BrowsingContext::CreateChildSHistory" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (BrowsingContext::CreateChildSHistory) pending.
(gdb) r
[...]
The breakpoint is never hit and the creation never occurs. In contrast, on ESR 78 we get a hit before the first page has loaded:
$ gdb sailfish-browser
[...]
(gdb) b BrowsingContext::CreateChildSHistory
Function "BrowsingContext::CreateChildSHistory" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (BrowsingContext::CreateChildSHistory) pending.
(gdb) r
[...]

Thread 8 "GeckoWorkerThre" hit Breakpoint 1, mozilla::dom::BrowsingContext::
    CreateChildSHistory (this=this@entry=0x7f889dc120)
    at obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33
33      obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:
    No such file or directory.
(gdb) bt
#0  mozilla::dom::BrowsingContext::CreateChildSHistory
    (this=this@entry=0x7f889dc120)
    at obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33
#1  0x0000007fbc7c933c in mozilla::dom::BrowsingContext::DidSet
    (aOldValue=<optimized out>, this=0x7f889dc120)
    at docshell/base/BrowsingContext.cpp:2356
#2  mozilla::dom::syncedcontext::Transaction<mozilla::dom::BrowsingContext>::
    Apply(mozilla::dom::BrowsingContext*)::{lambda(auto:1)#1}::operator()
    <std::integral_constant<unsigned long, 37ul> >(std::integral_constant
    <unsigned long, 37ul>) const (this=<optimized out>, this=<optimized out>,
    idx=...)
    at obj-build-mer-qt-xr/dist/include/mozilla/dom/SyncedContextInlines.h:137
[...]
#8  0x0000007fbc7cb484 in mozilla::dom::BrowsingContext::SetHasSessionHistory
    <bool> (this=this@entry=0x7f889dc120, 
    aValue=aValue@entry=@0x7fa6e1da57: true)
    at obj-build-mer-qt-xr/dist/include/mozilla/OperatorNewExtensions.h:47
#9  0x0000007fbc7cb54c in mozilla::dom::BrowsingContext::InitSessionHistory
    (this=0x7f889dc120)
    at docshell/base/BrowsingContext.cpp:2316
#10 0x0000007fbc7cb590 in mozilla::dom::BrowsingContext::InitSessionHistory
    (this=this@entry=0x7f889dc120)
    at obj-build-mer-qt-xr/dist/include/mozilla/dom/BrowsingContext.h:161
#11 0x0000007fbc901fac in nsWebBrowser::Create (aContainerWindow=
    <optimized out>, aParentWidget=<optimized out>, 
    aBrowsingContext=aBrowsingContext@entry=0x7f889dc120,
    aInitialWindowChild=aInitialWindowChild@entry=0x0)
    at toolkit/components/browser/nsWebBrowser.cpp:158
#12 0x0000007fbca950e8 in mozilla::embedlite::EmbedLiteViewChild::
    InitGeckoWindow (this=0x7f88bca840, parentId=0, parentBrowsingContext=0x0, 
    isPrivateWindow=<optimized out>, isDesktopMode=false)
    at obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:847
[...]
#33 0x0000007fb735e89c in ?? () from /lib64/libc.so.6
(gdb) c
Continuing.
[...]
This could well be our smoking gun. I need to check back through the backtrace to understand the process happening on ESR 78 and then establish why something similar isn't happening on ESR 91. Progress!

Immediately on examining the backtrace it's clear something odd is happening. The callee is BrowsingContext::DidSet() which is the only location where the call is made in both ESR 78 and ESR 91. So that does rather beg the question of why it's not getting called in ESR 91.

Digging back through the backtrace further, it eventually materialises that the difference is happening in nsWebBrowser::Create(). There's this bit of code in ESR 78 that looks like this:
  // If the webbrowser is a content docshell item then we won't hear any
  // events from subframes. To solve that we install our own chrome event
  // handler that always gets called (even for subframes) for any bubbling
  // event.

  if (aBrowsingContext->IsTop()) {
    aBrowsingContext->InitSessionHistory();
  }

  NS_ENSURE_SUCCESS(docShellAsWin->Create(), nullptr);

  docShellTreeOwner->AddToWatcher();  // evil twin of Remove in SetDocShell(0)
  docShellTreeOwner->AddChromeListeners();
You can see the InitSessionHistory() call in there which eventually leads to the creation of our sessinHistory object. In ESR 91 that same bit of code looks like this:
  // If the webbrowser is a content docshell item then we won't hear any
  // events from subframes. To solve that we install our own chrome event
  // handler that always gets called (even for subframes) for any bubbling
  // event.

  nsresult rv = docShell->InitWindow(nullptr, docShellParentWidget, 0, 0, 0, 0);
  if (NS_WARN_IF(NS_FAILED(rv))) {
    return nullptr;
  }

  docShellTreeOwner->AddToWatcher();  // evil twin of Remove in SetDocShell(0)
  docShellTreeOwner->AddChromeListeners();
Where has the InitSessionHistory() gone? It should be possible to find out using a bit of git log searching. Here we're following the rule of using git blame to find out about lines that have been added and git log -S to find out about lines that have been removed.
$ git log -1 -S InitSessionHistory toolkit/components/browser/nsWebBrowser.cpp
commit 140a4164598e0c9ed537a377cf66ef668a7fbc25
Author: Randell Jesup <rjesup@wgate.com>
Date:   Mon Feb 1 22:57:12 2021 +0000

    Bug 1673617 - Create BrowsingContext::mChildSessionHistory more
    aggressively, r=nika
    
    Differential Revision: https://phabricator.services.mozilla.com/D100348
Just looking at this change, it removes the call to InitSessionHistory() in nsWebBrowser::Create() and moves it to nsFrameLoader::TryRemoteBrowserInternal. There are some related changes in the parent Bug 1673617, but looking through those it doesn't seem that they're anything we need to worry about.

Placing a breakpoint on nsFrameLoader::TryRemoteBrowserInternal() shows that it's not being called on ESR 91.

The interesting thing is that it appears that if EnsureRemoteBrowser() gets called in this bit of code:
BrowsingContext* nsFrameLoader::GetBrowsingContext() {
  if (mNotifyingCrash) {
    if (mPendingBrowsingContext && mPendingBrowsingContext->EverAttached()) {
      return mPendingBrowsingContext;
    }
    return nullptr;
  }
  if (IsRemoteFrame()) {
    Unused << EnsureRemoteBrowser();
  } else if (mOwnerContent) {
    Unused << MaybeCreateDocShell();
  }
  return GetExtantBrowsingContext();
}
If the first path in the second condition is followed then the InitSessionHistory() would get called too. If this gets called by the InitSessionHistory() doesn't, it would imply that IsRemoteFrame() must be false. But if it is false then MaybeCreateDocShell() could get called, which also has a call to InitSessionHistory() like this:
  // If we are an in-process browser, we want to set up our session history.
  if (mIsTopLevelContent && mOwnerContent->IsXULElement(nsGkAtoms::browser) &&
      !mOwnerContent->HasAttr(kNameSpaceID_None, nsGkAtoms::disablehistory)) {
    // XXX(nika): Set this up more explicitly?
    mPendingBrowsingContext->InitSessionHistory();
  }
I wonder what's going on around about this code then. Checking with the debugger the answer turns out to be that apparently, nsFrameLoader::GetExtantBrowsingContext() simply doesn't get called either.

From here, things pan out. The EnsureRemoteBrowser() method is called by all of these methods:
  1. nsFrameLoader::GetBrowsingContext()
  2. nsFrameLoader::ShowRemoteFrame()
  3. nsFrameLoader::ReallyStartLoadingInternal()
None of these are static methods and when I place a breakpoint on the nsFrameLoader constructor it doesn't get hit. So it's not possible for any of these methods to be called and there's no point trying to dig any deeper via them.

However this isn't true in ESR 78 where the constructor does get called. It's almost certainly worthwhile finding out about this difference. But unfortunately I'm out of time for today.

I have to wrap things up. As I mentioned previously, I'm taking a break for two weeks to give myself a bit more breathing space as I prepare for FOSDEM, which I'm really looking forward to. If you're travelling to Brussels yourself then I hope to see you there. You'll be able to find me mostly on the Linux on Mobile stand.

When I get back I'll be back to posting these dev diaries again. And as a note to myself, when I do, my first action must be to figure out why nsFrameLoader is being created in ESR 78 but not ESR 91. That might hold the key to why the sessionHistory isn't getting called in ESR 91. And as a last resort, I can always revert commit 140a4164598e0c9ed53.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
24 Jan 2024 : Day 148 #
After much digging around in the code and gecko project structure I eventually decided that the best thing to do is implement a Sailfish-specific version of the SessionStore.jsm module.

Unfortunately this isn't just a case of copying the file over to embedlite-components, because it has some dependencies. These are listed at the top of the file. Let's go through and figure out which ones are already available, which we can remove, which we can copy over directly to embedlite-components and which we have to reimplement ourselves.

Here's the code that relates to the dependencies:
const { PrivateBrowsingUtils } = ChromeUtils.import(
  "resource://gre/modules/PrivateBrowsingUtils.jsm"
);
const { Services } = ChromeUtils.import("resource://gre/modules/Services.jsm");
const { TelemetryTimestamps } = ChromeUtils.import(
  "resource://gre/modules/TelemetryTimestamps.jsm"
);
const { XPCOMUtils } = ChromeUtils.import(
  "resource://gre/modules/XPCOMUtils.jsm"
);

ChromeUtils.defineModuleGetter(
  this,
  "SessionHistory",
  "resource://gre/modules/sessionstore/SessionHistory.jsm"
);

XPCOMUtils.defineLazyServiceGetters(this, {
  gScreenManager: ["@mozilla.org/gfx/screenmanager;1", "nsIScreenManager"],
});

XPCOMUtils.defineLazyModuleGetters(this, {
  AppConstants: "resource://gre/modules/AppConstants.jsm",
  AsyncShutdown: "resource://gre/modules/AsyncShutdown.jsm",
  BrowserWindowTracker: "resource:///modules/BrowserWindowTracker.jsm",
  DevToolsShim: "chrome://devtools-startup/content/DevToolsShim.jsm",
  E10SUtils: "resource://gre/modules/E10SUtils.jsm",
  GlobalState: "resource:///modules/sessionstore/GlobalState.jsm",
  HomePage: "resource:///modules/HomePage.jsm",
  PrivacyFilter: "resource://gre/modules/sessionstore/PrivacyFilter.jsm",
  PromiseUtils: "resource://gre/modules/PromiseUtils.jsm",
  RunState: "resource:///modules/sessionstore/RunState.jsm",
  SessionCookies: "resource:///modules/sessionstore/SessionCookies.jsm",
  SessionFile: "resource:///modules/sessionstore/SessionFile.jsm",
  SessionSaver: "resource:///modules/sessionstore/SessionSaver.jsm",
  SessionStartup: "resource:///modules/sessionstore/SessionStartup.jsm",
  TabAttributes: "resource:///modules/sessionstore/TabAttributes.jsm",
  TabCrashHandler: "resource:///modules/ContentCrashHandlers.jsm",
  TabState: "resource:///modules/sessionstore/TabState.jsm",
  TabStateCache: "resource:///modules/sessionstore/TabStateCache.jsm",
  TabStateFlusher: "resource:///modules/sessionstore/TabStateFlusher.jsm",
  setTimeout: "resource://gre/modules/Timer.jsm",
});
Heres's a table to summarise. I've ordered them by their current status to help highlight what needs work and what type of work it is.
 
Module Variable Status
gre/modules/PrivateBrowsingUtils.jsm" PrivateBrowsingUtils Available
gre/modules/Services.jsm Services Available
gre/modules/TelemetryTimestamps.jsm TelemetryTimestamps Available
gre/modules/XPCOMUtils.jsm XPCOMUtils Available
gre/modules/sessionstore/SessionHistory.jsm SessionHistory Available
gre/modules/AppConstants.jsm AppConstants Available
gre/modules/AsyncShutdown.jsm AsyncShutdown Available
gre/modules/E10SUtils.jsm E10SUtils Available
gre/modules/sessionstore/PrivacyFilter.jsm PrivacyFilter Available
gre/modules/PromiseUtils.jsm PromiseUtils Available
gre/modules/Timer.jsm setTimeout Available
@mozilla.org/gfx/screenmanager;1 gScreenManager Drop
modules/ContentCrashHandlers.jsm TabCrashHandler Drop
devtools-startup/content/DevToolsShim.jsm DevToolsShim Drop
modules/sessionstore/TabAttributes.jsm TabAttributes Copy
modules/sessionstore/GlobalState.jsm GlobalState Copy
modules/sessionstore/RunState.jsm RunState Copy
modules/BrowserWindowTracker.jsm BrowserWindowTracker Drop
modules/sessionstore/SessionCookies.jsm SessionCookies Drop?
modules/HomePage.jsm HomePage Drop?
modules/sessionstore/SessionFile.jsm SessionFile Copy/drop?
modules/sessionstore/SessionSaver.jsm SessionSaver Copy/drop?
modules/sessionstore/SessionStartup.jsm SessionStartup Copy/drop?
modules/sessionstore/TabState.jsm TabState Copy/drop?
modules/sessionstore/TabStateCache.jsm TabStateCache Copy/drop?
modules/sessionstore/TabStateFlusher.jsm TabStateFlusher Copy/drop?

In addition to the above there's also the SessionStore.jsm file itself.

As you can see there's still a fair bit of uncertainty in the table. But also, quite a large number of the dependencies are already available.

From the code it looks like the functionality is around saving and restoring sessions, including tab data, cookies, window positions and the like. Some of this isn't relevant on Sailfish OS (there's no point saving and restoring window sizes) or is already handled by other parts of the system (cookie storage). In fact, it's not clear that this module is providing any additional functionality that sailfish-browser actually needs.

Given this my focus will be on creating a minimal implementation that doesn't error when called but performs very little functionality in practice. That will hopefully make the task tractable.

It's early in the morning here still, but time for me to start work; so I'll pick this up again tonight.

[...]

It's now late evening and I have just a bit of time to move some files around. I've started by copying the SessionStore.jsm file into the embedlite-components project, alongside the other files I think I can copy without making changes. Apart from SessionStore.jsm, I've tried to copy over only files that don't require dependencies, or where the dependencies are all available.
$ find . -iname "SessionStore.jsm"
./gecko-dev/browser/components/sessionstore/SessionStore.jsm
$ cp ./gecko-dev/browser/components/sessionstore/SessionStore.jsm \
    ../../embedlite-components/jscomps/
$ cp ./gecko-dev/browser/components/sessionstore/TabAttributes.jsm \
    ../../embedlite-components/jscomps/
$ cp ./gecko-dev/browser/components/sessionstore/GlobalState.jsm \
     ../../embedlite-components/jscomps/
$ cp ./gecko-dev/browser/components/sessionstore/RunState.jsm \
    ../../embedlite-components/jscomps/
I've also been through and removed all of the code that used any of the dropped dependency. And in fact I've gone ahead and dropped all of the modules marked as "Drop" or "Copy/drop" in the table above. Despite the quantity of code in the original files, it really doesn't look like there's much functionality that's needed for sailfish-browser in these scripts. But having the functions available may turn out to be useful at some point in the future and in the meantime if the module just provides methods that don't do anything, then they will at least be successful in suppressing the errors.

The final step is to hook them up into the system so that they get included and can be accessed by other parts of the code. And this is where I hit a problem. The embedlite-components package contains two types of JavaScript entity. The first are in the jscomps folder. These all seem to be components that have a defined interface (they satisfy a "contract") as specified in the EmbedLiteJSComponents.manifest file. Here's an example of the entry for the AboutRedirector component:
# AboutRedirector.js
component {59f3da9a-6c88-11e2-b875-33d1bd379849} AboutRedirector.js
contract @mozilla.org/network/protocol/about;1?what= {59f3da9a-6c88-11e2-b875-33d1bd379849}
contract @mozilla.org/network/protocol/about;1?what=embedlite {59f3da9a-6c88-11e2-b875-33d1bd379849}
contract @mozilla.org/network/protocol/about;1?what=certerror {59f3da9a-6c88-11e2-b875-33d1bd379849}
contract @mozilla.org/network/protocol/about;1?what=home {59f3da9a-6c88-11e2-b875-33d1bd379849}
Our SessionStore.jsm files can't be added like this because they're not components with defined interfaces in this way. The other type are in the jsscipts folder. These aren't components and the files would fit perfectly in there. But they are all accessed using a particular path and the chrome:// scheme, like this:
const { NetErrorHelper } = ChromeUtils.import("chrome://embedlite/content/NetErrorHelper.jsm")
This won't work for SessionStore.jsm, which is expected to be accessed like this:
XPCOMUtils.defineLazyModuleGetters(this, {
  SessionStore: "resource:///modules/sessionstore/SessionStore.jsm",
});
Different location; different approach.

So I'm going to need to find some other way to do this. As it's late, it will take me a while to come up with an alternative. But my immediate thought is that maybe I can just add the missing files in to the EmbedLite build process. It looks like this is being controlled by the embedlite/moz.build file. So I've added the component folder /browser/components/sessionstore into the list of directories there:
LOCAL_INCLUDES += [
    '!/build',
    '/browser/components/sessionstore',
    '/dom/base',
    '/dom/ipc',
    '/gfx/layers',
    '/gfx/layers/apz/util',
    '/hal',
    '/js/xpconnect/src',
    '/netwerk/base/',
    '/toolkit/components/resistfingerprinting',
    '/toolkit/xre',
    '/widget',
    '/xpcom/base',
    '/xpcom/build',
    '/xpcom/threads',
    'embedhelpers',
    'embedprocess',
    'embedshared',
    'embedthread',
    'modules',
    'utils',
]
I've added the directory, cleaned out the build directory and started off a fresh build to run overnight. Let's see whether that worked in the morning.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
23 Jan 2024 : Day 147 #
Last night I woke up in a mild panic. This happens sometimes, usually when I have a lot going on and I feel like I'm in danger of dropping the ball.

This seems to be my mind's (or maybe my body's?) way of telling me that I need to get my priorities straight. That there's something that I need to get done, resolved or somehow dealt with, and I need to do it urgently or I'll continue to have sleepless nights until I do.

The reason for this particular panic was my FOSDEM preparations, combined with a build up of projects at work that are coming to a head. For FOSDEM I have two talks to prepare (one about this blog, the other related to my work), the Linux on Mobile stand to help organise, the Sailfish community dinner on the Saturday to help organise, support for the HPC, Big Data & Data Science devroom on the Saturday, and also with the Python devroom on the Sunday. It's going to be a crazy busy event. But it's actually the run-up to it and the fact I've to still write my presentations, that's losing me sleep.

It's all fine: it's under control. But in order to prevent it spiralling out of control, I'm going to be taking a break from gecko development for a couple of weeks until things have calmed down. This will slow down development, which of course saddens me because more than anything else I just want ESR 91 to end up in a good, releasable, state. But as others wiser than I am have already cautioned, this also means keeping a positive and healthy state of mind.

I'll finish my last post on Day 149 (that's this Thursday). I'll start back up again on Thursday the 8th February, assuming all goes to plan!

But for the next couple of days there's still development to be done, so let's get straight back to it.

Today I'm still attempting to fix the NS_ERROR_FILE_NOT_FOUND error coming from SessionStore.jsm which I believe may be causing the Back and Forwards buttons in the browser to fail. It's become clear that the SessionStore.jsm file itself is missing (along with a bunch of other files listed in gecko-dev/browser/components/sessionstore/moz.build) but what's not clear is whether the problem is that the files should be there, or that the calls to the methods in this file shouldn't be there.

Overnight while lying in bed I came up with some kind of plan to help move things forwards. The call to the failing method is happening in updateSessionStoreForWindow() and this is only called by the exported method UpdateSessionStoreForWindow(). As far as I can tell this isn't executed by any JavaScript code, but because it has an IPDL interface it's possible for it to be called by C++ code as well.

And sure enough, it's being called twice in WindowGlobalParent.cpp. Once at the end of the WriteFormDataAndScrollToSessionStore() method on line 1260, like this:
nsresult WindowGlobalParent::WriteFormDataAndScrollToSessionStore(
    const Maybe<FormData>& aFormData, const Maybe<nsPoint>& aScrollPosition,
    uint32_t aEpoch) {
[...]
  return funcs->UpdateSessionStoreForWindow(GetRootOwnerElement(), context, key,
                                            aEpoch, update);
}
And another time at the end of the ResetSessionStore() method on line 1310, like this:
nsresult WindowGlobalParent::ResetSessionStore(uint32_t aEpoch) {
[...]
  return funcs->UpdateSessionStoreForWindow(GetRootOwnerElement(), context, key,
                                            aEpoch, update);
}
I've placed a breakpoint on these two locations to find out whether these are actually where it's being fired from. If you're being super-observant you'll notice I've not actually placed the breakpoints where I actually want them; I've had to place them earlier in the code (but crucially, still within the same methods). That's because it's not always possible to place breakpoints on the exact line you want.

The debugger will place it on the first line it can after the point you request. Because both of the cases I'm interested in are right at the end of the methods they're called in, when I attempt to put a breakpoint on the exact line the debugger places it instead in the next method along in the source code. That isn't much use for what I needed.

Hence I've placed them a little earlier in the code instead: on the first lines where they actually stick.
bash-5.0$ EMBED_CONSOLE=1 MOZ_LOG="EmbedLite:5" gdb sailfish-browser
(gdb) b WindowGlobalParent.cpp:1260
Breakpoint 5 at 0x7fbbc8e31c: file dom/ipc/WindowGlobalParent.cpp, line 1260.
(gdb) b WindowGlobalParent.cpp:1294
Breakpoint 9 at 0x7fbbc8e688: file dom/ipc/WindowGlobalParent.cpp, line 1294.
(gdb) r
[...]
JavaScript error: resource:///modules/sessionstore/SessionStore.jsm, line 541:
    NS_ERROR_FILE_NOT_FOUND: 
CONSOLE message:
[JavaScript Error: "NS_ERROR_FILE_NOT_FOUND: " {file:
    "resource:///modules/sessionstore/SessionStore.jsm" line: 541}]
@resource:///modules/sessionstore/SessionStore.jsm:541:3
SSF_updateSessionStoreForWindow@resource://gre/modules/
    SessionStoreFunctions.jsm:120:5
UpdateSessionStoreForStorage@resource://gre/modules/
    SessionStoreFunctions.jsm:54:35

Thread 10 "GeckoWorkerThre" hit Breakpoint 5, mozilla::dom::WindowGlobalParent::
    WriteFormDataAndScrollToSessionStore (this=this@entry=0x7f81164520, 
    aFormData=..., aScrollPosition=..., aEpoch=0)
    at dom/ipc/WindowGlobalParent.cpp:1260
1260      windowState.mHasChildren.Construct() = !context->Children().IsEmpty();
(gdb) n
709     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/Span.h:
        No such file or directory.
(gdb) 
1260      windowState.mHasChildren.Construct() = !context->Children().IsEmpty();
(gdb) 
1262      JS::RootedValue update(jsapi.cx());
(gdb) 
1263      if (!ToJSValue(jsapi.cx(), windowState, &update)) {
(gdb) 
1267      JS::RootedValue key(jsapi.cx(), context->Top()->PermanentKey());
(gdb) 
1269      return funcs->UpdateSessionStoreForWindow(GetRootOwnerElement(),
          context, key,
(gdb) n
1297    ${PROJECT}/obj-build-mer-qt-xr/dist/include/js/RootingAPI.h:
        No such file or directory.
(gdb) 
JavaScript error: resource:///modules/sessionstore/SessionStore.jsm, line 541:
    NS_ERROR_FILE_NOT_FOUND: 
1267      JS::RootedValue key(jsapi.cx(), context->Top()->PermanentKey());
(gdb) c
Continuing.
It's clear from the above that the WriteFormDataAndScrollToSessionStore() method is being called and is then going on to call the missing JavaScript method. We can even see the error coming from the JavaScript as we step out of the calling method.

You'll notice from the output that there's another — earlier — NS_ERROR_FILE_NOT_FOUND error before the breakpoint hits. This is coming from a different spot: line 120 of SessionStoreFunctions.jsm rather than the line 105 we were looking at here.

This new error is called from line 54 of the same file (we can see from from the backtrace in the log output) which is in the UpdateSessionStoreForStorage() method in the file. So where is this being called from?
$ grep -rIn "UpdateSessionStoreForStorage" * --include="*.js" \
    --include="*.jsm" --include="*.cpp" --exclude-dir="obj-build-mer-qt-xr"
gecko-dev/docshell/base/CanonicalBrowsingContext.cpp:2233:
    return funcs->UpdateSessionStoreForStorage(Top()->GetEmbedderElement(), this,
gecko-dev/docshell/base/CanonicalBrowsingContext.cpp:2255:
    void CanonicalBrowsingContext::UpdateSessionStoreForStorage(
gecko-dev/dom/storage/SessionStorageManager.cpp:854:
    CanonicalBrowsingContext::UpdateSessionStoreForStorage(
gecko-dev/toolkit/components/sessionstore/SessionStoreFunctions.jsm:47:
    function UpdateSessionStoreForStorage(
gecko-dev/toolkit/components/sessionstore/SessionStoreFunctions.jsm:66:
    "UpdateSessionStoreForStorage",
From this we can see it's being called in a few places, all of them from C++ code. Again, that means we can explore them with gdb using breakpoints. Let's give this a go as well.
(gdb) break CanonicalBrowsingContext.cpp:2213
Breakpoint 7 at 0x7fbc7c6abc: file docshell/base/CanonicalBrowsingContext.cpp,
    line 2213.
(gdb) r
[...]

Thread 10 "GeckoWorkerThre" hit Breakpoint 7, mozilla::dom::
    CanonicalBrowsingContext::WriteSessionStorageToSessionStore
    (this=0x7f80b6dee0, aSesssionStorage=..., aEpoch=0)
    at docshell/base/CanonicalBrowsingContext.cpp:2213
2213      AutoJSAPI jsapi;
(gdb) bt
#0  mozilla::dom::CanonicalBrowsingContext::WriteSessionStorageToSessionStore
    (this=0x7f80b6dee0, aSesssionStorage=..., aEpoch=0)
    at docshell/base/CanonicalBrowsingContext.cpp:2213
#1  0x0000007fbc7c6f54 in mozilla::dom::CanonicalBrowsingContext::
    <lambda(const mozilla::MozPromise<nsTArray<mozilla::dom::SSCacheCopy>,
    mozilla::ipc::ResponseRejectReason, true>::ResolveOrRejectValue&)>::
    operator() (valueList=..., __closure=0x7f80bd70c8)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/Variant.h:768
#2  mozilla::MozPromise<nsTArray<mozilla::dom::SSCacheCopy>, mozilla::ipc::
    ResponseRejectReason, true>::InvokeMethod<mozilla::dom::
    CanonicalBrowsingContext::UpdateSessionStoreSessionStorage(const
    std::function<void()>&)::<lambda(const mozilla::MozPromise<nsTArray
    <mozilla::dom::SSCacheCopy>, mozilla::ipc::ResponseRejectReason, true>::
    ResolveOrRejectValue&)>, void (mozilla::dom::CanonicalBrowsingContext::
    UpdateSessionStoreSessionStorage(const std::function<void()>&)::
    <lambda(const mozilla::MozPromise<nsTArray<mozilla::dom::SSCacheCopy>,
    mozilla::ipc::ResponseRejectReason, true>::ResolveOrRejectValue&)>::*)
    (const mozilla::MozPromise<nsTArray<mozilla::dom::SSCacheCopy>,
    mozilla::ipc::ResponseRejectReason, true>::ResolveOrRejectValue&) const,
    mozilla::MozPromise<nsTArray<mozilla::dom::SSCacheCopy>, mozilla::ipc::
    ResponseRejectReason, true>::ResolveOrRejectValue>
    (aValue=..., aMethod=<optimized out>, 
[...]
#25 0x0000007fb78b289c in ?? () from /lib64/libc.so.6
(gdb) n
[LWP 8594 exited]
2214      if (!jsapi.Init(wrapped->GetJSObjectGlobal())) {
(gdb) 
2218      JS::RootedValue key(jsapi.cx(), Top()->PermanentKey());
(gdb) 
2220      Record<nsCString, Record<nsString, nsString>> storage;
(gdb) 
2221      JS::RootedValue update(jsapi.cx());
(gdb) 
2223      if (!aSesssionStorage.IsEmpty()) {
(gdb) 
2230        update.setNull();
(gdb) 
2233      return funcs->UpdateSessionStoreForStorage(Top()->
          GetEmbedderElement(), this,
(gdb) 
1297    ${PROJECT}/obj-build-mer-qt-xr/dist/include/js/RootingAPI.h:
        No such file or directory.
(gdb) 
JavaScript error: resource:///modules/sessionstore/SessionStore.jsm, line 541:
    NS_ERROR_FILE_NOT_FOUND: 
2221      JS::RootedValue update(jsapi.cx());
(gdb) 
2220      Record<nsCString, Record<nsString, nsString>> storage;
(gdb) 
This new breakpoint is hit and once again, stepping through the code shows the problem method being called and also triggering the JavaScript error we're concerned about.

This is all good stuff. Looking through the ESR 78 code there doesn't appear to be anything equivalent in CanonicalBrowsingContext.cpp. But now that I know this is where the problem is happening, I can at least find the commit that introduced the changes and use that to find out more. Here's the output from git blame, but please forgive the terrible formatting: it's very hard to line-wrap this output cleanly.
$ git blame docshell/base/CanonicalBrowsingContext.cpp \
    -L :WriteSessionStorageToSessionStore
dd51467 (Andreas Farre 2021-05-26 2204) nsresult CanonicalBrowsingContext::
                                        WriteSessionStorageToSessionStore(
dd51467 (Andreas Farre 2021-05-26 2205)     const nsTArray<SSCacheCopy>&
                                            aSesssionStorage, uint32_t aEpoch) {
dd51467 (Andreas Farre 2021-05-26 2206)   nsCOMPtr<nsISessionStoreFunctions> funcs =
dd51467 (Andreas Farre 2021-05-26 2207)       do_ImportModule("resource://gre/
                                              modules/SessionStoreFunctions.jsm");
dd51467 (Andreas Farre 2021-05-26 2208)   if (!funcs) {
dd51467 (Andreas Farre 2021-05-26 2209)     return NS_ERROR_FAILURE;
dd51467 (Andreas Farre 2021-05-26 2210)   }
dd51467 (Andreas Farre 2021-05-26 2211) 
dd51467 (Andreas Farre 2021-05-26 2212)   nsCOMPtr<nsIXPConnectWrappedJS>
                                          wrapped = do_QueryInterface(funcs);
dd51467 (Andreas Farre 2021-05-26 2213)   AutoJSAPI jsapi;
dd51467 (Andreas Farre 2021-05-26 2214)   if (!jsapi.Init(wrapped->
                                            GetJSObjectGlobal())) {
dd51467 (Andreas Farre 2021-05-26 2215)     return NS_ERROR_FAILURE;
dd51467 (Andreas Farre 2021-05-26 2216)   }
dd51467 (Andreas Farre 2021-05-26 2217) 
2b70b9d (Kashav Madan  2021-06-26 2218)   JS::RootedValue key(jsapi.cx(),
                                            Top()->PermanentKey());
2b70b9d (Kashav Madan  2021-06-26 2219) 
dd51467 (Andreas Farre 2021-05-26 2220)   Record<nsCString, Record<nsString,
                                            nsString>> storage;
dd51467 (Andreas Farre 2021-05-26 2221)   JS::RootedValue update(jsapi.cx());
dd51467 (Andreas Farre 2021-05-26 2222) 
dd51467 (Andreas Farre 2021-05-26 2223)   if (!aSesssionStorage.IsEmpty()) {
dd51467 (Andreas Farre 2021-05-26 2224)     SessionStoreUtils::
                                              ConstructSessionStorageValues(this,
                                              aSesssionStorage,
dd51467 (Andreas Farre 2021-05-26 2225)                                storage);
dd51467 (Andreas Farre 2021-05-26 2226)     if (!ToJSValue(jsapi.cx(), storage,
                                              &update)) {
dd51467 (Andreas Farre 2021-05-26 2227)       return NS_ERROR_FAILURE;
dd51467 (Andreas Farre 2021-05-26 2228)     }
dd51467 (Andreas Farre 2021-05-26 2229)   } else {
dd51467 (Andreas Farre 2021-05-26 2230)     update.setNull();
dd51467 (Andreas Farre 2021-05-26 2231)   }
dd51467 (Andreas Farre 2021-05-26 2232) 
2b70b9d (Kashav Madan  2021-06-26 2233)   return funcs->
                                            UpdateSessionStoreForStorage(
                                            Top()->GetEmbedderElement(), this,
2b70b9d (Kashav Madan  2021-06-26 2234)                    key, aEpoch, update);
dd51467 (Andreas Farre 2021-05-26 2235) }
dd51467 (Andreas Farre 2021-05-26 2236) 
If you can get past the terrible formatting you should be able to see there are two commits of interest here. The first is dd51467c228cb from Andreas Farre and the second which was layered on top is 2b70b9d821c8e from Kashav Madan. Let's find out more about them both.
$ git log -1 dd51467c228cb
commit dd51467c228cb5c9ec9d9efbb6e0339037ec7fd5
Author: Andreas Farre <farre@mozilla.com>
Date:   Wed May 26 07:14:06 2021 +0000

    Part 7: Bug 1700623 - Make session storage session store work with Fission.
        r=nika
    
    Use the newly added session storage data getter to access the session
    storage in the parent and store it in session store without a round
    trip to content processes.
    
    Depends on D111433
    
    Differential Revision: https://phabricator.services.mozilla.com/D111434
For some reason the Phabricator link doesn't work for me, but we can still see the revision directly in the repository. 
$ git log -1 2b70b9d821c8e commit 2b70b9d821c8eaf0ecae987cfc57e354f0f9cc20 Author: Kashav Madan <kshvmdn@gmail.com> Date: Sat Jun 26 20:25:29 2021 +0000 Bug 1703692 - Store the latest embedder's permanent key on CanonicalBrowsingContext, r=nika,mccr8 And include it in Session Store flushes to avoid dropping updates in case the browser is unavailable. Differential Revision: https://phabricator.services.mozilla.com/D118385 There aren't many clues in these changes, in particular there's no hint of how the build system was changed to have these files included. However, digging around in this code has given me a better understanding of the structure and purpose of the different directories.

It seems to me that, in essence, the gecko-dev/browser/components/ directory where the missing files can be found contains modules that relate to the browser chrome and Firefox user interface, rather than the rendering engine. Typically this kind of content would be replicated on Sailfish OS by adding amended versions into the embedlite-components project. If it's browser-specific material, that would make sense.

As an example, in Firefox we can find AboutRedirector.h and AboutRedirector.cpp files , whereas on Sailfish OS there's a replacement AboutRedirectory.js file in embedlite-components. Similarly Firefox has a DownloadsManager.jsm file that can be found in gecko-dev/browser/components/newtab/lib/. This seems to be replaced by EmbedliteDownloadManager.js in embedlite-components. Both have similar functionality based on the names of the methods contained in them, but the implementations are quite different.

Assuming this is correct, probably the right way to tackle the missing SessionStore.jsm and related files would be to move copies into embedlite-components. They'll need potentially quite big changes to align them with sailfish-browser, although hopefully this will largely be removing functionality that has already been implemented elsewhere (for example cookie save and restore).

I think I'll give these changes a go tomorrow.

Another thing I've pretty-much concluded while looking through this code is that it looks like it probably has nothing to do with the issue that's blocking the Back and Forward buttons from working after all. So I'll also need to make a separate task to track down the real source of that problem.

Right now I have a stack of tasks: SessionStore; Back/Forward failures; DuckDuckGo rendering. I mustn't lose site of the fact that the real goal right now is to get DuckDuckGo rendering correctly. The other tasks are secondary, albeit with DuckDuckGo rendering potentially dependent on them.

That's it for today. More tomorrow and Thursday, but then a bit of a break.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
22 Jan 2024 : Day 146 #
I'm still in the process of fixing the Sec-Fetch-* headers. The data I collected yesterday resulted in a few conclusions:
  1. Opening a URL at the command line with a tab that was already open gives odd results.
  2. Opening a URL as a homepage gives odd results.
  3. The Back and Forwards buttons are broken so couldn't be tested.
  4. I didn't get time to test the JavaScript case.
I'm going to tackle the JavaScript case first today. It turns out even for this one situation there are at least two cases to consider:
  1. Open a URL using JavaScript simulating an HREF selection.
  2. Open a URL using a JavaScript redirect.
To test this I've created a minimal web page with a couple of links that perform these actions. Here's the content of the page:
<!DOCTYPE html>
<html>
  <head>
    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
    <meta name="viewport" content="width=device-width, user-scalable=no"/>
    <script type="text/javascript">
      function reloadHref(url) {
        setTimeout(() => window.location.href = url, 2000);
      }
      function reloadRedirect(url) {
        setTimeout(() => window.location.replace(url), 2000);
      }
    </script>
  </head>
  <body>
    <p><a href="javascript:reloadHref('https://duckduckgo.com');">
      Simulate an HREF selection
    </a></p>
    <p><a href="javascript:reloadRedirect('https://duckduckgo.com');">
      Simulate a redirect
    </a></p>
  </body>
</html>
Pretty straightforward stuff, but should do the trick. This allows me to test the effects of the URL being changed and record the results. In fact, let's get straight to the results. Here's what I found out by testing using this page:
 
Situation Expected Flag set
Open a URL using JavaScript simulating a HREF selection. 0 0
Open a URL using a JavaScript redirect. 0 0

I do note however also notice that DuckDuckGo doesn't load correctly for these cases, so presumably there's still a problem with the headers here. I'll have to come back to that.

The inability to use the Back or Forwards buttons is also beginning to cause me trouble in day-to-day use, so now might be the time to fix that as well. Once I have I can test the remaining cases.

My suspicion is that the reason they don't work is related to this error that appears periodically when using the browser:
JavaScript error: resource://gre/modules/SessionStoreFunctions.jsm, line 105: NS_ERROR_FILE_NOT_FOUND: 
Here's the code in the file that's generating this error:
    SessionStore.updateSessionStoreFromTablistener(
      aBrowser,
      aBrowsingContext,
      aPermanentKey,
      { data: { windowstatechange: aData }, epoch: aEpoch }
    );
It could be an error happening inside SessionStore.updateSessionStoreFromTablistener() but two reasons make me think this is unlikely. First the error message clearly targets the calling location and if the error were inside this method I'd expect the error message to reflect that instead. Second there isn't anything obvious in the updateSessionStoreFromTablistener() body that might be causing an error like this. No obvious file accesses or anything like that.

A different possibility is that this, at the top of the SessionStoreFunctions.jsm file, is causing the problem:
XPCOMUtils.defineLazyModuleGetters(this, {
  SessionStore: "resource:///modules/sessionstore/SessionStore.jsm",
});
This is a lazy getter, meaning that an attempt will be made to load the resource only at the point where a method from the module is used. Could it be that the SessionStore.jsm file is inaccessible? Then when a method from it is called the JavaScript interpreter tries to load the code in and fails, triggering the error.

A quick search inside the omni archive suggests this file is indeed missing:
$ find . -iname "SessionStore.jsm"
$ find . -iname "SessionStoreFunctions.jsm"
./omni/modules/SessionStoreFunctions.jsm
$ find . -iname "Services.jsm"
./omni/modules/Services.jsm
As we can see, in contrast the SessionStoreFunctions.jsm and Services.jsm files are both present and correct. Well, present at least. To test out the theory that this is the problem I've parachuted the file into omni. First from my laptop:
$ scp gecko-dev/browser/components/sessionstore/SessionStore.jsm \
    defaultuser@10.0.0.116:./omni/modules/sessionstore/
SessionStore.jsm                              100%  209KB   1.3MB/s   00:00    
[...]
And then on my phone:
$ ./omni.sh pack
Omni action: pack
Packing from: ./omni
Packing to:   /usr/lib64/xulrunner-qt5-91.9.1
This hasn't fixed the Back and Forwards buttons, but it has resulted in a new error. The fact that this is error is now coming from inside SessionStore.jsm is encouraging.
JavaScript error: chrome://embedlite/content/embedhelper.js, line 259:
    TypeError: sessionHistory is null
JavaScript error: resource:///modules/sessionstore/SessionStore.jsm, line 541:
    NS_ERROR_FILE_NOT_FOUND: 
Line 541 of SessionStore.jsm looks like this:
  _globalState: new GlobalState(),
This also looks lazy-getter-related, since the only other reference to GlobalState() in this file is at the top, in this chunk of lazy-getter code:
XPCOMUtils.defineLazyModuleGetters(this, {
  AppConstants: "resource://gre/modules/AppConstants.jsm",
  AsyncShutdown: "resource://gre/modules/AsyncShutdown.jsm",
  BrowserWindowTracker: "resource:///modules/BrowserWindowTracker.jsm",
  DevToolsShim: "chrome://devtools-startup/content/DevToolsShim.jsm",
  E10SUtils: "resource://gre/modules/E10SUtils.jsm",
  GlobalState: "resource:///modules/sessionstore/GlobalState.jsm",
  HomePage: "resource:///modules/HomePage.jsm",
  PrivacyFilter: "resource://gre/modules/sessionstore/PrivacyFilter.jsm",
  PromiseUtils: "resource://gre/modules/PromiseUtils.jsm",
  RunState: "resource:///modules/sessionstore/RunState.jsm",
  SessionCookies: "resource:///modules/sessionstore/SessionCookies.jsm",
  SessionFile: "resource:///modules/sessionstore/SessionFile.jsm",
  SessionSaver: "resource:///modules/sessionstore/SessionSaver.jsm",
  SessionStartup: "resource:///modules/sessionstore/SessionStartup.jsm",
  TabAttributes: "resource:///modules/sessionstore/TabAttributes.jsm",
  TabCrashHandler: "resource:///modules/ContentCrashHandlers.jsm",
  TabState: "resource:///modules/sessionstore/TabState.jsm",
  TabStateCache: "resource:///modules/sessionstore/TabStateCache.jsm",
  TabStateFlusher: "resource:///modules/sessionstore/TabStateFlusher.jsm",
  setTimeout: "resource://gre/modules/Timer.jsm",
});
Sure enough, when I check, the GlobalState.jsm file is missing. It looks like these missing files are ones referenced in gecko-dev/browser/components/sessionstore/moz.build:
EXTRA_JS_MODULES.sessionstore = [
    "ContentRestore.jsm",
    "ContentSessionStore.jsm",
    "GlobalState.jsm",
    "RecentlyClosedTabsAndWindowsMenuUtils.jsm",
    "RunState.jsm",
    "SessionCookies.jsm",
    "SessionFile.jsm",
    "SessionMigration.jsm",
    "SessionSaver.jsm",
    "SessionStartup.jsm",
    "SessionStore.jsm",
    "SessionWorker.js",
    "SessionWorker.jsm",
    "StartupPerformance.jsm",
    "TabAttributes.jsm",
    "TabState.jsm",
    "TabStateCache.jsm",
    "TabStateFlusher.jsm",
]
It's not at all clear to me why these files aren't being included. The problem must be arising because either they're not being included when they should be, or they're being accessed when they shouldn't be.

But it's late now, so I'm going to have to figure that out tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
21 Jan 2024 : Day 145 #
Yesterday was a light day of gecko development (heavy on everything else but light on gecko). I managed to update the user agent overrides but not a lot else.

The one thing I did do was think about next steps, which brings us to today. To recap, the DuckDuckGo main page is now working. The search page inexplicably has no search results on it and so needs fixing. But the first thing I need to do is check whether the user interaction flags are propagating properly. My working assumption is that in the cases where they're needed they're being set. What I'm less certain about is whether they're not being set when they're not needed.

The purpose of the Sec-Fetch-* headers is to allow the browser to work in collaboration with the server. The user doesn't necessarily trust the page they're viewing and the server doesn't necessarily trust the browser. But the user should trust the browser. And the user should trust the browser to send the correct Sec-Fetch-* headers to the server. Assuming they're set correctly a trustworthy site can then act on them accordingly; for example, by only showing private data when the page isn't being displayed in an iframe, say.

Anyway, the point is, setting the value of these headers is a security feature. The implicit contract between user and browser requires that they're set correctly and the user trusts the browser will do this. The result of not doing so could make it easier for attackers to trick the user. So getting the flags set correctly is really important.

When it comes to understanding the header values and the flags that control them, the key gateway is EmbedLiteViewChild::RecvLoadURL(). The logic for deciding whether to set the flags happens before this is called and all of the logic that uses the flag happens after it. So I'll place a breakpoint on this method and check the value of the flag in various situations.

Which situations? Here are the ones I can think of where the flag should be set to true:
  1. Open a URL at the command line with no existing tabs.
  2. Open a URL at the command line with existing tabs.
  3. Open a URL via D-Bus with no existing tabs.
  4. Open a URL via D-Bus with existing tabs.
  5. Open a URL using xdg-open with no existing tags.
  6. Open a URL using xdg-open with existing tags.
And for the following situations the flag should be set to false.
  1. Open a URL as the homepage.
  2. Enter a URL in the address bar.
  3. Open an item from the history.
  4. Open a bookmark.
  5. Select a link on a page.
  6. Open a URL using JavaScript.
  7. Open a page using the Back button.
  8. Open a page using the Forwards button.
  9. Reloading a page.
Here are the results of one debugging cycle. I've skipped the others that are similar to this.
$ gdb sailfish-browser
(gdb) b EmbedLiteViewChild::RecvLoadURL
Function "EmbedLiteViewChild::RecvLoadURL" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (EmbedLiteViewChild::RecvLoadURL) pending.
(gdb) r https://duckduckgo.com

Thread 8 "GeckoWorkerThre" hit Breakpoint 1, mozilla::embedlite::
    EmbedLiteViewChild::RecvLoadURL (this=0x7f88ad1c60, url=..., 
    aFromExternal=@0x7f9f3d3598: true) at mobile/sailfishos/embedshared/
    EmbedLiteViewChild.cpp:482
482     {
(gdb) p aFromExternal
$2 = (const bool &) @0x7f9f3d3598: true
(gdb) n
483       LOGT("url:%s", NS_ConvertUTF16toUTF8(url).get());
(gdb) n
867     ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:
    No such file or directory.
(gdb) n
487       if (Preferences::GetBool("keyword.enabled", true)) {
(gdb) n
493       if (aFromExternal) {
(gdb) n
497       LoadURIOptions loadURIOptions;
(gdb) n
498       loadURIOptions.mTriggeringPrincipal = nsContentUtils::GetSystemPrincipal();
(gdb) p /x flags
$3 = 0x341000
(gdb) p/x flags & nsIWebNavigation::LOAD_FLAGS_FIXUP_SCHEME_TYPOS
$6 = 0x200000
(gdb) p/x flags & nsIWebNavigation::LOAD_FLAGS_ALLOW_THIRD_PARTY_FIXUP
$7 = 0x100000
(gdb) p/x flags & nsIWebNavigation::LOAD_FLAGS_DISALLOW_INHERIT_PRINCIPAL
$8 = 0x40000
(gdb) p/x flags & nsIWebNavigation::LOAD_FLAGS_FROM_EXTERNAL
$9 = 0x1000
(gdb) p /x flags - (nsIWebNavigation::LOAD_FLAGS_ALLOW_THIRD_PARTY_FIXUP
    | nsIWebNavigation::LOAD_FLAGS_FIXUP_SCHEME_TYPOS
    | nsIWebNavigation::LOAD_FLAGS_DISALLOW_INHERIT_PRINCIPAL
    | nsIWebNavigation::LOAD_FLAGS_FROM_EXTERNAL)
$10 = 0x0
(gdb) 
In a couple of places (selecting links and reloading the page) the EmbedLiteViewChild::RecvLoadURL() method doesn't get called. For those cases I put a breakpoint on LoadInfo::GetLoadTriggeredFromExternal() instead. The process looks a little different:
(gdb) disable break
(gdb) break GetLoadTriggeredFromExternal
Breakpoint 2 at 0x7fb9d3430c: GetLoadTriggeredFromExternal. (2 locations)
(gdb) c
Continuing.

Thread 8 "GeckoWorkerThre" hit Breakpoint 2, mozilla::net::LoadInfo::
    GetLoadTriggeredFromExternal (this=0x7f89170cf0, 
    aLoadTriggeredFromExternal=0x7f9f3d3150) at netwerk/base/LoadInfo.cpp:1478
1478      *aLoadTriggeredFromExternal = mLoadTriggeredFromExternal;
(gdb) p mLoadTriggeredFromExternal
$1 = false
(gdb) 
I've been through and checked the majority of the cases separately. Here's a summary of the results once I apply these processes for all of the cases.
 
Situation Expected Flag set
Open a URL at the command line with no existing tabs. 1 1
Open a URL at the command line with the same tab open. 1 0
Open a URL at the command line with a different tab open. 1 1
Open a URL via D-Bus with no existing tabs. 1 1
Open a URL via D-Bus with the same tab open. 1 No effect
Open a URL via D-Bus with a different tab open. 1 1
Open a URL using xdg-open with no existing tags. 1 1
Open a URL using xdg-open with the same tab open. 1 No effect
Open a URL using xdg-open with a different tab open. 1 1
Open a URL as the homepage. 0 1
Enter a URL in the address bar. 0 0
Open an item from the history. 0 0
Open a bookmark. 0 0
Select a link on a page. 0 0
Open a URL using JavaScript. 0 Not tested
Open a page using the Back button. 0 Unavailable
Open a page using the Forwards button. 0 Unavailable
Reloading a page. 0 0

There are some notable entries in the table although broadly speaking the results are what I was hoping for. For example, when using D-Bus or xdg-open to open the same website that's already available, there is no effect. I hadn't expected this, but having now seen the behaviour in action, it makes perfect sense and looks correct. For the case of opening a URL via the command line with the same tab open, I'll need to look in to whether some other flag should be set instead; but on the face of it, this looks like something that may need fixing.

Similarly for opening a URL as the home page. I think the result is the reverse of what it should be, but I need to look into this more to check.

The forward and back button interactions are marked as "Unavailable". That's because the back and forward functionality are currently broken. I'm hoping that fixing Issue 1024 will also restore this functionality, after which I'll need to test this again.

Finally I didn't get time to test the JavaScript case. I'll have to do that tomorrow.

So a few things still to fix, but hopefully over the next couple of days these can all be ironed out.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
20 Jan 2024 : Day 144 #
It was gratifying to finally see the DuckDuckGo main search page appearing with an ESR 91 build of the browser. Even the search suggestions are working nicely. But there's trouble beneath the surface. Dig just a little further and it turns out the search results page shows no results. In terms of search engine utility, this is sub-optimal. I'll need to look in to this and fix it. But it's not the first task I need to tackle today. Although the front page at least looked like it was working yesterday, it relied on some changes that still need to be fully implemented and checked. The easy task is adding an override to the user agent string override list. What I found yesterday is that while an ESR 91 user agent string only works with the correct Sec-Fetch headers, it also has to be the mobile version of the user agent. This works:
Mozilla/5.0 (Mobile; rv:91.0) Gecko/91.0 Firefox/91.0
This fails:
Mozilla/5.0 (X11; Linux aarch64; rv:91.0) Gecko/20100101 Firefox/91.0
All of these things have to align. Updating the ua-update.json.in file in the sailfish-browser repository is straightforward. Just make the change, run the preprocess-ua-update-json batch file and copy the result into the correct folder.
$ git diff data/ua-update.json.in
diff --git a/data/ua-update.json.in b/data/ua-update.json.in
index 584720c0..338b8faf 100644
--- a/data/ua-update.json.in
+++ b/data/ua-update.json.in
@@ -116,0 +116,1 @@
+  "duckduckgo.com": "Mozilla/5.0 (Mobile; rv:91.0) Gecko/91.0 Firefox/91.0"

$ ./preprocess-ua-update-json
[sailfishos-esr91 fbf5b15e] [user-agent] Update preprocessed user agent overrides
 1 file changed, 2 insertions(+), 1 deletion(-)
$ ls ua/
38.8.0  45.9.1  52.9.1  60.0  60.9.1  78.0  91.0
$ cp ua-update.json ua/91.0/
The second task is to check the code changes I made over the last couple of days. I added flags that pass certain user interaction signals on to the engine, such as whether an action was user-performed or not. Yesterday I checked using both gdb and by observing the resulting Sec-Fetch-* headers that the flags were making their way through the system correctly. However what I didn't check — and what I need to check still — is that the flags are correct when different paths are used to get there. For example, the system needs to distinguish between entering a URL in the toolbar and triggering a URL via D-Bus. The resulting request headers should be the same, but the logic for how we get to the same place is different. This is a debugging task. Unfortunately time has run away today already, so I'll have to pick up on the actual task of debugging tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
19 Jan 2024 : Day 143 #
Over the last couple of days now I've been building versions of gecko and its related packages that support the LOAD_FLAGS_FROM_EXTERNAL flag. According to comments in the code, when this flag is set the engine considers the request to be the result of user-interaction, so I'm hoping this change will help ensure the >Sec-Fetch headers are given the correct values.

But of course not all page loads are triggered externally. There are other flags to identify other types of user interaction which I'll need to fix as well. But one thing at a time. First of all I need to find out if the changes I've made to the code are having any effect.

The changes affect sixteen packages and comprise 771 MiB of data in all (including the debug packages), which I'm now copying over to my device so I can install and test them. Just doing this can take quite a while!

[...]

I've installed them and run them and they don't fix the problem. But as soon as I ran them I remembered that I've not yet hooked up things on the QML side to make them work. I still have a bit more coding to do before we'll see any good results.

[...]

I've updated the QML so that the flag is now passed in for the routes I could find that relate to page loads being triggered externally. There are some potential gotchas here: am I catching all of the situations in which this can happen? Are some of the paths used by other routes as well, such as entering a URL at the address bar? Once I've confirmed that these changes are having an effect I'll need to go back and check these other things as well.

First things first. Let's find out whether the flag is being applied in the case where a URL is passed on the command line.
$ gdb sailfish-browser
[...]
b DeclarativeTabModel::newTab
Breakpoint 1 at 0x6faf8: DeclarativeTabModel::newTab. (2 locations)
(gdb) r https://www.duckduckgo.com
[...]
Thread 1 "sailfish-browse" hit Breakpoint 1, DeclarativeTabModel::newTab
    (this=0x55559d7b80, url=..., fromExternal=true)
    at ../history/declarativetabmodel.cpp:196
196         return newTab(url, 0, 0, false, fromExternal);
(gdb) p fromExternal
$1 = true
(gdb) b EmbedLiteViewChild::RecvLoadURL
Breakpoint 2 at 0x7fbcb105f0: file mobile/sailfishos/embedshared/
    EmbedLiteViewChild.cpp, line 482.
(gdb) n

Thread 8 "GeckoWorkerThre" hit Breakpoint 2, mozilla::embedlite::
    EmbedLiteViewChild::RecvLoadURL (this=0x7f88690d50, url=..., 
    aFromExternal=@0x7f9f3d3598: true)
    at mobile/sailfishos/embedshared/EmbedLiteViewChild.cpp:482
482     {
(gdb) n
483       LOGT("url:%s", NS_ConvertUTF16toUTF8(url).get());
(gdb) n
867     ${PROJECT}/gecko-dev/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h: No such file or directory.
(gdb) n
487       if (Preferences::GetBool("keyword.enabled", true)) {
(gdb) n
493       if (aFromExternal) {
(gdb) n
497       LoadURIOptions loadURIOptions;
(gdb) p aFromExternal
$2 = (const bool &) @0x7f9f3d3598: true
(gdb) n
498       loadURIOptions.mTriggeringPrincipal = nsContentUtils::GetSystemPrincipal();
(gdb) n
499       loadURIOptions.mLoadFlags = flags;
(gdb) p /x flags
$4 = 0x341000
(gdb) p /x nsIWebNavigation::LOAD_FLAGS_FROM_EXTERNAL
$5 = 0x1000
(gdb) p /x (flags & nsIWebNavigation::LOAD_FLAGS_FROM_EXTERNAL)
$6 = 0x1000
Great! The flag is being set in the user-interface and is successfully passing through from sailfish-browser to qtmozembed and on to the EmbedLite wrapper for gecko itself. Let's see if it makes it all the way to the request header methods.
(gdb) b SecFetch::AddSecFetchUser
Breakpoint 3 at 0x7fbba7b680: file dom/security/SecFetch.cpp, line 333.
(gdb) b GetLoadTriggeredFromExternal
Breakpoint 4 at 0x7fb9d3430c: GetLoadTriggeredFromExternal. (5 locations)
(gdb) c
Continuing.
[New LWP 8921]

Thread 8 "GeckoWorkerThre" hit Breakpoint 4, nsILoadInfo::
    GetLoadTriggeredFromExternal (this=0x7f88eabce0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsILoadInfo.h:664
664     ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsILoadInfo.h: No such file or directory.
(gdb) n

Thread 8 "GeckoWorkerThre" hit Breakpoint 4, mozilla::net::LoadInfo::
    GetLoadTriggeredFromExternal (this=0x7f88eabce0, 
    aLoadTriggeredFromExternal=0x7f9f3d3150) at netwerk/base/LoadInfo.cpp:1478
1478      *aLoadTriggeredFromExternal = mLoadTriggeredFromExternal;
(gdb) p mLoadTriggeredFromExternal
$9 = true
(gdb) c
[...]
mozilla::dom::SecFetch::AddSecFetchUser
    (aHTTPChannel=aHTTPChannel@entry=0x7f88eabed0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/DebugOnly.h:97
97      ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/DebugOnly.h:
        No such file or directory.
(gdb) 
350       nsAutoCString user("?1");
(gdb) 
The Sec-Fetch-User flag is now being set correctly. Hooray! It's clear from this that the LOAD_FLAGS_FROM_EXTERNAL flag is making it through all the way to the place where the flag values are set. So let's check what the actual values are using the debug output on the console:
[ Request details ------------------------------------------- ]
    Request: GET status: 200 OK
    URL: https://duckduckgo.com/
    [ Request headers --------------------------------------- ]
        Host : duckduckgo.com
        User-Agent : Mozilla/5.0 (X11; Linux aarch64; rv:91.0) Gecko/20100101 Firefox/91.0
        Accept : text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
        Accept-Language : en-GB,en;q=0.5
        Accept-Encoding : gzip, deflate, br
        Connection : keep-alive
        Upgrade-Insecure-Requests : 1
        Sec-Fetch-Dest : document
        Sec-Fetch-Mode : navigate
        Sec-Fetch-Site : none
        Sec-Fetch-User : ?1
        If-None-Match : "65a81e81-4858"
That looks like just the set of Sec-Fetch flags we wanted.

So it's disappointing to discover that the page is still not rendering. Why is could this be?

I set the user agent to the ESR 91 value I used previously:
  "duckduckgo.com": "Mozilla/5.0 (Mobile; rv:91.0) Gecko/91.0 Firefox/91.0"
Close the browser; clear out the cache; try again:
$ rm -rf ~/.local/share/org.sailfishos/browser/.mozilla/cache2/ \
    ~/.local/share/org.sailfishos/browser/.mozilla/startupCache/ \
    ~/.local/share/org.sailfishos/browser/.mozilla/cookies.sqlite 
$ sailfish-browser https://duckduckgo.com/
 
Three screenshots: the DuckDuckGo main page showing in ESR 91; search suggestions for the search term 'DuckDuckGo'; the search results page with no search results on it

The last thing I see before I head to bed is the DuckDuckGo logo staring back at me from the screen. There is a glitch with search though. The search action works, in the sense that it takes me to the result page, but no results are shown. So there's still some more work to be done.

Nevertheless I'll sleep a lot easier tonight after this progress today.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
18 Jan 2024 : Day 142 #
Astonishingly I wake up to find the build I started last night has successfully completed this morning. That's not quite the end of the process though. It means I have a version of gecko implementing a new API. But the API is due to be accessed by qtmozembed. When I rebuild qtmozembed it installs the new gecko packages (which are named "xulrunner-qt5" for historical reasons) and then fails because it can't be built against the new interfaces.
$ sfdk -c no-fix-version build -d -p
[...]
Loading repository data...
Reading installed packages...
Computing distribution upgrade...
Force resolution: No

The following 5 packages are going to be reinstalled:
  xulrunner-qt5              91.9.1-1
  xulrunner-qt5-debuginfo    91.9.1-1
  xulrunner-qt5-debugsource  91.9.1-1
  xulrunner-qt5-devel        91.9.1-1
  xulrunner-qt5-misc         91.9.1-1

5 packages to reinstall.
[...]
qmozview_p.cpp: In member function ‘void QMozViewPrivate::load(const QString&)’:
qmozview_p.cpp:491:39: error: no matching function for call to
  ‘mozilla::embedlite::EmbedLiteView::LoadURL(char*)’
     mView->LoadURL(url.toUtf8().data());
                                       ^
In file included from qmozview_p.h:26,
                 from qmozview_p.cpp:35:
usr/include/xulrunner-qt5-91.9.1/mozilla/embedlite/EmbedLiteView.h:82:16:
  note: candidate: ‘virtual void mozilla::embedlite::EmbedLiteView::LoadURL
  (const char*, bool)’
   virtual void LoadURL(const char* aUrl, bool aFromExternal);
                ^~~~~~~
usr/include/xulrunner-qt5-91.9.1/mozilla/embedlite/EmbedLiteView.h:82:16:
  note:   candidate expects 2 arguments, 1 provided
This is entirely expected. In fact it's a good sign: it means that things are aligning as intended. The next step will be to update qtmozembed so that it matches the new interface.

The process of fixing this is pretty mechanical: I just have to go through and add the extra fromExternal parameter where it's missing so that it can be passed on directly to EmbedLiteView::LoadURL(). There is one small nuance, which is that on some occasions when the view hasn't been initialised yet, the URL to be loaded is cached until the view is ready. In this case I have to cache the fromExternal state as well, which I'm going to store in QMozViewPrivate::mPendingFromExternal: in the same class as where the URL is cached.

Having made these changes the package now builds successfully. But we're not quite there yet, because these changes are going to cause the exported qtmozembed interfaces to change. These will be picked up by the code in sailfish-browser.

Unlike qtmozembed the changes may not cause sailfish-browser to fail at build time, because it's possible the changes will only affect interpreted QML code rather than compiled C++ code. Let's see...

As I think I've mentioned before, sailfish-browser takes quite a while to build (we're talking tens of minutes rather than hours, but still enough time to make a coffee before it completes).
$ sfdk -c no-fix-version build -d -p
[...]
Loading repository data...
Reading installed packages...
Computing distribution upgrade...
Force resolution: No

The following 2 packages are going to be reinstalled:
  qtmozembed-qt5        1.53.9-1
  qtmozembed-qt5-devel  1.53.9-1

2 packages to reinstall.
[...]
../../../apps/qtmozembed/declarativewebpage.cpp: In member function
  ‘void DeclarativeWebPage::loadTab(const QString&, bool)’:
../../../apps/qtmozembed/declarativewebpage.cpp:247:20: error: no matching
  function for call to ‘DeclarativeWebPage::load(const QString&)’
         load(newUrl);
                    ^
compilation terminated due to -Wfatal-errors.
As we can see, there are some errors coming from the C++ build relating to the API changes. This is good, but it doesn't negate the fact that some changes will still be needed in the QML as well and these won't be flagged as errors during compilation. So we need to take care, but fixing these C++ errors will be a good start.

Once again the changes look pretty simple; for example here I'm adding the fromExternal parameter:
void DeclarativeWebPage::loadTab(const QString &newUrl, bool force,
    bool fromExternal)
{
    // Always enable chrome when load is called.
    setChrome(true);
    QString oldUrl = url().toString();
    if ((!newUrl.isEmpty() && oldUrl != newUrl) || force) {
        load(newUrl, fromExternal);
    }
}
These changes have some cascading effects and it takes five or six build cycles to get everything straightened out. But eventually it gets there and the build goes through.

But that's still not quite enough. As we noted earlier this just tackles the compiled code. We can say that the C++ code is probably now in a pretty decent state, but the QML code is a different matter. The build process won't check the QML for inconsistencies and if there are any, it'll just fail at runtime. So we'll need to look through that next. That will have to be a task for tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
17 Jan 2024 : Day 141 #
Preparations are well underway for FOSDEM'24 and I spent this morning putting together a slide template for my talk on Gecko Development. There's not much content yet, but I still have a bit of time to fill that out. It'll be taking up a bit of time between now and then though, so gecko development may slow down a little.

Continuing on though, yesterday we were looking at the Sec-Fetch headers. With the updated browser these are now added as request headers and it turned out this was the reason DuckDuckGo was serving up broken pages. If the headers are skipped a good version of the page is served. But if an ESR 91 user agent is used then the headers are needed and DuckDuckGo expects them to be set to sensible values.

Right now they don't provide sensible values because the context needed isn't available. For example, the browser isn't distinguishing between pages opened at the command line versus those entered as URLs or through clicking on links. We need to pass that info down from the front-end and into the engine.

The flag we need to set in the case of the page being triggered via command line or D-Bus is LOAD_FLAGS_FROM_EXTERNAL:
  /**
   * A hint this load was prompted by an external program: take care!
   */
  const unsigned long LOAD_FLAGS_FROM_EXTERNAL   = 0x1000;
Once this is set it will end up being used in the following conditional inside SecFetch.cpp:
  if (!loadInfo->GetLoadTriggeredFromExternal() &&
      !loadInfo->GetHasValidUserGestureActivation()) {
For the GetHasValidUserGestureActivation() portion of this the logic ends up looking like this:
bool WindowContext::HasValidTransientUserGestureActivation() {
  MOZ_ASSERT(IsInProcess());

  if (GetUserActivationState() != UserActivation::State::FullActivated) {
    MOZ_ASSERT(mUserGestureStart.IsNull(),
               "mUserGestureStart should be null if the document hasn't ever "
               "been activated by user gesture");
    return false;
  }

  MOZ_ASSERT(!mUserGestureStart.IsNull(),
             "mUserGestureStart shouldn't be null if the document has ever "
             "been activated by user gesture");
  TimeDuration timeout = TimeDuration::FromMilliseconds(
      StaticPrefs::dom_user_activation_transient_timeout());

  return timeout <= TimeDuration() ||
         (TimeStamp::Now() - mUserGestureStart) <= timeout;
}
Clearly GetUserActivationState() is critical here. This leads back to WindowContext::NotifyUserGestureActivation() and it looks like this is handled automatically inside the gecko code, so not something to worry about. At least, whether we need to worry about it will become clearer as we progress.

I'm reminded of the fact that back on Day 47 I made amendments to the API to add an aUserActivation parameter to the GoBack() and GoForward() methods of PEmbedLiteView.ipdl and that these probably aren't currently being set properly:
    async GoBack(bool aRequireUserInteraction, bool aUserActivation);
    async GoForward(bool aRequireUserInteraction, bool aUserActivation);
It's quite possible these will need fixing as well, although right now I can't see any execution path that runs from either of these down to HasValidTransientUserGestureActivation(). This is looking for one path amongst a thousand meandering paths though, so it would be easy to miss it.

Go back to focus on LOAD_FLAGS_FROM_EXTERNAL. To accommodate this I've made some simple changes to EmbedLiteViewChild:
--- a/embedding/embedlite/embedshared/EmbedLiteViewChild.cpp
+++ b/embedding/embedlite/embedshared/EmbedLiteViewChild.cpp
@@ -478,7 +478,7 @@ EmbedLiteViewChild::WebWidget()
 
 /*----------------------------TabChildIface-----------------------------------------------------*/
 
-mozilla::ipc::IPCResult EmbedLiteViewChild::RecvLoadURL(const nsString &url)
+mozilla::ipc::IPCResult EmbedLiteViewChild::RecvLoadURL(const nsString &url, const bool& aFromExternal)
 {
   LOGT("url:%s", NS_ConvertUTF16toUTF8(url).get());
   NS_ENSURE_TRUE(mWebNavigation, IPC_OK());
@@ -490,6 +490,10 @@ mozilla::ipc::IPCResult EmbedLiteViewChild::RecvLoadURL(const nsString &url)
   }
   flags |= nsIWebNavigation::LOAD_FLAGS_DISALLOW_INHERIT_PRINCIPAL;
 
+  if (aFromExternal) {
+    flags |= nsIWebNavigation::LOAD_FLAGS_FROM_EXTERNAL;
+  }
+
   LoadURIOptions loadURIOptions;
   loadURIOptions.mTriggeringPrincipal = nsContentUtils::GetSystemPrincipal();
   loadURIOptions.mLoadFlags = flags;
This change will have a cascading effect through qtmozembed and potentially all the way to sailfish-browser. But the easiest way for me to figure out what's missing or needs changing is for me to build the package find find out what breaks.

So for the first time in a long time I've set the gecko-dev building once again.

Tomorrow I'll find out what sort of mess I've caused.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
16 Jan 2024 : Day 140 #
I'm up early today in the hope of finding evidence of why DuckDuckGo is serving ESR 91 a completely different — and quite broken — version of the search site. Yesterday we established, pretty convincingly I think, that the page being served is different and that it's not the ESR 91 renderer that's the problem. If I view the exact same pages on ESR 78 or a desktop browser I get the same blank page.

Today I plan to look into the request and response headers, and especially the user agent in more detail. Maybe there's something in the request headers which differs between ESR 78 and ESR 91 that I've not noticed.

But I have to warn you: it's a long one today. I recommend skipping the response headers and backtraces if you're in a rush.

And before getting in to the headers I also want to thank Simon (simonschmeisser) and Adrian (amcewen) for their helpful interjections on the Sailfish Forum and Mastodon respectively.

Both of you correctly noticed that one of the files was broken on the copy I made of DuckDuckGo built from the data downloaded using the ESR 91 library. This shows up quite clearly a desktop browser using the dev console (as in Adrian's picture below) and if you try to download the file directly.
 
A browser developer console showing the filename home-74b6f93f10631d81.js with a very clear 403 error next to it

The broken URL is the following:
https://duckduckgo.com/_next/static/chunks/pages/%5Blocale%5D/home-74b6f93f10631d81.js
It's broken because of the %5B and %5D in the path. These are characters encoded using "URL" or "percent" encoding as originally codified in RFC 3986. The actually represent the left and right square brackets respectively. What's happened is that the Python script I used back on Day 135 has quite correctly output the URL in this encoded format. When I uploaded it to S3 I should have decoded it so that the file was stored in a folder called [locale] like this:
https://duckduckgo.com/_next/static/chunks/pages/[locale]/home-74b6f93f10631d81.js
Instead I left the encoded form in. Oh dear! The file itself is tiny, containing just the following text:
(self.webpackChunk_N_E=self.webpackChunk_N_E||[]).push([[43803],{98387:function
(n,_,u){(window.__NEXT_P=window.__NEXT_P||[]).push(["/[locale]/home",function()
{return u(39265)}])}},function(n){n.O(0,[41966,93432,18040,81125,39337,94623,
95665,55015,61754,55672,38407,49774,92888,40179],(function(){return _=98387,
n(n.s=_);var _}));var _=n.O();_N_E=_}]);
On hearing about my mistake from Simon and Adrian I thought this error would be too small to make any real difference. How wrong I was! Now that I've fixed this the page actually renders now, both in the Sailfish Browser on ESR 78 and on desktop Firefox.

Interestingly it still doesn't render with ESR 91. Which makes me wonder if the issue is related to the locale. But I'm going to have to come back to this because I've committed myself to looking at request headers today. Nevertheless, a big thank you to both Simon and Adrian. Not only is it reassuring to know my work is being checked, but this could also provide a critical piece of the puzzle.

We have a lot to get through today though, so let's now move on to request headers. We're particularly interested in the index.html file because, as anyone who's worked with webservers before will know, this is the first file to get downloaded and the one that kicks off all of the other downloads. Only files referenced in index.html, or where there's a reference chain back to index.html are going to get downloaded.

Here are the request headers for the index page sent to DuckDuckGo when using ESR 78.
[ Request details ------------------------------------------- ]
    Request: GET status: 200 OK
    URL: https://duckduckgo.com/
    [ Request headers --------------------------------------- ]
        Host : duckduckgo.com
        User-Agent : Mozilla/5.0 (Mobile; rv:78.0) Gecko/78.0 Firefox/78.0
        Accept : text/html,application/xhtml+xml,application/xml;
            q=0.9,image/webp,*/*;q=0.8
        Accept-Language : en-GB,en;q=0.5
        Accept-Encoding : gzip, deflate, br
        Connection : keep-alive
        Upgrade-Insecure-Requests : 1
And here are the equivalent headers when accessing the same page of DuckDuckGo when using ESR 91 using the default user agent.
[ Request details ------------------------------------------- ]
    Request: GET status: 200 OK
    URL: https://duckduckgo.com/
    [ Request headers --------------------------------------- ]
        Host : duckduckgo.com
        User-Agent : Mozilla/5.0 (X11; Linux aarch64; rv:91.0) Gecko/20100101
            Firefox/91.0
        Accept : text/html,application/xhtml+xml,application/xml;q=0.9,
            image/webp,*/*;q=0.8
        Accept-Language : en-GB,en;q=0.5
        Accept-Encoding : gzip, deflate, br
        Connection : keep-alive
        Upgrade-Insecure-Requests : 1
        Sec-Fetch-Dest : document
        Sec-Fetch-Mode : navigate
        Sec-Fetch-Site : cross-site
Finally, when setting the user agent to match that of the iPhone the request headers look like this:
[ Request details ------------------------------------------- ]
    Request: GET status: 200 OK
    URL: https://duckduckgo.com/
    [ Request headers --------------------------------------- ]
        Host : duckduckgo.com
        User-Agent : Mozilla/5.0 (iPhone12,1; U; CPU iPhone OS 13_0 like
            Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0
            Mobile/15E148 Safari/602.1
        Accept : text/html,application/xhtml+xml,application/xml;q=0.9,
            image/webp,*/*;q=0.8
        Accept-Language : en-GB,en;q=0.5
        Accept-Encoding : gzip, deflate, br
        Connection : keep-alive
        Upgrade-Insecure-Requests : 1
        Sec-Fetch-Dest : document
        Sec-Fetch-Mode : navigate
        Sec-Fetch-Site : cross-site
What do we notice from these? Obviously the user agents are different. The standard ESR 91 user agent isn't identifying itself as a Mobile variant and that's something that needs to be fixed for all sites. The remaining fields in the ESR 78 list are identical to those in ESR 91. However ESR 91 does have some additional fields:
        Sec-Fetch-Dest : document
        Sec-Fetch-Mode : navigate
        Sec-Fetch-Site : cross-site
Let's find out what these do. According to the Mozilla docs, Sec-Fetch-Dest...
 
...allows servers determine whether to service a request based on whether it is appropriate for how it is expected to be used. For example, a request with an audio destination should request audio data, not some other type of resource (for example, a document that includes sensitive user information).

The emphasis here is theirs. In our case we're sending document as the value, which from the docs means:
 
The destination is a document (HTML or XML), and the request is the result of a user-initiated top-level navigation (e.g. resulting from a user clicking a link).

This feels pretty accurate. How about Sec-Fetch-Mode and its value of navigate?
 
Broadly speaking, this allows a server to distinguish between: requests originating from a user navigating between HTML pages, and requests to load images and other resources. For example, this header would contain navigate for top level navigation requests, while no-cors is used for loading an image. For example, this header would contain navigate for top level navigation requests, while no-cors is used for loading an image.

navigate: The request is initiated by navigation between HTML documents.

Again, this all looks pretty unexceptional. Finally the Sec-Fetch-Site header and its value cross-site? According to the docs...
 
...this header tells a server whether a request for a resource is coming from the same origin, the same site, a different site, or is a "user initiated" request. The server can then use this information to decide if the request should be allowed.

cross-site: The request initiator and the server hosting the resource have a different site (i.e. a request by "potentially-evil.com" for a resource at "example.com").

This last one looks suspicious to me. ESR 91 is sending a cross-site value, which is the most risky of all the options, because it's essentially telling DuckDuckGo that the request is happening across different domains. From the docs, it looks like none would be a more appropriate value for this:
 
none: This request is a user-originated operation. For example: entering a URL into the address bar, opening a bookmark, or dragging-and-dropping a file into the browser window.

Checking the code I notice there's also a fourth Sec-Fetch header in the set which is the Sec-Fetch-User header. This isn't being sent and this might also be important, because the request we're making is user-initiated, so it might be reasonable to expect this header to be included. The docs say this about the header:
 
The Sec-Fetch-User fetch metadata request header is only sent for requests initiated by user activation, and its value will always be ?1.

A server can use this header to identify whether a navigation request from a document, iframe, etc., was originated by the user.

In effect, by omitting the header the browser is telling the site that the request isn't initiated by the user, but rather by something else, such as being a document referenced inside some other document.

An obvious thing to test would be to remove these headers and see whether that makes any difference. But before getting on to that let's take a look at the response headers as well.

First the response headers when using ESR 78:
    [ Response headers -------------------------------------- ]
        server : nginx
        date : Sun, 07 Jan 2024 16:07:57 GMT
        content-type : text/html; charset=UTF-8
        content-length : 2357
        vary : Accept-Encoding
        etag : "65983d82-935"
        content-encoding : br
        strict-transport-security : max-age=31536000
        permissions-policy : interest-cohort=()
        content-security-policy : [...] ;
        x-frame-options : SAMEORIGIN
        x-xss-protection : 1;mode=block
        x-content-type-options : nosniff
        referrer-policy : origin
        expect-ct : max-age=0
        expires : Sun, 07 Jan 2024 16:07:56 GMT
        cache-control : no-cache
        X-Firefox-Spdy : h2
I've removed the content-security-policy value for brevity in all of these responses. They all happen to be the same.

Next the response headers when using ESR 91 and the standard user agent:
    [ Response headers -------------------------------------- ]
        server : nginx
        date : Sun, 07 Jan 2024 16:09:35 GMT
        content-type : text/html; charset=UTF-8
        content-length : 18206
        vary : Accept-Encoding
        etag : "65983d80-471e"
        content-encoding : br
        strict-transport-security : max-age=31536000
        permissions-policy : interest-cohort=()
        content-security-policy : [...] ;
        x-frame-options : SAMEORIGIN
        x-xss-protection : 1;mode=block
        x-content-type-options : nosniff
        referrer-policy : origin
        expect-ct : max-age=0
        expires : Sun, 07 Jan 2024 16:09:34 GMT
        cache-control : no-cache
        X-Firefox-Spdy : h2
And finally the response headers when using ESR 91 and the iPhone user agent:
    [ Response headers -------------------------------------- ]
        server : nginx
        date : Sat, 13 Jan 2024 22:20:49 GMT
        content-type : text/html; charset=UTF-8
        content-length : 18221
        vary : Accept-Encoding
        etag : "65a1788c-472d"
        content-encoding : br
        strict-transport-security : max-age=31536000
        permissions-policy : interest-cohort=()
        content-security-policy : [...] ;
        x-frame-options : SAMEORIGIN
        x-xss-protection : 1;mode=block
        x-content-type-options : nosniff
        referrer-policy : origin
        expect-ct : max-age=0
        expires : Sat, 13 Jan 2024 22:20:48 GMT
        cache-control : no-cache
        X-Firefox-Spdy : h2
All three share the same response header keys and values apart from the following three which have different values across all three responses:
        date : Sun, 07 Jan 2024 16:07:57 GMT
        content-length : 2357
        etag : "65983d82-935"
        expires : Sun, 07 Jan 2024 16:07:56 GMT
It's not surprising these are different across the three. In fact, if they weren't, that might suggest something more sinister. The only one that's really problematic is the content-length value, not because it's incorrect, but because the three different values highlights the fact we're being served three different pages depending on the request.

If there's nothing particularly interesting to see in the response headers it means we can go back to experimenting with the three Sec-Fetch headers discussed earlier.

Digging through the code and the SecFetch.cpp file in particular, I can see that the headers are added in this method:
void mozilla::dom::SecFetch::AddSecFetchHeader(nsIHttpChannel* aHTTPChannel) {
  // if sec-fetch-* is prefed off, then there is nothing to do
  if (!StaticPrefs::dom_security_secFetch_enabled()) {
    return;
  }

  nsCOMPtr<nsIURI> uri;
  nsresult rv = aHTTPChannel->GetURI(getter_AddRefs(uri));
  if (NS_WARN_IF(NS_FAILED(rv))) {
    return;
  }

  // if we are not dealing with a potentially trustworthy URL, then
  // there is nothing to do here
  if (!nsMixedContentBlocker::IsPotentiallyTrustworthyOrigin(uri)) {
    return;
  }

  AddSecFetchDest(aHTTPChannel);
  AddSecFetchMode(aHTTPChannel);
  AddSecFetchSite(aHTTPChannel);
  AddSecFetchUser(aHTTPChannel);
}
The bit I'm interested in is the first condition which bails out of the method if a specific preference is disabled. Happily this preference was easy to establish as being exposed through about:config as dom.security.secFetch.enabled. I've now disabled it and can try loading the site again.

This time the headers no longer have any of the Sec-Fetch headers included:
[ Request details ------------------------------------------- ]
    Request: GET status: 200 OK
    URL: https://duckduckgo.com/
    [ Request headers --------------------------------------- ]
        Host : duckduckgo.com
        User-Agent : Mozilla/5.0 (X11; Linux aarch64; rv:91.0) Gecko/20100101
            Firefox/91.0
        Accept : text/html,application/xhtml+xml,application/xml;q=0.9,
            image/webp,*/*;q=0.8
        Accept-Language : en-GB,en;q=0.5
        Accept-Encoding : gzip, deflate, br
        Connection : keep-alive
        Upgrade-Insecure-Requests : 1
Sadly this doesn't fix the issue, I still get the same blank screen. I try again with the iPhone user agent, same result.

But then I try with the ESR 78 user agent and Sec-Fetch headers disabled... and it works! The site shows correctly, just as it does in ESR 78.

It's a little hard to express the strange mix of jubilation and frustration that I'm feeling right now. Jubilation because we've finally reached the point where it's certain that it will be possible to fix this. Frustration because it's taken quite so long to reach this point.

This feeling pretty much sums up my experience of software development in general. Despite the frustration, this is what I really love about it!

Before claiming that this is fixed, it's worth focusing a little on what the real problem is here. There is a user-agent issue for sure. And arguably DuckDuckGo is trying to frustrate certain users (bots, essentially) by serving different pages to different clients. But the real issue here is that these Sec-Fetch headers are actually broken in our version of gecko ESR 91. That's not the fault of the upstream code: it's a failure of the way the front-end is interacting with the backend code.

So the correct way to fix this issue (at least from the client-side) is to fix those headers. Fixing it for DuckDuckGo is likely to have a positive effect on other sites as well, so fixing it will be worthwhile effort.

That's what I'll now move on to.

As noted above the current (incorrect) values set for the headers are the following:
        Sec-Fetch-Dest : document
        Sec-Fetch-Mode : navigate
        Sec-Fetch-Site : cross-site
What we'd expect and want to see for a user-triggered access to DuckDuckGo is this:
        Sec-Fetch-Dest : document
        Sec-Fetch-Mode : navigate
        Sec-Fetch-Site : none
        Sec-Fetch-User : ?1
Let's check the code for the two incorrect values. First the site value:
void mozilla::dom::SecFetch::AddSecFetchSite(nsIHttpChannel* aHTTPChannel) {
  nsAutoCString site("same-origin");

  bool isSameOrigin = IsSameOrigin(aHTTPChannel);
  if (!isSameOrigin) {
    bool isSameSite = IsSameSite(aHTTPChannel);
    if (isSameSite) {
      site = "same-site"_ns;
    } else {
      site = "cross-site"_ns;
    }
  }

  if (IsUserTriggeredForSecFetchSite(aHTTPChannel)) {
    site = "none"_ns;
  }

  nsresult rv =
      aHTTPChannel->SetRequestHeader("Sec-Fetch-Site"_ns, site, false);
  mozilla::Unused << NS_WARN_IF(NS_FAILED(rv));
}
We can infer that isSameOrigin and isSameSite are both set to false. This is a bit strange but it's actually not the bit we have to worry about. The result of going through the initial condition will be overwritten if IsUserTriggeredForSecFetchSite() returns true so that's where we should focus.

I'm going to use the debugger to try to find out how these values get set.
$ rm -rf ~/.local/share/org.sailfishos/browser/.mozilla/cache2/ \
    ~/.local/share/org.sailfishos/browser/.mozilla/startupCache/ ~/.local/share/org.sailfishos/browser/.mozilla/cookies.sqlite 
$ gdb sailfish-browser
[...]
(gdb) r
[...]

(gdb) b LoadInfo::SetHasValidUserGestureActivation
Breakpoint 4 at 0x7fb9d34424: file netwerk/base/LoadInfo.cpp, line 1609.
(gdb) b LoadInfo::SetLoadTriggeredFromExternal
Breakpoint 5 at 0x7fb9d34300: file netwerk/base/LoadInfo.cpp, line 1472.
(gdb) c

Thread 8 "GeckoWorkerThre" hit Breakpoint 4, mozilla::net::LoadInfo::
    SetHasValidUserGestureActivation (this=this@entry=0x7f89b00da0, 
    aHasValidUserGestureActivation=false) at netwerk/base/LoadInfo.cpp:1609
1609      mHasValidUserGestureActivation = aHasValidUserGestureActivation;
(gdb) bt
#0  mozilla::net::LoadInfo::SetHasValidUserGestureActivation
    (this=this@entry=0x7f89b00da0, aHasValidUserGestureActivation=false)
    at netwerk/base/LoadInfo.cpp:1609
#1  0x0000007fb9ffd86c in mozilla::net::CreateDocumentLoadInfo
    (aBrowsingContext=aBrowsingContext@entry=0x7f88c6dde0, 
    aLoadState=aLoadState@entry=0x7f894082d0)
    at netwerk/ipc/DocumentLoadListener.cpp:149
#2  0x0000007fb9ffd9b8 in mozilla::net::DocumentLoadListener::OpenDocument
    (this=0x7f89a8c3e0, aLoadState=0x7f894082d0, aCacheKey=0, aChannelId=..., 
    aAsyncOpenTime=..., aTiming=0x0, aInfo=..., aUriModified=...,
    aIsXFOError=..., aPid=aPid@entry=0, aRv=aRv@entry=0x7f9f3eda34)
    at netwerk/ipc/DocumentLoadListener.cpp:744
#3  0x0000007fb9ffe88c in mozilla::net::ParentProcessDocumentChannel::AsyncOpen
    (this=0x7f89bf02e0, aListener=0x7f895a7770)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/MaybeStorageBase.h:80
#4  0x0000007fba4663bc in nsURILoader::OpenURI (this=0x7f887b46c0,
    channel=0x7f89bf02e0, aFlags=0, aWindowContext=0x7f886044f0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:859
#5  0x0000007fbc7fa824 in nsDocShell::OpenInitializedChannel
    (this=this@entry=0x7f886044c0, aChannel=0x7f89bf02e0,
    aURILoader=0x7f887b46c0, aOpenFlags=0)
    at docshell/base/nsDocShell.cpp:10488
#6  0x0000007fbc7fb5e4 in nsDocShell::DoURILoad (this=this@entry=0x7f886044c0,
    aLoadState=aLoadState@entry=0x7f894082d0, aCacheKey=..., 
    aRequest=aRequest@entry=0x7f9f3edf90)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:859
#7  0x0000007fbc7fc1a4 in nsDocShell::InternalLoad
    (this=this@entry=0x7f886044c0, aLoadState=aLoadState@entry=0x7f894082d0,
    aCacheKey=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:1363
#8  0x0000007fbc801c20 in nsDocShell::ReloadDocument
    (aDocShell=aDocShell@entry=0x7f886044c0, aDocument=<optimized out>,
    aLoadType=aLoadType@entry=2, aBrowsingContext=0x7f88c6dde0,
    aCurrentURI=0x7f887e62c0, aReferrerInfo=0x0,
    aNotifiedBeforeUnloadListeners=aNotifiedBeforeUnloadListeners@entry=false)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/MaybeStorageBase.h:79
#9  0x0000007fbc803984 in nsDocShell::Reload (this=0x7f886044c0, aReloadFlags=0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:859
#10 0x0000007fbc94d05c in nsWebBrowser::Reload (this=<optimized out>,
    aReloadFlags=<optimized out>)
    at toolkit/components/browser/nsWebBrowser.cpp:507
#11 0x0000007fbcb0ab70 in mozilla::embedlite::EmbedLiteViewChild::RecvReload
    (this=<optimized out>, aHardReload=<optimized out>)
    at mobile/sailfishos/embedshared/EmbedLiteViewChild.cpp:533
#12 0x0000007fba1915d0 in mozilla::embedlite::PEmbedLiteViewChild::
    OnMessageReceived (this=0x7f88767020, msg__=...)
    at PEmbedLiteViewChild.cpp:1152
#13 0x0000007fba17f05c in mozilla::embedlite::PEmbedLiteAppChild::
    OnMessageReceived (this=<optimized out>, msg__=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/ipc/ProtocolUtils.h:675
#14 0x0000007fba06b85c in mozilla::ipc::MessageChannel::DispatchAsyncMessage
    (this=this@entry=0x7f88b3e8a8, aProxy=aProxy@entry=0x7ebc003140, aMsg=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/ipc/ProtocolUtils.h:675
[...]
#40 0x0000007fb78b289c in ?? () from /lib64/libc.so.6
(gdb) c
Continuing.

Thread 8 "GeckoWorkerThre" hit Breakpoint 5, mozilla::net::LoadInfo::
    SetLoadTriggeredFromExternal (this=this@entry=0x7f89b00da0, 
    aLoadTriggeredFromExternal=false) at netwerk/base/LoadInfo.cpp:1472
1472      mLoadTriggeredFromExternal = aLoadTriggeredFromExternal;
(gdb) bt
#0  mozilla::net::LoadInfo::SetLoadTriggeredFromExternal
    (this=this@entry=0x7f89b00da0, aLoadTriggeredFromExternal=false)
    at netwerk/base/LoadInfo.cpp:1472
#1  0x0000007fbc7f20f8 in nsDocShell::CreateAndConfigureRealChannelForLoadState
    (aBrowsingContext=aBrowsingContext@entry=0x7f88c6dde0, 
    aLoadState=aLoadState@entry=0x7f894082d0,
    aLoadInfo=aLoadInfo@entry=0x7f89b00da0,
    aCallbacks=aCallbacks@entry=0x7f89724110, 
    aDocShell=aDocShell@entry=0x0, aOriginAttributes=...,
    aLoadFlags=aLoadFlags@entry=2689028, aCacheKey=aCacheKey@entry=0, 
    aRv=@0x7f9f3eda34: nsresult::NS_OK, aChannel=aChannel@entry=0x7f89a8c438)
    at docshell/base/nsDocShellLoadState.cpp:709
#2  0x0000007fb9ff9c4c in mozilla::net::DocumentLoadListener::Open
    (this=this@entry=0x7f89a8c3e0, aLoadState=aLoadState@entry=0x7f894082d0, 
    aLoadInfo=aLoadInfo@entry=0x7f89b00da0, aLoadFlags=2689028,
    aCacheKey=aCacheKey@entry=0, aChannelId=..., aAsyncOpenTime=..., 
    aTiming=aTiming@entry=0x0, aInfo=..., aUrgentStart=aUrgentStart@entry=false,
    aPid=aPid@entry=0, aRv=aRv@entry=0x7f9f3eda34)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:359
#3  0x0000007fb9ffda34 in mozilla::net::DocumentLoadListener::OpenDocument
    (this=0x7f89a8c3e0, aLoadState=0x7f894082d0, aCacheKey=0, aChannelId=..., 
    aAsyncOpenTime=..., aTiming=0x0, aInfo=..., aUriModified=...,
    aIsXFOError=..., aPid=aPid@entry=0, aRv=aRv@entry=0x7f9f3eda34)
    at netwerk/ipc/DocumentLoadListener.cpp:750
#4  0x0000007fb9ffe88c in mozilla::net::ParentProcessDocumentChannel::AsyncOpen
    (this=0x7f89bf02e0, aListener=0x7f895a7770)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/MaybeStorageBase.h:80
#5  0x0000007fba4663bc in nsURILoader::OpenURI (this=0x7f887b46c0,
    channel=0x7f89bf02e0, aFlags=0, aWindowContext=0x7f886044f0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:859
#6  0x0000007fbc7fa824 in nsDocShell::OpenInitializedChannel
    (this=this@entry=0x7f886044c0, aChannel=0x7f89bf02e0,
    aURILoader=0x7f887b46c0, aOpenFlags=0)
    at docshell/base/nsDocShell.cpp:10488
#7  0x0000007fbc7fb5e4 in nsDocShell::DoURILoad (this=this@entry=0x7f886044c0,
    aLoadState=aLoadState@entry=0x7f894082d0, aCacheKey=..., 
    aRequest=aRequest@entry=0x7f9f3edf90)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:859
#8  0x0000007fbc7fc1a4 in nsDocShell::InternalLoad (this=this@entry=0x7f886044c0,
    aLoadState=aLoadState@entry=0x7f894082d0, aCacheKey=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:1363
#9  0x0000007fbc801c20 in nsDocShell::ReloadDocument
    (aDocShell=aDocShell@entry=0x7f886044c0, aDocument=<optimized out>,
    aLoadType=aLoadType@entry=2, aBrowsingContext=0x7f88c6dde0,
    aCurrentURI=0x7f887e62c0, aReferrerInfo=0x0,
    aNotifiedBeforeUnloadListeners=aNotifiedBeforeUnloadListeners@entry=false)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/MaybeStorageBase.h:79
#10 0x0000007fbc803984 in nsDocShell::Reload (this=0x7f886044c0, aReloadFlags=0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:859
#11 0x0000007fbc94d05c in nsWebBrowser::Reload (this=<optimized out>,
    aReloadFlags=<optimized out>)
    at toolkit/components/browser/nsWebBrowser.cpp:507
#12 0x0000007fbcb0ab70 in mozilla::embedlite::EmbedLiteViewChild::RecvReload
    (this=<optimized out>, aHardReload=<optimized out>)
    at mobile/sailfishos/embedshared/EmbedLiteViewChild.cpp:533
#13 0x0000007fba1915d0 in mozilla::embedlite::PEmbedLiteViewChild::
    OnMessageReceived (this=0x7f88767020, msg__=...)
    at PEmbedLiteViewChild.cpp:1152
#14 0x0000007fba17f05c in mozilla::embedlite::PEmbedLiteAppChild::
    OnMessageReceived (this=<optimized out>, msg__=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/ipc/ProtocolUtils.h:675
#15 0x0000007fba06b85c in mozilla::ipc::MessageChannel::DispatchAsyncMessage
    (this=this@entry=0x7f88b3e8a8, aProxy=aProxy@entry=0x7ebc003140, aMsg=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/ipc/ProtocolUtils.h:675
[...]
#41 0x0000007fb78b289c in ?? () from /lib64/libc.so.6
(gdb) c
[...]
Digging deeper to find out where the value for mHasValidUserGestureActivation is coming from. It's being set to false and we need to know why.
(gdb) disable break
(gdb) break nsDocShell.cpp:4108
Breakpoint 6 at 0x7fbc801bec: file docshell/base/nsDocShell.cpp, line 4108.
(gdb) c
Continuing.
[Switching to LWP 29933]

Thread 8 "GeckoWorkerThre" hit Breakpoint 6, nsDocShell::ReloadDocument
    (aDocShell=aDocShell@entry=0x7f886044c0, aDocument=<optimized out>, 
    aLoadType=aLoadType@entry=2, aBrowsingContext=0x7f88c6dde0,
    aCurrentURI=0x7f887e62c0, aReferrerInfo=0x0, 
    aNotifiedBeforeUnloadListeners=aNotifiedBeforeUnloadListeners@entry=false)
    at docshell/base/nsDocShell.cpp:4108
4108      loadState->SetHasValidUserGestureActivation(
(gdb) p context
$14 = {mRawPtr = 0x7ed4009980}
(gdb) p context.mRawPtr
$15 = (mozilla::dom::WindowContext *) 0x7ed4009980
(gdb) p *(context.mRawPtr)
$16 = {<nsISupports> = {_vptr.nsISupports = 0x7fbf75c7a0 <vtable for
    mozilla::dom::WindowGlobalParent+16>}, <nsWrapperCache> = {
[...]
    mInnerWindowId = 16, mOuterWindowId = 1, mBrowsingContext = { mRawPtr =
    0x7f88c6dde0}, mWindowGlobalChild = {mRef = {mRawPtr = 0x7f895aa790}}, 
  mChildren = {<nsTArray_Impl<RefPtr<mozilla::dom::BrowsingContext>,
      nsTArrayInfallibleAllocator>> = {
      <nsTArray_base<nsTArrayInfallibleAllocator,
      nsTArray_RelocateUsingMemutils>> = { mHdr = 0x7fbe0c86b8
      <sEmptyTArrayHeader>},
      <nsTArray_TypedBase<RefPtr<mozilla::dom::BrowsingContext>,
      nsTArray_Impl<RefPtr<mozilla::dom::BrowsingContext>,
      nsTArrayInfallibleAllocator> >> = {
      <nsTArray_SafeElementAtHelper<RefPtr<mozilla::dom::BrowsingContext>,
      nsTArray_Impl<RefPtr<mozilla::dom::BrowsingContext>,
      nsTArrayInfallibleAllocator> >> =
      {<nsTArray_SafeElementAtSmartPtrHelper<RefPtr<mozilla::dom::BrowsingContext>,
      nsTArray_Impl<RefPtr<mozilla::dom::BrowsingContext>,
      nsTArrayInfallibleAllocator> >> = {<detail::nsTArray_CopyDisabler> =
      {<No data fields>}, <No data fields>}, <No data fields>},
      <No data fields>}, static NoIndex = 18446744073709551615},
      <No data fields>}, mIsDiscarded = false, mIsInProcess = true,
      mCanExecuteScripts = true, 
  mUserGestureStart = {mValue = {mUsedCanonicalNow = 0, mTimeStamp = 0}}}
(gdb) 
I eventually end up here in nsDocShell.cpp:
  aLoadInfo->SetLoadTriggeredFromExternal(
      aLoadState->HasLoadFlags(LOAD_FLAGS_FROM_EXTERNAL));
This is important because of external are considered user triggered as mentioned in the comments in SecFetch::AddSecFetchUser():
  // sec-fetch-user only applies if the request is user triggered.
  // requests triggered by an external application are considerd user triggered.
  if (!loadInfo->GetLoadTriggeredFromExternal() &&
      !loadInfo->GetHasValidUserGestureActivation()) {
    return;
  }
These load flags are set all over the place, although the LOAD_FLAGS_FROM_EXTERNAL specifically is only set in tabbrowser.js in the upstream code, which I believe is code we don't use for the Sailfish Browser. Instead we set these flags in EmbedLiteViewChild.cpp. It's possible there are some new flags which we need to add there. Let's check the history using git blame. I happen to see from the source code that the flag is being set on line 1798 of the tabbrowser.js file:
$ git blame browser/base/content/tabbrowser.js -L1798,1798
f9f59140398bc (Victor Porof 2019-07-05 09:48:57 +0200 1798)
    flags |= Ci.nsIWebNavigation.LOAD_FLAGS_FROM_EXTERNAL;
$ git log -1 --oneline f9f59140398bc
f9f59140398b Bug 1561435 - Format browser/base/, a=automatic-formatting
$ git log -1 f9f59140398bc
commit f9f59140398bc4d04d840e8217c04e0d7eafafb9
Author: Victor Porof <vporof@mozilla.com>
Date:   Fri Jul 5 09:48:57 2019 +0200

    Bug 1561435 - Format browser/base/, a=automatic-formatting
    
    # ignore-this-changeset
    
    Differential Revision: https://phabricator.services.mozilla.com/D36041
    
    --HG--
    extra : source : 96b3895a3b2aa2fcb064c85ec5857b7216884556
This commit is just reformatting the file so we need to look at the change prior to this one. Checking the diff of the automatic formatting I can see that before this the relevant line of code was line 1541 in the same file.
$ git blame f9f59140398bc~ browser/base/content/tabbrowser.js -L1541,1541
082b6eb1e7ed2 (James Willcox 2019-03-12 20:20:58 +0000 1541)
    flags |= Ci.nsIWebNavigation.LOAD_FLAGS_FROM_EXTERNAL;
$ git log -1 082b6eb1e7ed2
commit 082b6eb1e7ed20de7424aea94fb7ce40b1b39c36
Author: James Willcox <snorp@snorp.net>
Date:   Tue Mar 12 20:20:58 2019 +0000

    Bug 1524992 - Treat command line URIs as external r=mconley
    
    Differential Revision: https://phabricator.services.mozilla.com/D20890
    
    --HG--
    extra : moz-landing-system : lando
This is the important change. It added the flag specifically for URIs triggered at the command line. As it happens that's currently one of the ways I've been testing, so I should fix this. It's worth noting that this flag was introduced in ESR 67 so has actually been around for a while. But I guess skipping it didn't have any obvious negative effects, so nobody noticed it needed to be handled.

That'll have to change now. But it looks like this will be just one of many such flags that will need adding in to the Sailfish code.

Let's focus on LOAD_FLAGS_FROM_EXTERNAL first. This is set in browser.js when a call is made to getContentWindowOrOpenURI(). This ends up running this piece of code:
      default:
        // OPEN_CURRENTWINDOW or an illegal value
        browsingContext = window.gBrowser.selectedBrowser.browsingContext;
        if (aURI) {
          let loadFlags = Ci.nsIWebNavigation.LOAD_FLAGS_NONE;
          if (isExternal) {
            loadFlags |= Ci.nsIWebNavigation.LOAD_FLAGS_FROM_EXTERNAL;
          } else if (!aTriggeringPrincipal.isSystemPrincipal) {
            // XXX this code must be reviewed and changed when bug 1616353
            // lands.
            loadFlags |= Ci.nsIWebNavigation.LOAD_FLAGS_FIRST_LOAD;
          }
          gBrowser.loadURI(aURI.spec, {
            triggeringPrincipal: aTriggeringPrincipal,
            csp: aCsp,
            loadFlags,
            referrerInfo,
          });
        }
We end up here when there's a call made to handURIToExistingBrowser() in BrowserContentHandler.jsm which then calls browserDOMWindow.openURI() with the aFlags parameter set to Ci.nsIBrowserDOMWindow.OPEN_EXTERNAL.

What's the equivalent for Sailfish Browser? There are three ways it may end up opening an external URL that I can think of. The first is if the application is executed with a URL on the command line:
$ sailfish-browser https://www.flypig.co.uk/gecko
Another is if xdg-open is called:
$ xdg-open https://www.flypig.co.uk/gecko
The third is if the browser is called via its D-Bus interface:
$ gdbus call --session --dest org.sailfishos.browser.ui --object-path /ui \
    --method org.sailfishos.browser.ui.openUrl \
    "['https://www.flypig.co.uk/gecko']"
All of these should end up setting the LOAD_FLAGS_FROM_EXTERNAL flag.

Let's follow the path in the case of the D-Bus call. The entry point for this is in browserservice.cpp where the D-Bus object is registered. The call looks like this:
void BrowserUIService::openUrl(const QStringList &args)
{
    if(args.count() > 0) {
        emit openUrlRequested(args.first());
    } else {
        emit openUrlRequested(QString());
    }
}
This gets picked up by the browser object via a connection in main.cpp:
        browser->connect(uiService, &BrowserUIService::openUrlRequested,
                        browser, &Browser::openUrl);
Which triggers the following method:
void Browser::openUrl(const QString &url)
{
    Q_D(Browser);
    DeclarativeWebUtils::instance()->openUrl(url);
}
The version of the method in DeclarativeWebUtils sanitises the url before sending it on its way by emitting a signal:
    emit openUrlRequested(tmpUrl);
Finally this is picked up by BrowserPage.qml which ends up doing one of three things with it:
    Connections {
        target: WebUtils
        onOpenUrlRequested: {
[...]
            if (webView.tabModel.activateTab(url)) {
                webView.releaseActiveTabOwnership()
            } else if (!webView.tabModel.loaded) {
                webView.load(url)
            } else {
                webView.clearSelection()
                webView.tabModel.newTab(url)
                overlay.dismiss(true, !Qt.application.active /* immadiate */)
            }
So it either activates an existing tab, calls the user interface to create the very first tab, or adds a new tab to the existing list. This approach contrasts with the route taken when the user enters a URL. In this case the process is handled in Overlay.qml and the loadPage() function there. This does a bunch of checks before calling this:
                webView.load(pageUrl)
Notice that this is also one of the methods that's called in the case of the D-Bus trigger as well. That's important, because we need to distinguish between these two routes. The WebView component inherits load() from DeclarativeWebContainer where the method looks like this:
void DeclarativeWebContainer::load(const QString &url, bool force)
{
    QString tmpUrl = url;
    if (tmpUrl.isEmpty() || !browserEnabled()) {
        tmpUrl = ABOUT_BLANK;
    }

    if (!canInitialize()) {
        m_initialUrl = tmpUrl;
    } else if (m_webPage && m_webPage->completed()) {
        if (m_loading) {
            m_webPage->stop();
        }
        m_webPage->loadTab(tmpUrl, force);
        Tab *tab = m_model->getTab(m_webPage->tabId());
        if (tab) {
            tab->setRequestedUrl(tmpUrl);
        }
    } else if (m_model && m_model->count() == 0) {
        // Browser running all tabs are closed.
        m_model->newTab(tmpUrl);
    }
}
Notice that this mimics the D-Bus route with there being three options: record the URL in case there's a failure, load the URL into an existing tab or create a new tab if there are none already available.

Trying to follow all these routes is like trying to follow a droplet of water down a waterfall. I think I've reached the limit of my indirection capacity here; I'm going to need to pick this up again tomorrow.

But, so that I don't lose my thread, a note about two further things I need to do with this.

First I need to follow the process through until I hit the point at which the EmbedLiteViewParent::SendLoadURL() is called. This is the point at which the flag needs to be set. It looks like the common way for this to get called is through the following call:
void
EmbedLiteView::LoadURL(const char* aUrl)
{
  LOGT("url:%s", aUrl);
  Unused << mViewParent->SendLoadURL(NS_ConvertUTF8toUTF16(nsDependentCString(aUrl)));
}
I should check this with the debugger to make sure.

Second I need to ensure the flag gets passed from the point at which we know what its value needs to be (which is the D-Bus interface) to this call to SendLoadURL(), so that EmbedLiteViewChild::RecvLoadURL() can set it appropriately.

Once I have those two pieces everything will tie together and at that point it will be possible to set the LOAD_FLAGS_FROM_EXTERNAL flag appropriately.

That's it for today. Tomorrow, onwards.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
15 Jan 2024 : Day 139 #
This morning I transferred the DuckDuckGo testing website from S3 where we had it yesterday to Amazon CloudFront. This still uses the S3 bucket to serve the pages as a static site, but now with a cloudfront.net URL (which shouldn't make any difference for the tests) and using HTTPS (which might make a difference). I want to use HTTPS not because I'm worried about the integrity of the site, but because I want to eliminate differences between how I'm accessing the real DuckDuckGo and the test page version I'm using.

A small, overly-optimistic, bit of me was hoping that the switch to HTTPS might cause the site to break when using ESR 91, but it didn't. The DuckDuckGo logo shows just fine, the page looks sensible, albeit still showing the peculiar differences in functionality that I described yesterday.

The next steps are to capture the full output from accessing the original site using ESR 91, followed by the same process accessing the test site using ESR 91. In order to do this I'll need to build and run the recent changes I made to the EmbedLiteConsoleListener.js code so that it can be installed on my ESR 91 device. I made the changes manually on my ESR 78 device, and although I did already manually copy those changes over to the local repository on my development machine, I've not yet built and installed them for my other phone.

Having done that, I now need to capture a copy of the site using the ESR 91 version of the browser. I'll capture both a copy of the real site and a copy of the replicated version of the site (the version downloaded using ESR 78):
$ rm -rf ~/.local/share/org.sailfishos/browser
$ EMBED_CONSOLE="network" sailfish-browser \
    https://d3gf5xld99gmbj.cloudfront.net/ 2>&1 > esr91-ddg-copy.txt
[...]
$ rm -rf ~/.local/share/org.sailfishos/browser
$ EMBED_CONSOLE="network" sailfish-browser \
    https://duckduckgo.com/ 2>&1 > esr91-ddg-orig.txt
[...]
Looking through the output of both it's clear that they're quite different. Just starting with the index.html page, the very first line of each differ significantly. So it really does seem to be the case that DuckDuckGo is serving a different version of the page.

I also tried downloading a copy using ESR 91 and the iPhone user agent string from yesterday. But the site downloaded was the same.

What I want to do now is create a copy of the site downloaded when I use ESR 91. This is the site that's broken (showing just a blank page) when rendered using the ESR 91 renderer. But although the page is blank it is still downloading a bunch of files, so there's definitely something to replicate.

Having done this process before I'm getting quite proficient at it now. The process goes like this:
  1. Take the log output from loading the page, with the full network dump enabled
  2. Work through this log file and whenever there's some text content in the log, cut and paste this out into its own file. This is then a copy of the file as it was downloaded by the Sailfish Browser.
  3. Carefully save this file out using the same file structure as served by the server (matching the suffix of the URL of the file downloaded).
  4. Having recreated all of these files, create an S3 bucket on AWS and copy all of these files in to it.
  5. Create a CloudFront distribution of the bucket. To reiterate, I do it this way rather than just serving the bucket as a static site so that I can offer the site with HTTPS access.
I've been through all of these steps again and am now using CloudFront to serve two copies of the site:
  1. ddg8: the original DuckDuckGo site as downloaded by ESR 78: https://d3gf5xld99gmbj.cloudfront.net/.
  2. ddg9: the original DuckDuckGo site as downloaded by ESR 91: https://dd53jyxmgchu8.cloudfront.net/.
If you take a look at these sites you'll see that the version downloaded using ESR 78 looks pretty decent when downloaded by using a desktop browser. But the one downloaded by ESR 91 is blank, just as it is when rendering it on Sailfish OS using ESR 91.

There's one final check to make and that's to access the copy of the site originally downloaded using ESR 91 and now being served on CloudFront (the ddg9 version of the site), but using ESR 91. Why do this? Once I've got this I can compare the files downloaded this way with the files downloaded directly from DuckDuckGo using ESR 91.

If the copy is good, the files downloaded should be very similar, if not identical.

I've done this now, so have two copies of the log output. Let's compare them using the comparison command I put together a few days back.
$ diff --side-by-side 
  <(sed -e 's/^.*\/\(.*\)/\1/g' <(grep "duckduckgo" esr91-ddg-orig-urls.txt) | sort) \
  <(sed -e 's/^.*\/\(.*\)/\1/g' <(grep "cloudfront" esr91-ddg-esr91-urls.txt) | sort)

18040-1287342b1f839f70.js                    18040-1287342b1f839f70.js
38407-070351ade350c8e4.js                    38407-070351ade350c8e4.js
39337-cd8caeeff0afb1c4.js                    39337-cd8caeeff0afb1c4.js
41966-c9d76895b4f9358f.js                    41966-c9d76895b4f9358f.js
55015-29fec414530c2cf6.js                    55015-29fec414530c2cf6.js
55672-0a2c33e517ba92f5.js                    55672-0a2c33e517ba92f5.js
61754-cfebc3ba4c97208e.js                    61754-cfebc3ba4c97208e.js
6a4833195509cc3d.css                         6a4833195509cc3d.css
703c9a9a057785a9.css                         703c9a9a057785a9.css
81125-b74d1b6f4908497b.js                    81125-b74d1b6f4908497b.js
93432-ebd443fe69061b19.js                    93432-ebd443fe69061b19.js
94623-d5bfa67fc3bada59.js                    94623-d5bfa67fc3bada59.js
95665-f2a003aa56f899b0.js                    95665-f2a003aa56f899b0.js
a2a29f84956f2aac.css                         a2a29f84956f2aac.css
_app-a1aac13e30ca1ed6.js                     _app-a1aac13e30ca1ed6.js
_buildManifest.js                            _buildManifest.js
c89114cfe55133c4.css                         c89114cfe55133c4.css
ed8494aa71104fdc.css                         ed8494aa71104fdc.css
f0b3f7da285c9dbd.css                         f0b3f7da285c9dbd.css
framework-f8115f7fae64930e.js                framework-f8115f7fae64930e.js
home-34dda07336cb6ee1.js                     home-34dda07336cb6ee1.js
main-17a05b704438cdd6.js                     main-17a05b704438cdd6.js
ProximaNova-Bold-webfont.woff2               ProximaNova-Bold-webfont.woff2
ProximaNova-ExtraBold-webfont.woff2          ProximaNova-ExtraBold-webfont.woff2
ProximaNova-RegIt-webfont.woff2              ProximaNova-RegIt-webfont.woff2
ProximaNova-Reg-webfont.woff2                ProximaNova-Reg-webfont.woff2
ProximaNova-Sbold-webfont.woff2              ProximaNova-Sbold-webfont.woff2
_ssgManifest.js                              _ssgManifest.js
webpack-96503cdd116848e8.js                  webpack-96503cdd116848e8.js
The two sets of downloaded files are identical. This is really good, because it means that the ddg9 version of the site is an accurate reflection of what's being served to ESR 91 when Sailfish Browser accesses the real DuckDuckGo site using the ESR 91 engine.

Visiting this copy of the site in other browsers, including ESR 78 and the desktop version of Firefox, shows that the site doesn't render there either.

It's been a long journey to get to this point, but this is clear evidence that the problem is the site being served to ESR 91, rather than the ESR 91 rendering engine or JavaScript interpreter getting stuck after the site has been received.

This means I have to concentrate on persuading DuckDuckGo to serve the same version of the page that it's serving to ESR 78. It's taken far too long to get here, but at least I feel I've learnt something in the process about how to perform these checks, how to download accurate copies of a website and how to serve them using CloudFront so that they don't have to go in a sub-directory.

I've already tried with multiple different user agents and that hasn't been enough to persuade DuckDuckGo to serve the correct version of the page, so I'm not quite sure how to get around this issue. One possibility is that I'm not actually using the user agents I think I am. So this will be something to check tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
14 Jan 2024 : Day 138 #
Finally, after much meandering and misdirected effort, an important reason for my copied version of DuckDuckGo failing became clear yesterday. I admit I'm a bit embarrassed that I didn't spot it earlier, but there's no hiding here.

Simply put, DuckDuckGo uses paths with a preceding / in the site HTML file rather than fully relative URLs. So by placing the files in a "tests/ddg8" folder on my server, I was breaking most of the links.

Now, admittedly, I've not yet had a chance to see what happens when this is fixed, so there could well be other issues as well. But what's for sure is that without fixing this, the copied site will remain broken.

My plan is to make use of the HTML base element to try to work around the issue. This can be added to the head of an HTML file to direct the browser to the root of the site, so that all relative URLs are resolved relative the the base address.

I should also check for URLs that start with https://duckduckgo.com/ or similar as these won't be fixed by this change.

Since the location of my test site is https://www.flypig.co.uk/tests/ddg8/ the addition I need to make is a line inside the head element of the index.html file like this:
   <base href="https://www.flypig.co.uk/tests/ddg8/" />
On trying this, in practice and contrary to what I'd expected, it turns out that when a URL has a preceding / it's considered an absolute URL as well. The spec wasn't clear on this point for me, but it means it's resolved relative to the domain name, not relative to the base. That's rubbish, but observable behaviour. Rubbish because it means I can't use this as my solution after all.

So I'm going to have to make more intrusive changes, removing these preceding slashes from all instances of the URL in the page and all files that get loaded with it. I was hoping to avoid that.

There are alternatives to this intrusive fix though. Here are the three alternatives I can think of:
  1. Move the site to the root of the URL. This will get it mixed up with the rest of my site so I'd rather not do that.
  2. Move it to a completely new URL. This is definitely an option. I could spin up a Cloud server for this pretty easily.
  3. Configure the server so that it serves the directory from the root URL. The would be cheaper than using a Cloud service.
To save myself some time and effort I've decided to go with the second option. I don't think I'll need to keep the site up for long so the cost will be minimal and it'll prevent me causing a mess. Once I'm done I can shutdown the site and everything will be as before.

So this is what I've done. I didn't end up spinning up a server but rather copied the files over to an S3 bucket no Amazon Web Services. There's an option to serve static files from an S3 bucket as a website, which is exactly what I need.
 
Using AWS S3 to serve a test copy of the DuckDuckGo site

Testing the site using desktop Firefox shows a much better result than before. It's not perfect: there are still some missing images, but the copy I made to the bucket is the mobile version, so that's to be expected. Nevertheless it makes for a pretty reasonable facsimile of the real DuckDuckGo site.
 
The DDG test site rendered using a desktop browser

But what about if I try it on mobile?

Using ESR 78 it's uncannily similar to the real DuckDuckGo site. Even the menu on the left hand side works. Search suggestions and search itself are of course both broken, but again this is entirely to be expected since these features require dynamic aspects of the site which can't be replicated using an S3 bucket.
 
The same DDG test site rendered using ESR 78 (left) and ESR 91 (right) renderers. They're almost identical, apart from the downward scroll arrow which only shows on ESR 78.

But the real test is ESR 91. When I try it there... I also get a good rendition of the DuckDuckGo site. This is both good and bad. If it had failed to render that would have been ideal, because then I'd immediately have a comparison to work with. But the fact that it works well means I can now compare it with the real version of the site and try to figure out what's different.

It's also worth noting that the results aren't the same. On ESR 78 I can scroll down to view all the "bathroomguy" images telling me how great the site is. On ESR 91 the rendered down arrow is missing and I'm not able to scroll, as you can see in the screenshots if you peer closely.

So, what's the difference? That'll be my task for tomorrow!

This lends more weight to the claim that the problem here DuckDuckGo serving different pages rather than the ESR 91 renderer or JavaScript engine choking on something. I have a plan for how to test this categorically tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
13 Jan 2024 : Day 137 #
I'm still trying to get myself a copy of the DuckDuckGo website. To recap the latest situation, I still want a copy I can serve from my personal server, but which triggers the same errors as when accessing DuckDuckGo from the real website using ESR 91.

I feel confident that yesterday I got myself a full verbatim copy of the site. The catch is that it's woven into the logging output from ESR 91. My task today is to disentangle it.

The log file is 1.3 MiB of data. That's not crazy quantities, but it would work better if the text files used line-wrapping, rather than including massively long lines that seemingly go on for ever... my text editor really doesn't like having to deal with them and just hangs up for tens of minutes at a time.

[...]

Nevertheless, and although it took an age, I have managed to get it all done. The file structure is very similar to the one I showed yesterday:
$ tree ddg8/
ddg8/
├── assets
│   ├── logo_homepage.alt.v109.svg
│   ├── logo_homepage.normal.v109.svg
│   └── onboarding
│       ├── arrow.svg
│       └── bathroomguy
│           ├── 1-monster-v2--no-animation.svg
│           ├── 2-ghost-v2.svg
│           ├── 3-bathtub-v2--no-animation.svg
│           ├── 4-alpinist-v2.svg
│           └── teaser-2@2x.png
├── dist
│   ├── b.9e45618547aaad15b744.js
│   ├── d.01ff355796b8725c8dad.js
│   ├── h.2d6522d4f29f5b108aed.js
│   ├── lib
│   │   └── l.656ceb337d61e6c36064.js
│   ├── o.2988a52fdfb14b7eff16.css
│   ├── p.f5b58579149e7488209f.js
│   ├── s.b49dcfb5899df4f917ee.css
│   ├── ti.b07012e30f6971ff71d3.js
│   ├── tl.3db2557c9f124f3ebf92.js
│   └── util
│       └── u.a3c3a6d4d7bf9244744d.js
├── font
│   ├── ProximaNova-ExtraBold-webfont.woff2
│   ├── ProximaNova-Reg-webfont.woff2
│   └── ProximaNova-Sbold-webfont.woff2
├── index.html
├── locale
│   └── en_GB
│       └── duckduckgo85.js
└── post3.html

9 directories, 24 files
And not just that, but in fact the contents are very similar overall:
$ diff -q ddg ddg8/
Only in ddg: 3.html
Common subdirectories: ddg/assets and ddg8/assets
Common subdirectories: ddg/dist and ddg8/dist
Common subdirectories: ddg/font and ddg8/font
Files ddg/index.html and ddg8/index.html differ
Common subdirectories: ddg/locale and ddg8/locale
Only in ddg8/: post3.html
Although the index.html file is quite different to the equivalent one I downloaded earlier, it is similar to a previous one that was downloaded using the python script:
$ diff ddg5/index.html ddg8/index.html 
2,7c2,7
< <!--[if IEMobile 7 ]> <html lang="en-US" class="no-js iem7"> <![endif]-->
< <!--[if lt IE 7]> <html class="ie6 lt-ie10 lt-ie9 lt-ie8 lt-ie7 no-js"
  lang="en-US"> <![endif]-->
< <!--[if IE 7]>    <html class="ie7 lt-ie10 lt-ie9 lt-ie8 no-js" lang="en-US">
  <![endif]-->
< <!--[if IE 8]>    <html class="ie8 lt-ie10 lt-ie9 no-js" lang="en-US">
  <![endif]-->
< <!--[if IE 9]>    <html class="ie9 lt-ie10 no-js" lang="en-US"> <![endif]-->
< <!--[if (gte IE 9)|(gt IEMobile 7)|!(IEMobile)|!(IE)]><!--><html class="no-js"
  lang="en-US" data-ntp-features="tracker-stats-widget:off"><!--<![endif]-->
---
> <!--[if IEMobile 7 ]> <html lang="en-GB" class="no-js iem7"> <![endif]-->
> <!--[if lt IE 7]> <html class="ie6 lt-ie10 lt-ie9 lt-ie8 lt-ie7 no-js"
  lang="en-GB"> <![endif]-->
> <!--[if IE 7]>    <html class="ie7 lt-ie10 lt-ie9 lt-ie8 no-js" lang="en-GB">
  <![endif]-->
> <!--[if IE 8]>    <html class="ie8 lt-ie10 lt-ie9 no-js" lang="en-GB">
  <![endif]-->
> <!--[if IE 9]>    <html class="ie9 lt-ie10 no-js" lang="en-GB"> <![endif]-->
> <!--[if (gte IE 9)|(gt IEMobile 7)|!(IEMobile)|!(IE)]><!--><html class="no-js"
  lang="en-GB" data-ntp-features="tracker-stats-widget:off"><!--<![endif]-->
48,49c48,49
<       <title>DuckDuckGo — Privacy, simplified.</title>
< <meta property="og:title" content="DuckDuckGo — Privacy, simplified." />
---
>       <title>DuckDuckGo â Privacy, simplified.</title>
> <meta property="og:title" content="DuckDuckGo â Privacy, simplified." />
64c64
< <script type="text/javascript" src="/locale/en_US/duckduckgo14.js"
  onerror="handleScriptError(this)"></script>
---
> <script type="text/javascript" src="/locale/en_GB/duckduckgo85.js"
  onerror="handleScriptError(this)"></script>
107c107
<                                               <!-- en_US All Settings -->
---
>                                               <!-- en_GB All Settings -->
146a147,148
> 
> 
As you can see, the only real difference is the switch from en-US to en-GB, a one-character difference to the title of the page and the name of the locale file.

The result is also the same when viewing the page with either ESR 78 or the desktop browser: just a blank page.

Once again we find ourselves in an unsatisfactory position. But I will persevere and we will get to the bottom of this!

The next step is to check the network output from opening the page in the browser. And there's something important in there! There are many entries that look a bit like this:
[ Request details ------------------------------------------- ]
    Request: GET status: 404 Not Found
    URL: https://www.flypig.co.uk/dist/o.2988a52fdfb14b7eff16.css
And now the penny drops: the page is expecting to be in the root of the domain. So while the location it's expecting to find the file is this:
https://www.flypig.co.uk/dist/o.2988a52fdfb14b7eff16.css
I've instead been storing the file in this location:
https://www.flypig.co.uk/tests/ddg8/dist/o.2988a52fdfb14b7eff16.css
Checking inside the index.html file the reason is clear. The paths are being given as absolute paths from the root of the domain, with a preceding slash, like this:
<link rel="stylesheet" href="/dist/o.2988a52fdfb14b7eff16.css" type="text/css">
That / in front of the dist is causing all the trouble. It frustrates me that I didn't notice this before. But at least now I have something clear to fix. That'll be my task for tomorrow. Thankfully it should be really easy to fix. I feel a bit silly now.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
12 Jan 2024 : Day 136 #
Downloading all of the files served by DuckDuckGo individually didn't work: I end up with a site that simply triggers multiple "404 Not Found" errors due to files being requested but not available.

But I'm not giving up on this approach just yet. On the Sailfish Forum attah today made a nice suggestion in relation to this:
 
Finally remembering to post thoughts i have accumulated after the last weeks of following the blog: Have you tried with a wildly different User Agent override for DucDuckGo, like iPhone or something? The hanging parallel compile - could that be related to some syscall that gets used in synchronization, but which is stubbed in sb2?

There are actually two points here, the first or which relates to DuckDuckGo and the second of which relates to the issue of the build hanging when using more than one process using scratchbox2, part of the Sailfish SDK. Let me leave the compile query to one side for now, because, although it's a good point, I unfortunately don't know the answer (but it sounds like an interesting point to investigate).

Going back to DuckDuckGo, so far I've tried the ESR 78 user agent and the Firefox user agent, but I admit I've not tried anything else. It's a good idea — thank you attah — definitely worth trying. So let's see what happens.

I don't have an iPhone to compare with, but of course there are plenty of places on the Web that claim to list it. I'm going to use this one from the DeviceAtlas blog:
Mozilla/5.0 (iPhone12,1; U; CPU iPhone OS 13_0 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0 Mobile/15E148 Safari/602.1
I used the Python script from yesterday to download the files twice, first using the ESR 78 user agent (stored to the ddg-esr78 directory) and then again using the iPhone user agent above (and stored to the ddg-iphone directory). Each directory contains 34 files and here's what I get when I diff them:
$ find ddg-esr78/ | wc -l
34
$ find ddg-iphone12 | wc -l
34
$ diff --brief ddg-esr78/ ddg-iphone12/
Common subdirectories: ddg-esr78/assets and ddg-iphone12/assets
Common subdirectories: ddg-esr78/font and ddg-iphone12/font
Common subdirectories: ddg-esr78/ist and ddg-iphone12/ist
Common subdirectories: ddg-esr78/locale and ddg-iphone12/locale
So they resulting downloads are identical. That's too bad (although also a little reassuring). It's hard not to conclude that the user agent isn't the important factor here then. Nevertheless, I'm still concerned that I'm not getting the right files when I download using this Python script. If the problem is that DuckDuckGo is recognising a different browser when I download the files with my Python script — even if I've set the User Agent string to match — the the solution will have to be to download the files with the Sailfish Browser itself. It could be another issue entirely, but, well, this is a process of elimination.

I already have the means to do this, in theory. The EMBED_CONSOLE="network" setting gives a preview of any text files it downloads. But by default that's restricted to showing the first 32 KiB of data. That's not enough for everything and some files get truncated. So I've spent a bit of time improving the output.

First I've increased this value to 32 MiB. In practice I really want it to be set to have no limit, but 32 MiB should be more than enough (and if it isn't it should be obvious and can easily be bumped up). But when I first wrote this component I was always disappointed that the request and response headers could be output at a different time to the document content. That meant that it wasn't always possible to tie the content to the request headers (and in particular, the URL the content was downloaded from).

The reason the two can get separated is that the headers are output as soon as they've been received. And the content is output as soon as it's received in full. But between the headers being output and the content being received it's quite possible for some smaller file to be received in full. In this case, the smaller file would get printed out between the headers and content of the larger file.

My solution has been to store a copy of the URL in the content receiver callback object. That way, the URL can be output at the same time as the content. Now the headers and the content can be tied together since the URL is output with them both.

Here's an example (slightly abridged to keep things manageable):
[ Request details ------------------------------------------- ]
    Request: GET status: 200 OK
    URL: https://duckduckgo.com/dist/p.f5b58579149e7488209f.js
    [ Request headers --------------------------------------- ]
        Host : duckduckgo.com
        User-Agent : Mozilla/5.0 (Mobile; rv:78.0) Gecko/78.0 Firefox/78.0
        Accept : */*
        Accept-Language : en-GB,en;q=0.5
        Accept-Encoding : gzip, deflate, br
        Referer : https://duckduckgo.com/
        Connection : keep-alive
        TE : Trailers
    [ Response headers -------------------------------------- ]
        server : nginx
        date : Wed, 10 Jan 2024 22:27:17 GMT
        content-type : application/x-javascript
        content-length : 157
        last-modified : Fri, 27 Oct 2023 12:03:07 GMT
        vary : Accept-Encoding
        etag : "653ba6fb-9d"
        content-encoding : br
        strict-transport-security : max-age=31536000
        permissions-policy : interest-cohort=()
        content-security-policy : [...] ;
        x-frame-options : SAMEORIGIN
        x-xss-protection : 1;mode=block
        x-content-type-options : nosniff
        referrer-policy : origin
        expect-ct : max-age=0
        expires : Thu, 09 Jan 2025 22:27:17 GMT
        cache-control : max-age=31536000
        vary : Accept-Encoding
        X-Firefox-Spdy : h2
    [ Document URL ------------------------------------------ ]
        URL: https://duckduckgo.com/dist/p.f5b58579149e7488209f.js
        Charset: 
        ContentType: application/x-javascript
    [ Document content -------------------------------------- ]
function post(t){if(t.source===parent&&t.origin===location.protocol+"//"+
    location.hostname&&"string"==typeof t.data){var o=t.data.indexOf(":"),
    a=t.data.substr(0,o),n=t.data.substr(o+1);"ddg"===a&&(parent.window.
    location.href=n)}}window.addEventListener&&window.addEventListener
    ("message",post,!1);
    [ Document content ends --------------------------------- ]
Notice how the actual document content (which is only a few lines of text in this case) is right at the end. But directly beforehand the URL is output, which as a result can now be tied to the URL at the start of the request.

After downloading the mobile version of DuckDuckGo using the ESR 78 engine and these changes I can see they've made a difference when I compare the previous and newly collected data:
$ ls -lh ddg-urls-esr78-mobile-*.txt
-rw-rw-r-- 1 flypig flypig 2.5K Jan  8 19:04 ddg-urls-esr78-mobile-01.txt
-rw-rw-r-- 1 flypig flypig 1.3M Jan 10 22:27 ddg-urls-esr78-mobile-02.txt
Previously 2.5 KiB of data was collected, but with these changes that goes up to 1.3 MiB.

The log file is a bit unwieldy, but it should hopefully contain all the data we need: every bit of textual data that was downloaded. Tomorrow I'll try to disentangle the output and turn them into files again. With a bit of luck, I'll end up with a working copy of the DuckDuckGo site (famous last words!).

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
11 Jan 2024 : Day 135 #
After collecting together and comparing the lists of files downloaded yesterday, today I'm actually downloading those files from the server.

I've created a very simply Python script that will take each line from the output, then reconstruct a local copy of each of the files, using the same relative directory hierarchy. The script is short and simple enough to show here in full.
#!/bin/python3

import os
import urllib.request

BASE_DIR = 'ddg'
USER_AGENT = 'Mozilla/5.0 (Android 8.1.0; Mobile; rv:78.0) Gecko/78.0 Firefox/78.8'

def split_url(url):
	url = url.rstrip()
	path = url.lstrip('https://duckduckgo.com/')
	path, leaf = os.path.split(path)
	leaf = 'index.html' if not leaf else leaf
	path = os.path.join(BASE_DIR, path)
	filename = os.path.join(path, leaf)
	return path, filename, url

def make_dir(directory):
	print('Dir: {}'.format(directory))
	os.makedirs(directory, exist_ok=True)

def download_file(url, filename):
	print('URL: {}'.format(url))
	print('File: {}'.format(filename))
	opener = urllib.request.build_opener()
	opener.addheaders = [('User-agent', USER_AGENT)]
	urllib.request.install_opener(opener)
	urllib.request.urlretrieve(url, filename)

with open('download.txt') as fp:
	for line in fp:
		directory, filepath, url = split_url(line)
		make_dir(directory)
		download_file(url, filepath)
It really is a very linear process; they don't get much simpler than this. All it does is read in a file line by line. Each line is interpreted as a URL. For example suppose the line was the following:
https://duckduckgo.com/dist/lib/l.656ceb337d61e6c36064.js
Then the file will extract the directory dist/lib/, create a local directory ddg/dist/lib/, then download the file from the URL and save it in the directory with the filename l.656ceb337d61e6c36064.js.

We'll end up with a directory structure that should match the root directory structure of DuckDuckGo:
$ tree ddg
ddg
├── 3.html
├── assets
│   ├── logo_homepage.alt.v109.svg
│   ├── logo_homepage.normal.v109.svg
│   └── onboarding
│       ├── arrow.svg
│       └── bathroomguy
│           ├── 1-monster-v2--no-animation.svg
│           ├── 2-ghost-v2.svg
│           ├── 3-bathtub-v2--no-animation.svg
│           ├── 4-alpinist-v2.svg
│           └── teaser-2@2x.png
├── dist
│   ├── b.9e45618547aaad15b744.js
│   ├── d.01ff355796b8725c8dad.js
│   ├── h.2d6522d4f29f5b108aed.js
│   ├── lib
│   │   └── l.656ceb337d61e6c36064.js
│   ├── o.2988a52fdfb14b7eff16.css
│   ├── p.f5b58579149e7488209f.js
│   ├── s.b49dcfb5899df4f917ee.css
│   ├── ti.b07012e30f6971ff71d3.js
│   ├── tl.3db2557c9f124f3ebf92.js
│   └── util
│       └── u.a3c3a6d4d7bf9244744d.js
├── font
│   ├── ProximaNova-ExtraBold-webfont.woff2
│   ├── ProximaNova-Reg-webfont.woff2
│   └── ProximaNova-Sbold-webfont.woff2
├── index.html
└── locale
    └── en_GB
        └── duckduckgo85.js

9 directories, 24 files
The intention is that this will make a verbatim copy of all the files that the browser used when rendering the page. Unfortunately servers don't always serve the same file every time, but to try to avoid it serving up the wrong file, I've also set the user agent to be the same as for ESR 78.

That's no guarantee that the server will identify us as that — servers use all sorts of nasty tricks to try to identify misidentified browsers — but it's probably the best we can reasonably do.

Once I've got a local copy of the site structure I copy this over to my server and get the browser to render it.

But unfortunately without success. For reasons I can't figure out, when I attempt to open the page, the browser requests a wholly different set of files to download. And not just different leafnames, but a totally different file structure as well. So rather than it downloading the files I've collected together, I just get a bunch of "404 File Not Found" errors.

Frustrating. But the nature of me writing this up daily is that I can't just summarise all the things that work. As anyone who's been following along will no doubt have noticed by now, often things I try just don't work. But from the comments I've been getting from others, it's reassuring to know it's not just me. Sometimes failure is still progress.

Maybe I'll have better luck with a new approach tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
10 Jan 2024 : Day 134 #
I was getting a little despondent trying to fix DuckDuckGo on ESR 91, but the tests I performed yesterday have invigorated me somewhat. It's now clear that there's a decent quantity of accesses being made when using ESR 91 and that wouldn't be happening unless at least some of the data was being interpreted as HTML.

But it's also clear that not everything is as it should be: on ESR 78 there are a mixture of SVG and PNG files being downloaded to provide the images on the page. In contrast, on ESR 91, there are no images being downloaded at all. There are two possible reasons for this I can think of:
  1. DuckDuckGo is serving pages that don't contain any images. Seems unlikely, but nevertheless a possibility.
  2. There are images but ESR 91 isn't turning them in to access requests. I have no idea why this might happen, yet it still feels the more likely of the two scenarios.
I think it would be useful to know how much overlap there is between the sets of files that are being downloaded. So I put together a command line monstrosity to compare them. Before giving the final command, let me break down what it's doing.

First, there are some extraneous accesses in the list that have nothing to do with DuckDuckGo but are a consequence of gecko collecting settings data after the profile was wiped; lines like this:
https://firefox.settings.services.mozilla.com/v1/buckets/monitor/collections/
    changes/records?collection=hijack-blocklists&bucket=main
So the command filters all of the lines that don't include duckduckgo.

Second, I noticed that some files appear to be the same but located at different URLs. For example, this can be found in the ERS 78 list:
https://duckduckgo.com/font/ProximaNova-ExtraBold-webfont.woff2
While this can be found in the ESR 91 list:
https://duckduckgo.com/static-assets/font/ProximaNova-Reg-webfont.woff2
These are both font files; they must surely be the same file, right? But they're located in different folders. So the command then strips the URLs down to the leafnames.

Third, it then sorts the results alphabetically so that any identical lines in both files will appear in the same order. If there are any matching lines, this will make any diff of the two much cleaner.

Finally the command then performs a side-by-side diff on the result. Here's all of that put together:
$ diff --side-by-side \
    <(sed -e 's/^.*\/\(.*\)/\1/g' <(grep "duckduckgo" log1.txt) | sort) \
    <(sed -e 's/^.*\/\(.*\)/\1/g' <(grep "duckduckgo" log2.txt) | sort)
When I came up with this approach I thought it would give amazing results, but in practice it's not as exciting as I was hoping for.

Let's concentrate on the mobile version of the page first. Here's the diff:
$ diff --side-by-side \
    <(sed -e 's/^.*\/\(.*\)/\1/g' <(grep "duckduckgo" ddg-urls-esr78-mobile.txt) | sort) \
    <(sed -e 's/^.*\/\(.*\)/\1/g' <(grep "duckduckgo" ddg-urls-esr91-mobile.txt) | sort)

1-monster-v2--no-animation.svg             | 18040-1287342b1f839f70.js
2-ghost-v2.svg                             | 38407-070351ade350c8e4.js
3-bathtub-v2--no-animation.svg             | 39337-cd8caeeff0afb1c4.js
4-alpinist-v2.svg                          | 41966-c9d76895b4f9358f.js
arrow.svg                                  | 55015-29fec414530c2cf6.js
b.9e45618547aaad15b744.js                  | 55672-19856920a309aea5.js
d.01ff355796b8725c8dad.js                  | 61754-29df12bb83d71c7b.js
duckduckgo85.js                            | 6a4833195509cc3d.css
h.2d6522d4f29f5b108aed.js                  | 703c9a9a057785a9.css
hi?7857271&b=firefox&ei=true&i=false&[...] | 81125-b74d1b6f4908497b.js
l.656ceb337d61e6c36064.js                  | 93432-ebd443fe69061b19.js
logo_homepage.alt.v109.svg                 | 94623-d5bfa67fc3bada59.js
logo_homepage.normal.v109.svg              | 95665-30dd494bea911abd.js
o.2988a52fdfb14b7eff16.css                 | a2a29f84956f2aac.css
p.f5b58579149e7488209f.js                  | _app-ce0b94ea69138577.js
post3.html                                 | _buildManifest.js
                                           > c89114cfe55133c4.css
                                           > ed8494aa71104fdc.css
                                           > f0b3f7da285c9dbd.css
                                           > framework-f8115f7fae64930e.js
                                           > home-34dda07336cb6ee1.js
                                           > main-17a05b704438cdd6.js
                                           > ProximaNova-Bold-webfont.woff2
ProximaNova-ExtraBold-webfont.woff2          ProximaNova-ExtraBold-webfont.woff2
                                           > ProximaNova-RegIt-webfont.woff2
ProximaNova-Reg-webfont.woff2                ProximaNova-Reg-webfont.woff2
ProximaNova-Sbold-webfont.woff2              ProximaNova-Sbold-webfont.woff2
s.b49dcfb5899df4f917ee.css                 | _ssgManifest.js
teaser-2@2x.png                            | webpack-7358ea7cdec0aecf.js
ti.b07012e30f6971ff71d3.js                 <
tl.3db2557c9f124f3ebf92.js                 <
u.a3c3a6d4d7bf9244744d.js                  <
Only three files are shared across the two collections. The remaining 22 and 27 files respectively are apparently different. I was honestly hoping for there to be more similarity.

For completeness let's do the same for the desktop collections:
$ diff --side-by-side \
    <(sed -e 's/^.*\/\(.*\)/\1/g' <(grep "duckduckgo" ddg-urls-esr78-desktop.txt) | sort) \
    <(sed -e 's/^.*\/\(.*\)/\1/g' <(grep "duckduckgo" ddg-urls-esr91-d.txt) | sort)

18040-1287342b1f839f70.js                    18040-1287342b1f839f70.js
38407-070351ade350c8e4.js                    38407-070351ade350c8e4.js
39337-cd8caeeff0afb1c4.js                    39337-cd8caeeff0afb1c4.js
41966-c9d76895b4f9358f.js                    41966-c9d76895b4f9358f.js
48292.8c8d6cb394d25a15.js                  <
55015-29fec414530c2cf6.js                    55015-29fec414530c2cf6.js
55672-19856920a309aea5.js                    55672-19856920a309aea5.js
61754-29df12bb83d71c7b.js                    61754-29df12bb83d71c7b.js
6a4833195509cc3d.css                         6a4833195509cc3d.css
703c9a9a057785a9.css                         703c9a9a057785a9.css
81125-b74d1b6f4908497b.js                    81125-b74d1b6f4908497b.js
93432-ebd443fe69061b19.js                    93432-ebd443fe69061b19.js
94623-d5bfa67fc3bada59.js                    94623-d5bfa67fc3bada59.js
95665-30dd494bea911abd.js                    95665-30dd494bea911abd.js
a2a29f84956f2aac.css                         a2a29f84956f2aac.css
add-firefox.f0890a6c.svg                   <
_app-ce0b94ea69138577.js                     _app-ce0b94ea69138577.js
app-protection-back-dark.png               <
app-protection-front-dark.png              <
app-protection-ios-dark.png                <
app-store.501fe17a.png                     <
atb_home_impression?9836955&b=firefox[...] <
_buildManifest.js                            _buildManifest.js
burn@2x.be0bd36d.png                       <
c89114cfe55133c4.css                         c89114cfe55133c4.css
chrome-lg.a4859fb2.png                     <
CNET-DARK.e3fd496e.png                     <
dark-mode@2x.3e150d01.png                  <
devices-dark.png                           <
ed8494aa71104fdc.css                         ed8494aa71104fdc.css
edge-lg.36af7682.png                       <
email-protection-back-dark.png             <
email-protection-front-light.png           <
email-protection-ios-dark.png              <
f0b3f7da285c9dbd.css                         f0b3f7da285c9dbd.css
firefox-lg.8efad702.png                    <
flame.1241f020.png                         <
flame@2x.40e1cfa0.png                      <
flame-narrow.70589b7c.png                  <
framework-f8115f7fae64930e.js                framework-f8115f7fae64930e.js
home-34dda07336cb6ee1.js                     home-34dda07336cb6ee1.js
legacy-homepage-btf-dark.png               <
legacy-homepage-btf-mobile-dark.png        <
macos.61889438.png                         <
main-17a05b704438cdd6.js                     main-17a05b704438cdd6.js
night@2x.4ca79636.png                      <
opera-lg.237c4418.png                      <
page_home_commonImpression?2448534&[...]   <
play-store.e5d5ed36.png                    <
ProximaNova-Bold-webfont.woff2               ProximaNova-Bold-webfont.woff2
ProximaNova-ExtraBold-webfont.woff2          ProximaNova-ExtraBold-webfont.woff2
ProximaNova-RegIt-webfont.woff2              ProximaNova-RegIt-webfont.woff2
ProximaNova-Reg-webfont.woff2                ProximaNova-Reg-webfont.woff2
ProximaNova-Sbold-webfont.woff2              ProximaNova-Sbold-webfont.woff2
safari-lg.8406694a.png                     <
search-protection-back-light.png           <
search-protection-front-dark.png           <
search-protection-ios-dark.png             <
set-as-default.d95c3465.svg                <
_ssgManifest.js                              _ssgManifest.js
UT-DARK-DEFAULT.6cd0020d.png               <
VERGE-DARK-DEFAULT.8850a2d2.png            <
webpack-7358ea7cdec0aecf.js                  webpack-7358ea7cdec0aecf.js
web-protection-back-dark.png               <
web-protection-front-dark.png              <
web-protection-ios-dark.png                <
widget-big@2x.a260ccf6.png                 <
widget-small@2x.07c865df.png               <
windows.477fa143.png                       <
WIRED-DARK-DEFAULT.b4d48a49.png            <
Here we see something more interesting. There are 30 files the same across the two collections with 43 and 0 files being different respectively. In other words, the ESR 91 collection is a subset of the ESR 78 collection.

There might be something in this, but initially I'm more interested in the mobile version of the site and there the overlap is far less. However, now that that I have the URLs, one thing I can do is download all of the files and try to use them to recreate the site on my own server. It's possible this will give better results than saving out the files from the desktop browser, so I'll be giving that a go tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
9 Jan 2024 : Day 133 #
It's time to pick up from where we left off yesterday, hanging on a couple of breakpoints on two different phones, one running ESR 78 and the other running ESR 91.

It feels weird leaving the phones in this state of limbo overnight. Astonishingly, even though I put my laptop to sleep, the SSH connections weren't dropped. Returning to discover things in exactly the same state I left them in is disconcerting. But in a good way.

Before I get on to the next steps I also want to take the time to acknowledge the help of TrickyPR who sent this helpful message via Mastodon:
 
my knowledge of the fancy debugger is a lot more modern, but here are some tips. For the server:
  • make sure you are building devtools/ (you probably already are)
  • you can launch the devtools server through js: https://github.com/ajvincent/motherhen/blob/main/cleanroom/source/modules/DevtoolsServer.jsm
  • you will have better luck with about:debugging on a similar version browser
client (browser toolbox):
  • you need to set MOZ_DEVTOOLS to all
  • you will need to implement a command line handler for -chrome (you can steal this, but it’s esm (needs esr +100?): https://github.com/ajvincent/motherhen/pull/34 ) Feel free to ping if you want clarification or get stuck.

This is really helpful. I've not had a chance to look into the remote debugging (I think it'll be something for the weekend) so please bear with me, but it looks like this info will definitely make for reference material.

To recap, yesterday we were stepping through nsDisplayBackgroundImage::AppendBackgroundItemsToTop() because we know that on ESR 78 we eventually reach a call to nsDisplayBackgroundImage::GetInitData(), whereas on ESR 91 this method is never called. The two versions of AppendBackgroundItemsToTop() have changed between ESR 78 and ESR 91, but they're close enough to be able to follow them both side-by-side.

But stepping through didn't work as well as I'd hoped because for the first time the breakpoint is hit on both devices, the method returns early in both cases. The AppendBackgroundItemsToTop() method gets called multiple times during a render, so it's not surprising that in some cases it returns early and in others it goes all the way through to GetInitData(). I need a call on ESR 78 that goes all the way to GetInitData() and — and here's the tricky bit — I need the same call on ESR 91 that returns early.

To get closer to this situation I'm going to move the breakpoints closer to the call to GetInitData() on both systems.

Here's the relevant bit of code, which is pretty similar on both ESR 78 and ESR 91, but this copy happens to come from ESR 91:
  if (!bg || !drawBackgroundImage) {
    if (!bgItemList.IsEmpty()) {
      aList->AppendToTop(&bgItemList);
      return AppendedBackgroundType::Background;
    }

    return AppendedBackgroundType::None;
  }

  const ActiveScrolledRoot* asr = aBuilder->CurrentActiveScrolledRoot();

  bool needBlendContainer = false;

  // Passing bg == nullptr in this macro will result in one iteration with
  // i = 0.
  NS_FOR_VISIBLE_IMAGE_LAYERS_BACK_TO_FRONT(i, bg->mImage) {
    if (bg->mImage.mLayers[i].mImage.IsNone()) {
      continue;
    }
[...]

    nsDisplayList thisItemList;
    nsDisplayBackgroundImage::InitData bgData =
        nsDisplayBackgroundImage::GetInitData(aBuilder, aFrame, i, bgOriginRect,
                                              bgSC);
I've chopped a bit out, but there's no way for the method to return in the part I've removed, so the bits shown here are the important pieces for our purposes today.

The following line is the last statement before the loop around the image layers is entered.
  bool needBlendContainer = false;
If I place a breakpoint on this line, on ESR 91 the breakpoint remains untouched:
(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep n   0x0000007fbc3511d4 in nsDisplayBackgroundImage::
        GetInitData(nsDisplayListBuilder*, nsIFrame*, unsigned short,
        nsRect const&, mozilla::ComputedStyle*)
        at layout/painting/nsDisplayList.cpp:3409
2       breakpoint     keep y   0x0000007fbc3a8730 in nsDisplayBackgroundImage::
        AppendBackgroundItemsToTop(nsDisplayListBuilder*, nsIFrame*,
        nsRect const&, nsDisplayList*, bool, mozilla::ComputedStyle*,
        nsRect const&, nsIFrame*,
        mozilla::Maybe<nsDisplayListBuilder::AutoBuildingDisplayList>*) 
        at layout/painting/nsDisplayList.cpp:3632
        breakpoint already hit 2 times
(gdb) handle SIGPIPE nostop
Signal        Stop      Print   Pass to program Description
SIGPIPE       No        Yes     Yes             Broken pipe
(gdb) disable break 2
(gdb) break nsDisplayList.cpp:3766
Breakpoint 3 at 0x7fbc3a9344: file layout/painting/nsDisplayList.cpp, line 3766.
(gdb) c
[...]
So the method must be being exited — in all cases — before this line occurs. From my recollection of stepping through the method earlier I'm pretty sure we reached this condition:
  if (!bg || !drawBackgroundImage) {
So I'll put a breakpoint there and try again.
(gdb) disable break 3
(gdb) break nsDisplayList.cpp:3755
Breakpoint 4 at 0x7fbc3a8dac: file layout/painting/nsDisplayList.cpp, line 3755.
(gdb) c
Continuing.
[New LWP 28203]
[LWP 27764 exited]
[LWP 28203 exited]
[Switching to LWP 31016]

Thread 10 "GeckoWorkerThre" hit Breakpoint 4, nsDisplayBackgroundImage::
    AppendBackgroundItemsToTop (aBuilder=0x7f9efa2378, aFrame=0x7f803c9eb8, 
    aBackgroundRect=..., aList=0x7f9ef9ff08, aAllowWillPaintBorderOptimization=
    <optimized out>, aComputedStyle=<optimized out>, aBackgroundOriginRect=...,
    aSecondaryReferenceFrame=0x0, aAutoBuildingDisplayList=0x0)
    at layout/painting/nsDisplayList.cpp:3755
3755      if (!bg || !drawBackgroundImage) {
(gdb) 

Now it hits. Let's try putting some breakpoints on the return calls.
(gdb) disable break 4
(gdb) break nsDisplayList.cpp:3758
Note: breakpoint 3 (disabled) also set at pc 0x7fbc3a9344.
Breakpoint 5 at 0x7fbc3a9344: file layout/painting/nsDisplayList.cpp, line 3766.
(gdb) break nsDisplayList.cpp:3761
Note: breakpoints 3 (disabled) and 5 also set at pc 0x7fbc3a9344.
Breakpoint 6 at 0x7fbc3a9344: file layout/painting/nsDisplayList.cpp, line 3766.
(gdb) 
That's not helpful: the debugger won't let me place a breakpoint on either of these lines, presumably because after compilation to machine code it's no longer possible to distinguish between the lines properly.

However, if I leave the breakpoint on the start of the condition I notice that during rendering the breakpoint is hit precisely four times on ESR 91.

In contrast, when rendering the page on ESR 78 the breakpoint is hit 108 times. That's a red flag, because it really suggests that there are a lot of background images being rendered on ESR 78, and almost no attempt to even render backgrounds on ESR 91.

What's more, on ESR 91, even though we can't place a breakpoint on the return lines, we can infer the value that's being returned by looking at the values of the variables going in to the condition. Here's the condition again on ESR 91 for reference:
  if (!bg || !drawBackgroundImage) {
    if (!bgItemList.IsEmpty()) {
      aList->AppendToTop(&bgItemList);
      return AppendedBackgroundType::Background;
    }

    return AppendedBackgroundType::None;
  }
And here are the values the debugger will give us access to:
Thread 10 "GeckoWorkerThre" hit Breakpoint 9, nsDisplayBackgroundImage::
    AppendBackgroundItemsToTop (aBuilder=0x7f9efa2378, aFrame=0x7f80b5fa48, 
    aBackgroundRect=..., aList=0x7f9ef9ff08, aAllowWillPaintBorderOptimization=
    <optimized out>, aComputedStyle=<optimized out>, aBackgroundOriginRect=...,
    aSecondaryReferenceFrame=0x0, aAutoBuildingDisplayList=0x0)
    at layout/painting/nsDisplayList.cpp:3755
3755      if (!bg || !drawBackgroundImage) {
(gdb) p bg
$8 = <optimized out>
(gdb) p drawBackgroundImage
$9 = false
(gdb) p bgItemList.mLength
$19 = 0
(gdb) c
Continuing.
I get the same results for all four cases and working through the logic, if drawBackgroundImage is set to false then !drawBackgroundImage will be set to true, hence (!bg || !drawBackgroundImage) will be set to true, hence the condition will be entered into. At that point we know that bgItemList has no members and so will be empty, so the nested condition won't be entered.

The method will then return with a value of AppendedBackgroundType::None.

In contrast we know that on ESR 78 execution gets beyond this condition in order to actually render background images, just as we would expect for any relatively complex page.

Given all of the above it looks to me very much like the problem isn't happening in the render loop. Rendering is happening, there just isn't anything to render. There could be multiple reasons for this. One possibility is that the page is being received but somehow lost or dropped. The other is that the page isn't being sent in full. It could be JavaScript related.

To return to something discussed earlier, this behaviour distinguishes the problem from the issue we experience with Amazon on ESR 78. In that case, rendering most definitely occurs because we briefly see a copy of the page. The page visibly disappears in front of our eyes. In that case we'd expect there to be many breakpoint hits with rendering of multiple background images all prior to the page disappearing.

To make more sense of what's going on I need to go back down to the other end of the stack: the networking end. I realised after some thought that the experiments I ran earlier with EMBED_CONSOLE="network" set were not very effective. What I really want is a list of all the network access requests made from the browser in order to understand where the problem is happening. For example, are there any images being downloaded?

The last time I did this there weren't any images downloaded, but in retrospect that might have been because they were being cached. I should do this more robustly, and because I'm really only interested in the URLs, I can filter out all of the other details to make my life easier.

There are two reasons for doing this. First to compare with the URLs accessed using ESR 78, which is something I also failed to do previously. Second because if I have all the URLs, that might also help me piece together a full version of the page without all of the mysterious transformations that happen when I save the page manually from the desktop browser.

So to kick things off, here are all of the URLs accessed for ESR 78 when requesting the mobile version of the site. Note that the first thing I do is delete the profile entirely. That's to avoid anything being cached.

Mobile version of the site on ESR 78
$ rm -rf ~/.local/share/org.sailfishos/browser
$ EMBED_CONSOLE="network" sailfish-browser https://duckduckgo.com/ 2>&1 | grep "URL: "
URL: https://duckduckgo.com/
URL: https://duckduckgo.com/dist/s.b49dcfb5899df4f917ee.css
URL: https://duckduckgo.com/dist/o.2988a52fdfb14b7eff16.css
URL: https://duckduckgo.com/dist/tl.3db2557c9f124f3ebf92.js
URL: https://duckduckgo.com/dist/b.9e45618547aaad15b744.js
URL: https://duckduckgo.com/dist/lib/l.656ceb337d61e6c36064.js
URL: https://firefox.settings.services.mozilla.com/v1/buckets/monitor/
     collections/changes/records?collection=hijack-blocklists&bucket=main
URL: https://duckduckgo.com/locale/en_GB/duckduckgo85.js
URL: https://firefox.settings.services.mozilla.com/v1/buckets/monitor/
     collections/changes/records?collection=anti-tracking-url-decoration&bucket=main
URL: https://firefox.settings.services.mozilla.com/v1/buckets/monitor/
     collections/changes/records?collection=url-classifier-skip-urls&bucket=main
URL: https://duckduckgo.com/dist/util/u.a3c3a6d4d7bf9244744d.js
URL: https://duckduckgo.com/dist/d.01ff355796b8725c8dad.js
URL: https://firefox.settings.services.mozilla.com/v1/buckets/main/
     collections/hijack-blocklists/changeset?_expected=1605801189258
URL: https://firefox.settings.services.mozilla.com/v1/buckets/main/
     collections/url-classifier-skip-urls/changeset?_expected=1701090424142
URL: https://firefox.settings.services.mozilla.com/v1/buckets/main/
     collections/anti-tracking-url-decoration/changeset?_expected=1564511755134
URL: https://duckduckgo.com/dist/h.2d6522d4f29f5b108aed.js
URL: https://duckduckgo.com/dist/ti.b07012e30f6971ff71d3.js
URL: https://duckduckgo.com/font/ProximaNova-Reg-webfont.woff2
URL: https://content-signature-2.cdn.mozilla.net/chains/
     remote-settings.content-signature.mozilla.org-2024-02-08-20-06-04.chain
URL: https://duckduckgo.com/post3.html
URL: https://duckduckgo.com/assets/logo_homepage.normal.v109.svg
URL: https://duckduckgo.com/font/ProximaNova-Sbold-webfont.woff2
URL: https://duckduckgo.com/assets/onboarding/bathroomguy/teaser-2@2x.png
URL: https://duckduckgo.com/assets/onboarding/arrow.svg
URL: https://duckduckgo.com/assets/logo_homepage.alt.v109.svg
URL: https://duckduckgo.com/font/ProximaNova-ExtraBold-webfont.woff2
URL: https://duckduckgo.com/assets/onboarding/bathroomguy/
     1-monster-v2--no-animation.svg
URL: https://duckduckgo.com/assets/onboarding/bathroomguy/2-ghost-v2.svg
URL: https://duckduckgo.com/assets/onboarding/bathroomguy/
     3-bathtub-v2--no-animation.svg
URL: https://duckduckgo.com/assets/onboarding/bathroomguy/4-alpinist-v2.svg
URL: https://duckduckgo.com/dist/p.f5b58579149e7488209f.js
URL: https://improving.duckduckgo.com/t/hi?7857271&b=firefox&ei=true&i=false&
     d=m&l=en-GB&p=other&pre_atb=v411-5&ax=false&ak=false&pre_va=_&pre_atbva=_
Now that's a lot more accesses than I had last time, which is a good sign. That's 32 URL accesses in total. Interestingly all bar one of the images are SVG format: all vector apart from a single bitmap.

It's also worth noting that some of the requests are to Mozilla servers rather than DuckDuckGo servers. I suspect that's a consequence of having wiped the profile: the browser is making accesses to retrieve some of the data that I deleted. We should ignore those requests.

Next up the desktop version of the same site. These were generated by selecting the "Desktop Mode" button in the browser immediately after downloading the mobile site. Consequently it's possible some data was cached.

Desktop version of the site on ESR 78
URL: https://duckduckgo.com/
URL: https://duckduckgo.com/_next/static/css/c89114cfe55133c4.css
URL: https://duckduckgo.com/_next/static/css/6a4833195509cc3d.css
URL: https://duckduckgo.com/_next/static/css/a2a29f84956f2aac.css
URL: https://duckduckgo.com/_next/static/css/f0b3f7da285c9dbd.css
URL: https://duckduckgo.com/_next/static/css/ed8494aa71104fdc.css
URL: https://duckduckgo.com/_next/static/css/703c9a9a057785a9.css
URL: https://duckduckgo.com/_next/static/chunks/webpack-7358ea7cdec0aecf.js
URL: https://duckduckgo.com/_next/static/chunks/framework-f8115f7fae64930e.js
URL: https://duckduckgo.com/_next/static/chunks/main-17a05b704438cdd6.js
URL: https://duckduckgo.com/_next/static/chunks/pages/_app-ce0b94ea69138577.js
URL: https://duckduckgo.com/_next/static/chunks/41966-c9d76895b4f9358f.js
URL: https://duckduckgo.com/_next/static/chunks/93432-ebd443fe69061b19.js
URL: https://duckduckgo.com/_next/static/chunks/18040-1287342b1f839f70.js
URL: https://duckduckgo.com/_next/static/chunks/81125-b74d1b6f4908497b.js
URL: https://duckduckgo.com/_next/static/chunks/39337-cd8caeeff0afb1c4.js
URL: https://duckduckgo.com/_next/static/chunks/94623-d5bfa67fc3bada59.js
URL: https://duckduckgo.com/_next/static/chunks/95665-30dd494bea911abd.js
URL: https://duckduckgo.com/_next/static/chunks/55015-29fec414530c2cf6.js
URL: https://duckduckgo.com/_next/static/chunks/61754-29df12bb83d71c7b.js
URL: https://duckduckgo.com/_next/static/chunks/55672-19856920a309aea5.js
URL: https://duckduckgo.com/_next/static/chunks/38407-070351ade350c8e4.js
URL: https://duckduckgo.com/_next/static/chunks/pages/%5Blocale%5D/
     home-34dda07336cb6ee1.js
URL: https://duckduckgo.com/_next/static/ZNJGkb3SCV-nLlzz3kvNx/_buildManifest.js
URL: https://duckduckgo.com/_next/static/ZNJGkb3SCV-nLlzz3kvNx/_ssgManifest.js
URL: https://improving.duckduckgo.com/t/page_home_commonImpression?2448534&
     b=firefox&d=d&l=en-GB&p=linux&atb=v411-5&pre_va=_&pre_atbva=_&atbi=true&
     i=false&ak=false&ax=false
URL: https://duckduckgo.com/_next/static/media/set-as-default.d95c3465.svg
URL: https://duckduckgo.com/static-assets/image/pages/legacy-home/
     devices-dark.png
URL: https://duckduckgo.com/static-assets/backgrounds/
     legacy-homepage-btf-mobile-dark.png
URL: https://duckduckgo.com/_next/static/media/macos.61889438.png
URL: https://duckduckgo.com/_next/static/media/windows.477fa143.png
URL: https://duckduckgo.com/_next/static/media/app-store.501fe17a.png
URL: https://duckduckgo.com/_next/static/media/play-store.e5d5ed36.png
URL: https://duckduckgo.com/_next/static/media/chrome-lg.a4859fb2.png
URL: https://duckduckgo.com/_next/static/media/edge-lg.36af7682.png
URL: https://duckduckgo.com/_next/static/media/firefox-lg.8efad702.png
URL: https://duckduckgo.com/_next/static/media/opera-lg.237c4418.png
URL: https://duckduckgo.com/_next/static/media/safari-lg.8406694a.png
URL: https://duckduckgo.com/_next/static/chunks/48292.8c8d6cb394d25a15.js
URL: https://duckduckgo.com/static-assets/backgrounds/legacy-homepage-btf-dark.png
URL: https://duckduckgo.com/static-assets/font/ProximaNova-Reg-webfont.woff2
URL: https://duckduckgo.com/static-assets/font/ProximaNova-ExtraBold-webfont.woff2
URL: https://duckduckgo.com/static-assets/font/ProximaNova-Bold-webfont.woff2
URL: https://duckduckgo.com/static-assets/font/ProximaNova-Sbold-webfont.woff2
URL: https://duckduckgo.com/static-assets/font/ProximaNova-RegIt-webfont.woff2
URL: https://duckduckgo.com/static-assets/image/pages/home/devices/how-it-works/
     desktop/search-protection-back-light.png
URL: https://duckduckgo.com/static-assets/image/pages/home/devices/how-it-works/
     desktop/search-protection-front-dark.png
URL: https://duckduckgo.com/_next/static/media/flame.1241f020.png
URL: https://duckduckgo.com/_next/static/media/burn@2x.be0bd36d.png
URL: https://duckduckgo.com/_next/static/media/flame@2x.40e1cfa0.png
URL: https://duckduckgo.com/_next/static/media/widget-big@2x.a260ccf6.png
URL: https://duckduckgo.com/_next/static/media/night@2x.4ca79636.png
URL: https://duckduckgo.com/_next/static/media/dark-mode@2x.3e150d01.png
URL: https://duckduckgo.com/static-assets/image/pages/home/devices/how-it-works/
     desktop/email-protection-front-light.png
URL: https://duckduckgo.com/static-assets/image/pages/home/devices/how-it-works/
     desktop/app-protection-back-dark.png
URL: https://duckduckgo.com/static-assets/image/pages/home/devices/how-it-works/
     desktop/app-protection-front-dark.png
URL: https://duckduckgo.com/_next/static/media/flame-narrow.70589b7c.png
URL: https://duckduckgo.com/_next/static/media/widget-small@2x.07c865df.png
URL: https://duckduckgo.com/static-assets/image/pages/home/devices/how-it-works/
     search-protection-ios-dark.png
URL: https://duckduckgo.com/static-assets/image/pages/home/devices/how-it-works/
     web-protection-ios-dark.png
URL: https://duckduckgo.com/static-assets/image/pages/home/devices/how-it-works/
     email-protection-ios-dark.png
URL: https://duckduckgo.com/static-assets/image/pages/home/devices/how-it-works/
     desktop/web-protection-back-dark.png
URL: https://duckduckgo.com/static-assets/image/pages/home/devices/how-it-works/
     desktop/web-protection-front-dark.png
URL: https://duckduckgo.com/static-assets/image/pages/home/devices/how-it-works/
     desktop/email-protection-back-dark.png
URL: https://duckduckgo.com/static-assets/image/pages/home/devices/how-it-works/
     app-protection-ios-dark.png
URL: https://duckduckgo.com/_next/static/media/add-firefox.f0890a6c.svg
URL: https://duckduckgo.com/_next/static/media/WIRED-DARK-DEFAULT.b4d48a49.png
URL: https://duckduckgo.com/_next/static/media/VERGE-DARK-DEFAULT.8850a2d2.png
URL: https://duckduckgo.com/_next/static/media/UT-DARK-DEFAULT.6cd0020d.png
URL: https://duckduckgo.com/_next/static/media/CNET-DARK.e3fd496e.png
URL: https://improving.duckduckgo.com/t/atb_home_impression?9836955&b=firefox&
     d=d&l=en-GB&p=linux&atb=v411-5&pre_va=_&pre_atbva=_&atbi=true&i=false&
     ak=false&ax=false
The desktop version generates many more access requests, 71 in total, including a large number of PNG files (I count 36 in total). These are the results for ESR 78, so that's a working copy of the site. On the face of it there shouldn't be any reason for ESR 91 to be served anything different to this.

So now using ESR 91, here are the URLs accessed with an empty profile and the mobile version of the site.

Mobile version of the site on ESR 91
$ rm -rf ~/.local/share/org.sailfishos/browser
$ EMBED_CONSOLE="network" sailfish-browser https://duckduckgo.com/ 2>&1 | grep "URL: "
URL: http://detectportal.firefox.com/success.txt?ipv4
URL: https://duckduckgo.com/
URL: https://location.services.mozilla.com/v1/country?key=no-mozilla-api-key
URL: https://duckduckgo.com/static-assets/font/ProximaNova-RegIt-webfont.woff2
URL: https://duckduckgo.com/static-assets/font/ProximaNova-ExtraBold-webfont.woff2
URL: https://duckduckgo.com/static-assets/font/ProximaNova-Reg-webfont.woff2
URL: https://duckduckgo.com/static-assets/font/ProximaNova-Sbold-webfont.woff2
URL: https://duckduckgo.com/static-assets/font/ProximaNova-Bold-webfont.woff2
URL: https://duckduckgo.com/_next/static/css/c89114cfe55133c4.css
URL: https://duckduckgo.com/_next/static/css/6a4833195509cc3d.css
URL: https://duckduckgo.com/_next/static/css/a2a29f84956f2aac.css
URL: https://duckduckgo.com/_next/static/css/f0b3f7da285c9dbd.css
URL: https://duckduckgo.com/_next/static/css/ed8494aa71104fdc.css
URL: https://duckduckgo.com/_next/static/css/703c9a9a057785a9.css
URL: https://duckduckgo.com/_next/static/chunks/webpack-7358ea7cdec0aecf.js
URL: https://duckduckgo.com/_next/static/chunks/framework-f8115f7fae64930e.js
URL: https://firefox.settings.services.mozilla.com/v1/buckets/monitor/
     collections/changes/changeset?collection=anti-tracking-url-decoration&
     bucket=main&_expected=0
URL: https://firefox.settings.services.mozilla.com/v1/buckets/monitor/
     collections/changes/changeset?collection=query-stripping&bucket=main&
     _expected=0
URL: https://firefox.settings.services.mozilla.com/v1/buckets/monitor/
     collections/changes/changeset?collection=hijack-blocklists&
     bucket=main&_expected=0
URL: https://firefox.settings.services.mozilla.com/v1/buckets/monitor/
     collections/changes/changeset?collection=url-classifier-skip-urls&
     bucket=main&_expected=0
URL: https://duckduckgo.com/_next/static/chunks/main-17a05b704438cdd6.js
URL: https://duckduckgo.com/_next/static/chunks/pages/_app-ce0b94ea69138577.js
URL: https://duckduckgo.com/_next/static/chunks/41966-c9d76895b4f9358f.js
URL: https://duckduckgo.com/_next/static/chunks/93432-ebd443fe69061b19.js
URL: https://firefox.settings.services.mozilla.com/v1/buckets/monitor/
     collections/changes/changeset?collection=password-recipes&bucket=main&
     _expected=0
URL: https://duckduckgo.com/_next/static/chunks/18040-1287342b1f839f70.js
URL: https://duckduckgo.com/_next/static/chunks/81125-b74d1b6f4908497b.js
URL: https://duckduckgo.com/_next/static/chunks/39337-cd8caeeff0afb1c4.js
URL: https://duckduckgo.com/_next/static/chunks/94623-d5bfa67fc3bada59.js
URL: https://duckduckgo.com/_next/static/chunks/95665-30dd494bea911abd.js
URL: https://duckduckgo.com/_next/static/chunks/55015-29fec414530c2cf6.js
URL: https://duckduckgo.com/_next/static/chunks/61754-29df12bb83d71c7b.js
URL: https://duckduckgo.com/_next/static/chunks/55672-19856920a309aea5.js
URL: https://duckduckgo.com/_next/static/chunks/38407-070351ade350c8e4.js
URL: https://duckduckgo.com/_next/static/chunks/pages/%5Blocale%5D/
     home-34dda07336cb6ee1.js
URL: https://duckduckgo.com/_next/static/ZNJGkb3SCV-nLlzz3kvNx/_buildManifest.js
URL: https://duckduckgo.com/_next/static/ZNJGkb3SCV-nLlzz3kvNx/_ssgManifest.js
URL: https://firefox.settings.services.mozilla.com/v1/buckets/monitor/
     collections/changes/changeset?collection=search-config&bucket=main&
     _expected=0
URL: https://firefox.settings.services.mozilla.com/v1/buckets/main/collections/
     anti-tracking-url-decoration/changeset?_expected=1564511755134
URL: https://firefox.settings.services.mozilla.com/v1/buckets/main/collections/
     query-stripping/changeset?_expected=1694689843914
URL: https://firefox.settings.services.mozilla.com/v1/buckets/main/collections/
     hijack-blocklists/changeset?_expected=1605801189258
URL: https://firefox.settings.services.mozilla.com/v1/buckets/main/collections/
     password-recipes/changeset?_expected=1674595048726
URL: https://firefox.settings.services.mozilla.com/v1/buckets/main/collections/
     url-classifier-skip-urls/changeset?_expected=1701090424142
URL: https://firefox.settings.services.mozilla.com/v1/buckets/main/collections/
     search-config/changeset?_expected=1701806851414
URL: https://content-signature-2.cdn.mozilla.net/chains/
     remote-settings.content-signature.mozilla.org-2024-02-08-20-06-04.chain
Now that's certainly more files than I thought previously were being accessed using ESR 91. In fact its 45 access requests, which is more than we saw for ESR 78. What about the desktop version.

Desktop version of the site on ESR 91
URL: https://duckduckgo.com/
URL: https://duckduckgo.com/static-assets/font/ProximaNova-RegIt-webfont.woff2
URL: https://duckduckgo.com/static-assets/font/ProximaNova-ExtraBold-webfont.woff2
URL: https://duckduckgo.com/static-assets/font/ProximaNova-Reg-webfont.woff2
URL: https://duckduckgo.com/static-assets/font/ProximaNova-Sbold-webfont.woff2
URL: https://duckduckgo.com/_next/static/css/c89114cfe55133c4.css
URL: https://duckduckgo.com/_next/static/css/6a4833195509cc3d.css
URL: https://duckduckgo.com/_next/static/css/a2a29f84956f2aac.css
URL: https://duckduckgo.com/_next/static/css/f0b3f7da285c9dbd.css
URL: https://duckduckgo.com/_next/static/css/ed8494aa71104fdc.css
URL: https://duckduckgo.com/_next/static/css/703c9a9a057785a9.css
URL: https://duckduckgo.com/static-assets/font/ProximaNova-Bold-webfont.woff2
URL: https://duckduckgo.com/_next/static/chunks/webpack-7358ea7cdec0aecf.js
URL: https://duckduckgo.com/_next/static/chunks/framework-f8115f7fae64930e.js
URL: https://duckduckgo.com/_next/static/chunks/main-17a05b704438cdd6.js
URL: https://duckduckgo.com/_next/static/chunks/pages/_app-ce0b94ea69138577.js
URL: https://duckduckgo.com/_next/static/chunks/41966-c9d76895b4f9358f.js
URL: https://duckduckgo.com/_next/static/chunks/93432-ebd443fe69061b19.js
URL: https://duckduckgo.com/_next/static/chunks/18040-1287342b1f839f70.js
URL: https://duckduckgo.com/_next/static/chunks/81125-b74d1b6f4908497b.js
URL: https://duckduckgo.com/_next/static/chunks/39337-cd8caeeff0afb1c4.js
URL: https://duckduckgo.com/_next/static/chunks/94623-d5bfa67fc3bada59.js
URL: https://duckduckgo.com/_next/static/chunks/95665-30dd494bea911abd.js
URL: https://duckduckgo.com/_next/static/chunks/55015-29fec414530c2cf6.js
URL: https://duckduckgo.com/_next/static/chunks/61754-29df12bb83d71c7b.js
URL: https://duckduckgo.com/_next/static/chunks/55672-19856920a309aea5.js
URL: https://duckduckgo.com/_next/static/chunks/38407-070351ade350c8e4.js
URL: https://duckduckgo.com/_next/static/chunks/pages/%5Blocale%5D/
     home-34dda07336cb6ee1.js
URL: https://duckduckgo.com/_next/static/ZNJGkb3SCV-nLlzz3kvNx/_buildManifest.js
URL: https://duckduckgo.com/_next/static/ZNJGkb3SCV-nLlzz3kvNx/_ssgManifest.js
Only 30 accesses. That's really unexpected. I'm also going to collect myself a version of the full log output in case I need to refer back to it. First on ESR 78:
$ rm -rf ~/.local/share/org.sailfishos/browser
$ EMBED_CONSOLE="network" sailfish-browser https://duckduckgo.com/ \
  2>&1 > log-ddg-78.txt
Then also on ESR 91:
$ rm -rf ~/.local/share/org.sailfishos/browser
$ EMBED_CONSOLE="network" sailfish-browser https://duckduckgo.com/ \
  2>&1 > log-ddg-91.txt
The next step will be to compare the results from ESR 78 with those from ESR 91 properly. That's where I'll pick this up tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
8 Jan 2024 : Day 132 #
For the last twenty hours or so my development phone has been loyally running ESR 91 in the debugger. That's because I've been searching for a suitable breakpoint to distinguish a working render of DuckDuckGo from a failing render of DuckDuckGo.

The time spent in between has been fruitful, but before I get on to that I want to first thank simonschmeisser for the useful comment on the Sailfish Forum about remote debugging. Simon highlights the fact that Firefox can be remote debugged by a second copy of Firefox. Here's how Simon explains it:
 
Unfortunately my attempts (via about:config) failed so far. But maybe someone else knows some more tricks?
user_pref("devtools.chrome.enabled", true);
user_pref("devtools.debugger.remote-enabled", true);
user_pref("devtools.debugger.prompt-connection", false);
and then some sources claim you need to add --start-debugger-server and a port (this would obviously need to be implemented for sfos-browser...)

and finally you could connect using about:debugging from desktop to debug what’s happening on the phone.

This approach isn't something I've tried before and it's true that it may need some code changes, but they may not be significant changes. I've not had a chance to try this, but plan to give it a go some time over the next week if I have time. Thanks Simon for the really nice suggestion!

Going back to the curent situation, the local gdb debugger is still running so I'd better spend a little time today making use of it. Recall that yesterday we found that on ESR 91 a breakpoint on nsDisplayBackgroundImage::AppendBackgroundItemsToTop was triggered, whereas a breakpoint on nsDisplayBackgroundImage::GetInitData() wasn't.

These two are potentially significant because the latter can be found one above the former in the stack when we visit the HTML version of DuckDuckGo. Consequently, it could be that something is happening in this method to prevent the site from rendering.

This feels like a long shot, especially because the stack for the HTML version of the page is hardly likely to be equivalent to the stack for the JavaScript version of the page. Nevetheless, we can pin this down a bit more using an ESR 78 version of the browser. If we run this on the JavaScript version of the page, we might find there's some useful comparison to be made.

I've placed breakpoints on nsDisplayBackgroundImage::GetInitData() for both versions of the browser and have run them simultaneously. On ESR 91 I get a blank page, nothing rendered, and the breackpoint remains untriggered. In contrast to this on ESR 78 the breakpoint hits with the following backtrace:
Thread 8 "GeckoWorkerThre" hit Breakpoint 1,
    nsDisplayBackgroundImage::GetInitData (aBuilder=aBuilder@entry=0x7fa6e2e630,
    aFrame=aFrame@entry=0x7f88744060, aLayer=aLayer@entry=0,
    aBackgroundRect=..., aBackgroundStyle=aBackgroundStyle@entry=0x7f89d21578)
    at layout/painting/nsDisplayList.cpp:3233
3233                                          ComputedStyle* aBackgroundStyle) {
(gdb) bt
#0  nsDisplayBackgroundImage::GetInitData (aBuilder=aBuilder@entry=0x7fa6e2e630,
    aFrame=aFrame@entry=0x7f88744060, aLayer=aLayer@entry=0, 
    aBackgroundRect=..., aBackgroundStyle=aBackgroundStyle@entry=0x7f89d21578)
    at layout/painting/nsDisplayList.cpp:3233
#1  0x0000007fbc1fab50 in nsDisplayBackgroundImage::AppendBackgroundItemsToTop
    (aBuilder=aBuilder@entry=0x7fa6e2e630, aFrame=aFrame@entry=0x7f88744060, 
    aBackgroundRect=..., aList=<optimized out>, 
    aAllowWillPaintBorderOptimization=aAllowWillPaintBorderOptimization@entry=true, 
    aComputedStyle=aComputedStyle@entry=0x0, aBackgroundOriginRect=..., 
    aSecondaryReferenceFrame=aSecondaryReferenceFrame@entry=0x0, 
    aAutoBuildingDisplayList=<optimized out>, aAutoBuildingDisplayList@entry=0x0)
    at layout/painting/nsDisplayList.cpp:3605
#2  0x0000007fbbffea94 in nsFrame::DisplayBackgroundUnconditional
    (this=this@entry=0x7f88744060, aBuilder=aBuilder@entry=0x7fa6e2e630, aLists=..., 
    aForceBackground=aForceBackground@entry=false)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsRect.h:42
[...]
#74 0x0000007fb735e89c in ?? () from /lib64/libc.so.6
(gdb) 
This is interesting because the second frame in the stack is for nsDisplayBackgroundImage::AppendBackgroundItemsToTop(). That matches what we were expecting before, and there's a good chance that DuckDuckGo will be serving the same data to ESR 78 as to the ESR 91 version of the renderer.

This strengthens my suspicion that there's something to be found inside AppendBackgroundItemsToTop(). Let's take a look at it.

So now I'm stepping through the code side-by-side using two phones; ESR 78 on the phone on the left and ESR 91 on the phone on the right. On my laptop display I have Qt Creator open with the ESR 78 copy of nsDisplayList.cpp on the left and the ESR 91 copy of the file on the right.

As I step through on both phones I'm comparing the source code line-by-line. They're not identical but close enough that I can keep track of them line-by-line and keep them broadly synchronised.

Eventually we hit this bit of code that's part of ESR 78, which is where the method returns:
  if (!bg) {
    aList->AppendToTop(&bgItemList);
    return false;
  }
The ESR 91 version of this code looks like this:
  if (!bg || !drawBackgroundImage) {
    if (!bgItemList.IsEmpty()) {
      aList->AppendToTop(&bgItemList);
      return AppendedBackgroundType::Background;
    }

    return AppendedBackgroundType::None;
  }
That's similar but not quite the same and in particular, bgItemList.mLength is zero meaning that !bgItemList.IsEmpty() is false.
(gdb) p bgItemList.mLength
$6 = 0
Both methods return at this point, but on ESR 78 aList->AppendToTop() has been called whereas on ESR 91 it's been skipped. There are multiple reasons why this might be and it's hard to imagine this is the reason rendering is failing on ESR 91, but it's a possibility that I need to investigate further.

But also in both cases the method is returning before the call to GetInitData() and what I'm really after is a case where it's called on ESR 78 but not on ESR 91. To examine that I'm going to have to step through the method a few more times, maybe place a breakpoint closer to the call. And for that, unfortunately, I've run out of time for today; it'll have to be something I explore tomorrow.

So I guess the phones will have to be left stuck on a breakpoint overnight yet again.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
7 Jan 2024 : Day 131 #
I'm still stuck confused about why, specifically, the main DuckDuckGo search page doesn't render using ESR 91. As we saw yesterday, rotating the screen doesn't fix it. Copying the entire page and serving it from a different location works okay, which is the opposite of what I was hoping for and makes narrowing down the problem that much more difficult.

The only thing I can think to do now is crack open the debugger and see whether there's being any attempt to render anything of the site.

If the page is rendering, there's a good chance that some part of it will be an image. Consequently I've placed a breakpoint on nsImageRenderer::PrepareImage() and have set the browser running. Let's see if it hits.
$ EMBED_CONSOLE=1 gdb sailfish-browser https://duckduckgo.com/
[...]
(gdb) b nsImageRenderer::PrepareImage
Function "nsImageRenderer::PrepareImage" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (nsImageRenderer::PrepareImage) pending.
(gdb) r
Starting program: /usr/bin/sailfish-browser 
[...]
An interesting result: there are no hits on the breakpoint at all when loading the default DuckDuckGo page. But when loading the HTML-only page, the breakpoint hits immedately:
Thread 10 "GeckoWorkerThre" hit Breakpoint 1,
    mozilla::nsImageRenderer::PrepareImage (this=this@entry=0x7f9efb0848)
    at layout/painting/nsImageRenderer.cpp:66
66      bool nsImageRenderer::PrepareImage() {
In fact it hits, and continues hitting, while the page is displayed. Which is interesing: it suggests that for the JavaScript page no attempt is being made to render the page at all.

Here's the backtrace from the breakpoint hit with the HTML-only page. This could be useful, because it may allow us to place a breakpoint further up the stack to figure out at which point the render is getting blocked. That's assuming that the problem is in the render loop, which of course it may well not be.

The 69-frame backtrace is huge, so I'm just copying out the relevant parts of it below. Rendering involves recursive calls to BuildDisplayList() which seems to be one of the reasons these stacks get so large.
Thread 10 "GeckoWorkerThre" hit Breakpoint 1, mozilla::nsImageRenderer::
    PrepareImage (this=this@entry=0x7f9efa0848)
    at layout/painting/nsImageRenderer.cpp:66
66      bool nsImageRenderer::PrepareImage() {
(gdb) bt
#0  mozilla::nsImageRenderer::PrepareImage (this=this@entry=0x7f9efa0848)
    at layout/painting/nsImageRenderer.cpp:66
#1  0x0000007fbc350c5c in nsCSSRendering::PrepareImageLayer
    (aPresContext=aPresContext@entry=0x7f8111dde0,
    aForFrame=aForFrame@entry=0x7f813062c0, 
    aFlags=<optimized out>, aBorderArea=..., aBGClipRect=..., aLayer=..., 
    aOutIsTransformedFixed=aOutIsTransformedFixed@entry=0x7f9efa0847)
    at layout/painting/nsCSSRendering.cpp:2976
#2  0x0000007fbc351254 in nsDisplayBackgroundImage::GetInitData
    (aBuilder=aBuilder@entry=0x7f9efa6268, aFrame=aFrame@entry=0x7f813062c0, 
    aLayer=aLayer@entry=1, aBackgroundRect=...,
    aBackgroundStyle=aBackgroundStyle@entry=0x7f8130f608)
    at layout/painting/nsDisplayList.cpp:3416
#3  0x0000007fbc3a97a4 in nsDisplayBackgroundImage::AppendBackgroundItemsToTop
    (aBuilder=0x7f9efa6268, aFrame=0x7f813062c0, aBackgroundRect=..., 
    aList=0x7f9efa11c8, aAllowWillPaintBorderOptimization=<optimized out>,
    aComputedStyle=<optimized out>, aBackgroundOriginRect=..., 
    aSecondaryReferenceFrame=0x0, aAutoBuildingDisplayList=0x0)
    at layout/painting/nsDisplayList.cpp:3794
#4  0x0000007fbc18c3d0 in nsIFrame::DisplayBackgroundUnconditional
    (this=this@entry=0x7f813062c0, aBuilder=aBuilder@entry=0x7f9efa6268,
    aLists=..., aForceBackground=aForceBackground@entry=false)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsRect.h:40
#5  0x0000007fbc191c90 in nsIFrame::DisplayBorderBackgroundOutline
    (this=this@entry=0x7f813062c0, aBuilder=aBuilder@entry=0x7f9efa6268,
    aLists=..., aForceBackground=aForceBackground@entry=false)
    at layout/generic/nsIFrame.cpp:2590
#6  0x0000007fbc16fd10 in nsBlockFrame::BuildDisplayList
    (this=0x7f813062c0, aBuilder=0x7f9efa6268, aLists=...)
    at layout/generic/nsBlockFrame.cpp:6963
#7  0x0000007fbc1c040c in nsIFrame::BuildDisplayListForChild
    (this=this@entry=0x7f81306200, aBuilder=aBuilder@entry=0x7f9efa6268, 
    aChild=aChild@entry=0x7f813062c0, aLists=..., aFlags=..., aFlags@entry=...)
    at layout/generic/nsIFrame.cpp:4278
#8  0x0000007fbc159ccc in DisplayLine (aBuilder=aBuilder@entry=0x7f9efa6268,
    aLine=..., aLineInLine=aLineInLine@entry=false, aLists=..., 
    aFrame=aFrame@entry=0x7f81306200, aTextOverflow=aTextOverflow@entry=0x0,
    aLineNumberForTextOverflow=aLineNumberForTextOverflow@entry=0, 
    aDepth=aDepth@entry=0, aDrawnLines=@0x7f9efa14fc: 127)
    at layout/generic/nsBlockFrame.cpp:6924
#9  0x0000007fbc170220 in nsBlockFrame::BuildDisplayList (this=0x7f81306200,
    aBuilder=0x7f9efa6268, aLists=...)
    at layout/generic/nsBlockFrame.cpp:7082
[...]
#33 0x0000007fbc1bce14 in nsIFrame::BuildDisplayListForStackingContext
    (this=this@entry=0x7f81304940, aBuilder=<optimized out>, 
    aBuilder@entry=0x7f9efa6268, aList=aList@entry=0x7f9efa8078,
    aCreatedContainerItem=aCreatedContainerItem@entry=0x0)
    at layout/generic/nsIFrame.cpp:3416
#34 0x0000007fbc12d7ac in nsLayoutUtils::PaintFrame
    (aRenderingContext=aRenderingContext@entry=0x0,
    aFrame=aFrame@entry=0x7f81304940, aDirtyRegion=...,
    aBackstop=aBackstop@entry=4294967295,
    aBuilderMode=aBuilderMode@entry=nsDisplayListBuilderMode::Painting,
    aFlags=aFlags@entry=(nsLayoutUtils::PaintFrameFlags::WidgetLayers |
    nsLayoutUtils::PaintFrameFlags::ExistingTransaction |
    nsLayoutUtils::PaintFrameFlags::NoComposite))
    at layout/base/nsLayoutUtils.cpp:3445
#35 0x0000007fbc0b9008 in mozilla::PresShell::Paint
    (this=this@entry=0x7f811424a0, aViewToPaint=aViewToPaint@entry=0x7e780077f0,
    aDirtyRegion=..., aFlags=aFlags@entry=mozilla::PaintFlags::PaintLayers)
    at layout/base/PresShell.cpp:6400
[...]
#69 0x0000007fb78b289c in ?? () from /lib64/libc.so.6
(gdb) 
This breakpoint was set on nsImageRenderer::PrepareImage(). That hit on the HTML page but not on the JavaScript page. Let's try something further down the stack. At stack frame 34 we're inside nsLayoutUtils::PaintFrame(), so let's put a breakpoint on that and display the JavaScript version of the page to see what happens.
(gdb) delete break 1
(gdb) b nsLayoutUtils::PaintFrame
Breakpoint 2 at 0x7fbc12ce1c: file layout/base/nsLayoutUtils.cpp, line 3144.
(gdb) c
Continuing.

Thread 10 "GeckoWorkerThre" hit Breakpoint 2, nsLayoutUtils::PaintFrame
    (aRenderingContext=aRenderingContext@entry=0x0,
    aFrame=aFrame@entry=0x7f809f3c00, 
    aDirtyRegion=..., aBackstop=aBackstop@entry=4294967295,
    aBuilderMode=aBuilderMode@entry=nsDisplayListBuilderMode::Painting, 
    aFlags=aFlags@entry=(nsLayoutUtils::PaintFrameFlags::WidgetLayers |
    nsLayoutUtils::PaintFrameFlags::ExistingTransaction |
    nsLayoutUtils::PaintFrameFlags::NoComposite))
    at layout/base/nsLayoutUtils.cpp:3144
3144                                       PaintFrameFlags aFlags) {
(gdb) bt
#0  nsLayoutUtils::PaintFrame (aRenderingContext=aRenderingContext@entry=0x0,
    aFrame=aFrame@entry=0x7f809f3c00, aDirtyRegion=..., 
    aBackstop=aBackstop@entry=4294967295,
    aBuilderMode=aBuilderMode@entry=nsDisplayListBuilderMode::Painting, 
    aFlags=aFlags@entry=(nsLayoutUtils::PaintFrameFlags::WidgetLayers |
    nsLayoutUtils::PaintFrameFlags::ExistingTransaction |
    nsLayoutUtils::PaintFrameFlags::NoComposite))
    at layout/base/nsLayoutUtils.cpp:3144
#1  0x0000007fbc0b9008 in mozilla::PresShell::Paint
    (this=this@entry=0x7f80456d90, aViewToPaint=aViewToPaint@entry=0x7f80b73520,
    aDirtyRegion=..., aFlags=aFlags@entry=mozilla::PaintFlags::PaintLayers)
    at layout/base/PresShell.cpp:6400
#2  0x0000007fbbef0ec8 in nsViewManager::ProcessPendingUpdatesPaint
    (this=this@entry=0x7f80b79520, aWidget=aWidget@entry=0x7f806e6f60)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/gfx/RectAbsolute.h:43
[...]
#30 0x0000007fb78b289c in ?? () from /lib64/libc.so.6
(gdb) 
And we get a hit. That means that somewhere between the call to nsLayoutUtils::PaintFrame() and nsImageRenderer::PrepareImage something is changing.

I won't bore you with all of the steps, but I've followed the path up through the stack, setting breakpoints on each method call, and have found that the following breakpoint does hit for the standard rendering of the DuckDuckGo site:
Thread 10 "GeckoWorkerThre" hit Breakpoint 6,
    nsDisplayBackgroundImage::AppendBackgroundItemsToTop (aBuilder=0x7f9efa6378,
    aFrame=0x7f81599428, aBackgroundRect=..., aList=0x7f9efa3f08,
    aAllowWillPaintBorderOptimization=true, aComputedStyle=0x0,
    aBackgroundOriginRect=..., aSecondaryReferenceFrame=0x0,
    aAutoBuildingDisplayList=0x0)
    at layout/painting/nsDisplayList.cpp:3632
3632            aAutoBuildingDisplayList) {
(gdb) 
However, moving one up from the stack we find nsDisplayBackgroundImage::GetInitData() and this method doesn't trigger when we placed a breakpoint on it:
(gdb) delete break 6
(gdb) break nsDisplayBackgroundImage::GetInitData
Breakpoint 7 at 0x7fbc3511d4: file layout/painting/nsDisplayList.cpp, line 3409.
(gdb) c
Continuing.
So that's between stack frame two and three of our original backtrace. This might suggest — although I'm not on firm ground here by any means — that the problem is potentially happening inside the call to AppendBackgroundItemsToTop(). On the other hand, that might just be a consequence of the two versions of the site having different structures. I'm going to step through the method to try to find out.

But unfortunately not today: it's late and this needs a bit more time, so I'll pick it up in the morning. This is actually quite a good place to pause, having a very clear direction to explore and pick up on tomorrow.

Let's see what tomorrow brings.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
7 Jan 2024 : Life as a Christmas tree #
The 6th January is traditionally the day Christmas decorations are dismantled in the UK. In Finland it's the 13th January, partly because the Christmas lights are needed to counteract the shorter daylight hours and partly to avoid angering the Yulegoat. But I'm in the UK so this weekend Joanna and I took down our Christmas decorations.

In previous years we've always tried to get a Christmas tree with roots. Our success rate in keeping it alive until the next Christmas currently stands at zero percent.

This year I went out of my way to care for our Christmas tree, carefully keeping the soil in its pot moist with daily watering, avoiding bumps and bashes, not overburdening the branches with crazy decorative figurines.

It's definitely fared better than any of our previous trees and today I dug a hole in the back garden and planted it solidly.

Here are the three stages of its life I've so far been involved with, from left-to-right: sitting in our living room right after we introduced it; with decorations ready for Christmas; and now transplanted to our back garden.
 
Three photos of the same tree: undecorated in a pot; decorated in a pot; planted in the back garden


I'm no gardener and I don't rate its chances highly, but I'd love it to survive. Not only would it be wonderful to have a Norwegian Spruce living in our garden, but it would also feel like a real achievement to have a multi-year Christmas tree. I'm also counting this as one of the ecological acts needed to fulfil my New Year's Resolutions.

I'll report back later in the year on how the tree is doing. It feels like its success is now very much down to weather, nature and its will to survive. Maybe that's not the right way to look at these things, but that's why I'm not a gardener.
Comment
6 Jan 2024 : Day 130 #
We didn't make much progress yesterday despite capturing all of the data flowing through the browser while visiting DuckDuckGo. While using ESR 91 the page just comes up completely blank, with no obvious error messages in the console output to indicate why.

I left things overnight to settle in my mind and by the morning I'd come up with a plan. So I suppose letting it settle was the right move.

But before getting in to that let me first thank PeperJohnny for sympathising with my testing plight. While of course I'm not happy to PeperJohnny suffers these frustrations as well, I am reassured by the knowledge that I'm not the only one! I share — and am encouraged by — PeperJohnny's belief that the reason can't hide itself forever!

Thanks also to hschmitt and lkraav for the useful comments about similar experiences rendering the Amazon Germany Web pages using the ESR 78 engine.

As it happens I also get this with Amazon UK using ESR 78. It's mighty unhelpful because pages appear briefly before blanking out, making the site next-to-useless. I wasn't aware of the portrait-to-landscape fix, which from the useful comments, I now find works very well. So thank you both for this nice input.

Unfortunately the problem with DuckDuckGo on ESR 91 seems to be different. Unlike the Amazon case the rendering doesn't even start, so the symptoms appear to be slightly different. Following the advice I did of course also try the portrait-to-landscape trick, but sadly this doesn't have any obvious effect.

One positive I can report is that Amazon no longer exhibits this annoying behaviour when using ESR 91. It does display other problematic behaviour, but fixing that will have to wait for a future investigation.

Thanks for all of the input, I always appreciate it, and am always pleased to follow up useful tips like these. When attempting to fix this kind of thing, ideas can be the scarcest of resources, so all suggestions are valuable.

So it looks like I still need a plan of attack. Continuing with the plan I cooked up overnight, unfortunately I don't have much time to execute it today, but maybe I can make a start on it. The approach I'm going to try is a "divide and conquer" approach. I've actually used it before when trying to fix other sites broken with earlier versions of the browser. Given this I'm not sure why it didn't occur to me last night; sometimes these things just take a little while to work their way out.

So my plan is to take a complete static copy of the DuckDuckGo HTML and store it on a server I have control over and can access from my development phones. I can then visit the site and, hopefully, the page will be similarly broken.

Having got to that point I can then start removing parts of the site to figure out which particular aspect of it is causing the rendering to fail. I call it "divide and conquer" because each time I'll take away a chunk of the code and see whether that solves the problem. If not, I'll remove a different chunk. Eventually something will have to render.

The tricky part here is getting an adequate copy of the site. If it's not close enough the issue won't manifest itself.

I've started off by exporting a full copy of the site using desktop Firefox. I started by taking a copy of the page while using the Responsive Design developer option to make the site believe it's running on a phone. But taking this site, posting it on a server and accessing it using ESR 91, I find this new copy of the site now loads perfectly.

So I tried a second time, this time capturing the full desktop site. Again, when I load this using ESR 91 on my phone it looks just fine, automatically displaying the mobile version.

This is all rather unhelpful.

Unfortunately it's now super-late here and I've still not got to the bottom of this. Maybe another night's sleep will help.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
6 Jan 2024 : How lightly did I tread in 2023 #
For the last four years I've been offsetting my carbon emissions. In the long run I accept that offsetting isn't a sustainable way to address the climate crisis, but until my CO2 output reaches zero I still think it's better to offset than to not. Apart from attempting to address the balance of my impact on the world it also offers two other benefits.

First there's the personal financial cost I incur from having to pony up a hundred quid or thereabouts each year. That's a good way to incentivize myself to reduce my carbon footprint in the future. Second there's the active process of interrogating my consumption: working through the calculations is a great way to focus the mind, confront the consequences of my personal decisions and think about what I could improve on in the future.

Last year it took until April for me to run the calculations and act on them. This year I've done much better. That's partly driven by my New Year's Resolution to make at least one ecological improvement per month during the year. Even though this isn't a new thing for me, when I made the resolution the intention was always to count this as one of the tasks. And so it is.

Here's the table that shows which carbon emissions came from which activities. I've included all previous years so that some trends can be captured. I should emphasise that this represents household emissions, so covers two people, both Joanna and me. For comparison average emissions for individuals in the UK is 5.40 tonnes (10.80 tonnes for two people).
 
Source CO2, 2019 (t) CO2, 2020 (t) CO2, 2021 (t) CO2, 2022 (t) CO2, 2023 (t)
Electricity 0.50 0.40 0.59 1.14 1.66
Natural gas 1.18 1.26 1.66 0.81 -0.25
Flights 5.76 2.26 1.90 5.34 1.32
Car 1.45 0.39 0.39 1.01 1.00
Bus 0.00 0.01 0.02 0.01 0.31
National rail 0.08 0.01 0.02 0.00 0.70
International rail 0.02 0.01 0.00 0.04 0.01
Taxi 0.01 0.01 0.01 0.01 0.01
Food and drink 1.69 1.11 1.05 1.35 1.07
Pharmaceuticals 0.26 0.32 0.31 0.06 0.13
Clothing 0.03 0.06 0.06 0.12 0.23
Paper-based products 0.34 0.15 0.14 0.37 0.38
Computer usage 1.30 1.48 0.75 0.93 0.23
Electrical 0.12 0.29 0.19 0.03 0.01
Non-fuel car 0.00 0.10 0.00 0.12 0.92
Manufactured goods 0.50 0.03 0.03 0.05 0.11
Hotels, restaurants 0.51 0.16 0.15 0.10 1.21
Telecoms 0.15 0.05 0.04 0.03 0.05
Finance 0.24 0.24 0.22 0.04 0.02
Insurance 0.19 0.11 0.10 0.04 0.04
Education 0.05 0.00 0.04 0.01 0.00
Recreation 0.09 0.06 0.05 0.03 0.06
Total 14.47 8.50 7.73 11.65 9.25

The headline result is that our total carbon emissions have been reduced compared to last year. That's mostly driven by a large decrease in the number of flights, from twenty in 2022 to just four last year. Twenty flights is a large number, a consequence of living in Finland. This year I moved back to the UK in February. That meant some flights to tidy up my life in Finland, but I've not flown again since then. In 2024 I'm hoping to push that down to zero flights.

Reduced flights was partly offset by increased train and bus travel, largely due to my weekly commute between Cambridge and London for work. I took the journey 88 times, giving me a massive total distance travelled of 19 638 km by national rail. Thankfully trains are also far more carbon efficient than planes, so while distance travelled only reduced by a factor of 1.5, carbon emissions reduced by a factor of 5.75.

One potentially confusing thing about the numbers is that natural gas usage is a negative figure. We switched from a gas boiler to a heat pump, with the result that our gas usage tumbled. But of course it wasn't negative! The negative value is due to our power company overestimating our gas usage as a result of our heating change. The overestimate was included in the figures for last year and this negative figure redresses that.

The following table gives more detail about the numbers used to perform the calculations. After pulling these together I then fed them into Carbon Footprint Ltd's carbon calculator as I have in previous years to generate the results.
 
Source 2019 2020 2021 2022 2023
Electricity 1 794 kWh 1 427 kWh 3 009 kWh 4 101 kWh 5 975 kWh
Natural gas 6 433 kWh 6 869 kWh 9 089 kWh 4 439 kWh -1 362 kWh
Flights
 
36 580 km
20 flights
14 632 km
8 flights
25 542 km
14 flights
36 042 km
20 flights
7 233 km
4 flights
Car 11 910 km 2 000 km 3 219 km 8 458 km 8 369 km
Bus 1 930 km 40 km 168 km 133 km 3 080 km
National rail 5 630 km 400 km 676 km 0 km 19 638 km
International rail 64 km 1 368 km 513 km 8 684 km 2 322 km
Taxi 64 km 37 km 100 km 100 km 100 km
Tube 0 km 0 km 0 km 0 km 100 km

As in previous years I've used the UN Framework Convention on Climate Change to offset my carbon output. The money will go to pay for improved cooking stoves in Malawi, a scheme managed by Ripple Africa.
 
Cancellation Certificate from offset.climateneutralnow.org, 10 CERs, equivalent to 10 tonnes of CO2

 
Comment
5 Jan 2024 : Day 129 #
Yesterday we were looking at the User Agent strings, which I updated for ESR 91, but which we concluded wasn't the cause of the poorly rendering websites I've been experiencing.

One site that I'm particularly disappointed at being broken — and which I also mentioned yesterday — is DuckDuckGo. As this is currently my preferred search engine (although I admit I'm on the look out for a new one) it would be rather convenient for me personally if it was working well with the Sailfish Browser.

There is a workaround, in that DuckDuckGo already provides a non-JavaScript version of their site at https://duckduckgo.com/html. This site already works just fine with ESR 91 and I admit it's these kinds of considerate touches that keep me going back to use the site. But that doesn't detract from the fact that the standard JavaScript enabled site really ought to be working just fine with ESR 91.

So today I want to investigate further what the problem is. I'm a bit short on time today, so not expecting to get to the bottom of it, but it would be good to make a start.

It's also worth recalling that there were some problems with getting DuckDuckGo to work on previous versions of the browser. Digging through the git logs I eventually hit this, in the embedlite-components repository, which is likely what I'm thinking of:
$ git log --grep "DuckDuckGo"
commit 04bf236f9a57bdd01d02d83f02e55b848a8ed1af (upstream/jb45659, origin/jb45659)
Author: David Llewellyn-Jones <david@flypig.co.uk>
Date:   Fri Aug 14 16:24:22 2020 +0000

    [embedlite-components] Ensure touch events also fire mouse clicks.
    Fixes JB#45659
    
    Gesture single taps were previously configured to send out mouse events
    only if they were preceded by a touchstart event, and only if
    preventDefault() wasn't applied to the event.
    
    Other browsers send out the event independent of this. The difference
    manifests itself if stopPropagation() is applied to the touch event,
    which supresses the mouse events on our browser, but not on others.
    
    For example, this meant that the input field of DuckDuckGo couldn't be
    focussed, and also prevented the Google Maps touch controls from
    working.
    
    This change alters things so that the mouse event is sent out even if
    single tap isn't preceeded by a touch event.
Please try to ignore the fact that back in 2020 I apparently didn't know how to spell either preceded or suppresses. I want to focus on the fact that today I'm experiencing something different on ESR 91: now the site just doesn't render at all.

When rendering the site, the only output in the console log is the following:
[JavaScript Warning: "Cookie “_ga_NL3SRVXF92” will be soon rejected because it
    has the “SameSite” attribute set to “None” or an invalid value, without the
    “secure” attribute. To know more about the “SameSite“ attribute, read
    https://developer.mozilla.org/docs/Web/HTTP/Headers/Set-Cookie/SameSite"
    {file: "https://www.googletagmanager.com/gtag/js?id=G-NL3SRVXF92&l=dataLayer&cx=c"
    line: 307}]
That's a warning not an error and even if it was an error it doesn't look like it would be too serious. Not serious enough to prevent the site from rendering at any rate. So this doesn't help us much.

Given the complete lack of rendering it's hard to even know whether the correct site is being transferred or not: it may be that the browser is getting stuck on a redirect or something mundane like that.

To check this we can set the EMBED_CONSOLE environment variable to the value "network". This is a special value that, as well as printing out full debug information, also outputs everything textual that's sent between the network and the browser. This can generate crazy quantities of debug output, so it's almost never a good idea to use it. But in this case where we're just checking the contents of a single page being downloaded it can be really helpful.

Let's give it a go.

In all of the output below I've made quite significant abridgements to prevent the output from filling up multiple browser pages. I've tried to remove only uninteresting parts (mostly, the returned HTML code).
$ EMBED_CONSOLE="network" MOZ_LOG="EmbedLite:5" sailfish-browser \
    https://duckduckgo.com/
[...]
[JavaScript Error: "Unexpected event profile-after-change" {file:
    "resource://gre/modules/URLQueryStrippingListService.jsm" line: 228}]
observe@resource://gre/modules/URLQueryStrippingListService.jsm:228:12
[...]
CONSOLE message:
OpenGL compositor Initialized Succesfully.
Version: OpenGL ES 3.2 V@415.0 (GIT@248cd04, I42b5383e2c, 1569430435)
    (Date:09/25/19)
Vendor: Qualcomm
Renderer: Adreno (TM) 610
FBO Texture Target: TEXTURE_2D
[Parent 11746: Unnamed thread 7bfc002670]: I/EmbedLite WARN: EmbedLite::virtual
    void* mozilla::embedlite::EmbedLitePuppetWidget::GetNativeData(uint32_t):127
    EmbedLitePuppetWidget::GetNativeData not implemented for this type
JSScript: ContextMenuHandler.js loaded
JSScript: SelectionPrototype.js loaded
JSScript: SelectionHandler.js loaded
JSScript: SelectAsyncHelper.js loaded
JSScript: FormAssistant.js loaded
JSScript: InputMethodHandler.js loaded
EmbedHelper init called
Available locales: en-US, fi, ru
Frame script: embedhelper.js loaded
[ Request details ------------------------------------------- ]
    Request: GET status: 200 OK
    URL: https://duckduckgo.com/
    [ Request headers --------------------------------------- ]
        Host : duckduckgo.com
        User-Agent : Mozilla/5.0 (X11; Linux aarch64; rv:91.0) Gecko/20100101
            Firefox/91.0
        Accept : text/html,application/xhtml+xml,application/xml;q=0.9,
            image/webp,*/*;q=0.8
        Accept-Language : en-GB,en;q=0.5
        Accept-Encoding : gzip, deflate, br
        Connection : keep-alive
        Upgrade-Insecure-Requests : 1
        Sec-Fetch-Dest : document
        Sec-Fetch-Mode : navigate
        Sec-Fetch-Site : cross-site
        If-None-Match : "65959a92-471e"
        TE : trailers
    [ Response headers -------------------------------------- ]
        server : nginx
        date : Wed, 03 Jan 2024 21:07:55 GMT
        content-type : text/html; charset=UTF-8
        content-length : 18206
        vary : Accept-Encoding
        etag : "65959a94-471e"
        content-encoding : br
        strict-transport-security : max-age=31536000
        permissions-policy : interest-cohort=()
        content-security-policy : default-src 'none' ; connect-src
            https://duckduckgo.com https://*.duckduckgo.com
            https://duckduckgogg42xjoc72x3sjasowoarfbgcmvfimaftt6twagswzczad.onion/
            https://spreadprivacy.com ; manifest-src  https://duckduckgo.com
            https://*.duckduckgo.com
            https://duckduckgogg42xjoc72x3sjasowoarfbgcmvfimaftt6twagswzczad.onion/
            https://spreadprivacy.com ; media-src  https://duckduckgo.com
            https://*.duckduckgo.com
            https://duckduckgogg42xjoc72x3sjasowoarfbgcmvfimaftt6twagswzczad.onion/
            https://spreadprivacy.com ; script-src blob:  https://duckduckgo.com
            https://*.duckduckgo.com
            https://duckduckgogg42xjoc72x3sjasowoarfbgcmvfimaftt6twagswzczad.onion/
            https://spreadprivacy.com 'unsafe-inline' 'unsafe-eval' ;
            font-src data:  https://duckduckgo.com https://*.duckduckgo.com
            https://duckduckgogg42xjoc72x3sjasowoarfbgcmvfimaftt6twagswzczad.onion/
            https://spreadprivacy.com ; img-src data:  https://duckduckgo.com
            https://*.duckduckgo.com
            https://duckduckgogg42xjoc72x3sjasowoarfbgcmvfimaftt6twagswzczad.onion/
            https://spreadprivacy.com ; style-src  https://duckduckgo.com
            https://*.duckduckgo.com
            https://duckduckgogg42xjoc72x3sjasowoarfbgcmvfimaftt6twagswzczad.onion/
            https://spreadprivacy.com 'unsafe-inline' ; object-src 'none' ;
            worker-src blob: ; child-src blob:  https://duckduckgo.com
            https://*.duckduckgo.com
            https://duckduckgogg42xjoc72x3sjasowoarfbgcmvfimaftt6twagswzczad.onion/
            https://spreadprivacy.com ; frame-src blob:  https://duckduckgo.com
            https://*.duckduckgo.com
            https://duckduckgogg42xjoc72x3sjasowoarfbgcmvfimaftt6twagswzczad.onion/
            https://spreadprivacy.com ; form-action  https://duckduckgo.com
            https://*.duckduckgo.com
            https://duckduckgogg42xjoc72x3sjasowoarfbgcmvfimaftt6twagswzczad.onion/
            https://spreadprivacy.com ; frame-ancestors 'self' ;
            base-uri 'self' ; block-all-mixed-content ;
        x-frame-options : SAMEORIGIN
        x-xss-protection : 1;mode=block
        x-content-type-options : nosniff
        referrer-policy : origin
        expect-ct : max-age=0
        expires : Wed, 03 Jan 2024 21:07:54 GMT
        cache-control : no-cache
        X-Firefox-Spdy : h2
    [ Document content -------------------------------------- ]
<!DOCTYPE html><html lang="en-US"><head><meta charSet="utf-8"/>
    <meta name="viewport" content="width=device-width, initial-scale=1,
    user-scalable=1 , viewport-fit=auto"/><link rel="preload"
    href="/static-assets/font/ProximaNova-RegIt-webfont.woff2" as="font"
    type="font/woff2" crossorigin="anonymous"/>
    [...]
    <script>window.onerror = function _onerror(msg, source, lineno, colno) {
  var url = "https://improving.duckduckgo.com/t/" + "static_err_global?" +
      Math.ceil(Math.random() * 1e7) + "&msg=" + encodeURIComponent(msg) +
      "&url=" + encodeURIComponent(source) + "&pathname=" +
      encodeURIComponent(window.location && window.location.pathname || "") +
      "&line=" + lineno + "&col=" + colno;

  try {   
    if (window.navigator.sendBeacon) {
      window.navigator.sendBeacon(url);
    } else {
      var img = document.createElement("img");
      img.src = url;
    }
  } catch (e) {// noop
  }
};</script>
[...]
        Document output truncated by 73144 bytes
    [ Document content ends --------------------------------- ]
JavaScript error: chrome://embedlite/content/embedhelper.js, line 259:
    TypeError: sessionHistory is null
CONSOLE message:
[JavaScript Error: "TypeError: sessionHistory is null"
    {file: "chrome://embedlite/content/embedhelper.js" line: 259}]
receiveMessage@chrome://embedlite/content/embedhelper.js:259:29

There's clearly plenty of document data being downloaded. I'm not sure what to make of this. It doesn't look like the site content is being held back; there's no obvious redirect. Frankly it all looks in order.

As a separate test I also tried turning off JavaScript to see how this might affect the rendering, but in that case the site was clever enough to notice and redirected the browser to https://html.duckduckgo.com/, which of course then rendered just fine.

We can compare the output with JavaScript disabled to that generated from the plain HTML version of the page.
$ EMBED_CONSOLE="network" MOZ_LOG="EmbedLite:5" sailfish-browser \
    https://html.duckduckgo.com/
[...]
[W] unknown:0 - Unable to open bookmarks  "/home/defaultuser/.local/share/
    org.sailfishos/browser/bookmarks.json"
[...]
[Parent 14102: Unnamed thread 776c002670]: E/EmbedLite FUNC::static bool
    GeckoLoader::InitEmbedding(const char*):230 InitEmbedding successfully
CONSOLE message:
[JavaScript Error: "Unexpected event profile-after-change" {file: "resource://gre/modules/URLQueryStrippingListService.jsm" line: 228}]
observe@resource://gre/modules/URLQueryStrippingListService.jsm:228:12
[...]
JSScript: ContextMenuHandler.js loaded
JSScript: SelectionPrototype.js loaded
JSScript: SelectionHandler.js loaded
JSScript: SelectAsyncHelper.js loaded
JSScript: FormAssistant.js loaded
JSScript: InputMethodHandler.js loaded
EmbedHelper init called
Available locales: en-US, fi, ru
Frame script: embedhelper.js loaded
[ Request details ------------------------------------------- ]
    Request: GET status: 302 Found
    URL: https://html.duckduckgo.com/
[ Request details ------------------------------------------- ]
    Request: GET status: 200 OK
    URL: https://html.duckduckgo.com/html/
    [ Request headers --------------------------------------- ]
        Host : html.duckduckgo.com
        User-Agent : Mozilla/5.0 (X11; Linux aarch64; rv:91.0) Gecko/20100101
            Firefox/91.0
        Accept : text/html,application/xhtml+xml,application/xml;q=0.9,
            image/webp,*/*;q=0.8
        Accept-Language : en-GB,en;q=0.5
        Accept-Encoding : gzip, deflate, br
        Connection : keep-alive
        Upgrade-Insecure-Requests : 1
        Sec-Fetch-Dest : document
        Sec-Fetch-Mode : navigate
        Sec-Fetch-Site : cross-site
    [ Response headers -------------------------------------- ]
        server : nginx
        date : Wed, 03 Jan 2024 21:31:43 GMT
        content-type : text/html
        content-length : 138
        location : https://html.duckduckgo.com/html/
        strict-transport-security : max-age=31536000
        permissions-policy : interest-cohort=()
        content-security-policy : default-src 'none' ; connect-src
            https://duckduckgo.com https://*.duckduckgo.com
            https://duckduckgogg42xjoc72x3sjasowoarfbgcmvfimaftt6twagswzczad.onion/
            https://spreadprivacy.com ; manifest-src  https://duckduckgo.com
            https://*.duckduckgo.com
            https://duckduckgogg42xjoc72x3sjasowoarfbgcmvfimaftt6twagswzczad.onion/
            https://spreadprivacy.com ; media-src  https://duckduckgo.com
            https://*.duckduckgo.com
            https://duckduckgogg42xjoc72x3sjasowoarfbgcmvfimaftt6twagswzczad.onion/
            https://spreadprivacy.com ; script-src blob:  https://duckduckgo.com
            https://*.duckduckgo.com
            https://duckduckgogg42xjoc72x3sjasowoarfbgcmvfimaftt6twagswzczad.onion/
            https://spreadprivacy.com 'unsafe-inline' 'unsafe-eval' ;
            font-src data:  https://duckduckgo.com https://*.duckduckgo.com
            https://duckduckgogg42xjoc72x3sjasowoarfbgcmvfimaftt6twagswzczad.onion/
            https://spreadprivacy.com ; img-src data:  https://duckduckgo.com
            https://*.duckduckgo.com
            https://duckduckgogg42xjoc72x3sjasowoarfbgcmvfimaftt6twagswzczad.onion/
            https://spreadprivacy.com ; style-src  https://duckduckgo.com
            https://*.duckduckgo.com
            https://duckduckgogg42xjoc72x3sjasowoarfbgcmvfimaftt6twagswzczad.onion/
            https://spreadprivacy.com 'unsafe-inline' ; object-src 'none' ;
            worker-src blob: ; child-src blob:  https://duckduckgo.com
            https://*.duckduckgo.com
            https://duckduckgogg42xjoc72x3sjasowoarfbgcmvfimaftt6twagswzczad.onion/
            https://spreadprivacy.com ; frame-src blob:  https://duckduckgo.com
            https://*.duckduckgo.com
            https://duckduckgogg42xjoc72x3sjasowoarfbgcmvfimaftt6twagswzczad.onion/
            https://spreadprivacy.com ; form-action  https://duckduckgo.com
            https://*.duckduckgo.com
            https://duckduckgogg42xjoc72x3sjasowoarfbgcmvfimaftt6twagswzczad.onion/
            https://spreadprivacy.com ; frame-ancestors 'self' ;
            base-uri 'self' ; block-all-mixed-content ;
        x-frame-options : SAMEORIGIN
        x-xss-protection : 1;mode=block
        x-content-type-options : nosniff
        referrer-policy : origin
        expect-ct : max-age=0
        expires : Thu, 02 Jan 2025 21:31:43 GMT
        cache-control : max-age=31536000
        x-duckduckgo-locale : en_GB
        X-Firefox-Spdy : h2
    [ Request headers --------------------------------------- ]
        Host : html.duckduckgo.com
        User-Agent : Mozilla/5.0 (X11; Linux aarch64; rv:91.0) Gecko/20100101
            Firefox/91.0
        Accept : text/html,application/xhtml+xml,application/xml;q=0.9,
            image/webp,*/*;q=0.8
        Accept-Language : en-GB,en;q=0.5
        Accept-Encoding : gzip, deflate, br
        Connection : keep-alive
        Upgrade-Insecure-Requests : 1
        Sec-Fetch-Dest : document
        Sec-Fetch-Mode : navigate
        Sec-Fetch-Site : cross-site
        TE : trailers
    [ Response headers -------------------------------------- ]
        server : nginx
        date : Wed, 03 Jan 2024 21:31:43 GMT
        content-type : text/html; charset=UTF-8
        vary : Accept-Encoding
        server-timing : total;dur=14;desc="Backend Total"
        strict-transport-security : max-age=31536000
        permissions-policy : interest-cohort=()
        content-security-policy : default-src 'none' ; connect-src
            https://duckduckgo.com https://*.duckduckgo.com
[...]
        x-frame-options : SAMEORIGIN
        x-xss-protection : 1;mode=block
        x-content-type-options : nosniff
        referrer-policy : origin
        expect-ct : max-age=0
        expires : Wed, 03 Jan 2024 21:31:44 GMT
        cache-control : max-age=1
        x-duckduckgo-locale : en_GB
        content-encoding : br
        X-Firefox-Spdy : h2
    [ Document content -------------------------------------- ]
        Document output skipped, content-type non-text or unknown
    [ Document content ends --------------------------------- ]
    [ Document content -------------------------------------- ]
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<!--[if IE 6]><html class="ie6" xmlns="http://www.w3.org/1999/xhtml"><![endif]-->
<!--[if IE 7]><html class="lt-ie8 lt-ie9" xmlns="http://www.w3.org/1999/xhtml"><![endif]-->
<!--[if IE 8]><html class="lt-ie9" xmlns="http://www.w3.org/1999/xhtml"><![endif]-->
<!--[if gt IE 8]><!--><html xmlns="http://www.w3.org/1999/xhtml"><!--<![endif]-->
<head>
  <link rel="canonical" href="https://duckduckgo.com/" />
  <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0,
      maximum-scale=3.0, user-scalable=1" />
  <meta name="referrer" content="origin" />
  <title>DuckDuckGo</title>
  <link title="DuckDuckGo (HTML)" type="application/opensearchdescription+xml"
      rel="search" href="//duckduckgo.com/opensearch_html_v2.xml" />
  <link rel="icon" href="//duckduckgo.com/favicon.ico" type="image/x-icon" />
  <link rel="apple-touch-icon" href="//duckduckgo.com/assets/logo_icon128.v101.png" />
  <link rel="image_src" href="//duckduckgo.com/assets/logo_homepage.normal.v101.png" />
  <link rel="stylesheet" media="handheld, all" href="//duckduckgo.com/dist/h.aeda52882d97098ab9ec.css" type="text/css"/>
</head>

<body class="body--home body--html">
[...]
</html>

    [ Document content ends --------------------------------- ]
JavaScript error: chrome://embedlite/content/embedhelper.js, line 259:
    TypeError: sessionHistory is null
CONSOLE message:
[JavaScript Error: "TypeError: sessionHistory is null"
    {file: "chrome://embedlite/content/embedhelper.js" line: 259}]
receiveMessage@chrome://embedlite/content/embedhelper.js:259:29

JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 541:
    NotFoundError: WindowGlobalChild.getActor: No such JSWindowActor 'LoginManager'
CONSOLE message:
[JavaScript Error: "NotFoundError: WindowGlobalChild.getActor: No such
    JSWindowActor 'LoginManager'"
    {file: "resource://gre/modules/LoginManagerChild.jsm" line: 541}]
forWindow@resource://gre/modules/LoginManagerChild.jsm:541:27
handleEvent@chrome://embedlite/content/embedhelper.js:433:29
The main difference in this second case is that now we have a 302 "Found" redirect returned from the site, which then takes us to the JavaScript-free HTML version. Other than this, comparing the two unfortunately I don't see any obvious signs of why one might be working but the other not.

So I'm a bit stumped. At this point my head is spinning a bit; it's late here. But I'm frustrated and don't want to stop just yet. Maybe there's another — simpler — site that exhibits the same problem? If I could find something like that, it might help.

So to try to take this further I tested all of the pages from the Sailfish Browser test suite. I thought maybe something might show up as broken there that, if fixed, might also fix the DuckDuckGo page.

I was surprised to discover that most of these pages already work as expecting, including screen taps, SVGs and (some) videos. Audio wasn't working, neither was the file picker and perhaps most problematic the back button is also broken (although, weirdly, the page history seems to be working just fine).

So several things to fix, but none of the pages failed to render at all. So this also brought me no closer to figuring out what the problem might be with DuckDuckGo.

It's been a day of testing today it seems, rather than fixing. If I'm honest I'm a bit stumped and very frustrated right now. Maybe I need to sleep on it to try to think of another angle to approach it from.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
4 Jan 2024 : Day 128 #
It's the big one today! Finally we reached 27 days of Gecko development work. Looking at how things are progressing, it's looking hopeful that the work may be completed before we hit the 28-th days of work. But I'm not willing to make any more of a prediction than that!

Before moving on to coding, let's dwell on this a little further. I've been spending, I think (I've not actually been measuring this) between two to three hours working on this a day. Probably half that time is spent coding and the other half writing these posts. So that means 90 minutes of coding time spent on gecko for each of the 27 days.

That makes 192 hours of work, or 24 work days (eight hours a day), or nearly five weeks of "Full Time Equivalent" work. During holidays I've probably spent a bit more than this, so let's call it seven weeks.

That might sound like a lot, but in practice it's not much time at all when it comes to software development. I can imagine progress probably looks very slow to anyone reading this, but unfortunately it's more a product of the fact I can only spend some limited amount of time on this per day. I do wish there were more hours in the day to work with!

Alright, let's waste no more time on this side discussion then and get straight back to coding.

It's finally time to move on from printing, as promised, to User Agent string testing. This falls under Issue #1052 which I've now assigned to myself as my next task.

Let's get some bearings. The magic for User Agent fixes happens inside a file called ua-update.json that can be found inside your browser profile folder. It's a JSON file but un-prettified into a single line which makes it a little tricky to read. We can make it a little clearer by re-prettifying it. I'm just putting the first few lines here to give an idea about the contents, but feel free to run this on your own phone if you'd like to see the full list of fixes.
$ python3 -m json.tool ~/.local/share/org.sailfishos/browser/.mozilla/ua-update.json
{
    "msn.com": "Mozilla/5.0 (Android 8.1.0; Mobile; rv:78.0) Gecko/78.0 Firefox/78.0",
    "r7.com": "Mozilla/5.0 (Android; Mobile; rv:78.0) Gecko/78.0 Firefox/78.0",
    "baidu.com": "Mozilla/5.0 (Android 8.1.0; Mobile; rv:78.0) Gecko/78.0 Firefox/78.0",
    "bing.com": "Mozilla/5.0 (Android; Mobile; rv:78.0) Gecko/78.0 Firefox/78.0",
    "kuruc.info": "Mozilla/5.0 (Android; Mobile; rv:78.0) Gecko/78.0 Firefox/78.0",
[...]
}
This file is generated using a process, so represents the end result of that process. The processing step happens during the sailfish-browser build and the raw pieces can be found in the data folder:
$ tree data
data
├── 70-browser.conf
├── prefs.js
├── preprocess-ua-update-json
├── README
├── ua
│   ├── 38.8.0
│   │   └── ua-update.json
│   ├── 45.9.1
│   │   └── ua-update.json
│   ├── 52.9.1
│   │   └── ua-update.json
│   ├── 60.0
│   │   └── ua-update.json
│   ├── 60.9.1
│   │   └── ua-update.json
│   └── 78.0
│       └── ua-update.json
├── ua-update.json
└── ua-update.json.in

7 directories, 12 files
There are also some helpful instructions to be found in this folder:
$ cat data/README 
How to update ua-update.json
============================

0) git checkout next
1) Edit ua-update.json.in
2) Run preprocess-ua-update-json
3) Copy preprocessed ua-update.json to ua/<engine-version>
4) git commit ua-update.json.in ua/<engine-version>/ua-update.json
5) Create merge request

Before we get into these steps let's first find out what User Agent string is actually being used when we visit a Web site. The easiest way to do this is to actually go to a site that echoes the User Agent back to us.
 
DuckDuckGo showing the original and overridden User Agent strings; Bing and Google don't show the string

On DuckDuckGo the User Agent string is announced as "Mozilla/5.0 (X11; Linux aarch64; rv:91.0) Gecko/20100101 Firefox/91.0)". That doesn't look too bad. But it's also not one of the sites we currently have an override set for.

Neither Bing nor Google, both of which are in the list, do the courtesy of echoing back the user agent when you search on it and none of the other sites in the list look promising for doing this either. So I'm going to hack a user agent string in for DuckDuckGo to see whether or not we get any results.

I'll do this by just switching the msn.com string to duckduckgo.com instead.
$ cd ~/.local/share/org.sailfishos/browser/.mozilla/
$ cp ua-update.json ua-update.json.bak
$ sed -i -e 's/msn\.com/duckduckgo\.com/g' ua-update.json
Now when we visit the site we get a newly updated User Agent string: "Mozilla/5.0 (Android 8.1.0; Mobile; rv:78.0) Gecko/78.0 Firefox/78.8".

We can conclude that the ua-update.json file is working as expected, but we will need to update it to reflect the new rendering engine version. Before proceeding I'm going to undo the change I just made. It's not like it really matters on my development phone, but it helps keep things neat and tidy.
$ mv ua-update.json.bak ua-update.json
Back to the sailfish-browser repository and I'll do a search-and-replace on the references to "78.0" and replace them with "91.0". The file may need a little more refinement based on how the rendering engine works with different sites, but this simple change should give us a more solid base to build on.
$ pushd data
$ sed -i -e 's/78\.0/91\.0/g' ua-update.json.in 
$ git diff | head -n 10
diff --git a/data/ua-update.json.in b/data/ua-update.json.in
index be4352e6..584720c0 100644
--- a/data/ua-update.json.in
+++ b/data/ua-update.json.in
@@ -1,116 +1,116 @@
 // Everything after the first // on a line will be removed by the preproccesor.
 // Send these sites a custom user-agent. Bugs should be included with an entry.
 {
-  // Ref: "Mozilla/5.0 (Mobile; rv:78.0) Gecko/78.0 Firefox/78.0"
+  // Ref: "Mozilla/5.0 (Mobile; rv:91.0) Gecko/91.0 Firefox/91.0"
That looks pretty healthy. Let's complete the rest of the steps as suggested in the README.
$ ./preprocess-ua-update-json 
[sailfishos-esr91 e00b4335] [user-agent] Update preprocessed user agent overrides
 1 file changed, 89 insertions(+), 89 deletions(-)
 rewrite data/ua-update.json (98%)
$ gedit ua-update.json
$ ls ua
38.8.0  45.9.1  52.9.1  60.0  60.9.1  78.0
$ mkdir ua/91.0
$ cp ua-update.json ua/91.0/
$ git add ua-update.json.in ua/91.0/ua-update.json
$ git commit -m "[sailfish-browser] Update user agent overrides for ESR 91.0"
$ popd
So I've now built and installed the updated sailfish-browser package. Checking the ua-update.json file on my phone I can see that the file hasn't changed after the installation. However, after deleting the ua-agent-json file in the profile folder and re-running the browser, the changes have now taken root. Great.

Although that's updated the overrides, unfortunately it hasn't fixed some of the broken sites such as DuckDuckGo.com, which still just displays a blank page. My recollection is that this happened with ESR 78 too, but there must have been a different reason for it. There may even be an answer in amongst the patches. This will have to be my next task for tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
3 Jan 2024 : Day 127 #
As we rounded things off yesterday I was heading into a new task. My plan is to look into the User Agent string functionality of ESR 91. My hunch is that the overrides aren't working correctly in the updated build, although I'm not yet sure whether this is the case or why it might be the case. But I'm hoping that this is the reason for some sites currently working poorly with the updated browser engine.

But before I do that I want to try to address an issue raised by rob_kouw on the Sailfish forum. Or, to be more accurate, raised by Rob's wife. Here's Rob's comment:
 
I just read this morning’s post where you solved the pdf printing (hurray!) I remarked to my wife about the digging into this long, long list of threads: where to start?

After some more explanation she yells: “Who wants hidden browser tabs?”

And in fact, couldn’t these be misused at some point?


Apart from the fact that I'm chuffed that Rob is taking the time to read these posts, it's also a very good point: are these hidden browser tabs I've spent the last couple of weeks implementing going to be safe? Wouldn't it be a bad thing if websites have the ability to spawn hidden tabs without the user realising it?

I also want to acknowledge the useful input I received privately from thigg on this topic as well. These comments combined have provided really useful input when considering the final architecture.

So when we consider the security of these hidden tabs my claim is that it's safe because although there is now functionality to hide tabs, there is in fact only one way that this can happen and that's as a direct consequence of the user selecting the "Save web page to PDF" option in the Sailfish Browser menu. There's no other way it can happen and especially not through some malicious Web site creating hidden tabs without the user's consent.

But right now this is just a hypothesis. It deserves exploring in more depth.

Let's start at the sharp end. The tabs are hidden or shown based on the value stored in the hidden role in the DeclarativeTabModel class:
QVariant DeclarativeTabModel::data(const QModelIndex & index, int role) const {
    if (index.row() < 0 || index.row() >= m_tabs.count())
        return QVariant();

    const Tab &tab = m_tabs.at(index.row());
    if (role == ThumbPathRole) {
        return tab.thumbnailPath();
    } else if (role == TitleRole) {
        return tab.title();
    } else if (role == UrlRole) {
        return tab.url();
    } else if (role == ActiveRole) {
        return tab.tabId() == m_activeTabId;
    } else if (role == TabIdRole) {
        return tab.tabId();
    } else if (role == DesktopModeRole) {
        return tab.desktopMode();
    } else if (role == HiddenRole) {
        return tab.hidden();
    }
    return QVariant();
}
This hidden value is exposed in the QML, but it's not exposed to the JavaScript of the page in the tab and it's also not a value that can be set after the page has been created: it's essentially constant.

As we can see from the last few lines, the hidden state comes from the Tab class. The only way to set it is when the tab is instantiated:
Tab::Tab(int tabId, const QString &url, const QString &title,
    const QString &thumbPath, const bool hidden)
    : m_tabId(tabId)
    , m_requestedUrl(url)
    , m_url(url)
    , m_title(title)
    , m_thumbPath(thumbPath)
    , m_desktopMode(false)
    , m_hidden(hidden)
    , m_browsingContext(0)
    , m_parentId(0)
{
}
As we've seen in previous posts, the route that the hidden flag takes to get to this point is circuitous at best. But it all goes through only C++ code in the EmbedLite portion of the Sailfish gecko-dev project. None of this is exposed, either for reading or writing, to JavaScript. And it all leads back to this one location:
NS_IMETHODIMP
WindowCreator::CreateChromeWindow(nsIWebBrowserChrome *aParent,
                                  uint32_t aChromeFlags,
                                  nsIOpenWindowInfo *aOpenWindowInfo,
                                  bool *aCancel,
                                  nsIWebBrowserChrome **_retval)
{
[...]
  const bool isForPrinting = aOpenWindowInfo->GetIsForPrinting();

  mChild->CreateWindow(parentID, reinterpret_cast<uintptr_t>(parentBrowsingContext.get()), aChromeFlags, isForPrinting, &createdID, aCancel);
This requires just a little explanation: it's the isForPrinting flag which becomes the hidden flag. This point here is the only route to the flag being set. But that's not enough to give us the confidence that there's only one way to control this flag. We also need to confirm how the isForPrinting flag gets set.

There are two places this gets called from, both of them in WindowWatcher.cpp. The first:
NS_IMETHODIMP
nsWindowWatcher::OpenWindowWithRemoteTab(nsIRemoteTab* aRemoteTab,
                                         const nsACString& aFeatures,
                                         bool aCalledFromJS,
                                         float aOpenerFullZoom,
                                         nsIOpenWindowInfo* aOpenWindowInfo,
                                         nsIRemoteTab** aResult) {
And the second:
nsresult nsWindowWatcher::OpenWindowInternal(
    mozIDOMWindowProxy* aParent, const nsACString& aUrl,
    const nsACString& aName, const nsACString& aFeatures, bool aCalledFromJS,
    bool aDialog, bool aNavigate, nsIArray* aArgv, bool aIsPopupSpam,
    bool aForceNoOpener, bool aForceNoReferrer, PrintKind aPrintKind,
    nsDocShellLoadState* aLoadState, BrowsingContext** aResult) {
[...]
For the first, the crucial data going in to this method is aOpenWindowInfo. For the second the flag gets set like this:
    openWindowInfo->mIsForPrinting = aPrintKind != PRINT_NONE;
Consequently the crucial value going in is aPrintKind.

Let's now follow back the first of these. It comes via ContentParent::CommonCreateWindow() where it's passed in as the aForPrinting parameter like this:
mozilla::ipc::IPCResult ContentParent::CommonCreateWindow(
    PBrowserParent* aThisTab, BrowsingContext* aParent, bool aSetOpener,
    const uint32_t& aChromeFlags, const bool& aCalledFromJS,
    const bool& aWidthSpecified, const bool& aForPrinting,
    const bool& aForWindowDotPrint, nsIURI* aURIToLoad,
    const nsCString& aFeatures, const float& aFullZoom,
    BrowserParent* aNextRemoteBrowser, const nsString& aName, nsresult& aResult,
    nsCOMPtr<nsIRemoteTab>& aNewRemoteTab, bool* aWindowIsNew,
    int32_t& aOpenLocation, nsIPrincipal* aTriggeringPrincipal,
    nsIReferrerInfo* aReferrerInfo, bool aLoadURI,
    nsIContentSecurityPolicy* aCsp, const OriginAttributes& aOriginAttributes) {
[...]
  openInfo->mIsForPrinting = aForPrinting;
[...]
  aResult = pwwatch->OpenWindowWithRemoteTab(thisBrowserHost, aFeatures,
                                             aCalledFromJS, aFullZoom, openInfo,
                                             getter_AddRefs(aNewRemoteTab));
[...]
Where this is called, in one case we have a hard coded false value passed in for isForPrinting:
  mozilla::ipc::IPCResult ipcResult = CommonCreateWindow(
      aThisTab, parent, /* aSetOpener = */ false, aChromeFlags, aCalledFromJS,
      aWidthSpecified, /* aForPrinting = */ false,
      /* aForPrintPreview = */ false, aURIToLoad, aFeatures, aFullZoom,
      /* aNextRemoteBrowser = */ nullptr, aName, rv, newRemoteTab, &windowIsNew,
      openLocation, aTriggeringPrincipal, aReferrerInfo,
      /* aLoadUri = */ true, aCsp, aOriginAttributes);
In the other case the value is triggered via a call to ContentParent::RecvCreateWindow(). Following this back via SendCreateWindow() we find it eventually gets back to a call to nsWindowWatcher::OpenWindowInternal(). In other words, we find ourselves in the same place as the other path.

So let's move over to this path. We want to follow back to see what calls nsWindowWatcher::OpenWindowInternal() and in particular the value passed in for aPrintKind.

There are several ways this method can be called, all from inside nsWindowWatcher.cpp. One of them straightforwardly sets aPrintKind to PRINT_NONE:
  MOZ_TRY(OpenWindowInternal(aParent, aUrl, aName, aFeatures,
                             /* calledFromJS = */ false, dialog,
                             /* navigate = */ true, argv,
                             /* aIsPopupSpam = */ false,
                             /* aForceNoOpener = */ false,
                             /* aForceNoReferrer = */ false, PRINT_NONE,
                             /* aLoadState = */ nullptr, getter_AddRefs(bc)));
The others both end up getting called in from two separate places in nsGlobalWindowOuter.cpp:
$ grep -rIn "OpenWindow2" * --include="*.cpp"
gecko-dev/dom/base/nsGlobalWindowOuter.cpp:7134:
    rv = pwwatch->OpenWindow2(this, url, name, options,
gecko-dev/dom/base/nsGlobalWindowOuter.cpp:7154:
    rv = pwwatch->OpenWindow2(this, url, name, options,
gecko-dev/toolkit/components/windowwatcher/nsWindowWatcher.cpp:353:
    nsWindowWatcher::OpenWindow2(mozIDOMWindowProxy* aParent,
Both of these have the aPrintKind parameter set like this:
  const auto wwPrintKind = [&] {
    switch (aPrintKind) {
      case PrintKind::None:
        return nsPIWindowWatcher::PRINT_NONE;
      case PrintKind::InternalPrint:
        return nsPIWindowWatcher::PRINT_INTERNAL;
      case PrintKind::WindowDotPrint:
        return nsPIWindowWatcher::PRINT_WINDOW_DOT_PRINT;
    }
    MOZ_ASSERT_UNREACHABLE("Wat");
    return nsPIWindowWatcher::PRINT_NONE;
  }();
In all cases the crucial aPrintKind value seen here is passed in as a parameter to nsGlobalWindowOuter::OpenInternal() which, in four out of five cases look like this:
  return OpenInternal(aUrl, aName, aOptions,
                      true,                     // aDialog
                      false,                    // aContentModal
                      true,                     // aCalledNoScript
                      false,                    // aDoJSFixups
                      true,                     // aNavigate
                      nullptr, aExtraArgument,  // Arguments
                      nullptr,                  // aLoadState
                      false,                    // aForceNoOpener
                      PrintKind::None, _retval);
In these cases we can see it passed in explicitly as PrintKind::None. The exception is this call here:
      auto printKind = aForWindowDotPrint == IsForWindowDotPrint::Yes
                           ? PrintKind::WindowDotPrint
                           : PrintKind::InternalPrint;
      aError = OpenInternal(u""_ns, u""_ns, u""_ns,
                            false,             // aDialog
                            false,             // aContentModal
                            true,              // aCalledNoScript
                            false,             // aDoJSFixups
                            true,              // aNavigate
                            nullptr, nullptr,  // No args
                            nullptr,           // aLoadState
                            false,             // aForceNoOpener
                            printKind, getter_AddRefs(bc));
Crucially, this route is found inside the following method:
Nullable<WindowProxyHolder> nsGlobalWindowOuter::Print(
    nsIPrintSettings* aPrintSettings, nsIWebProgressListener* aListener,
    nsIDocShell* aDocShellToCloneInto, IsPreview aIsPreview,
    IsForWindowDotPrint aForWindowDotPrint,
    PrintPreviewResolver&& aPrintPreviewCallback, ErrorResult& aError) {
[...]
In other words, only a call to Print() will result in a value other than PRINT_NONE being passed in for aPrintKind and consequently the only way isForPrinting can be set to true.

We've followed all of the paths from beginning to end and hopefully it's clear that only this printing route will end up generating any hidden tabs. That's what we need. But in the future we'll also have to stay vigilant to the fact that this is possible; it's important we don't open any paths that allow it to be controlled by JavaScript coming from external sites.

My plan was to continue on today to looking into the User Agent strings, but this is already rather a long post so I'll leave it there for today. It remains to thank rob_kouw once again for the excellent question, as well as the useful and always-appreciated input from thigg.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
2 Jan 2024 : Day 126 #
We're finally reaching the end of the "Print web page to PDF" saga. Yesterday I re-implemented the JavaScript changes that were in gecko-dev as part of the embedlite-components package. Today I'm going to commit all of the changes so that they're available in the remote repositories as well as my local ones.

Going over the changes I previously made to the gecko-dev code to support the functionality I'm happy to see that all of them are now unnecessary. That means I can completely remove the last two commits.
$ git checkout -b temp-FIREFOX_ESR_91_9_X_RELBRANCH_patches
$ git checkout FIREFOX_ESR_91_9_X_RELBRANCH_patches
$ git log -5 --oneline

2fb912372c04 (HEAD -> FIREFOX_ESR_91_9_X_RELBRANCH_patches,
    temp-FIREFOX_ESR_91_9_X_RELBRANCH_patches)
    Call Print from the CanonicalBrowingContext
26259399358f Reintroduce PDF printing code
e035d6ff3a78 (origin/FIREFOX_ESR_91_9_X_RELBRANCH_patches)
    Add embedlite static prefs
5ae7644719f8 Allow file scheme when loading OpenSearch providers
1fe7c56c7363 Supress URLQueryStrippingListService.jsm error

$ git reset --hard HEAD~~
HEAD is now at e035d6ff3a78 Add embedlite static prefs
$ git log -5 --oneline

e035d6ff3a78 (HEAD -> FIREFOX_ESR_91_9_X_RELBRANCH_patches,
    origin/FIREFOX_ESR_91_9_X_RELBRANCH_patches)
    Add embedlite static prefs
5ae7644719f8 Allow file scheme when loading OpenSearch providers
1fe7c56c7363 Supress URLQueryStrippingListService.jsm error
80fa227aee92 [sailfishos][gecko] Fix gfxPlatform::AsyncPanZoomEnabled
    for embedlite. JB#50863
b58093d23fb2 Add patch to fix 32-bit builds
These commits would have had to be turned into one or maybe two patches to the upstream code, so the fact I can remove them completely leaves me with a warm feeling inside.

Almost all of these changes are to the JavaScript code, so wouldn't require a rebuild of gecko-dev to apply them on my device in themselves. However there is one change to the nsIWebBrowserPrint interface which will have a knock on effect to the compiled (C++ and Rust) code as well, which is this one:
@@ -98,7 +98,7 @@ interface nsIWebBrowserPrint : nsISupports
    * @param aWPListener - is updated during the print
    * @return void
    */
-  [noscript] void print(in nsIPrintSettings aThePrintSettings,
                         in nsIWebProgressListener aWPListener);
+  void print(in nsIPrintSettings aThePrintSettings,
 
   /**
Since we've shifted from calling print() on the window to calling print() on the canonical browsing context, this change shouldn't be needed any more. But to check that this really is the case I'm going to have to do a full rebuild. Having reverted the changes above I can therefore kick things off:
$ sfdk -c no-fix-version build -d -p --with git_workaround
While this is building I've also created the issue that we considered yesterday as well. It's now Issue #1051: "Fix hang when calling window.setBrowserCover()".

I also created another issue that I've been noticing during testing of the ESR 91 code. There are various Web pages which don't work as well as expected, meaning that they don't work as well on ESR 91 as they do on ESR 78. That's not good for the end user. Previous versions of the browser have had similar problems and invariably the issue relates to what the site is serving to the browser. If the server doesn't recognise the User Agent string it will often send a different version of the page. Often these versions are broken or at least don't work well with the gecko rendering engine. It's pretty astonishing that this is still necessary in a world where Web standards are so much more clearly defined than they were even a decade ago, but there it is.

I've not yet looked into it but I have a suspicion that the reason some of these sites may not be working is that the Sailfish Browser's user agent string handling code is no longer sending the "adjusted" user agent strings to the sites that need them. So I've created Issue 1052 for looking in to this and trying to fix it if there's something to fix.

The gecko-dev build has completed (it took most of the day, but at least not all of the day). I'm now rebuilding the packages that depend on it so that I can then test them out.

It's almost the full suite: gecko-dev, qtmozembed-qt5, embedlite-components and sailfish-browser. The only package I've not rebuilt is sailfish-components-webview. I've not yet made any changes to that at all since starting this ESR 91 work.

All built. All uploaded.
$ devel-su rpm -U --force xulrunner-qt5-91.*.rpm \
    xulrunner-qt5-debugsource-91.*.rpm xulrunner-qt5-debuginfo-91.*.rpm \
    xulrunner-qt5-misc-91.*.rpm qtmozembed-qt5-1.*.rpm \
    qtmozembed-qt5-debuginfo-1.*.rpm qtmozembed-qt5-debugsource-1.*.rpm \
    sailfish-browser-2.*.rpm sailfish-browser-debuginfo-2.*.rpm \
    sailfish-browser-debugsource-2.*.rpm sailfish-browser-settings-2.*.rpm \
    embedlite-components-qt5-1.*.rpm \
    embedlite-components-qt5-debuginfo-1.*.rpm \
    embedlite-components-qt5-debugsource-1.*.rpm
All installed.

Now to test the changes... and it doesn't work. That's no fun.

But as soon as I look at the code it's clear what the problem is. Here's the error:
[JavaScript Error: "aSerializable.url is undefined"
    {file: "resource://gre/modules/DownloadCore.jsm" line: 1496}]
DownloadSource.fromSerializable@resource://gre/modules/DownloadCore.jsm:1496:5
Download.fromSerializable@resource://gre/modules/DownloadCore.jsm:1282:38
D_createDownload@resource://gre/modules/Downloads.jsm:108:39
DownloadPDFSaver2.fromSerializable@file:///usr/lib64/mozembedlite/components/
    EmbedliteDownloadManager.js:420:34
observe/<@file:///usr/lib64/mozembedlite/components/
    EmbedliteDownloadManager.js:272:56
Apart from the error in DownloadCore.jsm I'm also a little perturbed at the reference to DownloadPDFSaver2. While I had two versions of DownloadPDFSaver in the code at the same time I did call one of them DownloadPDFSaver2, but subsequently switched all references to just use DownloadPDFSaver (or so I thought) once I'd removed the unused version. So that really shouldn't be appearing at all.

Let's find out why.

First the reference to DownloadPDFSaver2. When I check the actual code on device it certainly does still reference this, but that's different to the code I have on my laptop. I must have made a mistake when building the package. I've rebuilt it, re-uploaded it and reinstalled it. Now when I run it I get this:
[JavaScript Error: "aSerializable.url is undefined"
    {file: "resource://gre/modules/DownloadCore.jsm" line: 1496}]
DownloadSource.fromSerializable@resource://gre/modules/DownloadCore.jsm:1496:5
Download.fromSerializable@resource://gre/modules/DownloadCore.jsm:1282:38
D_createDownload@resource://gre/modules/Downloads.jsm:108:39
DownloadPDFSaver.createDownload@file:///usr/lib64/mozembedlite/components/
    EmbedliteDownloadManager.js:426:34
observe/<@file:///usr/lib64/mozembedlite/components/
    EmbedliteDownloadManager.js:272:55
That looks more healthy. Now for the main error. It looks to me like this is due to the removal of the following from DownloadCore.jsm:
@@ -1499,12 +1491,6 @@ DownloadSource.fromSerializable = function(aSerializable) {
     source.url = aSerializable.toString();
   } else if (aSerializable instanceof Ci.nsIURI) {
     source.url = aSerializable.spec;
-  } else if (aSerializable instanceof Ci.nsIDOMWindow) {
-    source.url = aSerializable.location.href;
-    source.isPrivate = PrivateBrowsingUtils.isContentWindowPrivate(
-      aSerializable
-    );
-    source.windowRef = Cu.getWeakReference(aSerializable);
   } else {
     // Convert String objects to primitive strings at this point.
     source.url = aSerializable.url.toString();
In the new code in EmbedliteDownloadManager.js we send in the source like this:
  let download = await DownloadPDFSaver.createDownload({
    source: Services.ww.activeWindow,
    target: data.to
  });
But now that I removed the code from DownloadCore.jsm it no longer knows how to handle this Services.ww.activeWindow value. I should be handling that in EmbedliteDownloadManager.js instead. At present the code that passes this value on looks like this:
DownloadPDFSaver.createDownload = async function(aProperties) {
  let download = await Downloads.createDownload({
    source: aProperties.source,
    target: aProperties.target,
    contentType: "application/pdf"
  });
So I should be able to fix this by handling it more judiciously. Let's try this:
DownloadPDFSaver.createDownload = async function(aProperties) {
  let download = await Downloads.createDownload({
    source: aProperties.source.location.href,
    target: aProperties.target,
    contentType: "application/pdf"
  });
  download.source.isPrivate = PrivateBrowsingUtils.isContentWindowPrivate(
    aProperties.source
  );
  download.source.windowRef = Cu.getWeakReference(aProperties.source);
This way the deserialisable in DownloadCore will just accept the source value as a string and store it in source.url (just as we need). We can then embellish it with the additional isPrivate and windowRef elements as in the code above.

In order to get this to work I also have to pull in PrivateBrowsingUtils at the top of the file:
const { PrivateBrowsingUtils } = ChromeUtils.import(
  "resource://gre/modules/PrivateBrowsingUtils.jsm"
);
The end result is working. I've pushed all the changes to the remote repositories. Phew. Finally I'm going to call this task done.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
1 Jan 2024 : Day 125 #
It's 2024! Happy New Year to everyone. The start of the year is all about tidying up the PDF printing code for me, following the final steps to get things working yesterday. Since everything is now working, this is my chance to break things by making my changes cleaner, attempting to fix any edge cases and moving code around so there are more changes to Sailfish code and fewer changes to gecko-dev code.

Back on Day 99 when we first started looking at the problem I reverted an upstream change that removed a JavaScript class called DownloadPDFSaver. That is, the upstream change removed it, then when I reverted the change it put it back.

But it doesn't look like there's much in DownloadPDFSaver that's actually dependent on being inside gecko-dev, so it would make sense to move it into the embedlite-embedding JavaScript codebase instead. The embedlite-embedding code is specific to Sailfish OS and so entirely under our control. If I can move the changes there it will not only make things cleaner by avoiding a patch to the upstream code, it will also make the code much simpler to maintain for the future.

So my plan is to move everything related to DownloadPDFSaver into the EmbedliteDownloadManager.js file, which is the place where we actually need to make use of it.

So I've copied over the class prototype code with zero changes. I won't show it all here because there's quite a lot of it, but here's what the start of it looks like along with its docstring:
/**
 * This DownloadSaver type creates a PDF file from the current document in a
 * given window, specified using the windowRef property of the DownloadSource
 * object associated with the download.
 *
 * In order to prevent the download from saving a different document than the one
 * originally loaded in the window, any attempt to restart the download will fail.
 *
 * Since this DownloadSaver type requires a live document as a source, it cannot
 * be persisted across sessions, unless the download already succeeded.
 */
DownloadPDFSaver.prototype = {
  __proto__: DownloadSaver.prototype,
[...]
This code essentially allows printing to happen but with a DownloadSaver interface which means we can hook it into our "downloads" code really easily.

Now previously we'd have called Downloads.createDownload() with the saver key set to "pdf". The original code that did this looked like this:
            if (Services.ww.activeWindow) {
              (async function() {
                let list = await Downloads.getList(Downloads.ALL);
                let download = await Downloads.createDownload({
                  source: Services.ww.activeWindow,
                  target: data.to,
                  saver: "pdf",
                  contentType: "application/pdf"
                });
                download["saveAsPdf"] = true;
                download.start();
                list.add(download);
              })().then(null, Cu.reportError);
You can see the saver: "pdf" entry in there. This was consumed by the Downloads class from gecko-dev which handed it off to the DownloadCore class. However the upstream change carefully excised all of the functionality that handled this PDF code. All of the other download types were maintained, only the PDF code was removed.

So while previously I reverted the changes I've now created a new method inside EmbedliteDownloadManager.js in the aim of achieving the same thing and that looks like this:
/**
 * Creates a new DownloadPDFSaver object, with its initial state derived from
 * the provided properties.
 *
 * @param aProperties
 *        Provides the initial properties for the newly created download.
 *        This matches the serializable representation of a Download object.
 *        Some of the most common properties in this object include:
 *        {
 *          source: An object providing a Ci.nsIDOMWindow interface.
 *          target: String containing the path of the target file.
 *        }
 *
 * @return The newly created DownloadPDFSaver object.
 */
DownloadPDFSaver.createDownload = async function(aProperties) {
  let download = await Downloads.createDownload({
    source: aProperties.source,
    target: aProperties.target,
    contentType: "application/pdf"
  });

  download.saver = new DownloadPDFSaver();
  download.saver.download = download;
  download["saveAsPdf"] = true;

  return download;
};
Note that DownloadPDFSaver is now in the same file, so can be used here just fine. This code will end up creating the structure that we talked about yesterday; the end result should look like this:
download = Download
{
	source: {
	    url: Services.ww.activeWindow.location.href,
	    isPrivate: PrivateBrowsingUtils.isContentWindowPrivate(
	        Services.ww.activeWindow
	    ),
	    windowRef: Cu.getWeakReference(Services.ww.activeWindow),
	}
	target: {
	    path: data.to,
	}
	saver: DownloadPDFSaver(
	    download: download,
	),
	contentType: "application/pdf",
	saveAsPdf: true,
}
Some of the functionality is hidden inside the existing call to Downloads.createDownload() which creates a basic Download object that we then add our stuff to. If you look carefully you should be able to match up the code shown in our new DownloadPDFSaver.createDownload() method with the elements of the structure shown above.

With all this in place we can now change our code that creates the download like this:
@@ -254,13 +269,10 @@ EmbedliteDownloadManager.prototype = {
             if (Services.ww.activeWindow) {
               (async function() {
                 let list = await Downloads.getList(Downloads.ALL);
-                let download = await Downloads.createDownload({
+                let download = await DownloadPDFSaver.createDownload({
                   source: Services.ww.activeWindow,
-                  target: data.to,
-                  saver: "pdf",
-                  contentType: "application/pdf"
+                  target: data.to
                 });
-                download["saveAsPdf"] = true;
                 download.start();
                 list.add(download);
               })().then(null, Cu.reportError);
To give our final version, which looks like this:
          case "saveAsPdf":
            if (Services.ww.activeWindow) {
              (async function() {
                let list = await Downloads.getList(Downloads.ALL);
                let download = await DownloadPDFSaver.createDownload({
                  source: Services.ww.activeWindow,
                  target: data.to
                });
                download.start();
                list.add(download);
              })().then(null, Cu.reportError);
            } else {
              Logger.warn("No active window to print to pdf")
            }
            break;
Nice! But when I try to execute this code I hit several errors. Various classes and objects that were available in the DownloadCore.jsm file are no longer available here: DownloadSaver, DownloadError, OS and gPrintSettingsService.

Thankfully we can still pull these into our EmbedliteDownloadManager.js code to make use of them. I've added the following at the top of the file for this:
const { DownloadSaver, DownloadError } = ChromeUtils.import(
  "resource://gre/modules/DownloadCore.jsm"
);

XPCOMUtils.defineLazyModuleGetters(this, {
  OS: "resource://gre/modules/osfile.jsm",
});

XPCOMUtils.defineLazyServiceGetter(
  this,
  "gPrintSettingsService",
  "@mozilla.org/gfx/printsettings-service;1",
  Ci.nsIPrintSettingsService
);
Now when I try to print everything works just as before. Lovely! It remains to revert the revert applied previously and to commit these changes to my working tree.

There is also now just one last task I need to do to round off this PDF print work and that's to "Record an issue for the setBrowserCover() hang." This is probably the easiest task of the lot, which is why I've left it until last. But that'll be a task for tomorrow.

Once this last step is done I can finally move on to something unrelated to the printing stack, which will be a relief!

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
1 Jan 2024 : Reckoning and Renewal, Part IV #
Greetings from 2024! As this is my fourth post on the topic of annual resolutions I think I can now consider it to be a habit. This is where I look back at the resolutions I made a year ago and assess how I did in the cold light of day; and then on the back of what is usually a mixed set of results I go on to pretend there's some value in me doing the same thing again for the year ahead.

But to be clear, these plans aren't supposed to be life-goals, sweeping changes or major reevaluations of the self. Rather they're supposed to be incremental improvements; achievable tasks that help me focus on specific changes I can make during the year that would be easy to overlook otherwise.
 

Before getting into the resolutions themselves I think it's worth reflecting on the kind of year 2023 was for me. The world seemingly had a torrid year and while it's impossible not to acknowledge how traumatic world events have been, I hope you'll forgive me for focusing on only my own small corner of the year in this post.

It was a positive — albeit turmultuous — year for me. Most of the turmoil happened in February when I moved countries, from living in Finland to living in the UK; and changed jobs from working at Jolla Oy to working at The Alan Turing Institute. Although I loved living in Finland and working for Jolla, I still consider these changes to be positive ones, mostly because it means I'm now back living in the same home as my wife Joanna. It's hard to overstate the improvement in quality of life that this brings to me.

February was also the month of FOSDEM'23 which I enjoyed hugely. I'm hoping to have a similarly enjoyable trip to Belgium this coming February for FOSDEM'24. Related to this is the fact that Jolla Oy, which has had its own tumultuous year, now looks to be in a better place. I still use Sailfish OS as my main phone operating system, so I'm encouraged that it's now on a more stable footing.

Although most of these changes were anticipated when I wrote my 2023 resolutions, the real ramifications of them were always going to be unclear beforehand. So my resolutions have to be considered in this context.

So, looking back at my resolutions for 2023, here's what they looked like:
  1. Learn quantum programming.
  2. Make the most of London with Joanna.
  3. Take the bisection work to the next stage.
And how did I do? The headline figure is one out of three, which is less than fifty percent and therefore not the success I was hoping for. But that doesn't mean it was a futile exercise. Let's look in more detail.

First the plan to learn quantum computing. Although I did make some progress on this by reading through "Programming Quantum Computers" by Eric R. Johnston, Nic Harrigan and Mercedes Gimeno-Segovia, I can't really consider it to be a success because I didn't finish the book. That would have been the minimum criterion. Despite this, it was still a worthwhile goal and having it in my list did give me more focus than I would have otherwise.

In practice my move to The Alan Turing Institute brought with it a huge amount of new things to learn and new opportunities for learning them. The institute has a much stronger commitment to continuous learning than other organisations I've worked for — an incredibly positive thing — and consequently I was able to join a Transformers reading group, a Rust reading group and a Linear Algebra reading group. Plus I had to learn the ropes of the job. Amidst all this other learning, my quantum computing plans took a back seat. Maybe I'll make more headway in 2024.

Second was to make the most of London with Joanna. This, I think, was a success. We went to see Hamilton, Matilda, the "Titanosaur" Patagotitan at the British Natural History Museum and a lecture about generative AI at the Royal Institute. We also enjoyed some nice meals out and I met up with friends in London on multiple occasions. So I feel like this was a success and one I should try to maintain in the coming year.

Finally I completely failed to take the bisection work forwards. This was overtaken by my commitment to upgrade Gecko to ESR 91 and publish a daily blog covering my progress. If I'm honest, although I'd love to have wrapped up the bisection work, there was no chance of me doing both and I'm happy with the choice I made. I still hope to get the bisection work published at some point, but not until this ESR 91 work is complete.
 

So what does 2024 have to offer? I don't want lots of new years resolutions, but I decided I do want there to be exactly one resolution from each of four topics that I care deeply about. As such I've split my resolutions into four categories: maths, computing, ecology and fun. I've picked the things that I most want to achieve for each of these in 2024. Here they are:
  1. Start working through "Information Theory: A Tutorial Introduction" by James V. Stone.
  2. Do something practical in Rust.
  3. Make twelve incremental ecological improvements to my life.
  4. Go to at least three events or exhibitions at the British Library.
A little context about these. The first was suggested by Rahul and both Alun and I agreed this would be a great thing to do. I've paused on this until after the ESR 91 work is complete, but if I don't get that completed this year I have bigger problems. I've always wanted to understand Information Theory more and this book seems to include many of the really interesting results I want to better understand: Shannon's Source Coding Theorem and Kolmogorov Complexity for example. It doesn't cover Hamming Distance, so that might require delving into a different text.

As I mentioned I've joined a Rust reading group. That's fun and useful but I need to do something practical to cement the knowledge I've been accumulating. I already started doing some development on top of the existing (but incomplete) rust_gpiozero codebase so my intention is to continue with this work. But if I switch to some other piece of practical Rust programming that's fine too.

My third resolution is to make a series of incremental improvements to my life in terms of reducing my environmental impact. I really wanted to make one of my resolutions ecological, but this was by far the hardest to decide upon. In previous years I've recorded my waste output, we had a heat pump installed and each year I offset my annual carbon emissions. But in order to achieve continual improvement I needed something different this year. I have a list of things I'd like to do: look into getting solar panels installed, commute by bike rather than by bus, switch my bank and pension to use green investment funds, avoid having to get in a plane. If I can make one such improvement per month, then I think that will be better than having just a single large headline improvement that may turn out to be unachievable.

Finally Joanna gifted me membership of the British Library for the year. Since I work in the building I've already started enjoying many of the facilities: bookshop, gift shop, cafés, reading rooms. In 2023 I really wanted to visit some of the exhibitions during my lunch break, but the entrance fee made it uneconomical (£16 for a 30 minute visit just didn't add up). With membership I'll be able to enter the exhibitions at no additional cost, so I plan to make the most of the fact.

The current exhibition on Fantasy: Realms of Imagination looks amazing, but I've held off going in the hope that someone might be generous enough to offer me membership. That will be the first exhibition I go to. As it runs from 27 October 2023 to 25 February 2024 it looks like there's a four-month exhibition cycle, meaning three per year. My plan is to go to all of them, along with other events that might be happening in the library. Sounds like fun to me.

So that's it. Four resolutions which certainly look achievable and which I'll be able to easily assess in twelve months' time. Here's to 2024 and the hope that I have more success with these than I did with my 2023 list.
Comment
31 Dec 2023 : Day 124 #
Here in the UK as I write this it's verging on 2024. From the fireworks outside I can tell that new year celebrations have already started for some. But I still have just enough time to squeeze in a final Gecko Dev Diary for 2023.


 

What I'm really looking forward to in 2024 is not having to worry about the printing pipeline any more. And thankfully we're reaching the end game for this particular issue. But there are still plenty more things to work on to get this ESR 91 version of the engine working as well as ESR 78, so this is far from being close to my final post on the topic.

So please expect posts to continue as we fly headlong into the new year. To everyone who has been following along over the last 124 days, thank you so much for your time, commitment, Mastodon favourites, boosts, generous comments and generally hugely motivational interactions. It's been a wonderful experience for me.

But I mustn't get carried away when there's development to be done, so let's continue with today's post.

Following the plan I cooked up to suspend rendering when the "hidden" window is created, yesterday I attempted to avoid the flicker at the point when the user selects the "Save page to PDF" option. The approach failed, so I'm dropping the idea. It doesn't change the functionality and, in retrospect, suspending rendering was always likely to leave the display showing the background colour rather than the last thing rendered, so without a fair bit more engineering it was always likely to fail.

That means I'm moving on to refactoring the QML and DownloadPDFSaver code today.

To kick things off I've worked through all the changes I made to gecko-dev to expose the isForPrinting flag, converting it to a hidden flag in the process. All of these changes live in the EmbedLite portion of the code, which is ideally what we want because it avoids having to patch the upstream gecko-dev code itself.

I've also committed the changes to qtmozembed and sailfish-browser that are needed to support this too.

The next step is to backtrack and take another look at the changes made to gecko-dev that we inserted in order to support printing. This is covered by two commits the first of which, now I look at it, I made right at the start of December. It makes me think that I've been stalled on these printing changes for far too long.
$ git log -2
commit 2fb912372c0475c1ca84c381cf9927f75fe32595
    (HEAD -> FIREFOX_ESR_91_9_X_RELBRANCH_patches)
Author: David Llewellyn-Jones <david@flypig.co.uk>
Date:   Tue Dec 19 20:22:51 2023 +0000

    Call Print from the CanonicalBrowingContext
    
    Call Print from the CanonicalBrowsingContext rather than on the
    Ci.nsIWebBrowserPrint of the current window.

commit 26259399358f14e9695d7b9497aeb3a8577285a9
Author: David Llewellyn-Jones <david@flypig.co.uk>
Date:   Tue Dec 5 22:29:55 2023 +0000

    Reintroduce PDF printing code
    
    Reintroduces code to allow printing of a window. This essentially
    reverts the following three upstream commits:
    
    https://phabricator.services.mozilla.com/D84012
    
    https://phabricator.services.mozilla.com/D84137
    
    https://phabricator.services.mozilla.com/D83264
The plan was always to try to move the reverted changes in the "Reintroduce PDF printing code" from the gecko-dev code and into the EmbedLite code. More specifically, moving the changes in DownloadCore.jsm to EmbedLiteDownloadManager.js. This may turn out not to be practical, but I'd like to give it a go.

The DownloadPDFSaver class prototype itself looks pretty self-contained, so moving that to EmbedliteDownloadManager.js looks plausible. However there's also deserialisation code which looks a bit more baked in. In order to move the code, we'd have to perform the deserialisation in EmbedliteDownloadManager.js as well.

Thankfully I looked through this code quite carefully already on Day 99 at the start of December. It's these situations in which I'm glad to have recorded these notes, because reading through the post means I don't have to dig through the code all over again.

The flow is the following:
  1. EmbedliteDownloadManager.observe() in EmbedliteDownloadManager.js.
  2. Downloads.createDownload() in Downloads.jsm.
  3. Download.fromSerializable() in DownloadCore.jsm.
  4. DownloadSource.fromSerializable() and DownloadTarget.fromSerializable() in DownloadCore.jsm.
  5. DownloadSaver.fromSerializable() in DownloadCore.jsm.
  6. DownloadPDFSaver.fromSerializable() in DownloadCore.jsm.
The aim is to move DownloadPDFSaver into EmbedliteDownloadManager.js. But in order for this to work all of the steps between will need moving there too. In practice most of the logic in the intermediate fromSerializable() methods are conditions on the contents of the data passed in. If we're moving DownloadPDFSaver into EmbedliteDownloadManager.js then most of the logic becomes redundant because we'll only have to cover the one case. Moreover we don't really need to perform any deserialisation: we can just configure the DownloadPDFSaver class with the values we have directly.

So it looks like it should be straightforward, but will need a little care and testing.

I've hatched a plan for how to proceed that will start with me moving the DownloadPDFSaver code, then collecting together the data to be configured into this, all of which comes from the following "serialised" data structure that's created in EmbedliteDownloadManager.js:
{
  source: Services.ww.activeWindow,
  target: data.to,
  saver: "pdf",
  contentType: "application/pdf"
}
Thankfully that's not a lot of data to have to deal with. We'll need to end up with a promise that resolves to a Download structure that will look like this:
download = Download
{
	source: {
	    url: Services.ww.activeWindow.location.href,
	    isPrivate: PrivateBrowsingUtils.isContentWindowPrivate(
	        Services.ww.activeWindow
	    ),
	    windowRef: Cu.getWeakReference(Services.ww.activeWindow),
	},
	target: {
	    path: data.to,
	},
	saver: DownloadPDFSaver(
	    download: download,
	),
	contentType: "application/pdf",
	saveAsPdf: true,
}
That's simply the structure you get when you follow all of the fromSerializable() steps through the code. Once DownloadPDFSaver has been moved it doesn't look like there's anything else there that can't be accessed from the EmbedliteDownloadManager.js code.

But we'll find that out tomorrow when I actually try to implement this. Roll on 2024!

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
30 Dec 2023 : Day 123 #
I finished yesterday in high spirits, having figured out what was causing the hang on switching to private mode although, to be clear, that's not the same thing as having a solution. Nevertheless this puts us in a good position with the printing situation, because it means the window hiding code isn't causing the hang and it should now be pretty straightforward to get it to a state where it's ready to be merged in.

I concluded my post yesterday by highlighting four tasks which I now need to focus on to get this over the line.
  1. Check whether the flicker can be removed by skipping the activation of the hidden window.
  2. Tidy up the QML implementation of the new proxy filter model.
  3. Move the DownloadPDFSaver class from DownloadCore.js to EmbedliteDownloadManager.js.
  4. Record an issue for the setBrowserCover() hang.
I'm going to spend a bit of time on the first of these today. Unfortunately I don't have as much time to spend on this today as I do usually, so this will be a terse investigation. And if it doesn't pan out then my plan is to simply drop the idea rather than spend any more time trying to figure something more complex out. So this may not work out and in case it doesn't, well, that's just fine.

The bits I'm interested in are the following, which are part of qtmozembed and can be found in the qopenglwebpage.cpp file:
void QOpenGLWebPage::suspendView()
{
    if (!d || !d->mViewInitialized) {
        return;
    }
    setActive(false);
    d->mView->SuspendTimeouts();
}

void QOpenGLWebPage::resumeView()
{
    if (!d || !d->mViewInitialized) {
        return;
    }
    setActive(true);

    // Setting view as active, will reset RefreshDriver()->SetThrottled at
    // PresShell::SetIsActive (nsPresShell). Thus, keep on throttling
    // if should keep on throttling.
    if (mThrottlePainting) {
        d->setThrottlePainting(true);
    }

    d->mView->ResumeTimeouts();
}
I'm interested in these because of the flicker that happens when the hidden page is initially created. If we can arrange things so that the page is in a suspended state when it's created and never resumed, then rendering may never take place and it's possible the flicker can be avoided. The aim here is to suspend the rendering but not execution of the page itself: we want the page to continue running JavaScript and doing whatever else it needs to do in the background. It's only the rendering we want to avoid.

These methods are called from a couple of places both of which appear to be in the webpages.cpp file of sailfish-browser. Here's one of them:
WebPageActivationData WebPages::page(const Tab& tab)
{
    const int tabId = tab.tabId();

    if (m_activePages.active(tabId)) {
        DeclarativeWebPage *activePage = m_activePages.activeWebPage();
        activePage->resumeView();
        return WebPageActivationData(activePage, false);
    }

    DeclarativeWebPage *webPage = 0;
    DeclarativeWebPage *oldActiveWebPage = m_activePages.activeWebPage();
    if (!m_activePages.alive(tabId)) {
        webPage = m_pageFactory->createWebPage(m_webContainer, tab);
        if (webPage) {
            m_activePages.prepend(tabId, webPage);
        } else {
            return WebPageActivationData(nullptr, false);
        }
    }

    DeclarativeWebPage *newActiveWebPage = m_activePages.activate(tabId);
    updateStates(oldActiveWebPage, newActiveWebPage);

    if (m_memoryLevel == MemCritical) {
        handleMemNotify(m_memoryLevel);
    }

    return WebPageActivationData(newActiveWebPage, true);
}
And here's the other (notice that this gets called in the code above):
void WebPages::updateStates(DeclarativeWebPage *oldActivePage,
    DeclarativeWebPage *newActivePage)
{
    if (oldActivePage) {
        // Allow suspending only the current active page if it is not the
        // creator (parent).
        if (newActivePage->parentId() != (int)oldActivePage->uniqueId()) {
            if (oldActivePage->loading()) {
                oldActivePage->stop();
            }
            oldActivePage->suspendView();
        } else {
            // Sets parent to inactive and suspends rendering keeping
            // timeouts running.
            oldActivePage->setActive(false);
        }
    }

    if (newActivePage) {
        newActivePage->resumeView();
        newActivePage->update();
    }
}
The second of these we've seen before (I was playing around with this block of code on Day 118). In order to try to get the page to stay inactive I've amended the WebPages::page() method so that rather than calling updateStates() in all circumstances, it now does something slightly different if the tab is hidden:
    DeclarativeWebPage *newActiveWebPage = m_activePages.activate(tabId);
    if (tab.hidden()) {
        newActiveWebPage->setActive(false);
        newActiveWebPage->suspendView();
    } else {
        updateStates(oldActiveWebPage, newActiveWebPage);
    }
The idea here is that if the tab is hidden it will be set to inactive and suspended rather than being activated through the call to updateStates().

The theory looks plausible, but the practice shows otherwise: after making this change there's still a visible flicker when the hidden page is created and then immediately hidden.

However, while investigating this I also notice that the WebView.qml has this readyToPaint value that also controls whether rendering occurs:
    readyToPaint: resourceController.videoActive ? webView.visible
        && !resourceController.displayOff : webView.visible
        && webView.contentItem
        && (webView.contentItem.domContentLoaded || webView.contentItem.painted)
If I comment out this line, so that readyToPaint is never set to true then I find none of the pages render at all. That's good, because it means that using this flag it may be possible to avoid the rendering of the page that's causing the flicker.

I've added some extra variables into the class so that when a hidden page is created the readyToPaint value is skipped on the next occasion it's set. This is a hack and if this works I'll need to figure out a more satisfactory approach, but for testing this might be enough.

Unfortunately my attempts to control it fail: it has precisely the opposite effect, so that the page turns white and then rendering never starts up again. I'm left with a blank screen and no page being rendered at all.

I give it one more go, this time with a little more sophistication in my approach. Essentially I'm recording the hidden state, skipping any readyToPaint updates while it's set, then restoring the flag as soon as the page has reverted back to the non-hidden page.

Now the rendering state is at least restored afterwards, but there's still a flicker on screen as the page appears and then is hidden. And when I check the debug output it's clear that there's no change of readyToPaint state occurring between the time the new page is created and the old page is reinstated. During this time the rendering state is set to false.

So I don't think there's anything more to test here. Suspending the page appears to still render to the screen as a white page, rather than simply leaving behind the page that was there before. This shouldn't be such a surprise; the texture used for rending is almost certainly getting cleared somewhere, even when rendering is suspended.

But this task simply isn't worth spending more time on. Maybe at some point in the future the path to avoiding the new page render for a frame will become clearer, but in the meantime the impact on the user is minimal.

So, tomorrow I'll get started on cleaning up the QML and JavaScript code so that this can all be finalised.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
29 Dec 2023 : Day 122 #
Over the last couple of days the plan had been to wrap up creating a QSortFilterProxyModel to hide the "hidden" tabs, stick it in the QML and call it a good 'un. But while doing that I discovered that switching from persistent to private tab mode causes the user interface thread of the browser to hang.

It's turning out to be quite nasty because, since there's no crash, the debugger won't tell us where the problem is. Looking through the code carefully hasn't revealed anything obvious either.

So to try to figure out what's going on I have to wait until the hang occurs, manually break the execution (using Ctrl-c) and then check each of the threads to find out what they're up to.

First of all, let's find out what threads are running.
(gdb) c
Continuing.
[Parent 23887: Unnamed thread 7f88002670]: E/EmbedLite NON_IMPL:
    EmbedLite::virtual nsresult WebBrowserChrome::GetDimensions(uint32_t,
    int32_t*, int32_t*, int32_t*, int32_t*):542 GetView dimensitions
[New LWP 26091]
^C
Thread 1 "sailfish-browse" received signal SIGINT, Interrupt.
0x0000007fb7bcf718 in pthread_cond_wait () from /lib64/libpthread.so.0
(gdb) info thread
  Id   Target Id                   Frame 
* 1    LWP 23887 "sailfish-browse" 0x0000007fb7bcf718 in pthread_cond_wait ()
  2    LWP 24132 "QQmlThread"      0x0000007fb78a8740 in poll ()
  3    LWP 24133 "QDBusConnection" 0x0000007fb78a8740 in poll ()
  4    LWP 24134 "gmain"           0x0000007fb78a8740 in poll ()
  5    LWP 24135 "dconf worker"    0x0000007fb78a8740 in poll ()
  6    LWP 24137 "gdbus"           0x0000007fb78a8740 in poll ()
  7    LWP 24147 "QThread"         0x0000007fb78a8740 in poll ()
  8    LWP 24149 "GeckoWorkerThre" StringMatch (text=0x7f9df3bf40,
                                   pat=0x7f9e0400c0, start=start@entry=0)
                                   at js/src/builtin/String.cpp:1944
  10   LWP 24151 "IPC I/O Parent"  0x0000007fb78ade24 in syscall ()
  11   LWP 24152 "QSGRenderThread" 0x0000007fb7bcf718 in pthread_cond_wait ()
  12   LWP 24153 "Netlink Monitor" 0x0000007fb78a8740 in poll ()
  13   LWP 24154 "Socket Thread"   0x0000007fb78a8740 in poll ()
  15   LWP 24156 "TaskCon~read #0" 0x0000007fb7bcf718 in pthread_cond_wait ()
  16   LWP 24157 "TaskCon~read #1" 0x0000007fb7bcf718 in pthread_cond_wait ()
  17   LWP 24158 "TaskCon~read #2" 0x0000007fb7bcf718 in pthread_cond_wait ()
  18   LWP 24159 "TaskCon~read #3" 0x0000007fb7bcf718 in pthread_cond_wait ()
  19   LWP 24160 "TaskCon~read #4" 0x0000007fb7bcf718 in pthread_cond_wait ()
  20   LWP 24161 "TaskCon~read #5" 0x0000007fb7bcf718 in pthread_cond_wait ()
  21   LWP 24162 "TaskCon~read #6" 0x0000007fb7bcf718 in pthread_cond_wait ()
  22   LWP 24163 "TaskCon~read #7" 0x0000007fb7bcf718 in pthread_cond_wait ()
  24   LWP 24165 "Timer"           0x0000007fb7bcfb80 in pthread_cond_timedwait ()
  25   LWP 24167 "IPDL Background" 0x0000007fb7bcf718 in pthread_cond_wait ()
  26   LWP 24168 "Cache2 I/O"      0x0000007fb7bcf718 in pthread_cond_wait ()
  27   LWP 24169 "Cookie"          0x0000007fb7bcf718 in pthread_cond_wait ()
  32   LWP 24174 "Worker Launcher" 0x0000007fb7bcf718 in pthread_cond_wait ()
  33   LWP 24175 "QuotaManager IO" 0x0000007fb7bcf718 in pthread_cond_wait ()
  35   LWP 24177 "Softwar~cThread" 0x0000007fb7bcfb80 in pthread_cond_timedwait ()
  36   LWP 24178 "Compositor"      0x0000007fb7bcf718 in pthread_cond_wait ()
  37   LWP 24179 "ImageIO"         0x0000007fb7bcf718 in pthread_cond_wait ()
  38   LWP 24181 "DOM Worker"      0x0000007fb7bcf718 in pthread_cond_wait ()
  40   LWP 24183 "ImageBridgeChld" 0x0000007fb7bcf718 in pthread_cond_wait ()
  42   LWP 24185 "Permission"      0x0000007fb7bcf718 in pthread_cond_wait ()
  43   LWP 24186 "TRR Background"  0x0000007fb7bcf718 in pthread_cond_wait ()
  44   LWP 24187 "URL Classifier"  0x0000007fb7bcf718 in pthread_cond_wait ()
  48   LWP 24191 "ProxyResolution" 0x0000007fb7bcf718 in pthread_cond_wait ()
  49   LWP 24193 "mozStorage #1"   0x0000007fb7bcf718 in pthread_cond_wait ()
  50   LWP 24195 "HTML5 Parser"    0x0000007fb7bcf718 in pthread_cond_wait ()
  51   LWP 24196 "localStorage DB" 0x0000007fb7bcf718 in pthread_cond_wait ()
  53   LWP 24198 "StyleThread#0"   0x0000007fb7bcf718 in pthread_cond_wait ()
  54   LWP 24199 "StyleThread#1"   0x0000007fb7bcf718 in pthread_cond_wait ()
  55   LWP 24200 "StyleThread#2"   0x0000007fb7bcf718 in pthread_cond_wait ()
  56   LWP 24202 "StyleThread#3"   0x0000007fb7bcf718 in pthread_cond_wait ()
  57   LWP 24203 "StyleThread#4"   0x0000007fb7bcf718 in pthread_cond_wait ()
  58   LWP 24204 "StyleThread#5"   0x0000007fb7bcf718 in pthread_cond_wait ()
  65   LWP 24293 "mozStorage #2"   0x0000007fb7bcf718 in pthread_cond_wait ()
  66   LWP 24294 "mozStorage #3"   0x0000007fb7bcf718 in pthread_cond_wait ()
  69   LWP 24942 "Backgro~Pool #5" 0x0000007fb7bcfb80 in pthread_cond_timedwait ()
  70   LWP 26091 "QSGRenderThread" 0x0000007fb78a8740 in poll ()
That's rather a lot of threads to check through, but nonetheless it's still likely to be our most fruitful approach right now. We can ask the debugger to print a backtrace for every single thread by calling thread apply all bt.

I won't copy out all of the resulting output here. Instead I'll include just the interesting backtraces and summarise the others.
(gdb) thread apply all bt

Thread 70 (LWP 26091):
#0  0x0000007fb78a8740 in poll () from /lib64/libc.so.6
#1  0x0000007fafb38bfc in ?? () from /usr/lib64/libwayland-client.so.0
#2  0x0000007fafb3a258 in wl_display_dispatch_queue ()
    from /usr/lib64/libwayland-client.so.0
#3  0x0000007faf885204 in WaylandNativeWindow::readQueue(bool) ()
    from /usr/lib64/libhybris//eglplatform_wayland.so
#4  0x0000007faf8843ec in WaylandNativeWindow::finishSwap() ()
    from /usr/lib64/libhybris//eglplatform_wayland.so
#5  0x0000007fb73f9210 in _my_eglSwapBuffersWithDamageEXT ()
    from /usr/lib64/libEGL.so.1
#6  0x0000007fafa4e080 in ?? () from
    /usr/lib64/qt5/plugins/wayland-graphics-integration-client/libwayland-egl.so
#7  0x0000007fb88e5180 in QOpenGLContext::swapBuffers(QSurface*) ()
    from /usr/lib64/libQt5Gui.so.5
#8  0x0000007fb8e64c68 in ?? () from /usr/lib64/libQt5Quick.so.5
#9  0x0000007fb8e6ac10 in ?? () from /usr/lib64/libQt5Quick.so.5
#10 0x0000007fb7ce20e8 in ?? () from /usr/lib64/libQt5Core.so.5
#11 0x0000007fb7bc8a4c in ?? () from /lib64/libpthread.so.0
#12 0x0000007fb78b289c in ?? () from /lib64/libc.so.6

[...]

Thread 11 (LWP 24152):
#0  0x0000007fb7bcf718 in pthread_cond_wait () from /lib64/libpthread.so.0
#1  0x0000007fb7ce2924 in QWaitCondition::wait(QMutex*, unsigned long) ()
    from /usr/lib64/libQt5Core.so.5
#2  0x0000007fb8e6a7cc in ?? () from /usr/lib64/libQt5Quick.so.5
#3  0x0000007fb8e6ac60 in ?? () from /usr/lib64/libQt5Quick.so.5
#4  0x0000007fb7ce20e8 in ?? () from /usr/lib64/libQt5Core.so.5
#5  0x0000007fb7bc8a4c in ?? () from /lib64/libpthread.so.0
#6  0x0000007fb78b289c in ?? () from /lib64/libc.so.6

Thread 10 (LWP 24151):
#0  0x0000007fb78ade24 in syscall () from /lib64/libc.so.6
#1  0x0000007fba03cdd0 in epoll_wait (epfd=<optimized out>,
    events=events@entry=0x7f8c0018a0, maxevents=<optimized out>,
    timeout=timeout@entry=-1)
    at ipc/chromium/src/third_party/libevent/epoll_sub.c:62
#2  0x0000007fba03f7a0 in epoll_dispatch (base=0x7f8c0015e0, tv=<optimized out>)
    at ipc/chromium/src/third_party/libevent/epoll.c:462
#3  0x0000007fba041568 in event_base_loop (base=0x7f8c0015e0,
    flags=flags@entry=1) at ipc/chromium/src/third_party/libevent/event.c:1947
#4  0x0000007fba01e248 in base::MessagePumpLibevent::Run (this=0x7f8c001560,
    delegate=0x7f9eeb9de0) at ipc/chromium/src/base/message_pump_libevent.cc:346
#5  0x0000007fba01f5bc in MessageLoop::RunInternal (this=this@entry=0x7f9eeb9de0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#6  0x0000007fba01f800 in MessageLoop::RunHandler (this=0x7f9eeb9de0)
    at ipc/chromium/src/base/message_loop.cc:352
#7  MessageLoop::Run (this=this@entry=0x7f9eeb9de0)
    at ipc/chromium/src/base/message_loop.cc:334
#8  0x0000007fba0336d8 in base::Thread::ThreadMain (this=0x7f88031040)
    at ipc/chromium/src/base/thread.cc:187
#9  0x0000007fba01dba0 in ThreadFunc (closure=<optimized out>)
    at ipc/chromium/src/base/platform_thread_posix.cc:40
#10 0x0000007fb7bc8a4c in ?? () from /lib64/libpthread.so.0
#11 0x0000007fb78b289c in ?? () from /lib64/libc.so.6

Thread 8 (LWP 24149):
#0  StringMatch (text=0x7f9df3bf40, pat=0x7f9e0400c0, start=start@entry=0)
    at js/src/builtin/String.cpp:1944
#1  0x0000007fbcc68b0c in js::str_indexOf (cx=<optimized out>,
    argc=<optimized out>, vp=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/js/CallArgs.h:245
#2  0x0000007f00708994 in ?? ()
#3  0x0000007fbd165538 in js::jit::MaybeEnterJit (cx=0xc021, state=...)
    at js/src/jit/Jit.cpp:207
#4  0x0000007fbd165538 in js::jit::MaybeEnterJit (cx=0x7f881cd720, state=...)
    at js/src/jit/Jit.cpp:207
#5  0x0000007ef4101401 in ?? ()
Backtrace stopped: Cannot access memory at address 0x84041a32f7d6

[...]
(gdb)
Of the other threads 35 are in a pthread_cond_wait() state and three are in a pthread_cond_timedwait() state. Nine are in a poll() state.

That leaves thread 8 and thread 10 in slightly different states, but in practice their backtraces still don't tell us anything particularly interesting.

The one that looks most interesting to me is thread 70. It looks to be a rendering thread, but thread 11 is already there as a rendering thread and during normal operation I'd only expect to see one of these.

Thread 70 is doing something to do with EGL and wayland.

Unfortunately this still doesn't give me anything actionable. So I'm going to try something else.

The hang occurs when we switch to private browsing mode. So let's take a look at the code that happens as a result of pressing the button to do this in the user interface. This can be found in declarativewebcontainer.cpp and looks like this:
void DeclarativeWebContainer::updateMode()
{
    setTabModel((BrowserApp::captivePortal() || m_privateMode)
        ? m_privateTabModel.data() : m_persistentTabModel.data());
    emit tabIdChanged();

    // Reload active tab from new mode
    if (m_model->count() > 0) {
        reload(false);
    } else {
        setWebPage(NULL);
        emit contentItemChanged();
    }
}
By commenting out different bits of the code and rebuilding I should be able to narrow the issue down.

And it turns out that here it's the call to setTabModel() that triggers the hang. Comment this line out and everything continues without hanging. Of course the functionality isn't all working (it doesn't correctly switch to private browsing mode), but this does at least get us somewhere to start. Let's dig deeper into this setTabModel() method. The implementation for this method looks like this:
void DeclarativeWebContainer::setTabModel(DeclarativeTabModel *model)
{
    if (m_model != model) {
        int oldCount = 0;
        if (m_model) {
            disconnect(m_model, 0, 0, 0);
            oldCount = m_model->count();
        }

        m_model = model;
        int newCount = 0;
        if (m_model) {
            connect(m_model.data(), &DeclarativeTabModel::activeTabChanged,
                    this, &DeclarativeWebContainer::onActiveTabChanged);
            connect(m_model.data(), &DeclarativeTabModel::activeTabChanged,
                    this, &DeclarativeWebContainer::tabIdChanged);
            connect(m_model.data(), &DeclarativeTabModel::loadedChanged,
                    this, &DeclarativeWebContainer::initialize);
            connect(m_model.data(), &DeclarativeTabModel::tabClosed,
                    this, &DeclarativeWebContainer::releasePage);
            connect(m_model.data(), &DeclarativeTabModel::newTabRequested,
                    this, &DeclarativeWebContainer::onNewTabRequested);
            newCount = m_model->count();
        }
        emit tabModelChanged();
        if (m_model && oldCount != newCount) {
            emit m_model->countChanged();
        }
    }
}
Again, by commenting out different parts of the code it should be possible to narrow down what's causing the problem. And indeed in this case the only line that consistently causes the hang to happen is the following:
        emit tabModelChanged();
I can leave all of the other lines in and the hang dissipates, but keep this line in, even with the others removed, and the hang returns.

So that means there's something hooked into this signal that's causing the hang. There's only one connection made to this in the C++ code and commenting that out doesn't make any difference. So the problem must be in the QML code.

Unfortunately there are many different bits of code that hang off this signal. I've commented large chunks of code related to this signal out to see whether skipping them prevents the hang. Most seem to have no effect on the outcome. But if I comment out enough stuff the hang no longer occurs.

Now I'm working backwards adding code back in until the hang returns. It's laborious, but at least guaranteed to move things forwards one step at a time.

Eventually after a lot of trial an error I've been able to narrow down the problem to this small snippet of code in BrowserPage.qml:
    // Use Connections so that target updates when model changes.
    Connections {
        target: AccessPolicy.browserEnabled && webView
                && webView.tabModel || null
        ignoreUnknownSignals: true
        // Animate overlay to top if needed.
        onCountChanged: {
            if (webView.tabModel.count === 0) {
                webView.handleModelChanges(false)
            }
            window.setBrowserCover(webView.tabModel)
        }
    }
Comment this code out and everything seems to work okay. At least, everything except the cover preview. In particular, the proxy filter model doesn't seem to be causing any issues.

With a bit more trial and error I'm able to narrow it down even further, to just this line:
            window.setBrowserCover(webView.tabModel)
With this line commented out things are looking up: the print to PDF functionality works nicely; there are no extraneous tabs added during printing; there's just the slightest of flickers when the printing starts; but crucially there are no other obvious issues and no hanging.

I'm really happy with this. It means that the code for getting the print to PDF functionality working is now all there. It'll be important to deal with the hang, but it's actually unrelated to these changes.

Given all this, that leaves me three things to deal with:
  1. Check whether the flicker can be removed by skipping the activation of the hidden window.
  2. Tidy up the QML implementation of the new proxy filter model.
  3. Move the DownloadPDFSaver class from DownloadCore.js to EmbedliteDownloadManager.js.
  4. Record an issue for the setBrowserCover() hang we were just looking at.
That's still quite a lot to do. But at least I'm happy that everything is now under control. What's now clear is that it'll definitely be possible to restore the PDF printing functionality, all I really have to do now is clean up the implementation. Fixing the hang can go in a separate issue.

Today I'm happy: it's been a productive day, we got to the bottom of the hang, and all of the printing changes are coming together into something usable. Now it's time for bed and hopefully a very sound sleep as a result.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
28 Dec 2023 : Day 121 #
Yesterday we looked at creating a DeclarativeTabFilterModel that allowed the hidden windows to be filtered out. The implementation turned out to be really straightforward, although to be fair that's because Qt is doing the majority of the hard work for us.

Today I'm going to try adding it in to the QML code to see what happens. The interesting file here is TabView.qml. This is the component that displays the tabs to the user. The design of the user interface means it has to handle multiple models: a model to allow switching between standard and private tabs called the modeModel which is defined directly in the QML. A model for handling the standard tab list accessed as the webView.persistentTabModel. The use of "persistent" in the name is a reference to the fact that these tabs will survive the browser being shutdown and restarted. Finally there are the private tabs accessed with the webView.privateTabModel model. These lose the "persistent" moniker because the private tabs are all discarded when the user closes the browser.

We're interested in the last of these two models because we want to wrap them in our filtering proxy. The key bit of code for this is the following (found in the same TabView.qml file:
    TabGridView {
        id: _tabView

        portrait: tabView.portrait
        model: tabItem.privateMode ? webView.privateTabModel
                                   : webView.persistentTabModel
        header: Item {
            width: 1
            height: Theme.paddingLarge
        }

        onHide: tabView.hide()
        onEnterNewTabUrl: tabView.enterNewTabUrl()
        onActivateTab: tabView.activateTab(index)
        onCloseTab: tabView.closeTab(index)
        onCloseAll: tabView.closeAll()
        onCloseAllCanceled: tabView.closeAllCanceled()
        onCloseAllPending: tabView.closeAllPending()
    }
As you can see the tabs are set to use a model that depends on whether private mode is active or not. There's s ternary operator there that chooses between the two models:
        model: tabItem.privateMode ? webView.privateTabModel
                                   : webView.persistentTabModel
This line hooks the correct model up to the model to be used by the TabGridView. The code here looks pretty slick and straightforward, but I recall when it was first implemented it caused a lot of trouble. The difficulty is that when the component switches between persistent and private tabs there's a short period during the animation when both models are used simultaneously. Ensuring that they can both exist without the tabs suddenly switching from one model to the other in an instant needed some work.

It wasn't me that had to implement that, but I probably reviewed the code at some stage.

So now let's add in our filtering proxy model.
    TabGridView {
        id: _tabView

        portrait: tabView.portrait
        model: TabFilterModel {
        	sourceModel: tabItem.privateMode ? webView.privateTabModel
        	                                 : webView.persistentTabModel
        	showHidden: false
        }
        header: Item {
            width: 1
            height: Theme.paddingLarge
        }
[...]
When I test this out on-device it works surprisingly well. The hidden tab is indeed hidden in the tab view. I need to sort the tab numbering out, but that's expected because I've not started using the filter model for the numbering yet. As soon as I switch to using it for the count as well, it should all fall into line.

However, there is one more serious problem. Switching between persistent and private modes causes the browser to hang. My guess is that this is to do with the code used to switch between the two, but I'm not certain yet.

But that's okay, this is what building this sort of stuff is all about: try something out, find the glitches, fix the glitches. It's not quite as clean and logical as we might always like, but this is software engineering, not computer science.

As we can see from the code shown above, the model is being used in a TabGridView component. When the tab switches it comes up with this error:
[W] unknown:65 - file:///usr/share/sailfish-browser/pages/components/
    TabGridView.qml:65:19: Unable to assign [undefined] to int
When we look inside the code for TabGridView we can see that this relates to this line:
    currentIndex: model.activeTabIndex
This is making use of the activeTabIndex property, which is a member of DeclarativeTabModel (and so therefore also a member of the persistent and private models that inherit from it) but not a member of our filter proxy model. So that would explain why it's so unhappy.

I can pass through the property quite easily. It looks like the count property is also used, but not the loaded property, so probably we just need to pass through those two.

Having added the required properties and tested out the code, it seems the problem still persists: switching from persistent to private tabs causes the browser to hang. I tried out a few changes to my code to see if that would help, including setting the filter flag to false, but that still doesn't fix it.

Thinking that my changes might have corrupted the tab database details stored on disk I also tried removing the profile data stored at ~/.local/share/org.sailfishos/browser. This cleared all of the tabs, but to my surprise the browser now hangs when creating the first persistent tab as well. So the issue isn't necessarily to do with switching between persistent and private views; more likely it's to do with creating the very first tab in each case.

After reverting my changes to the filter proxy it's now clear that it's not the proxy that's causing the issue here at all: it is indeed the creation of the very first tab. With the profile restored there are a bunch of tabs pre-existing in the persistent tab list. But in the private tab list there are none. So switching to this empty list is what's causing the crash.

So it looks like I need to double back and fix this. Here's the backtrace I get from the crash:
Thread 1 "sailfish-browse" received signal SIGSEGV, Segmentation fault.
Tab::tabId (this=0x48) at ../storage/tab.cpp:38
38          return m_tabId;
(gdb) bt
#0  Tab::tabId (this=0x48) at ../storage/tab.cpp:38
#1  0x00000055555cb3d8 in DeclarativeWebPage::tabId (this=<optimized out>)
    at ../qtmozembed/declarativewebpage.cpp:127
#2  0x0000005555591fd0 in DeclarativeWebContainer::onNewTabRequested
    (this=0x555569b370, tab=...) at include/c++/8.3.0/bits/atomic_base.h:390
#3  0x0000007fb7ec4204 in QMetaObject::activate(QObject*, int, int, void**) ()
    from /usr/lib64/libQt5Core.so.5
#4  0x00000055555f8bb0 in DeclarativeTabModel::newTabRequested
    (this=this@entry=0x55559c7e30, _t1=...) at moc_declarativetabmodel.cpp:366
#5  0x00000055555c5148 in DeclarativeTabModel::newTab
    (this=this@entry=0x55559c7e30, url=..., parentId=parentId@entry=0, 
    browsingContext=browsingContext@entry=0, hidden=hidden@entry=false)
    at ../history/declarativetabmodel.cpp:233
#6  0x00000055555c5318 in DeclarativeTabModel::newTab
    (this=this@entry=0x55559c7e30, url=...)
    at ../history/declarativetabmodel.cpp:199
#7  0x00000055555f8f3c in DeclarativeTabModel::qt_static_metacall
    (_o=_o@entry=0x55559c7e30, _c=_c@entry=QMetaObject::InvokeMetaMethod,
    _id=_id@entry=18, _a=_a@entry=0x7fffffbdc8)
    at moc_declarativetabmodel.cpp:182
[...]
#18 0x0000005555c61460 in ?? ()
Backtrace stopped: not enough registers or memory available to unwind further
(gdb) frame 1
#1  0x00000055555cb3d8 in DeclarativeWebPage::tabId (this=<optimized out>)
    at ../qtmozembed/declarativewebpage.cpp:127
127         return m_initialTab.tabId();
(gdb) p m_initialTab
value has been optimized out
(gdb) b DeclarativeWebPage::setInitialState
Breakpoint 1 at 0x55555cb3d8: file ../qtmozembed/declarativewebpage.cpp, line 134.
(gdb) r
[...]

Thread 1 "sailfish-browse" received signal SIGSEGV, Segmentation fault.
Tab::tabId (this=0x48) at ../storage/tab.cpp:38
38          return m_tabId;
(gdb) 
Notice that the issue here is that m_initialTab isn't set correctly. In fact, by putting additional breakpoints on the code I'm able to see that it's never actually getting set at all. It seems that the DeclarativeWebPage::setInitialState() method that sets it is never getting called.

This is supposed to be called in the WebPageFactory::createWebPage() method, which a breakpoint confirms is also not being called. From inspection of the code it seems that this is supposed to be being called from the WebPages::page() method. But that's also not being called.

Finally, this page() call is supposed to be made from the DeclarativeWebContainer::activatePage() method. You might recall this is a method I messed around with earlier.

It looks like the problem line might actually be one of the debug lines I added:
    qDebug() << "PRINT: onNewTabRequested post activeTab: "
        << m_webPage->tabId();
This line works fine if m_webPage::m_initialTab has been set correctly, but if it's the very first tab it won't yet have been set at all. When this happens the code tries to dereference an uninitialised pointer and boom. If this really is the problem then that will be nice: understandable and easy to fix. I've rebuilt sailfish-browser without the debug output line to see what happens.

This has changed things a bit: there's no longer a crash when creating the very first page. But switching to private browsing mode is still causing problems, even without the changes I made today to add the filter proxy model. That might suggest the issue has been there for much longer than I thought: I've just not switched to private browsing recently.

Running the app through the debugger it becomes clear that the app really is hanging rather than crashing. Or, even more specifically, the front-end is hanging. The JavaScript interpreter seems to happily continue running in the background. It's not at all clear why this might be happening and since it's already quite late now, finding the answer is going to have to wait until the morning.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
27 Dec 2023 : Day 120 #
Over the last few days we've been looking at hiding the print window used to clone the page into when saving a PDF of a page using the print routines. We got to the point where the print succeeds, the page appears for only the briefest of moments, but the tab for the page still appears in the tab model view.

Today we're going to start work on filtering out the page from the tab view.

After some discussion online with thigg, who rightly pointed out some potential dangers with doing this, I've decided to add the functionality but using a config setting that allows it to be enabled and disabled.

Before we implement the config setting I first want to put the filter together. To do this we're going to use a QSortFilterProxyModel. This will look exactly like the original model but with the ability to filter on specific values, in our case the "hidden" flag.

The Qt docs have a nice example of using this filtering approach which we can pretty much copy. All of this is happening in the sailfish-browser code as a wrapper for DeclarativeTabModel. There are also plenty of existing examples in the sailfish-browser code, including BookmarkFilterModel and LoginFilterModel, used to search on bookmarks and logins respectively.

Following these examples we're going to call ours the DeclarativeTabFilterModel.

I've put together the filter. It's generally very simple code, to the extent that I think I can get away with posting the entire header here.
#include <QSortFilterProxyModel>

class DeclarativeTabFilterModel : public QSortFilterProxyModel
{
    Q_OBJECT
    Q_PROPERTY(bool showHidden READ showHidden WRITE setShowHidden NOTIFY
        showHiddenChanged)
public:
    DeclarativeTabFilterModel(QObject *parent = nullptr);

    Q_INVOKABLE int getIndex(int currentIndex);

    bool filterAcceptsRow(int sourceRow, const QModelIndex &sourceParent)
        const override;
    void setSourceModel(QAbstractItemModel *sourceModel) override;

    bool showHidden() const;
    void setShowHidden(const bool showHidden);

signals:
    void showHiddenChanged(bool showHidden);

private:
    bool m_showHidden;
};
In essence all it's doing is accepting the model, then filtering the rows based on whether they're hidden or not. The key piece of code in the implementation is for the filterAcceptsRow() method which looks like this:
bool DeclarativeTabFilterModel::filterAcceptsRow(int sourceRow,
    const QModelIndex &sourceParent) const
{
    QModelIndex index = sourceModel()->index(sourceRow, 0, sourceParent);

    return (m_showHidden || !sourceModel()->data
        (index, DeclarativeTabModel::HiddenRole).toBool());
}
The underlying Qt implementation does all the rest. It's very nice stuff.

I'm hoping I can drop this right in as a replacement for the model in the TabView.qml code. Or, to be more precise, as a drop in for the two models in the TabView.qml code, because there's a separate model for the persistent (standard) and private tab lists.

However, there may be a catch because the DeclarativeTabModel provides a bunch of other properties which potentially might get used for various things. I'll have to be careful not to accidentally switch out a model for the filtered model where any existing code relies on these additional properties. The additional properties won't automatically be provided by the filtered model.

Looking carefully through the code, I think it's safe though. I should be able to replace the model for the filtered proxy model close enough to the view controller that it will only use a minimal set of additional properties.

In order to actually make use of this new declarative component, we must register it with the QML typing system. Along with all of the other components built in this way we do this in the startup code of the application, most easily done in the main() function like this:
    qmlRegisterType<DeclarativeTabFilterModel>(uri, 1, 0, "TabFilterModel");
The only other thing I need to do is add the new file to the Qt project files; in this case the apps/history/history.pri file.

Having done that I'm pleased to see it compiles and builds the packages fine. But it's only a short post today and I'm not planning to test the new model now. Instead I'll pick this up and add it to the QML front-end tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
26 Dec 2023 : Day 119 #
We're getting perilously close to 27 days on this now. I admit this printing issue has turned out to be more gnarly than I'd hopped. But I don't feel ready to give up on it just yet. Yesterday we got the page hidden but couldn't figure out why that prevented the print from successfully completing. Today I plan to debug the print process again to try to find out why.

Let's get the error output that we're dealing with. There are some additional debug print outputs that I've added (all of the ones starting with PRINT:) to help clarify the flow. Whenever I add temporary debug prints in this way I like to give them all the same prefix. It makes them easier to spot, but also makes it easier to filter on them all using grep in case there's so much debug output that it makes them hard to spot.
$ EMBED_CONSOLE=1 MOZ_LOG="EmbedLite:5" sailfish-browser
[...]
PRINT: Window: [object Window]
PRINT: Window document: [object HTMLDocument]
PRINT: Window MozElement: undefined
[Parent 15251: Unnamed thread 7718002670]: E/EmbedLite FUNC::virtual nsresult mozilla::embedlite::EmbedLiteAppChild::Observe(nsISupports*, const char*,
    const char16_t*):68 topic:embed:download
[D] unknown:0 - PRINT: onNewTabRequested pre activeTab:  10
[D] unknown:0 - PRINT: new tab is hidden; recording previous ID:  10
[D] unknown:0 - PRINT: onNewTabRequested post activeTab:  11
[W] unknown:0 - bool DBWorker::execute(QSqlQuery&) failed execute query
[W] unknown:0 - "INSERT INTO tab (tab_id, tab_history_id) VALUES (?,?);"
[W] unknown:0 - QSqlError("19", "Unable to fetch row", "UNIQUE constraint
    failed: tab.tab_id")
[D] unknown:0 - PRINT: onActiveTabChanged:  11
[D] unknown:0 - PRINT: onActiveTabChanged hidden:  true
[D] unknown:0 - PRINT: new tab is hidden, activating previous ID:  10
[D] unknown:0 - PRINT: activateTab: old:  11
[D] unknown:0 - PRINT: activateTab: new:  4
[D] unknown:0 - PRINT: activateTab: activate tab:  4 Tab(tabId = 10,
    parentId = 0, isValid = true, url = "https://jolla.com/",
    requested url = "", url resolved: true, title = "Jolla", thumbnailPath =
    "~/.cache/org.sailfishos/browser/tab-10-thumb.jpg", desktopMode = false)
[D] unknown:0 - PRINT: onActiveTabChanged:  10
[D] unknown:0 - PRINT: onActiveTabChanged hidden:  false
EmbedliteDownloadManager error: [Exception... "Abort"  nsresult: "0x80004004
    (NS_ERROR_ABORT)"  location: "JS frame ::
    resource://gre/modules/DownloadCore.jsm :: DownloadError :: line 1755"
    data: no]
[Parent 15251: Unnamed thread 7718002670]: E/EmbedLite FUNC::virtual nsresult 
    mozilla::embedlite::EmbedLiteAppChild::Observe(nsISupports*, const char*,
    const char16_t*):68 topic:embed:download
[Parent 15251: Unnamed thread 7718002670]: I/EmbedLite WARN: EmbedLite::virtual
    void* mozilla::embedlite::EmbedLitePuppetWidget::GetNativeData(uint32_t):127
    EmbedLitePuppetWidget::GetNativeData not implemented for this type
JavaScript error: , line 0: uncaught exception: Object
JSScript: ContextMenuHandler.js loaded
JSScript: SelectionPrototype.js loaded
JSScript: SelectionHandler.js loaded
JSScript: SelectAsyncHelper.js loaded
JSScript: FormAssistant.js loaded
JSScript: InputMethodHandler.js loaded
EmbedHelper init called
Available locales: en-US, fi, ru
Frame script: embedhelper.js loaded
CONSOLE message:
[JavaScript Error: "uncaught exception: Object"]
[...]
There are two things I need to look into from this debug output. First the database error:
[W] unknown:0 - QSqlError("19", "Unable to fetch row", "UNIQUE constraint failed:
    tab.tab_id")
It looks like something has messed up the tab database. That's not so surprising given that I made some (not very carefully thought-through) changes to the dbworker.cpp code. However my suspicion is that these database changes won't be affecting printing adversely. Second the download error:
EmbedliteDownloadManager error: [Exception... "Abort"  nsresult: "0x80004004
    (NS_ERROR_ABORT)"  location: "JS frame ::
    resource://gre/modules/DownloadCore.jsm :: DownloadError :: line 1755"
    data: no]
I'm going to start with the latter because it seems more likely to be the underlying issue. The exception is being thrown here, in the last of the three conditional blocks found in DownloadCore.js:
var DownloadError = function(aProperties) {
[...]
  if (aProperties.message) {
    this.message = aProperties.message;
  } else if (
    aProperties.becauseBlocked ||
    aProperties.becauseBlockedByParentalControls ||
    aProperties.becauseBlockedByReputationCheck ||
    aProperties.becauseBlockedByRuntimePermissions
  ) {
    this.message = "Download blocked.";
  } else {
    let exception = new Components.Exception("", this.result);
    this.message = exception.toString();
  }
That's not super helpful though because this is just the code that's called when there's an error of any kind. We need to find out where this is being called from.

There are 19 places in the DownloadCore.js file that might trigger an error using this DownloadError() method. Of these, nine include a message field and so will fall into the first block of the condition, four have one of the becauseBlocked... flags set and so fall into the second block of the condition. Finally one of them is the method used for deserialisation of the message.

That leaves five left which could potentially be one of the ones triggering the error we're seeing. These five can be found on lines 518, 557, 2164, 2740 and 3035 of the DownloadCore.jsm file.

That's not too many; let's find out which one exactly by adding in different messages to all of these six entries and seeing which one pops up. Here's the result:
EmbedliteDownloadManager error: Line 3035
That means the error is happening inside this block:
    try {
      await new Promise((resolve, reject) => {
        this._browsingContext.print(printSettings)
        .then(() => {
          resolve();
        })
        .catch(exception => {
          reject(new DownloadError({ result: exception, inferCause: true }));
        });
      });
    }
That's not too surprising or revealing. That means there's an exception being thrown inside the C++ code from somewhere following this call:
already_AddRefed<Promise> CanonicalBrowsingContext::Print(
    nsIPrintSettings* aPrintSettings, ErrorResult& aRv)
Inside this method there are all sorts of calls to get the window details. So it might be worth checking whether we're switching tabs before or after all this is happening. Maybe it's just a timing issue?

A horrible thought suddenly occurred to me: what if it's the printing part that's broken and not the window changes causing it to break? I've been reinstalling various files, but maybe I didn't re-apply some manual change that had fixed the printing earlier?

Just to double check this I've removed the code that switches the tab back again to check whether the printing works correctly with this removed:
void DeclarativeWebContainer::onActiveTabChanged(int activeTabId)
{
    if (activeTabId <= 0) {
        return;
    }

    reload(false);

    if (m_model->activeTab().hidden()) {
        // Switch back to the old tab
        //m_model->activateTabById(mPreviousTabWhenHidden);
    }
}
Rebuild, reinstall, rerun. But now the printing does work. So it's definitely the introduction of this one line that's causing the issue. I'm going to put the line back in again, but this time with just a slight 100 millisecond delay to see whether that makes any difference. This will help confirm or deny my suspicion that this may be a timing issue.
void DeclarativeWebContainer::onActiveTabChanged(int activeTabId)
{
    if (activeTabId <= 0) {
        return;
    }

    reload(false);

    if (m_model->activeTab().hidden()) {
        // Switch back to the old tab
        hiddenTabTimer.setSingleShot(true);
        disconnect(&hiddenTabTimer, &QTimer::timeout, this, nullptr);
		connect(&hiddenTabTimer, &QTimer::timeout, this, [this]() {
        	m_model->activateTabById(mPreviousTabWhenHidden);
		});
		hiddenTabTimer.start(100);
    }
}
Now with this slight delay the window appears visibly to the user for a fraction of a second, then disappears. The print then continues and the PDF is output successfully, no longer a zero-byte file:
$ ls -l
total 4596
-rw-rw-r-- 1 defaultuser defaultuser 4703901 Dec 24 18:39 Jolla.pdf
So that confirms it: it's a timing issue. Most likely the print code is expecting to get details from the current window. But if the window switches before it can do this, it'll end up getting the details from the original window, causing the print to fail.

If this is indeed what's going wrong, we should be able to pull that delay right down to zero. Adding a delay of zero time isn't the same as adding in no timer at all. The big difference is that by using the timer, execution will be force once round the event loop before the switch back to the window occurs. In this case, it could be enough for the print code to pick up all of the info it needs in order to avoid the print failing.

So I've made this change:
		hiddenTabTimer.start(0);
With a timeout time set to zero like this the window barely appears: it's more like a flicker of the screen, no different to when we had no timer at all. What's more the file is still generated in the background and this time gets filled up with data; the error that we were seeing earlier doesn't occur. Printing multiple times also seems to work correctly.

So this is a good result. I don't think I'm going to be able to remove the initial flicker entirely, which is a shame, but maybe further down the line someone will think of a way to address that. I'm also not totally done because the "hidden" window is also currently still appearing in the tab view. I need to filter it out. The good news is that I'm much more comfortable with how to do that as there are standard approaches to filtering items from models using the QSortFilterProxyModel class.

I'm not going to do that right now though, I'll pick that up tomorrow.

Over the last few weeks there have been many times when I wasn't convinced I'd be able to get the "Save page to PDF" functionality back up and working satisfactorily again. So it's a huge relief to get to this stage where I know it will get to a place that I'm happy with. That seems like a great place to end for the day.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
25 Dec 2023 : Day 118 #
It's Christmas Day. I've spent most of the day eating, opening presents and watching films. If you celebrate Christmas I hope you've had a wonderful day as well. Alongside general celbration and relaxation I've also had the chance to do a bit of gecko development. I'm very keen to get this printing user interface working correctly and it's definitely getting there.

As something a bit different, here's a photo of our Christmas tree before and after the decorations. But it comes with a warning: if you read on beyond this photo there will be backtraces. You have been warned!
On the left a Nordic Pine; on the right the same tree covered in Christmas decorations

The day before yesterday we got to the point where the hidden tabs were glowing red in the tab view, leaving the non-hidden tabs their usual colour. We don't actually want them to be red, but it does demonstrate that the hidden flag is being exposed correctly in the user interface.

Then yesterday I worked my way carefully through the code to come up with a plan for how to actually hide a tab based on the flag being set. My plan was to wait for the new tab to be activated, then immediately set the tab that was activated directly before as the active tab again.

If that happens quickly enough the user should hopefully not even be aware that it's happened.

So today I'm planning to harness that information in order to trigger the activation. With any luck we can get this done just by amending the front-end sailfish-browser code. No need to make changes to the gecko library itself.

To do this I'm going to keep track of the previous tab by recording the tab id in the DeclarativeWebContainer class. I've added the code to do this into the onNewTabRequested() method. As you can see, if the new tab has the hidden flag set we'll copy the tab ID into the mPreviousTabWhenHidden class member:
void DeclarativeWebContainer::onNewTabRequested(const Tab &tab)
{
	if (tab.hidden()) {
		mPreviousTabWhenHidden = m_webPage->tabId();
	}

    if (activatePage(tab, false)) {
        m_webPage->loadTab(tab.requestedUrl(), false);
    }
}
As we saw yesterday, once the tab has been created an activeTabChanged signal will be emitted which is connected to a DeclarativeWebContainer::onActiveTabChanged() slot. At that point I've added in some code to then switch back to the previous tab using the tab ID we recorded earlier:
void DeclarativeWebContainer::onActiveTabChanged(int activeTabId)
{
    if (activeTabId <= 0) {
        return;
    }

    reload(false);

    if (m_model->activeTab().hidden()) {
        // Switch back to the old tab
        m_model->activateTabById(mPreviousTabWhenHidden);
    }
}
When I test this out it actually works, which I'm a little surprised about. There's a slight flicker when the new tab opens but then it's immediately hidden again. It's perceptible, but really looks okay to me.

The only problem is that now the print doesn't complete. The printing starts, the tab opens, the tab closes, but then there's an error in the debug output and the data is never written to the file.

Here's the output from the debug console. Note that there are quite a few debug prints that I've added here to try to help me figure out what's going on, so they're non-standard and I'll remove them once I'm done.
[D] unknown:0 - PRINT: onNewTabRequested pre activeTab:  10
[D] unknown:0 - PRINT: new tab is hidden; recording previous ID:  10
[D] unknown:0 - PRINT: onNewTabRequested post activeTab:  11
[W] unknown:0 - bool DBWorker::execute(QSqlQuery&) failed execute query
[W] unknown:0 - "INSERT INTO tab (tab_id, tab_history_id) VALUES (?,?);"
[W] unknown:0 - QSqlError("19", "Unable to fetch row", "UNIQUE constraint failed: tab.tab_id")
[D] unknown:0 - PRINT: onActiveTabChanged:  11
[D] unknown:0 - PRINT: onActiveTabChanged hidden:  true
[D] unknown:0 - PRINT: new tab is hidden, activating previous ID:  10
[D] unknown:0 - PRINT: activateTab: old:  11
[D] unknown:0 - PRINT: activateTab: new:  4
[D] unknown:0 - PRINT: activateTab: activate tab:  4 Tab(tabId = 10,
    parentId = 0, isValid = true, url = "https://jolla.com/",
    requested url = "", url resolved: true, title = "Jolla", thumbnailPath =
    "/home/defaultuser/.cache/org.sailfishos/browser/tab-10-thumb.jpg",
    desktopMode = false)
[D] unknown:0 - PRINT: onActiveTabChanged:  10
[D] unknown:0 - PRINT: onActiveTabChanged hidden:  false
EmbedliteDownloadManager error: [Exception... "Abort"  nsresult:
    "0x80004004 (NS_ERROR_ABORT)"  location: "JS frame ::
    resource://gre/modules/DownloadCore.jsm :: DownloadError :: line 1755"
    data: no]
[Parent 24026: Unnamed thread 7e54002670]: E/EmbedLite FUNC::virtual nsresult 
    mozilla::embedlite::EmbedLiteAppChild::Observe(nsISupports*, const char*,
    const char16_t*):68 topic:embed:download
[Parent 24026: Unnamed thread 7e54002670]: I/EmbedLite WARN: EmbedLite::virtual
    void* mozilla::embedlite::EmbedLitePuppetWidget::GetNativeData(uint32_t):127
    EmbedLitePuppetWidget::GetNativeData not implemented for this type
JavaScript error: , line 0: uncaught exception: Object
JSScript: ContextMenuHandler.js loaded
JSScript: SelectionPrototype.js loaded
JSScript: SelectionHandler.js loaded
JSScript: SelectAsyncHelper.js loaded
JSScript: FormAssistant.js loaded
JSScript: InputMethodHandler.js loaded
EmbedHelper init called
Available locales: en-US, fi, ru
Frame script: embedhelper.js loaded
CONSOLE message:
[JavaScript Error: "uncaught exception: Object"]
JavaScript error: resource://gre/modules/SessionStoreFunctions.jsm, line 120:
    NS_ERROR_FILE_NOT_FOUND:
CONSOLE message:
It's not quite clear to me whether it's the error shown here causing the problem, or the fact that the page is being deactivated and suspended. The latter could be causing problems, even though it might not necessarily trigger any error output (it would just freeze the page in the background). To explore the possibility of it being the page being set to inactive I've put a breakpoint on the place where this happens: the WebPages::updateStates() method. With any luck this will tell me where the deactivation takes place so I can disable it.

When we execute the app the breakpoint gets hit three times. First on creation of the app when the page we're loading gets created. We can see this from the fact that onNewTabRequested() is in the backtrace. I then press the "Save page to PDF" option after which the blank print page is created, so that a call to onNewTabRequested() triggers the breakpoint again. Following that the tab will switch from the blank page back to the previous page. On this occasion there's no call to onNewTabRequested() Instead there's a call to activateTabById() which will trigger the breakpoint a third and final time.

This last call to activateTabById() is the one we just added.

Here are the backtraces for all three of these calls to updateStates(). As I mentioned yesterday, I'm very well aware that these backtraces make for horrible reading. I'm honestly sorry about this. I keep the backtraces here for anyone who's really keen to see all of the details, or for a future me who might need this information for reference. If you're a normal human I strongly recommend just to skip past this bit. You'll not lose anything by doing so.
(gdb) b WebPages::updateStates
Breakpoint 2 at 0x55555ad150: WebPages::updateStates. (2 locations)
(gdb) c
Continuing.
	Thread 1 "sailfish-browse" hit Breakpoint 2, WebPages::updateStates
	(this=0x55559bf880, oldActivePage=0x5555c4e0c0, newActivePage=0x7f8c063720)
    at ../core/webpages.cpp:203
203         if (oldActivePage) {
(gdb) bt
#0  WebPages::updateStates (this=0x55559bf880, oldActivePage=0x5555c4e0c0,
    newActivePage=0x7f8c063720) at ../core/webpages.cpp:203
#1  0x00000055555ad69c in WebPages::page (this=0x55559bf880, tab=...)
    at ../core/webpages.cpp:166
#2  0x000000555559148c in DeclarativeWebContainer::activatePage
    (this=this@entry=0x55559bf220, tab=..., force=force@entry=false)
    at include/c++/8.3.0/bits/atomic_base.h:390
#3  0x00000055555917a4 in DeclarativeWebContainer::onNewTabRequested
    (this=0x55559bf220, tab=...) at ../core/declarativewebcontainer.cpp:1062
#4  0x0000007fb7ec4204 in QMetaObject::activate(QObject*, int, int, void**) ()
    from /usr/lib64/libQt5Core.so.5
#5  0x00000055555f7808 in DeclarativeTabModel::newTabRequested
    (this=this@entry=0x55559c69a0, _t1=...) at moc_declarativetabmodel.cpp:366
#6  0x00000055555c4428 in DeclarativeTabModel::newTab (this=0x55559c69a0,
    url=..., parentId=1, browsingContext=547754475280, hidden=<optimized out>)
    at ../history/declarativetabmodel.cpp:233
#7  0x00000055555d1900 in DeclarativeWebPageCreator::createView
    (this=0x55559c69f0, parentId=<optimized out>,
    parentBrowsingContext=<optimized out>, 
    hidden=<optimized out>) at /usr/include/qt5/QtCore/qarraydata.h:240
#8  0x0000007fbfb71ef0 in QMozContextPrivate::CreateNewWindowRequested
    (this=<optimized out>, chromeFlags=<optimized out>,
    hidden=@0x7fffffe922: true, aParentView=0x5555bfad90,
    parentBrowsingContext=@0x7fffffe938: 547754475280) at qmozcontext.cpp:218
#9  0x0000007fbcb10eec in mozilla::embedlite::EmbedLiteApp::
    CreateWindowRequested (this=0x555585a3a0, chromeFlags=@0x7fffffe928: 4094, 
    hidden=@0x7fffffe922: true, parentId=@0x7fffffe924: 1,
    parentBrowsingContext=@0x7fffffe938: 547754475280)
    at mobile/sailfishos/EmbedLiteApp.cpp:543
#10 0x0000007fbcb1ea68 in mozilla::embedlite::EmbedLiteAppThreadParent::
    RecvCreateWindow (this=<optimized out>, parentId=<optimized out>, 
    parentBrowsingContext=<optimized out>, chromeFlags=<optimized out>,
    hidden=<optimized out>, createdID=0x7fffffe92c, cancel=0x7fffffe923)
    at mobile/sailfishos/embedthread/EmbedLiteAppThreadParent.cpp:70
#11 0x0000007fba183aa0 in mozilla::embedlite::PEmbedLiteAppParent::
    OnMessageReceived (this=0x7f88a0fe90, msg__=..., reply__=@0x7fffffea38: 0x0)
    at PEmbedLiteAppParent.cpp:924
#12 0x0000007fba06b618 in mozilla::ipc::MessageChannel::DispatchSyncMessage
    (this=this@entry=0x7f88a0ff58, aProxy=aProxy@entry=0x5555c69ea0, aMsg=..., 
    aReply=@0x7fffffea38: 0x0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/ipc/ProtocolUtils.h:675
[...]
#32 0x000000555557b360 in main (argc=<optimized out>, argv=<optimized out>)
    at main.cpp:201
(gdb) c
Continuing.
[New LWP 21735]

Thread 1 "sailfish-browse" hit Breakpoint 2, WebPages::updateStates
    (this=<optimized out>, oldActivePage=<optimized out>,
    newActivePage=0x7f8c063720) at ../core/webpages.cpp:218
218             newActivePage->resumeView();
(gdb) bt
#0  WebPages::updateStates (this=<optimized out>, oldActivePage=<optimized out>,
    newActivePage=0x7f8c063720) at ../core/webpages.cpp:218
#1  WebPages::updateStates (this=<optimized out>, oldActivePage=<optimized out>,
    newActivePage=0x7f8c063720) at ../core/webpages.cpp:201
#2  0x00000055555ad69c in WebPages::page (this=0x55559bf880, tab=...)
    at ../core/webpages.cpp:166
#3  0x000000555559148c in DeclarativeWebContainer::activatePage
    (this=this@entry=0x55559bf220, tab=..., force=force@entry=false)
    at include/c++/8.3.0/bits/atomic_base.h:390
#4  0x00000055555917a4 in DeclarativeWebContainer::onNewTabRequested
    (this=0x55559bf220, tab=...) at ../core/declarativewebcontainer.cpp:1062
#5  0x0000007fb7ec4204 in QMetaObject::activate(QObject*, int, int, void**) ()
    from /usr/lib64/libQt5Core.so.5
#6  0x00000055555f7808 in DeclarativeTabModel::newTabRequested
    (this=this@entry=0x55559c69a0, _t1=...) at moc_declarativetabmodel.cpp:366
#7  0x00000055555c4428 in DeclarativeTabModel::newTab (this=0x55559c69a0,
    url=..., parentId=1, browsingContext=547754475280, hidden=<optimized out>)
    at ../history/declarativetabmodel.cpp:233
#8  0x00000055555d1900 in DeclarativeWebPageCreator::createView
    (this=0x55559c69f0, parentId=<optimized out>,
    parentBrowsingContext=<optimized out>, 
    hidden=<optimized out>) at /usr/include/qt5/QtCore/qarraydata.h:240
#9  0x0000007fbfb71ef0 in QMozContextPrivate::CreateNewWindowRequested
    (this=<optimized out>, chromeFlags=<optimized out>,
    hidden=@0x7fffffe922: true, aParentView=0x5555bfad90,
    parentBrowsingContext=@0x7fffffe938: 547754475280) at qmozcontext.cpp:218
#10 0x0000007fbcb10eec in mozilla::embedlite::EmbedLiteApp::
    CreateWindowRequested (this=0x555585a3a0, chromeFlags=@0x7fffffe928: 4094, 
    hidden=@0x7fffffe922: true, parentId=@0x7fffffe924: 1,
    parentBrowsingContext=@0x7fffffe938: 547754475280)
    at mobile/sailfishos/ EmbedLiteApp.cpp:543
#11 0x0000007fbcb1ea68 in mozilla::embedlite::EmbedLiteAppThreadParent::
    RecvCreateWindow (this=<optimized out>, parentId=<optimized out>, 
    parentBrowsingContext=<optimized out>, chromeFlags=<optimized out>,
    hidden=<optimized out>, createdID=0x7fffffe92c, cancel=0x7fffffe923)
    at mobile/sailfishos/embedthread/EmbedLiteAppThreadParent.cpp:70
#12 0x0000007fba183aa0 in mozilla::embedlite::PEmbedLiteAppParent::
    OnMessageReceived (this=0x7f88a0fe90, msg__=..., reply__=@0x7fffffea38: 0x0)
    at PEmbedLiteAppParent.cpp:924
#13 0x0000007fba06b618 in mozilla::ipc::MessageChannel::DispatchSyncMessage
    (this=this@entry=0x7f88a0ff58, aProxy=aProxy@entry=0x5555c69ea0, aMsg=..., 
    aReply=@0x7fffffea38: 0x0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/ipc/ProtocolUtils.h:675
[...]
#33 0x000000555557b360 in main (argc=<optimized out>, argv=<optimized out>)
    at main.cpp:201
(gdb) c
Continuing.

[D] unknown:0 - PRINT: onNewTabRequested post activeTab:  11
[W] unknown:0 - bool DBWorker::execute(QSqlQuery&) failed execute query
[W] unknown:0 - "INSERT INTO tab (tab_id, tab_history_id) VALUES (?,?);"
[W] unknown:0 - QSqlError("19", "Unable to fetch row", "UNIQUE constraint failed: tab.tab_id")
[D] unknown:0 - PRINT: onActiveTabChanged:  11
[D] unknown:0 - PRINT: onActiveTabChanged hidden:  true
[D] unknown:0 - PRINT: new tab is hidden, activating previous ID:  10
[D] unknown:0 - PRINT: activateTab: old:  11
[D] unknown:0 - PRINT: activateTab: new:  4
[D] unknown:0 - PRINT: activateTab: activate tab:  4 Tab(tabId = 10, parentId = 0, isValid = true, url = "https://jolla.com/", requested url = "", url resolved: true, title = "Jolla", thumbnailPath = "/home/defaultuser/.cache/org.sailfishos/browser/tab-10-thumb.jpg", desktopMode = false)
[D] unknown:0 - PRINT: onActiveTabChanged:  10

Thread 1 "sailfish-browse" hit Breakpoint 2, WebPages::updateStates
    (this=0x55559bf880, oldActivePage=0x7f8c063720, newActivePage=0x5555c4e0c0)
    at ../core/webpages.cpp:203
203         if (oldActivePage) {
(gdb) bt
#0  WebPages::updateStates (this=0x55559bf880, oldActivePage=0x7f8c063720,
    newActivePage=0x5555c4e0c0) at ../core/webpages.cpp:203
#1  0x00000055555ad69c in WebPages::page (this=0x55559bf880, tab=...)
    at ../core/webpages.cpp:166
#2  0x000000555559148c in DeclarativeWebContainer::activatePage
    (this=this@entry=0x55559bf220, tab=..., force=force@entry=true)
    at include/c++/8.3.0/bits/atomic_base.h:390
#3  0x00000055555928d4 in DeclarativeWebContainer::loadTab (this=0x55559bf220,
    tab=..., force=false) at ../core/declarativewebcontainer.cpp:1187
#4  0x0000005555592b78 in DeclarativeWebContainer::onActiveTabChanged
    (this=0x55559bf220, activeTabId=10)
    at ../core/declarativewebcontainer.cpp:960
#5  0x0000007fb7ec4204 in QMetaObject::activate(QObject*, int, int, void**) ()
    from /usr/lib64/libQt5Core.so.5
#6  0x00000055555f7768 in DeclarativeTabModel::activeTabChanged
    (this=this@entry=0x55559c69a0, _t1=<optimized out>)
    at moc_declarativetabmodel.cpp:339
#7  0x00000055555c3eb4 in DeclarativeTabModel::updateActiveTab
    (this=this@entry=0x55559c69a0, activeTab=..., reload=reload@entry=false)
    at ../history/declarativetabmodel.cpp:429
#8  0x00000055555c4870 in DeclarativeTabModel::activateTab
    (this=this@entry=0x55559c69a0, index=4, reload=reload@entry=false)
    at ../history/declarativetabmodel.cpp:167
#9  0x00000055555c4d28 in DeclarativeTabModel::activateTabById
    (this=0x55559c69a0, tabId=<optimized out>)
    at ../history/declarativetabmodel.cpp:174
#10 0x0000005555592dbc in DeclarativeWebContainer::onActiveTabChanged
    (this=0x55559bf220, activeTabId=<optimized out>)
    at include/c++/8.3.0/bits/atomic_base.h:390
#11 0x0000007fb7ec4204 in QMetaObject::activate(QObject*, int, int, void**) ()
    from /usr/lib64/libQt5Core.so.5
#12 0x00000055555f7768 in DeclarativeTabModel::activeTabChanged
    (this=this@entry=0x55559c69a0, _t1=<optimized out>)
    at moc_declarativetabmodel.cpp:339
#13 0x00000055555c3eb4 in DeclarativeTabModel::updateActiveTab
    (this=this@entry=0x55559c69a0, activeTab=..., reload=reload@entry=false)
    at ../history/declarativetabmodel.cpp:429
#14 0x00000055555c4004 in DeclarativeTabModel::addTab
    (this=this@entry=0x55559c69a0, tab=..., index=index@entry=5)
    at ../history/declarativetabmodel.cpp:78
#15 0x00000055555c4438 in DeclarativeTabModel::newTab
    (this=0x55559c69a0, url=..., parentId=1, browsingContext=547754475280,
    hidden=<optimized out>) at ../history/declarativetabmodel.cpp:235
#16 0x00000055555d1900 in DeclarativeWebPageCreator::createView
    (this=0x55559c69f0, parentId=<optimized out>,
    parentBrowsingContext=<optimized out>, hidden=<optimized out>)
    at /usr/include/qt5/QtCore/qarraydata.h:240
#17 0x0000007fbfb71ef0 in QMozContextPrivate::CreateNewWindowRequested
    (this=<optimized out>, chromeFlags=<optimized out>,
    hidden=@0x7fffffe922: true, aParentView=0x5555bfad90,
    parentBrowsingContext=@0x7fffffe938: 547754475280) at qmozcontext.cpp:218
#18 0x0000007fbcb10eec in mozilla::embedlite::EmbedLiteApp::
    CreateWindowRequested (this=0x555585a3a0, chromeFlags=@0x7fffffe928: 4094, 
    hidden=@0x7fffffe922: true, parentId=@0x7fffffe924: 1,
    parentBrowsingContext=@0x7fffffe938: 547754475280)
    at mobile/sailfishos/EmbedLiteApp.cpp:543
#19 0x0000007fbcb1ea68 in mozilla::embedlite::EmbedLiteAppThreadParent::
    RecvCreateWindow (this=<optimized out>, parentId=<optimized out>, 
    parentBrowsingContext=<optimized out>, chromeFlags=<optimized out>,
    hidden=<optimized out>, createdID=0x7fffffe92c, cancel=0x7fffffe923)
    at mobile/sailfishos/embedthread/EmbedLiteAppThreadParent.cpp:70
#20 0x0000007fba183aa0 in mozilla::embedlite::PEmbedLiteAppParent::
    OnMessageReceived (this=0x7f88a0fe90, msg__=..., reply__=@0x7fffffea38: 0x0)
    at PEmbedLiteAppParent.cpp:924
#21 0x0000007fba06b618 in mozilla::ipc::MessageChannel::DispatchSyncMessage
    (this=this@entry=0x7f88a0ff58, aProxy=aProxy@entry=0x5555c69ea0, aMsg=..., 
    aReply=@0x7fffffea38: 0x0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/ipc/ProtocolUtils.h:675
[...]
#41 0x000000555557b360 in main (argc=<optimized out>, argv=<optimized out>)
    at main.cpp:201
(gdb) 
From all of these backtraces we can see that WebPages::updateStates() will be a good place to look at to avoid the deactivation and suspension of the old page.

So I've edited this method to remove the calls that perform the deactivation and suspension of the page. In theory this should leave the page not just running, but even rendering, even though it's no longer visible.
void WebPages::updateStates(DeclarativeWebPage *oldActivePage,
    DeclarativeWebPage *newActivePage)
{
    if (oldActivePage) {
        // Allow suspending only the current active page if it is not the
        // creator (parent).
        if (newActivePage->parentId() != (int)oldActivePage->uniqueId()) {
            if (oldActivePage->loading()) {
                //oldActivePage->stop();
            }
            //oldActivePage->suspendView();
        } else {
            // Sets parent to inactive and suspends rendering keeping
            // timeouts running.
            //oldActivePage->setActive(false);
        }
    }

    if (newActivePage) {
        newActivePage->resumeView();
        newActivePage->update();
    }
}
Unfortunately even after making this change the problem persists: the error still triggers and the PDF data still doesn't get stored in the file, which is created but left empty. Given this I'm going to have to look into the JavaScript error and try to figure out what that's happening there instead. Maybe it's something that I've not yet considered.

That'll be a task for Boxing Day!

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
24 Dec 2023 : Day 117 #
It's Christmas eve as I write this, there's a sense of excitement and anticipation in the air. I like Christmas for many reasons, but mostly because it's a time for relaxation and not worrying about things for a day or two. Since gecko development is one of my means of relaxation I still plan to write an update tomorrow. I don't expect anyone to be reading it though. Given this, let me take the chance now to wish everybody reading this a very Merry Christmas. It amazes me how many of you continue to read it and today I want to give a special shout-out to Sylvain (Sthocs) who said nice things today on the Sailfish Forum. I'm thoroughly uplifted when such kind posts.

And because it's Christmas eve, here's one of Thigg's amazing images to add some colour and energy to the post today.
 
A lizard running through an autumnal forest carrying a pig (with wings) on its back

Now on to today's development. Yesterday we were able to confirm that the hidden flag was making it all the way from the print request out to the QML user interface of sailfish-browser. Today we have to think of something useful to do with the information.

I spent some time earlier today sitting in a coffee shop in the centre of Cambridge looking over the code and trying to think of what might be the right way to approach this. I've concluded that it might be a little tricky. But there are some things to try.

Because most of the changes will either happen in the QML or in the sailfish-browser code it should be possible to get a pretty swift turnaround on the possibilities I plan to try, which will hopefully mean we'll arrive at a solution all the more quickly.

To get the clearest picture it'll help to understand what happens when a new window is created, so let's break that down a bit.

The request comes in to the DeclarativeWebPageCreator::createView() method which looks like this:
quint32 DeclarativeWebPageCreator::createView(const quint32 &parentId,
    const uintptr_t &parentBrowsingContext, bool hidden)
{
    QPointer<DeclarativeWebPage> oldPage = m_activeWebPage;
    m_model->newTab(QString(), parentId, parentBrowsingContext, hidden);

    if (m_activeWebPage && oldPage != m_activeWebPage) {
        return m_activeWebPage->uniqueId();
    }
    return 0;
}
Notice that before and after the call to m_model->newTab() there's an expectation that the value of m_activeWebPage will change and that's because it's triggering a lot more stuff that's happening in the background. The newTab() call goes through to DeclarativeTabModel::newTab() which creates a new instance of the Tab class and adds it to the tab model. This will trigger a dataChanged signal through a call to addTab(), as well as a newTabRequested signal sent explicitly in the newTab code.

The newTabRequested signal is connected in declarativewebcontainer.cpp to the onNewTabRequested() method:
    connect(m_model.data(), &DeclarativeTabModel::newTabRequested,
            this, &DeclarativeWebContainer::onNewTabRequested);
This onNewTabRequested() method ends up calling two other methods: activatePage() and loadTab(), like this:
void DeclarativeWebContainer::onNewTabRequested(const Tab &tab)
{
    if (activatePage(tab, false)) {
        m_webPage->loadTab(tab.requestedUrl(), false);
    }
}
The activatePage() method does quite a lot of work here, but in particular activates the page. I'm thinking that we probably do want this to happen, but then might want to switch back immediately to the previous page. Otherwise the new tab is going to get shown to the user, which we want to avoid.

It might also be helpful to compare this flow against that which happens when a user selects a tab from the tab view. In that case the TabItem component sends out a TabView.activateTab() signal. This results in the following bit of code from the Overlay QML component getting called:
	onActivateTab: {
		webView.tabModel.activateTab(index)
		pageStack.pop()
	}
Although this bit of code is in the Overlay component, that component itself doesn't directly have a webView property. That's because both the Overlay and webView can be found in BrowserPage.qml. I must admit I'm not too keen on this type of approach, where a QML component relies on some implied features of its parent component. Nevertheless it's not uncommon in QML code and perfectly legal. Here's where the webView property is defined in the BrowserPage.qml file:
    Shared.WebView {
        id: webView
[...]
The Shared.WebView component is an instance of DeclarativeWebContainer, which we can see is due to this call in the DeclarativeWebContainer C++ constructor code:
DeclarativeWebContainer::DeclarativeWebContainer(QWindow *parent)
[...]
{
[...]
    setTitle("BrowserContent");
    setObjectName("WebView");
[...]
The tabModel property is an instance of DeclarativeTabModel, which has a compatible activateTab() overload available:
    Q_INVOKABLE void activateTab(int index, bool reload = false);
With the method definition looking like this:
void DeclarativeTabModel::activateTab(int index, bool reload)
{
    if (m_tabs.isEmpty()) {
        return;
    }

    index = qBound<int>(0, index, m_tabs.count() - 1);
    const Tab &newActiveTab = m_tabs.at(index);
    updateActiveTab(newActiveTab, reload);
}
Finally the call to updateActiveTab() sends out the dataChanged signal and then an activeTabChanged signal too. The latter is tied to onActiveTabChanged() due to this line in declarativewebcontainer.cpp:
	connect(m_model.data(), &DeclarativeTabModel::activeTabChanged,
		    this, &DeclarativeWebContainer::onActiveTabChanged);
Which therefore causes this code to be run:
void DeclarativeWebContainer::onActiveTabChanged(int activeTabId)
{
    if (activeTabId <= 0) {
        return;
    }

    reload(false);
}
The reload() method contains a call to loadTab() which looks just like the onNewTabRequested() method that was called in the case of the new window creation route:
void DeclarativeWebContainer::loadTab(const Tab& tab, bool force)
{
    if (activatePage(tab, true) || force) {
        // Note: active pages containing a "link" between each other (parent-child relationship)
        // are not destroyed automatically e.g. in low memory notification.
        // Hence, parentId is not necessary over here.
        m_webPage->loadTab(tab.url(), force);
    }
}
Which means we're back to the same place, in particular the activatePage() method which actually does the work. I've gone through both the process for creating a new tab and the process for switching tabs to highlight the crucial bit that we're interested in. We want to switch tabs right after the new tab has been activated, so it's the point where these two paths converge that's likely to make the best place for us to amend the code.

As well as working through the code as we have done above, we can also find the place where the two paths converge using backtraces in the debugger. Now I know these backtraces can be a bit hard to parse on a website, especially on a small screen. If you're a human reading this then please do feel free to just skip these, they're more for my future reference, as it's really helpful for me to keep a record of them. In any case here's the "changing tab" case in backtrace form.
(gdb) bt
#0  DeclarativeWebContainer::setWebPage (this=0x55559bd820, webPage=0x5555ff3960,
    triggerSignals=false) at ../core/declarativewebcontainer.cpp:165
#1  0x00000055555914a0 in DeclarativeWebContainer::activatePage
    (this=0x55559bd820, tab=..., force=true)
    at ../core/declarativewebcontainer.cpp:589
#2  0x000000555559257c in DeclarativeWebContainer::loadTab (this=0x55559bd820,
    tab=..., force=false) at ../core/declarativewebcontainer.cpp:1171
#3  0x0000007fb7ec4204 in QMetaObject::activate(QObject*, int, int, void**) ()
    from /usr/lib64/libQt5Core.so.5
#4  0x00000055555f6ce0 in DeclarativeTabModel::activeTabChanged
    (this=this@entry=0x55559d2b30, _t1=<optimized out>)
    at moc_declarativetabmodel.cpp:339
#5  0x00000055555c37b4 in DeclarativeTabModel::updateActiveTab
    (this=this@entry=0x55559d2b30, activeTab=..., reload=reload@entry=false)
    at ../history/declarativetabmodel.cpp:426
#6  0x00000055555c3f38 in DeclarativeTabModel::activateTab
    (this=this@entry=0x55559d2b30, index=<optimized out>,
    reload=reload@entry=false)
    at ../history/declarativetabmodel.cpp:164
#7  0x00000055555f70ec in DeclarativeTabModel::qt_static_metacall
    (_o=_o@entry=0x55559d2b30, _c=_c@entry=QMetaObject::InvokeMetaMethod,
    _id=_id@entry=16, 
    _a=_a@entry=0x7fffff64c8) at moc_declarativetabmodel.cpp:180
[...]
#17 0x0000007fb8643bf8 in QQmlJavaScriptExpression::evaluate(QV4::CallData*,
    bool*) () from /usr/lib64/libQt5Qml.so.5
#18 0x0000007fa83c2410 in ?? ()
#19 0x00000055556d2c90 in ?? ()
(gdb) 
So, in short, it looks like if we emit an activeTabChanged() signal straight after the new tab has been created using the index of the previous tab (the one that was active before the window opened) then with a bit of luck the browser will immediately (and maybe imperceptible?.. we'll have to see about that) switch back to the previous tab.

I'm keen to try this out, but all of this digging through code has left me a bit exhausted, so testing it out will have to wait until tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
23 Dec 2023 : Day 116 #
Yesterday we were looking at the window creation code and I made some changes to "surface" (as they say) the hidden flag of a window. The idea is to set the flag on the window generated as a result of the page being cloned during printing so that the front-end can filter the window out from the user interface.

Here's the new "Hidden" role from the tab model that I've added and which I'm hoping to use for the filtering, taken from the declarativetabmodel.h file:
    enum TabRoles {
        ThumbPathRole = Qt::UserRole + 1,
        TitleRole,
        UrlRole,
        ActiveRole,
        TabIdRole,
        DesktopModeRole,
        HiddenRole,
    };
Although in theory the pieces are in place to implement the filtering I don't want to do that just yet. I want to do something easier to check whether it's working as expected. I've already made quite a number of changes and experience tells me that periodically checking my work is a sensible thing to do. I'm not always where I think I am.

So rather than the filtering I want to add some kind of indicator to the tab view that shows whether a page is supposed to be hidden or not. I'm thinking maybe a colour change or something like that.

I've found a nice place to put this indicator in the TabItem.qml component of the sailfish-browser user interface code. There's some QML code there to set the background colour depending on the ambience:
    color: Theme.colorScheme === Theme.LightOnDark ? "black" : "white"
I'm going to embellish this a little — I can do it directly on the device — like this:
    color: hidden ? "red" : Theme.colorScheme === Theme.LightOnDark ? "black" : "white"
This is about as simple as a change gets. When I apply the change, run the browser and open the tab view I get the following error:
[W] unknown:60 - file:///usr/share/sailfish-browser/pages/components/TabItem.qml:60:
    ReferenceError: hidden is not defined
That's okay though: I've not yet installed my updated packages where the hidden role is defined, so the error is expected. Now to install the packages to see whether that changes. I have to update three sets of packages (xulrunner, qtmozembed and sailfish-browser) simultaneously to make it work.

All of the packages installed fine. The browser runs. Opening the tab window now no longer generates any errors. But things haven't quite gone to plan because now all of the tabs have turned red. Hrmf. I guess I need to check my logic.
 
Two screenshots of the tab view in the browser; on the left all of the titles of the tabs have dark backgrounds; on the right the backgrounds are all red

After checking with the debugger, all of the tabs have their hidden flag set to false. Odd. Annoyingly though, the debugger also tells me that when the print page is created its hidden state is also set to false. So that's now two errors to fix. Well, I've found the first error in my logic, caused by a hack added on top of a hack:
    color: hidden ? "red" : 
        Theme.colorScheme === Theme.LightOnDark ? "red" : "white"
It's annoyingly obvious now I see it. The second colour should be "black" not "red", like this:
    color: hidden ? "red" : 
        Theme.colorScheme === Theme.LightOnDark ? "black" : "white"
Having fixed this now all the tabs are the correct colour again. Apart from the printing page which should be red but now isn't. For some reason the information isn't being "surfaced" as it should be.

I've picked a point suitably high up the stack — on EmbedLiteViewChild::InitGeckoWindow() — to attach a breakpoint and set the print running. Hopefully this will give us an idea about where the chain is being broken.
(gdb) b EmbedLiteViewChild::InitGeckoWindow
Breakpoint 1 at 0x7fbcb195a4: file mobile/sailfishos/embedshared/
    EmbedLiteViewChild.cpp, line 179.
(gdb) c
Continuing.
[...]

Thread 8 "GeckoWorkerThre" hit Breakpoint 1, mozilla::embedlite::
    EmbedLiteViewChild::InitGeckoWindow (this=0x7e37699810, parentId=1, 
    parentBrowsingContext=0x7f88b7a680, isPrivateWindow=false,
    isDesktopMode=false, isHidden=false)
    at mobile/sailfishos/embedshared/EmbedLiteViewChild.cpp:179
179     {
(gdb) bt
#0  mozilla::embedlite::EmbedLiteViewChild::InitGeckoWindow (this=0x7e37699810,
    parentId=1, parentBrowsingContext=0x7f88b7a680, isPrivateWindow=false, 
    isDesktopMode=false, isHidden=false)
    at mobile/sailfishos/embedshared/EmbedLiteViewChild.cpp:179
#1  0x0000007fbcb0ae48 in mozilla::detail::RunnableMethodArguments<unsigned int
    const, mozilla::dom::BrowsingContext*, bool const, bool const, bool 
    const>::applyImpl<mozilla::embedlite::EmbedLiteViewChild, void 
    (mozilla::embedlite::EmbedLiteViewChild::*)(unsigned int,
    mozilla::dom::BrowsingContext*, bool, bool, bool),
    StoreCopyPassByConstLRef>unsigned int const>,
    StoreRefPtrPassByPtr<mozilla::dom::BrowsingContext>,
    StoreCopyPassByConstLRef<bool const>, StoreCopyPassByConstLRef<bool const>>,
    StoreCopyPassByConstLRef<bool const>, 0ul, 1ul, 2ul, 3ul, 4ul>
    (args=..., m=<optimized out>, o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:280
#2  mozilla::detail::RunnableMethodArguments<unsigned int const,
    mozilla::dom::BrowsingContext*, bool const, bool const,
    bool const>::apply<mozilla::embedlite::EmbedLiteViewChild, void
    (mozilla::embedlite::EmbedLiteViewChild::*)(unsigned int,
    mozilla::dom::BrowsingContext*, bool, bool, bool)> (
    m=<optimized out>, o=<optimized out>, this=<optimized out>)
    at xpcom/threads/nsThreadUtils.h:1154
#3  mozilla::detail::RunnableMethodImpl<mozilla::embedlite::EmbedLiteViewChild*,
    void (mozilla::embedlite::EmbedLiteViewChild::*)(unsigned int,
    mozilla::dom::BrowsingContext*, bool, bool, bool), true,
    (mozilla::RunnableKind)0, unsigned int const, mozilla::dom::BrowsingContext*,
    bool const, bool const, bool const>::Run (this=<optimized out>)
    at xpcom/threads/nsThreadUtils.h:1201
#4  0x0000007fb9c8b99c in mozilla::RunnableTask::Run (this=0x7ed8004590)
[...]
Oh, that backtrace quickly degenerates because EmbedLiteViewChild::InitGeckoWindow() is being called as a posted runnable task. I'll need to pick somewhere lower down. It's being run from the EmbedLiteViewChild::EmbedLiteViewChild() constructor, so let's try that instead.

The debugger is running incredibly slowly today for some reason. But eventually it gets there.
(gdb) delete break
Delete all breakpoints? (y or n) y
(gdb) b EmbedLiteViewChild::EmbedLiteViewChild
Breakpoint 2 at 0x7fbcb143f0: file mobile/sailfishos/embedshared/
    EmbedLiteViewChild.cpp, line 71.
(gdb) c
Continuing.

Thread 8 "GeckoWorkerThre" hit Breakpoint 1, mozilla::embedlite::
    EmbedLiteViewChild::EmbedLiteViewChild (this=0x7e30cc9000,
    aWindowId=@0x7f9f3cfc3c: 1, aId=@0x7f9f3cfc40: 2,
    aParentId=@0x7f9f3cfc44: 1, parentBrowsingContext=0x7f88bd2f40,
    isPrivateWindow=@0x7f9f3cfc2e: false,
    isDesktopMode=@0x7f9f3cfc2f: false, isHidden=@0x7f9f3cfc30: false)
    at mobile/sailfishos/embedshared/EmbedLiteViewChild.cpp:71
71      EmbedLiteViewChild::EmbedLiteViewChild(const uint32_t &aWindowId,
(gdb) bt
#0  mozilla::embedlite::EmbedLiteViewChild::EmbedLiteViewChild
    (this=0x7e30cc9000, aWindowId=@0x7f9f3cfc3c: 1, aId=@0x7f9f3cfc40: 2, 
    aParentId=@0x7f9f3cfc44: 1, parentBrowsingContext=0x7f88bd2f40,
    isPrivateWindow=@0x7f9f3cfc2e: false, isDesktopMode=@0x7f9f3cfc2f: false, 
    isHidden=@0x7f9f3cfc30: false)
    at mobile/sailfishos/embedshared/EmbedLiteViewChild.cpp:71
#1  0x0000007fbcb23aa4 in mozilla::embedlite::EmbedLiteViewThreadChild::
    EmbedLiteViewThreadChild (this=0x7e30cc9000, windowId=<optimized out>, 
    id=<optimized out>, parentId=<optimized out>,
    parentBrowsingContext=<optimized out>, isPrivateWindow=<optimized out>,
    isDesktopMode=<optimized out>, isHidden=<optimized out>)
    at mobile/sailfishos/embedthread/EmbedLiteViewThreadChild.cpp:15
#2  0x0000007fbcb2a69c in mozilla::embedlite::EmbedLiteAppThreadChild::
    AllocPEmbedLiteViewChild (this=0x7f889f48e0, windowId=@0x7f9f3cfc3c: 1, 
    id=@0x7f9f3cfc40: 2, parentId=@0x7f9f3cfc44: 1,
    parentBrowsingContext=<optimized out>, isPrivateWindow=@0x7f9f3cfc2e: false, 
    isDesktopMode=@0x7f9f3cfc2f: false, isHidden=@0x7f9f3cfc30: false)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33
#3  0x0000007fba17f48c in mozilla::embedlite::PEmbedLiteAppChild::
    OnMessageReceived (this=0x7f889f48e0, msg__=...) at PEmbedLiteAppChild.cpp:529
#4  0x0000007fba06b85c in mozilla::ipc::MessageChannel::DispatchAsyncMessage
    (this=this@entry=0x7f889f49a8, aProxy=aProxy@entry=0x7ebc001cb0, aMsg=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/ipc/ProtocolUtils.h:675
[...]
#49 0x0000007fbd165538 in js::jit::MaybeEnterJit (cx=0x7f88234ba0, state=...)
    at js/src/jit/Jit.cpp:207
#50 0x0000007f8830d951 in ?? ()
Backtrace stopped: Cannot access memory at address 0xfa512247ea
(gdb) 
That gets us a little further, but not massively so. Let's go back a bit further again.
(gdb) delete break
Delete all breakpoints? (y or n) y
(gdb) b PEmbedLiteAppParent::SendPEmbedLiteViewConstructor
Breakpoint 2 at 0x7fba19099c: PEmbedLiteAppParent::SendPEmbedLiteViewConstructor.
    (2 locations)
(gdb) c
Continuing.

Thread 1 "sailfish-browse" hit Breakpoint 2, mozilla::embedlite::
    PEmbedLiteAppParent::SendPEmbedLiteViewConstructor
    (this=this@entry=0x7f88a64fc0, windowId=@0x7fffffe414: 1,
    id=@0x7fbfb29440: 3, parentId=@0x7fffffe40c: 1,
    parentBrowsingContext=@0x7fffffe400: 547754946368, 
    isPrivateWindow=@0x7fffffe40b: false, isDesktopMode=@0x7fffffe40a: false,
    isHidden=@0x7fffffe409: false) at PEmbedLiteAppParent.cpp:168
168     PEmbedLiteAppParent.cpp: No such file or directory.
(gdb) bt
#0  mozilla::embedlite::PEmbedLiteAppParent::SendPEmbedLiteViewConstructor
    (this=this@entry=0x7f88a64fc0, windowId=@0x7fffffe414: 1,
    id=@0x7fbfb29440: 3, parentId=@0x7fffffe40c: 1,
    parentBrowsingContext=@0x7fffffe400: 547754946368,
    isPrivateWindow=@0x7fffffe40b: false, isDesktopMode=@0x7fffffe40a: false,
    isHidden=@0x7fffffe409: false) at PEmbedLiteAppParent.cpp:168
#1  0x0000007fbcb18850 in mozilla::embedlite::EmbedLiteApp::CreateView
    (this=0x5555859800, aWindow=0x5555c67e80, aParent=<optimized out>, 
    aParentBrowsingContext=<optimized out>, aIsPrivateWindow=<optimized out>,
    isDesktopMode=<optimized out>, isHidden=<optimized out>)
    at mobile/sailfishos/EmbedLiteApp.cpp:478
#2  0x0000007fbfb8693c in QMozViewPrivate::createView (this=0x555569b070)
    at qmozview_p.cpp:862
#3  QMozViewPrivate::createView (this=0x555569b070) at qmozview_p.cpp:848
#4  0x00000055555d10ec in WebPageFactory::createWebPage (this=0x55559bc500,
    webContainer=0x55559bc110, initialTab=...)
    at ../factories/webpagefactory.cpp:44
#5  0x00000055555acf68 in WebPages::page (this=0x55559bc650, tab=...)
    at include/c++/8.3.0/bits/atomic_base.h:390
#6  0x000000555559148c in DeclarativeWebContainer::activatePage
    (this=this@entry=0x55559bc110, tab=..., force=force@entry=false)
    at include/c++/8.3.0/bits/atomic_base.h:390
#7  0x000000555559160c in DeclarativeWebContainer::onNewTabRequested
    (this=0x55559bc110, tab=...) at ../core/declarativewebcontainer.cpp:1047
#8  0x0000007fb7ec4204 in QMetaObject::activate(QObject*, int, int, void**) ()
    from /usr/lib64/libQt5Core.so.5
#9  0x00000055555f6d80 in DeclarativeTabModel::newTabRequested
    (this=this@entry=0x55559d6130, _t1=...) at moc_declarativetabmodel.cpp:366
#10 0x00000055555c3d28 in DeclarativeTabModel::newTab (this=0x55559d6130,
    url=..., parentId=1, browsingContext=547754946368, hidden=<optimized out>)
    at ../history/declarativetabmodel.cpp:230
#11 0x00000055555d0e78 in DeclarativeWebPageCreator::createView
    (this=0x55559c50f0, parentId=<optimized out>,
    parentBrowsingContext=<optimized out>, hidden=<optimized out>)
    at /usr/include/qt5/QtCore/qarraydata.h:240
#12 0x0000007fbfb71ef0 in QMozContextPrivate::CreateNewWindowRequested
    (this=<optimized out>, chromeFlags=<optimized out>, hidden=<optimized out>, 
    aParentView=0x5555bf74e0, parentBrowsingContext=@0x7fffffe938: 547754946368)
    at qmozcontext.cpp:218
#13 0x0000007fbcb10eec in mozilla::embedlite::EmbedLiteApp::CreateWindowRequested
    (this=0x5555859800, chromeFlags=@0x7fffffe928: 4094,
    hidden=@0x7fffffe922: true, parentId=@0x7fffffe924: 1,
    parentBrowsingContext=@0x7fffffe938: 547754946368)
    at mobile/sailfishos/EmbedLiteApp.cpp:543
#14 0x0000007fbcb1ea68 in mozilla::embedlite::EmbedLiteAppThreadParent::
    RecvCreateWindow (this=<optimized out>, parentId=<optimized out>, 
    parentBrowsingContext=<optimized out>, chromeFlags=<optimized out>,
    hidden=<optimized out>, createdID=0x7fffffe92c, cancel=0x7fffffe923)
    at mobile/sailfishos/embedthread/EmbedLiteAppThreadParent.cpp:70
#15 0x0000007fba183aa0 in mozilla::embedlite::PEmbedLiteAppParent::
    OnMessageReceived (this=0x7f88a64fc0, msg__=..., reply__=@0x7fffffea38: 0x0)
    at PEmbedLiteAppParent.cpp:924
#16 0x0000007fba06b618 in mozilla::ipc::MessageChannel::DispatchSyncMessage
    (this=this@entry=0x7f88a65088, aProxy=aProxy@entry=0x7fa0010590, aMsg=..., 
    aReply=@0x7fffffea38: 0x0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/ipc/ProtocolUtils.h:675
[...]
#36 0x000000555557b360 in main (argc=<optimized out>, argv=<optimized out>)
    at main.cpp:201
(gdb) 
Frustratingly a lot of the values we'd like to investigate have been optimised out, making them hard to access. However, what we can see is that for the call to EmbedLiteApp::CreateWindowRequested() the value of hidden is set to true:
#13 0x0000007fbcb10eec in mozilla::embedlite::EmbedLiteApp::CreateWindowRequested
    (this=0x5555859800, chromeFlags=@0x7fffffe928: 4094,
    hidden=@0x7fffffe922: true, parentId=@0x7fffffe924: 1,
    parentBrowsingContext=@0x7fffffe938: 547754946368)
That's at stack frame 13. The next time we see it in non-optimised form is at stack frame 0 where it's set to false (albeit with a slightly different name):
#0  mozilla::embedlite::PEmbedLiteAppParent::SendPEmbedLiteViewConstructor
    (this=this@entry=0x7f88a64fc0, windowId=@0x7fffffe414: 1,
    id=@0x7fbfb29440: 3, parentId=@0x7fffffe40c: 1,
    parentBrowsingContext=@0x7fffffe400: 547754946368,
    isPrivateWindow=@0x7fffffe40b: false, isDesktopMode=@0x7fffffe40a: false,
    isHidden=@0x7fffffe409: false) at PEmbedLiteAppParent.cpp:168
Somewhere between these two frames the value is getting lost. The debugger isn't being very helpful in identifying exactly where, so I'll need to read through the code manually.

Luckily it doesn't take long to find out... it's in stack frame 12 where we have this:
uint32_t QMozContextPrivate::CreateNewWindowRequested(const uint32_t &chromeFlags,
    const bool &hidden, EmbedLiteView *aParentView,
    const uintptr_t &parentBrowsingContext)
{
    Q_UNUSED(chromeFlags)

    uint32_t parentId = aParentView ? aParentView->GetUniqueID() : 0;
    qCDebug(lcEmbedLiteExt) << "QtMozEmbedContext new Window requested: parent:"
        << (void *)aParentView << parentId;
    uint32_t viewId = QMozContext::instance()->createView(parentId,
        parentBrowsingContext);
    return viewId;
}
As we can see, the value of hidden just isn't being used at all in this method. I must have missed it when I made all my changes earlier. It slipped past because I gave it a default value in the header, so it compiled even though I forgot to pass it on:
    quint32 createView(const quint32 &parentId = 0,
        const uintptr_t &parentBrowsingContext = 0, const bool hidden = false);
Luckily qtmozembed is one of the quicker browser packages to compile, so it should be possible to fix, build and test the changes pretty swiftly.
 
A screenshot of the tab view in the browser; most of the tabs have a dark background but the one at the bottom is blank and coloured red

Rather excitingly all of the windows now appear with the standard colour, except for the window created during printing which is glaringly red. Just what we need!

Okay, that's it tested and working. The next step is to add the view filtering so that the red windows aren't shown in the user interface. I have a suspicion that this step is going to be harder than I think it is. But that's something to think about tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
22 Dec 2023 : Day 115 #
This morning the build had not completed successfully. There were a few instances of the hidden and isHidden parameters that I'd failed to add. It's quite a web inside the EmbedLite structures with what seems to be every combination of {App, View}, {Thread, Process, Nothing} and {Parent, Child, Interface}. It was always going to be easy to miss something.

Thankfully that's not the disaster it might have been because the idl interface definitions went through their regeneration cycle to create new source and header files. That means I can now do partial builds to check that things are working.

As I add the final arguments to the code I notice that the final resting place of the flag seems to be here in EmbedLiteViewChild:
void
EmbedLiteViewChild::InitGeckoWindow(const uint32_t parentId,
                                    mozilla::dom::BrowsingContext
                                    *parentBrowsingContext,
                                    const bool isPrivateWindow,
                                    const bool isDesktopMode,
                                    const bool &isHidden)
At this point the flag is discarded and it should be used for something. This is a note to myself to figure out what.

Another note to myself is to decide whether we need to store isHidden in EmbedLiteViewParent or not. Here's the relevant method signature:
EmbedLiteViewParent::EmbedLiteViewParent(const uint32_t &windowId,
                                         const uint32_t &id,
                                         const uint32_t &parentId,
                                         const uintptr_t &parentBrowsingContext,
                                         const bool &isPrivateWindow,
                                         const bool &isDesktopMode,
                                         const bool &isHidden)
Currently this gets stored in a class member, but we don't seem to use it anywhere. It can probably be removed, although then that begs the question of why pass it in at all? I should come back and check this later.

Now that the partial build completed successfully I'm going to run it through the full build again so that I have a package to install. That's necessary for me to be able to build qtmozembed and sailfish-browser against the new header files from these changes.

[...]

It's towards the end of the day now and the build completed successfully. At least, I'm pretty sure it did. I stupidly closed the build output window by accident during the day. But the packages have a modified time of 15:30 today, which sounds about right.

The next step is to find out how to push the hidden flag through to the sailfish-browser front-end. In theory, with the interface changed, if I now try to build qtmozembed against the latest packages I've just built, it should fail in some way. If I find the place it's failing, that should give me a good place to start.

Let's build them and see what happens.
$ cd ../qtmozembed/
$ sfdk -c no-fix-version build -d
[...]
The following 5 packages are going to be reinstalled:
  xulrunner-qt5              91.9.1-1
  xulrunner-qt5-debuginfo    91.9.1-1
  xulrunner-qt5-debugsource  91.9.1-1
  xulrunner-qt5-devel        91.9.1-1
  xulrunner-qt5-misc         91.9.1-1

5 packages to reinstall.
[...]
Wrote: RPMS/SailfishOS-devel-aarch64/qtmozembed-qt5-tests-1.53.9-1.aarch64.rpm
Wrote: RPMS/SailfishOS-devel-aarch64/qtmozembed-qt5-devel-1.53.9-1.aarch64.rpm
Wrote: RPMS/SailfishOS-devel-aarch64/qtmozembed-qt5-debugsource-1.53.9-1.aarch64.rpm
Wrote: RPMS/SailfishOS-devel-aarch64/qtmozembed-qt5-1.53.9-1.aarch64.rpm
Wrote: RPMS/SailfishOS-devel-aarch64/qtmozembed-qt5-debuginfo-1.53.9-1.aarch64.rpm
[...]
Surprisingly the qtmozembed build all completed without error. What about sailfish-browser?
$ cd ../sailfish-browser/
$ sfdk -c no-fix-version build -d
[...]
The following 2 packages are going to be downgraded:
  qtmozembed-qt5        1.53.25+sailfishos.esr91.20231003080118.8b9a009-1 -> 1.53.9-1
  qtmozembed-qt5-devel  1.53.25+sailfishos.esr91.20231003080118.8b9a009-1 -> 1.53.9-1

2 packages to downgrade.
[...]
Wrote: RPMS/SailfishOS-devel-aarch64/sailfish-browser-debugsource-2.2.45-1.aarch64.rpm
Wrote: RPMS/SailfishOS-devel-aarch64/sailfish-browser-ts-devel-2.2.45-1.aarch64.rpm
Wrote: RPMS/SailfishOS-devel-aarch64/sailfish-browser-settings-2.2.45-1.aarch64.rpm
Wrote: RPMS/SailfishOS-devel-aarch64/sailfish-browser-2.2.45-1.aarch64.rpm
Wrote: RPMS/SailfishOS-devel-aarch64/sailfish-browser-tests-2.2.45-1.aarch64.rpm
[...]
Well that's all a bit strange. I thought there would be at least some exposure to some of the elements I changed. I guess I got that wrong.

The connection point is supposed to happen in qmozcontext.cpp where the QMozContextPrivate class implements EmbedLiteAppListener:
class QMozContextPrivate : public QObject, public EmbedLiteAppListener
{
[...]
    virtual uint32_t CreateNewWindowRequested(const uint32_t &chromeFlags,
                                              EmbedLiteView *aParentView,
                                              const uintptr_t
                                              &parentBrowsingContext) override;
Compare this with the method it's supposed to be overriding:
class EmbedLiteAppListener
{
[...]
  // New Window request which is usually coming from WebPage new window request
  virtual uint32_t CreateNewWindowRequested(const uint32_t &chromeFlags,
                                            const bool &hidden,
                                            EmbedLiteView *aParentView,
                                            const uintptr_t
                                            &parentBrowsingContext) { return 0; }
The override should be causing an error.

Since I didn't do a clean rebuild of qtmozembed I'm thinking maybe the reason is that it just didn't know to rebuild some of the files. Let's try it again, but this time with passion.
$ cd ../qtmozembed/
$ git clean -xdf
$ sfdk -c no-fix-version build -d
[...]
In file included from qmozcontext.cpp:22:
qmozcontext_p.h:53:22: error: ‘virtual uint32_t QMozContextPrivate::
    CreateNewWindowRequested(const uint32_t&, mozilla::embedlite::EmbedLiteView*,
    const uintptr_t&)’ marked ‘override’, but does not override
     virtual uint32_t CreateNewWindowRequested(const uint32_t &chromeFlags,
                      ^~~~~~~~~~~~~~~~~~~~~~~~
qmozcontext_p.h:53:22: warning:   by ‘virtual uint32_t QMozContextPrivate::
    CreateNewWindowRequested(const uint32_t&, mozilla::embedlite::EmbedLiteView*,
    const uintptr_t&)’ [-Woverloaded-virtual]
[...]
make[1]: *** [Makefile:701: ../src/moc_qmozcontext_p.o] Error 1
Okay, that's more like it. So now I know for sure that this is the place to start.

I've added the hidden parameter to QMozContextPrivate::CreateNewWindowRequested() and let the changes cascade through the rest of the code. There's quite a lot of abstraction in qtmozembed, but despite that the changes all seem quite reasonable and, crucially, result in a new Q_PROPERTY being created to expose the hidden flag to the front end.

After a few build-fail-fix cycles the packages now builds fully without compiler or linker errors. The sailfish-browser code also links against it without issue... but it shouldn't. I probably need to clean out all the existing build first again and give it another go. I've cleaned it out, but sailfish-browser takes a surprising amount of time to complete (nothing compared to gecko, about 20 mins or so, but that's still a lot of code to build).

I've set it going, let's see if there are errors now (there will be!).

[...]

And indeed there are! Thank goodness for that.
$ cd ../sailfish-browser/
$ git clean -xdf
$ sfdk -c no-fix-version build -d -p
[...]
../qtmozembed/declarativewebpagecreator.h:37:21: error: ‘virtual quint32
    DeclarativeWebPageCreator::createView(const quint32&, const uintptr_t&)’
    marked ‘override’, but does not override
     virtual quint32 createView(const quint32 &parentId, const uintptr_t &parentBrowsingContext) override;
                     ^~~~~~~~~~
compilation terminated due to -Wfatal-errors.
make[2]: *** [Makefile:988: declarativewebpagecreator.o] Error 1
I've been through and made the changes needed. Ultimately they all flowed through the code rather neatly, touching 12 files in qtmozembed and 10 files in sailfish-browser. Crucially the tab model now has a new HiddenRole role that can be used to hide certain tabs.

The next step will be to create a QSortFilterProxyModel on it so that the pages can be hidden. But it's already late now so this will be a task for tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
21 Dec 2023 : Day 114 #
It has to be said, I'm pretty frustrated with myself. I've looked over the code in printUtils.js over and over again and I just can't seem to figure out what I'm doing wrong. I've been poring over this code for days now.

In theory it might be possible to create a browser element and add it to the current page, and use that as the print source. This seems to be roughly what the print preview is doing in the code that emilio highlighted:
  startPrintWindow(aBrowsingContext, aOptions) {
[...]
    if (openWindowInfo) {
      let printPreview = new PrintPreview({
        sourceBrowsingContext: aBrowsingContext,
        openWindowInfo,
      });
      let browser = printPreview.createPreviewBrowser("source");
      document.documentElement.append(browser);
      // Legacy print dialog or silent printing, the content process will print
      // in this <browser>.
      return browser;
    }

    let settings = this.getPrintSettings();
    settings.printSelectionOnly = printSelectionOnly;
    this.printWindow(aBrowsingContext, settings);
    return null;
  },
That call to createPreviewBrowser() goes on to do something like this:
  createPreviewBrowser(sourceVersion) {
    let browser = document.createXULElement("browser");
[...]
      browser.openWindowInfo = this.openWindowInfo;
[...]
    return browser;
  }
So it looks like the code is creating a browser element, appending it to the document and then... well, then what? It doesn't call printWindow() on the result, it just returns it. That could be because it's opening the print preview and waiting for the user to hit the "Print" button, but if that's the case it's no good for what I need, because I want it to go ahead and print straight away.

To assuage my frustration I'm going to leave this and — even if there is a solution that avoids it — just go ahead and implement the page hiding functionality that I've been talking about for the last few days. It feels like a defeat though, because there should be some other way, I just can't quite put my finger on it. And that's because I can't properly follow the flow of the code when it's all so abstract.

Alright, it's time to cut my losses and move on.

At least finally this means I've been able to actually do some coding. I've added a hidden (in some cases isHidden to match the style of a particular interface) parameter to various window creation methods scattered around the EmbedLite code. Changes like this:
--- a/embedding/embedlite/PEmbedLiteApp.ipdl
+++ b/embedding/embedlite/PEmbedLiteApp.ipdl
@@ -18,12 +18,12 @@ nested(upto inside_cpow) sync protocol PEmbedLiteApp {
 parent:
   async Initialized();
   async ReadyToShutdown();
-  sync CreateWindow(uint32_t parentId, uintptr_t parentBrowsingContext, uint32_t chromeFlags)
+  sync CreateWindow(uint32_t parentId, uintptr_t parentBrowsingContext, uint32_t chromeFlags, bool hidden)
     returns (uint32_t createdID, bool cancel);
   async PrefsArrayInitialized(Pref[] prefs);
This has knock-on effects to a lot of files, but all of them follow similar lines: a parameter is added to the method signature which is then passed on to some other method that gets called inside the method definition. I'm not going to list all of the changes here, but just note that it cascades throughout the EmbedLite portion of the code, across 22 files in total:
$ git status
On branch sailfishos-esr91
Your branch is up-to-date with 'origin/sailfishos-esr91'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)
        modified:   embedding/embedlite/EmbedLiteApp.cpp
        modified:   embedding/embedlite/EmbedLiteApp.h
        modified:   embedding/embedlite/PEmbedLiteApp.ipdl
        modified:   embedding/embedlite/embedprocess/EmbedLiteAppProcessParent.cpp
        modified:   embedding/embedlite/embedprocess/EmbedLiteAppProcessParent.h
        modified:   embedding/embedlite/embedprocess/EmbedLiteViewProcessParent.cpp
        modified:   embedding/embedlite/embedprocess/EmbedLiteViewProcessParent.h
        modified:   embedding/embedlite/embedshared/EmbedLiteAppChild.cpp
        modified:   embedding/embedlite/embedshared/EmbedLiteAppChild.h
        modified:   embedding/embedlite/embedshared/EmbedLiteAppChildIface.h
        modified:   embedding/embedlite/embedshared/EmbedLiteAppParent.h
        modified:   embedding/embedlite/embedshared/EmbedLiteViewChild.cpp
        modified:   embedding/embedlite/embedshared/EmbedLiteViewChild.h
        modified:   embedding/embedlite/embedshared/EmbedLiteViewParent.cpp
        modified:   embedding/embedlite/embedshared/EmbedLiteViewParent.h
        modified:   embedding/embedlite/embedthread/EmbedLiteAppThreadParent.cpp
        modified:   embedding/embedlite/embedthread/EmbedLiteAppThreadParent.h
        modified:   embedding/embedlite/embedthread/EmbedLiteViewThreadChild.cpp
        modified:   embedding/embedlite/embedthread/EmbedLiteViewThreadChild.h
        modified:   embedding/embedlite/embedthread/EmbedLiteViewThreadParent.cpp
        modified:   embedding/embedlite/embedthread/EmbedLiteViewThreadParent.h
        modified:   embedding/embedlite/utils/WindowCreator.cpp
        modified:   gecko-dev (new commits, untracked content)

no changes added to commit (use "git add" and/or "git commit -a")
These changes will also have impacts elsewhere, in particular I expect them to bubble up into the qtmozembed and sailfish-browser code as well. In fact, I'm absolutely hoping they will be because because the entire point is to get the details about the need to hide the window into the sailfish-browser code where it can actually be made to do something useful.

However, having made these changes, a partial build doesn't work because the PEmbedLiteApp.ipdl code needs to be regenerated, so I'm going to have to set this going on a full build overnight. I've tested the changes as far as possible by running partial builds but it won't fully go through because of this. I'll just have to hope that I covered all of the changes needed.

While it builds I could start looking at the code in the other packages, but without the gecko headers to build against I won't be able to compile any of those other changes either, so I'm going to leave those until tomorrow as well.

Here's hoping the build goes through okay!

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
20 Dec 2023 : Day 113 #
It's a new dawn, it's a new day. As I write this the sun is peaking up over the horizon casting all the clouds orange against the pastel blue sky, trees silhouetted against the skyline. It feels like a good day to start moving forwards with this and implementing "hidden windows" in the sailfish-browser code.
 
Trees silhouetted against an orange-blue sunrise

Thankfully I've nearly returned to full health again and as I write this am feeling a lot better. Everyone has been so generous with their kind wishes, it's made a real difference. Wherever you are in the world, I hope you're in good health, and if not I hope your recovery is swift and full!

I'm also very excited about the fact my talk on all this gecko development has been accepted for presentation at FOSDEM in the FOSS on Mobile devroom! If you're planning to be in Brussels yourself on the 3-4 February I really hope to see you there.

I'm just about to get on to starting coding when I notice a notification icon on my Matrix client. It seems emilio has got back to me. Yesterday I asked on the Mozilla Matrix "Printing" channel whether I'd need to implement some "window hiding" feature in the front-end to handle these empty print windows. Here's the reply.
 
Well instead of a window you can create a browser like this: https://searchfox.org/mozilla-central/search?q=symbol:%23handleStaticCloneCreatedForPrint

It's a brief but pithy answer. Digging through the links provided, a few things spring to mind:
  1. The handleStaticCloneCreatedForPrint() method in printUtils.js doesn't exist in ESR 91. Maybe I'll have to create it?
  2. Nor does the createParentBrowserForStaticClone() method in the same file; I'll probably have to back port that too.
  3. The OPEN_PRINT_BROWSER case in browser.js exists in browser.js in ESR 91, but I'm not sure whether it's getting executed. I should check.
  4. I'm wondering if, in order to go down this route, I'd need to call startPrintWindow() in printUtils.js to initiate a print, rather than the CanonicalBrowserContext::print() method I'm currently using.
Lots of questions. The first thing to check is whether the OPEN_PRINT_BROWSER execution path is being taken at all in my current build. As this is JavaScript code I can't use the debugger to find out; I'll need to put some debug print statements into the code instead.

As I dig around in the code I quickly come to realise that browser.js isn't a file that's used by sailfish-browser. I guess this functionality is largely handled by the front-end QML code instead. However the printUtils.js file is there in omni.ja so I can still add some debug output to that.
  /**  
   * Initialize a print, this will open the tab modal UI if it is enabled or
   * defer to the native dialog/silent print.
   *
   * @param aBrowsingContext
   *        The BrowsingContext of the window to print.
   *        Note that the browsing context could belong to a subframe of the
   *        tab that called window.print, or similar shenanigans.
   * @param aOptions
   *        {openWindowInfo}      Non-null if this call comes from window.print().
   *                              This is the nsIOpenWindowInfo object that has to
   *                              be passed down to createBrowser in order for the
   *                              child process to clone into it.
   *        {printSelectionOnly}  Whether to print only the active selection of
   *                              the given browsing context.
   *        {printFrameOnly}      Whether to print the selected frame.
   */
  startPrintWindow(aBrowsingContext, aOptions) {
    dump("PRINT: startPrintWindow\n");
[...]

  /**  
   * Starts the process of printing the contents of a window.
   *
   * @param aBrowsingContext
   *        The BrowsingContext of the window to print.
   * @param {Object?} aPrintSettings
   *        Optional print settings for the print operation
   */
  printWindow(aBrowsingContext, aPrintSettings) {
    dump("PRINT: printWindow\n");
[...]
I pack up these changes into omni.ja, run the browser and set the print running. There's plenty of debug output, but neither of these new entries show up.

So the next step is to switch out the call to CanonicalBrowserContext::print() to a call to startPrintWindow() if that's possible.

Doing this directly doesn't give good results, for multiple reasons. First we get an error stating that MozElements isn't defined. That's coming from the definition of PrintPreview in the PrintUtils.js file:
class PrintPreview extends MozElements.BaseControl {
[...]
The definition of MozElements happens in customElements.js and should be a global within the context of a chrome window. As the text at the top of the file explains: "This is loaded into chrome windows with the subscript loader.". For the time being I can work around this by calling PrintUtils.printWindow() instead of PrintUtils.startPrintWindow() because the former doesn't make use of the preview window.

However, when I do this the script complains that document is undefined, caused by this condition in printWindow():
    const printPreviewIsOpen = !!document.getElementById(
      "print-preview-toolbar"
    );
Again, I can work around this by setting printPreviewIsOpen to false and commenting out these lines. Having made these two changes, the print works but opens a window, just as before.

Looking more carefully through the ESR 91 code and the code that emilio mentioned, it looks to me like this is the critical part (although I'm not in the least bit certain about this):
    if (openWindowInfo) {
      let printPreview = new PrintPreview({
        sourceBrowsingContext: aBrowsingContext,
        openWindowInfo,
      });
      let browser = printPreview.createPreviewBrowser("source");
      document.documentElement.append(browser);
      // Legacy print dialog or silent printing, the content process will print
      // in this <browser>.
      return browser;
    }
This will create the preview window which will be used for the silent printing. But even this doesn't seem to be doing what we need: it's still creating the browser window.

The fact that neither document nor MozElements is defined makes me think that this needs to be called from inside a chrome window for this to work. And I don't know how to do that.

I've come full circle again. I've sent a message to emilio for clarification, but I'm yet again brought back to the thought that I'm going to need to hide this window manually, whether it's the print preview browser, or a tab that's opened to hold the cloned document in.

I feel like I've been chasing my tail yet again. I need some positive thoughts: tomorrow I plan to be back to full health and doing some actual coding.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
19 Dec 2023 : Day 112 #
I'm still feeling poorly today, which is very frustrating. When I've got a temperature I just can't focus well on the code, everything swims around and refuses to settle in to a comprehensible form. I find myself drifting from one file to another and losing the thread of where I've been and why I'm here. So you'll have to excuse me if some things I write today look a little nonsensical.

There's a silver lining though, in the form of the kind words I received from Thigg, Valorsoguerriero97, poetaster and (privately) throwaway69 via the Sailfish Forum. Thank you! It really helps motivate me to continue onwards. And it amazes me that you have the stamina to keep up with these posts!

As we discussed yesterday, looking at this bit of code inside nsGlobalWindowOuter::Print() we can see that if the condition is entered into the browser context is just copied over directly from the source to the new context:
  nsAutoSyncOperation sync(docToPrint, SyncOperationBehavior::eAllowInput);
  AutoModalState modalState(*this);

  nsCOMPtr<nsIContentViewer> cv;
  RefPtr<BrowsingContext> bc;
  bool hasPrintCallbacks = false;
  if (docToPrint->IsStaticDocument() &&
      (aIsPreview == IsPreview::Yes ||
       StaticPrefs::print_tab_modal_enabled())) {
    if (aForWindowDotPrint == IsForWindowDotPrint::Yes) {
      aError.ThrowNotSupportedError(
          "Calling print() from a print preview is unsupported, did you intend "
          "to call printPreview() instead?");
      return nullptr;
    }
    // We're already a print preview window, just reuse our browsing context /
    // content viewer.
    bc = sourceBC;
I'm left wondering whether, if we went through that branch, we might end up just using the existing context. It's clear from the code later that if this were to happen, the code that opens the new window would be skipped.

It's also clearly not meant to be doing this: forcing execution through this branch would be a dubious long shot at best. This bit of code is only supposed to be used if the print preview window context is to be re-used. We're not creating a print preview window, so if we hack this, it'll be attempting to use the actual context for the page instead.

So while I don't expect it to produce good results, I'm interested to see what will happen.

Looking at the condition that's gating the code block, the following part of the condition will be false:
  aIsPreview == IsPreview::Yes
Therefore what we'll need is for both docToPrint->IsStaticDocument() and StaticPrefs::print_tab_modal_enabled() to be true. Checking the about:config page I note that the print.tab_model.enabled value is already set to true, so we just need the docToPrint->IsStaticDocument() call to return true for the condition to hold.

To test all this out I put a breakpoint just before this block of code using the debugger and examine the state of the system when it hits. It turns out that docToPrint->IsStaticDocument() is indeed false, but we can switch its value using the debugger to force our way into the conditional code block.
$ EMBED_CONSOLE=1 MOZ_LOG="EmbedLite:5" gdb sailfish-browser
[...]
(gdb) b nsGlobalWindowOuter.cpp:5287
Breakpoint 1 at 0x7fba96f364: file dom/base/nsGlobalWindowOuter.cpp, line 5287.
(gdb) c
Continuing.
[...]
Thread 8 "GeckoWorkerThre" hit Breakpoint 1, nsGlobalWindowOuter::Print
    (nsIPrintSettings*, nsIWebProgressListener*, nsIDocShell*,
    nsGlobalWindowOuter::IsPreview, nsGlobalWindowOuter::IsForWindowDotPrint,
    std::function<void (mozilla::dom::PrintPreviewResultInfo const&)>&&,
    mozilla::ErrorResult&) (this=this@entry=0x7f88ab8100,
    aPrintSettings=aPrintSettings@entry=0x7e3553a790,
    aListener=aListener@entry=0x7f8a6730d0,
    aDocShellToCloneInto=aDocShellToCloneInto@entry=0x0,
    aIsPreview=aIsPreview@entry=nsGlobalWindowOuter::IsPreview::No,
    aForWindowDotPrint=aForWindowDotPrint@entry=nsGlobalWindowOuter::
    IsForWindowDotPrint::No, aPrintPreviewCallback=..., aError=...)
    at dom/base/nsGlobalWindowOuter.cpp:5287
5287      nsAutoSyncOperation sync(docToPrint, SyncOperationBehavior::eAllowInput);
(gdb) n
5288      AutoModalState modalState(*this);
(gdb) p docToPrint->IsStaticDocument()
Attempt to take address of value not located in memory.
(gdb) p docToPrint
$1 = {mRawPtr = 0x7f89077650}
(gdb) p aIsPreview
$2 = nsGlobalWindowOuter::IsPreview::No
(gdb) p docToPrint.mRawPtr.mIsStaticDocument
$3 = false
(gdb) set variable docToPrint.mRawPtr.mIsStaticDocument = true
(gdb) p docToPrint.mRawPtr.mIsStaticDocument
$4 = true
(gdb) c
Continuing.
[New LWP 19986]

Thread 13 "Socket Thread" received signal SIGPIPE, Broken pipe.

Thread 8 "GeckoWorkerThre" received signal SIGSEGV, Segmentation fault.
nsPrintJob::FindFocusedDocument (this=this@entry=0x7f885e3020,
    aDoc=aDoc@entry=0x7f89077650)
    at layout/printing/nsPrintJob.cpp:2411
2411      nsPIDOMWindowOuter* window = aDoc->GetOriginalDocument()->GetWindow();
(gdb) bt
#0  nsPrintJob::FindFocusedDocument (this=this@entry=0x7f885e3020,
    aDoc=aDoc@entry=0x7f89077650) at layout/printing/nsPrintJob.cpp:2411
#1  0x0000007fbc3bac6c in nsPrintJob::DoCommonPrint
    (this=this@entry=0x7f885e3020, aIsPrintPreview=aIsPrintPreview@entry=false,
    aPrintSettings=aPrintSettings@entry=0x7e3553a790,
    aWebProgressListener=aWebProgressListener@entry=0x7f8a6730d0,
    aDoc=aDoc@entry=0x7f89077650) at layout/printing/nsPrintJob.cpp:548
#2  0x0000007fbc3bb718 in nsPrintJob::CommonPrint
    (this=this@entry=0x7f885e3020, aIsPrintPreview=aIsPrintPreview@entry=false,
    aPrintSettings=aPrintSettings@entry=0x7e3553a790,
    aWebProgressListener=aWebProgressListener@entry=0x7f8a6730d0,
    aSourceDoc=aSourceDoc@entry=0x7f89077650)
    at layout/printing/nsPrintJob.cpp:488
#3  0x0000007fbc3bb840 in nsPrintJob::Print (this=this@entry=0x7f885e3020,
    aSourceDoc=<optimized out>, aPrintSettings=aPrintSettings@entry=0x7e3553a790,
    aWebProgressListener=aWebProgressListener@entry=0x7f8a6730d0)
    at layout/printing/nsPrintJob.cpp:824
#4  0x0000007fbc108fe4 in nsDocumentViewer::Print (this=0x7f8905e190,
    aPrintSettings=0x7e3553a790, aWebProgressListener=0x7f8a6730d0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:859
#5  0x0000007fba96f49c in nsGlobalWindowOuter::Print(nsIPrintSettings*,
    nsIWebProgressListener*, nsIDocShell*, nsGlobalWindowOuter::IsPreview,
    nsGlobalWindowOuter::IsForWindowDotPrint, std::function<void
    (mozilla::dom::PrintPreviewResultInfo const&)>&&, mozilla::ErrorResult&)
    (this=this@entry=0x7f88ab8100,
    aPrintSettings=aPrintSettings@entry=0x7e3553a790,
    aListener=aListener@entry=0x7f8a6730d0,
    aDocShellToCloneInto=aDocShellToCloneInto@entry=0x0,
    aIsPreview=aIsPreview@entry=nsGlobalWindowOuter::IsPreview::No,
    aForWindowDotPrint=aForWindowDotPrint@entry=nsGlobalWindowOuter::
    IsForWindowDotPrint::No, aPrintPreviewCallback=..., aError=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:859
#6  0x0000007fbc7eb714 in mozilla::dom::CanonicalBrowsingContext::Print
    (this=this@entry=0x7f88cb8b70, aPrintSettings=0x7e3553a790, aRv=...)
    at include/c++/8.3.0/bits/std_function.h:402
#7  0x0000007fbab82f08 in mozilla::dom::CanonicalBrowsingContext_Binding::print
    (args=..., void_self=0x7f88cb8b70, obj=..., cx_=0x7f881df400)
    at BrowsingContextBinding.cpp:4674
#8  mozilla::dom::CanonicalBrowsingContext_Binding::print_promiseWrapper
    (cx=0x7f881df400, obj=..., void_self=0x7f88cb8b70, args=...)
    at BrowsingContextBinding.cpp:4688
[...]
#34 0x0000007fbd16635c in js::jit::MaybeEnterJit (cx=0x7f881df400, state=...)
    at js/src/jit/Jit.cpp:207
#35 0x0000007f8824be41 in ?? ()
Backtrace stopped: Cannot access memory at address 0x56e206215288
(gdb)
Okay, so this plan clearly isn't going to work.

Putting this dead-end to one side, we still have two options on the table:
  1. Figure out what's happening in ESR 78 where the window isn't being created.
  2. Re-code the front-end to hide the additional window.
The more I look at the code the more I think I'm going to have to go down the second route. Nevertheless I'd like to try to compare execution with ESR 78 one more time to try to figure out how it can get away without creating a clone. In particular, there is this BuildNestedPrintObjects() method in ESR 78 which I still suspect is performing the role of the cloning. This code still exists in ESR 91 and the debugger tells me it's not being called. But for comparison I'd really like to know whether it's being called in ESR 78.

Unfortunately the debugger just refuses to work properly on the ESR 78 code as installed from the repository:
$ EMBED_CONSOLE=1 MOZ_LOG="EmbedLite:5" gdb sailfish-browser
(gdb) r
(gdb) b BuildNestedPrintObjects
Breakpoint 1 at 0x7fbc20b7a8: file layout/printing/nsPrintJob.cpp, line 403.
(gdb) c
Continuing.

Thread 8 "GeckoWorkerThre" hit Breakpoint 1, BuildNestedPrintObjects (aDocument=
dwarf2read.c:10473: internal-error: process_die_scope::process_die_scope
    (die_info*, dwarf2_cu*): Assertion `!m_die->in_process' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) n

This is a bug, please report it.  For instructions, see:
<http://www.gnu.org/software/gdb/bugs/>.

dwarf2read.c:10473: internal-error: process_die_scope::process_die_scope
    (die_info*, dwarf2_cu*): Assertion `!m_die->in_process' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) n
Command aborted.
(gdb) 
This is basically scuppering any attempt I make to breakpoint on these functions, or indeed anything nearby. I've tried this on two separate devices now and get the same results, so I'm pretty sure this is due to the debug symbols in the repository or a bug in gdb rather than some specific misconfiguration of gdb on my phone.

To try to address this, I'm going to rebuild ESR 78 and install completely new debug packages on my device. That will at least discount the possibility of it being something corrupt about the debug symbols coming from the official repositories. But to do that, I'll need to rebuild the ESR 78 code.

I've set the build going. While it builds I can't do much else except read through the code some more. But I also decide to try to ask around on the Mozilla Matrix "Printing" channel to see if anyone has any advice about how to tackle this. Here's the message I posted:
 
Hi. I have a query about cloning the document for printing PDF to file. I'm upgrading gecko from ESR78 to ESR91 for Sailfish OS (a mobile Linux variant) where we have a "Save page to PDF" option in the Qt UI. In ESR78 the nsPrintJob::DoCommonPrint() method is called but in ESR91 this no longer seems to work so I call CanonicalBrowsingContext::Print() instead (seems to relate to D87063). The former clones the doc using (I think) BuildNestedPrintObjects(), but the latter seems to call OpenInternal() inside nsGlobalWindowOuter::Print() to open a new window for it instead. Is this correct, or am I misunderstanding the changes? The reason I ask is that the latter opens a new blank window for the clone to go into, which I'm trying to avoid.

I post this at 15:47 and wait. By the evening I've received a reply from emilio:
 
We still clone the old doc. In order to avoid the new window you need to handle OPEN_PRINT_BROWSER flag

That's really helpful; it suggests that hiding the extra window in the user interface really is the right way to address this. I post a follow-up to get clarification on this.
 
Thanks for your reply emilio. So I have to allow the window to be created, but hide it in the front-end based on the fact the OPEN_PRINT_BROWSER flag (from nsOpenWindowInfo::GetIsForPrinting()) is set?

As I write this I'm still awaiting a reply, but I've already come to a conclusion on this: I'll need to add code so that some windows can be created but hidden from the front-end. I already have some ideas for how this can work. It will require some bigger changes than I was hoping for, but on the plus side, most if not all of the changes will happen in Sailfish code, rather than gecko code.

And you never know, there may be some other use for the functionality in the future as well.

The ESR 78 build is still chugging away. It's quite late here and I'm feeling rotten. I've got barely a fraction of the things I was hoping to get completed today done, but I have at least reached a conclusion for how to proceed with this printing situation. So the day hasn't been a complete wipe-out.

I really hope I'm feeling better in the morning.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
18 Dec 2023 : Day 111 #
I'm still feeling unwell. Plus I had a long couple of days at work yesterday and the day before. All in all this has left me in a bit of a mess and I spent last night trying to figure out how the parent browsing context and the new browsing context relate to one another. Although I did get to a piece of code that looked relevant, I wasn't able to piece together the processes either side that led up to it and that led away from it, all of which should have re-converged inside nsWindowWatcher::OpenWindowInternal().

What I did do was add in some code — mimicking the code further up the stack which usually performs the task elsewhere — to create the new browser context from the parent browser context into this method, in an attempt to short circuit the process that would usually require a new window to be created.

It built overnight and now I'm testing it out.

Unfortunately it leaves us with a segfault occurring inside nsGlobalWindowOuter::Print(). Let's take a record of the backtrace.
Thread 8 "GeckoWorkerThre" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 6812]
nsGlobalWindowOuter::Print(nsIPrintSettings*, nsIWebProgressListener*,
    nsIDocShell*, nsGlobalWindowOuter::IsPreview,
    nsGlobalWindowOuter::IsForWindowDotPrint, std::function<void
    (mozilla::dom::PrintPreviewResultInfo const&)>&&, mozilla::ErrorResult&)
    (this=this@entry=0x7f8843e530, 
    aPrintSettings=aPrintSettings@entry=0x7e33d83c80,
    aListener=aListener@entry=0x7f88f44c00,
    aDocShellToCloneInto=aDocShellToCloneInto@entry=0x0, 
    aIsPreview=aIsPreview@entry=nsGlobalWindowOuter::IsPreview::No, 
    aForWindowDotPrint=aForWindowDotPrint@entry=nsGlobalWindowOuter::
    IsForWindowDotPrint::No, aPrintPreviewCallback=..., aError=...)
    at dom/base/nsGlobalWindowOuter.cpp:5351
5351        cloneDocShell->GetContentViewer(getter_AddRefs(cv));
(gdb) bt
#0  nsGlobalWindowOuter::Print(nsIPrintSettings*, nsIWebProgressListener*,
    nsIDocShell*, nsGlobalWindowOuter::IsPreview,
    nsGlobalWindowOuter::IsForWindowDotPrint, std::function<void
    (mozilla::dom::PrintPreviewResultInfo const&)>&&, mozilla::ErrorResult&)
    (this=this@entry=0x7f8843e530, 
    aPrintSettings=aPrintSettings@entry=0x7e33d83c80,
    aListener=aListener@entry=0x7f88f44c00,
    aDocShellToCloneInto=aDocShellToCloneInto@entry=0x0, 
    aIsPreview=aIsPreview@entry=nsGlobalWindowOuter::IsPreview::No, 
    aForWindowDotPrint=aForWindowDotPrint@entry=nsGlobalWindowOuter::
    IsForWindowDotPrint::No, aPrintPreviewCallback=..., aError=...)
    at dom/base/nsGlobalWindowOuter.cpp:5351
#1  0x0000007fbc7eb714 in mozilla::dom::CanonicalBrowsingContext::Print
    (this=this@entry=0x7f88c83400, aPrintSettings=0x7e33d83c80, aRv=...)
    at include/c++/8.3.0/bits/std_function.h:402
#2  0x0000007fbab82f08 in mozilla::dom::CanonicalBrowsingContext_Binding::print
    (args=..., void_self=0x7f88c83400, obj=..., cx_=0x7f881df400)
    at BrowsingContextBinding.cpp:4674
#3  mozilla::dom::CanonicalBrowsingContext_Binding::print_promiseWrapper
    (cx=0x7f881df400, obj=..., void_self=0x7f88c83400, args=...)
    at BrowsingContextBinding.cpp:4688
#4  0x0000007fbb2e2960 in mozilla::dom::binding_detail::GenericMethod
    <mozilla::dom::binding_detail::NormalThisPolicy,
    mozilla::dom::binding_detail::ConvertExceptionsToPromises>
    (cx=0x7f881df400, argc=<optimized out>, vp=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/js/CallArgs.h:207
[...]
#29 0x0000007fbd1662cc in js::jit::MaybeEnterJit (cx=0x7f881df400, state=...)
    at js/src/jit/Jit.cpp:207
#30 0x0000007f8824be41 in ?? ()
Backtrace stopped: Cannot access memory at address 0x4d673b58921e
(gdb) 
Here's the code (it's the last line that's causing the segfault):
    nsCOMPtr<nsIDocShell> cloneDocShell = bc->GetDocShell();
    MOZ_DIAGNOSTIC_ASSERT(cloneDocShell);
    cloneDocShell->GetContentViewer(getter_AddRefs(cv));
And here's the reason for the segfault:
(gdb) p cloneDocShell
$1 = {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>}
(gdb) 
So, from this, it looks very much like the browser context simply doesn't have a docShell. Maybe there's a whole bunch of other stuff it doesn't — but should — have as well?

Looking back at the original code that creates the new browser context inside EmbedLiteViewChild::InitGeckoWindow(), there's this snippet of code following the creation that looks particularly relevant:
  // nsWebBrowser::Create creates nsDocShell, calls InitWindow for nsIBaseWindow,
  // and finally creates nsIBaseWindow. When browsingContext is passed to
  // nsWebBrowser::Create, typeContentWrapper type is passed to the nsWebBrowser
  // upon instantiation.
  mWebBrowser = nsWebBrowser::Create(mChrome, mWidget, browsingContext,
                                     nullptr);
With the short-circuit we introduced yesterday this is no longer being executed. So no browser context. So segfaults. So no good.

Back to the drawing board and, in particular, to figuring out the path to how this EmbedLiteViewChild::InitGeckoWindow() method gets called.

It would be so nice to cheat on this. I could execute the code in the debugger, stick a breakpoint on the method and observe the path in the backtrace, were it not for the fact I introduced that short circuit. Since I did, this method no longer gets called, so the breakpoint will no longer fire.

The short circuit is broken anyway so I've removed it and set the build running. But it will be hours before it completes so I may as well peruse the code in the meantime in case I can figure it out manually.

I've also just noticed that just after creation of the mWebBrowser that we saw above, there's this call here:
  rv = mWebBrowser->SetVisibility(true);
I'm also interested to know whether this could be of use to us, so that's also something to follow in the code (although I don't think it'll make any difference on the tab display side).

So from manual inspection, this InitGeckoWindow() gets triggered by the EmbedLiteViewChild constructor. The constructor isn't called directly, but is inherited by EmbedLiteViewThreadChild and EmbedLiteViewProcessChild. I think the browser uses the thread version so I'm going to follow that route.

An instance of EmbedLiteViewThreadChild is created in the EmbedLiteAppThreadChild::AllocPEmbedLiteViewChild() method, which isn't called from anywhere in the main codebase. However, just to highlight how frustratingly convoluted this is, there is some generated code which is presumably the place responsible for calling it:
auto PEmbedLiteAppChild::OnMessageReceived(const Message& msg__) ->
    PEmbedLiteAppChild::Result
{
[...]
    switch (msg__.type()) {
    case PEmbedLiteApp::Msg_PEmbedLiteViewConstructor__ID:
        {
[...]
            msg__.EndRead(iter__, msg__.type());
            PEmbedLiteViewChild* actor =
                (static_cast<EmbedLiteAppChild*>(this))->
                AllocPEmbedLiteViewChild(windowId, id, parentId,
                parentBrowsingContext, isPrivateWindow, isDesktopMode);
[...]
The sending of this message (through a bit of generated indirection) happens as a result of a call to PEmbedLiteAppParent::SendPEmbedLiteViewConstructor(), which (finally we got there) is called here:
EmbedLiteView*
EmbedLiteApp::CreateView(EmbedLiteWindow* aWindow, uint32_t aParent,
  uintptr_t aParentBrowsingContext, bool aIsPrivateWindow, bool isDesktopMode)
{
  LOGT();
  NS_ASSERTION(mState == INITIALIZED, "The app must be up and runnning by now");
  static uint32_t sViewCreateID = 0;
  sViewCreateID++;

  PEmbedLiteViewParent* viewParent = static_cast<PEmbedLiteViewParent*>(
      mAppParent->SendPEmbedLiteViewConstructor(aWindow->GetUniqueID(),
          sViewCreateID, aParent, aParentBrowsingContext, aIsPrivateWindow,
          isDesktopMode));
  EmbedLiteView* view = new EmbedLiteView(this, aWindow, viewParent, sViewCreateID);
  mViews[sViewCreateID] = view;
  return view;
}
This is part of the main codebase, not auto-generated.

There's always a balance between minimising changes to the entire codebase and minimising changes to the gecko codebase. My preference is for the latter wherever possible since it will help minimise maintenance in the long run. In this case, working on this principle, it looks like the best thing to do is to allow the print status to be passed down through the EmbedLite code and into the front end. This will allow the front end to choose how to deal with the new window, which after some additional front end changes, could be to hide the window.

Before I do that, I'm going to double check the reason why all of this isn't necessary in ESR 78. I think I've reached the stage where I understand the process flow in the ESR 91 print, browser context and window creation code sufficiently to make it worthwhile attempting a comparison.

So what I want to do is to check whether nsGlobalWindowOuter::Print() is called as part of the ESR 78 process and, if so, figure out what happens at this point:
    // We're already a print preview window, just reuse our browsing context /
    // content viewer.
    bc = sourceBC;
    nsCOMPtr<nsIDocShell> docShell = bc->GetDocShell();
    if (!docShell) {
      aError.ThrowNotSupportedError("No docshell");
      return nullptr;
    }
If that code is being executed it might imply a route that could work on ESR 91 that also avoids having to open the new window.

That's it for today though. I'll have to pick this question up again tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
17 Dec 2023 : Day 110 #
Unfortunately I woke up this morning with some kind of Winter ailment: headache, sore throat, cough and all the rest. This has knocked me out for pretty much the entire day, including all of my planned gecko work.

The thing I hate most about being unwell is that it totally throws off all of my plans. I really hate that. It's not just my gecko work, but I had big plans to work on a Rust project today as well.

As a consequence it'll be a short one today, and potentially also for the next couple of days. Hopefully I'll be able to gain momentum just as soon as I'm back to my normal self.

Nevertheless, as yesterday we're continuing to work through the printer code to try to remove the window that's appearing when the print starts. Now that our examination has reached the sailfish-browser code I think we can start calling it a tab now. This is the code we ended up at last night:
quint32 DeclarativeWebPageCreator::createView(const quint32 &parentId,
    const uintptr_t &parentBrowsingContext)
{
    QPointer<DeclarativeWebPage> oldPage = m_activeWebPage;
    m_model->newTab(QString(), parentId, parentBrowsingContext);

    if (m_activeWebPage && oldPage != m_activeWebPage) {
        return m_activeWebPage->uniqueId();
    }
    return 0;
}
Now it's time to check out the m_model which we can see declared in the class header:
    QPointer<DeclarativeTabModel> m_model;
I've looked through the code a few times now and exactly what's happening is confusing me greatly. It looks very much like the parentBrowsingContext is passed in to creatView() which passes it along to newTab(). This transition causes a name change to browsingContext and this value gets stored with the new tab:
    Tab tab;
    tab.setTabId(nextTabId());
    tab.setRequestedUrl(url);
    tab.setBrowsingContext(browsingContext);
    tab.setParentId(parentId);
There's no particular magic happening here:
void Tab::setBrowsingContext(uintptr_t browsingContext)
{
    Q_ASSERT_X(m_browsingContext == 0, Q_FUNC_INFO,
        "Browsing context can be set only once.");
    m_browsingContext = browsingContext;
}
My concern is that this is the same browsing context that's eventually going to be returned by nsWindowWatcher::OpenWindowInternal(). After we've gone all the way to sailfish-browser and back again the new browser context gets extracted like this:
    nsCOMPtr<nsIWebBrowserChrome> newChrome;
    rv = CreateChromeWindow(parentChrome, chromeFlags, openWindowInfo,
                            getter_AddRefs(newChrome));

    // Inside CreateChromeWindow
    RefPtr<BrowsingContext> parentBrowsingContext = aOpenWindowInfo->GetParent();
    Tab tab;
    tab.setBrowsingContext(parentBrowsingContext);

	// After CreateChromeWindow
    nsCOMPtr<nsIDocShellTreeItem> newDocShellItem = do_GetInterface(newChrome);
    RefPtr<BrowsingContext> newBC = newDocShellItem->GetBrowsingContext();
Note that the above isn't real code, it's just a summary of the steps that happen at different stages to give an idea of how the variables are moving around.

This is such a web. Somewhere the parentBrowsingContext going in has to touch the newBC coming out. It looks like the place may be in the EmbedLiteViewChild::InitGeckoWindow() method where we have this:
  RefPtr<BrowsingContext> browsingContext = BrowsingContext::CreateDetached
    (nullptr, parentBrowsingContext, nullptr, EmptyString(),
    BrowsingContext::Type::Content);
This seems to happen when an instance of EmbedLiteViewChild is created.

I'm going to try to short-circuit this by adding the following code to nsWindowWatcher::OpenWindowInternal() before all this happens.
  if (!newBC) {
    bool isForPrinting = openWindowInfo->GetIsForPrinting();
    if (isForPrinting) {
      RefPtr<BrowsingContext> parentBrowsingContext = openWindowInfo->GetParent();
      newBC = BrowsingContext::CreateDetached(nullptr, parentBrowsingContext,
        nullptr, EmptyString(), BrowsingContext::Type::Content);
    }
  }
Taking a look at this with fresh eyes in the morning will &mash; I'm sure — help. Maybe I'll feel a little more cogent tomorrow. This is as far as I can take it today.

It's building now so it's a good time for me to pause. I'll aim to pick this up again in the morning.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
16 Dec 2023 : Day 109 #
It's back to trying to hide the extraneous print window today. If you've been following over the last few days, you'll know that the "Save page to PDF" functionality is now working, but plagued by an errant window that insists on opening during the print.

The reason for the window needing to exist is completely clear: the page needs to be cloned into a browser context and the process of creating a browser context involves creating a window for it to live in. This wasn't a problem for ESR 78... I'm not exactly sure why and that's something I should look into.

But first I'm going to look at the code in WindowCreator.cpp that lives in the EmbedLite portion of the gecko project. Recall that we were working through this yesterday and it looked like this:
  if (isForPrinting) {
    return NS_OK;
  }

  mChild->CreateWindow(parentID, reinterpret_cast(parentBrowsingContext.get()), aChromeFlags, &createdID, aCancel);

  if (*aCancel) {
    return NS_OK;
  }

  nsresult rv(NS_OK);
  nsCOMPtr<nsIWebBrowserChrome> browser;
  nsCOMPtr<nsIThread> thread;
  NS_GetCurrentThread(getter_AddRefs(thread));
  while (!browser && NS_SUCCEEDED(rv)) {
    bool processedEvent;
    rv = thread->ProcessNextEvent(true, &processedEvent);
    if (NS_SUCCEEDED(rv) && !processedEvent) {
      rv = NS_ERROR_UNEXPECTED;
    }
    EmbedLiteViewChildIface* view = mChild->GetViewByID(createdID);
    if (view) {
      view->GetBrowserChrome(getter_AddRefs(browser));
    }
  }

  // check to make sure that we made a new window
  if (_retval) {
      NS_ADDREF(*_retval = browser);
      return NS_OK;
  }
I've included slightly more more of the code today because the chunk in the middle is important. Just to dissect this a little, the first conditional return is code I added in the hope that it would avoid creation of the window. The problem with this is that it means _retval never gets set and this is the return value needed in order for the browser context to be created.

We can tell more than this though. In order for _retval to be set we need browser to be set and that will only happen if:
  1. Execution goes inside the while loop;
  2. and createdID has been set;
  3. which requires that CreateWindow() is called.
In summary, we can't avoid creating the window at this point. The fact that the while loop is waiting for the browser chrome to exist makes it look like we can't avoid creating the window at all.

But that doesn't mean we can't hide it, so let's pursue that goal for now.

Checking the WindowCreator.h header we can see that the mChild that handles the call to CreateWindow() is a class that implements the EmbedLiteAppChildIface interface. There's actually only one concrete class that does this which is EmbedLiteAppChild. All this does is send a CreateWindow message.

There are a few candidates for classes that might receive this. It could be EmbedLiteAppThreadParent, EmbedLiteAppProcessParent or ContentParent. Only the first two are part of the EmbedLite code and they both end up doing the same thing, which is making the following call:
  *createdID = mApp->CreateWindowRequested(chromeFlags, parentId, parentBrowsingContext);
In both cases mApp is an instance of EmbedLiteApp. I'm wondering why we have both of these. I'm wondering if one is used for the browser and the other is used for the WebView. Or maybe Sailfish OS only uses one of them. I must remember to investigate this further at some point.

Also worth noting is that the createdID returned by this call is exactly the value we're interested in. We need this to be set.

Let's continue on into EmbedLiteApp.

In this method there's a snippet of code that searches for the parent based on the parent ID and then calls this:
  uint32_t viewId = mListener ? mListener->CreateNewWindowRequested(
    chromeFlags, view, parentBrowsingContext) : 0;
Once again it's the return value that we need. The mListener is an instance of EmbedLiteAppListener, that's actually defined in the same header file as EmbedLiteApp. This is an interface and we need something to implement it. But there's nothing in gecko that does.

After some scrabbling around and scratching of my head the reason becomes clear: there's nothing that implements it in gecko because the class that implements it is in qtmozembed. Which means we've finally broken through to the sailfish-browser frontend. The implementation is in qmozcontext.cpp:
uint32_t QMozContextPrivate::CreateNewWindowRequested(const uint32_t &chromeFlags,
  EmbedLiteView *aParentView, const uintptr_t &parentBrowsingContext)
{
    Q_UNUSED(chromeFlags)

    uint32_t parentId = aParentView ? aParentView->GetUniqueID() : 0;
    qCDebug(lcEmbedLiteExt) << "QtMozEmbedContext new Window requested: parent:"
      << (void *)aParentView << parentId;
    uint32_t viewId = QMozContext::instance()->createView(parentId,
      parentBrowsingContext);
    return viewId;
}
Once again, it's the return value, viewId, that we're particularly concerned about. The call to createView() is bounced by the QMozContext instance (it's presumably a singleton) to mViewCreator which is an instance of QMozViewCreator which is an abstract class that has no implementation in qtmozembed.

And that's because the implementation comes from the sailfish-browser repository in the form of the DeclarativeWebPageCreator class.

Here's the implementation:
quint32 DeclarativeWebPageCreator::createView(const quint32 &parentId,
  const uintptr_t &parentBrowsingContext)
{
    QPointer<DeclarativeWebPage> oldPage = m_activeWebPage;
    m_model->newTab(QString(), parentId, parentBrowsingContext);

    if (m_activeWebPage && oldPage != m_activeWebPage) {
        return m_activeWebPage->uniqueId();
    }
    return 0;
}
Now it really feels like we're getting close! We're switching terminology from windows to tabs. Plus we have no more repositories to go in to. Somewhere around here we're going to have to start thinking about adding in some way to hide the window, if we're going to go down that route.

Alright, my train is coming in to King's Cross and I have a busy evening tonight with my work Christmas Party, so this is likely to be all I have time for today. It feels like we made good progress though and are reaching a point where I might be able to start making changes to the code to actually fix the issue. That's always the goal!

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
15 Dec 2023 : Day 108 #
This morning: a successful build. That's not really so surprising given the minimal changes I made yesterday, but I've messed up smaller pieces of code before, so you can never be sure.

So, packages built, installed and run, and what do we have? In order to get the debug output I have to set the MOZ_LOG environment variable to include "EmbedLite:5" so that the LOGE() messages will show. Here's what happens when I press the "Save web page as PDF" option:
$ EMBED_CONSOLE=1 MOZ_LOG="EmbedLite:5" sailfish-browser
[...]
[Parent 15259: Unnamed thread 7060002670]: E/EmbedLite FUNC::virtual nsresult 
    mozilla::embedlite::EmbedLiteAppChild::Observe(nsISupports*, const char*,
    const char16_t*):68 topic:embed:download
[Parent 15259: Unnamed thread 7060002670]: W/EmbedLite ERROR:
    EmbedLite::virtual nsresult WindowCreator::CreateChromeWindow
    (nsIWebBrowserChrome*, uint32_t, nsIOpenWindowInfo*, bool*,
    nsIWebBrowserChrome**):61 PRINT: isForPrinting: 1
EmbedliteDownloadManager error: [Exception... "The request is not allowed."
    nsresult: "0x80530021 (NS_ERROR_DOM_NOT_ALLOWED_ERR)"  location:
    "JS frame :: resource://gre/modules/DownloadCore.jsm ::
    DownloadError :: line 1755"  data: no]
[Parent 15259: Unnamed thread 7060002670]: E/EmbedLite FUNC::virtual nsresult mozilla::embedlite::EmbedLiteAppChild::Observe(nsISupports*, const char*,
    const char16_t*):68 topic:embed:download
JavaScript error: , line 0: uncaught exception: Object
CONSOLE message:
[JavaScript Error: "uncaught exception: Object"]
That's a bit messy, but if you look through carefully it's possible to see the output PRINT: isForPrinting: 1. That's a good sign: it shows that in WindowCreator::CreateChromeWindow() we can find out whether this is a window that needs to be hidden or not.

But the rest is less encouraging. Apart from the logging I also added an early return to the method to prevent the window from actually being created. I got the method to return NS_OK in the hope that whatever happens further down the stack might not notice. But from this debug output we can see that it did notice, throwing an exception "The request is not allowed."

From DOMException.h we can see that this NS_ERROR_DOM_NOT_ALLOWED_ERR that we're getting is equivalent to NotAllowedError which appears one in DownloadCore.jsm. However in this particular instance it's just some code conditioned on the error. What we need is some code that's actually generating the error. Looking through the rest of the code, it all looks a bit peculiar: this error is usually triggered by a an authentication failure, which doesn't fit with what we're doing here at all.

There are only a few places where it seems to be used for other purposes. One of them is the StyleSheet::InsertRuleIntoGroup() method where it seems to be caused by a failed attempt to modify a group.
nsresult StyleSheet::InsertRuleIntoGroup(const nsACString& aRule,
                                         css::GroupRule* aGroup,
                                         uint32_t aIndex) {
  NS_ASSERTION(IsComplete(), "No inserting into an incomplete sheet!");
  // check that the group actually belongs to this sheet!
  if (this != aGroup->GetStyleSheet()) {
    return NS_ERROR_INVALID_ARG;
  }

  if (IsReadOnly()) {
    return NS_OK;
  }

  if (ModificationDisallowed()) {
    return NS_ERROR_DOM_NOT_ALLOWED_ERR;
  }
[...]
I'm running sailfish-browser through the debugger with a break point on this method to see whether this is where it's coming from. If it is, I'm not sure what that will tell us.

But the breakpoint isn't triggered, so it's not this bit of code anyway. After grepping the code a bit more and sifting carefully through various files, I eventually realise that there's an instance of this error that could potentially be generated directly after our call to OpenInternal() in nsGlobalWindowOuter.cpp:
[...]
      aError = OpenInternal(u""_ns, u""_ns, u""_ns,
                            false,             // aDialog
                            false,             // aContentModal
                            true,              // aCalledNoScript
                            false,             // aDoJSFixups
                            true,              // aNavigate
                            nullptr, nullptr,  // No args
                            nullptr,           // aLoadState
                            false,             // aForceNoOpener
                            printKind, getter_AddRefs(bc));
      if (NS_WARN_IF(aError.Failed())) {
        return nullptr;
      }
    }
    if (!bc) {
      aError.ThrowNotAllowedError("No browsing context");
      return nullptr;
    }
That looks like a far more promising case. The debugger won't let me put a breakpoint directly on the line that's throwing the exception here, but it will let me put one on the OpenInternal() call, so I can set that and step through to check whether this error is the one causing the output.
(gdb) break nsGlobalWindowOuter.cpp:5329
Breakpoint 4 at 0x7fba96fad0: file dom/base/nsGlobalWindowOuter.cpp, line 5329.
(gdb) c
Continuing.
[LWP 17314 exited]
[Parent 16702: Unnamed thread 7f88002670]: E/EmbedLite FUNC::virtual nsresult mozilla::embedlite::EmbedLiteAppChild::Observe(nsISupports*, const char*,
    const char16_t*):68 topic:embed:download
[Switching to LWP 16938]

Thread 8 "GeckoWorkerThre" hit Breakpoint 4, nsGlobalWindowOuter::Print
    (nsIPrintSettings*, nsIWebProgressListener*, nsIDocShell*,
    nsGlobalWindowOuter::IsPreview, nsGlobalWindowOuter::IsForWindowDotPrint,
    std::function<void (mozilla::dom::PrintPreviewResultInfo const&)>&&,
    mozilla::ErrorResult&) (this=this@entry=0x7f88564870,
    aPrintSettings=aPrintSettings@entry=0x7e352ed060,
    aListener=aListener@entry=0x7f89bfa6b0, 
    aDocShellToCloneInto=aDocShellToCloneInto@entry=0x0,
    aIsPreview=aIsPreview@entry=nsGlobalWindowOuter::IsPreview::No, 
    aForWindowDotPrint=aForWindowDotPrint@entry=nsGlobalWindowOuter::
    IsForWindowDotPrint::No, aPrintPreviewCallback=..., aError=...)
    at dom/base/nsGlobalWindowOuter.cpp:5329
5329          aError = OpenInternal(u""_ns, u""_ns, u""_ns,
(gdb) n
[Parent 16702: Unnamed thread 7f88002670]: W/EmbedLite ERROR:
    EmbedLite::virtual nsresult WindowCreator::CreateChromeWindow
    (nsIWebBrowserChrome*, uint32_t, nsIOpenWindowInfo*, bool*,
    nsIWebBrowserChrome**):61 PRINT: isForPrinting: 1
5329          aError = OpenInternal(u""_ns, u""_ns, u""_ns,
(gdb) n
30      ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsError.h:
    No such file or directory.
(gdb) n
5343        if (!bc) {
(gdb) p bc
$1 = {mRawPtr = 0x0}
(gdb) n
5344          aError.ThrowNotAllowedError("No browsing context");
(gdb) n
5345          return nullptr;
(gdb) 
So that's the one. It's also clear from this why the error is happening: by not creating the window we're obviously also causing the creation of the browser context bc to fail.

The obvious follow-up question is why the lack of window is preventing the browsing context from getting created. Well have to follow the metaphorical rabbit down the rabbit hole to find out. So the context is returned as the last parameter of the OpenInternal() call:
nsresult nsGlobalWindowOuter::OpenInternal(
    const nsAString& aUrl, const nsAString& aName, const nsAString& aOptions,
    bool aDialog, bool aContentModal, bool aCalledNoScript, bool aDoJSFixups,
    bool aNavigate, nsIArray* argv, nsISupports* aExtraArgument,
    nsDocShellLoadState* aLoadState, bool aForceNoOpener, PrintKind aPrintKind,
    BrowsingContext** aReturn)
In practice it's the domReturn variable inside this method that interests us. This is set as the last parameter of OpenWindow2() called inside this method:
      rv = pwwatch->OpenWindow2(this, url, name, options,
                                /* aCalledFromScript = */ true, aDialog,
                                aNavigate, argv, isPopupSpamWindow,
                                forceNoOpener, forceNoReferrer, wwPrintKind,
                                aLoadState, getter_AddRefs(domReturn));
And then this comes back from the call to OpenWindowInternal() that's being called from inside this method, again as the last parameter:
  return OpenWindowInternal(aParent, aUrl, aName, aFeatures, aCalledFromScript,
                            dialog, aNavigate, argv, aIsPopupSpam,
                            aForceNoOpener, aForceNoReferrer, aPrintKind,
                            aLoadState, aResult);
In this method the variable we're interested in is newBC which ends up turning in to the returned browser context value. Now this doesn't get directly returned by the next level. Instead there's some code that looks like this:
      /* We can give the window creator some hints. The only hint at this time
         is whether the opening window is in a situation that's likely to mean
         this is an unrequested popup window we're creating. However we're not
         completely honest: we clear that indicator if the opener is chrome, so
         that the downstream consumer can treat the indicator to mean simply
         that the new window is subject to popup control. */
      rv = CreateChromeWindow(parentChrome, chromeFlags, openWindowInfo,
                              getter_AddRefs(newChrome));
      if (parentTopInnerWindow) {
        parentTopInnerWindow->Resume();
      }

      if (newChrome) {
        /* It might be a chrome AppWindow, in which case it won't have
            an nsIDOMWindow (primary content shell). But in that case, it'll
            be able to hand over an nsIDocShellTreeItem directly. */
        nsCOMPtr<nsPIDOMWindowOuter> newWindow(do_GetInterface(newChrome));
        nsCOMPtr<nsIDocShellTreeItem> newDocShellItem;
        if (newWindow) {
          newDocShellItem = newWindow->GetDocShell();
        }
        if (!newDocShellItem) {
          newDocShellItem = do_GetInterface(newChrome);
        }
        if (!newDocShellItem) {
          rv = NS_ERROR_FAILURE;
        }
        newBC = newDocShellItem->GetBrowsingContext();
      }
Looking at this code, the most likely explanation for newBC being null is that newChrome is being returned as null from the CreateChromeWindow() call. So let's follow this lead into CreateChromeWindow(). Now we're interested in the newWindowChrome variable and we have some code that looks like this:
  bool cancel = false;
  nsCOMPtr<nsIWebBrowserChrome> newWindowChrome;
  nsresult rv = mWindowCreator->CreateChromeWindow(
      aParentChrome, aChromeFlags, aOpenWindowInfo, &cancel,
      getter_AddRefs(newWindowChrome));

  if (NS_SUCCEEDED(rv) && cancel) {
    newWindowChrome = nullptr;
    return NS_ERROR_ABORT;
  }

  newWindowChrome.forget(aResult);
The mWindowCreator->CreateChromeWindow() call there is important, because that's the line calling the method which we've hacked around with. I carefully arranged things so that the method would leave NS_SUCCEEDED(rv) as true, so it must be the last parameter which is returning null.

So finally we reach high enough up the stack that we're in the EmbedLite code, and the reason for the null return is immediately clear from looking at the WindowCreator::CreateChromeWindow() implementation. In this method it's the _retval variable that's of interest and the code I added causes the method to return before it gets set.
  if (isForPrinting) {
    return NS_OK;
  }

  mChild->CreateWindow(parentID, reinterpret_cast<uintptr_t>
    (parentBrowsingContext.get()), aChromeFlags, &createdID, aCancel);
[...]
  // check to make sure that we made a new window
  if (_retval) {
      NS_ADDREF(*_retval = browser);
      return NS_OK;
  }
We're going to need a better way to solve this. Unfortunately that won't happen this evening as I have a very early start tomorrow, so I'm going to have to leave it there for today. Still, this will be a good place — with something tangible — to pick up from in the morning.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
14 Dec 2023 : Day 107 #
We're still splashing around in the docks looking at printing today, hoping to set sail. Yesterday it actually felt like we made pretty good progress getting the print promise to act as expected, so that the user interface works as it should. The remaining issue is that there's a blank window opening every time we print. The window closes once the print is complete, but it's a messy experience for the end user.

Yesterday it was possible to narrow down the code that's triggering the window to open. There's a call to nsGlobalWindowOuter::OpenInternal() in nsGlobalWindowOuter::Print() that eventually leads to a call to and nsWindowWatcher::OpenWindowInternal() which seems to be doing most of the work.

What I'm interested in today is the link between this and the Qt code in sailfish-browser that handles the windows (or "tabs" in sailfish-browser parlance) and actually creates the window on screen.

There's a parameter passed in to nsWindowWatcher::OpenWindowInternal() which indicates that the window is being created for the purposes of: the PrintKind aPrintKind parameter. If this parameter can be accessed from the Sailfish-specific part of the code then it may just be possible to persuade sailfish-browser to open the window "in the background" so that the user doesn't know it's there.

While I'm looking into this I'll also be trying to figure out whether we can avoid the call to open the window altogether. Everything up to nsWindowWatcher::OpenWindowInternal() in the call stack is essential, because it's there that the browser context is created and that's a bit we definitely need. We need a browser context to clone the document into. But the actual window chrome being displayed on screen? Hopefully that part can be skipped.

I've placed a breakpoint on nsWindowWatcher::OpenWindowInternal() and plan to see where that takes us.

Once the breakpoint hits and after stepping through most of the method I eventually get to this:
939             parentTopInnerWindow->Suspend();
(gdb) 
[LWP 30328 exited]
948           rv = CreateChromeWindow(parentChrome, chromeFlags, openWindowInfo,
(gdb) n
359     ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:
        No such file or directory.
(gdb) 
[W] unknown:0 - bool DBWorker::execute(QSqlQuery&) failed execute query
[W] unknown:0 - "INSERT INTO tab (tab_id, tab_history_id) VALUES (?,?);"
[W] unknown:0 - QSqlError("19", "Unable to fetch row",Gecko-dev
                          "UNIQUE constraint failed: tab.tab_id")
[LWP 30138 exited]
[LWP 30313 exited]
[LWP 30314 exited]
[LWP 30123 exited]
950           if (parentTopInnerWindow) {
(gdb) 
That group of LWPs ("lightweight processes" or threads as they are otherwise known) being created are as a result of the window opening. So it's clearly the CreateChromeWindow() call that's triggering the window to open. The errors that follow it could well be coming from sailfish-browser rather than the gecko library, but I'm not sure whether they're errors to worry about, or just artefacts of everything being slowed down due to debugging. I don't recall having seem them when running the code normally.

Let's follow this code a bit more. The nsWindowWatcher::CreateChromeWindow() method is mercifully short. The active ingredient of the method is this bit here:
  nsCOMPtr<nsIWebBrowserChrome> newWindowChrome;
  nsresult rv = mWindowCreator->CreateChromeWindow(
      aParentChrome, aChromeFlags, aOpenWindowInfo, &cancel,
      getter_AddRefs(newWindowChrome));
The mWindowCreator variable is an instance of nsIWindowCreator so the next step is to find out what that is. Stepping through gives us a clue.
419       nsCOMPtr newWindowChrome;
(gdb) 
1363    ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:
        No such file or directory.
(gdb) s
WindowCreator::CreateChromeWindow (this=0x7f889ce190, aParent=0x7f88ba5450,
    aChromeFlags=4094, aOpenWindowInfo=0x7f8854a7b0, aCancel=0x7f9f3d019f, 
    _retval=0x7f9f3d01a0) at mobile/sailfishos/utils/WindowCreator.cpp:44
44        NS_ENSURE_ARG_POINTER(_retval);
(gdb) 
So we've finally reached some Sailfish-specific code. If there's some way to check whether this is a print window or not, it may be possible to stop the window being shown at this point. There is this aOpenWindowInfo object being passed in which is of type nsIOpenWindowInfo. Checking the nsIOpenWindowInfo.idl file we can see that there is a relevant attribute that's part of the object:
  /** Whether this is a window opened for printing */
  [infallible]
  readonly attribute boolean isForPrinting;
Disappointingly the object refuses to yield its contents using the debugger.
(gdb) p aOpenWindowInfo
$3 = (nsIOpenWindowInfo *) 0x7f8854a7b0
(gdb) p *aOpenWindowInfo
$4 = {<nsISupports> = {_vptr.nsISupports = 0x7fbf7dfc00
      <vtable for nsOpenWindowInfo+16>}, <No data fields>}
(gdb) p aOpenWindowInfo->GetIsForPrinting()
Cannot evaluate function -- may be inlined
(gdb) 
Never mind, let's continue digging down into the code. So from here the method calls mChild->CreateWindow() which sends the stack down a rabbit hole of different calls which I've not yet followed to the end. However I do notice that the aOpenWindowInfo object doesn't go any further. So if the info about this being a print window needs extracting, it has to be done here.

I'm going to put some debug printw in here, but I'll also amend the code to cancel the window opening at this on condition that the window is a print window. Then I'll have to build the library to see how that's worked out.

Here's the small piece of code I've added, just before the window is created (which you can see on the last line):
  bool isForPrinting = aOpenWindowInfo->GetIsForPrinting();
  LOGE("PRINT: isForPrinting: %d", isForPrinting);

  if (isForPrinting) {
    return NS_OK;
  }

  mChild->CreateWindow(parentID, reinterpret_cast<uintptr_t>
    (parentBrowsingContext.get()), aChromeFlags, &createdID, aCancel);
I've set it building, which may take a little time, so I'm going to take a break from this while it does. I'll return with some results, hopefully, tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
13 Dec 2023 : Day 106 #
We've not quite left the printing harbour, but we're getting closer to sailing. Yesterday we got to the point where it was possible to print a copy of the page, but there were some glitches: a new blank tab is opened when the print starts, and the user interface doesn't recognise when the print finishes.

I'm hoping these might be easy to fix, but at this point, without further investigation, that's impossible to tell.

So, let's get stuck in to getting things seaworthy.

First up, I've amended the print() method to assume it returns a promise rather than the progress and status change messages that the previous code assumed. I've also added some debug prints so that it now looks like this:
    this._browsingContext = BrowsingContext.getFromWindow(win)

    try {
      dump("PRINT: printing\n");
      this._browsingContext.print(printSettings)
        .then(() => dump("PRINT: Printing finished\n"))
        .catch(exception => dump("PRINT: Printing exception: " + exception + "\n"));
    } finally {
      // Remove the print object to avoid leaks
      this._browsingContext = null;
      dump("PRINT: finally\n");
    }

    dump("PRINT: returning\n");
    let fileInfo = await OS.File.stat(targetPath);
    aSetProgressBytesFn(fileInfo.size, fileInfo.size, false);
When I execute this the following output is generated:
PRINT: printing
PRINT: finally
PRINT: returning
JSScript: ContextMenuHandler.js loaded
JSScript: SelectionPrototype.js loaded
JSScript: SelectionHandler.js loaded
JSScript: SelectAsyncHelper.js loaded
JSScript: FormAssistant.js loaded
JSScript: InputMethodHandler.js loaded
EmbedHelper init called
Available locales: en-US, fi, ru
Frame script: embedhelper.js loaded
PRINT: Printing finished
JavaScript error: file:///usr/lib64/mozembedlite/components/
    EmbedLiteChromeManager.js, line 170: TypeError: chromeListener is undefined
onWindowClosed@file:///usr/lib64/mozembedlite/components/EmbedLiteChromeManager.js:170:7
observe@file:///usr/lib64/mozembedlite/components/EmbedLiteChromeManager.js:201:12
That all looks pretty healthy. The promise is clearly being returned and eventually resolved. There's also positive results in the user interface: now that this is working correctly the window that opens at the start of the printing process now also closes at the end of it. The menu item is no longer greyed out.

So probably the code that was there before was failing because of the incorrect semantics, which also caused the print process not to complete cleanly. One small issue is that the menu item doesn't get greyed out at all now. It should be greyed out while the printing is taking place.

My suspicion is that this is because the function is returning immediately, leaving the promise to run asynchronously, rather than waiting for the promise to resolve before returning. To test the theory out I've updated the code inside the try clause to look like this:
      dump("PRINT: printing\n");
      await new Promise((resolve, reject) => {
        this._browsingContext.print(printSettings)
        .then(() => {
          dump("PRINT: Printing finished\n")
          resolve();
        })
        .catch(exception => {
          dump("PRINT: Printing exception: " + exception + "\n");
          reject(new DownloadError({ result: exception, inferCause: true }));
        });
      });
Now when it's run we see the following output:
PRINT: printing
JSScript: ContextMenuHandler.js loaded
JSScript: SelectionPrototype.js loaded
JSScript: SelectionHandler.js loaded
JSScript: SelectAsyncHelper.js loaded
JSScript: FormAssistant.js loaded
JSScript: InputMethodHandler.js loaded
EmbedHelper init called
Available locales: en-US, fi, ru
Frame script: embedhelper.js loaded
PRINT: Printing finished
PRINT: finally
PRINT: returning
JavaScript error: file:///usr/lib64/mozembedlite/components/
    EmbedLiteChromeManager.js, line 170: TypeError: chromeListener is undefined
onWindowClosed@file:///usr/lib64/mozembedlite/components/EmbedLiteChromeManager.js:170:7
observe@file:///usr/lib64/mozembedlite/components/EmbedLiteChromeManager.js:201:12
That's pretty similar to the output from last time, but notice how the "finally" and "returning" output now waits for the printing to finish before appearing. That's because of the await we've added to the promise. But the good news is that this also produces better results with the user interface too: the menu item is greyed out until the printing finishes, at which point it's restored. The remorse timer and notifications work better too.

So it now appears to be completing cleanly with good results. That means the only thing now to try to fix is the new window that's opening.

One of the nice things about the fact the printing is working is that I can now debug it properly and exclusively on the ESR 91 side to see what's happening (in the C++ portions at least).

My first breakpoints go on the various ShowPrintDialog() methods. This could potentially be the source of the extra window. But when I run the executable and trigger a PDF save the breakpoint isn't hit. So I guess it's not. Instead the SetupSilentPrinting() method is being called, following the explanation in nsIPrintSettings.idl:
  /**
   * We call this function so that anything that requires a run of the event loop
   * can do so safely. The print dialog runs the event loop but in silent printing
   * that doesn't happen.
   *
   * Either this or ShowPrintDialog (but not both) MUST be called by the print
   * engine before printing, otherwise printing can fail on some platforms.
   */
  [noscript] void SetupSilentPrinting();
The backtrace when this hits looks like this:
Thread 8 "GeckoWorkerThre" hit Breakpoint 2,
    nsPrintSettingsQt::SetupSilentPrinting (this=0x7e2c255760)
    at widget/qt/nsPrintSettingsQt.cpp:383
383         return NS_OK;
(gdb) bt
#0  nsPrintSettingsQt::SetupSilentPrinting (this=0x7e2c255760)
    at widget/qt/nsPrintSettingsQt.cpp:383
#1  0x0000007fbc3bb4e4 in nsPrintJob::DoCommonPrint (this=this@entry=0x7e2c14ba30, aIsPrintPreview=aIsPrintPreview@entry=false, 
    aPrintSettings=aPrintSettings@entry=0x7e2c255760,
    aWebProgressListener=aWebProgressListener@entry=0x7f8a0902b0,
    aDoc=aDoc@entry=0x7f8abf4040)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:869
#2  0x0000007fbc3bb718 in nsPrintJob::CommonPrint (this=this@entry=0x7e2c14ba30,
    aIsPrintPreview=aIsPrintPreview@entry=false, 
    aPrintSettings=aPrintSettings@entry=0x7e2c255760,
    aWebProgressListener=aWebProgressListener@entry=0x7f8a0902b0, 
    aSourceDoc=aSourceDoc@entry=0x7f8abf4040)
    at layout/printing/nsPrintJob.cpp:488
#3  0x0000007fbc3bb840 in nsPrintJob::Print (this=this@entry=0x7e2c14ba30,
    aSourceDoc=<optimized out>, aPrintSettings=aPrintSettings@entry=0x7e2c255760, 
    aWebProgressListener=aWebProgressListener@entry=0x7f8a0902b0)
    at layout/printing/nsPrintJob.cpp:824
#4  0x0000007fbc108fe4 in nsDocumentViewer::Print (this=0x7e2ffabd80,
    aPrintSettings=0x7e2c255760, aWebProgressListener=0x7f8a0902b0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:859
#5  0x0000007fba96f49c in nsGlobalWindowOuter::Print(nsIPrintSettings*,
    nsIWebProgressListener*, nsIDocShell*, nsGlobalWindowOuter::IsPreview,
    nsGlobalWindowOuter::IsForWindowDotPrint, std::function>void
    (mozilla::dom::PrintPreviewResultInfo const&)>&&, mozilla::ErrorResult&)
    (this=this@entry=0x7f88b85de0,
    aPrintSettings=aPrintSettings@entry=0x7e2c255760,
    aListener=aListener@entry=0x7f8a0902b0,
    aDocShellToCloneInto=aDocShellToCloneInto@entry=0x0, 
    aIsPreview=aIsPreview@entry=nsGlobalWindowOuter::IsPreview::No, 
    aForWindowDotPrint=aForWindowDotPrint@entry=nsGlobalWindowOuter::
    IsForWindowDotPrint::No, aPrintPreviewCallback=..., aError=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:859
#6  0x0000007fbc7eb714 in mozilla::dom::CanonicalBrowsingContext::Print
    (this=this@entry=0x7f88b85870, aPrintSettings=0x7e2c255760, aRv=...)
    at cross/aarch64-meego-linux-gnu/include/c++/8.3.0/bits/std_function.h:402
#7  0x0000007fbab82f08 in mozilla::dom::CanonicalBrowsingContext_Binding::print
    (args=..., void_self=0x7f88b85870, obj=..., cx_=0x7f881df460)
    at BrowsingContextBinding.cpp:4674
#8  mozilla::dom::CanonicalBrowsingContext_Binding::print_promiseWrapper
    (cx=0x7f881df460, obj=..., void_self=0x7f88b85870, args=...)
    at BrowsingContextBinding.cpp:4688
[...]
#34 0x0000007fbd16635c in js::jit::MaybeEnterJit (cx=0x7f881df460, state=...)
    at js/src/jit/Jit.cpp:207
#35 0x0000007f882b8211 in ?? ()
Backtrace stopped: Cannot access memory at address 0x228ea24c1e0c
(gdb) 
The thing I find interesting about this is that it's entering the nsPrintSettingsQt version of the SetupSilentPrinting() method. That's Sailfish-specific code coming from patch 002 "Bring back Qt Layer" and so worth taking a look at.

Unfortunately after looking through it carefully the nsPrintSettingsQt implementation doesn't yield any secrets. The SetupSilentPrinting() method is essentially empty, which isn't out of line with the GTK or default implementations. I don't see anything else in there that shouts "open new window"; it just looks like a largely passive class for capturing settings.

Nevertheless this callstack can still be useful for us. I notice that, even though the app is still stuck on this SetupSilentPrinting() breakpoint, the new blank window has already opened — hanging in a state of suspended animation — on my phone. We can also see the call to CanonicalBrowsingContext::Print() is item six in the stack.

That means that the trigger for opening the window must be somewhere between these two points in the callstack. My next task will be to work my way through all of them to see if one of them could be the culprit. Six methods to work through isn't too many. Here's a slightly cleaner version of the stack to work with:
  1. nsPrintSettingsQt::SetupSilentPrinting() file nsPrintSettingsQt.cpp line 383
  2. nsPrintJob::DoCommonPrint() file nsPrintJob.cpp line 768
  3. nsPrintJob::CommonPrint() file nsPrintJob.cpp line 488
  4. nsPrintJob::Print() file nsPrintJob.cpp line 824
  5. nsDocumentViewer::Print() file nsDocumentViewer.cpp line 2930
  6. nsGlobalWindowOuter::Print() file nsGlobalWindowOuter.cpp line 5412
  7. CanonicalBrowsingContext::Print() file CanonicalBrowsingContext.cpp line 682
Inside the nsGlobalWindowOuter::Print() method, between the start of the method and the call to nsDocumentViewer::Print() on line 5412, I see the following bit of code:
      aError = OpenInternal(u""_ns, u""_ns, u""_ns,
                            false,             // aDialog
                            false,             // aContentModal
                            true,              // aCalledNoScript
                            false,             // aDoJSFixups
                            true,              // aNavigate
                            nullptr, nullptr,  // No args
                            nullptr,           // aLoadState
                            false,             // aForceNoOpener
                            printKind, getter_AddRefs(bc));
I'm wondering whether that might be the opening of a window; all the signs are that it is. I've placed a breakpoint on nsGlobalWindowOuter::Print() and plan to step through the method to this point to try to find out.

As I step through, the moment I step over this OpenInternal() call, the window opens in the browser. The printKind parameter is set to PrintKind::InternalPrint which makes me think that the window should be hidden, or something to that effect. Here's the debugging step-through for anyone interested:
Thread 8 "GeckoWorkerThre" hit Breakpoint 2,
    nsGlobalWindowOuter::Print(nsIPrintSettings*, nsIWebProgressListener*,
    nsIDocShell*, nsGlobalWindowOuter::IsPreview,
    nsGlobalWindowOuter::IsForWindowDotPrint, std::function<void
    (mozilla::dom::PrintPreviewResultInfo const&)>&&, mozilla::ErrorResult&) (
    this=this@entry=0x7f888d23e0,
    aPrintSettings=aPrintSettings@entry=0x7e2f2f0ce0,
    aListener=aListener@entry=0x7ea013c4a0, 
    aDocShellToCloneInto=aDocShellToCloneInto@entry=0x0,
    aIsPreview=aIsPreview@entry=nsGlobalWindowOuter::IsPreview::No, 
    aForWindowDotPrint=aForWindowDotPrint@entry=nsGlobalWindowOuter::
    IsForWindowDotPrint::No, aPrintPreviewCallback=..., aError=...)
    at dom/base/nsGlobalWindowOuter.cpp:5258
5258        PrintPreviewResolver&& aPrintPreviewCallback, ErrorResult& aError) {
(gdb) n
[LWP 8816 exited]
5261          do_GetService("@mozilla.org/gfx/printsettings-service;1");
(gdb) n
867     ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h: No such file or directory.
(gdb) n
5268      nsCOMPtr<nsIPrintSettings> ps = aPrintSettings;
(gdb) n
867     ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h: No such file or directory.
(gdb) n
5274      RefPtr<Document> docToPrint = mDoc;
(gdb) n
5280      RefPtr<BrowsingContext> sourceBC = docToPrint->GetBrowsingContext();
(gdb) n
5287      nsAutoSyncOperation sync(docToPrint, SyncOperationBehavior::eAllowInput);
(gdb) n
5288      AutoModalState modalState(*this);
(gdb) n
5290      nsCOMPtr<nsIContentViewer> cv;
(gdb) p modalState
$1 = {mModalStateWin = {mRawPtr = 0x7f888d23e0}}
(gdb) n
5291      RefPtr<BrowsingContext> bc;
(gdb) n
5292      bool hasPrintCallbacks = false;
(gdb) n
5293      if (docToPrint->IsStaticDocument() &&
(gdb) n
5320        if (aDocShellToCloneInto) {
(gdb) p aDocShellToCloneInto
$2 = (nsIDocShell *) 0x0
(gdb) n
5325          AutoNoJSAPI nojsapi;
(gdb) n
5326          auto printKind = aForWindowDotPrint == IsForWindowDotPrint::Yes
(gdb) n
5329          aError = OpenInternal(u""_ns, u""_ns, u""_ns,
(gdb) p printKind
$3 = nsGlobalWindowOuter::PrintKind::InternalPrint
(gdb) n
[LWP 8736 exited]
[LWP 8737 exited]
[LWP 8711 exited]
[New LWP 9166]
[New LWP 9167]
5329          aError = OpenInternal(u""_ns, u""_ns, u""_ns,
(gdb) p aError
$4 = (mozilla::ErrorResult &) @0x7f9f3d8dc0: {<mozilla::binding_danger::
    TErrorResult<mozilla::binding_danger::AssertAndSuppressCleanupPolicy>> =
    {mResult = nsresult::NS_OK, mExtra = {mMessage = 0x41e,
    mJSException = {asBits_ = 1054}, mDOMExceptionInfo = 0x41e}},
    <No data fields>}
(gdb) 
Looking through the nsGlobalWindowOuter code eventually leads me to the nsWindowWatcher::OpenWindowInternal() method. There's a lot happening in this method. About halfway through there are some comments about visibility, which pique my interest because I'm wondering whether the window that's opening has a visibility flag which is either not being set properly, or being ignored.

But I've reached the end of my mental capacity today, it's time for bed. So I'll have to come back to this tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
12 Dec 2023 : Day 105 #
We're back looking at printing today after about a week of work now. Yesterday we looked at the parent-child structure of the PBrowser interface and came to the conclusion that we should be calling CanonicalBrowsingContext::Print() in the DownloadPDFServer code rather than the nsIWebBrowserPrint::Print() call that's there now. It'd be a good thing to try at least. Our route in to the call is through windowRef which is a Ci.nsIDOMWindow. So the question I want to answer today is: "given a Ci.nsIDOMWindow, how do a find my way to calling something in a CanonicalBrowsingContext object?"

It's a pretty simple question, but as is often the case with object-oriented code, finding a route from one to the other is not always obvious. It's obfuscated by the class hierarchy and child-parent message-passing, and made even more complex by gecko's approach to reflection using nsISupports runtime type discovery.

I'll need to look through this code carefully again.

[...]

I've been poring over the code for some time now and reading around the BrowsingContext documentation, but still not made any breakthrough headway. The one thing I did find was that it's possible to collect the browsing context from the DOM window:
      browsingContext = BrowsingContext.getFromWindow(domWin);
This was taken from Prompt.jsm which executes some code similar to the above. That's making use of the following static call in the BrowsingContext.h header:
  static already_AddRefed<BrowsingContext> GetFromWindow(
      WindowProxyHolder& aProxy);
As far as I can tell BrowsingContext isn't pulled into the JavaScript as any sort of prototype or object. It's just there already.

There's also this potentially useful static function for getting the CanonicalBrowsingContext from an ID:
  static already_AddRefed<CanonicalBrowsingContext> Get(uint64_t aId);
Unfortunately I'm not really sure where I'm supposed to get the ID from. It's added into a static hash table and if I had a BrowsingContext already I could get the ID, but without that first, it's not clear where I might extract it from.

So I'm going to try going down the BrowsingContext.getFromWindow() route. If I'm going to use this I've already got a good idea about where it should go, which is in the DownloadPDFSaver code.

So I've added some debug prints to the DownloadPDFSaver in DownloadCore.jsm to try to figure out if we can extract the BrowsingContext from the windowRef using this method. Here's what I added:
    dump("PRINT: win: " + win + "\n");
    this._webBrowserPrint = win.getInterface(Ci.nsIWebBrowserPrint);
    dump("PRINT: webBrowserPrint: " + this._webBrowserPrint + "\n");
    this._browsingContext = BrowsingContext.getFromWindow(win)
    dump("PRINT: BrowsingContext: " + this._browsingContext + "\n");
I've not changed any of the functional code though, so I'm not expecting this to fix the segfault; this is just to extract some hints. Here's what it outputs to the console when I try running this and selecting the "Save web page as PDF" option:
PRINT: win: [object Window]
PRINT: webBrowserPrint: [xpconnect wrapped nsIWebBrowserPrint]
PRINT: BrowsingContext: [object CanonicalBrowsingContext]
Segmentation fault (core dumped)
This is... well it's pretty exciting for me if I'm honest. That last print output suggests that it's successfully extracted some kind of CanonicalBrowsingContext object, which is exactly what we're after. So the next step is to call the print() method on it to see what happens.

Having added that code and selected the option to safe as PDF, there's now some rather strange and dubious looking output sent to the console. It's the same output that we see when the browser starts:
PRINT: win: [object Window]
PRINT: webBrowserPrint: [xpconnect wrapped nsIWebBrowserPrint]
PRINT: BrowsingContext: [object CanonicalBrowsingContext]
JSScript: ContextMenuHandler.js loaded
JSScript: SelectionPrototype.js loaded
JSScript: SelectionHandler.js loaded
JSScript: SelectAsyncHelper.js loaded
JSScript: FormAssistant.js loaded
JSScript: InputMethodHandler.js loaded
EmbedHelper init called
Available locales: en-US, fi, ru
Frame script: embedhelper.js loaded
[...]
On the other hand, looking into the downloads folder, there's a new PDF output that looks encouragingly non-empty. I wonder what it will contain?
$ cd Downloads/
$ ls -l
total 4624
-rw------- 1 defaultuser defaultuser       0 Dec  7 21:38 'Jolla(10).pdf'
-rw------- 1 defaultuser defaultuser       0 Dec  8 23:36 'Jolla(11).pdf'
-rw------- 1 defaultuser defaultuser       0 Dec  8 23:47 'Jolla(12).pdf'
-rw------- 1 defaultuser defaultuser       0 Dec 10 21:50 'Jolla(13).pdf'
-rw------- 1 defaultuser defaultuser       0 Dec 10 21:53 'Jolla(14).pdf'
-rw-rw-r-- 1 defaultuser defaultuser 4673253 Dec 10 21:57 'Jolla(15).pdf'
-rw------- 1 defaultuser defaultuser       0 Dec  5 22:17 'Jolla(2).pdf'
-rw------- 1 defaultuser defaultuser       0 Dec  5 22:17 'Jolla(3).pdf'
-rw------- 1 defaultuser defaultuser       0 Dec  6 08:23 'Jolla(4).pdf'
-rw------- 1 defaultuser defaultuser       0 Dec  6 08:27 'Jolla(5).pdf'
-rw------- 1 defaultuser defaultuser       0 Dec  6 22:32 'Jolla(6).pdf'
-rw------- 1 defaultuser defaultuser       0 Dec  6 23:05 'Jolla(7).pdf'
-rw------- 1 defaultuser defaultuser       0 Dec  7 19:23 'Jolla(8).pdf'
-rw------- 1 defaultuser defaultuser       0 Dec  7 21:24 'Jolla(9).pdf'
-rw------- 1 defaultuser defaultuser       0 Dec  5 22:16  Jolla.pdf
That 4673253 byte file that's been output must have something interesting inside it, surely?
 
Four screenshots showing the printing process: first the Jolla webpage in the browser; second a blank window during printing; third the output PDF with similar graphics on; fourth the browser menu with saving to PDF disabled

Well as you can see the actual PDF print out is a bit rubbish, but I don't think that's anything I've done: that's just the inherent difficulty of providing decent PDF printouts of dynamic webpages. This is actually exactly the PDF output we need.

I admit I'm pretty happy about this. All that reading of documentation and scattered code seems to have paid off. There is a slight problem though, in that the process seems to open a new window when the printing takes place. That's not ideal and will have to be fixed.

Also, having printed out a page, the "Save web page as PDF" option is now completely greyed out in the user interface. It's not possible to print another page. That feels more like the consequence of a promise not resolving, or some completion message not being received, rather than anything more intrinsic though.

I've done some brief testing of other functionality in the browser. Nothing else seems to be broken and the browser didn't crash. So that's also rather encouraging.

I'm going to call it a night: finish on a high. There's still plenty of work to be done with the PDF printing: prevent the extra window from opening; ensure the option to print is restored once the printing is complete. But those will have to be for tomorrow. The fact saving to PDF is working is a win already.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
11 Dec 2023 : Day 104 #
We're back on printing again today. Yesterday we tracked down the nsGlobalWindowOuter::Print() method which appears to be responsible for cloning the document ready for printing. That, in turn, appears to be called by BrowserChild::RecvPrint(). And this method deserves some explanation.

We've discussed the gecko IPC mechanism in previous posts, in fact way back on Day 7. If you read those then... well, first, kudos for keeping up! But second you'll recall there are ipdl files which define an interface and that the build process generates parent and child interfaces from these that allow message passing from one to the other.

Whenever we see a SendName() or RecvName() method like this RecvPrint() method we have here, it's a good sign that this is related to this message passing. The Send method is called on one thread and the message, which is a bit like a remote method call, triggers the Recv method to be called on a (potentially different) thread. That's my rather basic understanding of it, at any rate.

The message, along with the sending and receiving mechanisms, are generated by the build process in the form of a file that has a P at the start of the name. I'm not exactly sure what the P stands for. The sender tends to be called the parent actor, and the receiver the child actor. That's why we're seeing RecvPrint() in the BrowserChild class.

If we look at the class definition for BrowserChild in the header file it looks like this:
/**
 * BrowserChild implements the child actor part of the PBrowser protocol. See
 * PBrowser for more information.
 */
class BrowserChild final : public nsMessageManagerScriptExecutor,
                           public ipc::MessageManagerCallback,
                           public PBrowserChild,
                           public nsIWebBrowserChrome,
                           public nsIEmbeddingSiteWindow,
                           public nsIWebBrowserChromeFocus,
                           public nsIInterfaceRequestor,
                           public nsIWindowProvider,
                           public nsSupportsWeakReference,
                           public nsIBrowserChild,
                           public nsIObserver,
                           public nsIWebProgressListener2,
                           public TabContext,
                           public nsITooltipListener,
                           public mozilla::ipc::IShmemAllocator {
[...]
Notice how it inherits from the PBrowserChild class. That's the child actor class interface that's autogenerated from the PBrowser.ipdl file. If we look in the PBrowser.ipdl file we can see the definition of the Print() method we're interested in:
    /**
     * Tell the child to print the current page with the given settings.
     *
     * @param aBrowsingContext the browsing context to print.
     * @param aPrintData the serialized settings to print with
     */
    async Print(MaybeDiscardedBrowsingContext aBC, PrintData aPrintData);
That's not quite the end of it though, because — to round things off — there's also a PBrowserParent class that's been generated from the IPDL file as well. Like the child class it has both a header and a source file. In the source file we can find the definition for the SendPrint() method like this:
auto PBrowserParent::SendPrint(
        const MaybeDiscardedBrowsingContext& aBC,
        const PrintData& aPrintData) -> bool
{
All of this is inherited by the BrowserParent class in the BrowserParent.h file like this:
/**
 * BrowserParent implements the parent actor part of the PBrowser protocol. See
 * PBrowser for more information.
 */
class BrowserParent final : public PBrowserParent,
                            public nsIDOMEventListener,
                            public nsIAuthPromptProvider,
                            public nsSupportsWeakReference,
                            public TabContext,
                            public LiveResizeListener {
[...]
This class doesn't override the SendPrint() method but it does inherit it. So there's quite a structure and class hierarchy that's built up from these IPDL files.

The key takeaway for what we're trying to achieve is that if we want to trigger the nsGlobalWindowOuter::Print() method, we're going to need to call the BrowserParent::SendPrint() method from somewhere. Checking through the code it's clear that nothing is inheriting from BrowserParent but there are plenty of places which give access to the BrowserParent interface.

For example the BrowserBridgeParent class has this method:
  BrowserParent* GetBrowserParent() { return mBrowserParent; }
There are quite a few other similar method scattered around the place and I honestly don't know which I'm supposed to end up using.
$ grep -rIn "BrowserParent\* Get" * --include="*.h"
layout/base/PresShell.h:181:
    static dom::BrowserParent* GetCapturingRemoteTarget() {
docshell/base/CanonicalBrowsingContext.h:222:
    BrowserParent* GetBrowserParent() const;
dom/base/nsFrameLoader.h:338:
    BrowserParent* GetBrowserParent() const;
dom/base/PointerLockManager.h:38:
    static dom::BrowserParent* GetLockedRemoteTarget();
dom/base/nsContentUtils.h:501:
    static mozilla::dom::BrowserParent* GetCommonBrowserParentAncestor(
dom/ipc/BrowserHost.h:54:
    BrowserParent* GetActor() { return mRoot; }
dom/ipc/WindowGlobalParent.h:105:
    BrowserParent* GetBrowserParent();
dom/ipc/BrowserParent.h:114:
    static BrowserParent* GetFocused();
dom/ipc/BrowserParent.h:116:
    static BrowserParent* GetLastMouseRemoteTarget();
dom/ipc/BrowserParent.h:118:
    static BrowserParent* GetFrom(nsFrameLoader* aFrameLoader);
dom/ipc/BrowserParent.h:120:
    static BrowserParent* GetFrom(PBrowserParent* aBrowserParent);
dom/ipc/BrowserParent.h:122:
    static BrowserParent* GetFrom(nsIContent* aContent);
dom/ipc/BrowserParent.h:124:
    static BrowserParent* GetBrowserParentFromLayersId(
dom/ipc/BrowserBridgeParent.h:43:
    BrowserParent* GetBrowserParent() { return mBrowserParent; }
dom/events/TextComposition.h:82:
    BrowserParent* GetBrowserParent() const { return mBrowserParent; }
dom/events/IMEStateManager.h:55:
    static BrowserParent* GetActiveBrowserParent() {
dom/events/PointerEventHandler.h:95:
    static dom::BrowserParent* GetPointerCapturingRemoteTarget(
dom/events/EventStateManager.h:1050:
    dom::BrowserParent* GetCrossProcessTarget();
Hopefully this will all become clear in due course.

I'm also interested to discover that the CanonicalBrowsingContext class has a Print() method that calls SendPrint():
already_AddRefed<Promise> CanonicalBrowsingContext::Print(
    nsIPrintSettings* aPrintSettings, ErrorResult& aRv) {
  RefPtr<Promise> promise = Promise::Create(GetIncumbentGlobal(), aRv);
  if (NS_WARN_IF(aRv.Failed())) {
    return promise.forget();
  }
[...]

  auto* browserParent = GetBrowserParent();
  if (NS_WARN_IF(!browserParent)) {
    promise->MaybeReject(ErrorResult(NS_ERROR_FAILURE));
    return promise.forget();
  }

  RefPtr<embedding::PrintingParent> printingParent =
      browserParent->Manager()->GetPrintingParent();

  embedding::PrintData printData;
  nsresult rv = printingParent->SerializeAndEnsureRemotePrintJob(
      aPrintSettings, listener, nullptr, &printData);
  if (NS_WARN_IF(NS_FAILED(rv))) {
    promise->MaybeReject(ErrorResult(rv));
    return promise.forget();
  }

  if (NS_WARN_IF(!browserParent->SendPrint(this, printData))) {
    promise->MaybeReject(ErrorResult(NS_ERROR_FAILURE));
  }
  return promise.forget();
#endif
}
Maybe we're going to either have to call this or do something similar. Let's head back to the code we're already using to do the printing. Recall that this lives in DownloadCore.jsm in the form of our newly added DownloadPDFSaver class.

Now that I'm comparing against the two there are some aspects that I think it's worth taking note of. The main input to the nsIWebBrowserPrint::Print() method that we're currently calling in DownloadPDFSaver takes in an object that implements the nsIPrintSettings interface and returns a promise. From the code I listed above for CanonicalBrowsingContext::Print() you'll notice that this also takes in an object that implements the nsIPrintSettings interface and returns a promise.

So calling switching to call the latter may require only minimal changes. The question then is where to get the CanonicalBrowsingContext from. The method we're currently using is hanging off of a windowRef:
    let win = this.download.source.windowRef.get();
[...]
    this._webBrowserPrint = win.getInterface(Ci.nsIWebBrowserPrint);
Because it's JavaScript from this there's absolutely no indication of what type win is of course. I'm going to need to know in order to make progress.

Digging back through the source in DownloadCore.jsm I can see that windowRef gets set to something that implements Ci.nsIDOMWindow:
DownloadSource.fromSerializable = function(aSerializable) {
[...]
  } else if (aSerializable instanceof Ci.nsIDOMWindow) {
    source.url = aSerializable.location.href;
    source.isPrivate = PrivateBrowsingUtils.isContentWindowPrivate(
      aSerializable
    );
    source.windowRef = Cu.getWeakReference(aSerializable);
[...]
I've spent a lot of time analysing the code today without actually making any changes at all to the code itself. I'm keen to actually try some things out in the JavaScript to see whether I can extract something useful from the windowRef that will allow us to call the SendPrint() method that we're so interested in. But that will have to wait for tomorrow now.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
10 Dec 2023 : Day 103 #
Yesterday and over the last few days we've been looking into printing. There were some errors that needed fixing in the JavaScript and now we're digging down into the C++ code. Something has changed in the neath and the print code is expecting the page to have been "cloned" when it hasn't been. That's where we've got up to. But figuring out where the cloning is supposed to happen is proving to be difficult. It's been made harder by the fact that placing breakpoints on the print code causes the debugger to crash — apparently a bug in our version of gdb — and so we don't have anything to compare against.

So my plan is to read through the print code once again. The answer must be in there somewhere.

One thing I did manage to establish using the debugger is that the nsPrintObject::InitAsRootObject() method is being called in the ESR 91 code. That could turn out to be useful because although not in the ESR 91 version, in the ESR 87 version this is where the clone appears to take place:
nsresult nsPrintObject::InitAsRootObject(nsIDocShell* aDocShell, Document* aDoc,
                                         bool aForPrintPreview) {
[...]
  mDocument = aDoc->CreateStaticClone(mDocShell);
[...]
  return NS_OK;
}

If we look at the history of the file it may give some hints about where this clone was moved to. I need to do a git blame on a line that I can see has changed between the two.

It turns out that git blame isn't too helpful because the change appears to have mostly deleted lines rather than added or changed them. Unfortunately git blame simply doesn't work very well for deleted lines. I want to use what I think of as reverse git blame, which is using git log with the -S parameter. This searchers the diff of every commit for a particular string, which will include deleted items.

Here's what comes up as the first hit:
$ git log -1 -S "CreateStaticClone" layout/printing/nsPrintObject.cpp
commit 044b3c4332134ac0c94d4916458f9930d5091c6a
Author: Emilio Cobos Álvarez <emilio@crisal.io>
Date:   Tue Aug 25 17:45:12 2020 +0000

    Bug 1636728 - Centralize printing entry points in nsGlobalWindowOuter, and
    move cloning out of nsPrintJob. r=jwatt,geckoview-reviewers,smaug,agi
    
    This centralizes our print and preview setup in nsGlobalWindowOuter so
    that we never re-clone a clone, and so that we reuse the window.open()
    codepath to create the browsing context to clone into.
    
    For window.print, for both old print dialog / silent printing and new
    print preview UI, we now create a hidden browser (as in with visibility:
    collapse, which takes no space but still gets a layout box).
    
     * In the modern UI case, this browser is swapped with the actual print
       preview clone, and the UI takes care of removing the browser.
    
     * In the print dialog / silent printing case, the printing code calls
       window.close() from nsDocumentViewer::OnDonePrinting().
    
     * We don't need to care about the old print preview UI for this case
       because it can't be open from window.print().
    
    We need to fall back to an actual window when there's no
    nsIBrowserDOMWindow around for WPT print tests and the like, which don't
    have one. That seems fine, we could special-case this code path more if
    needed but it doesn't seem worth it.
    
    Differential Revision: https://phabricator.services.mozilla.com/D87063
"Move cloning out of nsPrintJob". That sounds relevant. A lot of the changes are happening inside nsGlobalWindowOuter.cpp as the commit message suggests. And sure enough, right in there is a brand new nsGlobalWindowOuter::Print() method which now performs the cloning. It's a very long method which I'll have to read through in full, but here's the active ingredient in relation to cloning:
Nullable<WindowProxyHolder> nsGlobalWindowOuter::Print(
    nsIPrintSettings* aPrintSettings, nsIWebProgressListener* aListener,
    nsIDocShell* aDocShellToCloneInto, IsPreview aIsPreview,
    IsForWindowDotPrint aForWindowDotPrint,
    PrintPreviewResolver&& aPrintPreviewCallback, ErrorResult& aError) {
#ifdef NS_PRINTING
[...]
  if (docToPrint->IsStaticDocument() &&
      (aIsPreview == IsPreview::Yes ||
       StaticPrefs::print_tab_modal_enabled())) {
[...]
  } else {
[...]
    nsAutoScriptBlocker blockScripts;
    RefPtr<Document> clone = docToPrint->CreateStaticClone(
        cloneDocShell, cv, ps, &hasPrintCallbacks);
    if (!clone) {
      aError.ThrowNotSupportedError("Clone operation for printing failed");
      return nullptr;
    }
  }
[...]
  return WindowProxyHolder(std::move(bc));
#else
  return nullptr;
#endif  // NS_PRINTING
}
The next obvious thing we should check is whether this method is getting called at all. My guess is not, in which case it would be nice to know how we're supposed to be calling it, and maybe we can figure that out by comparing the diff in the above commit with a callstack we get from the debugger.

First things first: does it get called?
$ EMBED_CONSOLE=1 gdb sailfish-browser
(gdb) b nsGlobalWindowOuter::Print
(gdb) info break
Num Type       Disp Enb Address            What
1   breakpoint keep y   0x0000007fba96f28c in nsGlobalWindowOuter::Print
                                           (nsIPrintSettings*,
                                           nsIWebProgressListener*, nsIDocShell*,
                                           nsGlobalWindowOuter::IsPreview,
                                           nsGlobalWindowOuter::IsForWindowDotPrint,
                                           std::function<void (mozilla::dom::
                                           PrintPreviewResultInfo const&)>&&,
                                           mozilla::ErrorResult&) 
                                           at dom/base/nsGlobalWindowOuter.cpp:5258
(gdb) r
[...]
Thread 8 "GeckoWorkerThre" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 1018]
nsPrintJob::FindFocusedDocument (this=this@entry=0x7e31d9f2e0,
    aDoc=aDoc@entry=0x7f89079430)
    at layout/printing/nsPrintJob.cpp:2411
2411      nsPIDOMWindowOuter* window = aDoc->GetOriginalDocument()->GetWindow();
(gdb) 
So that's a segfault before a breakpoint: the code isn't being executed. The next step then is to try to find out where it gets executed in the changes of the upstream commit.

It looks like it might happen in the BrowserChild::RecvPrint() method:
mozilla::ipc::IPCResult BrowserChild::RecvPrint(const uint64_t& aOuterWindowID,
                                                const PrintData& aPrintData);
This goes on to call nsGlobalWindowOuter::Print() like this:
    outerWindow->Print(printSettings,
                       /* aListener = */ nullptr,
                       /* aWindowToCloneInto = */ nullptr,
                       nsGlobalWindowOuter::IsPreview::No,
                       nsGlobalWindowOuter::IsForWindowDotPrint::No,
                       /* aPrintPreviewCallback = */ nullptr, rv);
This is also not getting executed and I'm wondering whether maybe it should be.

I have to be honest, I'm not really sure which direction this is going to take tomorrow. What I am fairly sure about is that spending a night asleep won't make things any less clear! So until tomorrow it is.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment