flypig.co.uk

List items

Items from the current list are shown below.

Blog

24 Aug 2024 : Day 329 #
Yesterday I achieved two semi-wins. The first was applying patch 0038 "Fix mesa egl display and buffer initialisation". The second was implementing a code-only fix to replace LD_PRELOAD. I say semi-wins and that's because neither achieved quite what I had hoped. The patch was supposed to help fix the Wayland EGL crash, which it didn't do. But the patch is needed to allow the browser to work on native devices, so I'm hoping it'll be helpful in achieving that at least. So a semi-win. The code-only fix does remove the need to use LD_PRELOAD which circumvents the Wayland EGL crash, but I'm not convinced it fixes the underlying issue. It's a hackish workaround rather than a real fix. So also a semi-win. I'd love to properly understand what's going on, but I'll have to leave the deep understanding to Frajo and Raine.

There is one outstanding issue that I'm still keen to address. It's been a bit of a thorn that I've not been able to fix, which is the hang occurring when the cover view switches from a gecko screenshot to the "no tabs" cover. At present I have this "no tabs" cover disabled. This is a shame, but as Rob (rob_kouw) highlights on the forum, it's not the end of the world if this can't be fixed. But still, it's annoying that the fix has eluded me.

I thought I'd take another look. Recent changes have caused the behaviour to change. Instead of the browser hanging it now crashes. On the face of it this sounds like a sideways step rather than a move forwards, but for me this is a very positive development. This change is likely due to the dynamic library loading changes that are now in place.

The benefit of having a crash rather than a hang is that we can now get a clean backtrace at the point of failure. Here's what we get:
sailfish-browser: ../src/wayland-client.c:2339: wl_proxy_set_queue: Assertion 
    `proxy->display == queue->display' failed.

Thread 67 "QSGRenderThread" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fb58dd830 (LWP 15988)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50        return ret;
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x0000007fef285d20 in __GI_abort () at abort.c:79
#2  0x0000007fef294c98 in __assert_fail_base (fmt=0x7fef3a5718 "%s%s%s:%u: 
    %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x7fef11d990 "proxy->display == 
    queue->display", file=file@entry=0x7fef11d6d0 "../src/
    wayland-client.c", 
    line=line@entry=2339, function=function@entry=0x7fef11de80 
    "wl_proxy_set_queue") at assert.c:92
#3  0x0000007fef294cfc in __GI___assert_fail (assertion=0x7fef11d990 
    "proxy->display == queue->display" , file=0x7fef11d6d0 "../
    src/wayland-client.c", 
    line=2339, function=0x7fef11de80 "wl_proxy_set_queue") at 
    assert.c:101
#4  0x0000007fef119d18 in wl_proxy_set_queue () from /usr/lib64/
    libwayland-client.so.0
#5  0x0000007ff7fa27e0 in WaylandNativeWindow::finishSwap() () from /usr/lib64/
    libhybris/eglplatform_wayland.so
#6  0x0000007feed23dc4 in _my_eglSwapBuffersWithDamageEXT () from /usr/lib64/
    libEGL.so.1
#7  0x0000007fe7424be0 in ?? () from /usr/lib64/qt5/plugins/
    wayland-graphics-integration-client/libwayland-egl.so
#8  0x0000007ff09827f8 in ?? () from /usr/lib64/libQt5Quick.so.5
#9  0x0000007ff098363c in ?? () from /usr/lib64/libQt5Quick.so.5
#10 0x0000007fef7d96a8 in ?? () from /usr/lib64/libQt5Core.so.5
#11 0x0000007fef6b8b98 in start_thread (arg=0x7fb58dd130) at pthread_create.c:
    479
#12 0x0000007fef3497cc in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/
    clone.S:78
(gdb) 
This is really fascinating stuff. The crash is happening in Wayland, which isn't a total surprise because this is happening in the QML code rather than the browser code. QML should never bring an app down in such an unclean way, so it was always clear that this was going to be quite a low-level failure.

What's more, the crash is happening because of an assertion failure, so it's a controlled crash. That's something that's worth investigating further. Something else that jumps out is that the crash seems to be triggered by a call to _my_eglSwapBuffersWithDamageEXT(). That's also of interest because eglSwapBuffersWithDamageEXT() has been a source of trouble in the past. We actually have a patch — patch 0047 "Drop swap_buffers_with_damage extension support" — that tries to address something similar by disabling the use of eglSwapBuffersWithDamageEXT() in the gecko code.

I applied this patch back on Day 83, but in ESR 91 the way EGL extensions are disabled has changed compared to ESR 78, so this is code I had to make changes to. In particular, the display-related EGL calls (of which this is one) have moved from GLLibraryEGL and into EglDisplay. The split makes sense structurally, but resulted in code being moved around, including the code for disabling extensions, which had to be split into two pieces. Maybe I messed something up when I updated the 0047 patch to overlay these changes?

So I want to make sure that the extensions are properly disabled. There's only one method that makes use of eglSwapBuffersWithDamage() directly and that's the GLContextEGL::SwapBuffers() method. Just to be clear, this isn't what's causing the crash, but I'd like to use it as a means of checking that the call isn't being made here. Once I'm done with this I'll also add a breakpoint to fSwapBuffersWithDamage() but I'm not expecting that to throw up any surprises.

Here's the SwapBuffers() method. As you can see, the call to fSwapBuffersWithDamage() only happens if at least one of the two relevant extensions hasn't been disabled (that's a few too many confusing negatives for a single sentence to bear, but the code should make this clear, even if I've struggled to successfully describe it!).
bool GLContextEGL::SwapBuffers() {
  EGLSurface surface =
      mSurfaceOverride != EGL_NO_SURFACE ? mSurfaceOverride : mSurface;
  if (surface) {
    if ((mEgl->IsExtensionSupported(
             EGLExtension::EXT_swap_buffers_with_damage) ||
         mEgl->IsExtensionSupported(
             EGLExtension::KHR_swap_buffers_with_damage))) {
      std::vector<EGLint> rects;
      for (auto iter = mDamageRegion.RectIter(); !iter.Done(); iter.Next()) {
        const IntRect& r = iter.Get();
        rects.push_back(r.X());
        rects.push_back(r.Y());
        rects.push_back(r.Width());
        rects.push_back(r.Height());
      }
      mDamageRegion.SetEmpty();
      return mEgl->fSwapBuffersWithDamage(surface, rects.data(),
                                          rects.size() / 4);
    }
    return mEgl->fSwapBuffers(surface);
  } else {
    return false;
  }
}
So the plan is to breakpoint on this method. If I get a hit I'll step through to ensure that the fSwapBuffersWithDamage() section is skipped. Assuming it is I can then add a breakpoint inside the block just to make sure it never gets entered. Here are the results of this:
$ gdb sailfish-browser
[...]
(gdb) break GLContextEGL::SwapBuffers
Breakpoint 2 at 0x7ff2358b74: file gfx/gl/GLContextProviderEGL.cpp, line 538.
(gdb) c
Continuing.
[...]
Thread 39 &quot;Compositor&quot; hit Breakpoint 2, mozilla::gl::GLContextEGL::
    SwapBuffers (this=0x7ed01a8020) at gfx/gl/GLContextProviderEGL.cpp:538
538     bool GLContextEGL::SwapBuffers() {
(gdb) n
539       EGLSurface surface =
(gdb) n
541       if (surface) {
(gdb) n
542         if ((mEgl->IsExtensionSupported(
(gdb) n
558         return mEgl->fSwapBuffers(surface);
Okay, so the block is correctly skipped. That's positive. Now I'm going to add a breakpoint inside the block just to ensure this never changes:
(gdb) b GLContextProviderEGL.cpp:708
Breakpoint 5 at 0x7ff237c934: file gfx/gl/GLContextProviderEGL.cpp, line 709.
(gdb) info break
Num     Type           Disp Enb Address            What
2       breakpoint     keep y   0x0000007ff2358b74 in mozilla::gl::GLContextEGL:
    :SwapBuffers() 
                                                   at gfx/gl/
    GLContextProviderEGL.cpp:538
        breakpoint already hit 1 time
5       breakpoint     keep y   0x0000007ff237c934 in mozilla::gl::GLContextEGL:
    :CreateGLContext(std::shared_ptr<mozilla::gl::EglDisplay>, mozilla::gl::
    GLContextDesc const&, void*, void*, bool, nsTSubstring<char>*) at gfx/gl/
    GLContextProviderEGL.cpp:709
(gdb) delete break 2
(gdb) c
Continuing.
[...]
There are no hits of this new breakpoint. So the extensions are certainly being correctly disabled, which is a good thing. I'm just going to quickly check there are no other rogue calls to fSwapBuffersWithDamage() by adding a breakpoint to it directly. There should be no hits:
(gdb) delete break
(gdb) break fSwapBuffersWithDamage
Breakpoint 6 at 0x7ff2358d3c: file gfx/gl/GLLibraryEGL.h, line 448.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/sailfish-browser 
[...]
I'm able to open a Website and browse its pages without any hits. So that's looking good; it means I can rule it out as being the cause of the issue. So I'm going to switch to a different approach and take a look at the Wayland code where the assertion is being triggered. Here's the method in question, taken from the Sailfish mirror of the Wayland repository:
WL_EXPORT void
wl_proxy_set_queue(struct wl_proxy *proxy, struct wl_event_queue *queue)
{
	pthread_mutex_lock(&proxy->display->mutex);

	wl_list_remove(&proxy->queue_link);

	if (queue) {
		assert(proxy->display == queue->display);
		proxy->queue = queue;
	} else {
		proxy->queue = &proxy->display->default_queue;
	}

	wl_list_insert(&proxy->queue->proxy_list, &proxy->queue_link);

	pthread_mutex_unlock(&proxy->display->mutex);
}
The assertion that's failing is the one requiring proxy->display == queue->display to be true. What does this mean? The details aren't entirely clear to me, but the display value being checked is likely an EGL display value. It looks like there are two different values in use when there should be only one.

That fits with some of the things I noticed when trying to fix what I now know to be the Wayland dynamic library issue. I noted that there were displays with value both 0x00 and 0x01. I'm fairly sure it should just be the latter. If that's the case, then getting the code to always use 0x01 would be a worthwhile goal.

I feel like I've collected enough information about this, but I don't want to dwell on it further. I can add what I've learnt to the bug report and return to it at a later date. In terms of priorities, I think this should be low down the list. Getting packages available for others to install should be a higher priority, as should tidying up all these changes to turn them into patches.

So that's my plan for the next couple of days. First is get some install instructions written up. Second is moving on to "tidying up" mode. These will be the most fruitful next steps.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comments

Uncover Disqus comments