flypig.co.uk

List items

Items from the current list are shown below.

Blog

22 Aug 2024 : Day 327 #
Before getting into development, Leif-Jöran Olsson shared this beautiful "half verse" (his description!) on the topic of travelling to Jersey. It captures my trip back in a few pithy words and I really can't tell you how much I enjoy receiving these. Thank you ljo!
 
Flypig just caught the Jersey(town) ferry, it is not a cold day in August like he said it would be. A case of gone was all he carried ... General wakeup call at 5 am. Might the egl build be ready? Fingers crossed.

Apart from my overnight ferry build, if you've been following along you may also recall that I recently updated one of my dev devices to Sailfish OS 4.6, so that I could debug the failing ESR 91 build on it. Previously I'd been using the device to run ESR 78 for comparison purposes. Because it wasn't my main development device I was using an Xperia 10 II, with my Xperia 10 III as my main dev device running the latest ESR 91 build.

After doing my best using the Xperia 10 II for debugging the ESR 91 code I eventually got fed up of waiting for the debugger. The 10 II isn't a bad device, but it's now getting on a bit (released in 2020) and it just doesn't have the raw power of the Xperia 10 III. This really shows itself when debugging code; I found myself waiting ten minutes or more for gdb to print out a member variable of a class instance.

So this morning I decided to upgrade me Xperia 10 III dev device to Sailfish OS 4.6 as well. This is a bit of a bold move because it means I'll have no devices on which ESR 91 will actually run. It works nicely on Sailfish OS 4.5 but I've yet to find the reason for it failing on 4.6. I really do need the extra oomf of the 10 III though. It'll motivate me (if I wasn't already motivated enough) to get this issue fixed as soon as possible.

And I really do want to get it fixed. This is potentially the last issue I need to deal with before moving on to the "tidying up" phase of this whole process. It'd be a real weight off my shoulders if it were working.

Having performed the upgrade, copied over and installed all of the packages, I'm now ready to actually benefit from that additional debuggging power. As I left things yesterday it looked like there was an issue occurring in the CompositorBridgeParent::NewCompositor() method. There were two reasons I felt the problem might be happening there. First, when stepping through the code, the segmentation fault happened mid-execution of the method.

That in itself isn't a guarantee of it being the source of the problem because the crash is happening in a different thread. Still, the two could be related. But there's a second reason as well, and that's that it looks like the compositor isn't being successfully created. If that's true, it would be prime candidate for a big — potentially browser-crashing — problem that needs fixing.

Here's the code that we're stepping through for reference (abridged to aid clarity):
RefPtr<Compositor> CompositorBridgeParent::NewCompositor(
    const nsTArray<LayersBackend>& aBackendHints) {
  for (size_t i = 0; i < aBackendHints.Length(); ++i) {
    RefPtr<Compositor> compositor;
    if (aBackendHints[i] == LayersBackend::LAYERS_OPENGL) {
      compositor =
          new CompositorOGL(this, mWidget, mEGLSurfaceSize.width,
                            mEGLSurfaceSize.height, mUseExternalSurfaceSize);
    } else if (aBackendHints[i] == LayersBackend::LAYERS_BASIC) {
      compositor = new BasicCompositor(this, mWidget);
    }
    nsCString failureReason;

    const int max_fb_size = 32767;
    const LayoutDeviceIntSize size = mWidget->GetClientSize();
    if (size.width > max_fb_size || size.height > max_fb_size) {
      failureReason = &quot;FEATURE_FAILURE_MAX_FRAMEBUFFER_SIZE&quot;;
      return nullptr;
    }

    if (compositor && compositor->Initialize(&failureReason)) {
      if (failureReason.IsEmpty()) {
        failureReason = &quot;SUCCESS&quot;;
      }

      // should only report success here
      if (aBackendHints[i] == LayersBackend::LAYERS_OPENGL) {
        Telemetry::Accumulate(Telemetry::OPENGL_COMPOSITING_FAILURE_ID,
                              failureReason);
      }

      return compositor;
    }

    // report any failure reasons here
    if (aBackendHints[i] == LayersBackend::LAYERS_OPENGL) {
      gfxCriticalNote << &quot;[OPENGL] Failed to init compositor with reason: 
    &quot;
                      << failureReason.get();
      Telemetry::Accumulate(Telemetry::OPENGL_COMPOSITING_FAILURE_ID,
                            failureReason);
    }
  }

  return nullptr;
}
When I stepped through this yesterday the programme counter jumped straight from the start of the for loop straight through to the line setting gfxCriticalNote for failure reporting. That suggested to me that the code in between was being skipped, most likely because none of the conditions were being met to execute it. In particular, it looked like the call to create a CompositorOGL instance was being skipped.

When I try stepping through the code again today, I realise that this was a misreading of what's happening. The jump is actually due to a compiler optimisation and, in fact, the code is creating a CompositorOGL instance after all.

Here's the step through of the code from this morning. I know these debugging step-throughs are hard to follow on their own, so I've added some annotations using comments as I've gone along to try to explain my thinking.
Thread 38 &quot;Compositor&quot; hit Breakpoint 1, mozilla::layers::
    CompositorBridgeParent::NewCompositor (this=this@entry=0x7fbcbb0190, 
    aBackendHints=...) at gfx/layers/ipc/CompositorBridgeParent.cpp:1455
1455        const nsTArray<LayersBackend>& aBackendHints) {
(gdb) n
1456      for (size_t i = 0; i < aBackendHints.Length(); ++i) {
(gdb) p aBackendHints.mHdr->mLength
$7 = 2
(gdb) n
1515          gfxCriticalNote << &quot;[OPENGL] Failed to init compositor with 
    reason: &quot;
(gdb) # Don't be fooled, this is an optimisation
(gdb) n
1458        if (aBackendHints[i] == LayersBackend::LAYERS_OPENGL) {
(gdb) # We're back up top again
(gdb) n
1461                                mEGLSurfaceSize.height, 
    mUseExternalSurfaceSize);
(gdb) # We have a LAYERS_OPENGL situation, which is good
(gdb) n
1476        nsCString failureReason;
(gdb) n
1485        const LayoutDeviceIntSize size = mWidget->GetClientSize();
(gdb) n
1486        if (size.width > max_fb_size || size.height > max_fb_size) {
(gdb) n
1493        if (compositor && compositor->Initialize(&failureReason)) {
(gdb) # We want to go into that Initialise calls
(gdb) s
mozilla::layers::CompositorOGL::Initialize (this=0x7ed81a2b90, 
    out_failureReason=0x7fb50ed5a0) at gfx/layers/opengl/CompositorOGL.cpp:378
378     bool CompositorOGL::Initialize(nsCString* const out_failureReason) {
(gdb) n
379       ScopedGfxFeatureReporter reporter(&quot;GL Layers&quot;);
(gdb) n
385       if (!mGLContext) {
(gdb) p mGLContext
$8 = {mRawPtr = 0x0}
(gdb) n
387         mGLContext = CreateContext();
(gdb) # We're going to head into this too
(gdb) s
mozilla::layers::CompositorOGL::CreateContext (this=this@entry=0x7ed81a2b90) at 
    gfx/layers/opengl/CompositorOGL.cpp:227
227     already_AddRefed<mozilla::gl::GLContext> CompositorOGL::CreateContext() 
    {
(gdb) n
231       nsIWidget* widget = mWidget->RealWidget();
(gdb) n
232       void* widgetOpenGLContext =
(gdb) n

Thread 1 &quot;sailfish-browse&quot; received signal SIGSEGV, Segmentation 
    fault.
[Switching to Thread 0x7feb174010 (LWP 2399)]
0x0000007fe761e2d0 in wl_display_read_events () from /usr/lib64/
    libwayland-client.so.0
(gdb) 
So while we do see the crash again here, it doesn't seem to be due to a missing compositor. Looking at the debugging above, it looks more likely the call to CompositorOGL::CreateContext() is the problem. Unfortunately, despite stepping through that method and trying my best to pin things down, I've still yet to establish whether this is actually causing the crash.

However, something happened since that's caused me to change tack. Mid-afternoon I received a message from Raine letting me that he and Frajo have managed to get the new ESR 91 version working on a Sailfish OS 4.6 device. Here's Raine's comment discussing what Frajo has been working on:
 
Could you try this?
LD_DEBUG=libs LD_PRELOAD=/usr/lib64/libhybris/eglplatform_wayland.so 
    sailfish-browser
Above worked for him. This should keep the eglplatform_wayland.so loaded and prevent unloading. The reason looks to be somewhere where dlunload happens, qtmozembed spots didn’t help but the issue might in libhybris itself.


If this is indeed the problem then this will be a great result. So Let's try it on my own device here. When I set LD_DEBUG=libs as Raine suggested there's a huge amount of debug output generated — way more than I can reasonably include here — so I've dropped that part for the sake of brevity. But I get the same overall result whether it's set or not:
$ LD_PRELOAD=/usr/lib64/libhybris/eglplatform_wayland.so sailfish-browser
[D] unknown:0 - Using Wayland-EGL
library &quot;libui_compat_layer.so&quot; not found
library &quot;libutils.so&quot; not found
library &quot;libcutils.so&quot; not found
library &quot;libhardware.so&quot; not found
library &quot;android.hardware.graphics.mapper@2.0.so&quot; not found
library &quot;android.hardware.graphics.mapper@2.1.so&quot; not found
library &quot;android.hardware.graphics.mapper@3.0.so&quot; not found
library &quot;android.hardware.graphics.mapper@4.0.so&quot; not found
library &quot;libc++.so&quot; not found
library &quot;libhidlbase.so&quot; not found
library &quot;libgralloctypes.so&quot; not found
library &quot;android.hardware.graphics.common@1.2.so&quot; not found
library &quot;libion.so&quot; not found
library &quot;libz.so&quot; not found
library &quot;libhidlmemory.so&quot; not found
library &quot;android.hidl.memory@1.0.so&quot; not found
library &quot;vendor.qti.qspmhal@1.0.so&quot; not found
greHome from GRE_HOME:/usr/bin
libxul.so is not found, in /usr/bin/libxul.so
Created LOG for EmbedLiteTrace
[W] unknown:0 - MeeGo.QOfono QML module name is deprecated and subject for 
    removal. Please adapt code to &quot;import QOfono&quot;.
[D] onCompleted:108 - ViewPlaceholder requires a SilicaFlickable parent
Created LOG for EmbedLite
[D] unknown:0 - Updating services as GetServices returns
[D] unknown:0 - No default route set, services: 19
[D] unknown:0 - Selected service &quot;Moominland&quot; path &quot;/net/connman/
    service/wifi_3c38f400343d_4d6f6f6d696e6c616e64_managed_psk&quot;
Created LOG for EmbedPrefs
Created LOG for EmbedLiteLayerManager
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: Failed to create 
    EGLConfig! (t=1.32095) [GFX1-]: Failed to create EGLConfig!
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: Failed to create 
    EGLConfig! (t=1.32095) |[1][GFX1-]: Failed to create EGLConfig! (t=1.32193) 
    [GFX1-]: Failed to create EGLConfig!
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: Failed to create 
    EGLConfig! (t=1.32095) |[1][GFX1-]: Failed to create EGLConfig! (t=1.32193) 
    |[2][GFX1-]: [OPENGL] Failed to init compositor with reason: 
    FEATURE_FAILURE_OPENGL_CREATE_CONTEXT (t=1.32273) [GFX1-]: [OPENGL] Failed 
    to init compositor with reason: FEATURE_FAILURE_OPENGL_CREATE_CONTEXT
sailfish-browser: ws.c:150: ws_prepareSwap: Assertion `ws != NULL' failed.
Aborted
Okay, so that's definitely not bringing up a usable browser just yet. However, you may recall that a couple of days back I made the following change to add a condition around the call to eglInitialize():
  if (display == EGL_NO_DISPLAY) {
    if (!lib.fInitialize(display, nullptr, nullptr)) {
      return nullptr;
    }
  }
This change looked like it was giving positive results, but I've never actually had the browser working with it like this. So I really need to try Frajo's fix with the version of the build that Frajo is also using, which is without this change.

I've therefore reverted this change and have started building a fresh version of the packages to test out. I'm excited to find out whether this will make the difference. I'm trying not to get too hopefully just yet, but if it does, that would be a real step forwards.

After a lengthy wait for the build to complete it's now late at night, but I'm eager to test out Frajo's fix. I run the command with some trepidation...
$ LD_PRELOAD=/usr/lib64/libhybris/eglplatform_wayland.so gdb sailfish-browser
[...]
And it works! With this LD_PRELOAD variable set, ESR 91 now runs nicely on Sailfish OS 4.6. I'm pretty certain I'd have been scrabbling around with the debugger for weeks without this fix from Frajo. Both Frajo and Raine are top-tier developers, so I'm not surprised they were able to come up with a solution so quickly. It really highlights to me how important it is to have amazing devs like Frajo and Raine working on Sailfish OS.

I feel like we're in touching distance of the finish line now. Although this trick gets the browser to work, it still needs a proper fix of course, but it offers hope that the problem isn't too far away from a this. Once this is sorted and unless something else comes up, this will be the last thing on my list before I move on to tidying up mode.

So this is an exciting time. It's been over a year now; I'll be happy when I can finally say ESR 91 is ready.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comments

Uncover Disqus comments