flypig.co.uk

Personal Blog

View the blog index.

RSS feed Click the icon for the blog RSS feed.

Blog

5 most recent items

17 Apr 2024 : Day 219 #
Today has been a bit of a disheartening day from a Gecko perspective. So disheartening in fact that I considered pausing this blog, taking some time off and then trying to fix it in private to avoid having to admit how badly things are going right now.

But it's important to show the downs as well as the ups. Software development can be a messy business at times, things rarely go to plan and even when they do there's still often an awful lot of angst and frustration preceding the enjoyment of getting something working.

So here it is, warts and all.

Overnight I rebuilt the packages with the new patches installed. Running the WebView showed no changes in the output: still a white screen, no page rendering or coloured backgrounds. I can live with that, it's not what I wanted but it's also no worse than before.

So I decided to head off in the direction that was set last Thursday when I laid plans to check the implementation that happens between the DeclarativeWebContainer::clearWindowSurface() method and the CompositorOGL::EndFrame() method. This direction is in response to the useful discussion in last week's Sailfish Community Meeting.

First up I wanted to establish what was happening on the window side, so starting at clearWindowSurface() I added some code first to change the colour used to clear the texture from white to green.

This didn't affect the rendering, which remains white. Okay, so far so normal. That's a little unexpected, but nevertheless adds new and useful information.

I then added some code to read off the colour at the centre of the texture. This is pretty simple code and, as before, I added some debug output lines so we can find out what's going wrong. Here's what the method looks like now:
void DeclarativeWebContainer::clearWindowSurface()
{
    Q_ASSERT(m_context);
    // The GL context should always be used from the same thread in which it 
    was created.
    Q_ASSERT(m_context->thread() == QThread::currentThread());
    m_context->makeCurrent(this);
    QOpenGLFunctions_ES2* funcs = 
    m_context->versionFunctions<QOpenGLFunctions_ES2>();
    Q_ASSERT(funcs);

    funcs->glClearColor(0.0, 1.0, 0.0, 0.0);
    funcs->glClear(GL_COLOR_BUFFER_BIT);
    m_context->swapBuffers(this);

    QSize screenSize = QGuiApplication::primaryScreen()->size();
    size_t bufferSize = screenSize.width() * screenSize.height() * 4;
    uint8_t* buf = static_cast<uint8_t*>(calloc(sizeof(uint8_t), bufferSize));

    funcs->glReadPixels(0, 0, screenSize.width(), screenSize.height(),
                            GL_RGBA, GL_UNSIGNED_BYTE, buf);

    int xpos = screenSize.width() / 2;
    int ypos = screenSize.height() / 2;
    int pos = xpos * ypos * 4;

    volatile char red = buf[pos];
    volatile char green = buf[pos + 1];
    volatile char blue = buf[pos + 2];
    volatile char alpha = buf[pos + 3];

    printf(&quot;Colour: (%d, %d, %d, %d)\n&quot;, red, green, blue, alpha);
    free(buf);
}
Everything after the call to swapBuffers() is newly added.

Executing this I was surprised to find that there was no new output from this. So I tried to set a breakpoint on it, but gdb claims it doesn't exist in the binary. A bit of reflection made me realise why: this is code in sailfish-browser, which forms part of the browser, not the WebView.

So for the WebView I'll need to find the similar equivalent call. Before trying to find out where the correct call lives, I thought I'd first see what happens when I run the browser instead of the WebView. Will gdb have more luck with that?

And this is where things start to go wrong. When running with the browser the method is called once. The screen turns green. But there's no other rendering. The fact the screen goes green is good. The fact there's no other rendering is bad. Very bad.

It means that over the last few weeks while I've been trying to fix the WebView render pipeline, I've been doing damage to the browser render pipeline in the process. I honestly thought they didn't interact at this level, but it turns out I was wrong.

So now I have both a broken WebView and a broken browser to fix. Grrrr.

While I was initially quite downhearted about this, on reflection it's not quite as bad as it sounds. The working browser version is still safely stored in the repository, so if necessary I can just revert back. But I've also got all of the changes on top of this easily visible using git locally on my system. So I can at least now work through the changes to see whether I can figure out what's responsible.

I'm not going to make more progress on this tonight, so I'll have to return to it tomorrow to try to understand what I've broken.

So very frustrating. But that's software development for you.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
16 Apr 2024 : Day 218 #
I've been continuing my patch audit today. It remains a bit of a lengthy and laborious process, but I have at least managed to get through them all.

Until now I've been keeping track of patches that have been applied, patches that aren't needed — either because changes have become redundant or because the code has changed so much that they're just no longer applicable — and patches that haven't yet been applied.

As I worked through today I found a new category to log: patches that haven't been applied but really should be. For example there are patches to ensure libcontentaction is used for handling URLs, or patches for managing extensions better on mobile. These are changes that won't necessarily have an immediately obvious effect on whether or not the browser runs, but will have functionality or efficiency impacts that need to be included.

The list of all such patches is longer than I was expecting:

0056, 0057, 0058, 0059, 0060, 0061, 0062, 0063, 0067, 0068, 0069, 0074, 0077, 0078, 0079, 0080, 0081, 0086.

You might well ask, if I know for sure that these should be applied, why didn't I just apply them? That's because I don't want to muddy the waters with them while I focus on fixing the offscreen render pipeline. If I applied these patches now, not only would it make debugging harder, it would also make the partitioning of code changes into clean patches harder as well. What I am sure about, however, is that none of these are affecting the render pipeline problem.

On top of these the following patches haven't been applied, but potentially should be. I say potentially because these are patches which either probably should be applied but I've not had time to properly decide yet, or where it's possible the reason for the patch was fixed upstream, but it's not entirely clear yet.

For example this includes patches for integrating with FFmpeg version 5.0 and above, patches to fix bugs or fixes for memory leaks. When I checked these they definitely hadn't yet been applied, but maybe they can safely be skipped. I should go through all of these more carefully at some point. Here's the list:

0008, 0012, 0014, 0041, 0042, 0043, 0044, 0045, 0046, 0049, 0050, 0051, 0052, 0064, 0066, 0073, 0076, 0082, 0083, 0084, 0085, 0087, 0089, 0090, 0091, 0094, 0095, 0096, 0097, 0099.

Then there are the patches that are not needed. I didn't find any more of these, so the list remains the same as yesterday with the following:

0004, 0005, 0013, 0035.

Finally the patches that have already been applied:

0001, 0002, 0003, 0006, 0007, 0009, 0010, 0011, 0015, 0016, 0017, 0018, 0019, 0020, 0021, 0022, 0023, 0024, 0025, 0026, 0027, 0028, 0029, 0030, 0031, 0032, 0033, 0034, 0036, 0037, 0038, 0039, 0040, 0047, 0048, 0053, 0054, 0055, 0065, 0070, 0071, 0072, 0075, 0088, 0092, 0093, 0099.

That's 47 patches in total; just under half. Plus the four that won't get applied at all sums to 51 patches dealt with, leaving 48 that still need dealing with; just under half.

Of those patches that have been applied, the following are the ones that hadn't been applied but looked potentially relevant to the render pipeline issue I'm tryig to fix. So I went ahead and applied all of these just in case:

0053, 0054, 0055, 0065, 0075.

My suspicion is that none of these will be important enough to fix the issue, but all contributions are helpful. The next step is to build myself some new packages, install them and see where we are.

Then, tomorrow, I can finally move on to the task that was discussed at the community meeting last Thursday. I'm going to start testing the render status at points in between DeclarativeWebContainer::clearWindowSurface() and CompositorOGL::EndFrame(). This will help narrow down where the working render turns into a broken render. That'll leave me in a much better position to focus on why.

I'm just starting the build now.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
15 Apr 2024 : Day 217 #
After re-reading patch 0053 yesterday, today I've been reintegrating the TextureImageEGL class back into the codebase. For context, patch 0053 makes a very small change to a single file, the TextureImageEGL.h file. But when I tried to check whether the patch had been applied or not I discovered the file it's supposed to apply to has been removed entirely from the ESR 91 codebase entirely.

The upstream changeset that removed the file mentions that the class doesn't add anything over and above what BasicTextureImage already offers. When I compared the two classes myself I came to pretty much the same conclusion: there really isn't much that's different between them. The key difference seems to be that TextureImageEGL allows for a wide variety of surface formats, whereas BasicTextureImage supports only RGBA format. There's also a slight difference in that TextureImageEGL seems to be more aggressive in resizing the texture.

Neither of these strike me as significant enough to prevent rendering, but you never know. So I've copied over the two relevant files — TextureImageEGL.h and TextureImageEGL.cpp — and integrated them into the build system. I also had to adjust the code in a few places in order to support the new EglDisplay class and some slight differences in the overridable parent method signatures that necessitated altering the signatures in the class.

Since then I've been trying to perform a partial rebuild of the code, trying to avoid having to do a full rebuild. Previously I would have run a full build overnight, but I've been getting increasingly confident with the way the build system works. Consequently I persuaded mach to rebuild the configuration based on the advice it offered up:
$ sfdk engine exec
$ sb2 -t SailfishOS-esr78-aarch64.default
$ source `pwd`/obj-build-mer-qt-xr/rpm-shared.env
$ make -j1 -C obj-build-mer-qt-xr/gfx/gl/
make: Entering directory '${PROJECT}/obj-build-mer-qt-xr/gfx/gl'
${PROJECT}/gecko-dev/config/rules.mk:338: *** Build configuration changed. 
    Build with |mach build| or run |mach build-backend| to regenerate build 
    config.  Stop.
make: Leaving directory '${PROJECT}/obj-build-mer-qt-xr/gfx/gl'
Following that advice seems to give good results:
$ ./gecko-dev/mach build-backend
 0:05.01 ${PROJECT}/obj-build-mer-qt-xr/_virtualenvs/common/bin/python 
    ${PROJECT}/obj-build-mer-qt-xr/config.status
Reticulating splines...
 0:05.22 File already read. Skipping: ${PROJECT}/gecko-dev/intl/components/
    moz.build
 0:08.32 File already read. Skipping: ${PROJECT}/gecko-dev/gfx/angle/targets/
    angle_common/moz.build
Finished reading 984 moz.build files in 19.20s
Read 11 gyp files in parallel contributing 0.00s to total wall time
Processed into 5326 build config descriptors in 17.35s
RecursiveMake backend executed in 28.37s
  1959 total backend files; 0 created; 2 updated; 1957 unchanged; 0 deleted; 21 
    -> 664 Makefile
FasterMake backend executed in 2.77s
  8 total backend files; 0 created; 1 updated; 7 unchanged; 0 deleted
Total wall time: 69.45s; CPU time: 68.55s; Efficiency: 99%; Untracked: 1.75s
Glean could not be found, so telemetry will not be reported. You may need to 
    run |mach bootstrap|.
I can then follow my usual partial build commands in order to get a newly built version of the library out.

After the build I went through the usual process of transferring and installing the new library to my phone. Sadly I'm still left with a blank screen, so this wasn't enough to get the render working. That may mean these changes can safely reverted, but I'll leave them as they are for now and aim to come back to that later.

That's because it should be easier to remove redundant code once things are working than to guess exactly which changes are needed to get things to work. Or at least, that's my rationale.

This leaves me having applied another patch today (patch 0053), so that we now have the following applied:

0001, 0002, 0003, 0006, 0007, 0009, 0010, 0011, 0015, 0016, 0017, 0018, 0019, 0020, 0021, 0022, 0023, 0024, 0025, 0026, 0027, 0028, 0029, 0030, 0031, 0032, 0033, 0034, 0036, 0037. 0038, 0039, 0040, 0047, 0048, 0053

The following remain uneeded:

0004, 0005, 0013, 0035,

While the following haven't yet been applied:

0008, 0012, 0014, 0041, 0042, 0043, 0044, 0045, 0046, 0049, 0050, 0051, 0052,

Tomorrow I'll continue working through the patches. It's slow going, but this interlude with TextureImageEGL serves to highlight that there are still relevant patches still to be applied, which means it's worth checking. But I also want to reiterate that I've not forgotten the discussion from the Sailfish Community Meeting last Thursday. I still have some work to do following the advice provided there.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
14 Apr 2024 : Day 216 #
After a motivational day getting really useful feedback from Raine, Damien and others, I'm back to going through patches today. I will of course need to act on the helpful suggestions, which I#m sure will be key to getting the rendering to work, but having started on my journey of checking patches I need to now finish it. This entire activity is an in-order execution pipeline you see.

I had reached patch 0038, otherwise named "Fix mesa egl display and buffer initialisation" and originally written by Adam Pigg, Frajo and myself back in 2021. It's a big one and also a relevant one to the current problem, so I'm going to spend a bit of time working my way through it.

The good news is that the process of checking, while a little laborious, doesn't require a lot of concentration. So it's good work to do while I'm still half asleep on my morning commute.

[...]

Having worked through patch 0038 it all looks present and correct, at least to the extent that makes sense with the way the new code is structured. I still have 61 patches to check through so I'm going to continue on with them.

So far the following are all applied, present and correct:

0001, 0002, 0003, 0006, 0007, 0009, 0010, 0011, 0015, 0016, 0017, 0018, 0019, 0020, 0021, 0022, 0023, 0024, 0025, 0026, 0027, 0028, 0029, 0030, 0031, 0032, 0033, 0034, 0036, 0037. 0038, 0039, 0040, 0047, 0048,

The following are no longer needed:

0004, 0005, 0013, 0035,

While the following haven't been applied, but might potentially be important for the browser overall, so may need to be applied at some point in the future:

0008, 0012, 0014, 0041, 0042, 0043, 0044, 0045, 0046, 0049, 0050, 0051, 0052,

I got up to patch 0051 and then I hit something unexpected. Patch 0051 makes "TextureImageEGL hold a reference to GLContext". There's a detailed explanation in the patch about why this is needed: it ensures that freed memory isn't used during the shutdown of EmbedLite port objects. The actual change required is minimal:
  protected:
   typedef gfxImageFormat ImageFormat;
 
-  GLContext* mGLContext;
+  RefPtr<GLContext> mGLContext;
 
   gfx::SurfaceFormat mUpdateFormat;
   EGLImage mEGLImage;
Changes don't come much simpler than this. But when I tried to check whether it had been applied I ran in to problems: the TextureImageEGL.h file which it's intended to apply to doesn't exist.

Well that's odd.

It would appear that the TextureImageEGL class has been completely removed. And because it's created in a factory it's apparently not affected any of the other EmbedLite changes. Here's the creation code from ESR 78:
already_AddRefed<TextureImage> CreateTextureImage(
    GLContext* gl, const gfx::IntSize& aSize,
    TextureImage::ContentType aContentType, GLenum aWrapMode,
    TextureImage::Flags aFlags, TextureImage::ImageFormat aImageFormat) {
  switch (gl->GetContextType()) {
    case GLContextType::EGL:
      return CreateTextureImageEGL(gl, aSize, aContentType, aWrapMode, aFlags,
                                   aImageFormat);
    default: {
      GLint maxTextureSize;
      gl->fGetIntegerv(LOCAL_GL_MAX_TEXTURE_SIZE, &maxTextureSize);
      if (aSize.width > maxTextureSize || aSize.height > maxTextureSize) {
        NS_ASSERTION(aWrapMode == LOCAL_GL_CLAMP_TO_EDGE,
                     &quot;Can't support wrapping with tiles!&quot;);
        return CreateTiledTextureImage(gl, aSize, aContentType, aFlags,
                                       aImageFormat);
      } else {
        return CreateBasicTextureImage(gl, aSize, aContentType, aWrapMode,
                                       aFlags);
      }
    }
  }
}
Notice how in ESR 91 the code for generating TextureImageEGL objects has been completely removed:
already_AddRefed<TextureImage> CreateTextureImage(
    GLContext* gl, const gfx::IntSize& aSize,
    TextureImage::ContentType aContentType, GLenum aWrapMode,
    TextureImage::Flags aFlags, TextureImage::ImageFormat aImageFormat) {
  GLint maxTextureSize;
  gl->fGetIntegerv(LOCAL_GL_MAX_TEXTURE_SIZE, &maxTextureSize);
  if (aSize.width > maxTextureSize || aSize.height > maxTextureSize) {
    NS_ASSERTION(aWrapMode == LOCAL_GL_CLAMP_TO_EDGE,
                 &quot;Can't support wrapping with tiles!&quot;);
    return CreateTiledTextureImage(gl, aSize, aContentType, aFlags,
                                   aImageFormat);
  } else {
    return CreateBasicTextureImage(gl, aSize, aContentType, aWrapMode, aFlags);
  }
}
Let's find out why that is.
$ git log -1 -S &quot;CreateTextureImageEGL&quot; -- gfx/gl/GLTextureImage.cpp
commit a7817816b708c8b1fc3362d21cbcadbff3c46741
Author: Jeff Muizelaar <jmuizelaar@mozilla.com>
Date:   Tue Jun 15 21:10:47 2021 +0000

    Bug 1716559 - Remove TextureImageEGL. r=jnicol,jgilbert
    
    TextureImageEGL doesn't seem to provide any value beyond
    BasicTextureImage. It's last usage was bug 814159.
    
    Removing this has the side effect of using BasicTextureImage
    for small images instead of always using TilingTextureImage.
    
    Differential Revision: https://phabricator.services.mozilla.com/D117904
It surprises me that I've not run in to this before, but one of the benefits of keeping this diary is that I can very quickly verify this:... I have not.

The upstream diff suggests that TextureImageEGL provides no value, but it would be easy to miss the value if it applied only to a project that's external to gecko and which isn't officially supported. So I wonder if it's potentially offering value to us?

Unfortunately I've run out of time today to investigate this further today. I'd especially like to compare the functionality of TextureImageEGL from ESR 78 with that of BasicTextureImage in ESR 91. That will be my task for tomorrow morning.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
13 Apr 2024 : Day 215 #
This morning I had a really fruitful discussion with everyone attending the fortnightly Sailfish Community Meeting. As you'll know I've reached the point where it's clear that rendering is happening to the surface, but there's something not working between the surface and the screen that's preventing any colour getting to the screen.

There were two really excellent outcomes from the meeting. The first involved potential solutions to this problem and I'd like to spend a little time discussing that now. The second... well I'll come to that later.

So let's first look at those potential solutions. The first thing was that there's a process that happens between the sailfish-browser code and the gecko code that performs the WebView rendering. It takes the offscreen render and ensures it gets to the screen. And as Raine (rainemak) explained during the meeting, that starts in the DeclarativeWebContainer::clearWindowSurface() method. That's part of the sailfish-browser code.

Recently I've been playing around with the CompositorOGL::EndFrame() method. That's at the other end of the process. So that means the part of the code I need to check should be somewhere between these two points. Here's some of the relevant conversation (slightly amended by me for brevity and clarify):
 
rainemak: flypig, have you tried to change gl clear color?
flypig: Yes, I tried glClearColor. The change shows up with fReadPixels, but not on screen.
rainemak: has to be that off screen is not there
flypig: Could you elaborate?
rainemak: flypig, one thing worth looking is that gecko doesn't create any headless things in addition to what you're expecting it to create
flypig: rainemak, how would that manifest itself? I've been careful to ensure there's only one display value
created. Is there anything else to check for this you can think of?
rainemak: flypig, also gl makeCurrent and swap I'd check
rainemak: DeclarativeWebContainer::clearWindowSurface
rainemak: ^ is that being called properly
flypig: That's in sailfish-browser rainemak? I've not checked that; I'll do so.

So that's some clear things to check, specifically the code around clearWindowSurface() and the calling of glMakeCurrent() (which is most likely called fMakeCurrent() if it's in the gecko code). So this is the reason for the first part of my upcoming plan:
 
flypig: Alright, I'm going to check on the clearWindowSurface side of things and everything between that and CompositorOGL::EndFrame().
rainemak: flypig, that's at least scheduling glclear
flypig: Okay, that's useful to know. Then maybe I can put lots of readPixels in between the two to find out exactly which step is failing.

But there was more useful info from Raine in addition to this. He also suggested to take a look at the DrawUnderlay() method — if it exists — or the functionality that's replaced it.
 
rainemak: flypig, and draw underlay is getting called properly as well? that call makeCurrent
rainemak: that calls makeCurrent
flypig: One second, let me check.
rainemak: let me think
flypig: That's called in EmbedLiteCompositorBridgeParent::CompositeToDefaultTarget() ?
rainemak: it starts around there
rainemak: did we remove on esr78 DrawUnderlay like methon ( there's still QMozWindowPrivate::DrawOverlay )
rainemak: hmm
rainemak: I'm not seeing that makeCurrent

Finally, he also suggested to check the updatePaintNode() method which is part of the qtmozembed codebase:
 
rainemak: flypig, QuickMozView::updatePaintNode I'd check this one as browser rendering works
rainemak: i.e. does getPlatform and MozMaterialNode::preprocess work as expected
rainemak: thinking if there's something wrong now with previous scenegraph integration

All of these are excellent suggestions and they're now on my list of things to check. Yesterday I'd started working through the patches and I'll have to finish that task before I move on to these suggestions, since the patches have ramifications for other parts of the code as well, but I'm really hoping — and quite energised by the prospect — that these tips from Raine might lead to a solution for the offscreen rendering problems.

It was also great to see that the discussion during the meeting also led to some helpful OpenGL debugging suggestions on the forum. I've already thanked Tone (tortoisedoc) previously but now I must also add Damien Caliste (dcaliste) to the list of people who deserve my thanks.

So that was the first nice outcome, what about the second? Well that's of a slightly different nature. During the meeting thigg also made an unexpected by also really superb offer:
 
thilo[m]: #info thigg, community
thilo[m]: Flypig: you have a milestone in gecko after which you have a big party planned? ;)
flypig: thilo[m], once the offscreen render is working, I'll generate patches, and then it will be time to celebrate (for me at least!).
thilo[m]: Cool, i'll prepare pictures ;)

I can't get enough of Thilo's pictures and am already looking forward to being able to share them in these diaries when the time comes.

That's it for today, but I'll return tomorrow with updates on patches.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment