List items
Items from the current list are shown below.
Blog
All items from June 2024
30 Jun 2024 : Day 274 #
Yesterday I spent a lot of time using the debugger, stepping through the code to find out why the GLContext isn't being created successfully. Deep in the code I found that a call to fCheckFramebufferStatus() is returning with the following error:
I've read carefully through the code and compared it with the changes from the previous working version. The remaining changes can be categorised into a few related areas and it seems to me that any one of these could be causing the problem:
I've decided to start by looking at the SharedSurface_Basic() changes. It looks like these are used before the failure occurs and are part of the same execution flow, so they would seem to be good candidates.
Thankfully git will do the hard work of restoring the previous versions of the relevant files, but I also want to keep a record of the changes I'm making in case they don't have any effect. In that case, I may well want to revert the changes I'm making now. So the following steps store a record of the current diff, then revert it.
Having made these changes the partial build has gone through successfully. On testing it I discover two things:
So this is the state of the latest build. What I need to do now is some debugging to find out why these two issues are happening. In particular I want to find out whether SharedSurface_Basic gets used during a WebGL render.
But to do that I'll need to run a full build. It's late now, so I can run the build overnight and let my laptop do the work while I sleep. That sounds like a good plan to me, so see you tomorrow!
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
GL_FRAMEBUFFER_INCOMPLETE_MISSING_ATTACHMENTThe OpenGL docs told us that this is returned if the framebuffer doesn't have at least one image attached to it. That's a little surprising because just a few lines above where the check takes place we find the following line:
gl->AttachBuffersToFB(colorTex, 0, 0, 0, fb, target);Maybe I'm misunderstanding how frame buffer attachment is supposed to work, or maybe there's something wrong with either the buffer or framebuffer. Either way, something needs fixing.
I've read carefully through the code and compared it with the changes from the previous working version. The remaining changes can be categorised into a few related areas and it seems to me that any one of these could be causing the problem:
- The compacted version of GLContextProvider::CreateOffscreen() (maybe I got something wrong when I merged the three methods into one?).
- Removal of the Wayland integration.
- Removal of raw_fBindFramebuffer().
- Changes to the EGLDisplay flow.
- Changes to the SharedSurface_Basic class.
- Simplifications of the texture destruction flow.
I've decided to start by looking at the SharedSurface_Basic() changes. It looks like these are used before the failure occurs and are part of the same execution flow, so they would seem to be good candidates.
Thankfully git will do the hard work of restoring the previous versions of the relevant files, but I also want to keep a record of the changes I'm making in case they don't have any effect. In that case, I may well want to revert the changes I'm making now. So the following steps store a record of the current diff, then revert it.
$ git diff gfx/gl/SharedSurfaceGL.cpp gfx/gl/SharedSurfaceGL.h > ../../ surface-factory-changes.diff $ git checkout gfx/gl/SharedSurfaceGL.cpp Updated 1 path from the index $ git checkout gfx/gl/SharedSurfaceGL.h Updated 1 path from the indexNote that I'm only changing two files here: SharedSurfaceGL.h and SharedSurfaceGL.cpp. I'm still trying to keep my changes as minimal as possible.
Having made these changes the partial build has gone through successfully. On testing it I discover two things:
- When using the browser, general rendering works fine, but WebGL rendering now no longer works. It doesn't crash, just doesn't show the WebGL content on the page.
- WebView behaviour has changed slightly, although not in a way that fixes it. The screen goes full black and the app hangs. This is reminiscent of the behaviour we saw back around Day 185.
SharedSurface_EGLImage::~SharedSurface_EGLImage() { const auto& gle = GLContextEGL::Cast(mDesc.gl); const auto& egl = gle->mEgl; egl->fDestroyImage(mImage); if (mSync) { // We can't call this unless we have the ext, but we will always have // the ext if we have something to destroy. egl->fDestroySync(mSync); mSync = 0; } + if (!mDesc.gl || !mDesc.gl->MakeCurrent()) return; + + mDesc.gl->fDeleteTextures(1, &mProdTex); + mProdTex = 0; }The only reason I'm mentioning this now is in case we need to make a similar fix again. But I don't have time to focus on it right now, so I'm just keeping a note of it in case it turns out to be helpful for later.
So this is the state of the latest build. What I need to do now is some debugging to find out why these two issues are happening. In particular I want to find out whether SharedSurface_Basic gets used during a WebGL render.
But to do that I'll need to run a full build. It's late now, so I can run the build overnight and let my laptop do the work while I sleep. That sounds like a good plan to me, so see you tomorrow!
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
29 Jun 2024 : Day 273 #
Continuing on with our investigation from yesterday, we've got a WebView that now runs but doesn't render. Today I need to find out the reason. The good news is that the WebGL on this build is still working, so we're definitely getting closer.
If you've been following along you'll know that this isn't the first time I've needed to get the WebView render pipeline working. The difference is that this time I have the delta between where we are now and where there's a working WebView, I just need to tread carefully enough between here and there so as not to destroy the WebGL in the process.
So, first up, why is the WebView not working. From what we saw yesterday I already know that this is bounded by a broken call to GLContextProvider::CreateOffscreen() which is returning null. I want to step through the code to find out why.
Last night I couldn't do this because I only had a partial build (meaning the debug symbols and source were misaligned with the binary). I ran a build overnight to fix that. So now it's time to step through.
I've placed a breakpoint on CompositorOGL::CreateContext(). Let's see what we can see when we step through the code from there. Keep in mind that we're interested in the call to CreateOffscreen() and what it's returning.
It's hard to tell from the debug trace, but the line that's failing is the following:
To help with this I've set a bunch of breakpoints that will allow me to skip between the relevant parts of the code:
Once again, we hit all of the breakpoints in order before we get there. There are more of them this time:
This results in the overall method returning early with an error. Is it a real error though? It might be interesting to find out what happens if we force the code to ignore the error and continue on regardless.
But when I forcefully clear the error value using the debugger, I still don't get rendering. There's no crash, but there's also no web page: just a blank screen.
So the question I want to now answer is "Why is IsFramebufferComplete() returning false?". It's worth checking what the method does.
I'm going to look into this further, but it's the end of the day here, so I'll have to pick this up again tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
If you've been following along you'll know that this isn't the first time I've needed to get the WebView render pipeline working. The difference is that this time I have the delta between where we are now and where there's a working WebView, I just need to tread carefully enough between here and there so as not to destroy the WebGL in the process.
So, first up, why is the WebView not working. From what we saw yesterday I already know that this is bounded by a broken call to GLContextProvider::CreateOffscreen() which is returning null. I want to step through the code to find out why.
Last night I couldn't do this because I only had a partial build (meaning the debug symbols and source were misaligned with the binary). I ran a build overnight to fix that. So now it's time to step through.
I've placed a breakpoint on CompositorOGL::CreateContext(). Let's see what we can see when we step through the code from there. Keep in mind that we're interested in the call to CreateOffscreen() and what it's returning.
Thread 38 "Compositor" hit Breakpoint 1, mozilla::layers:: CompositorOGL::CreateContext (this=this@entry=0x7ed4002ed0) at gfx/layers/opengl/CompositorOGL.cpp:227 227 already_AddRefed<mozilla::gl::GLContext> CompositorOGL::CreateContext() { (gdb) n 231 nsIWidget* widget = mWidget->RealWidget(); (gdb) n 232 void* widgetOpenGLContext = (gdb) n 234 if (widgetOpenGLContext) { (gdb) n 248 if (!context && gfxEnv::LayersPreferOffscreen()) { (gdb) n 249 nsCString discardFailureId; (gdb) n 257 context = GLContextProvider::CreateOffscreen( (gdb) p context $1 = {mRawPtr = 0x0} (gdb) s mozilla::gl::GLContextProviderEGL::CreateOffscreen (size=..., flags=flags@entry=mozilla::gl::CreateContextFlags::REQUIRE_COMPAT_PROFILE, out_failureId=out_failureId@entry=0x7f1f96b1c8) at gfx/gl/GLContextProviderEGL.cpp:1264 1264 gl = CreateHeadless({CreateContextFlags::REQUIRE_COMPAT_PROFILE}, out_failureId); (gdb) p gl $2 = {mRawPtr = 0x0} (gdb) n 1267 if (!gl || !gl->IsOffscreenSizeAllowed(size)) { (gdb) p gl $3 = {mRawPtr = 0x7ed419ee40} (gdb) n 1271 UniquePtr<GLScreenBuffer> newScreen = GLScreenBuffer::Create(gl, size); (gdb) p size $4 = (const mozilla::gfx::IntSize &) @0x7ed4002fc4: {<mozilla::gfx:: BaseSize<int, mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> >> = {{{ width = 1080, height = 2520}, components = {1080, 2520}}}, <mozilla:: gfx::UnknownUnits> = {<No data fields>}, <No data fields>} (gdb) n 1272 if ((!newScreen) || (!newScreen->Resize(size))) { (gdb) p newScreen.mTuple.mFirstA $6 = (mozilla::gl::GLScreenBuffer *) 0x7ed4003ba0 (gdb) n 1271 UniquePtr<GLScreenBuffer> newScreen = GLScreenBuffer::Create(gl, size); (gdb) n 1262 RefPtr<GLContext> gl; (gdb) n mozilla::layers::CompositorOGL::CreateContext (this=this@entry=0x7ed4002ed0) at gfx/layers/opengl/CompositorOGL.cpp:249 249 nsCString discardFailureId; (gdb) n 263 if (!context) { (gdb) p context $7 = {mRawPtr = 0x0} (gdb)As we can see from this, we drop in to the CreateOffscreen() method. This calls CreateHeadless() which gives us what appears to be a valid context. Then we get a call to GLScreenBuffer::Create() which gives us a valid GLScreenBuffer object.
It's hard to tell from the debug trace, but the line that's failing is the following:
if ((!newScreen) || (!newScreen->Resize(size)))We can see from the trace that newScreen is valid, so it's the call to Resize() which is returning false. We can't yet tell why. But we can find out by stepping inside the Resize() method to check. That'll require us to re-run the application. Let's give that a go.
To help with this I've set a bunch of breakpoints that will allow me to skip between the relevant parts of the code:
(gdb) info break Num Type Disp Enb Address What 3 breakpoint keep y 0x0000007ff11988a4 in mozilla::layers:: CompositorOGL::CreateContext() at gfx/layers/opengl/ CompositorOGL.cpp:227 breakpoint already hit 1 time 4 breakpoint keep y 0x0000007ff1131540 in mozilla::gl:: GLContextProviderEGL::CreateOffscreen(mozilla::gfx::IntSizeTyped<mozilla:: gfx::UnknownUnits> const&, mozilla::gl::CreateContextFlags, nsTSubstring<char>*) at gfx/gl/ GLContextProviderEGL.cpp:1260 breakpoint already hit 1 time 5 breakpoint keep y 0x0000007ff1107644 in mozilla::gl:: GLScreenBuffer::Resize(mozilla::gfx::IntSizeTyped<mozilla::gfx:: UnknownUnits> const&) at gfx/gl/GLScreenBuffer.cpp: 339 breakpoint already hit 1 time 6 breakpoint keep y 0x0000007ff110695c in mozilla::gl:: GLScreenBuffer::Attach(mozilla::gl::SharedSurface*, mozilla::gfx:: IntSizeTyped<mozilla::gfx::UnknownUnits> const&) at gfx/gl/GLScreenBuffer.cpp: 275 (gdb)Now let's step through, hitting these breakpoints as we go, and take special care not to jump past the Resize() call.
Thread 37 "Compositor" hit Breakpoint 3, mozilla::layers:: CompositorOGL::CreateContext (this=this@entry=0x7ed8002f10) at gfx/layers/opengl/CompositorOGL.cpp:227 227 already_AddRefed<mozilla::gl::GLContext> CompositorOGL::CreateContext() { (gdb) c Continuing. Thread 37 "Compositor" hit Breakpoint 4, mozilla::gl:: GLContextProviderEGL::CreateOffscreen (size=..., flags=flags@entry=mozilla::gl::CreateContextFlags::REQUIRE_COMPAT_PROFILE, out_failureId=out_failureId@entry=0x7f179ac1c8) at gfx/gl/GLContextProviderEGL.cpp:1260 1260 CreateContextFlags flags, nsACString* const out_failureId) { (gdb) c Continuing. Thread 37 "Compositor" hit Breakpoint 5, mozilla::gl::GLScreenBuffer:: Resize (this=0x7ed8003ba0, size=...) at gfx/gl/GLScreenBuffer.cpp:339 339 bool GLScreenBuffer::Resize(const gfx::IntSize& size) {That's all of our breakpoints hit. We're now at the start of the Resize() method. Let's step through it.
339 bool GLScreenBuffer::Resize(const gfx::IntSize& size) { (gdb) n 342 if (!newBack) return false; (gdb) p mFactory.mTuple->mFirstA $15 = (mozilla::gl::SurfaceFactory *) 0x7ed8004190 (gdb) p newBack.mRawPtr $17 = (mozilla::layers::SharedSurfaceTextureClient *) 0x7ed81af430 (gdb) c Continuing. Thread 37 "Compositor" hit Breakpoint 6, mozilla::gl::GLScreenBuffer:: Attach (this=this@entry=0x7ed8003ba0, surf=0x7ed81a1c80, size=...) at gfx/gl/GLScreenBuffer.cpp:275 275 bool GLScreenBuffer::Attach(SharedSurface* surf, const gfx::IntSize& size) {Now we're inside the Attach() method. Let's step through this.
275 bool GLScreenBuffer::Attach(SharedSurface* surf, const gfx::IntSize& size) { (gdb) n 276 ScopedBindFramebuffer autoFB(mGL); (gdb) p size $18 = (const mozilla::gfx::IntSize &) @0x7ed8003004: {<mozilla::gfx:: BaseSize<int, mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> >> = {{{ width = 1080, height = 2520}, components = {1080, 2520}}}, <mozilla:: gfx::UnknownUnits> = {<No data fields>}, <No data fields>} (gdb) n 278 const bool readNeedsUnlock = (mRead && SharedSurf()); (gdb) n 283 surf->LockProd(); (gdb) n 285 if (mRead && size == Size()) { (gdb) p mRead.mTuple.mFirstA $20 = (mozilla::gl::ReadBuffer *) 0x0 (gdb) n 289 UniquePtr<ReadBuffer> read = ReadBuffer::Create(mFactory->mDesc.gl, surf); (gdb) n 291 if (!read) { (gdb) n 292 surf->UnlockProd(); (gdb) p read.mTuple.mFirstA $22 = (mozilla::gl::ReadBuffer *) 0x0 (gdb)So read is null and that must be because ReadBuffer::Create() is returning null. We need to find out why and the same drill applies: place a breakpoint on ReadBuffer::Create() and step through the code to try to figure out where it's failing.
Once again, we hit all of the breakpoints in order before we get there. There are more of them this time:
Thread 37 "Compositor" hit Breakpoint 3, mozilla::layers:: CompositorOGL::CreateContext (this=this@entry=0x7ee0002f10) at gfx/layers/opengl/CompositorO GL.cpp:227 227 already_AddRefed<mozilla::gl::GLContext> CompositorOGL::CreateContext() { (gdb) c Continuing. Thread 37 "Compositor" hit Breakpoint 4, mozilla::gl:: GLContextProviderEGL::CreateOffscreen (size=..., flags=flags@entry=mozilla::gl::CreateContextFlags::REQUIRE_COMPAT_PROFILE, out_failureId=out_failureId@entry=0x7f1f98d1c8) at gfx/gl/GLContextProviderEGL.c pp:1260 1260 CreateContextFlags flags, nsACString* const out_failureId) { (gdb) c Continuing. Thread 37 "Compositor" hit Breakpoint 5, mozilla::gl::GLScreenBuffer:: Resize (this=0x7ee0003bc0, size=...) at gfx/gl/GLScreenBuffer.cpp:339 339 bool GLScreenBuffer::Resize(const gfx::IntSize& size) { (gdb) c Continuing. Thread 37 "Compositor" hit Breakpoint 6, mozilla::gl::GLScreenBuffer:: Attach (this=this@entry=0x7ee0003bc0, surf=0x7ee01a1ca0, size=...) at gfx/gl/GLScreenBuffer.cpp:275 275 bool GLScreenBuffer::Attach(SharedSurface* surf, const gfx::IntSize& size) { (gdb) c Continuing. Thread 37 "Compositor" hit Breakpoint 7, mozilla::gl::ReadBuffer:: Create (gl=0x7ee019ee60, surf=surf@entry=0x7ee01a1ca0) at gfx/gl/GLScreenBuffer.cpp:358 358 UniquePtr<ReadBuffer> ReadBuffer::Create(GLContext* gl, SharedSurface* surf) {Now we're in the right place. Let's step through the ReadBuffer::Create() method to find out what's going wrong.
358 UniquePtr<ReadBuffer> ReadBuffer::Create(GLContext* gl, SharedSurface* surf) { (gdb) n 361 GLContext::LocalErrorScope localError(*gl); (gdb) n 366 colorTex = surf->ProdTexture(); (gdb) n 367 target = surf->ProdTextureTarget(); (gdb) n 370 GLuint fb = 0; (gdb) n 371 gl->fGenFramebuffers(1, &fb); (gdb) n 372 gl->AttachBuffersToFB(colorTex, 0, 0, 0, fb, target); (gdb) p fb $38 = 2 (gdb) p target $39 = 3553 (gdb) n 374 UniquePtr<ReadBuffer> ret(new ReadBuffer(gl, fb, 0, 0, surf)); (gdb) n 376 GLenum err = localError.GetError(); (gdb) p ret.mTuple.mFirstA $41 = (mozilla::gl::ReadBuffer *) 0x7ee01a16e0 (gdb) n 378 if (err) return nullptr; (gdb) p err $42 = 0 (gdb) n 381 if (needsAcquire) { (gdb) p needsAcquire $43 = 255 (gdb) n 382 surf->ProducerReadAcquire(); (gdb) n 384 const bool isComplete = gl->IsFramebufferComplete(fb); (gdb) n 386 surf->ProducerReadRelease(); (gdb) p isComplete $44 = false (gdb) set isComplete=true (gdb) p isComplete $45 = true (gdb) n 389 if (!isComplete) return nullptr; (gdb) n 374 UniquePtr<ReadBuffer> ret(new ReadBuffer(gl, fb, 0, 0, surf)); (gdb) n 361 GLContext::LocalErrorScope localError(*gl); (gdb) c Continuing.Starting at the top, we can see that it's creating the buffers. It generates a framebuffer, then a ReadBuffer; this all appears to work fine. Then it acquires the buffer. Then things start to go wrong. A check is made to see whether the frame buffer is complete, but the result comes back negative.
This results in the overall method returning early with an error. Is it a real error though? It might be interesting to find out what happens if we force the code to ignore the error and continue on regardless.
But when I forcefully clear the error value using the debugger, I still don't get rendering. There's no crash, but there's also no web page: just a blank screen.
So the question I want to now answer is "Why is IsFramebufferComplete() returning false?". It's worth checking what the method does.
bool GLContext::IsFramebufferComplete(GLuint fb, GLenum* pStatus) { MOZ_ASSERT(fb); ScopedBindFramebuffer autoFB(this, fb); MOZ_GL_ASSERT(this, fIsFramebuffer(fb)); GLenum status = fCheckFramebufferStatus(LOCAL_GL_FRAMEBUFFER); if (pStatus) *pStatus = status; return status == LOCAL_GL_FRAMEBUFFER_COMPLETE; }The call to the ScopedBindFramebuffer constructor will bind the framebuffers using the Init() method:
ScopedBindFramebuffer::ScopedBindFramebuffer(GLContext* aGL) : mGL(aGL) { Init(); } /* ScopedBindFramebuffer - Saves and restores with GetUserBoundFB and * BindUserFB. */ void ScopedBindFramebuffer::Init() { if (mGL->IsSupported(GLFeature::split_framebuffer)) { mOldReadFB = mGL->GetReadFB(); mOldDrawFB = mGL->GetDrawFB(); } else { mOldReadFB = mOldDrawFB = mGL->GetFB(); } }So IsFramebufferComplete() is essentially calling fCheckFramebufferStatus() on the bound framebuffer. Maybe we can find out what the error status is from the value returned by fCheckFramebufferStatus(). That might help. Back to the top of the method we go.
Thread 37 "Compositor" hit Breakpoint 7, mozilla::gl::ReadBuffer:: Create (gl=0x7ee019ee40, surf=surf@entry=0x7ee01a1c80) at gfx/gl/GLScreenBuffer.cpp:358 358 UniquePtr<ReadBuffer> ReadBuffer::Create(GLContext* gl, SharedSurface* surf) { (gdb) n 361 GLContext::LocalErrorScope localError(*gl); (gdb) n 366 colorTex = surf->ProdTexture(); (gdb) n 367 target = surf->ProdTextureTarget(); (gdb) n 370 GLuint fb = 0; (gdb) n 371 gl->fGenFramebuffers(1, &fb); (gdb) n 372 gl->AttachBuffersToFB(colorTex, 0, 0, 0, fb, target); (gdb) n Thread 37 "Compositor" hit Breakpoint 11, mozilla::gl:: ScopedBindFramebuffer::ScopedBindFramebuffer (this=0x7f1f9db048, aGL=0x7ee019ee40, aNewFB=2) at gfx/gl/ScopedGLHelpers.cpp:60 60 ScopedBindFramebuffer::ScopedBindFramebuffer(GLContext* aGL, GLuint aNewFB) (gdb) n 62 Init(); (gdb) n 63 mGL->BindFB(aNewFB); (gdb) n mozilla::gl::GLContext::AttachBuffersToFB (this=this@entry=0x7ee019ee40, colorTex=colorTex@entry=0, colorRB=colorRB@entry=0, depthRB=depthRB@entry=0, stencilRB=stencilRB@entry=0, fb=<optimized out>, target=target@entry=3553) at gfx/gl/GLContext.cpp:1755 1755 if (colorTex) { (gdb) n 1761 } else if (colorRB) { (gdb) n 1769 if (depthRB) { (gdb) n 1776 if (stencilRB) { (gdb) n mozilla::gl::ReadBuffer::Create (gl=0x7ee019ee40, surf=surf@entry=0x7ee01a1c80) at gfx/gl/GLScreenBuffer.cpp:374 374 UniquePtr<ReadBuffer> ret(new ReadBuffer(gl, fb, 0, 0, surf)); (gdb) n 376 GLenum err = localError.GetError(); (gdb) n 378 if (err) return nullptr; (gdb) n 381 if (needsAcquire) { (gdb) n 382 surf->ProducerReadAcquire(); (gdb) n 384 const bool isComplete = gl->IsFramebufferComplete(fb); (gdb) s Thread 37 "Compositor" hit Breakpoint 9, mozilla::gl::GLContext:: IsFramebufferComplete (this=this@entry=0x7ee019ee40, fb=2, pStatus=pStatus@entry=0x0) at gfx/gl/GLContext.cpp:1734 1734 bool GLContext::IsFramebufferComplete(GLuint fb, GLenum* pStatus) { (gdb) n 1737 ScopedBindFramebuffer autoFB(this, fb); (gdb) s Thread 37 "Compositor" hit Breakpoint 11, mozilla::gl:: ScopedBindFramebuffer::ScopedBindFramebuffer (this=0x7f1f9db048, aGL=0x7ee019ee40, aNewFB=2) at gfx/gl/ScopedGLHelpers.cpp:60 60 ScopedBindFramebuffer::ScopedBindFramebuffer(GLContext* aGL, GLuint aNewFB) (gdb) n 62 Init(); (gdb) p mOldReadFB $46 = 530428000 (gdb) p mOldDrawFB $47 = 127 (gdb) n 63 mGL->BindFB(aNewFB); (gdb) p aNewFB $48 = 2 (gdb) p mGL $49 = (mozilla::gl::GLContext * const) 0x7ee019ee40 (gdb) n mozilla::gl::GLContext::IsFramebufferComplete (this=this@entry=0x7ee019ee40, fb=<optimized out>, pStatus=pStatus@entry=0x0) at gfx/gl/GLContext.cpp:1740 1740 GLenum status = fCheckFramebufferStatus(LOCAL_GL_FRAMEBUFFER); (gdb) n 1741 if (pStatus) *pStatus = status; (gdb) p status $50 = 36055 (gdb) p/x status $51 = 0x8cd7 (gdb)So we have an error value coming back from fCheckFramebufferStatus() of 0x8cd7. Checking the documentation and the GLConsts.h file, we can see that the error returned is the following:
#define LOCAL_GL_FRAMEBUFFER_INCOMPLETE_MISSING_ATTACHMENT 0x8CD7And checking the docs, it tells us that:
GL_FRAMEBUFFER_INCOMPLETE_MISSING_ATTACHMENT is returned if the framebuffer does not have at least one image attached to it.
I'm going to look into this further, but it's the end of the day here, so I'll have to pick this up again tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
28 Jun 2024 : Day 272 #
We're getting a crash with the WebView because the GLSCreenBuffer isn't defined before it's first accessed. We've been here before, but the code has changed and I need to go through the process of figuring out why, again.
Finding out shouldn't be difficult, just a little laborious. I need to install the old working packages, place a breakpoint on the GLScreenBuffer constructor, run the code and wait for a backtrace to magically appear. Here's the result:
Next up I've installed my latest WebGL-broken version and have placed breakpoints on the top n methods that appear in the backtrace:
That will tell me the method I need to look in to find out why the GLScreenBuffer isn't being created. I don't expect the result to be a surprise, but going through this process is actually going to be quicker than me trying to remember.
First time around I got no hit... I guess I didn't cast my hook far enough. I'll need to add extra breakpoints and try again:
I now need to decide whether I'm going to restore the five missing methods, or alternatively try to extract their core functionality to be merged together into a single method to hook into the CreateHeadless() flow. This will require some thought and a bit of deeper investigation.
Let's summarise things. First, this is the code that's currently being executed.
We already know that it calls InitOffscreen() which calls CreateScreenBuffer() which calls GLScreenBuffer::Create(). So let's be explicit about this and take a look at all of those methods. Here they are, out of context and lined up in a row like ducks:
I've done a partial build and hooked everything together. Now when I run this I get interesting results. Running the browser gives good results: rendering works and the WebGL is functional as well.
Running the WebView is a bit more mised: there's no longer a crash (yay!) but rendering is broken (boo!). The errors generated look really serious given all the wailing and gnashing of teeth coming from the debug output, but in practice they're not blocking execution or triggering a crash. Here's the console output:
The easiest way to find out why it's doing this will be to step through the code, but I can't do that using a binary from a partial build, I'll need to run a full build first.
So I've set the build going. With any luck it'll be completed by morning.
Despite this error I'm excited by these developments. The WebGL remains intact and the WebView isn't crashing. I feel like I'm honing in on where the problem is, and that makes me hopefully that we'll be able to reach a solution soon.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
Finding out shouldn't be difficult, just a little laborious. I need to install the old working packages, place a breakpoint on the GLScreenBuffer constructor, run the code and wait for a backtrace to magically appear. Here's the result:
Thread 38 "Compositor" hit Breakpoint 1, mozilla::gl::GLScreenBuffer:: GLScreenBuffer (this=0x7edc003900, gl=0x7edc19aa50, factory=...) at gfx/gl/GLScreenBuffer.cpp:183 183 GLScreenBuffer::GLScreenBuffer(GLContext* gl, UniquePtr<SurfaceFactory> factory) (gdb) bt #0 mozilla::gl::GLScreenBuffer::GLScreenBuffer (this=0x7edc003900, gl=0x7edc19aa50, factory=...) at gfx/gl/GLScreenBuffer.cpp:183 #1 0x0000007ff1104e00 in mozilla::gl::GLScreenBuffer::Create ( gl=gl@entry=0x7edc19aa50, size=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33 #2 0x0000007ff112f49c in mozilla::gl::GLContext::CreateScreenBuffer ( this=this@entry=0x7edc19aa50, size=...) at gfx/gl/GLContext.cpp:2076 #3 0x0000007ff112f51c in mozilla::gl::GLContext::InitOffscreen ( this=this@entry=0x7edc19aa50, size=...) at gfx/gl/GLContext.cpp:2346 #4 0x0000007ff112f65c in mozilla::gl::GLContextProviderEGL::CreateOffscreen ( size=..., flags=flags@entry=mozilla::gl::CreateContextFlags::REQUIRE_COMPAT_PROFILE, out_failureId=out_failureId@entry=0x7f1f7471c8) at gfx/gl/GLContextProviderEGL.cpp:1462 #5 0x0000007ff11982fc in mozilla::layers::CompositorOGL::CreateContext ( this=this@entry=0x7edc002f10) at gfx/layers/opengl/CompositorOGL.cpp:250 #6 0x0000007ff11adad4 in mozilla::layers::CompositorOGL::Initialize ( this=0x7edc002f10, out_failureReason=0x7f1f747520) at gfx/layers/opengl/CompositorOGL.cpp:387 #7 0x0000007ff12c3850 in mozilla::layers::CompositorBridgeParent:: NewCompositor (this=this@entry=0x7fc4b7a7e0, aBackendHints=...) at gfx/layers/ipc/CompositorBridgeParent.cpp:1493 #8 0x0000007ff12ce8cc in mozilla::layers::CompositorBridgeParent:: InitializeLayerManager (this=this@entry=0x7fc4b7a7e0, aBackendHints=...) at gfx/layers/ipc/CompositorBridgeParent.cpp:1436 #9 0x0000007ff12ce9fc in mozilla::layers::CompositorBridgeParent:: AllocPLayerTransactionParent (this=this@entry=0x7fc4b7a7e0, aBackendHints=..., aId=...) at gfx/layers/ipc/CompositorBridgeParent.cpp:1546 #10 0x0000007ff3665628 in mozilla::embedlite::EmbedLiteCompositorBridgeParent:: AllocPLayerTransactionParent (this=0x7fc4b7a7e0, aBackendHints=..., aId=...) at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:80 #11 0x0000007ff0c5e490 in mozilla::layers::PCompositorBridgeParent:: OnMessageReceived (this=0x7fc4b7a7e0, msg__=...) at PCompositorBridgeParent.cpp:1285 [...] #26 0x0000007ff6a0289c in ?? () from /lib64/libc.so.6 (gdb)I've needed this backtrace so many times, I should probably have it printed on a poster to hang on my wall.
Next up I've installed my latest WebGL-broken version and have placed breakpoints on the top n methods that appear in the backtrace:
(gdb) info break Num Type Disp Enb Address What 1 breakpoint keep y <PENDING> GLContextProviderEGL::CreateOffscreen 2 breakpoint keep y <PENDING> GLContext::InitOffscreen 3 breakpoint keep y <PENDING> GLContext::CreateScreenBuffer 4 breakpoint keep y <PENDING> GLScreenBuffer::CreateIf I've chosen a list that's long enough, something at the bottom end will trigger a breakpoint, while the code at the top end will never be reached, all happening before the segfault occurs.
That will tell me the method I need to look in to find out why the GLScreenBuffer isn't being created. I don't expect the result to be a surprise, but going through this process is actually going to be quicker than me trying to remember.
First time around I got no hit... I guess I didn't cast my hook far enough. I'll need to add extra breakpoints and try again:
(gdb) info break Num Type Disp Enb Address What 1 breakpoint keep y <PENDING> GLScreenBuffer::GLScreenBuffer 2 breakpoint keep y <PENDING> GLScreenBuffer::Create 3 breakpoint keep y <PENDING> GLContext::CreateScreenBuffer 4 breakpoint keep y <PENDING> GLContext::InitOffscreen 5 breakpoint keep y <PENDING> GLContextProviderEGL::CreateOffscreen 6 breakpoint keep y <PENDING> CompositorOGL::CreateContext 7 breakpoint keep y <PENDING> CompositorOGL::Initialize 8 breakpoint keep y <PENDING> CompositorBridgeParent::NewCompositor 9 breakpoint keep y <PENDING> CompositorBridgeParent:: InitializeLayerManagerWith the extra layers added, now the result comes back positive:
Thread 38 "Compositor" hit Breakpoint 9, mozilla::layers:: CompositorBridgeParent::InitializeLayerManager ( this=this@entry=0x7fc4b01f70, aBackendHints=...) at gfx/layers/ipc/CompositorBridgeParent.cpp:1432 1432 const nsTArray<LayersBackend>& aBackendHints) { (gdb) c Continuing. Thread 38 "Compositor" hit Breakpoint 8, mozilla::layers:: CompositorBridgeParent::NewCompositor (this=this@entry=0x7fc4b01f70, aBackendHints=...) at gfx/layers/ipc/CompositorBridgeParent.cpp:1455 1455 const nsTArray<LayersBackend>& aBackendHints) { (gdb) c Continuing. Thread 38 "Compositor" hit Breakpoint 7, mozilla::layers:: CompositorOGL::Initialize (this=0x7ed4002ed0, out_failureReason=0x7f1f9a5520) at gfx/layers/opengl/CompositorOGL.cpp:384 384 bool CompositorOGL::Initialize(nsCString* const out_failureReason) { (gdb) c Continuing. Thread 38 "Compositor" hit Breakpoint 6, mozilla::layers:: CompositorOGL::CreateContext (this=this@entry=0x7ed4002ed0) at gfx/layers/opengl/CompositorOGL.cpp:227 227 already_AddRefed<mozilla::gl::GLContext> CompositorOGL::CreateContext() { (gdb) c Continuing. =============== Preparing offscreen rendering context =============== Thread 38 "Compositor" received signal SIGSEGV, Segmentation fault. 0x0000007ff366520c in mozilla::gl::GLScreenBuffer::Size (this=0x0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 290 ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h: No such file or directory. (gdb)This tells us that CompositorOGL::CreateContext() is being called, but then that method is failing to call GLContextProviderEGL::CreateOffscreen(). Let's find out why. A diff on the source file immediately clarifies the reason:
$ git diff gfx/layers/opengl/CompositorOGL.cpp diff --git a/gfx/layers/opengl/CompositorOGL.cpp b/gfx/layers/opengl/ CompositorOGL.cpp index 122709eaf2de..06c84a9ebdaa 100644 --- a/gfx/layers/opengl/CompositorOGL.cpp +++ b/gfx/layers/opengl/CompositorOGL.cpp @@ -247,9 +247,15 @@ already_AddRefed<mozilla::gl::GLContext> CompositorOGL:: CreateContext() { // Allow to create offscreen GL context for main Layer Manager if (!context && gfxEnv::LayersPreferOffscreen()) { nsCString discardFailureId; - context = GLContextProvider::CreateOffscreen( - mSurfaceSize, CreateContextFlags::REQUIRE_COMPAT_PROFILE, - &discardFailureId); + + context = GLContextProvider::CreateHeadless( + {CreateContextFlags::REQUIRE_COMPAT_PROFILE}, &discardFailureId); + if (!context->CreateOffscreenDefaultFb(mSurfaceSize)) { + context = nullptr; + } } if (!context) {Looking at this code and checking the diff of the other code I've removed since my last commit, I can see that the entire chain of code that includes all of the thread from the above backtrace is missing from the source. Here are the missing methods:
CreateOffscreen() InitOffscreen() CreateScreenBuffer() GLScreenBuffer::Create()That's because I removed all of them, hoping that GLContextProvider::CreateHeadless() would provide a suitable alternative. But if I'm going to use CreateHeadless() as an alternative I'm going to have to ensure a GLScreenBuffer instance is created somewhere in the process.
I now need to decide whether I'm going to restore the five missing methods, or alternatively try to extract their core functionality to be merged together into a single method to hook into the CreateHeadless() flow. This will require some thought and a bit of deeper investigation.
Let's summarise things. First, this is the code that's currently being executed.
context = GLContextProvider::CreateHeadless( {CreateContextFlags::REQUIRE_COMPAT_PROFILE}, &discardFailureId); if (!context->CreateOffscreenDefaultFb(mSurfaceSize)) { context = nullptr; }The active ingredients here are CreateHeadless() and CreateOffscreenDefaultFb(). This replaces the following three lines which were in use before:
context = GLContextProvider::CreateOffscreen( mSurfaceSize, CreateContextFlags::REQUIRE_COMPAT_PROFILE, &discardFailureId);Here we can see that the active ingredient is CreateOffscreen(). We want to restore this functionality, so the next question it'd be helpful to know the answer to is exactly what CreateOffscreen() does.
We already know that it calls InitOffscreen() which calls CreateScreenBuffer() which calls GLScreenBuffer::Create(). So let's be explicit about this and take a look at all of those methods. Here they are, out of context and lined up in a row like ducks:
already_AddRefed<GLContext> GLContextProviderEGL::CreateOffscreen( const mozilla::gfx::IntSize& size, CreateContextFlags flags, nsACString* const out_failureId) { RefPtr<GLContext> gl; const GLContextCreateDesc desc{flags = flags}; gl = CreateHeadless(desc, out_failureId); if (!gl) { return nullptr; } // Init the offscreen with the updated offscreen caps. if (!gl->InitOffscreen(size)) { *out_failureId = "FEATURE_FAILURE_EGL_OFFSCREEN"_ns; return nullptr; } return gl.forget(); } bool GLContext::InitOffscreen(const gfx::IntSize& size) { if (!CreateScreenBuffer(size)) return false; if (!MakeCurrent()) { return false; } fBindFramebuffer(LOCAL_GL_FRAMEBUFFER, 0); fScissor(0, 0, size.width, size.height); fViewport(0, 0, size.width, size.height); return true; } bool GLContext::CreateScreenBuffer(const IntSize& size) { if (!IsOffscreenSizeAllowed(size)) return false; UniquePtr<GLScreenBuffer> newScreen = GLScreenBuffer::Create(this, size); if (!newScreen) return false; if (!newScreen->Resize(size)) { return false; } // This will rebind to 0 (Screen) if needed when // it falls out of scope. ScopedBindFramebuffer autoFB(this); mScreen = std::move(newScreen); return true; }The neat thing is that almost all of the functionality being accessed by these has public accessibility. That means we can roll all three of these into a single method that we could, say, include as part of GLContextProviderEGL(). Here's the refactored functionality combined together:
already_AddRefed<GLContext> GLContextProviderEGL::CreateOffscreen( const mozilla::gfx::IntSize& size, CreateContextFlags flags, nsACString* const out_failureId) { RefPtr<GLContext> gl; gl = CreateHeadless({CreateContextFlags::REQUIRE_COMPAT_PROFILE}, out_failureId); // Init the offscreen with the updated offscreen caps. if (!gl || !gl->IsOffscreenSizeAllowed(size)) { return nullptr; } UniquePtr<GLScreenBuffer> newScreen = GLScreenBuffer::Create(gl, size); if ((!newScreen) || (!newScreen->Resize(size))) { return nullptr; } // This will rebind to 0 (Screen) if needed when // it falls out of scope. ScopedBindFramebuffer autoFB(gl); gl->mScreen = std::move(newScreen); if (!gl->MakeCurrent()) { return nullptr; } gl->fBindFramebuffer(LOCAL_GL_FRAMEBUFFER, 0); gl->fScissor(0, 0, size.width, size.height); gl->fViewport(0, 0, size.width, size.height); return gl.forget(); }Now this compacted version is rather theoretical at this stage, I've just manually combined everything by inspection. It's easy to make mistakes with this kind of thing, so what I want to do now is try it to see whether this works.
I've done a partial build and hooked everything together. Now when I run this I get interesting results. Running the browser gives good results: rendering works and the WebGL is functional as well.
Running the WebView is a bit more mised: there's no longer a crash (yay!) but rendering is broken (boo!). The errors generated look really serious given all the wailing and gnashing of teeth coming from the debug output, but in practice they're not blocking execution or triggering a crash. Here's the console output:
Created LOG for EmbedLiteLayerManager Crash Annotation GraphicsCriticalError: |[0][GFX1-]: Failed to create EGLConfig! (t=0.830283) [GFX1-]: Failed to create EGLConfig! Crash Annotation GraphicsCriticalError: |[0][GFX1-]: Failed to create EGLConfig! (t=0.830283) |[1][GFX1-]: Failed to create EGLConfig! ( t=0.831443) [GFX1-]: Failed to create EGLConfig! Crash Annotation GraphicsCriticalError: |[0][GFX1-]: Failed to create EGLConfig! (t=0.830283) |[1][GFX1-]: Failed to create EGLConfig! ( t=0.831443) |[2][GFX1-]: [OPENGL] Failed to init compositor with reason: FEATURE_FAILURE_OPENGL_CREATE_CONTEXT (t=0.832053) [GFX1-]: [OPENGL] Failed to init compositor with reason: FEATURE_FAILURE_OPENGL_CREATE_CONTEXT =============== Preparing offscreen rendering context ===============The important part of this output is the error reason of FEATURE_FAILURE_OPENGL_CREATE_CONTEXT. That suggets that our rolled up method is returning null when it should be returning a GLContext instance.
The easiest way to find out why it's doing this will be to step through the code, but I can't do that using a binary from a partial build, I'll need to run a full build first.
So I've set the build going. With any luck it'll be completed by morning.
Despite this error I'm excited by these developments. The WebGL remains intact and the WebView isn't crashing. I feel like I'm honing in on where the problem is, and that makes me hopefully that we'll be able to reach a solution soon.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
28 Jun 2024 : Day 272 #
We're getting a crash with the WebView because the GLSCreenBuffer isn't defined before it's first accessed. We've been here before, but the code has changed and I need to go through the process of figuring out why, again.
Finding out shouldn't be difficult, just a little laborious. I need to install the old working packages, place a breakpoint on the GLScreenBuffer constructor, run the code and wait for a backtrace to magically appear. Here's the result:
Next up I've installed my latest WebGL-broken version and have placed breakpoints on the top n methods that appear in the backtrace:
That will tell me the method I need to look in to find out why the GLScreenBuffer isn't being created. I don't expect the result to be a surprise, but going through this process is actually going to be quicker than me trying to remember.
First time around I got no hit... I guess I didn't cast my hook far enough. I'll need to add extra breakpoints and try again:
I now need to decide whether I'm going to restore the five missing methods, or alternatively try to extract their core functionality to be merged together into a single method to hook into the CreateHeadless() flow. This will require some thought and a bit of deeper investigation.
Let's summarise things. First, this is the code that's currently being executed.
We already know that it calls InitOffscreen() which calls CreateScreenBuffer() which calls GLScreenBuffer::Create(). So let's be explicit about this and take a look at all of those methods. Here they are, out of context and lined up in a row like ducks:
I've done a partial build and hooked everything together. Now when I run this I get interesting results. Running the browser gives good results: rendering works and the WebGL is functional as well.
Running the WebView is a bit more mised: there's no longer a crash (yay!) but rendering is broken (boo!). The errors generated look really serious given all the wailing and gnashing of teeth coming from the debug output, but in practice they're not blocking execution or triggering a crash. Here's the console output:
The easiest way to find out why it's doing this will be to step through the code, but I can't do that using a binary from a partial build, I'll need to run a full build first.
So I've set the build going. With any luck it'll be completed by morning.
Despite this error I'm excited by these developments. The WebGL remains intact and the WebView isn't crashing. I feel like I'm honing in on where the problem is, and that makes me hopefully that we'll be able to reach a solution soon.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Finding out shouldn't be difficult, just a little laborious. I need to install the old working packages, place a breakpoint on the GLScreenBuffer constructor, run the code and wait for a backtrace to magically appear. Here's the result:
Thread 38 "Compositor" hit Breakpoint 1, mozilla::gl::GLScreenBuffer:: GLScreenBuffer (this=0x7edc003900, gl=0x7edc19aa50, factory=...) at gfx/gl/GLScreenBuffer.cpp:183 183 GLScreenBuffer::GLScreenBuffer(GLContext* gl, UniquePtr<SurfaceFactory> factory) (gdb) bt #0 mozilla::gl::GLScreenBuffer::GLScreenBuffer (this=0x7edc003900, gl=0x7edc19aa50, factory=...) at gfx/gl/GLScreenBuffer.cpp:183 #1 0x0000007ff1104e00 in mozilla::gl::GLScreenBuffer::Create ( gl=gl@entry=0x7edc19aa50, size=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33 #2 0x0000007ff112f49c in mozilla::gl::GLContext::CreateScreenBuffer ( this=this@entry=0x7edc19aa50, size=...) at gfx/gl/GLContext.cpp:2076 #3 0x0000007ff112f51c in mozilla::gl::GLContext::InitOffscreen ( this=this@entry=0x7edc19aa50, size=...) at gfx/gl/GLContext.cpp:2346 #4 0x0000007ff112f65c in mozilla::gl::GLContextProviderEGL::CreateOffscreen ( size=..., flags=flags@entry=mozilla::gl::CreateContextFlags::REQUIRE_COMPAT_PROFILE, out_failureId=out_failureId@entry=0x7f1f7471c8) at gfx/gl/GLContextProviderEGL.cpp:1462 #5 0x0000007ff11982fc in mozilla::layers::CompositorOGL::CreateContext ( this=this@entry=0x7edc002f10) at gfx/layers/opengl/CompositorOGL.cpp:250 #6 0x0000007ff11adad4 in mozilla::layers::CompositorOGL::Initialize ( this=0x7edc002f10, out_failureReason=0x7f1f747520) at gfx/layers/opengl/CompositorOGL.cpp:387 #7 0x0000007ff12c3850 in mozilla::layers::CompositorBridgeParent:: NewCompositor (this=this@entry=0x7fc4b7a7e0, aBackendHints=...) at gfx/layers/ipc/CompositorBridgeParent.cpp:1493 #8 0x0000007ff12ce8cc in mozilla::layers::CompositorBridgeParent:: InitializeLayerManager (this=this@entry=0x7fc4b7a7e0, aBackendHints=...) at gfx/layers/ipc/CompositorBridgeParent.cpp:1436 #9 0x0000007ff12ce9fc in mozilla::layers::CompositorBridgeParent:: AllocPLayerTransactionParent (this=this@entry=0x7fc4b7a7e0, aBackendHints=..., aId=...) at gfx/layers/ipc/CompositorBridgeParent.cpp:1546 #10 0x0000007ff3665628 in mozilla::embedlite::EmbedLiteCompositorBridgeParent:: AllocPLayerTransactionParent (this=0x7fc4b7a7e0, aBackendHints=..., aId=...) at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:80 #11 0x0000007ff0c5e490 in mozilla::layers::PCompositorBridgeParent:: OnMessageReceived (this=0x7fc4b7a7e0, msg__=...) at PCompositorBridgeParent.cpp:1285 [...] #26 0x0000007ff6a0289c in ?? () from /lib64/libc.so.6 (gdb)I've needed this backtrace so many times, I should probably have it printed on a poster to hang on my wall.
Next up I've installed my latest WebGL-broken version and have placed breakpoints on the top n methods that appear in the backtrace:
(gdb) info break Num Type Disp Enb Address What 1 breakpoint keep y <PENDING> GLContextProviderEGL::CreateOffscreen 2 breakpoint keep y <PENDING> GLContext::InitOffscreen 3 breakpoint keep y <PENDING> GLContext::CreateScreenBuffer 4 breakpoint keep y <PENDING> GLScreenBuffer::CreateIf I've chosen a list that's long enough, something at the bottom end will trigger a breakpoint, while the code at the top end will never be reached, all happening before the segfault occurs.
That will tell me the method I need to look in to find out why the GLScreenBuffer isn't being created. I don't expect the result to be a surprise, but going through this process is actually going to be quicker than me trying to remember.
First time around I got no hit... I guess I didn't cast my hook far enough. I'll need to add extra breakpoints and try again:
(gdb) info break Num Type Disp Enb Address What 1 breakpoint keep y <PENDING> GLScreenBuffer::GLScreenBuffer 2 breakpoint keep y <PENDING> GLScreenBuffer::Create 3 breakpoint keep y <PENDING> GLContext::CreateScreenBuffer 4 breakpoint keep y <PENDING> GLContext::InitOffscreen 5 breakpoint keep y <PENDING> GLContextProviderEGL::CreateOffscreen 6 breakpoint keep y <PENDING> CompositorOGL::CreateContext 7 breakpoint keep y <PENDING> CompositorOGL::Initialize 8 breakpoint keep y <PENDING> CompositorBridgeParent::NewCompositor 9 breakpoint keep y <PENDING> CompositorBridgeParent:: InitializeLayerManagerWith the extra layers added, now the result comes back positive:
Thread 38 "Compositor" hit Breakpoint 9, mozilla::layers:: CompositorBridgeParent::InitializeLayerManager ( this=this@entry=0x7fc4b01f70, aBackendHints=...) at gfx/layers/ipc/CompositorBridgeParent.cpp:1432 1432 const nsTArray<LayersBackend>& aBackendHints) { (gdb) c Continuing. Thread 38 "Compositor" hit Breakpoint 8, mozilla::layers:: CompositorBridgeParent::NewCompositor (this=this@entry=0x7fc4b01f70, aBackendHints=...) at gfx/layers/ipc/CompositorBridgeParent.cpp:1455 1455 const nsTArray<LayersBackend>& aBackendHints) { (gdb) c Continuing. Thread 38 "Compositor" hit Breakpoint 7, mozilla::layers:: CompositorOGL::Initialize (this=0x7ed4002ed0, out_failureReason=0x7f1f9a5520) at gfx/layers/opengl/CompositorOGL.cpp:384 384 bool CompositorOGL::Initialize(nsCString* const out_failureReason) { (gdb) c Continuing. Thread 38 "Compositor" hit Breakpoint 6, mozilla::layers:: CompositorOGL::CreateContext (this=this@entry=0x7ed4002ed0) at gfx/layers/opengl/CompositorOGL.cpp:227 227 already_AddRefed<mozilla::gl::GLContext> CompositorOGL::CreateContext() { (gdb) c Continuing. =============== Preparing offscreen rendering context =============== Thread 38 "Compositor" received signal SIGSEGV, Segmentation fault. 0x0000007ff366520c in mozilla::gl::GLScreenBuffer::Size (this=0x0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 290 ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h: No such file or directory. (gdb)This tells us that CompositorOGL::CreateContext() is being called, but then that method is failing to call GLContextProviderEGL::CreateOffscreen(). Let's find out why. A diff on the source file immediately clarifies the reason:
$ git diff gfx/layers/opengl/CompositorOGL.cpp diff --git a/gfx/layers/opengl/CompositorOGL.cpp b/gfx/layers/opengl/ CompositorOGL.cpp index 122709eaf2de..06c84a9ebdaa 100644 --- a/gfx/layers/opengl/CompositorOGL.cpp +++ b/gfx/layers/opengl/CompositorOGL.cpp @@ -247,9 +247,15 @@ already_AddRefed<mozilla::gl::GLContext> CompositorOGL:: CreateContext() { // Allow to create offscreen GL context for main Layer Manager if (!context && gfxEnv::LayersPreferOffscreen()) { nsCString discardFailureId; - context = GLContextProvider::CreateOffscreen( - mSurfaceSize, CreateContextFlags::REQUIRE_COMPAT_PROFILE, - &discardFailureId); + + context = GLContextProvider::CreateHeadless( + {CreateContextFlags::REQUIRE_COMPAT_PROFILE}, &discardFailureId); + if (!context->CreateOffscreenDefaultFb(mSurfaceSize)) { + context = nullptr; + } } if (!context) {Looking at this code and checking the diff of the other code I've removed since my last commit, I can see that the entire chain of code that includes all of the thread from the above backtrace is missing from the source. Here are the missing methods:
CreateOffscreen() InitOffscreen() CreateScreenBuffer() GLScreenBuffer::Create()That's because I removed all of them, hoping that GLContextProvider::CreateHeadless() would provide a suitable alternative. But if I'm going to use CreateHeadless() as an alternative I'm going to have to ensure a GLScreenBuffer instance is created somewhere in the process.
I now need to decide whether I'm going to restore the five missing methods, or alternatively try to extract their core functionality to be merged together into a single method to hook into the CreateHeadless() flow. This will require some thought and a bit of deeper investigation.
Let's summarise things. First, this is the code that's currently being executed.
context = GLContextProvider::CreateHeadless( {CreateContextFlags::REQUIRE_COMPAT_PROFILE}, &discardFailureId); if (!context->CreateOffscreenDefaultFb(mSurfaceSize)) { context = nullptr; }The active ingredients here are CreateHeadless() and CreateOffscreenDefaultFb(). This replaces the following three lines which were in use before:
context = GLContextProvider::CreateOffscreen( mSurfaceSize, CreateContextFlags::REQUIRE_COMPAT_PROFILE, &discardFailureId);Here we can see that the active ingredient is CreateOffscreen(). We want to restore this functionality, so the next question it'd be helpful to know the answer to is exactly what CreateOffscreen() does.
We already know that it calls InitOffscreen() which calls CreateScreenBuffer() which calls GLScreenBuffer::Create(). So let's be explicit about this and take a look at all of those methods. Here they are, out of context and lined up in a row like ducks:
already_AddRefed<GLContext> GLContextProviderEGL::CreateOffscreen( const mozilla::gfx::IntSize& size, CreateContextFlags flags, nsACString* const out_failureId) { RefPtr<GLContext> gl; const GLContextCreateDesc desc{flags = flags}; gl = CreateHeadless(desc, out_failureId); if (!gl) { return nullptr; } // Init the offscreen with the updated offscreen caps. if (!gl->InitOffscreen(size)) { *out_failureId = "FEATURE_FAILURE_EGL_OFFSCREEN"_ns; return nullptr; } return gl.forget(); } bool GLContext::InitOffscreen(const gfx::IntSize& size) { if (!CreateScreenBuffer(size)) return false; if (!MakeCurrent()) { return false; } fBindFramebuffer(LOCAL_GL_FRAMEBUFFER, 0); fScissor(0, 0, size.width, size.height); fViewport(0, 0, size.width, size.height); return true; } bool GLContext::CreateScreenBuffer(const IntSize& size) { if (!IsOffscreenSizeAllowed(size)) return false; UniquePtr<GLScreenBuffer> newScreen = GLScreenBuffer::Create(this, size); if (!newScreen) return false; if (!newScreen->Resize(size)) { return false; } // This will rebind to 0 (Screen) if needed when // it falls out of scope. ScopedBindFramebuffer autoFB(this); mScreen = std::move(newScreen); return true; }The neat thing is that almost all of the functionality being accessed by these has public accessibility. That means we can roll all three of these into a single method that we could, say, include as part of GLContextProviderEGL(). Here's the refactored functionality combined together:
already_AddRefed<GLContext> GLContextProviderEGL::CreateOffscreen( const mozilla::gfx::IntSize& size, CreateContextFlags flags, nsACString* const out_failureId) { RefPtr<GLContext> gl; gl = CreateHeadless({CreateContextFlags::REQUIRE_COMPAT_PROFILE}, out_failureId); // Init the offscreen with the updated offscreen caps. if (!gl || !gl->IsOffscreenSizeAllowed(size)) { return nullptr; } UniquePtr<GLScreenBuffer> newScreen = GLScreenBuffer::Create(gl, size); if ((!newScreen) || (!newScreen->Resize(size))) { return nullptr; } // This will rebind to 0 (Screen) if needed when // it falls out of scope. ScopedBindFramebuffer autoFB(gl); gl->mScreen = std::move(newScreen); if (!gl->MakeCurrent()) { return nullptr; } gl->fBindFramebuffer(LOCAL_GL_FRAMEBUFFER, 0); gl->fScissor(0, 0, size.width, size.height); gl->fViewport(0, 0, size.width, size.height); return gl.forget(); }Now this compacted version is rather theoretical at this stage, I've just manually combined everything by inspection. It's easy to make mistakes with this kind of thing, so what I want to do now is try it to see whether this works.
I've done a partial build and hooked everything together. Now when I run this I get interesting results. Running the browser gives good results: rendering works and the WebGL is functional as well.
Running the WebView is a bit more mised: there's no longer a crash (yay!) but rendering is broken (boo!). The errors generated look really serious given all the wailing and gnashing of teeth coming from the debug output, but in practice they're not blocking execution or triggering a crash. Here's the console output:
Created LOG for EmbedLiteLayerManager Crash Annotation GraphicsCriticalError: |[0][GFX1-]: Failed to create EGLConfig! (t=0.830283) [GFX1-]: Failed to create EGLConfig! Crash Annotation GraphicsCriticalError: |[0][GFX1-]: Failed to create EGLConfig! (t=0.830283) |[1][GFX1-]: Failed to create EGLConfig! ( t=0.831443) [GFX1-]: Failed to create EGLConfig! Crash Annotation GraphicsCriticalError: |[0][GFX1-]: Failed to create EGLConfig! (t=0.830283) |[1][GFX1-]: Failed to create EGLConfig! ( t=0.831443) |[2][GFX1-]: [OPENGL] Failed to init compositor with reason: FEATURE_FAILURE_OPENGL_CREATE_CONTEXT (t=0.832053) [GFX1-]: [OPENGL] Failed to init compositor with reason: FEATURE_FAILURE_OPENGL_CREATE_CONTEXT =============== Preparing offscreen rendering context ===============The important part of this output is the error reason of FEATURE_FAILURE_OPENGL_CREATE_CONTEXT. That suggets that our rolled up method is returning null when it should be returning a GLContext instance.
The easiest way to find out why it's doing this will be to step through the code, but I can't do that using a binary from a partial build, I'll need to run a full build first.
So I've set the build going. With any luck it'll be completed by morning.
Despite this error I'm excited by these developments. The WebGL remains intact and the WebView isn't crashing. I feel like I'm honing in on where the problem is, and that makes me hopefully that we'll be able to reach a solution soon.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
27 Jun 2024 : Day 271 #
It's been just a short day of development today. After discovering the source of the browser crashes yesterday, I've been collecting evidence to try to find out where the WebView rendering path and the WebGL rendering path need to diverge.
To do this, I'm going to add breakpoints to the SwapChain and GLScreenBuffer constructors. Why these? Because SwapChain is what allows WebGL to work while breaking the WebView, whereas GLScreenBuffer allows the WebView to work but breaks WebGL. Somewhere between these two lies a sweet spot that we have to inhabit.
First up, the breakpoints for sailfish-browser:
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
To do this, I'm going to add breakpoints to the SwapChain and GLScreenBuffer constructors. Why these? Because SwapChain is what allows WebGL to work while breaking the WebView, whereas GLScreenBuffer allows the WebView to work but breaks WebGL. Somewhere between these two lies a sweet spot that we have to inhabit.
First up, the breakpoints for sailfish-browser:
(gdb) info break Num Type Disp Enb Address What 1 breakpoint keep y <PENDING> SwapChain::SwapChain 2 breakpoint keep y <PENDING> GLScreenBuffer::GLScreenBufferExecuting this and showing a page that renders some WebGL content, we quickly hit the first of these breakpoints:
Thread 8 "GeckoWorkerThre" hit Breakpoint 1, mozilla::gl::SwapChain:: SwapChain (this=0x7fc95168f8) at gfx/gl/GLScreenBuffer.h:63 63 class SwapChain final { (gdb) bt #0 mozilla::gl::SwapChain::SwapChain (this=0x7fc95168f8) at gfx/gl/GLScreenBuffer.h:63 #1 0x0000007ff3699528 in mozilla::WebGLContext::WebGLContext ( this=0x7fc9516460, host=..., desc=...) at include/c++/8.3.0/bits/move.h:74 #2 0x0000007ff36a8c6c in mozilla::WebGLContext::<lambda()>::operator() ( __closure=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33 #3 mozilla::WebGLContext::Create (host=..., desc=..., out=out@entry=0x7fc9516068) at dom/canvas/WebGLContext.cpp:562 #4 0x0000007ff36608fc in mozilla::HostWebGLContext::Create (ownerData=..., desc=..., out=out@entry=0x7fc9516068) at dom/canvas/HostWebGLContext.cpp:59 #5 0x0000007ff3690350 in mozilla::ClientWebGLContext::<lambda()>::operator() ( __closure=<optimized out>) at dom/canvas/ClientWebGLContext.cpp:625 #6 mozilla::ClientWebGLContext::CreateHostContext ( this=this@entry=0x7ee01d6150, requestedSize=...) at dom/canvas/ClientWebGLContext.cpp:654 #7 0x0000007ff3690e38 in mozilla::ClientWebGLContext::SetDimensions ( this=0x7ee01d6150, signedWidth=<optimized out>, signedHeight=<optimized out>) at dom/canvas/ClientWebGLContext.cpp:563 #8 0x0000007ff362a1ec in mozilla::dom::CanvasRenderingContextHelper:: UpdateContext (this=0x7fc94782c0, aCx=<optimized out>, aNewContextOptions=..., aRvForDictionaryInit=...) at dom/canvas/CanvasRenderingContextHelper.cpp:238 #9 0x0000007ff36392b8 in mozilla::dom::CanvasRenderingContextHelper:: GetContext (this=this@entry=0x7fc94782c0, aCx=0x7fc8222fa0, aContextId=..., aContextOptions=..., aRv=...) at dom/canvas/CanvasRenderingContextHelper.cpp:190 #10 0x0000007ff390aef4 in mozilla::dom::HTMLCanvasElement::GetContext ( this=this@entry=0x7fc9478240, aCx=aCx@entry=0x7fc8222fa0, aContextId=..., aContextOptions=aContextOptions@entry=..., aRv=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/js/Value.h:670 #11 0x0000007ff35486d4 in mozilla::dom::HTMLCanvasElement_Binding::getContext ( cx=0x7fc8222fa0, obj=..., void_self=0x7fc9478240, args=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/js/RootingAPI.h:1297 #12 0x0000007ff35dfb5c in mozilla::dom::binding_detail::GenericMethod<mozilla:: dom::binding_detail::NormalThisPolicy, mozilla::dom::binding_detail:: ThrowExceptions> (cx=0x7fc8222fa0, argc=<optimized out>, vp=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/js/CallArgs.h:207 [..] #63 0x0000007fefbae89c in ?? () from /lib64/libc.so.6 (gdb)Next up, the same arrangement with harbour-webview. Here are the breakpoints again:
(gdb) info break Num Type Disp Enb Address What 1 breakpoint keep y <PENDING> SwapChain::SwapChain 2 breakpoint keep y <PENDING> GLScreenBuffer::GLScreenBufferThis time it's the GLScreenBuffer code that gets hit. This is what I was expecting, although I wasn't expecting it to be the GLScreenBuffer::Size() method. In fact, on closer inspection it's not our breakpoint being hit at all, but rather a segmentation fault that happens to relate to the GLScreenBuffer code:
Thread 38 "Compositor" received signal SIGSEGV, Segmentation fault. [Switching to LWP 14178] 0x0000007ff366520c in mozilla::gl::GLScreenBuffer::Size (this=0x0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 290 ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h: No such file or directory. (gdb) bt #0 0x0000007ff366520c in mozilla::gl::GLScreenBuffer::Size (this=0x0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 #1 mozilla::embedlite::EmbedLiteCompositorBridgeParent:: CompositeToDefaultTarget (this=0x7fc4b7ede0, aId=...) at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:151 #2 0x0000007ff12b5448 in mozilla::layers::CompositorVsyncScheduler:: ForceComposeToTarget (this=0x7fc4c37f90, aTarget=aTarget@entry=0x0, aRect=aRect@entry=0x0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/LayersTypes.h: 82 #3 0x0000007ff12b54a4 in mozilla::layers::CompositorBridgeParent:: ResumeComposition (this=this@entry=0x7fc4b7ede0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313 #4 0x0000007ff12b5530 in mozilla::layers::CompositorBridgeParent:: ResumeCompositionAndResize (this=0x7fc4b7ede0, x=<optimized out>, y=<optimized out>, width=<optimized out>, height=<optimized out>) at gfx/layers/ipc/CompositorBridgeParent.cpp:794 #5 0x0000007ff12ae0cc in mozilla::detail::RunnableMethodArguments<int, int, int, int>::applyImpl<mozilla::layers::CompositorBridgeParent, void (mozilla: :layers::CompositorBridgeParent::*)(int, int, int, int), StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, 0ul, 1ul, 2ul, 3ul> (args=..., m=<optimized out>, o=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151 [...] #17 0x0000007ff6a0289c in ?? () from /lib64/libc.so.6 (gdb)The next step has to be to fix this segmentation fault, which looks like it's happening precisely because GLScreenBuffer hasn't been constructed yet. But that'll have to be a task for tomorrow now.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
26 Jun 2024 : Day 270 #
I've been frustrated now for a while because I don't understand how the same libxul.so library can work when installed on top of one set of packages, but then break when installed on top of another set of packages. I find it baffling. Originally I thought it might be to do with API changes causing dynamic linking to fail against the browser or QtMozEmbed code. But I know this can't be the case, because the sailfish-browser binary and QtMozEmbed libraries are staying the same. They both access libxul.so directly, so there can't be anything else in the xulrunner packages that are changing their relationship to them either.
Yesterday I ran the various different arrangements, including using the harbour-webview app, which failed. But when I checked the backtrace none of the method names were defined. That's because I'd copied the new libxul.so library in place of the old one. Having done that the debug source and symbols were no longer valid, making it impossible to generate a sensible backtrace.
Consequently, if I want to continue development to track down the error, I'm going to have to figure out this peculiar situation with the library working when copied on top of some packages, but not others.
That means there must be something else that's being installed by the package which is causing the incompatibility. But what?
In the hope it might shed some light on things I'm going to checksum all of the installed files to see which ones change.
First, here are the checksums for the installed files from the new packages I've built (the non-working packages):
It's possible that there's a problem with the boundary between the gecko and EmbedLite code, although I'm not sure I really understand how this can be. Nevertheless I've reverted the changes in EmbedLite just in case and set the build going again.
Unfortunately the build fails:
But as I'm checking the diff between the two different versions of the code, I notice that it's not just the C++ code that has changed. There's also this addition to the embedding.js file:
So, it looks like this could be where the problem is.
If you cast your mind back to Day 94, you may recall that I have a script for cleanly packing and unpacking the omni.ja archive. So I can test this out really easily using that. The steps are:
If my hypothesis is correct, this should fix the problem.
To wrap things up, I've eddied the embedding.js file in the source tree as well, so that it'll get baked into the package in future. But at some stage I'll need to restore it, because this is essential for getting the WebView to work. But in the meantime, at least I now have an answer to my conundrum which will make things much easier to handle in the future.
I now have a set of packages that contain GLScreenBuffer, which have working WebGL and which don't crash on start-up. The WebView is broken, so the next step will be to hook in the GLScreenBuffer and try to find out why that's breaking the WebGL. That's my plan for tomorrow.
I'm going to sleep much more soundly tonight.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
Yesterday I ran the various different arrangements, including using the harbour-webview app, which failed. But when I checked the backtrace none of the method names were defined. That's because I'd copied the new libxul.so library in place of the old one. Having done that the debug source and symbols were no longer valid, making it impossible to generate a sensible backtrace.
Consequently, if I want to continue development to track down the error, I'm going to have to figure out this peculiar situation with the library working when copied on top of some packages, but not others.
That means there must be something else that's being installed by the package which is causing the incompatibility. But what?
In the hope it might shed some light on things I'm going to checksum all of the installed files to see which ones change.
First, here are the checksums for the installed files from the new packages I've built (the non-working packages):
$ find /usr/lib64/xulrunner-qt5-91.9.1/ -type f -exec sha256sum {} \; 7f267b67c763f9dcee815b63f3d9beda01e05a8633ba0229ed0692877943 iblgpllibs.so 96638b0343cba81b65e13f835c5b7b554a26c6a2a65f0ee95ef2c00875ea mni.ja 46628085d9d2912973453035688e6d6c632a51a2cade000a8180defa93b8 lugin-container 8cdd39dfc3f2f982d909314aa7af160a3144564f354126c85d2a07f258b4 ependentlibs.list facb73cd418a6647bd9b4d7914206257a4a97e5857355ab430b3737d2917 latform.ini b40edc79cfa6e82237381a1e92f2c562f4fa844637779739e687967c0f6a ibmozavcodec.so 233025ceb162a45f5ca2d7dac3e511c6c7f188539b7859a88b16277264a3 ibxul.so fd7bd48c5d68e6d9c13fed833d2a6880ab614d0af54dd4d0c903e2752775 ibmozavutil.so b84c50b60b264df22ccc0bcc262eeb86f7764f20ffec9a073e3ba99ef703 pplication.iniWe can compare those against the checksums for the files that come from the package which has working WebGL but broken WebView:
$ find /usr/lib64/xulrunner-qt5-91.9.1/ -type f -exec sha256sum {} \; ed00ccd7d2faadbf872f61436dc5041857d4464c05ba080147f88fc3e35c iblgpllibs.so 224d5f398864a708b6bf6a9a091d101adb2c9d94c7374837fd38ee8090ce mni.ja 7375b5d4a9445e3e6e169bda464253b33e132bad6bdd9b9de96a7c7399d1 lugin-container 8cdd39dfc3f2f982d909314aa7af160a3144564f354126c85d2a07f258b4 ependentlibs.list ce90838911024163f58961d92cf6d810389e730c08ada4e364f1d592050a latform.ini b60a6fa988fb3c763c21322e898721c5cd0f1aea5fad3b9ce4f938a0569c ibmozavcodec.so bb0068273b939c4352f20a8e8d3095b387a33a24b82ecbf6bd1df280fcd9 ibxul.so 0f559421c9fded2b93e25818b98b178960d1956e2b0139721ba5085622df ibmozavutil.so d007134ac436f8e806e3e6760855fd67235d523fe1fcb13c9a25f75dc3cb pplication.iniUnfortunately this is less enlightening than I was hoping for: all of the files have changed, apart from the dependentlibs.list file. This unchanged file isn't very interesting, given it simply contains the name of the library:
$ cat /usr/lib64/xulrunner-qt5-91.9.1/dependentlibs.list libxul.soLiterally all of the other files have changed. That's going to make it harder to pin down where the problem lies. I'm thinking that this avenue of investigation using checksums isn't going to be especially fruitful.
It's possible that there's a problem with the boundary between the gecko and EmbedLite code, although I'm not sure I really understand how this can be. Nevertheless I've reverted the changes in EmbedLite just in case and set the build going again.
Unfortunately the build fails:
177:27.12 mobile/sailfishos 177:36.83 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp: In member function ‘void mozilla:: embedlite::EmbedLiteCompositorBridgeParent::PrepareOffscreen()’: 177:36.83 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:121:20: error: ‘class mozilla::gl:: GLScreenBuffer’ has no member named ‘mCaps’ 177:36.83 if (!screen->mCaps.premultAlpha) { 177:36.83 ^~~~~ 177:36.84 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:127:68: error: ‘class mozilla::gl:: GLScreenBuffer’ has no member named ‘mCaps’ 177:36.84 factory = SurfaceFactory_EGLImage::Create(context, screen->mCaps, nullptr, flags); 177:36.84 ^~~~~ 177:36.84 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:131:30: error: ‘SurfaceFactory_GLTexture’ was not declared in this scope 177:36.84 factory = MakeUnique<SurfaceFactory_GLTexture>(context, screen->mCaps, nullptr, flags); 177:36.84 ^~~~~~~~~~~~~~~~~~~~~~~~ 177:36.92 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:131:30: note: suggested alternative: ‘SurfaceDescriptorSharedGLTexture’ 177:36.92 factory = MakeUnique<SurfaceFactory_GLTexture>(context, screen->mCaps, nullptr, flags); 177:36.92 ^~~~~~~~~~~~~~~~~~~~~~~~ 177:36.92 SurfaceDescriptorSharedGLTexture 177:36.92 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:131:73: error: ‘class mozilla::gl:: GLScreenBuffer’ has no member named ‘mCaps’ 177:36.92 factory = MakeUnique<SurfaceFactory_GLTexture>(context, screen->mCaps, nullptr, flags); 177:36.92 ^~~~~ 177:36.92 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp: In member function ‘virtual void mozilla::embedlite::EmbedLiteCompositorBridgeParent:: CompositeToDefaultTarget(mozilla::layers::PCompositorBridgeParent:: VsyncId)’: 177:36.92 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:156:18: error: ‘class mozilla::gl:: GLContext’ has no member named ‘OffscreenSize’; did you mean ‘IsOffscreen’? 177:36.92 if (context->OffscreenSize() != mEGLSurfaceSize && !context->ResizeOffscreen(mEGLSurfaceSize)) { 177:36.92 ^~~~~~~~~~~~~ 177:36.92 IsOffscreen 177:36.92 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:156:66: error: ‘class mozilla::gl:: GLContext’ has no member named ‘ResizeOffscreen’; did you mean ‘IsOffscreen’? 177:36.92 if (context->OffscreenSize() != mEGLSurfaceSize && !context->ResizeOffscreen(mEGLSurfaceSize)) { 177:36.92 ^~~~~~~~~~~~~~~ 177:36.92 IsOffscreen 177:38.59 make[4]: *** [${PROJECT}/gecko-dev/config/rules.mk:694: EmbedLiteCompositorBridgeParent.o] Error 1The interesting thing with these errors is that the missing mCaps and OffScreenSize() members of GLContext no longer exist in the updated code. It turns out that's because although I added only a single commit to the gecko side, that was split into two commits on the EmbedLite side. Removing the extra embedlite commit should remove these extra elements and allow the build to go through.
But as I'm checking the diff between the two different versions of the code, I notice that it's not just the C++ code that has changed. There's also this addition to the embedding.js file:
// Make gecko compositor use GL context/surface provided by the application. pref("embedlite.compositor.external_gl_context", false); // Request the application to create GLContext for the compositor as // soon as the top level PuppetWidget is created for the view. Setting // this pref only makes sense when using external compositor gl context. pref("embedlite.compositor.request_external_gl_context_early", false);This is significant in being the only non-C++ change. This gets packaged into the omni.ja file which, crucially, won't get updated when I copy over a new libxul.so library.
So, it looks like this could be where the problem is.
If you cast your mind back to Day 94, you may recall that I have a script for cleanly packing and unpacking the omni.ja archive. So I can test this out really easily using that. The steps are:
- Install the full set of new packages that are broken.
- Unpack omni.ja.
- Edit the embedding.js file to comment out the code shown above.
- Repack omni.ja.
- Test the browser.
If my hypothesis is correct, this should fix the problem.
$ cd omni $ ./omni.sh unpack $ vim omni/defaults/pref/embedding.js $ ./omni.sh pack $ cd .. $ sailfish-browser [...]And indeed now the browser works correctly. So that clarifies the mystery that's been baffling me for the last week. I feel much better now.
To wrap things up, I've eddied the embedding.js file in the source tree as well, so that it'll get baked into the package in future. But at some stage I'll need to restore it, because this is essential for getting the WebView to work. But in the meantime, at least I now have an answer to my conundrum which will make things much easier to handle in the future.
I now have a set of packages that contain GLScreenBuffer, which have working WebGL and which don't crash on start-up. The WebView is broken, so the next step will be to hook in the GLScreenBuffer and try to find out why that's breaking the WebGL. That's my plan for tomorrow.
I'm going to sleep much more soundly tonight.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
25 Jun 2024 : Day 269 #
This morning I awake to a failed build. It got right through to the end, so all the compilation went through successfully, but the linking step failed. That means I won't be able to test my changes from yesterday just yet. Here's the output. Linking errors are notoriously obtuse, but I've added linebreaks in to separate out the individual errors in an attempt to make things a little clearer:
This is usually a sign that something was declared in the header but without an implementation in the source. And sure enough, checking the diff from the previous version I can see that there's this method that used to be in TextureClientSharedSurface.cpp but has been removed:
Finally we have this hidden symbol error, which suggests that the following hasn't been defined:
With all of the errors apparently resolved, I've kicked the build off again.
[...]
The build quickly hits another error, but this error is happening during compilation as a result of the new code I added:
[...]
Now the compilation is going through successfully, but the build hit another few errors during linking:
[...]
Finally the build goes through. There's bad news and good news and good news and bad news.
The bad news is that after installing the packages the browser crashes with the same Wayland errors we were getting before.
The good news is that when I install the previous packages, then link in the new library to replace the previous version, the browser then works okay.
The subsequent good news is that the WebGL is also working when I do this.
And the final piece of news — bad news as it happens — is that the WebView doesn't work in this case.
Overall though, I take this to be positive. The build is working with the GLScreenBuffer restored (even if it's not being used). I now just need to figure out how to prevent the crash. After that I can focus on the WebView.
So it's gradually coming together.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
226:05.55 toolkit/library/build/libxul.so 229:37.11 opt/cross/bin/aarch64-meego-linux-gnu-ld: ../../../gfx/layers/ Unified_cpp_gfx_layers6.o: in function `mozilla::layers:: SharedSurfaceTextureClient::Create(mozilla::UniquePtr<mozilla::gl:: SharedSurface, mozilla::DefaultDelete<mozilla::gl::SharedSurface> >, mozilla::gl::SurfaceFactory*, mozilla::layers::LayersIPCChannel*, mozilla:: layers::TextureFlags)': 229:37.11 ${PROJECT}/gecko-dev/gfx/layers/client/TextureClientSharedSurface.cpp: 104: undefined reference to `mozilla::layers::SharedSurfaceTextureData:: SharedSurfaceTextureData(mozilla::UniquePtr<mozilla::gl::SharedSurface, mozilla::DefaultDelete<mozilla::gl::SharedSurface> >)' 229:37.12 opt/cross/bin/aarch64-meego-linux-gnu-ld: ../../../gfx/layers/ Unified_cpp_gfx_layers6.o: in function `already_AddRefed<mozilla::layers:: SharedSurfaceTextureClient> mozilla::MakeAndAddRef<mozilla::layers:: SharedSurfaceTextureClient, mozilla::layers::SharedSurfaceTextureData*&, mozilla::layers::TextureFlags&, mozilla::layers::LayersIPCChannel*&>( mozilla::layers::SharedSurfaceTextureData*&, mozilla::layers:: TextureFlags&, mozilla::layers::LayersIPCChannel*&)': 229:37.12 ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:590: undefined reference to `mozilla::layers::SharedSurfaceTextureClient:: SharedSurfaceTextureClient(mozilla::layers::SharedSurfaceTextureData*, mozilla::layers::TextureFlags, mozilla::layers::LayersIPCChannel*)' 229:37.12 opt/cross/bin/aarch64-meego-linux-gnu-ld: libxul.so: hidden symbol `_ZN7mozilla6layers24SharedSurfaceTextureDataC1ENS_9UniquePtrINS_2gl13 SharedSurfaceENS_13DefaultDeleteIS4_EEEE' isn't defined 229:37.12 opt/cross/bin/aarch64-meego-linux-gnu-ld: final link failed: bad value 229:37.12 collect2: error: ld returned 1 exit statusThat's a bit dense, but alongside the header and footer, we can see that this breaks down into three failures. First there's an undefined reference to a SharedSurfaceTextureData() constructor. The missing method has the following signature:
SharedSurfaceTextureData::SharedSurfaceTextureData( UniquePtr<SharedSurface,DefaultDelete<SharedSurface> > )The output also tells us that the reference to it happens in SharedSurfaceTextureClient::Create() on line 104 of TextureClientSharedSurface.cpp.
This is usually a sign that something was declared in the header but without an implementation in the source. And sure enough, checking the diff from the previous version I can see that there's this method that used to be in TextureClientSharedSurface.cpp but has been removed:
-SharedSurfaceTextureData::SharedSurfaceTextureData( - UniquePtr<gl::SharedSurface> surf) - : mSurf(std::move(surf)), - mDesc(), - mFormat(), - mSize(mSurf->mDesc.size) -{ -}Adding it back in was nice and easy and should fix this error. Next up there's another undefined reference. This one relates to this method:
SharedSurfaceTextureClient::SharedSurfaceTextureClient( SharedSurfaceTextureData*, TextureFlags, LayersIPCChannel* )The error shows that an attempt is being made to use this inside a MakeAndAddRef() call, but that's not very helpful for us because that's just a wrapper obscuring the real location. Nevertheless, checking the diff shows that the following relevant code has been removed from TextureClientSharedSurface.cpp:
-SharedSurfaceTextureClient::SharedSurfaceTextureClient( - SharedSurfaceTextureData* aData, TextureFlags aFlags, - LayersIPCChannel* aAllocator) - : TextureClient(aData, aFlags, aAllocator) { - mWorkaroundAnnoyingSharedSurfaceLifetimeIssues = true; -}This matches the missing signature and I can see there's also a matching signature for this already in the header file. So everything is aligning for this one as well. I've added the missing method body into the code.
Finally we have this hidden symbol error, which suggests that the following hasn't been defined:
_ZN7mozilla6layers24SharedSurfaceTextureDataC1ENS_9UniquePtrINS_2gl13 SharedSurfaceENS_13DefaultDeleteIS4_EEEEThat's a horribly mangled name, but happily binutils provides the neat c++filt utility which will demangle it for us:
$ c++filt '_ZN7mozilla6layers24SharedSurfaceTextureDataC1ENS_9UniquePtrINS_2gl13 SharedSurfaceENS_13DefaultDeleteIS4_EEEE' mozilla::layers::SharedSurfaceTextureData::SharedSurfaceTextureData( mozilla::UniquePtr<mozilla::gl::SharedSurface, mozilla::DefaultDelete<mozilla::gl::SharedSurface> >)Simplifying this output a bit, we can reduce it down to a missing symbol for the following:
SharedSurfaceTextureData::SharedSurfaceTextureData( UniquePtr<SharedSurface, DefaultDelete<SharedSurface> > )It looks like this is a repeat of our first error, so adding in the body for the missing SharedSurfaceTextureData should have already done the job of fixing this error.
With all of the errors apparently resolved, I've kicked the build off again.
[...]
The build quickly hits another error, but this error is happening during compilation as a result of the new code I added:
104:32.38 ${PROJECT}/gecko-dev/gfx/layers/client/TextureClientSharedSurface.cpp: 109:3: error: ‘mWorkaroundAnnoyingSharedSurfaceLifetimeIssues’ was not declared in this scope 104:32.38 mWorkaroundAnnoyingSharedSurfaceLifetimeIssues = true; 104:32.38 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~This does appear in the new code we just added, but right now that's the only place it appears. It also appears in a bunch of code I've not added back in, all related to carefully choosing when to destroy shared surfaces. Since I'm not planning to add these back in just yet, I've commented out the offending line for now instead of adding the missing variable in. If any of the other pieces of infrastructure that need it get added in later, I can restore it. I've kicked off another build.
[...]
Now the compilation is going through successfully, but the build hit another few errors during linking:
222:47.06 toolkit/library/build/libxul.so 226:31.24 opt/cross/bin/aarch64-meego-linux-gnu-ld: ../../../gfx/layers/ Unified_cpp_gfx_layers6.o: in function `mozilla::layers:: SharedSurfaceTextureClient::SharedSurfaceTextureClient(mozilla::layers:: SharedSurfaceTextureData*, mozilla::layers::TextureFlags, mozilla::layers:: LayersIPCChannel*)': 226:31.24 ${PROJECT}/gecko-dev/gfx/layers/client/TextureClientSharedSurface.cpp: 108: undefined reference to `vtable for mozilla::layers:: SharedSurfaceTextureClient' 226:31.24 opt/cross/bin/aarch64-meego-linux-gnu-ld: ${PROJECT}/gecko-dev/gfx/ layers/client/TextureClientSharedSurface.cpp:108: undefined reference to `vtable for mozilla::layers::SharedSurfaceTextureClient' 226:31.25 opt/cross/bin/aarch64-meego-linux-gnu-ld: libxul.so: hidden symbol `_ZTVN7mozilla6layers26SharedSurfaceTextureClientE' isn't defined 226:31.25 opt/cross/bin/aarch64-meego-linux-gnu-ld: final link failed: bad valueLet's break these down once again. From TextureClientSharedSurface.cpp line 108 we have an undefined reference to the SharedSurfaceTextureClient class's vtable. It looks to me like that's because the destructor is defined but not implemented.
-SharedSurfaceTextureClient::~SharedSurfaceTextureClient() { - // XXX - Things break when using the proper destruction handshake with - // SharedSurfaceTextureData because the TextureData outlives its gl - // context. Having a strong reference to the gl context creates a cycle. - // This needs to be fixed in a better way, though, because deleting - // the TextureData here can race with the compositor and cause flashing. - TextureData* data = mData; - mData = nullptr; - - Destroy(); - - if (data) { - // Destroy mData right away without doing the proper deallocation handshake, - // because SharedSurface depends on things that may not outlive the - // texture's destructor so we can't wait until we know the compositor isn't - // using the texture anymore. It goes without saying that this is really bad - // and we should fix the bugs that block doing the right thing such as bug - // 1224199 sooner rather than later. - delete data; - } -}The second error looks to be the same thing. The third is mangled, so let's demangle it:
$ c++filt '_ZTVN7mozilla6layers26SharedSurfaceTextureClientE' vtable for mozilla::layers::SharedSurfaceTextureClientSo apparently the third error is the same thing as well. I've added in the destructor, so let's give it another go.
[...]
Finally the build goes through. There's bad news and good news and good news and bad news.
The bad news is that after installing the packages the browser crashes with the same Wayland errors we were getting before.
The good news is that when I install the previous packages, then link in the new library to replace the previous version, the browser then works okay.
The subsequent good news is that the WebGL is also working when I do this.
And the final piece of news — bad news as it happens — is that the WebView doesn't work in this case.
Overall though, I take this to be positive. The build is working with the GLScreenBuffer restored (even if it's not being used). I now just need to figure out how to prevent the crash. After that I can focus on the WebView.
So it's gradually coming together.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
24 Jun 2024 : Day 268 #
Today hasn't quite been the day of development I was planning. That's okay, it happens sometimes, and while I've not been doing development, the sun has been shining and nature has been making it's lazy hum. It's not been bad to take the opportunity to relax.
What's more, my day was made emphatically better by receiving this Gecko-dev related poem from Leif-Jöran Olsson (ljo) on Mastodon:
Genuine art! It sums up where things are at nicely, as you may recall I've recently restored GLScreenBuffer alongside a minimal set of changes (the convex hull of its dependencies) need to get the build to compile. Partial compile that is.
I kicked off a build overnight, but by the morning it's hit some errors. They look like this:
All of these errors amount to the same thing and will be easy to fix. The necessary change is to restore the GLContext::Screen() method, which I've done, and set the build off again. It presumably got past the partial build because the call is being made from inside EmbedLiteCompositorBridgeParent.cpp, which as I also discussed yesterday, doesn't get touched by the partial build.
It was a pretty obvious error and someone more astute than I am could certainly have picked it up just by observation, without the need to do the build. But it's also easy when working with compiled languages to rely on the compiler to pick these kinds of errors up. So I missed it and it lost me some time.
My second build failed as well, this time due to the following variable being missing from the GLContext class:
So I'm now on to my third build of the day. So far so good, I'm hoping it'll complete before bed-time so as to give me the chance to test it.
Frustratingly it gets all the way to the linker before it fails again. >, mozilla::gl::SurfaceFactory*,
mozilla::layers::LayersIPCChannel*, mozilla::layers::TextureFlags)
Cleaning that up, we get this:
After making these fixes and running the partial build again, it now throws up the following error:
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
What's more, my day was made emphatically better by receiving this Gecko-dev related poem from Leif-Jöran Olsson (ljo) on Mastodon:
Summer solstice and a supporting full moon ends the code removal phase. A sea of browser backtrace ejects gives support for switching to incremental introduction of nibbles of code. The WebGL context path is buried together with any remaining anxiety. The energy collected awakens the concavenator to pair up in dynamic duo with flypig's rejuvenated gecko.
Genuine art! It sums up where things are at nicely, as you may recall I've recently restored GLScreenBuffer alongside a minimal set of changes (the convex hull of its dependencies) need to get the build to compile. Partial compile that is.
I kicked off a build overnight, but by the morning it's hit some errors. They look like this:
[...] 254:42.53 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp: In member function ‘void mozilla:: embedlite::EmbedLiteCompositorBridgeParent::GetPlatformImage(const std:: function<void(void*, int, int)>&)’: 254:42.53 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:227:37: error: ‘class mozilla::gl:: GLContext’ has no member named ‘Screen’ 254:42.53 GLScreenBuffer* screen = context->Screen(); 254:42.53 ^~~~~~ 254:42.53 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp: In member function ‘void* mozilla:: embedlite::EmbedLiteCompositorBridgeParent::GetPlatformImage(int*, int*)’: 254:42.53 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:257:37: error: ‘class mozilla::gl:: GLContext’ has no member named ‘Screen’ 254:42.53 GLScreenBuffer* screen = context->Screen(); 254:42.53 ^~~~~~ 254:44.41 make[4]: *** [${PROJECT}/gecko-dev/config/rules.mk:694: EmbedLiteCompositorBridgeParent.o] Error 1In this and the following error output I've added some newlines to try to separate out the errors and hopefully make them a little clearer.
All of these errors amount to the same thing and will be easy to fix. The necessary change is to restore the GLContext::Screen() method, which I've done, and set the build off again. It presumably got past the partial build because the call is being made from inside EmbedLiteCompositorBridgeParent.cpp, which as I also discussed yesterday, doesn't get touched by the partial build.
It was a pretty obvious error and someone more astute than I am could certainly have picked it up just by observation, without the need to do the build. But it's also easy when working with compiled languages to rely on the compiler to pick these kinds of errors up. So I missed it and it lost me some time.
My second build failed as well, this time due to the following variable being missing from the GLContext class:
UniquePtr<GLScreenBuffer> mScreen;In my defence I had added it, but it got removed again while performing a git checkout -d command to restore the Screen() method. It's a poor defence, but it's how it went down.
So I'm now on to my third build of the day. So far so good, I'm hoping it'll complete before bed-time so as to give me the chance to test it.
Frustratingly it gets all the way to the linker before it fails again.
394:09.19 toolkit/library/build/libxul.so 401:01.08 /home/flypig/Programs/sailfish-sdk/sailfish-sdk/mersdk/targets/ SailfishOS-devel-aarch64.default/opt/cross/bin/aarch64-meego-linux-gnu-ld: ../../../gfx/gl/Unified_cpp_gfx_gl0.o: in function `mozilla::gl:: SurfaceFactory::NewTexClient(mozilla::gfx::IntSizeTyped<mozilla::gfx:: UnknownUnits> const&)': 401:01.10 ${PROJECT}/gecko-dev/gfx/gl/SharedSurface.cpp:204: undefined reference to `mozilla::layers::SharedSurfaceTextureClient::Create(mozilla:: UniquePtr<mozilla::gl::SharedSurface, mozilla::DefaultDelete<mozilla::gl:: SharedSurface> >, mozilla::gl::SurfaceFactory*, mozilla::layers:: LayersIPCChannel*, mozilla::layers::TextureFlags)' 401:01.10 /home/flypig/Programs/sailfish-sdk/sailfish-sdk/mersdk/targets/ SailfishOS-devel-aarch64.default/opt/cross/bin/aarch64-meego-linux-gnu-ld: libxul.so: hidden symbol `_ZN7mozilla6layers26SharedSurfaceTextureClient6CreateENS_9UniquePtr INS_2gl13SharedSurfaceENS_13DefaultDeleteIS4_EEEEPNS3_14SurfaceFactory EPNS0_16LayersIPCChannelENS0_12TextureFlagsE' isn't defined 401:01.10 /home/flypig/Programs/sailfish-sdk/sailfish-sdk/mersdk/targets/ SailfishOS-devel-aarch64.default/opt/cross/bin/aarch64-meego-linux-gnu-ld: final link failed: bad valueThe problem here is a method that's being declared in a header but not implemented in the source file. By carefully working through the error output we can see that the missing code is the implementation for SharedSurfaceTextureClient::Create(). Here's the method shown in the error message, but cleaned up and reformatted to make things clearer:
SharedSurfaceTextureClient::Create( UniquePtr<SharedSurface, DefaultDelete<SharedSurface> >, SurfaceFactory*, LayersIPCChannel*, TextureFlags )We can also see from the error messages that it's being called here:
RefPtr<layers::SharedSurfaceTextureClient> ret; ret = layers::SharedSurfaceTextureClient::Create(std::move(surf), this, mAllocator, mFlags);In TextureClientSharedSurface.h we can see the method signature in the header. The fact there's a signature is the reason the compiler didn't notice and it wasn't until the linker that the error was uncovered:
class SharedSurfaceTextureClient : public TextureClient { public: [...] static already_AddRefed<SharedSurfaceTextureClient> Create( UniquePtr<gl::SharedSurface> surf, gl::SurfaceFactory* factory, LayersIPCChannel* aAllocator, TextureFlags aFlags); [...] };But the implementation is indeed missing from TextureClientSharedSurface.cpp. We can get the implementation that we were using before using git diff, which gives us the following:
$ git diff [...] -already_AddRefed<SharedSurfaceTextureClient> SharedSurfaceTextureClient:: Create( - UniquePtr<gl::SharedSurface> surf, gl::SurfaceFactory* factory, - LayersIPCChannel* aAllocator, TextureFlags aFlags) { - if (!surf) { - return nullptr; - } - TextureFlags flags = aFlags | TextureFlags::RECYCLE | surf->GetTextureFlags( ); - SharedSurfaceTextureData* data = - new SharedSurfaceTextureData(std::move(surf)); - return MakeAndAddRef<SharedSurfaceTextureClient>(data, flags, aAllocator); -}There's also a mangled method name appearing in the error output. We can demangle it to try to find out if this is something separate we need to fix: $ c++filt '_ZN7mozilla6layers26SharedSurfaceTextureClient6CreateENS_9 UniquePtrINS_2gl13SharedSurfaceENS_13DefaultDeleteIS4_EEEEPNS3_14 SurfaceFactoryEPNS0_16LayersIPCChannelENS0_12TextureFlagsE' mozilla::layers::SharedSurfaceTextureClient::Create(mozilla::UniquePtr
SharedSurfaceTextureClient::Create( UniquePtr<SharedSurface, DefaultDelete<SharedSurface> >, SurfaceFactory*, LayersIPCChannel*, TextureFlags )Having demangled and cleaned it up, it's clear this is the same error as before, so nothing more to do on this front.
After making these fixes and running the partial build again, it now throws up the following error:
In file included from Unified_cpp_gfx_layers6.cpp:128: ${PROJECT}/gecko-dev/gfx/layers/client/TextureClientSharedSurface.cpp: In static member function ‘static already_AddRefed<mozilla::layers:: SharedSurfaceTextureClient> mozilla::layers::SharedSurfaceTextureClient:: Create(mozilla::UniquePtr<mozilla::gl::SharedSurface>, mozilla::gl:: SurfaceFactory*, mozilla::layers::LayersIPCChannel*, mozilla::layers:: TextureFlags)’: ${PROJECT}/gecko-dev/gfx/layers/client/TextureClientSharedSurface.cpp:102:63: error: ‘class mozilla::gl::SharedSurface’ has no member named ‘GetTextureFlags’ TextureFlags flags = aFlags | TextureFlags::RECYCLE | surf->GetTextureFlags( );To fix this I need to add in the removed GetTextureFlags() method to SharedSurface.cpp and the related signature in the SharedSurface.h header:
- // Specifies to the TextureClient any flags which - // are required by the SharedSurface backend. - virtual layers::TextureFlags GetTextureFlags() const; [...] -layers::TextureFlags SharedSurface::GetTextureFlags() const { - return layers::TextureFlags::NO_FLAGS; -}With this change the partial build finally goes through, including the final linking stage. But I'll still need to run the full build again before I can test anything. So I've kicked it off. There's no way it'll complete before the morning, so that'll have to be it for the day.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
23 Jun 2024 : Day 267 #
I have a bit more time for development today than I had yesterday, so I'm hoping I can properly follow up on this issue I noticed yesterday with the library working or not, depending on which version of the package is installed.
As part of this, I want to explore what happens when I run a configuration with the "working WebGL" packages (i.e. the ones with all of the changes from my latest commit reverted), plus my latest library, but also running the WebView rather than the browser.
I'm expecting this to fail, but it'll be interesting to see where.
[...]
And it does fail. But now I have a backtrace to inspect from it and it's a lot more interesting than the backtraces from the Wayland failure we've been getting so often recently. Here's the backtrace:
To be honest, this is just what I'd expect. But it also tells us that this whole process hasn't been in vain: cutting out things brought us to a similar point to before, but we're closer to resolving both the WebGL and WebView issues this time.
The next step is to establish whether the new SwapChain is getting used. I'd previously thought it was never used by the browser, but I have a new perspective now: although it's not used when rendering general web pages, maybe it's used when rending WebGL within a page? Most pages don't do this, but when they do, I'm now expecting there to be some offscreen rendering.
I've placed a breakpoint on the SwapChain constructor. To start with, here's where the SwapChain gets created when using a WebView component. This is for comparison, captured using the latest code:
When I render a website without WebGL (e.g. the Jolla site) the constructor goes unused. But if I visit a site that uses WebGL (e.g. my personal website where the animated background is generated using a WebGL shader) it does get hit. It comes with a crazy long backtrace that shows it's happening inside a DOM element, which is again what I'd expect. I've chopped quite a lot out from the below backtrace, but still kept the parts I think are most relevant:
[...]
I've now spent a good few hours looking through the WebGLContext code, since this is what we see in the backtrace above. There's definitely something in the idea that we should be using this instead of GLContext. But WebGLContext isn't inheriting anything from GLContext and their interfaces look quite different to me. It certainly isn't the case that one would be a drop-in replacement for the other. Quite the contrary in fact. While switching to use WebGLContext might be a better solution in the long-term, I've convinced myself (again) that this isn't what we need right now.
So I'm going back to my original plan, but now we're going in the opposite direction. Rather than removing code I'll now start to reintroduce code. In particular, the one thing I'm convinced that we can't do without is the GLScreenBuffer object, as encapsulated in the GLContext::mScreen member variable.
So I'm adding this class back in. Thankfully git makes this a very easy process:
With just these two files reverted, attempting to build throws up a whole host of errors. Here are just a few:
It's going to be the convex hull of the GLScreenBuffer dependencies.
[...]
I've got to the stage where the partial build seems to be compiling. But it required changes to the EmbedLite code, which I don't yet have a method of including in the partial build. But it's already late here, so I'm going to set the full build running overnight and see where that gets us.
Today has been a very productive day of development. If I can be similarly productive tomorrow, I'll feel like all of the work I've been putting in over the last week, despite the slow progress, will nevertheless have been worth it.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
As part of this, I want to explore what happens when I run a configuration with the "working WebGL" packages (i.e. the ones with all of the changes from my latest commit reverted), plus my latest library, but also running the WebView rather than the browser.
I'm expecting this to fail, but it'll be interesting to see where.
[...]
And it does fail. But now I have a backtrace to inspect from it and it's a lot more interesting than the backtraces from the Wayland failure we've been getting so often recently. Here's the backtrace:
Thread 38 "Compositor" received signal SIGSEGV, Segmentation fault. [Switching to LWP 9220] 0x0000007ff110864c in mozilla::gl::SwapChain::OffscreenSize (this=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 290 ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h: No such file or directory. (gdb) bt #0 0x0000007ff110864c in mozilla::gl::SwapChain::OffscreenSize ( this=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 #1 0x0000007ff3666230 in mozilla::embedlite::EmbedLiteCompositorBridgeParent:: CompositeToDefaultTarget (this=0x7fc4ad76f0, aId=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 #2 0x0000007ff12b64d8 in mozilla::layers::CompositorVsyncScheduler:: ForceComposeToTarget (this=0x7fc4c39f60, aTarget=aTarget@entry=0x0, aRect=aRect@entry=0x0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/LayersTypes.h: 82 #3 0x0000007ff12b6534 in mozilla::layers::CompositorBridgeParent:: ResumeComposition (this=this@entry=0x7fc4ad76f0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313 #4 0x0000007ff12b65c0 in mozilla::layers::CompositorBridgeParent:: ResumeCompositionAndResize (this=0x7fc4ad76f0, x=<optimized out>, y=<optimized out>, width=<optimized out>, height=<optimized out>) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:794 #5 0x0000007ff12af15c in mozilla::detail::RunnableMethodArguments<int, int, int, int>::applyImpl<mozilla::layers::CompositorBridgeParent, void (mozilla: :layers::CompositorBridgeParent::*)(int, int, int, int), StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, 0ul, 1ul, 2ul, 3ul> (args=..., m=<optimized out>, o=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151 #6 mozilla::detail::RunnableMethodArguments<int, int, int, int>::apply<mozilla: :layers::CompositorBridgeParent, void (mozilla::layers:: CompositorBridgeParent::*)(int, int, int, int)> (m=<optimized out>, o=<optimized out>, this=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1154 #7 mozilla::detail::RunnableMethodImpl<mozilla::layers:: CompositorBridgeParent*, void (mozilla::layers::CompositorBridgeParent::*)( int, int, int, int), true, (mozilla::RunnableKind)0, int, int, int, int>:: Run (this=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1201 #8 0x0000007ff0801ab8 in nsThread::ProcessNextEvent (this=0x7fc4c01730, aMayWait=<optimized out>, aResult=0x7f1796bcb7) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:869 #9 0x0000007ff07f098c in NS_ProcessNextEvent (aThread=<optimized out>, aThread@entry=0x7fc4c01730, aMayWait=aMayWait@entry=false) at ${PROJECT}/gecko-dev/xpcom/threads/nsThreadUtils.cpp:466 #10 0x0000007ff0bbcab0 in mozilla::ipc::MessagePumpForNonMainThreads::Run ( this=0x7edc001840, aDelegate=0x7f1796bdc0) at ${PROJECT}/gecko-dev/ipc/glue/MessagePump.cpp:300 #11 0x0000007ff0b7b87c in MessageLoop::RunInternal ( this=this@entry=0x7f1796bdc0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313 #12 0x0000007ff0b7bac0 in MessageLoop::RunHandler (this=0x7f1796bdc0) at ${PROJECT}/gecko-dev/ipc/chromium/src/base/message_loop.cc:352 #13 MessageLoop::Run (this=this@entry=0x7f1796bdc0) at ${PROJECT}/gecko-dev/ipc/chromium/src/base/message_loop.cc:334 #14 0x0000007ff08034b8 in nsThread::ThreadFunc (aArg=0x7fc4c018d0) at ${PROJECT}/gecko-dev/xpcom/threads/nsThread.cpp:392 #15 0x0000007feca419f0 in ?? () from /usr/lib64/libnspr4.so #16 0x0000007fefd05a4c in ?? () from /lib64/libpthread.so.0 #17 0x0000007ff6a0289c in ?? () from /lib64/libc.so.6While we're here, let's do a little exploration into why this crash occurred using the debugger.
(gdb) frame 1 #1 0x0000007ff3666230 in mozilla::embedlite::EmbedLiteCompositorBridgeParent:: CompositeToDefaultTarget (this=0x7fc4ad76f0, aId=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 290 ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h: No such file or directory. (gdb) p context $1 = (mozilla::gl::GLContext *) 0x7edc19ede0 (gdb) p context->mSwapChain $2 = { mTuple = {<mozilla::detail::CompactPairHelper<mozilla::gl::SwapChain*, mozilla::DefaultDelete<mozilla::gl::SwapChain>, (mozilla::detail:: StorageType)1, (mozilla::detail::StorageType)0>> = {<mozilla:: DefaultDelete<mozilla::gl::SwapChain>> = {<No data fields>}, mFirstA = 0x7edc1ce070}, <No data fields>}} (gdb) p context->mSwapChain.mTuple $3 = {<mozilla::detail::CompactPairHelper<mozilla::gl::SwapChain*, mozilla:: DefaultDelete<mozilla::gl::SwapChain>, (mozilla::detail::StorageType)1, ( mozilla::detail::StorageType)0>> = {<mozilla::DefaultDelete<mozilla::gl:: SwapChain>> = {<No data fields>}, mFirstA = 0x7edc1ce070}, <No data fields>} (gdb) p context->mSwapChain.mTuple.mFirstA $4 = (mozilla::gl::SwapChain *) 0x7edc1ce070 (gdb) p context->mSwapChain.mTuple.mFirstA->mPresenter $5 = (mozilla::gl::SwapChainPresenter *) 0x7edc1a1300 (gdb) p context->mSwapChain.mTuple.mFirstA->mPresenter->mBackBuffer $6 = std::shared_ptr<mozilla::gl::SharedSurface> (empty) = {get() = 0x0} (gdb)What's this telling us? Well, it's very similar to the crash we got back on Day 177 when we first started trying out the WebView. The SwapChain is being created and accessed, but it's deep inside the object that the problem occurs: it's the SharedSurface backbuffer object stored inside the SwapChainPresenter object, stored inside a smart pointer, stored inside the GLContext that's stored inside the SwapChain that's not been set:
(gdb) p context->mSwapChain.mTuple.mFirstA->mPresenter->mBackBuffer $6 = std::shared_ptr<mozilla::gl::SharedSurface> (empty) = {get() = 0x0}This might be an initialisation issue, or it might be more involved. It's not quite the same as what was happening on Day 177 since the code is different this time. But the underlying issue is the same.
To be honest, this is just what I'd expect. But it also tells us that this whole process hasn't been in vain: cutting out things brought us to a similar point to before, but we're closer to resolving both the WebGL and WebView issues this time.
The next step is to establish whether the new SwapChain is getting used. I'd previously thought it was never used by the browser, but I have a new perspective now: although it's not used when rendering general web pages, maybe it's used when rending WebGL within a page? Most pages don't do this, but when they do, I'm now expecting there to be some offscreen rendering.
I've placed a breakpoint on the SwapChain constructor. To start with, here's where the SwapChain gets created when using a WebView component. This is for comparison, captured using the latest code:
=============== Preparing offscreen rendering context =============== [Switching to LWP 9891] Thread 37 "Compositor" hit Breakpoint 1, mozilla::gl::SwapChain:: SwapChain (this=0x7ee01ce090) at ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.h:63 63 ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.h: No such file or directory. (gdb) bt #0 mozilla::gl::SwapChain::SwapChain (this=0x7ee01ce090) at ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.h:63 #1 0x0000007ff3666ac0 in mozilla::embedlite::EmbedLiteCompositorBridgeParent:: PrepareOffscreen (this=this@entry=0x7fc4b01c50) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33 #2 0x0000007ff3666b7c in mozilla::embedlite::EmbedLiteCompositorBridgeParent:: AllocPLayerTransactionParent (this=0x7fc4b01c50, aBackendHints=..., aId=...) at ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:90 #3 0x0000007ff0c63d90 in mozilla::layers::PCompositorBridgeParent:: OnMessageReceived (this=0x7fc4b01c50, msg__=...) at PCompositorBridgeParent.cpp:1285 [...] #18 0x0000007ff6a0289c in ?? () from /lib64/libc.so.6 (gdb)As we can see from this, it's created inside the EmbedLiteCompositorBridgeParent::PrepareOffscreen() method. Here's what the code looks like that's creating it, for reference:
void EmbedLiteCompositorBridgeParent::PrepareOffscreen() { fprintf(stderr, "=============== Preparing offscreen rendering context ===============\n"); const CompositorBridgeParent::LayerTreeState* state = CompositorBridgeParent:: GetIndirectShadowTree(RootLayerTreeId()); NS_ENSURE_TRUE(state && state->mLayerManager, ); GLContext* context = static_cast<CompositorOGL*>( state->mLayerManager->GetCompositor())->gl(); NS_ENSURE_TRUE(context, ); // TODO: The switch from GLSCreenBuffer to SwapChain needs completing // See: https://phabricator.services.mozilla.com/D75055 if (context->IsOffscreen()) { UniquePtr<SurfaceFactory> factory; if (context->GetContextType() == GLContextType::EGL) { // [Basic/OGL Layers, OMTC] WebGL layer init. factory = SurfaceFactory_EGLImage::Create(*context); } else { // [Basic Layers, OMTC] WebGL layer init. // Well, this *should* work... factory = MakeUnique<SurfaceFactory_Basic>(*context); } SwapChain* swapChain = context->GetSwapChain(); if (swapChain == nullptr) { swapChain = new SwapChain(); new SwapChainPresenter(*swapChain); context->mSwapChain.reset(swapChain); } if (factory) { swapChain->Morph(std::move(factory)); } } }Now I want to know whether it's ever used by the browser using an execution flow that doesn't depend on EmbedLite.
When I render a website without WebGL (e.g. the Jolla site) the constructor goes unused. But if I visit a site that uses WebGL (e.g. my personal website where the animated background is generated using a WebGL shader) it does get hit. It comes with a crazy long backtrace that shows it's happening inside a DOM element, which is again what I'd expect. I've chopped quite a lot out from the below backtrace, but still kept the parts I think are most relevant:
Thread 8 "GeckoWorkerThre" hit Breakpoint 1, mozilla::gl::SwapChain:: SwapChain (this=0x7fc9ce3588) at ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.h:63 63 ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.h: No such file or directory. (gdb) bt #0 mozilla::gl::SwapChain::SwapChain (this=0x7fc9ce3588) at ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.h:63 #1 0x0000007ff369a54c in mozilla::WebGLContext::WebGLContext ( this=0x7fc9ce30f0, host=..., desc=...) at include/c++/8.3.0/bits/move.h:74 #2 0x0000007ff36a9c90 in mozilla::WebGLContext::<lambda()>::operator() ( __closure=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33 #3 mozilla::WebGLContext::Create (host=..., desc=..., out=out@entry=0x7fcb9660c8) at ${PROJECT}/gecko-dev/dom/canvas/WebGLContext.cpp:562 #4 0x0000007ff3661920 in mozilla::HostWebGLContext::Create (ownerData=..., desc=..., out=out@entry=0x7fcb9660c8) at ${PROJECT}/gecko-dev/dom/canvas/HostWebGLContext.cpp:59 #5 0x0000007ff3691374 in mozilla::ClientWebGLContext::<lambda()>::operator() ( __closure=<optimized out>) at ${PROJECT}/gecko-dev/dom/canvas/ClientWebGLContext.cpp:625 #6 mozilla::ClientWebGLContext::CreateHostContext ( this=this@entry=0x7fc9991820, requestedSize=...) at ${PROJECT}/gecko-dev/dom/canvas/ClientWebGLContext.cpp:654 #7 0x0000007ff3691e5c in mozilla::ClientWebGLContext::SetDimensions ( this=0x7fc9991820, signedWidth=<optimized out>, signedHeight=<optimized out>) at ${PROJECT}/gecko-dev/dom/canvas/ClientWebGLContext.cpp:563 #8 0x0000007ff362b27c in mozilla::dom::CanvasRenderingContextHelper:: UpdateContext (this=0x7e6036c790, aCx=<optimized out>, aNewContextOptions=..., aRvForDictionaryInit=...) at ${PROJECT}/gecko-dev/dom/canvas/CanvasRenderingContextHelper.cpp:238 #9 0x0000007ff363a348 in mozilla::dom::CanvasRenderingContextHelper:: GetContext (this=this@entry=0x7e6036c790, aCx=0x7fc81defd0, aContextId=..., aContextOptions=..., aRv=...) at ${PROJECT}/gecko-dev/dom/canvas/CanvasRenderingContextHelper.cpp:190 #10 0x0000007ff390bf18 in mozilla::dom::HTMLCanvasElement::GetContext ( this=this@entry=0x7e6036c710, aCx=aCx@entry=0x7fc81defd0, aContextId=..., aContextOptions=aContextOptions@entry=..., aRv=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/js/Value.h:670 #11 0x0000007ff3549764 in mozilla::dom::HTMLCanvasElement_Binding::getContext ( cx=0x7fc81defd0, obj=..., void_self=0x7e6036c710, args=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/js/RootingAPI.h:1297 #12 0x0000007ff35e0bec in mozilla::dom::binding_detail::GenericMethod<mozilla:: dom::binding_detail::NormalThisPolicy, mozilla::dom::binding_detail:: ThrowExceptions> (cx=0x7fc81defd0, argc=<optimized out>, vp=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/js/CallArgs.h:207 #13 0x0000007ff4e7d5d4 in CallJSNative (args=..., reason=js::CallReason::Call, native=0x7ff35e09ac <mozilla::dom::binding_detail::GenericMethod<mozilla:: dom::binding_detail::NormalThisPolicy, mozilla::dom::binding_detail:: ThrowExceptions>(JSContext*, unsigned int, JS::Value*)>, cx=0x7fc81defd0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/js/CallArgs.h:285 #14 js::InternalCallOrConstruct (cx=cx@entry=0x7fc81defd0, args=..., construct=construct@entry=js::NO_CONSTRUCT, reason=reason@entry=js:: CallReason::Call) at ${PROJECT}/gecko-dev/js/src/vm/Interpreter.cpp:511 [...] #63 0x0000007fefbb189c in ?? () from /lib64/libc.so.6 (gdb)For comparison, it's interesting to check whether the backbuffer is already created at this point. The debugger suggests not:
(gdb) p mPresenter->mBackBuffer $3 = std::shared_ptr<mozilla::gl::SharedSurface> (expired, weak count 0) = {get( ) = 0x21}On this version of the browser the WebGL is working, using offscreen rendering, but the WebView is broken. So now I'm rethinking my need to introduce all the old GLScreenBuffer code. Could I try to use the ~SwapChain after all? You may recall that I already considered this much earlier, tried it, failed and then reassessed. Maybe I now know more, enough to make it work now? I'm going to look carefully through the code and reconsider.
[...]
I've now spent a good few hours looking through the WebGLContext code, since this is what we see in the backtrace above. There's definitely something in the idea that we should be using this instead of GLContext. But WebGLContext isn't inheriting anything from GLContext and their interfaces look quite different to me. It certainly isn't the case that one would be a drop-in replacement for the other. Quite the contrary in fact. While switching to use WebGLContext might be a better solution in the long-term, I've convinced myself (again) that this isn't what we need right now.
So I'm going back to my original plan, but now we're going in the opposite direction. Rather than removing code I'll now start to reintroduce code. In particular, the one thing I'm convinced that we can't do without is the GLScreenBuffer object, as encapsulated in the GLContext::mScreen member variable.
So I'm adding this class back in. Thankfully git makes this a very easy process:
$ git checkout gfx/gl/GLScreenBuffer.cpp $ git checkout gfx/gl/GLScreenBuffer.hThis is the minimal change I think is needed to get the WebView working again. I'm building from a base where WebGL is working. So I feel like I'm back on track again.
With just these two files reverted, attempting to build throws up a whole host of errors. Here are just a few:
${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp: In destructor ‘virtual mozilla:: gl::GLScreenBuffer: :~GLScreenBuffer()’: ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:205:8: error: invalid use of incomplete type ‘class mozilla::layers::SharedSurfaceTextureClient’ mBack->Surf()->ProducerRelease(); ^~ In file included from ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.h:23, from ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:6: ${PROJECT}/gecko-dev/gfx/gl/SharedSurface.h:45:7: note: forward declaration of ‘class mozilla::laye rs::SharedSurfaceTextureClient’ class SharedSurfaceTextureClient; ^~~~~~~~~~~~~~~~~~~~~~~~~~ ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp: In member function ‘void mozilla::gl::GLScreenBuffe r::BindFB(GLuint)’: ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:218:10: error: ‘class mozilla:: gl::GLContext’ has no member named ‘raw_fBindFramebuffer’; did you mean ‘raw_fBlitFramebuffer’? mGL->raw_fBindFramebuffer(LOCAL_GL_FRAMEBUFFER, mInternalDrawFB); ^~~~~~~~~~~~~~~~~~~~ raw_fBlitFramebufferThis isn't unexpected. My process now is to reintroduce removed code, but only where absolutely necessary to get the build working again. So I'm essentially doing the opposite of what I was doing before: adding code rather than removing it. As before, git is my biggest help here because it's kept a neat record of everything I've changed. I'm reverting it in small pieces, so it's taking a while to make the changes, but I'm still satisfied that what I'll end up with is the smallest set of changes I can reasonably expect, given that we've added the GlScreenBuffer class back in.
It's going to be the convex hull of the GLScreenBuffer dependencies.
[...]
I've got to the stage where the partial build seems to be compiling. But it required changes to the EmbedLite code, which I don't yet have a method of including in the partial build. But it's already late here, so I'm going to set the full build running overnight and see where that gets us.
Today has been a very productive day of development. If I can be similarly productive tomorrow, I'll feel like all of the work I've been putting in over the last week, despite the slow progress, will nevertheless have been worth it.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
22 Jun 2024 : Day 266 #
I'm continuing to strip out methods today after eviscerating the code yesterday. There aren't many changes left and yesterday I established that none of the changed code is being executed before the crash. So I'm not confident that removing the changes will have any effect. But I don't have much else to do at this point so I may as well continue until there are no changes left to make.
It's anomalous though. I have packages built against the code with the commit reverted. I know that reverting all of the changes leaves me with a working browser. So something is clearly amiss.
Nevertheless I'm left with only a few changes now. The code is building and we'll see where this leaves us.
[...]
Code built, installed, executed. And we get the same result: a crash early on in the execution of the browser. I've removed so much code now that this doesn't feel right, so I need to check that something else hasn't broken along the way.
I've tried a whole bunch of things, including removing the profile, using different websites, restarting lipstick. None of this makes any difference.
Installing the packages for the version with working WebGL shows that things are still working for that. And when I then replace the library with the version of the library I've just built... well now that version works too, and with working WebGL as well. But of course the WebView is still broken with this version. But this clearly highlights that the problem isn't where I expected it to be.
So with much frustration I have to concede that something else — something bad — must have been happening elsewhere in the code.
Trying a second test, I install the packages for the version with broken WebGL. That gives the expected result (browser working; WebGL broken). Now I replace the library with my newly built one.
And now it crashes.
So the pattern is:
This is food for thought for sure. This suggests that the problem sits somewhere in the interface between the updated code and one of the EmbedLite code, the QtMozEmbed code, or the sailfish-browser code.
This at least gives me something to go on. I'm going to ruminate on this overnight and try to tackle it tomorrow. This is definitely progress, just not without raising new questions which I'll need to answer.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
It's anomalous though. I have packages built against the code with the commit reverted. I know that reverting all of the changes leaves me with a working browser. So something is clearly amiss.
Nevertheless I'm left with only a few changes now. The code is building and we'll see where this leaves us.
[...]
Code built, installed, executed. And we get the same result: a crash early on in the execution of the browser. I've removed so much code now that this doesn't feel right, so I need to check that something else hasn't broken along the way.
I've tried a whole bunch of things, including removing the profile, using different websites, restarting lipstick. None of this makes any difference.
Installing the packages for the version with working WebGL shows that things are still working for that. And when I then replace the library with the version of the library I've just built... well now that version works too, and with working WebGL as well. But of course the WebView is still broken with this version. But this clearly highlights that the problem isn't where I expected it to be.
So with much frustration I have to concede that something else — something bad — must have been happening elsewhere in the code.
Trying a second test, I install the packages for the version with broken WebGL. That gives the expected result (browser working; WebGL broken). Now I replace the library with my newly built one.
And now it crashes.
So the pattern is:
- Install working WebGL packages followed by latest libxul.so... works.
- Install broken WebGL packages followed by latest libxul.so... crashes.
This is food for thought for sure. This suggests that the problem sits somewhere in the interface between the updated code and one of the EmbedLite code, the QtMozEmbed code, or the sailfish-browser code.
This at least gives me something to go on. I'm going to ruminate on this overnight and try to tackle it tomorrow. This is definitely progress, just not without raising new questions which I'll need to answer.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
21 Jun 2024 : Day 265 #
Sadly, when I checked my machine this morning, I discovered the build I kicked off overnight didn't complete successfully. There have been a couple of errors during the compilation step. The first looks like this:
If I'd run the final linker stage at the end of the process, the error may have been exposed by an undefined reference, but I was so tired last night by the time I'd made all of the changes to the source code that I could barely think straight. So I had neither the energy nor the whit to do this.
Never mind, with any luck I can fix it this morning and get another build running during the day.
The fix appears to involve reverting almost all of the changes made to EmbedLiteCompositorBridgeParent.cpp to re-accommodate the GLContext::mScreen, which I'd previous to that switched for GLContext::mSwapChain in line with changes that happened upstream between ESR 78 and ESR 91. Given that I removed GLScreenBuffer, which mScreen was an instance of, these changes aren't too surprising in retrospect. But I was never going to notice them given my state of tiredness last night.
So anyway, here we are, it's still early and a fresh build is running. Hopefully this one will enjoy more success!
[...]
And happily it does: the build has completed without any errors, in time for some evening development.
Now time to execute it. And the result is...
Here's the list of breakpoints I've set:
I've now removed those classes and the few methods that also had signatures without definitions. I had thought that if these were the problems either the compiler would pick up on them or it would just fail when an attempt was made to load the library. Maybe I was wrong.
So, as I say, I've removed the classes signatures and related code. The good news is that the partial build completed fine, including the linking stage. Does it run?
No. No it doesn't. The crash, along with its backtrace, remains identical.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
330:03.83 mobile/sailfishos 330:22.42 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp: In member function ‘void mozilla:: embedlite::EmbedLiteCompositorBridgeParent::PrepareOffscreen()’: 330:22.42 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:116:39: error: ‘class mozilla::gl:: GLContext’ has no member named ‘Screen’ 330:22.42 GLScreenBuffer* screen = context->Screen(); 330:22.42 ^~~~~~The second looks like this:
330:22.43 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:124:74: error: no matching function for call to ‘mozilla::gl::SurfaceFactory_EGLImage::Create(mozilla::gl:: GLContext*&, std::nullptr_t, mozilla::layers::TextureFlags&)’ 330:22.43 factory = SurfaceFactory_EGLImage::Create(context, nullptr, flags); 330:22.43 ^There are some further errors, but they look like variations on these two. You might think it's odd that the full build failed when the partial build completed successfully last night. This is an occupational hazard of running partial builds. When running a partial build we have to specify the folder to start in. For example, this is the command I used last night:
$ make -j1 -C obj-build-mer-qt-xr/gfx/This is going to rebuild everything in the gfx directory and anything that depends on it. I chose this as the root because, as far as I could recall, all of the changes I made were in this directory or one of its children. But that's not always enough, for example, it means if there's something in the project with a shared dependency that's higher up the directory hierarchy from gfx it won't necessarily get rebuilt.
If I'd run the final linker stage at the end of the process, the error may have been exposed by an undefined reference, but I was so tired last night by the time I'd made all of the changes to the source code that I could barely think straight. So I had neither the energy nor the whit to do this.
Never mind, with any luck I can fix it this morning and get another build running during the day.
The fix appears to involve reverting almost all of the changes made to EmbedLiteCompositorBridgeParent.cpp to re-accommodate the GLContext::mScreen, which I'd previous to that switched for GLContext::mSwapChain in line with changes that happened upstream between ESR 78 and ESR 91. Given that I removed GLScreenBuffer, which mScreen was an instance of, these changes aren't too surprising in retrospect. But I was never going to notice them given my state of tiredness last night.
So anyway, here we are, it's still early and a fresh build is running. Hopefully this one will enjoy more success!
[...]
And happily it does: the build has completed without any errors, in time for some evening development.
Now time to execute it. And the result is...
Thread 38 "Compositor" received signal SIGSEGV, Segmentation fault. [Switching to LWP 13378] 0x0000007fe7e374cc in wl_proxy_marshal_constructor () from /usr/lib64/ libwayland-client.so.0 (gdb) bt #0 0x0000007fe7e374cc in wl_proxy_marshal_constructor () from /usr/lib64/ libwayland-client.so.0 #1 0x0000007fe7b8742c in ServerWaylandBuffer::ServerWaylandBuffer(unsigned int, unsigned int, int, int, android_wlegl*, wl_event_queue*) () from /usr/lib64/libhybris//eglplatform_wayland.so #2 0x0000007fe7b874c8 in WaylandNativeWindow::addBuffer() () from /usr/lib64/ libhybris//eglplatform_wayland.so #3 0x0000007fe7b86728 in WaylandNativeWindow::dequeueBuffer( BaseNativeWindowBuffer**, int*) () from /usr/lib64/libhybris// eglplatform_wayland.so #4 0x0000007fe7b4d124 in BaseNativeWindow::_dequeueBuffer(ANativeWindow*, ANativeWindowBuffer**, int*) () from /usr/lib64/ libhybris-platformcommon.so.1 #5 0x0000007fe4fa9188 in ?? () #6 0x0000000000000438 in ?? () Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb)Honestly, I really thought that I'd removed enough of the changes that this error would no longer occur, so I'm surprised that it's still here. I must be missing something important. There aren't really so many active changes in the code now and I'm really struggling to figure out what the problem is. So I've placed breakpoints on the main remaining edited methods. One of these will have to be hit before the crash occurs.
Here's the list of breakpoints I've set:
(gdb) info break Num Type Disp Enb Address What 1 breakpoint keep y <PENDING> DirectUpdate 2 breakpoint keep y <PENDING> TextureImageEGL::Resize 3 breakpoint keep y <PENDING> TextureImageEGL::ReleaseTexImage 4 breakpoint keep y <PENDING> TextureImageEGL::TextureImageEGL 5 breakpoint keep y <PENDING> DestroyTextureData 6 breakpoint keep y <PENDING> TextureClient::Destroy 7 breakpoint keep y <PENDING> CompositorOGL::PrepareViewport 8 breakpoint keep y <PENDING> CompositorOGL::DrawGeometry (gdb)Astonishingly not one of these hits. This is crazy. As I try to add the final few breakpoints, even of the methods that look unused, I notice that there are a couple of classes that have signatures but no implementation. Is it possible this could be the reason and I've been missing this all along?
I've now removed those classes and the few methods that also had signatures without definitions. I had thought that if these were the problems either the compiler would pick up on them or it would just fail when an attempt was made to load the library. Maybe I was wrong.
So, as I say, I've removed the classes signatures and related code. The good news is that the partial build completed fine, including the linking stage. Does it run?
No. No it doesn't. The crash, along with its backtrace, remains identical.
Thread 37 "Compositor" received signal SIGSEGV, Segmentation fault. [Switching to LWP 31572] 0x0000007fe7e364bc in wl_proxy_marshal_constructor () from /usr/lib64/ libwayland-client.so.0 (gdb) bt #0 0x0000007fe7e364bc in wl_proxy_marshal_constructor () from /usr/lib64/ libwayland-client.so.0 #1 0x0000007fe7b8642c in ServerWaylandBuffer::ServerWaylandBuffer(unsigned int, unsigned int, int, int, android_wlegl*, wl_event_queue*) () from /usr/lib64/libhybris//eglplatform_wayland.so #2 0x0000007fe7b864c8 in WaylandNativeWindow::addBuffer() () from /usr/lib64/ libhybris//eglplatform_wayland.so #3 0x0000007fe7b85728 in WaylandNativeWindow::dequeueBuffer( BaseNativeWindowBuffer**, int*) () from /usr/lib64/libhybris// eglplatform_wayland.so #4 0x0000007fe7b4c124 in BaseNativeWindow::_dequeueBuffer(ANativeWindow*, ANativeWindowBuffer**, int*) () from /usr/lib64/ libhybris-platformcommon.so.1 #5 0x0000007fe4f69188 in ?? () #6 0x0000000000000438 in ?? () Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb)That's not what I was hoping to see and this is deeply frustrating. I've reached the end of my usable hours for today, so I'll have to continue with this tomorrow. I'm not sure how much further I can strip code out until there's nothing left to remove, but I'll continue onward. Right now that seems like the only sane thing to do.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
20 Jun 2024 : Day 264 #
Overnight I ran a build having removed the Wayland code added in the last commit, or at least a decent proportion of it. This morning everything was built and it's time to test it out.
Unfortunately we still get a crash. The backtrace isn't identical to the backtrace we were getting a couple of days ago, but it's similar. Similar enough to make me think we've not actually fixed anything yet. Here's the backtrace:
So this turned out to be a really big change. Having made it, I now have to build and test it. It's going to be a short post today, but that reflects the fact I've spent all my time stripping out code. Big changes, but there's not so much to say about it if I'm honest.
I've checked that the partial build goes through. But it's really late now, so I may as well set a full build running overnight. Then I can test the changes in the morning.
The one nice thing about the process I'm currently undertaking with these WebGL changes is that I know that the process is bounded. I have a broken version, I have a working version, I just need to find the tipping point between the two where it switches from broken to fixed. Once I have that it won't be the end of the story, because the changes will inevitably have broken the WebView, but at that point, once I know what's breaking the WebGL, I can go back to the working WebView and reapply that specific change. Or so the theory goes.
Before I sign off for today, I'm going to take a moment to indulge in some development philosophy.
The big difference between detective stories and real life detective work is that in the fictional version the clues are all there, you just need to find them. Real life doesn't come with that guarantee: you can spend your entire life looking for clues that don't exit. I'm sure that's a big part of the reason why, as humans, we prefer computer games over real life. In a computer game we know in advance there's a finite, bounded, solution. A big portion of the uncertainty has already been taken away.
With this particular WebGL bug I'm inhabiting this same happy computer-game place where I know there's a solution, I just have to find it. It may be a slow process, it may be a little arduous at times, but the solution is there, I just need to find it. It exists somewhere between the last commit and the current commit. We'll get there.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
Unfortunately we still get a crash. The backtrace isn't identical to the backtrace we were getting a couple of days ago, but it's similar. Similar enough to make me think we've not actually fixed anything yet. Here's the backtrace:
Thread 37 "Compositor" received signal SIGSEGV, Segmentation fault. [Switching to LWP 25343] 0x0000007fefec60e4 in pthread_mutex_lock () from /lib64/libpthread.so.0 (gdb) bt #0 0x0000007fefec60e4 in pthread_mutex_lock () from /lib64/libpthread.so.0 #1 0x0000007fe7e34170 in wl_proxy_marshal_array_constructor_versioned () from / usr/lib64/libwayland-client.so.0 #2 0x0000007fe7e344e8 in wl_proxy_marshal_constructor () from /usr/lib64/ libwayland-client.so.0 #3 0x0000007fe7b8442c in ServerWaylandBuffer::ServerWaylandBuffer(unsigned int, unsigned int, int, int, android_wlegl*, wl_event_queue*) () from /usr/lib64/libhybris//eglplatform_wayland.so #4 0x0000007fe7b844c8 in WaylandNativeWindow::addBuffer() () from /usr/lib64/ libhybris//eglplatform_wayland.so #5 0x0000007fe7b83728 in WaylandNativeWindow::dequeueBuffer( BaseNativeWindowBuffer**, int*) () from /usr/lib64/libhybris// eglplatform_wayland.so #6 0x0000007fe7b4a124 in BaseNativeWindow::_dequeueBuffer(ANativeWindow*, ANativeWindowBuffer**, int*) () from /usr/lib64/ libhybris-platformcommon.so.1 #7 0x0000007fe4f69188 in ?? () #8 0x0000000000000438 in ?? () Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb)So following this I've been working away at the code again. It's mostly business as usual: I continue to remove code, reverting pieces back to the code as it was before my latest commit. I went a bit further than usual today though, removing GLScreenBuffer entirely. That's a really significant change and it required plenty of refinement: fixing up code that made use of it; fixing up code that made use of code that made use of it. Each set of changes brings us a little closer to a working WebGL.
So this turned out to be a really big change. Having made it, I now have to build and test it. It's going to be a short post today, but that reflects the fact I've spent all my time stripping out code. Big changes, but there's not so much to say about it if I'm honest.
I've checked that the partial build goes through. But it's really late now, so I may as well set a full build running overnight. Then I can test the changes in the morning.
The one nice thing about the process I'm currently undertaking with these WebGL changes is that I know that the process is bounded. I have a broken version, I have a working version, I just need to find the tipping point between the two where it switches from broken to fixed. Once I have that it won't be the end of the story, because the changes will inevitably have broken the WebView, but at that point, once I know what's breaking the WebGL, I can go back to the working WebView and reapply that specific change. Or so the theory goes.
Before I sign off for today, I'm going to take a moment to indulge in some development philosophy.
The big difference between detective stories and real life detective work is that in the fictional version the clues are all there, you just need to find them. Real life doesn't come with that guarantee: you can spend your entire life looking for clues that don't exit. I'm sure that's a big part of the reason why, as humans, we prefer computer games over real life. In a computer game we know in advance there's a finite, bounded, solution. A big portion of the uncertainty has already been taken away.
With this particular WebGL bug I'm inhabiting this same happy computer-game place where I know there's a solution, I just have to find it. It may be a slow process, it may be a little arduous at times, but the solution is there, I just need to find it. It exists somewhere between the last commit and the current commit. We'll get there.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
19 Jun 2024 : Day 263 #
Finally I have my new packages installed and ready to go. My development was disrupted today by the need to have a new keyboard installed in my laptop (the original having been broken from overuse it seems). The technician who came to replace it was highly skilled and incredibly efficient. I was, I have to admit, impressed.
But with the new keyboard in place I'm able to type and develop again, so let's get back to it.
Time to step through the code. Yesterday I was anxious about mDefaultDisplay which gets set, then immediately expires. After pondering this in more detail I concluded that the expiration was inevitable. The value gets stored in a weak pointer without any related shared pointer. It expires immediately.
That's just the logic as it is. So what happens in the previous working version? I've installed my backup copy of the old package and, on testing out this bit of the code, it does exactly the same:
The crash location after making the most recent changes is still inside wl_proxy_marshal_constructor() which relates to Wayland. I did add a load of code related to Wayland in the last commit, although I'm not totally sure it gets used. Most of it has been added inside the GLContextProviderEGL.cpp file. As I'm looking through it I notice that the GetAppDisplay() method is part of this bundle of code and if I get rid of it, it'd make sense to remove GetAppDisplay() at the same time.
But that will have a knock-on effect, since this is the "source" of the display variable, so to speak. Before this existed, the display value appeared, as if by magic, as the result of the aDisplay parameter of the CreateDisplay() method having a default value set on it. That value being EGL_NO_DISPLAY.
So now I'm wondering what happens if I revert that part and restore this default value again. It's possible that this is where the problem lies: the WebGL is creating a display, using that provided by Wayland and getting in a mess.
I'm not sure, but I'm going to try making this change before getting rid of all the new Wayland code.
The changes I've had to make to get it to compile make me a little uncomfortable, but let's see.
[...]
Having made the changes I still get a crash, and it's in the same location as before. But the code is already now in a better state for me to remove the Wayland changes, so I'm going to do that now and set a build running overnight.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
But with the new keyboard in place I'm able to type and develop again, so let's get back to it.
Time to step through the code. Yesterday I was anxious about mDefaultDisplay which gets set, then immediately expires. After pondering this in more detail I concluded that the expiration was inevitable. The value gets stored in a weak pointer without any related shared pointer. It expires immediately.
That's just the logic as it is. So what happens in the previous working version? I've installed my backup copy of the old package and, on testing out this bit of the code, it does exactly the same:
Thread 37 "Compositor" hit Breakpoint 2, mozilla::gl::GLLibraryEGL:: DefaultDisplay (this=this@entry=0x7ed81a23e0, out_failureId=out_failureId@entry=0x7f2e381f60) at gfx/gl/GLLibraryEGL.cpp:741 741 nsACString* const out_failureId) { (gdb) n 742 auto ret = mDefaultDisplay.lock(); (gdb) n 745 ret = CreateDisplay(false, out_failureId); (gdb)Interestingly I also discovered while checking this that in the working version CreateDisplay() gets called additional times if there's WebGL content on the page. Alongside the initial creation it also gets called from Init() like this when creating the canvas texture to render to:
Thread 8 "GeckoWorkerThre" hit Breakpoint 1, mozilla::gl:: GLLibraryEGL::CreateDisplay (this=this@entry=0x7fca67d450, forceAccel=forceAccel@entry=false, out_failureId=out_failureId@entry=0x7fdf2ddd18, aDisplay=aDisplay@entry=0x0) at gfx/gl/GLLibraryEGL.cpp:752 752 EGLDisplay aDisplay) { (gdb) bt #0 mozilla::gl::GLLibraryEGL::CreateDisplay (this=this@entry=0x7fca67d450, forceAccel=forceAccel@entry=false, out_failureId=out_failureId@entry=0x7fdf2ddd18, aDisplay=aDisplay@entry=0x0) at gfx/gl/GLLibraryEGL.cpp:752 #1 0x0000007ff28c27e8 in mozilla::gl::GLLibraryEGL::Init ( this=this@entry=0x7fca67d450, forceAccel=forceAccel@entry=false, out_failureId=out_failureId@entry=0x7fdf2ddd18, aDisplay=aDisplay@entry=0x0) at gfx/gl/GLLibraryEGL.cpp:504 [...]That's interesting. But the conclusion is the same, which is that this particular change isn't causing the crash. I'll need to look elsewhere to fix this.
The crash location after making the most recent changes is still inside wl_proxy_marshal_constructor() which relates to Wayland. I did add a load of code related to Wayland in the last commit, although I'm not totally sure it gets used. Most of it has been added inside the GLContextProviderEGL.cpp file. As I'm looking through it I notice that the GetAppDisplay() method is part of this bundle of code and if I get rid of it, it'd make sense to remove GetAppDisplay() at the same time.
But that will have a knock-on effect, since this is the "source" of the display variable, so to speak. Before this existed, the display value appeared, as if by magic, as the result of the aDisplay parameter of the CreateDisplay() method having a default value set on it. That value being EGL_NO_DISPLAY.
So now I'm wondering what happens if I revert that part and restore this default value again. It's possible that this is where the problem lies: the WebGL is creating a display, using that provided by Wayland and getting in a mess.
I'm not sure, but I'm going to try making this change before getting rid of all the new Wayland code.
The changes I've had to make to get it to compile make me a little uncomfortable, but let's see.
[...]
Having made the changes I still get a crash, and it's in the same location as before. But the code is already now in a better state for me to remove the Wayland changes, so I'm going to do that now and set a build running overnight.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
18 Jun 2024 : Day 262 #
I'm continuing to look into WebGL today. Yesterday I made a change to the GLLibraryEGL::Init() method to restore a call to GLLibraryEGL::CreateDisplay(). Debugging was getting tricky due to the debug source getting out of sync with the binary, which prevented gdb from tying addresses to source lines properly. So I kicked off a build overnight.
Unfortunately the build hasn't finished yet this morning, so I'm working on the theory &mdahs; rather than the practice — in the meantime. One thing I notice is that now the CreateDisplay() method gets called twice. It's possible this is intentional, but I'm not convinced so want to explore it a little further.
The first time it gets called is in the new location, inside the Init() method. This is as expected; it's the new call I just added yesterday:
There is another possibility, which is that the first call to CreateDisplay() is returning a null pointer. Initially I thought I'd need line-by-line debugging to check this, but it turns out there's another way:
Assuming for the timebeing the first call isn't null, that might imply that there are two different instances of GLLibraryEGL in operation. I thought that maybe this might be as a consequence of there being WebGL content on the page, but when I checked using a page without any such content, I got exactly the same result.
But taking another look at that debug output above, the reason is right there after all. It's because in the second case the pointer (which is a weak pointer) has already expired, meaning that its reference count reached zero. Consequently the call to lock it in DefaultDisplay will be returning null.
That still seems a little odd if I'm honest. I'd have expected it not to expire until either the window or the browser is shut.
Now I don't know for sure that this is wrong, but it certainly feels wrong. Why create a display object and then immediately drop it?
I want to investigate this further, but it's time for my work day to start so I'll have to pause until this evening. This is no bad thing: it'll give time for the build to complete, by which time I'll have access to line-by-line debugging and the git history, both of which I anticipate being super-helpful in getting to the bottom of this. Until later then.
[...]
The build took a lot longer than I'd anticipated, too long to do any debugging today. But I have at least been able to transfer over and isntall the new packages. That means tomorrow debugging will be much eaiser and I should be able to find out why this weak pointer is getting released before there's a proper chance to make use of it.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
Unfortunately the build hasn't finished yet this morning, so I'm working on the theory &mdahs; rather than the practice — in the meantime. One thing I notice is that now the CreateDisplay() method gets called twice. It's possible this is intentional, but I'm not convinced so want to explore it a little further.
The first time it gets called is in the new location, inside the Init() method. This is as expected; it's the new call I just added yesterday:
Thread 37 "Compositor" hit Breakpoint 3, mozilla::gl::GLLibraryEGL:: CreateDisplay (this=this@entry=0x7ed81a21e0, forceAccel=forceAccel@entry=false, out_failureId=out_failureId@entry=0x7f2e4e7f60, aDisplay=aDisplay@entry=0x1) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:752 752 ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp: No such file or directory. (gdb) bt #0 mozilla::gl::GLLibraryEGL::CreateDisplay (this=this@entry=0x7ed81a21e0, forceAccel=forceAccel@entry=false, out_failureId=out_failureId@entry=0x7f2e4e7f60, aDisplay=aDisplay@entry=0x1) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:752 #1 0x0000007ff28c09e0 in mozilla::gl::GLLibraryEGL::Init ( this=this@entry=0x7ed81a21e0, forceAccel=forceAccel@entry=false, out_failureId=out_failureId@entry=0x7f2e4e7f60, aDisplay=aDisplay@entry=0x1) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:504 #2 0x0000007ff28c191c in mozilla::gl::GLContextProviderEGL:: CreateWrappingExisting (aContext=0x7ed80049a0, aSurface=0x5555988610, aDisplay=0x1) at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1177 #3 0x0000007ff4e20f4c in mozilla::embedlite::nsWindow::GetGLContext ( this=this@entry=0x7fc8ab77c0) at ${PROJECT}/gecko-dev/mobile/sailfishos/embedshared/nsWindow.cpp:405 #4 0x0000007ff4e21118 in mozilla::embedlite::nsWindow::GetNativeData ( this=0x7fc8ab77c0, aDataType=12) at ${PROJECT}/gecko-dev/mobile/sailfishos/embedshared/nsWindow.cpp:173 #5 0x0000007ff293ac00 in mozilla::layers::CompositorOGL::CreateContext ( this=this@entry=0x7ed8002f10) at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:232 #6 0x0000007ff29504d4 in mozilla::layers::CompositorOGL::Initialize ( this=0x7ed8002f10, out_failureReason=0x7f2e4e8510) at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:393 #7 0x0000007ff2a66244 in mozilla::layers::CompositorBridgeParent:: NewCompositor (this=this@entry=0x7fc8a7f630, aBackendHints=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1493 #8 0x0000007ff2a712c0 in mozilla::layers::CompositorBridgeParent:: InitializeLayerManager (this=this@entry=0x7fc8a7f630, aBackendHints=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1436 #9 0x0000007ff2a713f0 in mozilla::layers::CompositorBridgeParent:: AllocPLayerTransactionParent (this=this@entry=0x7fc8a7f630, aBackendHints=..., aId=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1546 #10 0x0000007ff4e08008 in mozilla::embedlite::EmbedLiteCompositorBridgeParent:: AllocPLayerTransactionParent (this=0x7fc8a7f630, aBackendHints=..., aId=...) at ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:80 #11 0x0000007ff24022f0 in mozilla::layers::PCompositorBridgeParent:: OnMessageReceived (this=0x7fc8a7f630, msg__=...) at PCompositorBridgeParent.cpp:1285 [...] #26 0x0000007fefbac89c in ?? () from /lib64/libc.so.6 (gdb)But it hits a second time as well. It's possible this is when it would have been called without the addition I just made, but I'm not certain. This time it's happening inside GLLibraryEGL::DefaultDisplay(). This is odd. I'll come on to why it's odd in a moment, but first here are the relevant parts of the backtrace:
Thread 37 "Compositor" hit Breakpoint 3, mozilla::gl::GLLibraryEGL:: CreateDisplay (this=this@entry=0x7ed81a21e0, forceAccel=forceAccel@entry=false, out_failureId=out_failureId@entry=0x7f2e4e7f60, aDisplay=aDisplay@entry=0x1) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:752 752 in ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp (gdb) bt #0 mozilla::gl::GLLibraryEGL::CreateDisplay (this=this@entry=0x7ed81a21e0, forceAccel=forceAccel@entry=false, out_failureId=out_failureId@entry=0x7f2e4e7f60, aDisplay=aDisplay@entry=0x1) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:752 #1 0x0000007ff28c1554 in mozilla::gl::GLLibraryEGL::DefaultDisplay ( this=this@entry=0x7ed81a21e0, out_failureId=out_failureId@entry=0x7f2e4e7f60, aDisplay=aDisplay@entry=0x1) at ${PROJECT}/gecko-dev/gfx/gl/ GLLibraryEGL.cpp:745 #2 0x0000007ff28c19a4 in mozilla::gl::GLContextProviderEGL:: CreateWrappingExisting (aContext=0x7ed80049a0, aSurface=0x5555988610, aDisplay=0x1) at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1183 #3 0x0000007ff4e20f4c in mozilla::embedlite::nsWindow::GetGLContext ( this=this@entry=0x7fc8ab77c0) at ${PROJECT}/gecko-dev/mobile/sailfishos/embedshared/nsWindow.cpp:405 #4 0x0000007ff4e21118 in mozilla::embedlite::nsWindow::GetNativeData ( this=0x7fc8ab77c0, aDataType=12) at ${PROJECT}/gecko-dev/mobile/sailfishos/embedshared/nsWindow.cpp:173 #5 0x0000007ff293ac00 in mozilla::layers::CompositorOGL::CreateContext ( this=this@entry=0x7ed8002f10) at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:232 #6 0x0000007ff29504d4 in mozilla::layers::CompositorOGL::Initialize ( this=0x7ed8002f10, out_failureReason=0x7f2e4e8510) at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:393 #7 0x0000007ff2a66244 in mozilla::layers::CompositorBridgeParent:: NewCompositor (this=this@entry=0x7fc8a7f630, aBackendHints=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1493 #8 0x0000007ff2a712c0 in mozilla::layers::CompositorBridgeParent:: InitializeLayerManager (this=this@entry=0x7fc8a7f630, aBackendHints=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1436 #9 0x0000007ff2a713f0 in mozilla::layers::CompositorBridgeParent:: AllocPLayerTransactionParent (this=this@entry=0x7fc8a7f630, aBackendHints=..., aId=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1546 #10 0x0000007ff4e08008 in mozilla::embedlite::EmbedLiteCompositorBridgeParent:: AllocPLayerTransactionParent (this=0x7fc8a7f630, aBackendHints=..., aId=...) at ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:80 #11 0x0000007ff24022f0 in mozilla::layers::PCompositorBridgeParent:: OnMessageReceived (this=0x7fc8a7f630, msg__=...) at PCompositorBridgeParent.cpp:1285 #12 0x0000007ff2446804 in mozilla::layers::PCompositorManagerParent:: OnMessageReceived (this=<optimized out>, msg__=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/ipc/ProtocolUtils.h: 675 [...] #26 0x0000007fefbac89c in ?? () from /lib64/libc.so.6 (gdb)If this is where it would have been called previously, why is the call here unexpected? It's because of the code directly before the call to CreateDisplay(). Here's the call that's made in the Init() method:
std::shared_ptr<EglDisplay> defaultDisplay = CreateDisplay(forceAccel, out_failureId, aDisplay); if (!defaultDisplay) { return false; } mDefaultDisplay = defaultDisplay;Notice how mDefaultDisplay is called directly after the call returns. There's no call to Init() to be found in the second backtrace so we can assume that the first call to CreateDisplay() safely completed before the second call was made. But now the code that appears inside DefaultDisplay() looks like this:
std::shared_ptr<EglDisplay> GLLibraryEGL::DefaultDisplay( nsACString* const out_failureId, EGLDisplay aDisplay) { auto ret = mDefaultDisplay.lock(); if (ret) return ret; ret = CreateDisplay(false, out_failureId, aDisplay); mDefaultDisplay = ret; return ret; }Given the previous chunk of code, by this time the mDefaultDisplay should be non-null and the if (ret) return ret line should therefore be returning immedately as a consequence. In other words, it should be returning before the call to CreateDisplay().
There is another possibility, which is that the first call to CreateDisplay() is returning a null pointer. Initially I thought I'd need line-by-line debugging to check this, but it turns out there's another way:
out_failureId=out_failureId@entry=0x7f16603f60, aDisplay=aDisplay@entry=0x1) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:752 752 ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp: No such file or directory. (gdb) p mDefaultDisplay $2 = std::weak_ptr<mozilla::gl::EglDisplay> (empty) = {get() = 0x0} (gdb) c Continuing. Thread 37 "Compositor" hit Breakpoint 3, mozilla::gl::GLLibraryEGL:: CreateDisplay (this=this@entry=0x7ed81a20a0, forceAccel=forceAccel@entry=false, out_failureId=out_failureId@entry=0x7f16603f60, aDisplay=aDisplay@entry=0x1) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:752 752 in ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp (gdb) p mDefaultDisplay $3 = std::weak_ptr<mozilla::gl::EglDisplay> (expired, weak count 1) = {get() = 0x7ed819b120} (gdb)This little sequence initially looks odd. The second time CreateDisplay() is called the value of mDefaultDisplay is non-null. But if that's the case, why isn't DefaultDsipay() returning early?
Assuming for the timebeing the first call isn't null, that might imply that there are two different instances of GLLibraryEGL in operation. I thought that maybe this might be as a consequence of there being WebGL content on the page, but when I checked using a page without any such content, I got exactly the same result.
But taking another look at that debug output above, the reason is right there after all. It's because in the second case the pointer (which is a weak pointer) has already expired, meaning that its reference count reached zero. Consequently the call to lock it in DefaultDisplay will be returning null.
That still seems a little odd if I'm honest. I'd have expected it not to expire until either the window or the browser is shut.
Now I don't know for sure that this is wrong, but it certainly feels wrong. Why create a display object and then immediately drop it?
I want to investigate this further, but it's time for my work day to start so I'll have to pause until this evening. This is no bad thing: it'll give time for the build to complete, by which time I'll have access to line-by-line debugging and the git history, both of which I anticipate being super-helpful in getting to the bottom of this. Until later then.
[...]
The build took a lot longer than I'd anticipated, too long to do any debugging today. But I have at least been able to transfer over and isntall the new packages. That means tomorrow debugging will be much eaiser and I should be able to find out why this weak pointer is getting released before there's a proper chance to make use of it.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
17 Jun 2024 : Day 261 #
Yesterday I collected a bunch of backtraces. These were all from methods added in the last commit for the purpose of WebView rendering, but which also trigger when rendering WebGL content on the browser. They're the methods I'm really interested in.
The very first of the methods to hit a breakpoint was the CompositorOGL::CreateContext() method. Checking the diff between the previous commit and HEAD, we can see that I made the following changes to this method:
Okay, that's something I can change back easily. So I've added this change on top, effectively reversing the change that I made before:
Having rebuilt, reinstalled and executed these changes, there's no change on the browser side, but for the WebView I now get a segfault crash occurring like this:
This is fine. At present I really only care about getting the browser to work. If the change I just made turns out to be necessary I'll have to return to this and fix it, so I'll need this backtrace, but right now I have to forge on and focus on the WebGL rendering.
I'm now checking the following change:
I've continued stepping through the method, out of the method and up the stack, but so far without yet hitting the crash. I'm going to continue looking at this, but to do so will have to wait until tomorrow now. I'm sure I'll get to the bottom of it and that we're approaching a solution, even if it is now taking longer than I'd have preferred.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
The very first of the methods to hit a breakpoint was the CompositorOGL::CreateContext() method. Checking the diff between the previous commit and HEAD, we can see that I made the following changes to this method:
@@ -247,11 +247,9 @@ already_AddRefed<mozilla::gl::GLContext> CompositorOGL:: CreateContext() { // Allow to create offscreen GL context for main Layer Manager if (!context && gfxEnv::LayersPreferOffscreen()) { nsCString discardFailureId; - context = GLContextProvider::CreateHeadless( - {CreateContextFlags::REQUIRE_COMPAT_PROFILE}, &discardFailureId); - if (!context->CreateOffscreenDefaultFb(mSurfaceSize)) { - context = nullptr; - } + context = GLContextProvider::CreateOffscreen( + mSurfaceSize, CreateContextFlags::REQUIRE_COMPAT_PROFILE, + &discardFailureId); } if (!context) {In other words, I messed around with the generation of the RefPtr<GLContext> context variable so that it gets created using GLContextProvider::CreateOffscreen() rather than GLContextProvider::CreateHeadless() as it was doing before.
Okay, that's something I can change back easily. So I've added this change on top, effectively reversing the change that I made before:
diff --git a/gfx/layers/opengl/CompositorOGL.cpp b/gfx/layers/opengl/ CompositorOGL.cpp index 122709eaf2de..06c84a9ebdaa 100644 --- a/gfx/layers/opengl/CompositorOGL.cpp +++ b/gfx/layers/opengl/CompositorOGL.cpp @@ -247,9 +247,15 @@ already_AddRefed<mozilla::gl::GLContext> CompositorOGL:: CreateContext() { // Allow to create offscreen GL context for main Layer Manager if (!context && gfxEnv::LayersPreferOffscreen()) { nsCString discardFailureId; - context = GLContextProvider::CreateOffscreen( - mSurfaceSize, CreateContextFlags::REQUIRE_COMPAT_PROFILE, - &discardFailureId); + + context = GLContextProvider::CreateHeadless( + {CreateContextFlags::REQUIRE_COMPAT_PROFILE}, &discardFailureId); + if (!context->CreateOffscreenDefaultFb(mSurfaceSize)) { + context = nullptr; + } +// context = GLContextProvider::CreateOffscreen( +// mSurfaceSize, CreateContextFlags::REQUIRE_COMPAT_PROFILE, +// &discardFailureId); } if (!context) {As you can see, this comments out the new call to CreateOffscreen() and replaces it with a call to CreateHeadless() so that it's doing the same as it did before. I've left the removed code in as a comment to make it easier for me to compare, but actually git is keeping track of all this for me already, so I could have safely just deleted that code. I may yet do that!
Having rebuilt, reinstalled and executed these changes, there's no change on the browser side, but for the WebView I now get a segfault crash occurring like this:
Thread 37 "Compositor" received signal SIGSEGV, Segmentation fault. [Switching to LWP 305] 0x0000007ff366470c in mozilla::gl::GLScreenBuffer::Size (this=0x0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 290 ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h: No such file or directory. (gdb) bt #0 0x0000007ff366470c in mozilla::gl::GLScreenBuffer::Size (this=0x0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 #1 mozilla::embedlite::EmbedLiteCompositorBridgeParent:: CompositeToDefaultTarget (this=0x7fc4b0a5a0, aId=...) at ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:151 #2 0x0000007ff12b49d8 in mozilla::layers::CompositorVsyncScheduler:: ForceComposeToTarget (this=0x7fc4c98760, aTarget=aTarget@entry=0x0, aRect=aRect@entry=0x0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/LayersTypes.h: 82 #3 0x0000007ff12b4a34 in mozilla::layers::CompositorBridgeParent:: ResumeComposition (this=this@entry=0x7fc4b0a5a0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313 #4 0x0000007ff12b4ac0 in mozilla::layers::CompositorBridgeParent:: ResumeCompositionAndResize (this=0x7fc4b0a5a0, x=<optimized out>, y=<optimized out>, width=<optimized out>, height=<optimized out>) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:794 #5 0x0000007ff12ad65c in mozilla::detail::RunnableMethodArguments<int, int, int, int>::applyImpl<mozilla::layers::CompositorBridgeParent, void (mozilla: :layers::CompositorBridgeParent::*)(int, int, int, int), StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, 0ul, 1ul, 2ul, 3ul> (args=..., m=<optimized out>, o=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151 #6 mozilla::detail::RunnableMethodArguments<int, int, int, int>::apply<mozilla: :layers::CompositorBridgeParent, void (mozilla::layers:: CompositorBridgeParent::*)(int, int, int, int)> (m=<optimized out>, o=<optimized out>, this=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1154 #7 mozilla::detail::RunnableMethodImpl<mozilla::layers:: CompositorBridgeParent*, void (mozilla::layers::CompositorBridgeParent::*)( int, int, int, int), true, (mozilla::RunnableKind)0, int, int, int, int>:: Run (this=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1201 #8 0x0000007ff07fd018 in nsThread::ProcessNextEvent (this=0x7fc4c25780, aMayWait=<optimized out>, aResult=0x7f1f608cb7) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:869 [...] #17 0x0000007ff6a0289c in ?? () from /lib64/libc.so.6 (gdb)It seems that a call is now being made to GLScreenBuffer::Size() where the GLScreenBuffer object that the method belongs to doesn't exist.
This is fine. At present I really only care about getting the browser to work. If the change I just made turns out to be necessary I'll have to return to this and fix it, so I'll need this backtrace, but right now I have to forge on and focus on the WebGL rendering.
I'm now checking the following change:
@@ -1081,7 +1251,11 @@ static void FillContextAttribs(bool es3, bool useGles, nsTArray<EGLint>* out) { } else #endif { - out->AppendElement(LOCAL_EGL_PBUFFER_BIT); + if (useWindow) { + out->AppendElement(LOCAL_EGL_WINDOW_BIT); + } else { + out->AppendElement(LOCAL_EGL_PBUFFER_BIT); + } } if (useGles) {As we can see, previously it was always LOCAL_EGL_PBUFFER_BIT being appended. Now there's a switch it could potentially change. Here's what the debugger shows us:
Thread 8 "GeckoWorkerThre" hit Breakpoint 1, mozilla::gl:: FillContextAttribs (out=0x7fdf293b78, useWindow=false, useGles=false, es3=false) at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1245 1245 ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp: No such file or directory. (gdb) p useWindow $1 = false (gdb)With useWindow set to false we'll still be appending LOCAL_EGL_PBUFFER_BIT, so this isn't a relevant change. It also appears to get called at initialisation, but not when the WebGL view is created. So this looks safe to leave as it is. We see something similar happening in CreateEGLPBufferOffscreenContextImpl():
@@ -1202,9 +1387,14 @@ RefPtr<GLContextEGL> GLContextEGL:: CreateEGLPBufferOffscreenContextImpl( } else #endif { - surface = GLContextEGL::CreatePBufferSurfaceTryingPowerOfTwo( - *egl, config, LOCAL_EGL_NONE, pbSize); + if (useWindow) { + surface = CreateEmulatorBufferSurface(egl->mLib, config, pbSize); + } else { + surface = GLContextEGL::CreatePBufferSurfaceTryingPowerOfTwo( + *egl, config, LOCAL_EGL_NONE, pbSize); + } }Again, what we want is for useWindow to be set to false and debugging the app shows this to be the case. We have to work a little harder for it this time though, because the variable is set based on conditions at the start of the method and the condition occurs further down. I would have place a breakpoint on the exact line but our partial build shenanigans has broken line-by-line debugging, so I've placed breakpoints on each of the called methods instead:
(gdb) break CreateEmulatorBufferSurface Breakpoint 6 at 0x7ff28c1538: file ${PROJECT}/gecko-dev/gfx/gl/ GLContextProviderEGL.cpp, line 982. (gdb) break GLContextEGL::CreatePBufferSurfaceTryingPowerOfTwo Breakpoint 7 at 0x7ff28ae934: file ${PROJECT}/gecko-dev/gfx/gl/ GLContextProviderEGL.cpp, line 911. (gdb) c Continuing. Thread 8 "GeckoWorkerThre" hit Breakpoint 7, mozilla::gl:: GLContextEGL::CreatePBufferSurfaceTryingPowerOfTwo (egl=..., config=config@entry=0x5555ab3950, bindToTextureFormat=bindToTextureFormat@entry=12344, pbsize=...) at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:911 911 in ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp (gdb)I'm going to try to tackle GLLibraryEGL::Init() now. There seems to be one significant chunk of code removed from this method. Adding it back in would be straightforward were it not for the fact the display value is no longer passed in to the method. Adding it back in required a cascade of changes, but in essence the important change is the following:
-bool GLLibraryEGL::Init(bool forceAccel, nsACString* const out_failureId) { +bool GLLibraryEGL::Init(bool forceAccel, nsACString* const out_failureId, EGLDisplay aDisplay) { MOZ_RELEASE_ASSERT(!mSymbols.fTerminate); mozilla::ScopedGfxFeatureReporter reporter("EGL"); @@ -501,6 +501,11 @@ bool GLLibraryEGL::Init(bool forceAccel, nsACString* const out_failureId) { } // - + std::shared_ptr<EglDisplay> defaultDisplay = CreateDisplay(forceAccel, out_failureId, aDisplay); + if (!defaultDisplay) { + return false; + } + mDefaultDisplay = defaultDisplay; InitLibExtensions();Now when I execute this code I get an almost immediate segfault:
Thread 38 "Compositor" received signal SIGSEGV, Segmentation fault. [Switching to LWP 32407] 0x0000007fe7e324cc in wl_proxy_marshal_constructor () from /usr/lib64/ libwayland-client.so.0 (gdb) bt #0 0x0000007fe7e324cc in wl_proxy_marshal_constructor () from /usr/lib64/ libwayland-client.so.0 #1 0x0000007fe7b8242c in ServerWaylandBuffer::ServerWaylandBuffer(unsigned int, unsigned int, int, int, android_wlegl*, wl_event_queue*) () from /usr/lib64/libhybris//eglplatform_wayland.so #2 0x0000007fe7b824c8 in WaylandNativeWindow::addBuffer() () from /usr/lib64/ libhybris//eglplatform_wayland.so #3 0x0000007fe7b81728 in WaylandNativeWindow::dequeueBuffer( BaseNativeWindowBuffer**, int*) () from /usr/lib64/libhybris// eglplatform_wayland.so #4 0x0000007fe7b48124 in BaseNativeWindow::_dequeueBuffer(ANativeWindow*, ANativeWindowBuffer**, int*) () from /usr/lib64/ libhybris-platformcommon.so.1 #5 0x0000007fe4f69188 in ?? () #6 0x0000000000000438 in ?? () Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb)It's not clear to me why the change I made has this effect, given all of this seems to be happening inside the Wayland code. However, by stepping through the code, even without the source lines, I'm been able to establish that the Init() method is called, as is the new code I added in. This new code executes without error, so there's nothing in the new code that's directly triggering the segfault. But some consequence of the change certainly is.
I've continued stepping through the method, out of the method and up the stack, but so far without yet hitting the crash. I'm going to continue looking at this, but to do so will have to wait until tomorrow now. I'm sure I'll get to the bottom of it and that we're approaching a solution, even if it is now taking longer than I'd have preferred.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
16 Jun 2024 : Day 260 #
I'm getting well into the swing of things now, doing my best to get to the bottom of the WebGL issue. Yesterday I tried removing some changes I made in the last commit. Today I'll be placing breakpoints on methods to find out which get hit and which sail through without incident.
A key difference I'm interested in is whether the methods get hit while running the browser on a page that displays WebGL content. Most of the methods should already be used when running a WebView — that is, after all, their purpose — and in theory few of them should be used — by default — by the browser. That's because they were added specifically to be used for offscreen rendering, which is what the WebView is all about.
The exception is WebGL which also uses offscreen rending, so it's the methods that are used on both the WebView and the browser that are of interest to us today.
I've been testing a few out. For example the following SharedSurfaceTextureClient constructor is used by the WebView, but not by the Sailfish Browser. Here it is being hit when executing the WebView:
In contrast, I found that the following method is neither used by the WebView nor the browser:
The following is the CreateEGLPBufferOffscreenContextImpl() method used by the WebView, but which appears not to be used by the browser:
I've checked a few methods now but this is beginning to feel like a rather hit and miss approach: debugging as Brownian motion. I'm not averse to a bit of Brownian debugging, but I prefer something more structured if there's an alternative available. To this end, rather than testing methods individually I've now placed all of the following breakpoints on the executing browser to find out which ones hit:
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
A key difference I'm interested in is whether the methods get hit while running the browser on a page that displays WebGL content. Most of the methods should already be used when running a WebView — that is, after all, their purpose — and in theory few of them should be used — by default — by the browser. That's because they were added specifically to be used for offscreen rendering, which is what the WebView is all about.
The exception is WebGL which also uses offscreen rending, so it's the methods that are used on both the WebView and the browser that are of interest to us today.
I've been testing a few out. For example the following SharedSurfaceTextureClient constructor is used by the WebView, but not by the Sailfish Browser. Here it is being hit when executing the WebView:
Thread 37 "Compositor" hit Breakpoint 1, mozilla::layers:: SharedSurfaceTextureClient::SharedSurfaceTextureClient (this=0x7ee01ab110, aData=0x7ee01ab060, aFlags=34, aAllocator=0x0) at ${PROJECT}/gecko-dev/gfx/layers/client/TextureClientSharedSurface.cpp:105 105 ${PROJECT}/gecko-dev/gfx/layers/client/TextureClientSharedSurface.cpp: No such file or directory. (gdb)This surprised me a little because I thought it'd be used for offscreen rendering more generally. But apparently not. I can conclude that this method is unlikely to be the source of the WebGL problems.
In contrast, I found that the following method is neither used by the WebView nor the browser:
(gdb) info break Num Type Disp Enb Address What 2 breakpoint keep y 0x0000007ff28c14fc in mozilla::gl:: CreateEmulatorBufferSurface(mozilla::gl::GLLibraryEGL*, void*, mozilla::gfx: :IntSizeTyped<mozilla::gfx::UnknownUnits>&) at ${PROJECT}/gecko-dev/gfx/gl/ GLContextProviderEGL.cpp:982This is also surprising. I thought maybe that I should therefore think about getting rid of this method entirely, but then I realised there's a reason for this method existing: it's intended for use with native rendering and for the emulator. These are both cases I've not yet tested, but they're still important for Sailfish OS. I added these to try to map ESR 78 changes and while I've not tested them I cam only hope they work, which is still preferable to not including them at all.
The following is the CreateEGLPBufferOffscreenContextImpl() method used by the WebView, but which appears not to be used by the browser:
Thread 38 "Compositor" hit Breakpoint 2, mozilla::gl::GLContextEGL:: CreateEGLPBufferOffscreenContextImpl ( egl=std::shared_ptr<mozilla::gl::EglDisplay> (use count 3, weak count 2) = {...}, desc=..., size=..., useGles=useGles@entry=false, out_failureId=out_failureId@entry=0x7f1f94c1c8) at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1359 1359 ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp: No such file or directory. (gdb) bt #0 mozilla::gl::GLContextEGL::CreateEGLPBufferOffscreenContextImpl (egl=std:: shared_ptr<mozilla::gl::EglDisplay> (use count 3, weak count 2) = {...}, desc=..., size=..., useGles=useGles@entry=false, out_failureId=out_failureId@entry=0x7f1f94c1c8) at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1359 #1 0x0000007ff112f9f8 in mozilla::gl::GLContextEGL:: CreateEGLPBufferOffscreenContext ( display=std::shared_ptr<mozilla::gl::EglDisplay> (use count 3, weak count 2) = {...}, desc=..., size=..., out_failureId=out_failureId@entry=0x7f1f94c1c8) at include/c++/8.3.0/ext/atomicity.h:96 #2 0x0000007ff112fbcc in mozilla::gl::GLContextProviderEGL::CreateHeadless ( desc=..., out_failureId=out_failureId@entry=0x7f1f94c1c8) at include/c++/8.3.0/ext/atomicity.h:96 #3 0x0000007ff1130454 in mozilla::gl::GLContextProviderEGL::CreateOffscreen ( size=..., flags=flags@entry=mozilla::gl::CreateContextFlags::REQUIRE_COMPAT_PROFILE, out_failureId=out_failureId@entry=0x7f1f94c1c8) at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1456 #4 0x0000007ff1198048 in mozilla::layers::CompositorOGL::CreateContext ( this=this@entry=0x7ed4002f50) at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:250 #5 0x0000007ff11ad820 in mozilla::layers::CompositorOGL::Initialize ( this=0x7ed4002f50, out_failureReason=0x7f1f94c520) at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:387 #6 0x0000007ff12c359c in mozilla::layers::CompositorBridgeParent:: NewCompositor (this=this@entry=0x7fc4b671f0, aBackendHints=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1493 #7 0x0000007ff12ce618 in mozilla::layers::CompositorBridgeParent:: InitializeLayerManager (this=this@entry=0x7fc4b671f0, aBackendHints=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1436 #8 0x0000007ff12ce748 in mozilla::layers::CompositorBridgeParent:: AllocPLayerTransactionParent (this=this@entry=0x7fc4b671f0, aBackendHints=..., aId=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1546 #9 0x0000007ff3665368 in mozilla::embedlite::EmbedLiteCompositorBridgeParent:: AllocPLayerTransactionParent (this=0x7fc4b671f0, aBackendHints=..., aId=...) at ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:80 #10 0x0000007ff0c5f2f0 in mozilla::layers::PCompositorBridgeParent:: OnMessageReceived (this=0x7fc4b671f0, msg__=...) at PCompositorBridgeParent.cpp:1285 [...] #25 0x0000007ff6a0289c in ?? () from /lib64/libc.so.6 (gdb)As you can see I've captured a backtrace; I'm honestly not expecting it to be useful but I'd rather keep it just in case.
I've checked a few methods now but this is beginning to feel like a rather hit and miss approach: debugging as Brownian motion. I'm not averse to a bit of Brownian debugging, but I prefer something more structured if there's an alternative available. To this end, rather than testing methods individually I've now placed all of the following breakpoints on the executing browser to find out which ones hit:
1 mozilla::gl::GLScreenBuffer::Create(mozilla::gl::GLContext*, mozilla::gfx:: IntSizeTyped<mozilla::gfx::UnknownUnits> const&) at ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:171 2 mozilla::gl::GLScreenBuffer::GLScreenBuffer(mozilla::gl::GLContext*, mozilla::UniquePtr<mozilla::gl::SurfaceFactory, mozilla::DefaultDelete <mozilla::gl::SurfaceFactory> >) at ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:183 3 mozilla::gl::GLScreenBuffer::Swap(mozilla::gfx::IntSizeTyped<mozilla:: gfx::UnknownUnits> const&) at ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:308 4 mozilla::gl::GLScreenBuffer::Resize(mozilla::gfx::IntSizeTyped<mozilla:: gfx::UnknownUnits> const&) at ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:339 5 mozilla::gl::ReadBuffer::Attach(mozilla::gl::SharedSurface*) at ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:408 6 mozilla::gl::TileGenFunc at ${PROJECT}/gecko-dev/gfx/gl/GLTextureImage.cpp:352 7 mozilla::gl::SurfaceFactory_Basic::SurfaceFactory_Basic(mozilla::gl:: GLContext&) at ${PROJECT}/gecko-dev/gfx/gl/SharedSurfaceGL.cpp:43 7.2 mozilla::gl::SurfaceFactory_Basic::SurfaceFactory_Basic(mozilla::gl:: GLContext*, mozilla::layers::TextureFlags const&) at ${PROJECT}/gecko-dev/gfx/gl/SharedSurfaceGL.cpp:47 8 mozilla::gl::SharedSurface_Basic::Create(mozilla::gl::GLContext*, mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> const&) at ${PROJECT}/gecko-dev/gfx/gl/SharedSurfaceGL.cpp:56 9 GLFormatForImage 10 GLTypeForImage 11 TextureImageEGL::TextureImageEGL 12 TextureImageEGL::BindTexture 13 TextureImageEGL::BindTexImage 14 mozilla::gl::CreateTextureImageEGL(mozilla::gl::GLContext*, mozilla::gfx:: IntSizeTyped<mozilla::gfx::UnknownUnits> const&, gfxContentType, unsigned int, mozilla::gl::TextureImage::Flags, mozilla::gfx::SurfaceFormat) at ${PROJECT}/gecko-dev/gfx/gl/TextureImageEGL.cpp:185 15 mozilla::layers::SharedSurfaceTextureClient::Create(mozilla::UniquePtr <mozilla::gl::SharedSurface, mozilla::DefaultDelete<mozilla::gl:: SharedSurface> >, mozilla::gl::SurfaceFactory*, mozilla::layers:: LayersIPCChannel*, mozilla::layers::TextureFlags) at ${PROJECT}/gecko-dev/gfx/layers/client/TextureClientSharedSurface.cpp:114 16 mozilla::layers::CompositorOGL::CreateContext() at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:227 17 mozilla::layers::CompositorOGL::ClearRect(mozilla::gfx::RectTyped<mozilla:: gfx::UnknownUnits, float> const&) at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:766I get some results back. Each time something hits I've recorded a backtrace, since these are the really crucial methods I need to look into. I won't have time to look in to them today, but I can at least take a record of their details. First up is CompositorOGL::CreateContext(). This is a totally new method added as part of my changes, so a good candidate for where the issue might live:
Thread 37 "Compositor" hit Breakpoint 16, mozilla::layers:: CompositorOGL::CreateContext (this=this@entry=0x7ed8002f10) at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:227 227 ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp: No such file or directory. (gdb) bt #0 mozilla::layers::CompositorOGL::CreateContext (this=this@entry=0x7ed8002f10) at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:227 #1 0x0000007ff2950820 in mozilla::layers::CompositorOGL::Initialize ( this=0x7ed8002f10, out_failureReason=0x7f2e4ed510) at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:387 #2 0x0000007ff2a6659c in mozilla::layers::CompositorBridgeParent:: NewCompositor (this=this@entry=0x7fc8a7e820, aBackendHints=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1493 #3 0x0000007ff2a71618 in mozilla::layers::CompositorBridgeParent:: InitializeLayerManager (this=this@entry=0x7fc8a7e820, aBackendHints=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1436 #4 0x0000007ff2a71748 in mozilla::layers::CompositorBridgeParent:: AllocPLayerTransactionParent (this=this@entry=0x7fc8a7e820, aBackendHints=..., aId=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1546 #5 0x0000007ff4e08368 in mozilla::embedlite::EmbedLiteCompositorBridgeParent:: AllocPLayerTransactionParent (this=0x7fc8a7e820, aBackendHints=..., aId=...) at ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:80 #6 0x0000007ff24022f0 in mozilla::layers::PCompositorBridgeParent:: OnMessageReceived (this=0x7fc8a7e820, msg__=...) at PCompositorBridgeParent.cpp:1285 [...] #21 0x0000007fefbac89c in ?? () from /lib64/libc.so.6 (gdb)Next we have the SurfaceFactory_Basic constructor. This isn't new but has an updated signature. This seems less likely to be the cause, but worth checking just in case.
Thread 8 "GeckoWorkerThre" hit Breakpoint 7, mozilla::gl:: SurfaceFactory_Basic::SurfaceFactory_Basic (this=0x7e9c005810, gl=...) at ${PROJECT}/gecko-dev/gfx/gl/SharedSurfaceGL.cpp:43 43 ${PROJECT}/gecko-dev/gfx/gl/SharedSurfaceGL.cpp: No such file or directory. (gdb) bt #0 mozilla::gl::SurfaceFactory_Basic::SurfaceFactory_Basic (this=0x7e9c005810, gl=...) at ${PROJECT}/gecko-dev/gfx/gl/SharedSurfaceGL.cpp:43 #1 0x0000007ff369ce20 in mozilla::MakeUnique<mozilla::gl:: SurfaceFactory_Basic, mozilla::gl::GLContext&> () at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33 #2 mozilla::WebGLContext::Present (this=this@entry=0x7fc961d590, xrFb=<optimized out>, consumerType=consumerType@entry=mozilla::layers::TextureType::Unknown, webvr=webvr@entry=false) at ${PROJECT}/gecko-dev/dom/canvas/WebGLContext.cpp:929 #3 0x0000007ff3664e68 in mozilla::HostWebGLContext::Present (webvr=false, t=mozilla::layers::TextureType::Unknown, xrFb=<optimized out>, this=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/ mozilla/RefPtr.h:280 #4 mozilla::ClientWebGLContext::Run<void (mozilla::HostWebGLContext::*)( unsigned long, mozilla::layers::TextureType, bool) const, &(mozilla:: HostWebGLContext::Present(unsigned long, mozilla::layers::TextureType, bool) const), unsigned long, mozilla::layers::TextureType const&, bool const&> ( this=<optimized out>, args#0=@0x7fdf2926c0: 0, args#1=@0x7fdf2926bf: mozilla::layers::TextureType::Unknown, args#2=@0x7fdf2926be: false) at ${PROJECT}/gecko-dev/dom/canvas/ClientWebGLContext.cpp:313 #5 0x0000007ff3664fd0 in mozilla::ClientWebGLContext::Present ( this=this@entry=0x7fc9611ef0, xrFb=xrFb@entry=0x0, type=<optimized out>, webvr=<optimized out>, webvr@entry=false) at ${PROJECT}/gecko-dev/dom/canvas/ClientWebGLContext.cpp:363 #6 0x0000007ff36907e0 in mozilla::ClientWebGLContext::OnBeforePaintTransaction (this=0x7fc9611ef0) at ${PROJECT}/gecko-dev/dom/canvas/ClientWebGLContext.cpp:345 #7 0x0000007ff28ffc7c in mozilla::layers::CanvasRenderer:: FirePreTransactionCallback (this=this@entry=0x7fc989b9a0) at ${PROJECT}/gecko-dev/gfx/layers/CanvasRenderer.cpp:75 [...] #54 0x0000007fefbac89c in ?? () from /lib64/libc.so.6 (gdb)Similarly for SharedSurface_Basic::Create(). The sequencing here makes perfect sense: first you'd want to create a factory, next up you'd want to use it to create an instance of the object it's designed to create, which is exactly what we've just seen above. Here we have a SurfaceFactory_Basic that's generating a SharedSurface_Basic object:
Thread 8 "GeckoWorkerThre" hit Breakpoint 8, mozilla::gl:: SharedSurface_Basic::Create (gl=0x7fc985a8e0, size=...) at ${PROJECT}/gecko-dev/gfx/gl/SharedSurfaceGL.cpp:56 56 in ${PROJECT}/gecko-dev/gfx/gl/SharedSurfaceGL.cpp (gdb) bt #0 mozilla::gl::SharedSurface_Basic::Create (gl=0x7fc985a8e0, size=...) at ${PROJECT}/gecko-dev/gfx/gl/SharedSurfaceGL.cpp:56 #1 0x0000007ff28a6578 in mozilla::gl::SurfaceFactory_Basic::CreateSharedImpl ( this=<optimized out>, desc=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/WeakPtr.h:185 #2 0x0000007ff28a6488 in mozilla::gl::SurfaceFactory::CreateShared ( this=0x7e9c005810, size=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefCounted.h:240 #3 0x0000007ff28a9504 in mozilla::gl::SwapChain::Acquire ( this=this@entry=0x7fc961da28, size=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 #4 0x0000007ff369ca3c in mozilla::WebGLContext::PresentInto ( this=this@entry=0x7fc961d590, swapChain=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 #5 0x0000007ff369ce6c in mozilla::WebGLContext::Present ( this=this@entry=0x7fc961d590, xrFb=<optimized out>, consumerType=consumerType@entry=mozilla::layers::TextureType::Unknown, webvr=webvr@entry=false) at ${PROJECT}/gecko-dev/dom/canvas/WebGLContext.cpp:936 #6 0x0000007ff3664e68 in mozilla::HostWebGLContext::Present (webvr=false, t=mozilla::layers::TextureType::Unknown, xrFb=<optimized out>, this=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/ mozilla/RefPtr.h:280 #7 mozilla::ClientWebGLContext::Run<void (mozilla::HostWebGLContext::*)( unsigned long, mozilla::layers::TextureType, bool) const, &(mozilla:: HostWebGLContext::Present(unsigned long, mozilla::layers::TextureType, bool) const), unsigned long, mozilla::layers::TextureType const&, bool const&> ( this=<optimized out>, args#0=@0x7fdf2926c0: 0, args#1=@0x7fdf2926bf: mozilla::layers::TextureType::Unknown, args#2=@0x7fdf2926be: false) at ${PROJECT}/gecko-dev/dom/canvas/ClientWebGLContext.cpp:313 #8 0x0000007ff3664fd0 in mozilla::ClientWebGLContext::Present ( this=this@entry=0x7fc9611ef0, xrFb=xrFb@entry=0x0, type=<optimized out>, webvr=<optimized out>, webvr@entry=false) at ${PROJECT}/gecko-dev/dom/canvas/ClientWebGLContext.cpp:363 #9 0x0000007ff36907e0 in mozilla::ClientWebGLContext::OnBeforePaintTransaction (this=0x7fc9611ef0) at ${PROJECT}/gecko-dev/dom/canvas/ClientWebGLContext.cpp:345 [...] #57 0x0000007fefbac89c in ?? () from /lib64/libc.so.6 (gdb)Next we have a call to CreateTextureImageEGL(). It's not clear to me how this relates to the earlier calls; I think this may be one to look in to in a bit more depth:
Thread 37 "Compositor" hit Breakpoint 14, mozilla::gl:: CreateTextureImageEGL (gl=gl@entry=0x7ed81a29e0, aSize=..., aContentType=aContentType@entry=gfxContentType::COLOR_ALPHA, aWrapMode=aWrapMode@entry=33071, aFlags=aFlags@entry=mozilla::gl::TextureImage::OriginBottomLeft, aImageFormat=aImageFormat@entry=mozilla::gfx::SurfaceFormat::B8G8R8A8) at ${PROJECT}/gecko-dev/gfx/gl/TextureImageEGL.cpp:185 185 ${PROJECT}/gecko-dev/gfx/gl/TextureImageEGL.cpp: No such file or directory. (gdb) bt #0 mozilla::gl::CreateTextureImageEGL (gl=gl@entry=0x7ed81a29e0, aSize=..., aContentType=aContentType@entry=gfxContentType::COLOR_ALPHA, aWrapMode=aWrapMode@entry=33071, aFlags=aFlags@entry=mozilla::gl:: TextureImage::OriginBottomLeft, aImageFormat=aImageFormat@entry=mozilla::gfx::SurfaceFormat::B8G8R8A8) at ${PROJECT}/gecko-dev/gfx/gl/TextureImageEGL.cpp:185 #1 0x0000007ff28b9154 in mozilla::gl::CreateTextureImage ( gl=gl@entry=0x7ed81a29e0, aSize=..., aContentType=aContentType@entry=gfxContentType::COLOR_ALPHA, aWrapMode=aWrapMode@entry=33071, aFlags=aFlags@entry=mozilla::gl::TextureImage::OriginBottomLeft, aImageFormat=<optimized out>) at ${PROJECT}/gecko-dev/gfx/gl/GLTextureImage.cpp:30 #2 0x0000007ff294e9d4 in mozilla::layers::TextureImageTextureSourceOGL::Update (this=0x7ed81a2940, aSurface=0x7ed822aa70, aDestRegion=0x0, aSrcOffset=0x0, aDstOffset=0x0) at ${PROJECT}/obj-build-mer-qt-xr/dist/ include/gfx2DGlue.h:70 #3 0x0000007ff2a43bf4 in mozilla::layers::BufferTextureHost::Upload ( this=this@entry=0x7ed81fca30, aRegion=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313 #4 0x0000007ff2a4422c in mozilla::layers::BufferTextureHost::MaybeUpload ( this=this@entry=0x7ed81fca30, aRegion=<optimized out>) at ${PROJECT}/gecko-dev/gfx/layers/composite/TextureHost.cpp:1046 #5 0x0000007ff2a44554 in mozilla::layers::BufferTextureHost::UploadIfNeeded ( this=this@entry=0x7ed81fca30) at ${PROJECT}/gecko-dev/gfx/layers/composite/TextureHost.cpp:1031 #6 0x0000007ff2a44570 in mozilla::layers::BufferTextureHost::Lock ( this=0x7ed81fca30) at ${PROJECT}/gecko-dev/gfx/layers/composite/TextureHost.cpp:650 #7 0x0000007ff2a35ad8 in mozilla::layers::ImageHost::Lock (this=0x7ed81fbe20) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313 #8 0x0000007ff2a35f68 in mozilla::layers::AutoLockCompositableHost:: AutoLockCompositableHost (aHost=0x7ed81fbe20, this=0x7f2e4ecca0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313 #9 mozilla::layers::ImageHost::Composite (this=this@entry=0x7ed81fbe20, aCompositor=aCompositor@entry=0x7ed8002f10, aLayer=aLayer@entry=0x7ed81db4d0, aEffectChain=..., aOpacity=1, aTransform=..., aSamplingFilter=<optimized out>, aClipRect=..., aVisibleRegion=aVisibleRegion@entry=0x0, aGeometry=...) at ${PROJECT}/gecko-dev/gfx/layers/composite/ImageHost.cpp:197 #10 0x0000007ff2a26a88 in mozilla::layers::CanvasLayerComposite::<lambda( mozilla::layers::EffectChain&, const IntRect&)>::operator() (clipRect=..., effectChain=..., __closure=<synthetic pointer>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/MaybeStorageBase.h:50 #11 mozilla::layers::RenderWithAllMasks<mozilla::layers::CanvasLayerComposite:: RenderLayer(const IntRect&, const mozilla::Maybe<mozilla::gfx:: PolygonTyped<mozilla::gfx::UnknownUnits> >&)::<lambda(mozilla::layers:: EffectChain&, const IntRect&)> >(mozilla::layers::Layer *, mozilla::layers:: Compositor *, const mozilla::gfx::IntRect &, mozilla::layers:: CanvasLayerComposite::<lambda(mozilla::layers::EffectChain&, const IntRect&)>) (aLayer=aLayer@entry= 0x7ed81db0c0, aCompositor=<optimized out>, aClipRect=..., aRenderCallback=aRenderCallback@entry=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/ LayerManagerCompositeUtils.h:69 #12 0x0000007ff2a26ddc in mozilla::layers::CanvasLayerComposite::RenderLayer ( this=0x7ed81db0c0, aClipRect=..., aGeometry=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:289 #13 0x0000007ff2a32cd4 in mozilla::layers::RenderLayers<mozilla::layers:: ContainerLayerComposite> (aContainer=aContainer@entry=0x7ed81f1600, aManager=aManager@entry=0x7ed81a44e0, aClipRect=..., aGeometry=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/Maybe.h:443 [...] #41 0x0000007fefbac89c in ?? () from /lib64/libc.so.6 (gdb)Next up we have a call to TileGenFunc(). You may recall that we already fixed this to align its functionality with the code as it was before the last commit, so this one should already be safe. I'll still keep its backtrace here though, both for completeness and in case I missed something earlier:
Thread 37 "Compositor" hit Breakpoint 6, 0x0000007ff28b8b88 in mozilla::gl::TileGenFunc (aImageFormat=<optimized out>, aFlags=<optimized out>, aContentType=<optimized out>, aSize=..., gl=<optimized out>) at ${PROJECT}/gecko-dev/gfx/gl/GLTextureImage.cpp:352 352 ${PROJECT}/gecko-dev/gfx/gl/GLTextureImage.cpp: No such file or directory. (gdb) bt #0 0x0000007ff28b8b88 in mozilla::gl::TileGenFunc (aImageFormat=<optimized out>, aFlags=<optimized out>, aContentType=<optimized out>, aSize=..., gl=<optimized out>) at ${PROJECT}/gecko-dev/gfx/gl/GLTextureImage.cpp:352 #1 mozilla::gl::TiledTextureImage::Resize (this=this@entry=0x7ed81b5b10, aSize=...) at ${PROJECT}/gecko-dev/gfx/gl/GLTextureImage.cpp:402 #2 0x0000007ff28b8fd0 in mozilla::gl::TiledTextureImage::TiledTextureImage ( this=0x7ed81b5b10, aGL=0x7ed81a29e0, aSize=..., aContentType=<optimized out>, aFlags=<optimized out>, aImageFormat=<optimized out>) at ${PROJECT}/gecko-dev/gfx/gl/GLTextureImage.cpp:224 #3 0x0000007ff28d4908 in mozilla::gl::CreateTextureImageEGL ( gl=gl@entry=0x7ed81a29e0, aSize=..., aContentType=aContentType@entry=gfxContentType::COLOR_ALPHA, aWrapMode=aWrapMode@entry=33071, aFlags=aFlags@entry=mozilla::gl::TextureImage::OriginBottomLeft, aImageFormat=aImageFormat@entry=mozilla::gfx::SurfaceFormat::B8G8R8A8) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33 #4 0x0000007ff28b9154 in mozilla::gl::CreateTextureImage ( gl=gl@entry=0x7ed81a29e0, aSize=..., aContentType=aContentType@entry=gfxContentType::COLOR_ALPHA, aWrapMode=aWrapMode@entry=33071, aFlags=aFlags@entry=mozilla::gl::TextureImage::OriginBottomLeft, aImageFormat=<optimized out>) at ${PROJECT}/gecko-dev/gfx/gl/GLTextureImage.cpp:30 #5 0x0000007ff294e9d4 in mozilla::layers::TextureImageTextureSourceOGL::Update (this=0x7ed81a2940, aSurface=0x7ed822aa70, aDestRegion=0x0, aSrcOffset=0x0, aDstOffset=0x0) at ${PROJECT}/obj-build-mer-qt-xr/dist/ include/gfx2DGlue.h:70 #6 0x0000007ff2a43bf4 in mozilla::layers::BufferTextureHost::Upload ( this=this@entry=0x7ed81fca30, aRegion=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313 #7 0x0000007ff2a4422c in mozilla::layers::BufferTextureHost::MaybeUpload ( this=this@entry=0x7ed81fca30, aRegion=<optimized out>) at ${PROJECT}/gecko-dev/gfx/layers/composite/TextureHost.cpp:1046 #8 0x0000007ff2a44554 in mozilla::layers::BufferTextureHost::UploadIfNeeded ( this=this@entry=0x7ed81fca30) at ${PROJECT}/gecko-dev/gfx/layers/composite/TextureHost.cpp:1031 #9 0x0000007ff2a44570 in mozilla::layers::BufferTextureHost::Lock ( this=0x7ed81fca30) at ${PROJECT}/gecko-dev/gfx/layers/composite/TextureHost.cpp:650 #10 0x0000007ff2a35ad8 in mozilla::layers::ImageHost::Lock (this=0x7ed81fbe20) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313 #11 0x0000007ff2a35f68 in mozilla::layers::AutoLockCompositableHost:: AutoLockCompositableHost (aHost=0x7ed81fbe20, this=0x7f2e4ecca0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313 #12 mozilla::layers::ImageHost::Composite (this=this@entry=0x7ed81fbe20, aCompositor=aCompositor@entry=0x7ed8002f10, aLayer=aLayer@entry=0x7ed81db4d0, aEffectChain=..., aOpacity=1, aTransform=..., aSamplingFilter=<optimized out>, aClipRect=..., aVisibleRegion=aVisibleRegion@entry=0x0, aGeometry=...) at ${PROJECT}/gecko-dev/gfx/layers/composite/ImageHost.cpp:197 #13 0x0000007ff2a26a88 in mozilla::layers::CanvasLayerComposite::<lambda( mozilla::layers::EffectChain&, const IntRect&)>::operator() (clipRect=..., effectChain=..., __closure=<synthetic pointer>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/MaybeStorageBase.h:50 #14 mozilla::layers::RenderWithAllMasks<mozilla::layers::CanvasLayerComposite:: RenderLayer(const IntRect&, const mozilla::Maybe<mozilla::gfx:: PolygonTyped<mozilla::gfx::UnknownUnits> >&)::<lambda(mozilla::layers:: EffectChain&, const IntRect&)> >(mozilla::layers::Layer *, mozilla::layers:: Compositor *, const mozilla::gfx::IntRect &, mozilla::layers:: CanvasLayerComposite::<lambda(mozilla::layers::EffectChain&, const IntRect&)>) ( aLayer=aLayer@entry=0x7ed81db0c0, aCompositor=<optimized out>, aClipRect=..., aRenderCallback=aRenderCallback@entry=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/ LayerManagerCompositeUtils.h:69 [...] #44 0x0000007fefbac89c in ?? () from /lib64/libc.so.6 (gdb)Phew: that's a lot to check. Plenty to look in to, but as I mentioned at the outset, this is just the preliminary info-capturing work. I'll do my best to use this information tomorrow when I plan to investigate it all in much more depth. That's it for now though. More tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
16 Jun 2024 : Day 259 #
I've spent this last week participating in a HackWeek. Organised at my place of work for the Research Engineering Group I'm part of, it allowed us to work with a different team on different projects and with different technologies than we usually would do.
My team developed a "Plantcraft" simulation, built on a three-dimensional grid where each cell satisfies a small set of physical rules approximating those of reality. A cell can be one of Air, Soil, Rock or Plant, each with a state that determines its water content, energy level, colour and memory. Despite the naming the four types of cell are actually identical, save for the state values and the fact that Plant cells execute a state-machine programme which determines their behaviour.
If you're interested all the documentation and executable code is available in the project's git repository. Please don't judge the code too harshly: it was all written under tight time constraints!
While it was huge fun to work with such a great team during the week, it's also nice to now be getting back to gecko once again. Time to get things moving and maybe apply some of the pressure of a time-constrained project to getting the gecko changes over the line too.
Let's recap where we were at last weekend before I paused during the week. I'll be continuing to work on getting WebGL rendering once again. The WebGL code was working nicely at one point, but since it makes use of the same offscreen rendering pipeline as the WebView, the changes I made to get the latter working seem to have broken the former.
I've already established and tweaked some of the relevant changes, namely that TileGenFunc() now executes CreateBasicTextureImage() in all circumstances and GLContext::ResizeScreenBuffer() now acts on SwapChain rather than GLScreenBuffer. Here's the full diff showing the changes I've made up to now to reverse these:
But oddly it doesn't. Or it might have were it not for the fact there's no crash after all. So since there's no backtrace I've had to go with a different approach. Instead I've been through the diff of the previous commit again to see whether it reveals any further gentle differences I can try to reverse. Ones that are unlikely to cause damage while at the same time might help resolve the WebGL issue.
One such change can be found in the SurfaceFactory constructor. This accepts an allocator and a flags parameter, neither of which appear to be used. So I've removed them to see what happens, setting the allocator to be nullptr where it's needed later instead.
Here's the diff of the changes I made:
Given that I'm not getting a crash and that the various changes I've made today haven't had any apparent effect, tomorrow I'm going to go through the methods in the previous commit again, set breakpoints on them and see which are being used by the browser. Hopefully this will shed more light, while also giving me the opportunity to refresh my memory about the changes. A refresh is going to be helpful given I spent last week thinking about other things.
So, more on this tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
My team developed a "Plantcraft" simulation, built on a three-dimensional grid where each cell satisfies a small set of physical rules approximating those of reality. A cell can be one of Air, Soil, Rock or Plant, each with a state that determines its water content, energy level, colour and memory. Despite the naming the four types of cell are actually identical, save for the state values and the fact that Plant cells execute a state-machine programme which determines their behaviour.
If you're interested all the documentation and executable code is available in the project's git repository. Please don't judge the code too harshly: it was all written under tight time constraints!
While it was huge fun to work with such a great team during the week, it's also nice to now be getting back to gecko once again. Time to get things moving and maybe apply some of the pressure of a time-constrained project to getting the gecko changes over the line too.
Let's recap where we were at last weekend before I paused during the week. I'll be continuing to work on getting WebGL rendering once again. The WebGL code was working nicely at one point, but since it makes use of the same offscreen rendering pipeline as the WebView, the changes I made to get the latter working seem to have broken the former.
I've already established and tweaked some of the relevant changes, namely that TileGenFunc() now executes CreateBasicTextureImage() in all circumstances and GLContext::ResizeScreenBuffer() now acts on SwapChain rather than GLScreenBuffer. Here's the full diff showing the changes I've made up to now to reverse these:
$ git diff diff --git a/gfx/gl/GLContext.cpp b/gfx/gl/GLContext.cpp index 1177768bb92e..aac6912bb914 100644 --- a/gfx/gl/GLContext.cpp +++ b/gfx/gl/GLContext.cpp @@ -1875,7 +1875,8 @@ void GLContext::MarkDestroyed() { // Null these before they're naturally nulled after dtor, as we want GLContext // to still be alive in *their* dtors. - mScreen = nullptr; + //mScreen = nullptr; + mSwapChain = nullptr; mBlitHelper = nullptr; mReadTexImageHelper = nullptr; @@ -1886,7 +1887,7 @@ void GLContext::MarkDestroyed() { bool GLContext::ResizeScreenBuffer(const gfx::IntSize& size) { if (!IsOffscreenSizeAllowed(size)) return false; - return mScreen->Resize(size); + return mSwapChain->Resize(size); } // - diff --git a/gfx/gl/GLTextureImage.cpp b/gfx/gl/GLTextureImage.cpp index c2def2dedb18..8152128bdc9c 100644 --- a/gfx/gl/GLTextureImage.cpp +++ b/gfx/gl/GLTextureImage.cpp @@ -47,6 +47,9 @@ already_AddRefed<TextureImage> CreateTextureImage( static already_AddRefed<TextureImage> TileGenFunc( GLContext* gl, const IntSize& aSize, TextureImage::ContentType aContentType, TextureImage::Flags aFlags, TextureImage::ImageFormat aImageFormat) { + return CreateBasicTextureImage(gl, aSize, aContentType, + LOCAL_GL_CLAMP_TO_EDGE, aFlags); + switch (gl->GetContextType()) { case GLContextType::EGL: return TileGenFuncEGL(gl, aSize, aContentType, aFlags, aImageFormat);As I left things last weekend these changes were triggering a segfault. My task for today is to check the backtrace of the crash. It's bound to reveal something useful...
But oddly it doesn't. Or it might have were it not for the fact there's no crash after all. So since there's no backtrace I've had to go with a different approach. Instead I've been through the diff of the previous commit again to see whether it reveals any further gentle differences I can try to reverse. Ones that are unlikely to cause damage while at the same time might help resolve the WebGL issue.
One such change can be found in the SurfaceFactory constructor. This accepts an allocator and a flags parameter, neither of which appear to be used. So I've removed them to see what happens, setting the allocator to be nullptr where it's needed later instead.
Here's the diff of the changes I made:
diff --git a/gfx/gl/SharedSurface.cpp b/gfx/gl/SharedSurface.cpp index 687d18b95893..1d911b84379a 100644 --- a/gfx/gl/SharedSurface.cpp +++ b/gfx/gl/SharedSurface.cpp [...] @@ -149,10 +164,105 @@ UniquePtr<SurfaceFactory> SurfaceFactory::Create( return nullptr; } -SurfaceFactory::SurfaceFactory(const PartialSharedSurfaceDesc& partialDesc) - : mDesc(partialDesc), mMutex("SurfaceFactor::mMutex") {} +SurfaceFactory::SurfaceFactory(const PartialSharedSurfaceDesc& partialDesc, + const RefPtr<layers::LayersIPCChannel>& allocator, + const layers::TextureFlags& flags) + : mDesc(partialDesc), + mAllocator(allocator), + mFlags(flags), + mMutex("SurfaceFactor::mMutex") +{ +} [...]Changing, building, installing and testing this doesn't result in any change. The browser and WebView both work as before, but the WebGL functionality is still broken.
Given that I'm not getting a crash and that the various changes I've made today haven't had any apparent effect, tomorrow I'm going to go through the methods in the previous commit again, set breakpoints on them and see which are being used by the browser. Hopefully this will shed more light, while also giving me the opportunity to refresh my memory about the changes. A refresh is going to be helpful given I spent last week thinking about other things.
So, more on this tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
9 Jun 2024 : Day 258 #
As I mentioned a couple of days back, I'm taking part in a hackathon for my work during the next week, so I'm not planning to make any posts for the next five days. This coming Saturday I'll continue right back off where I leave off at the end of today though.
For today, I'm looking further into why WebGL might not be doing what it's supposed to be doing. So far I've found that there are two methods in my commit diff that get hit when executing the broken code. These are:
Looking at the code and observing the execution using the debugger I can see that the stack trace for the second of these includes TileGenFunc(), which calls TileGenFuncEGL() which then calls TextureImageEGL::TextureImageEGL(). And the flow is definitely being affected by what happens in TileGenFunc().
Here's the diff between the two versions:
To see whether this is having an important effect I've amended the method so that it has the same approach as previously, by changing it to this:
Unfortunately though, rebuilding and executing this change gives me the same result as before, in that the WebGL is still not showing signs of life.
So it's back to the code again. There's also another important change in that in some cases in GLScreenBuffer I've switched use of mSwapChain for mScreen instead. The two have quite different characteristics, so I should try switching this back as well, for example like this:
I'll need to investigate this further. Not today though as I'm out of time, and I won't be picking this up tomorrow either. Instead there will be the five-day pause I mentioned at the top of this post, but I'll be back to continue this where I've left it this coming Saturday.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
For today, I'm looking further into why WebGL might not be doing what it's supposed to be doing. So far I've found that there are two methods in my commit diff that get hit when executing the broken code. These are:
- SurfaceFactory::SurfaceFactory()
- TextureImageEGL::TextureImageEGL()
Looking at the code and observing the execution using the debugger I can see that the stack trace for the second of these includes TileGenFunc(), which calls TileGenFuncEGL() which then calls TextureImageEGL::TextureImageEGL(). And the flow is definitely being affected by what happens in TileGenFunc().
Here's the diff between the two versions:
static already_AddRefed<TextureImage> TileGenFunc( GLContext* gl, const IntSize& aSize, TextureImage::ContentType aContentType, TextureImage::Flags aFlags, TextureImage::ImageFormat aImageFormat) { - return CreateBasicTextureImage(gl, aSize, aContentType, - LOCAL_GL_CLAMP_TO_EDGE, aFlags); + switch (gl->GetContextType()) { + case GLContextType::EGL: + return TileGenFuncEGL(gl, aSize, aContentType, aFlags, aImageFormat); + default: + return CreateBasicTextureImage(gl, aSize, aContentType, + LOCAL_GL_CLAMP_TO_EDGE, aFlags); + } }As we can see, the original version always calls CreateBasicTextureImage() in the original version, whereas in the new version there's a switch to contend with. That means that in the new version, rather than doing the same thing as the original it will instead on occasion call TileGenFuncEGL(). So this is clearly a candidate for where things are going wrong.
To see whether this is having an important effect I've amended the method so that it has the same approach as previously, by changing it to this:
$ git diff diff --git a/gfx/gl/GLTextureImage.cpp b/gfx/gl/GLTextureImage.cpp index c2def2dedb18..8152128bdc9c 100644 --- a/gfx/gl/GLTextureImage.cpp +++ b/gfx/gl/GLTextureImage.cpp @@ -47,6 +47,9 @@ already_AddRefed<TextureImage> CreateTextureImage( static already_AddRefed<TextureImage> TileGenFunc( GLContext* gl, const IntSize& aSize, TextureImage::ContentType aContentType, TextureImage::Flags aFlags, TextureImage::ImageFormat aImageFormat) { + return CreateBasicTextureImage(gl, aSize, aContentType, + LOCAL_GL_CLAMP_TO_EDGE, aFlags); + switch (gl->GetContextType()) { case GLContextType::EGL: return TileGenFuncEGL(gl, aSize, aContentType, aFlags, aImageFormat);Now when this gets called, it will immediately call CreateBasicTextureImage() rather than going into the switch conditional. This isn't a long term solution, it's just a way for me to test things out.
Unfortunately though, rebuilding and executing this change gives me the same result as before, in that the WebGL is still not showing signs of life.
So it's back to the code again. There's also another important change in that in some cases in GLScreenBuffer I've switched use of mSwapChain for mScreen instead. The two have quite different characteristics, so I should try switching this back as well, for example like this:
bool GLContext::ResizeScreenBuffer(const gfx::IntSize& size) { if (!IsOffscreenSizeAllowed(size)) return false; - return mScreen->Resize(size); + return mSwapChain->Resize(size); }Now when I build and try this something different happens. Now the app crashes when it tries to render the WebGL. That's not a bad thing, because the debugger will tell me where the crash is taking place.
I'll need to investigate this further. Not today though as I'm out of time, and I won't be picking this up tomorrow either. Instead there will be the five-day pause I mentioned at the top of this post, but I'll be back to continue this where I've left it this coming Saturday.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
8 Jun 2024 : Day 257 #
It's an early start for me today as I'm travelling to London and back. But I had difficulty sleeping last night and am up even earlier than I usually would be, so I'm pleased to discover that the build I kicked off last night has already completed.
This means I now have two sets of RPM packages. One set that represents the last commit of ESR 91 when WebGL was working and a second set that adds a commit on top of this, but which breaks WebGL.
Here's a list of the packages, where the sailfishos.esr91 represents the most recent changes that caused the breakage, while the temp branch has these changes reverted.
Previously the broken RPMs were crashing on a call to ToSurfaceDescriptor(). The reason for the crash is that I'd added an explicit request for the app to crash if this was ever called:
So I'm expecting to get a new backtrace from the crash. The question will be: what is this crash and how does it compare with the execution of the working version. As soon as I have this backtrace it will hopefully be clear the path the execution took to get there. Then I'll reinstall the working version and compare against the equivalent path there to establish what's changed.
This is the plan, at least.
They packages have copied over, so let's get to work.
So my new plan is to debug the same piece of code that I debugged yesterday on the working version. Let's see if anything has changed.
To help with this search I've attached breakpoints to the majority of the new functions that have been added or seen significant changes in the latest commit. Here they all are (there are quite a few):
Contrariwise if they're not hit then they're not part of the execution flow and it should be safe for me to ignore them in my investigation.
When I now debug the program there are three breakpoints that hit; or rather two breakpoints are hit a total of three times:
That feels like plenty to be getting on with. Tomorrow I'll need to compare these backtraces with the working ESR 91 code to see whether it's possible to get to the same place or not and, if it is, what might have changed.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
This means I now have two sets of RPM packages. One set that represents the last commit of ESR 91 when WebGL was working and a second set that adds a commit on top of this, but which breaks WebGL.
Here's a list of the packages, where the sailfishos.esr91 represents the most recent changes that caused the breakage, while the temp branch has these changes reverted.
$ ls webgl-broken/ webgl-working/ webgl-broken/: xulrunner-qt5-91.9.1+git1+sailfishos.esr91. 20240604225626.a84dc7d4765d+gecko.dev.7437a9d17284-1.aarch64.rpm xulrunner-qt5-debuginfo-91.9.1+git1+sailfishos.esr91. 20240604225626.a84dc7d4765d+gecko.dev.7437a9d17284-1.aarch64.rpm xulrunner-qt5-debugsource-91.9.1+git1+sailfishos.esr91. 20240604225626.a84dc7d4765d+gecko.dev.7437a9d17284-1.aarch64.rpm xulrunner-qt5-devel-91.9.1+git1+sailfishos.esr91. 20240604225626.a84dc7d4765d+gecko.dev.7437a9d17284-1.aarch64.rpm xulrunner-qt5-misc-91.9.1+git1+sailfishos.esr91. 20240604225626.a84dc7d4765d+gecko.dev.7437a9d17284-1.aarch64.rpm webgl-working/: xulrunner-qt5-91.9.1+git1+temp. 20240212214917.9f64ce35a187-1.aarch64.rpm xulrunner-qt5-debuginfo-91.9.1+git1+temp. 20240212214917.9f64ce35a187-1.aarch64.rpm xulrunner-qt5-debugsource-91.9.1+git1+temp. 20240212214917.9f64ce35a187-1.aarch64.rpm xulrunner-qt5-devel-91.9.1+git1+temp. 20240212214917.9f64ce35a187-1.aarch64.rpm xulrunner-qt5-misc-91.9.1+git1+temp. 20240212214917.9f64ce35a187-1.aarch64.rpmWhile the newly built RPMs transfer over to my phone, let me summarise what I'm expecting.
Previously the broken RPMs were crashing on a call to ToSurfaceDescriptor(). The reason for the crash is that I'd added an explicit request for the app to crash if this was ever called:
Maybe<layers::SurfaceDescriptor> SharedSurface_Basic::ToSurfaceDescriptor() { MOZ_CRASH("GFX: ToSurfaceDescriptor"); return Nothing(); }I added it for debugging purposes while working on the WebView changes. These latest packages have this MOZ_CRASH statement removed, so I'm no longer expecting a crash to happen here. However, I do expect it to crash nevertheless, just in some other location. Removing the MOZ_CRASH would be too simple a fix for it to actually work as a solution!
So I'm expecting to get a new backtrace from the crash. The question will be: what is this crash and how does it compare with the execution of the working version. As soon as I have this backtrace it will hopefully be clear the path the execution took to get there. Then I'll reinstall the working version and compare against the equivalent path there to establish what's changed.
This is the plan, at least.
They packages have copied over, so let's get to work.
$ sailfish-browser https://shadertoy.com [...] Created LOG for EmbedLiteLayerManager JavaScript warning: https://www.shadertoy.com/, line 2388: WebGL warning: drawArraysInstanced: Tex image TEXTURE_2D level 0 is incurring lazy initialization. [...]Well, that's interesting. There is now no crash, so that failure was entirely self-induced. However the WebGL is broken. It's just displaying an empty canvas where the WebGL should be rendered. This makes things somewhat harder to debug, because now there's no obvious please to start from.
So my new plan is to debug the same piece of code that I debugged yesterday on the working version. Let's see if anything has changed.
(gdb) info break Num Type Disp Enb Address What 2 breakpoint keep y 0x0000007ff29b0cfc in mozilla::layers:: ShareableCanvasRenderer::UpdateCompositableClient() at gfx/layers/ ShareableCanvasRenderer.cpp:191 breakpoint already hit 1 time (gdb) c [...] Thread 8 "GeckoWorkerThre" hit Breakpoint 2, mozilla::layers:: ShareableCanvasRenderer::UpdateCompositableClient (this=0x7fc963c520) at gfx/layers/ShareableCanvasRenderer.cpp:192 192 FirePreTransactionCallback(); (gdb) n 195 auto tc = fnGetExistingTc(); (gdb) n 196 if (!tc) { (gdb) p tc $1 = {mRawPtr = 0x0} (gdb) n 198 tc = fnMakeTcFromSnapshot(); (gdb) n 200 if (tc != mFrontBufferFromDesc) { (gdb) p tc $2 = {mRawPtr = 0x7fc8ceb370} (gdb) p tc.mRawPtr $3 = (mozilla::layers::TextureClient *) 0x7fc8ceb370 (gdb)This matches the flow in the working version, so it seems this isn't where the problem is. I'm going to have to look further afield.
To help with this search I've attached breakpoints to the majority of the new functions that have been added or seen significant changes in the latest commit. Here they all are (there are quite a few):
(gdb) break GLScreenBuffer::Create Breakpoint 4 at 0x7ff28a7d94: file gfx/gl/GLScreenBuffer.cpp, line 171. (gdb) break InitOffscreen Breakpoint 5 at 0x7ff28d2500: file gfx/gl/GLContext.cpp, line 2345. (gdb) break GLContext::CreateScreenBuffer Breakpoint 6 at 0x7ff28d2428: file gfx/gl/GLContext.cpp, line 2073. (gdb) b WaylandGLSurface::WaylandGLSurface Breakpoint 7 at 0x7ff28c084c: file gfx/gl/GLContextProviderEGL.cpp, line 954. (gdb) b GLContextProviderEGL::CreateOffscreen Breakpoint 8 at 0x7ff28d2610: file gfx/gl/GLContextProviderEGL.cpp, line 1451. (gdb) b ReadBuffer::Create Breakpoint 9 at 0x7ff28a6ea8: file gfx/gl/GLScreenBuffer.cpp, line 358. (gdb) b SurfaceFactory::SurfaceFactory Breakpoint 10 at 0x7ff28acdc4: file gfx/gl/SharedSurface.cpp, line 167. (gdb) b SharedSurface_EGLImage::SharedSurface_EGLImage Breakpoint 11 at 0x7ff28d363c: file gfx/gl/SharedSurfaceEGL.cpp, line 95. (gdb) b TextureImageEGL::TextureImageEGL Breakpoint 12 at 0x7ff28d3e80: file gfx/gl/TextureImageEGL.cpp, line 46. (gdb) r [...]If any of these breakpoints hit, that means they'd be good candidates for comparing against the working version. If they're new (rather just heavily amended) methods then that'll be even more relevant, because that'll indicate a wholesale change of flow. In that case I'll need to work backwards through the call stack to see where — and why — the divergence happened.
Contrariwise if they're not hit then they're not part of the execution flow and it should be safe for me to ignore them in my investigation.
When I now debug the program there are three breakpoints that hit; or rather two breakpoints are hit a total of three times:
Thread 8 "GeckoWorkerThre" hit Breakpoint 10, mozilla::gl:: SurfaceFactory::SurfaceFactory (this=0x7fc95eb220, partialDesc=..., allocator=..., flags=@0x7fdf29256c: mozilla::layers::TextureFlags::NO_FLAGS) at gfx/gl/SharedSurface.cpp:167 167 SurfaceFactory::SurfaceFactory(const PartialSharedSurfaceDesc& partialDesc, Thread 37 "Compositor" hit Breakpoint 12, mozilla::gl:: TextureImageEGL::TextureImageEGL (this=0x7ed81ab2d0, aTexture=20, aSize=..., aWrapMode=33071, aContentType=gfxContentType::COLOR_ALPHA, aContext=0x7ed81a2780, aFlags=mozilla::gl::TextureImage::OriginBottomLeft, aTextureState=mozilla::gl::TextureImage::Created, aImageFormat=mozilla::gfx: :SurfaceFormat::B8G8R8A8) at gfx/gl/TextureImageEGL.cpp:46 46 TextureImageEGL::TextureImageEGL(GLuint aTexture, const gfx::IntSize& aSize, Thread 37 "Compositor" hit Breakpoint 12, mozilla::gl:: TextureImageEGL::TextureImageEGL (this=0x7ed825ed80, aTexture=21, aSize=..., aWrapMode=33071, aContentType=gfxContentType::COLOR_ALPHA, aContext=0x7ed81a2780, aFlags=mozilla::gl::TextureImage::OriginBottomLeft, aTextureState=mozilla::gl::TextureImage::Created, aImageFormat=mozilla::gfx: :SurfaceFormat::B8G8R8A8) at gfx/gl/TextureImageEGL.cpp:46 46 TextureImageEGL::TextureImageEGL(GLuint aTexture, const gfx::IntSize& aSize,Let's get some backtraces from those. These are really long backtraces and I do apologise for that. I want to keep copies here for future reference, but there's no need to look at them in any detail. Certainly not right now anyway. Here's the first one:
Thread 8 "GeckoWorkerThre" hit Breakpoint 10, mozilla::gl:: SurfaceFactory::SurfaceFactory (this=0x7fc95e9c10, partialDesc=..., allocator=..., flags=@0x7fdf29256c: mozilla::layers::TextureFlags::NO_FLAGS) at gfx/gl/SharedSurface.cpp:167 167 SurfaceFactory::SurfaceFactory(const PartialSharedSurfaceDesc& partialDesc, (gdb) bt #0 mozilla::gl::SurfaceFactory::SurfaceFactory (this=0x7fc95e9c10, partialDesc=..., allocator=..., flags=@0x7fdf29256c: mozilla::layers::TextureFlags::NO_FLAGS) at gfx/gl/SharedSurface.cpp:167 #1 0x0000007ff28d3950 in mozilla::gl::SurfaceFactory_Basic:: SurfaceFactory_Basic (this=0x7fc95e9c10, gl=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:113 #2 0x0000007ff369d0d4 in mozilla::MakeUnique<mozilla::gl:: SurfaceFactory_Basic, mozilla::gl::GLContext&> () at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33 #3 mozilla::WebGLContext::Present (this=this@entry=0x7fc93ab2a0, xrFb=<optimized out>, consumerType=consumerType@entry=mozilla::layers::TextureType::Unknown, webvr=webvr@entry=false) at dom/canvas/WebGLContext.cpp:929 #4 0x0000007ff366511c in mozilla::HostWebGLContext::Present (webvr=false, t=mozilla::layers::TextureType::Unknown, xrFb=<optimized out>, this=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/ mozilla/RefPtr.h:280 #5 mozilla::ClientWebGLContext::Run<void (mozilla::HostWebGLContext::*)( unsigned long, mozilla::layers::TextureType, bool) const, &(mozilla:: HostWebGLContext::Present(unsigned long, mozilla::layers::TextureType, bool) const), unsigned long, mozilla::layers::TextureType const&, bool const&> ( this=<optimized out>, args#0=@0x7fdf2926c0: 0, args#1=@0x7fdf2926bf: mozilla::layers::TextureType::Unknown, args#2=@0x7fdf2926be: false) at dom/canvas/ClientWebGLContext.cpp:313 #6 0x0000007ff3665284 in mozilla::ClientWebGLContext::Present ( this=this@entry=0x7f28004210, xrFb=xrFb@entry=0x0, type=<optimized out>, webvr=<optimized out>, webvr@entry=false) at dom/canvas/ClientWebGLContext.cpp:363 #7 0x0000007ff3690a94 in mozilla::ClientWebGLContext::OnBeforePaintTransaction (this=0x7f28004210) at dom/canvas/ClientWebGLContext.cpp:345 #8 0x0000007ff28fff30 in mozilla::layers::CanvasRenderer:: FirePreTransactionCallback (this=this@entry=0x7fc93fb900) at gfx/layers/CanvasRenderer.cpp:75 #9 0x0000007ff29b0d04 in mozilla::layers::ShareableCanvasRenderer:: UpdateCompositableClient (this=0x7fc93fb900) at gfx/layers/ShareableCanvasRenderer.cpp:192 #10 0x0000007ff29f08a0 in mozilla::layers::ClientCanvasLayer::RenderLayer ( this=0x7fc95fc380) at gfx/layers/client/ClientCanvasLayer.cpp:25 #11 0x0000007ff29ef9c0 in mozilla::layers::ClientLayer::RenderLayerWithReadback (this=<optimized out>, aReadback=<optimized out>) at gfx/layers/client/ClientLayerManager.h:365 #12 0x0000007ff29ffd08 in mozilla::layers::ClientContainerLayer::RenderLayer ( this=0x7fc92fc450) at gfx/layers/Layers.h:1051 #13 0x0000007ff29ef9c0 in mozilla::layers::ClientLayer::RenderLayerWithReadback (this=<optimized out>, aReadback=<optimized out>) at gfx/layers/client/ClientLayerManager.h:365 #14 0x0000007ff29ffd08 in mozilla::layers::ClientContainerLayer::RenderLayer ( this=0x7fc934a230) at gfx/layers/Layers.h:1051 #15 0x0000007ff29ef9c0 in mozilla::layers::ClientLayer::RenderLayerWithReadback (this=<optimized out>, aReadback=<optimized out>) at gfx/layers/client/ClientLayerManager.h:365 #16 0x0000007ff29ffd08 in mozilla::layers::ClientContainerLayer::RenderLayer ( this=0x7fc8d123e0) at gfx/layers/Layers.h:1051 #17 0x0000007ff2a069ec in mozilla::layers::ClientLayerManager:: EndTransactionInternal (this=this@entry=0x7fc8a5ea90, aCallback=aCallback@entry= 0x7ff46a31ec <mozilla::FrameLayerBuilder::DrawPaintedLayer(mozilla::layers:: PaintedLayer*, gfxContext*, mozilla::gfx::IntRegionTyped<mozilla::gfx:: UnknownUnits> const&, mozilla::gfx::IntRegionTyped<mozilla::gfx:: UnknownUnits> const&, mozilla::layers::DrawRegionClip, mozilla::gfx:: IntRegionTyped<mozilla::gfx::UnknownUnits> const&, void*)>, aCallbackData=aCallbackData@entry=0x7fdf293268) at gfx/layers/client/ClientLayerManager.cpp:341 #18 0x0000007ff2a118ec in mozilla::layers::ClientLayerManager::EndTransaction ( this=0x7fc8a5ea90, aCallback=0x7ff46a31ec <mozilla::FrameLayerBuilder::DrawPaintedLayer( mozilla::layers::PaintedLayer*, gfxContext*, mozilla::gfx:: IntRegionTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gfx:: IntRegionTyped<mozilla::gfx::UnknownUnits> const&, mozilla::layers:: DrawRegionClip, mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> const&, void*)>, aCallbackData=0x7fdf293268, aFlags=mozilla::layers:: LayerManager::END_DEFAULT) at gfx/layers/client/ClientLayerManager.cpp:397 #19 0x0000007ff46a060c in nsDisplayList::PaintRoot ( this=this@entry=0x7fdf295078, aBuilder=aBuilder@entry=0x7fdf293268, aCtx=aCtx@entry=0x0, aFlags=aFlags@entry=13, aDisplayListBuildTime=...) at layout/painting/nsDisplayList.cpp:2622 #20 0x0000007ff442c968 in nsLayoutUtils::PaintFrame ( aRenderingContext=aRenderingContext@entry=0x0, aFrame=aFrame@entry=0x7fc9280d10, aDirtyRegion=..., aBackstop=aBackstop@entry=4294967295, aBuilderMode=aBuilderMode@entry=nsDisplayListBuilderMode::Painting, aFlags=aFlags@entry=(nsLayoutUtils::PaintFrameFlags::WidgetLayers | nsLayoutUtils::PaintFrameFlags::ExistingTransaction | nsLayoutUtils:: PaintFrameFlags::NoComposite)) at ${PROJECT}/obj-build-mer-qt-xr/dist/ include/mozilla/MaybeStorageBase.h:80 #21 0x0000007ff43b705c in mozilla::PresShell::Paint ( this=this@entry=0x7fc921c9a0, aViewToPaint=aViewToPaint@entry=0x7fc8563cb0, aDirtyRegion=..., aFlags=aFlags@entry=mozilla::PaintFlags::PaintLayers) at layout/base/PresShell.cpp:6400 #22 0x0000007ff41eef2c in nsViewManager::ProcessPendingUpdatesPaint ( this=this@entry=0x7fc8563c70, aWidget=aWidget@entry=0x7fc90d0760) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/gfx/RectAbsolute.h:43 [...] #55 0x0000007fefbab89c in ?? () from /lib64/libc.so.6 (gdb)Here's the second one:
Thread 37 "Compositor" hit Breakpoint 12, mozilla::gl:: TextureImageEGL::TextureImageEGL (this=0x7ee01faa90, aTexture=21, aSize=..., aWrapMode=33071, aContentType=gfxContentType::COLOR_ALPHA, aContext=0x7ee01a28a0, aFlags=mozilla::gl::TextureImage::OriginBottomLeft, aTextureState=mozilla::gl::TextureImage::Created, aImageFormat=mozilla::gfx: :SurfaceFormat::B8G8R8A8) at gfx/gl/TextureImageEGL.cpp:46 46 TextureImageEGL::TextureImageEGL(GLuint aTexture, const gfx::IntSize& aSize, (gdb) bt #0 mozilla::gl::TextureImageEGL::TextureImageEGL (this=0x7ee01faa90, aTexture=21, aSize=..., aWrapMode=33071, aContentType=gfxContentType:: COLOR_ALPHA, aContext=0x7ee01a28a0, aFlags=mozilla::gl::TextureImage::OriginBottomLeft, aTextureState=mozilla::gl::TextureImage::Created, aImageFormat=mozilla::gfx::SurfaceFormat::B8G8R8A8) at gfx/gl/TextureImageEGL.cpp:46 #1 0x0000007ff28d42e0 in mozilla::gl::TileGenFuncEGL ( gl=gl@entry=0x7ee01a28a0, aSize=..., aContentType=aContentType@entry=gfxContentType::COLOR_ALPHA, aFlags=aFlags@entry=mozilla::gl::TextureImage::OriginBottomLeft, aImageFormat=aImageFormat@entry=mozilla::gfx::SurfaceFormat::B8G8R8A8) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33 #2 0x0000007ff28b7ec8 in mozilla::gl::TileGenFunc (aImageFormat=mozilla::gfx:: SurfaceFormat::B8G8R8A8, aFlags=mozilla::gl::TextureImage::OriginBottomLeft, aContentType=gfxContentType::COLOR_ALPHA, aSize=..., gl=0x7ee01a28a0) at gfx/gl/GLTextureImage.cpp:52 #3 mozilla::gl::TiledTextureImage::Resize (this=this@entry=0x7ee01d7660, aSize=...) at gfx/gl/GLTextureImage.cpp:399 #4 0x0000007ff28b81cc in mozilla::gl::TiledTextureImage::TiledTextureImage ( this=0x7ee01d7660, aGL=0x7ee01a28a0, aSize=..., aContentType=<optimized out>, aFlags=<optimized out>, aImageFormat=<optimized out>) at gfx/gl/GLTextureImage.cpp:221 #5 0x0000007ff28d41f8 in mozilla::gl::CreateTextureImageEGL ( gl=gl@entry=0x7ee01a28a0, aSize=..., aContentType=aContentType@entry=gfxContentType::COLOR_ALPHA, aWrapMode=aWrapMode@entry=33071, aFlags=aFlags@entry=mozilla::gl::TextureImage::OriginBottomLeft, aImageFormat=aImageFormat@entry=mozilla::gfx::SurfaceFormat::B8G8R8A8) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33 #6 0x0000007ff28b8350 in mozilla::gl::CreateTextureImage ( gl=gl@entry=0x7ee01a28a0, aSize=..., aContentType=aContentType@entry=gfxContentType::COLOR_ALPHA, aWrapMode=aWrapMode@entry=33071, aFlags=aFlags@entry=mozilla::gl::TextureImage::OriginBottomLeft, aImageFormat=<optimized out>) at gfx/gl/GLTextureImage.cpp:30 #7 0x0000007ff294ec88 in mozilla::layers::TextureImageTextureSourceOGL::Update (this=0x7ee01c70f0, aSurface=0x7ee019b290, aDestRegion=0x0, aSrcOffset=0x0, aDstOffset=0x0) at ${PROJECT}/obj-build-mer-qt-xr/dist/ include/gfx2DGlue.h:70 #8 0x0000007ff2a43ea8 in mozilla::layers::BufferTextureHost::Upload ( this=this@entry=0x7ee01bb470, aRegion=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313 #9 0x0000007ff2a444e0 in mozilla::layers::BufferTextureHost::MaybeUpload ( this=this@entry=0x7ee01bb470, aRegion=<optimized out>) at gfx/layers/composite/TextureHost.cpp:1046 #10 0x0000007ff2a44808 in mozilla::layers::BufferTextureHost::UploadIfNeeded ( this=this@entry=0x7ee01bb470) at gfx/layers/composite/TextureHost.cpp:1031 #11 0x0000007ff2a44824 in mozilla::layers::BufferTextureHost::Lock ( this=0x7ee01bb470) at gfx/layers/composite/TextureHost.cpp:650 #12 0x0000007ff2a35d8c in mozilla::layers::ImageHost::Lock (this=0x7ee01b7c60) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313 #13 0x0000007ff2a3621c in mozilla::layers::AutoLockCompositableHost:: AutoLockCompositableHost (aHost=0x7ee01b7c60, this=0x7f364e8ca0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313 #14 mozilla::layers::ImageHost::Composite (this=this@entry=0x7ee01b7c60, aCompositor=aCompositor@entry=0x7ee0002ed0, aLayer=aLayer@entry=0x7ee0265370, aEffectChain=..., aOpacity=1, aTransform=..., aSamplingFilter=<optimized out>, aClipRect=..., aVisibleRegion=aVisibleRegion@entry=0x0, aGeometry=...) at gfx/layers/composite/ImageHost.cpp:197 #15 0x0000007ff2a26d3c in mozilla::layers::CanvasLayerComposite::<lambda( mozilla::layers::EffectChain&, const IntRect&)>::operator() (clipRect=..., effectChain=..., __closure=<synthetic pointer>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/MaybeStorageBase.h:50 #16 mozilla::layers::RenderWithAllMasks<mozilla::layers::CanvasLayerComposite:: RenderLayer(const IntRect&, const mozilla::Maybe<mozilla::gfx:: PolygonTyped<mozilla::gfx::UnknownUnits> >&)::<lambda(mozilla::layers:: EffectChain&, const IntRect&)> >(mozilla::layers::Layer *, mozilla::layers:: Compositor *, const mozilla::gfx::IntRect &, mozilla::layers:: CanvasLayerComposite::<lambda(mozilla::layers::EffectChain&, const IntRect&)>) (aLayer=aLayer@entry= 0x7ee0264f60, aCompositor=<optimized out>, aClipRect=..., aRenderCallback=aRenderCallback@entry=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/ LayerManagerCompositeUtils.h:69 #17 0x0000007ff2a27090 in mozilla::layers::CanvasLayerComposite::RenderLayer ( this=0x7ee0264f60, aClipRect=..., aGeometry=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:289 #18 0x0000007ff2a32f88 in mozilla::layers::RenderLayers<mozilla::layers:: ContainerLayerComposite> (aContainer=aContainer@entry=0x7ee025d580, aManager=aManager@entry=0x7ee01a43a0, aClipRect=..., aGeometry=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/Maybe.h:443 #19 0x0000007ff2a33e78 in mozilla::layers::ContainerRender<mozilla::layers:: ContainerLayerComposite> (aContainer=0x7ee025d580, aManager=0x7ee01a43a0, aClipRect=..., aGeometry=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/gfx/BaseRect.h:53 #20 0x0000007ff2a33fc0 in mozilla::layers::ContainerLayerComposite::RenderLayer (this=<optimized out>, aClipRect=..., aGeometry=...) at gfx/layers/composite/ContainerLayerComposite.cpp:745 #21 0x0000007ff2a32f88 in mozilla::layers::RenderLayers<mozilla::layers:: ContainerLayerComposite> (aContainer=aContainer@entry=0x7ee01d0140, aManager=aManager@entry=0x7ee01a43a0, aClipRect=..., aGeometry=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/Maybe.h:443 #22 0x0000007ff2a33e78 in mozilla::layers::ContainerRender<mozilla::layers:: ContainerLayerComposite> (aContainer=0x7ee01d0140, aManager=0x7ee01a43a0, aClipRect=..., aGeometry=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/gfx/BaseRect.h:53 #23 0x0000007ff2a33fc0 in mozilla::layers::ContainerLayerComposite::RenderLayer (this=<optimized out>, aClipRect=..., aGeometry=...) at gfx/layers/composite/ContainerLayerComposite.cpp:745 #24 0x0000007ff2a32f88 in mozilla::layers::RenderLayers<mozilla::layers:: ContainerLayerComposite> (aContainer=aContainer@entry=0x7ee01b0d00, aManager=aManager@entry=0x7ee01a43a0, aClipRect=..., aGeometry=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/Maybe.h:443 #25 0x0000007ff2a33e78 in mozilla::layers::ContainerRender<mozilla::layers:: ContainerLayerComposite> (aContainer=0x7ee01b0d00, aManager=0x7ee01a43a0, aClipRect=..., aGeometry=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/gfx/BaseRect.h:53 #26 0x0000007ff2a33fc0 in mozilla::layers::ContainerLayerComposite::RenderLayer (this=<optimized out>, aClipRect=..., aGeometry=...) at gfx/layers/composite/ContainerLayerComposite.cpp:745 #27 0x0000007ff2a1bc84 in mozilla::layers::LayerManagerComposite::<lambda(const IntRect&)>::operator()(const mozilla::gfx::IntRect &) const ( __closure=__closure@entry=0x7f364e98c8, aClipRect=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/MaybeStorageBase.h:50 #28 0x0000007ff2a30e68 in mozilla::layers::LayerManagerComposite::Render ( this=this@entry=0x7ee01a43a0, aInvalidRegion=..., aOpaqueRegion=...) at gfx/layers/composite/LayerManagerComposite.cpp:1237 #29 0x0000007ff2a3148c in mozilla::layers::LayerManagerComposite:: UpdateAndRender (this=this@entry=0x7ee01a43a0) at gfx/layers/composite/LayerManagerComposite.cpp:657 #30 0x0000007ff2a3183c in mozilla::layers::LayerManagerComposite:: EndTransaction (this=this@entry=0x7ee01a43a0, aTimeStamp=..., aFlags=aFlags@entry=mozilla::layers::LayerManager::END_DEFAULT) at gfx/layers/composite/LayerManagerComposite.cpp:572 #31 0x0000007ff2a72fbc in mozilla::layers::CompositorBridgeParent:: CompositeToTarget (this=0x7fc89b9920, aId=..., aTarget=0x0, aRect=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313 #32 0x0000007ff4e07e38 in mozilla::embedlite::EmbedLiteCompositorBridgeParent:: CompositeToDefaultTarget (this=0x7fc89b9920, aId=...) at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:160 #33 0x0000007ff2a58718 in mozilla::layers::CompositorVsyncScheduler::Composite ( this=0x7fc8bd6dd0, aVsyncEvent=...) at gfx/layers/ipc/CompositorVsyncScheduler.cpp:256 #34 0x0000007ff2a50b98 in mozilla::detail::RunnableMethodArguments<mozilla:: VsyncEvent>::applyImpl<mozilla::layers::CompositorVsyncScheduler, void ( mozilla::layers::CompositorVsyncScheduler::*)(mozilla::VsyncEvent const&), StoreCopyPassByConstLRef<mozilla::VsyncEvent>, 0ul> (args=..., m=<optimized out>, o=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/ nsThreadUtils.h:887 [...] #46 0x0000007fefbab89c in ?? () from /lib64/libc.so.6 (gdb)To prevent this becoming tiresome I'm going to skip the last backtrace, since it relates to the same TextureImageEGL::TextureImageEGL() call we've just seen.
That feels like plenty to be getting on with. Tomorrow I'll need to compare these backtraces with the working ESR 91 code to see whether it's possible to get to the same place or not and, if it is, what might have changed.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
7 Jun 2024 : Day 256 #
It's the big one! A full 2^8 days of development have gone in to this now, which seems like an absurd amount of effort.
Unfortunately, while numerically this is very exciting, the actual work I'm doing right now isn't, so there's no big reveal to impress you with. Instead I'm going to continue hacking away at the WebGL bug I discovered a couple of days back.
To elaborate, I'm currently trying to find out why the WebView rendering fix has caused WebGL rendering to fail. Both are types of offscreen rendering, so it's not surprising that one has affected the other, but it's important that both of them are working correctly.
Over the last couple of days I discovered that the problem definitely exists in the latest commit added to the code. I checked that by rolling the repository back one commit, rebuilding and checking that the problem doesn't happen with the slightly older version.
Now I need to find out what has changed in the flow of the code to make the problem appear.
From the earlier backtraces we know that the problem is a call to SharedSurface_Basic::ToSurfaceDescriptor(), which itself is called from WebGLContext::GetFrontBuffer(). Stepping through this method I can see that there's no immediate crashing happening there, and execution continues into ShareableCanvasRenderer::UpdateCompositableClient(). The code being executed there looks like this:
However, the following call to fnMakeTcFromSnapshot() is returning a value, as we can see in the following debug steps:
Sadly I didn't keep copies of the newer packages to install back again, but I do have a copy of the libxul.so library from back then. I'm not sure if I'll be able to debug using it, but it's worth a try. If it turns out not to be debuggable I'll just have to do another complete rebuild (although, this time, I'll keep a copy of the current packages so I can reinstall them if I need to do another comparison!).
Sadly I don't get any joy testing the library:
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
Unfortunately, while numerically this is very exciting, the actual work I'm doing right now isn't, so there's no big reveal to impress you with. Instead I'm going to continue hacking away at the WebGL bug I discovered a couple of days back.
To elaborate, I'm currently trying to find out why the WebView rendering fix has caused WebGL rendering to fail. Both are types of offscreen rendering, so it's not surprising that one has affected the other, but it's important that both of them are working correctly.
Over the last couple of days I discovered that the problem definitely exists in the latest commit added to the code. I checked that by rolling the repository back one commit, rebuilding and checking that the problem doesn't happen with the slightly older version.
Now I need to find out what has changed in the flow of the code to make the problem appear.
From the earlier backtraces we know that the problem is a call to SharedSurface_Basic::ToSurfaceDescriptor(), which itself is called from WebGLContext::GetFrontBuffer(). Stepping through this method I can see that there's no immediate crashing happening there, and execution continues into ShareableCanvasRenderer::UpdateCompositableClient(). The code being executed there looks like this:
// First, let's see if we can get a no-copy TextureClient from the canvas. auto tc = fnGetExistingTc(); if (!tc) { // Otherwise, snapshot the surface and copy into a TexClient. tc = fnMakeTcFromSnapshot(); } if (tc != mFrontBufferFromDesc) { mFrontBufferFromDesc = nullptr; }Both fnGetExistingTc() and fnMakeTcFromSnapshot() are lambda functions defined inside the method. But the first of these is where the call to SharedSurface_Basic::ToSurfaceDescriptor() occurs. This is returning null because a call to SharedSurface_Basic::ToSurfaceDescriptor() always returns Nothing().
However, the following call to fnMakeTcFromSnapshot() is returning a value, as we can see in the following debug steps:
(gdb) n 32 return Nothing(); (gdb) n 50 ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/MaybeStorageBase.h: No such file or directory. (gdb) n mozilla::ClientWebGLContext::GetFrontBuffer (this=this@entry=0x7fc8b4a4b0, fb=fb@entry=0x0, vr=<optimized out>, vr@entry=false) at dom/canvas/ClientWebGLContext.cpp:368 368 const auto notLost = mNotLost; (gdb) n mozilla::layers::ShareableCanvasRenderer::<lambda()>::operator() ( __closure=<synthetic pointer>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/Maybe.h:443 443 ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/Maybe.h: No such file or directory. (gdb) 149 if (!desc) return nullptr; (gdb) n 148 const auto desc = webgl->GetFrontBuffer(nullptr); (gdb) n mozilla::layers::ShareableCanvasRenderer::UpdateCompositableClient ( this=0x7fc98a98e0) at gfx/layers/ShareableCanvasRenderer.cpp:196 196 if (!tc) { (gdb) p tc $8 = {mRawPtr = 0x0} (gdb) n 198 tc = fnMakeTcFromSnapshot(); (gdb) n 200 if (tc != mFrontBufferFromDesc) { (gdb) p tc $9 = {mRawPtr = 0x7fc93bc9a0} (gdb)This will need comparing against what happens in our newer build where the crash occurs. Thinking back, I'm now a little concerned that the sole reason for the crash is this line that I added to SharedSurface_Basic::ToSurfaceDescriptor():
Maybe<layers::SurfaceDescriptor> SharedSurface_Basic::ToSurfaceDescriptor() { MOZ_CRASH("GFX: ToSurfaceDescriptor"); return Nothing(); }Certainly this will cause a crash, but I thought I'd also tested it without this. Now I'm not so sure...
Sadly I didn't keep copies of the newer packages to install back again, but I do have a copy of the libxul.so library from back then. I'm not sure if I'll be able to debug using it, but it's worth a try. If it turns out not to be debuggable I'll just have to do another complete rebuild (although, this time, I'll keep a copy of the current packages so I can reinstall them if I need to do another comparison!).
Sadly I don't get any joy testing the library:
Thread 8 "GeckoWorkerThre" received signal SIGSEGV, Segmentation fault. 0x0000007fe5ee13a8 in ?? () (gdb) bt #0 0x0000007fe5ee13a8 in ?? () #1 0x0000007fdf293e08 in ?? () Backtrace stopped: previous frame inner to this frame (corrupt stack?) (gdb)I'm going to have to do a rebuild. This means restoring the original branch, then performing the build to create the full set of RPM packages.
$ cd gecko-dev $ git checkout -b temp $ git checkout FIREFOX_ESR_91_9_X_RELBRANCH_patches $ git log --oneline -5 7437a9d17284 (HEAD -> FIREFOX_ESR_91_9_X_RELBRANCH_patches) Restore GLScreenBuffer and TextureImageEGL d3ba4df29a32 (temp) Restore NotifyDidPaint event and timers f55057391ac0 Prevent errors from DownloadPrompter eab04b8c0d80 Enable dconf c6ea49286566 (origin/FIREFOX_ESR_91_9_X_RELBRANCH_patches) Disable SessionStore functionality $ cd ..Before now performing the build I must remove the code that's guaranteed to cause a crash:
Maybe<layers::SurfaceDescriptor> SharedSurface_Basic::ToSurfaceDescriptor() { return Nothing(); }Now to build:
$ sfdk build -d --with git_workaround [...]The build won't be ready until the morning at the earliest. So I'm going to pause there and come back to this tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
6 Jun 2024 : Day 255 #
It's day 0b011111111 today, or to put it another way, day (2^8 - 1). That means tomorrow is the big one. I'm certainly hoping I won't need until 2^9 before ESR 91 is released, which hopefully means this will be the last big one, numerical speaking, for this project.
A couple of months back Adam Pigg (piggz) claimed he suspected me of holding out on a solution:
The truth was that at that stage I wasn't at all convinced I'd be able to get the WebView working in time. Thankfully it is now working, in the nick of time as it turns out, but nevertheless the task isn't quite complete. Even once I've finalised this WebView patch, there'll still be more work to do in areas including video rendering, WebRTC videoconferencing, patch refactoring and a bunch of smaller glitches to iron out. So I'm sorry to say there's still no release on the horizon just yet. But as I hope is clear by now, I'm playing the long game. Not only am I committed to getting it finished, but I'm also doing my best to help ensure the process is as streamlined as possible for the future too. Hopefully, when it comes to the next release, things will be easier.
Before I get back to coding, I also need to give advance warning that I'll not be posting entries next week. Next week is Hackweek at work, which means a week long intensive coding session with my colleagues. There's a good chance that this won't leave much in the way of free-time for me to be working on Gecko. That'll be from Monday 10th June to Friday 14th June. I'll start up right back where I leave things off on the Saturday though.
Alright, now back to coding. Yesterday you'll recall I discovered a problem with WebGL rendering. I know this was working back in February because I demoed it at FOSDEM, but some change I've made between then and now has broken it.
Yesterday I recorded a couple of backtraces around the crash. My suspicion is that the problem relates to the recent changes to offscreen rendering.
To test this theory out I've created a new branch and rolled the project back a single commit to before I started making the WebView changes. The wonders of version control! During the day today I set it building a completely fresh set of RPM packages based on this slightly older version of the code.
That's a big help. With the two backtraces captured yesterday my plan is to compare the execution flow with the working version to see how they differ. Here's what I believe to be the equivalent backtrace:
Unfortunately after the build during the day I'm a bit short on time to delve deeper in to this now. But I'll pick this up again tomorrow to try to figure out what the difference is. Once I have that it will hopefully give a much clearer idea about how to fix the problem with my latest changeset. I can then roll back to my original commit, fix it, and... well, let's see.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
A couple of months back Adam Pigg (piggz) claimed he suspected me of holding out on a solution:
[M]y theory is that its all working just fine, and he's just dragging it out to the big reveal on day 2^8 :)
The truth was that at that stage I wasn't at all convinced I'd be able to get the WebView working in time. Thankfully it is now working, in the nick of time as it turns out, but nevertheless the task isn't quite complete. Even once I've finalised this WebView patch, there'll still be more work to do in areas including video rendering, WebRTC videoconferencing, patch refactoring and a bunch of smaller glitches to iron out. So I'm sorry to say there's still no release on the horizon just yet. But as I hope is clear by now, I'm playing the long game. Not only am I committed to getting it finished, but I'm also doing my best to help ensure the process is as streamlined as possible for the future too. Hopefully, when it comes to the next release, things will be easier.
Before I get back to coding, I also need to give advance warning that I'll not be posting entries next week. Next week is Hackweek at work, which means a week long intensive coding session with my colleagues. There's a good chance that this won't leave much in the way of free-time for me to be working on Gecko. That'll be from Monday 10th June to Friday 14th June. I'll start up right back where I leave things off on the Saturday though.
Alright, now back to coding. Yesterday you'll recall I discovered a problem with WebGL rendering. I know this was working back in February because I demoed it at FOSDEM, but some change I've made between then and now has broken it.
Yesterday I recorded a couple of backtraces around the crash. My suspicion is that the problem relates to the recent changes to offscreen rendering.
To test this theory out I've created a new branch and rolled the project back a single commit to before I started making the WebView changes. The wonders of version control! During the day today I set it building a completely fresh set of RPM packages based on this slightly older version of the code.
$ cd gecko-dev $ git checkout -b temp $ git log FIREFOX_ESR_91_9_X_RELBRANCH_patches_temp --oneline -5 eb40ffd47432 (FIREFOX_ESR_91_9_X_RELBRANCH_patches_temp) Restore GLScreenBuffer and TextureImageEGL d3ba4df29a32 (HEAD -> temp) Restore NotifyDidPaint event and timers f55057391ac0 Prevent errors from DownloadPrompter eab04b8c0d80 Enable dconf c6ea49286566 (origin/FIREFOX_ESR_91_9_X_RELBRANCH_patches) Disable SessionStore functionality $ git reset --hard d3ba4df29a32d53c38c68e4512d1fa82073ecdf4 $ git log --oneline -4 d3ba4df29a32 (HEAD -> temp) Restore NotifyDidPaint event and timers f55057391ac0 Prevent errors from DownloadPrompter eab04b8c0d80 Enable dconf c6ea49286566 (origin/FIREFOX_ESR_91_9_X_RELBRANCH_patches) Disable SessionStore functionality $ cd .. $ sfdk build -d --with git_workaround [...]Testing these new packages this evening I find that WebGL is indeed working with this one-commit-older version. That narrows down the problem to somewhere in the most recent commit eb40ffd47432.
That's a big help. With the two backtraces captured yesterday my plan is to compare the execution flow with the working version to see how they differ. Here's what I believe to be the equivalent backtrace:
Thread 10 "GeckoWorkerThre" hit Breakpoint 2, mozilla::gl:: SharedSurface_Basic::SharedSurface_Basic (this=0x7f81347dc0, gl=0x7f815d82c0, size=..., hasAlpha=true, tex=1, ownsTex=true) at gfx/gl/SharedSurfaceGL.cpp:54 54 SharedSurface_Basic::SharedSurface_Basic(GLContext* gl, const IntSize& size,This leads us to the second backtraces for the ToSurfaceDescriptor conversion method:
Thread 8 "GeckoWorkerThre" hit Breakpoint 5, mozilla::gl:: SharedSurface_Basic::ToSurfaceDescriptor (this=0x7fc8d8c9f0) at gfx/gl/SharedSurfaceGL.cpp:31 31 Maybe<layers::SurfaceDescriptor> SharedSurface_Basic:: ToSurfaceDescriptor() { (gdb) bt #0 mozilla::gl::SharedSurface_Basic::ToSurfaceDescriptor (this=0x7fc8d8c9f0) at gfx/gl/SharedSurfaceGL.cpp:31 #1 0x0000007ff3694278 in mozilla::WebGLContext::GetFrontBuffer ( this=this@entry=0x7fc94b8d10, xrFb=<optimized out>, webvr=webvr@entry=false) at dom/canvas/WebGLContext.cpp:949 #2 0x0000007ff365c528 in mozilla::HostWebGLContext::GetFrontBuffer ( this=<optimized out>, xrFb=<optimized out>, webvr=false) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:280 #3 0x0000007ff365c5d8 in mozilla::ClientWebGLContext::GetFrontBuffer ( this=this@entry=0x7fc8b4a4b0, fb=fb@entry=0x0, vr=<optimized out>, vr@entry=false) at dom/canvas/ClientWebGLContext.cpp:373 #4 0x0000007ff29b2410 in mozilla::layers::ShareableCanvasRenderer::<lambda()>:: operator() (__closure=<synthetic pointer>) at gfx/layers/ShareableCanvasRenderer.cpp:148 #5 mozilla::layers::ShareableCanvasRenderer::UpdateCompositableClient ( this=0x7fc98a98e0) at gfx/layers/ShareableCanvasRenderer.cpp:195 #6 0x0000007ff29f1e10 in mozilla::layers::ClientCanvasLayer::RenderLayer ( this=0x7fc959bd60) at gfx/layers/client/ClientCanvasLayer.cpp:25 #7 0x0000007ff29f0f30 in mozilla::layers::ClientLayer::RenderLayerWithReadback (this=<optimized out>, aReadback=<optimized out>) at gfx/layers/client/ClientLayerManager.h:365 #8 0x0000007ff2a01054 in mozilla::layers::ClientContainerLayer::RenderLayer ( this=0x7fc9798e60) at gfx/layers/Layers.h:1051 #9 0x0000007ff29f0f30 in mozilla::layers::ClientLayer::RenderLayerWithReadback (this=<optimized out>, aReadback=<optimized out>) at gfx/layers/client/ClientLayerManager.h:365 #10 0x0000007ff2a01054 in mozilla::layers::ClientContainerLayer::RenderLayer ( this=0x7fc8d810f0) at gfx/layers/Layers.h:1051 #11 0x0000007ff29f0f30 in mozilla::layers::ClientLayer::RenderLayerWithReadback (this=<optimized out>, aReadback=<optimized out>) at gfx/layers/client/ClientLayerManager.h:365 #12 0x0000007ff2a01054 in mozilla::layers::ClientContainerLayer::RenderLayer ( this=0x7fc93748a0) at gfx/layers/Layers.h:1051 #13 0x0000007ff2a08270 in mozilla::layers::ClientLayerManager:: EndTransactionInternal (this=this@entry=0x7fc8b18a30, aCallback=aCallback@entry=0x7ff46a44d0 <mozilla::FrameLayerBuilder:: DrawPaintedLayer(mozilla::layers::PaintedLayer*, gfxContext*, mozilla::gfx:: IntRegionTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gfx:: IntRegionTyped<mozilla::gfx::UnknownUnits> const&, mozilla::layers:: DrawRegionClip, mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> const&, void*)>, aCallbackData=aCallbackData@entry=0x7fdf2dd268) at gfx/layers/client/ClientLayerManager.cpp:341 #14 0x0000007ff2a12be4 in mozilla::layers::ClientLayerManager::EndTransaction ( this=0x7fc8b18a30, aCallback=0x7ff46a44d0 <mozilla::FrameLayerBuilder::DrawPaintedLayer( mozilla::layers::PaintedLayer*, gfxContext*, mozilla::gfx:: IntRegionTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gfx:: IntRegionTyped<mozilla::gfx::UnknownUnits> const&, mozilla::layers:: DrawRegionClip, mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> const&, void*)>, aCallbackData=0x7fdf2dd268, aFlags=mozilla::layers:: LayerManager::END_DEFAULT) at gfx/layers/client/ClientLayerManager.cpp:397 #15 0x0000007ff46a18f0 in nsDisplayList::PaintRoot ( this=this@entry=0x7fdf2df078, aBuilder=aBuilder@entry=0x7fdf2dd268, aCtx=aCtx@entry=0x0, aFlags=aFlags@entry=13, aDisplayListBuildTime=...) at layout/painting/nsDisplayList.cpp:2622 #16 0x0000007ff442dc4c in nsLayoutUtils::PaintFrame ( aRenderingContext=aRenderingContext@entry=0x0, aFrame=aFrame@entry=0x7fc9362940, aDirtyRegion=..., aBackstop=aBackstop@entry=4294967295, aBuilderMode=aBuilderMode@entry=nsDisplayListBuilderMode::Painting, aFlags=aFlags@entry=(nsLayoutUtils::PaintFrameFlags::WidgetLayers | nsLayoutUtils::PaintFrameFlags::ExistingTransaction | nsLayoutUtils:: PaintFrameFlags::NoComposite)) at ${PROJECT}/obj-build-mer-qt-xr/dist/ include/mozilla/MaybeStorageBase.h:80 #17 0x0000007ff43b8340 in mozilla::PresShell::Paint ( this=this@entry=0x7fc92df890, aViewToPaint=aViewToPaint@entry=0x7fc8570b20, aDirtyRegion=..., aFlags=aFlags@entry=mozilla::PaintFlags::PaintLayers) at layout/base/PresShell.cpp:6400 #18 0x0000007ff41f0210 in nsViewManager::ProcessPendingUpdatesPaint ( this=this@entry=0x7fc8570ae0, aWidget=aWidget@entry=0x7fc8570ba0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/gfx/RectAbsolute.h:43 #19 0x0000007ff41f05c4 in nsViewManager::ProcessPendingUpdatesForView ( this=this@entry=0x7fc8570ae0, aView=<optimized out>, aFlushDirtyRegion=aFlushDirtyRegion@entry=true) at view/nsViewManager.cpp:394 #20 0x0000007ff41f0bb4 in nsViewManager::ProcessPendingUpdates ( this=this@entry=0x7fc8570ae0) at view/nsViewManager.cpp:972 [...] #51 0x0000007fefbb189c in ?? () from /lib64/libc.so.6 (gdb)There's actually very little difference between these calls, as we can see if we look at just the first couple of frames of each next to each other:
#0 0x0000007ff28d1ca4 in mozilla::gl::SharedSurface_Basic::ToSurfaceDescriptor (this=<optimized out>) at gfx/gl/SharedSurfaceGL.cpp:38 #1 0x0000007ff36920a4 in mozilla::WebGLContext::GetFrontBuffer ( this=this@entry=0x7fc94889c0, xrFb=<optimized out>, webvr=webvr@entry=false) at dom/canvas/WebGLContext.cpp:949
#0 mozilla::gl::SharedSurface_Basic::ToSurfaceDescriptor (this=0x7fc8d8c9f0) at gfx/gl/SharedSurfaceGL.cpp:31 #1 0x0000007ff3694278 in mozilla::WebGLContext::GetFrontBuffer ( this=this@entry=0x7fc94b8d10, xrFb=<optimized out>, webvr=webvr@entry=false) at dom/canvas/WebGLContext.cpp:949The first of these is the broken version, while the second is working. In order to get a deeper understanding, I'm going to want to step through the code between here and the crash.
Unfortunately after the build during the day I'm a bit short on time to delve deeper in to this now. But I'll pick this up again tomorrow to try to figure out what the difference is. Once I have that it will hopefully give a much clearer idea about how to fix the problem with my latest changeset. I can then roll back to my original commit, fix it, and... well, let's see.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
5 Jun 2024 : Day 254 #
I'm continuing with my attempts to simplify the latest commit by removing unnecessary changes and doing my best to align the changes with the upstream ESR 91 changes.
To do this I've been looking carefully through the code to try to find unused methods that were added in my latest commit. I managed to find a couple. Both GLContext::ResizeOffscreen() and GLContext::OffscreenSize() have equivalents in the GLScreenBuffer class which can be used as drop-in replacements. So those two methods, which were previously added, are now added no more.
The other change I've made is to remove the following member variable from GLContext:
The other change I've made today is to simplify how the EGLDisplay is passed around when things are still being initialised. I'd created quite a web of methods to pass this between, with all of these needing this variable to be passed in. Here they are, with the aDisplay parameter at the end being the one I'd really like to avoid the need for.
After these changes the patch is still pretty large, but looking considerably better.
However, during my testing I hit another problem. It turns out that somewhere, some of the changes I made (probably related to offscreen rendering) have broken the WebGLContext capabilities. That means that if a web page uses WebGL it'll now trigger a crash. I know for sure that this was working earlier, so something has changed.
You'll have to forgive me for including a lengthy backtrace. I admit it's not very enlightening, but I want to keep a copy here so I have something to refer to. This is the backtrace for the crash:
So because I think it could be helpful in getting to the bottom of this, I'm also going to record how and where this surface is being created. Here's the backtrace for its creation:
Sadly I don't have the energy to dig into this crash further today, so I'll have to leave things there. I hope it won't be too hard to fix, but at this point it's already looking a bit tricky, so we'll have to see.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
To do this I've been looking carefully through the code to try to find unused methods that were added in my latest commit. I managed to find a couple. Both GLContext::ResizeOffscreen() and GLContext::OffscreenSize() have equivalents in the GLScreenBuffer class which can be used as drop-in replacements. So those two methods, which were previously added, are now added no more.
The other change I've made is to remove the following member variable from GLContext:
std::map<GLuint, SharedSurface*> mFBOMappingWith this removed I'm also able to remove the related code from the source file as well as the dependency on the standard map header. I've tested the result and it doesn't seem to have any negative effects on either the browser or the WebView.
The other change I've made today is to simplify how the EGLDisplay is passed around when things are still being initialised. I'd created quite a web of methods to pass this between, with all of these needing this variable to be passed in. Here they are, with the aDisplay parameter at the end being the one I'd really like to avoid the need for.
RefPtr<GLLibraryEGL> DefaultEglLibrary(nsACString* const out_failureId, EGLDisplay aDisplay); inline std::shared_ptr<EglDisplay> DefaultEglDisplay(nsACString* const out_failureId, EGLDisplay aDisplay); RefPtr<GLLibraryEGL> GLLibraryEGL::Create(nsACString* const out_failureId, EGLDisplay aDisplay); bool GLLibraryEGL::Init(bool forceAccel, nsACString* const out_failureId, EGLDisplay aDisplay);Just by rearranging things a little I've been able to remove the aDisplay parameter from all of these. It turns out that most of them were just passing the value on to one of the other methods. Once they were taken away from one, they weren't needed in the others either.
After these changes the patch is still pretty large, but looking considerably better.
However, during my testing I hit another problem. It turns out that somewhere, some of the changes I made (probably related to offscreen rendering) have broken the WebGLContext capabilities. That means that if a web page uses WebGL it'll now trigger a crash. I know for sure that this was working earlier, so something has changed.
You'll have to forgive me for including a lengthy backtrace. I admit it's not very enlightening, but I want to keep a copy here so I have something to refer to. This is the backtrace for the crash:
Thread 8 "GeckoWorkerThre" received signal SIGSEGV, Segmentation fault. [Switching to LWP 10166] 0x0000007ff28d1ca4 in mozilla::gl::SharedSurface_Basic::ToSurfaceDescriptor ( this=<optimized out>) at gfx/gl/SharedSurfaceGL.cpp:38 38 MOZ_CRASH("GFX: ToSurfaceDescriptor"); (gdb) bt #0 0x0000007ff28d1ca4 in mozilla::gl::SharedSurface_Basic::ToSurfaceDescriptor (this=<optimized out>) at gfx/gl/SharedSurfaceGL.cpp:38 #1 0x0000007ff36920a4 in mozilla::WebGLContext::GetFrontBuffer ( this=this@entry=0x7fc94889c0, xrFb=<optimized out>, webvr=webvr@entry=false) at dom/canvas/WebGLContext.cpp:949 #2 0x0000007ff365a410 in mozilla::HostWebGLContext::GetFrontBuffer ( this=<optimized out>, xrFb=<optimized out>, webvr=false) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:280 #3 0x0000007ff365a4c0 in mozilla::ClientWebGLContext::GetFrontBuffer ( this=this@entry=0x7fc8b48ec0, fb=fb@entry=0x0, vr=<optimized out>, vr@entry=false) at dom/canvas/ClientWebGLContext.cpp:373 #4 0x0000007ff29b0084 in mozilla::layers::ShareableCanvasRenderer::<lambda()>:: operator() (__closure=<synthetic pointer>) at gfx/layers/ShareableCanvasRenderer.cpp:148 #5 mozilla::layers::ShareableCanvasRenderer::UpdateCompositableClient ( this=0x7fc96d7db0) at gfx/layers/ShareableCanvasRenderer.cpp:195 #6 0x0000007ff29efa84 in mozilla::layers::ClientCanvasLayer::RenderLayer ( this=0x7fc978af90) at gfx/layers/client/ClientCanvasLayer.cpp:25 #7 0x0000007ff29eeba4 in mozilla::layers::ClientLayer::RenderLayerWithReadback (this=<optimized out>, aReadback=<optimized out>) at gfx/layers/client/ClientLayerManager.h:365 #8 0x0000007ff29feeec in mozilla::layers::ClientContainerLayer::RenderLayer ( this=0x7fc978a680) at gfx/layers/Layers.h:1051 #9 0x0000007ff29eeba4 in mozilla::layers::ClientLayer::RenderLayerWithReadback (this=<optimized out>, aReadback=<optimized out>) at gfx/layers/client/ClientLayerManager.h:365 #10 0x0000007ff29feeec in mozilla::layers::ClientContainerLayer::RenderLayer ( this=0x7fc9424e20) at gfx/layers/Layers.h:1051 #11 0x0000007ff29eeba4 in mozilla::layers::ClientLayer::RenderLayerWithReadback (this=<optimized out>, aReadback=<optimized out>) at gfx/layers/client/ClientLayerManager.h:365 #12 0x0000007ff29feeec in mozilla::layers::ClientContainerLayer::RenderLayer ( this=0x7fc8dd8d50) at gfx/layers/Layers.h:1051 #13 0x0000007ff2a05bd0 in mozilla::layers::ClientLayerManager:: EndTransactionInternal (this=this@entry=0x7fc8b17440, aCallback=aCallback@entry= 0x7ff46a23d0 <mozilla::FrameLayerBuilder::DrawPaintedLayer(mozilla::layers:: PaintedLayer*, gfxContext*, mozilla::gfx::IntRegionTyped<mozilla::gfx:: UnknownUnits> const&, mozilla::gfx::IntRegionTyped<mozilla::gfx:: UnknownUnits> const&, mozilla::layers::DrawRegionClip, mozilla::gfx:: IntRegionTyped<mozilla::gfx::UnknownUnits> const&, void*)>, aCallbackData=aCallbackData@entry=0x7fdf293268) at gfx/layers/client/ClientLayerManager.cpp:341 #14 0x0000007ff2a10ad0 in mozilla::layers::ClientLayerManager::EndTransaction ( this=0x7fc8b17440, aCallback=0x7ff46a23d0 <mozilla::FrameLayerBuilder::DrawPaintedLayer( mozilla::layers::PaintedLayer*, gfxContext*, mozilla::gfx:: IntRegionTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gfx:: IntRegionTyped<mozilla::gfx::UnknownUnits> const&, mozilla::layers:: DrawRegionClip, mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> const&, void*)>, aCallbackData=0x7fdf293268, aFlags=mozilla::layers:: LayerManager::END_DEFAULT) at gfx/layers/client/ClientLayerManager.cpp:397 #15 0x0000007ff469f7f0 in nsDisplayList::PaintRoot ( this=this@entry=0x7fdf295078, aBuilder=aBuilder@entry=0x7fdf293268, aCtx=aCtx@entry=0x0, aFlags=aFlags@entry=13, aDisplayListBuildTime=...) at layout/painting/nsDisplayList.cpp:2622 #16 0x0000007ff442bb4c in nsLayoutUtils::PaintFrame ( aRenderingContext=aRenderingContext@entry=0x0, aFrame=aFrame@entry=0x7fc932f550, aDirtyRegion=..., aBackstop=aBackstop@entry=4294967295, aBuilderMode=aBuilderMode@entry=nsDisplayListBuilderMode::Painting, aFlags=aFlags@entry=(nsLayoutUtils::PaintFrameFlags::WidgetLayers | nsLayoutUtils::PaintFrameFlags::ExistingTransaction | nsLayoutUtils:: PaintFrameFlags::NoComposite)) at ${PROJECT}/obj-build-mer-qt-xr/dist/ include/mozilla/MaybeStorageBase.h:80 #17 0x0000007ff43b6240 in mozilla::PresShell::Paint ( this=this@entry=0x7fc92a8a70, aViewToPaint=aViewToPaint@entry=0x7fc9287140, aDirtyRegion=..., aFlags=aFlags@entry=mozilla::PaintFlags::PaintLayers) at layout/base/PresShell.cpp:6400 #18 0x0000007ff41ee110 in nsViewManager::ProcessPendingUpdatesPaint ( this=this@entry=0x7fc92870d0, aWidget=aWidget@entry=0x7fc92871c0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/gfx/RectAbsolute.h:43 [...] #51 0x0000007fefba989c in ?? () from /lib64/libc.so.6 (gdb)Reading through just the first few items in this backtrace, it's clear that the reason is due to the call to SharedSurface_Basic::ToSurfaceDescriptor(), which is intentionally triggering a crash based on the code that's there. Given this, it's possible the problem is that the SharedSurface_Basic object that's being used should have been one of the several other alternative surface variants.
So because I think it could be helpful in getting to the bottom of this, I'm also going to record how and where this surface is being created. Here's the backtrace for its creation:
#0 mozilla::gl::SharedSurface_Basic::SharedSurface_Basic (this=0x7fc8d77e40, gl=0x7fc97f9830, size=..., hasAlpha=false, tex=1, ownsTex=true) at gfx/gl/SharedSurfaceGL.cpp:78 #1 0x0000007ff28d3e6c in mozilla::gl::SharedSurface_Basic::Create ( gl=0x7fc97f9830, formats=..., size=..., hasAlpha=false) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33 #2 0x0000007ff28a3720 in mozilla::gl::SurfaceFactory_Basic::CreateSharedImpl ( this=<optimized out>, desc=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/WeakPtr.h:185 #3 0x0000007ff28a3628 in mozilla::gl::SurfaceFactory::CreateShared ( this=0x7fc93e17c0, size=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefCounted.h:240 #4 0x0000007ff28a5f24 in mozilla::gl::SwapChain::Acquire ( this=this@entry=0x7fc92ed298, size=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 #5 0x0000007ff369bed4 in mozilla::WebGLContext::PresentInto ( this=this@entry=0x7fc92ece00, swapChain=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 #6 0x0000007ff369c304 in mozilla::WebGLContext::Present ( this=this@entry=0x7fc92ece00, xrFb=<optimized out>, consumerType=consumerType@entry=mozilla::layers::TextureType::Unknown, webvr=webvr@entry=false) at dom/canvas/WebGLContext.cpp:936 #7 0x0000007ff3664300 in mozilla::HostWebGLContext::Present (webvr=false, t=mozilla::layers::TextureType::Unknown, xrFb=<optimized out>, this=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/ mozilla/RefPtr.h:280 #8 mozilla::ClientWebGLContext::Run<void (mozilla::HostWebGLContext::*)( unsigned long, mozilla::layers::TextureType, bool) const, &(mozilla:: HostWebGLContext::Present(unsigned long, mozilla::layers::TextureType, bool) const), unsigned long, mozilla::layers::TextureType const&, bool const&> ( this=<optimized out>, args#0=@0x7fdf2926c0: 0, args#1=@0x7fdf2926bf: mozilla::layers::TextureType::Unknown, args#2=@0x7fdf2926be: false) at dom/canvas/ClientWebGLContext.cpp:313 #9 0x0000007ff3664468 in mozilla::ClientWebGLContext::Present ( this=this@entry=0x7fc9558160, xrFb=xrFb@entry=0x0, type=<optimized out>, webvr=<optimized out>, webvr@entry=false) at dom/canvas/ClientWebGLContext.cpp:363 #10 0x0000007ff368fc78 in mozilla::ClientWebGLContext::OnBeforePaintTransaction (this=0x7fc9558160) at dom/canvas/ClientWebGLContext.cpp:345 #11 0x0000007ff28ff0dc in mozilla::layers::CanvasRenderer:: FirePreTransactionCallback (this=this@entry=0x7fc9646e50) at gfx/layers/CanvasRenderer.cpp:75 #12 0x0000007ff29afee8 in mozilla::layers::ShareableCanvasRenderer:: UpdateCompositableClient (this=0x7fc9646e50) at gfx/layers/ShareableCanvasRenderer.cpp:192 #13 0x0000007ff29efa84 in mozilla::layers::ClientCanvasLayer::RenderLayer ( this=0x7fc98f02c0) at gfx/layers/client/ClientCanvasLayer.cpp:25 #14 0x0000007ff29eeba4 in mozilla::layers::ClientLayer::RenderLayerWithReadback (this=<optimized out>, aReadback=<optimized out>) at gfx/layers/client/ClientLayerManager.h:365 #15 0x0000007ff29feeec in mozilla::layers::ClientContainerLayer::RenderLayer ( this=0x7fc991a8d0) at gfx/layers/Layers.h:1051 #16 0x0000007ff29eeba4 in mozilla::layers::ClientLayer::RenderLayerWithReadback (this=<optimized out>, aReadback=<optimized out>) at gfx/layers/client/ClientLayerManager.h:365 #17 0x0000007ff29feeec in mozilla::layers::ClientContainerLayer::RenderLayer ( this=0x7fc935acd0) at gfx/layers/Layers.h:1051 #18 0x0000007ff29eeba4 in mozilla::layers::ClientLayer::RenderLayerWithReadback (this=<optimized out>, aReadback=<optimized out>) at gfx/layers/client/ClientLayerManager.h:365 #19 0x0000007ff29feeec in mozilla::layers::ClientContainerLayer::RenderLayer ( this=0x7fc8d8f2f0) at gfx/layers/Layers.h:1051 #20 0x0000007ff2a05bd0 in mozilla::layers::ClientLayerManager:: EndTransactionInternal (this=this@entry=0x7fc8b17e40, aCallback=aCallback@entry= 0x7ff46a23d0 <mozilla::FrameLayerBuilder::DrawPaintedLayer(mozilla::layers:: PaintedLayer*, gfxContext*, mozilla::gfx::IntRegionTyped<mozilla::gfx:: UnknownUnits> const&, mozilla::gfx::IntRegionTyped<mozilla::gfx:: UnknownUnits> const&, mozilla::layers::DrawRegionClip, mozilla::gfx:: IntRegionTyped<mozilla::gfx::UnknownUnits> const&, void*)>, aCallbackData=aCallbackData@entry=0x7fdf293268) at gfx/layers/client/ClientLayerManager.cpp:341 #21 0x0000007ff2a10ad0 in mozilla::layers::ClientLayerManager::EndTransaction ( this=0x7fc8b17e40, aCallback=0x7ff46a23d0 <mozilla::FrameLayerBuilder::DrawPaintedLayer( mozilla::layers::PaintedLayer*, gfxContext*, mozilla::gfx:: IntRegionTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gfx:: IntRegionTyped<mozilla::gfx::UnknownUnits> const&, mozilla::layers:: DrawRegionClip, mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> const&, void*)>, aCallbackData=0x7fdf293268, aFlags=mozilla::layers:: LayerManager::END_DEFAULT) at gfx/layers/client/ClientLayerManager.cpp:397 #22 0x0000007ff469f7f0 in nsDisplayList::PaintRoot ( this=this@entry=0x7fdf295078, aBuilder=aBuilder@entry=0x7fdf293268, aCtx=aCtx@entry=0x0, aFlags=aFlags@entry=13, aDisplayListBuildTime=...) at layout/painting/nsDisplayList.cpp:2622 #23 0x0000007ff442bb4c in nsLayoutUtils::PaintFrame ( aRenderingContext=aRenderingContext@entry=0x0, aFrame=aFrame@entry=0x7fc93240a0, aDirtyRegion=..., aBackstop=aBackstop@entry=4294967295, aBuilderMode=aBuilderMode@entry=nsDisplayListBuilderMode::Painting, aFlags=aFlags@entry=(nsLayoutUtils::PaintFrameFlags::WidgetLayers | nsLayoutUtils::PaintFrameFlags::ExistingTransaction | nsLayoutUtils:: PaintFrameFlags::NoComposite)) at ${PROJECT}/obj-build-mer-qt-xr/dist/ include/mozilla/MaybeStorageBase.h:80 #24 0x0000007ff43b6240 in mozilla::PresShell::Paint ( this=this@entry=0x7fc9295080, aViewToPaint=aViewToPaint@entry=0x7fc9273470, aDirtyRegion=..., aFlags=aFlags@entry=mozilla::PaintFlags::PaintLayers) at layout/base/PresShell.cpp:6400 #25 0x0000007ff41ee110 in nsViewManager::ProcessPendingUpdatesPaint ( this=this@entry=0x7fc9273430, aWidget=aWidget@entry=0x7fc92734f0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/gfx/RectAbsolute.h:43 [...] #58 0x0000007fefba989c in ?? () from /lib64/libc.so.6 (gdb)Again, apologies for the lengthy backtraces. I'm not going to try to debug and fix this today, so I'm thinking it might be helpful to have these backtraces as something to refer back to for the future.
Sadly I don't have the energy to dig into this crash further today, so I'll have to leave things there. I hope it won't be too hard to fix, but at this point it's already looking a bit tricky, so we'll have to see.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
4 Jun 2024 : Day 253 #
Overnight I've been mulling options. They are to either strip out the code that chooses texture format based on device capabilities, or leave the existing large patch as it is. Going with the former will be more work and could result in incompatibilities on some devices, but has the potential to massively simplify the patch.
My decision is that I'm going to strip out the code. Already the way the decision is performed is messy. Stripping it all back and going with the bare essentials is exactly what's happened upstream; I think it makes sense to mirror these changes as much as possible. If it does cause problems for some devices, then we can look to reintroduce some of these changes. But at that point, it should be possible to do so in a much cleaner and more structured way.
The first step needed is to strip out the code from GLContext::ChooseGLFormats() so that it returns the values we saw yesterday, which were the following.
[...]
I've now spent a good few hours removing all references to SurfaceCaps from the code, to the extent that the structure isn't even defined in SurfaceTypes.h any more. That means big changes to the code: 79 new lines added, but more importantly 379 lines removed. That'll make a big difference to the final patch. I've also checked that the code compiles and, in theory, it's not doing anything different to the previous version that had the values fixed.
I'm going to test this out now, which means a partial build, but that will still take half an hour or so.
[...]
I've tested it; it all works well. So now I'm going to move on to the AttachmentType enumeration. Looking through the code, this attachment type is only ever set in three places: in the SharedSurface_EGLImage constructor and in one of the two SharedSurface_Basic constructors. In call cases it gets set to AttachmentType::GLTexture. Here are the three places:
That simplifies the code a bunch more: we're now up to 98 inserted lines and 444 removed lines. Finally the GLFormats structure that's defined in GLContextTypes.h no longer exists in the upstream ESR 91 code. If you look back at the GLContext::ChooseGLFormats() listed out above, you'll notice that the values get set to default values; and in fact they never get changed. So this structure also looks to be a good candidate for removal.
Thankfully removing it is also pretty clean and straightforward. That leaves things in a much better state. These changes add a total of 95 lines but remove a total of 472 lines. I've checked that both the browser and WebView are still working as expected after these changes, so that feels like a good result for today.
That still leaves behind a potentially huge patch though. After these changes the patch still removes 145 lines and adds 1702 lines of code. We want both of these numbers to be as small as possible and that compares to 156 removals and 2090 additions before making these improvements.
I think we can do better; there are still further candidates for simplification. For example, one big change I had to make was to pass around theEGLDisplay value, which adds a new parameter to several methods. If I can remove this requirement, that'll take us closer again to the upstream ESR 91 code. But I've reached my limit for today, so this will have to wait until tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
My decision is that I'm going to strip out the code. Already the way the decision is performed is messy. Stripping it all back and going with the bare essentials is exactly what's happened upstream; I think it makes sense to mirror these changes as much as possible. If it does cause problems for some devices, then we can look to reintroduce some of these changes. But at that point, it should be possible to do so in a much cleaner and more structured way.
The first step needed is to strip out the code from GLContext::ChooseGLFormats() so that it returns the values we saw yesterday, which were the following.
bool any = false; bool color = true; bool alpha = false; bool bpp16 = false; bool depth = false; bool stencil = false; bool premultAlpha = true;Stepping through the ChooseGLFormats() method we can see the consequences of these values, combined with the results returned from the GLES driver. When following this and subsequent comments it may be helpful to refer to the method as it looks upstream in ESR 78.
1763 formats.color_texType = LOCAL_GL_UNSIGNED_BYTE; (gdb) p bpp16 $6 = false (gdb) p caps.alpha $7 = false (gdb) n 1765 if (caps.alpha) { (gdb) n 1772 formats.color_texFormat = LOCAL_GL_RGB; (gdb) n 1773 formats.color_rbFormat = LOCAL_GL_RGB8; (gdb) n 1779 if (IsSupported(GLFeature::packed_depth_stencil)) { (gdb) n 1780 formats.depthStencil = LOCAL_GL_DEPTH24_STENCIL8; (gdb) n 260 return mProfile == ContextProfile::OpenGLES; (gdb) n 1785 if (IsExtensionSupported(OES_depth24)) { (gdb) n 1794 formats.stencil = LOCAL_GL_STENCIL_INDEX8; (gdb) n 1796 return formats; (gdb) p formats $8 = {color_texInternalFormat = 6407, color_texFormat = 6407, color_texType = 5121, color_rbFormat = 32849, depthStencil = 35056, depth = 33190, stencil = 36168} (gdb)The key values needed to determine flow through the method are these:
bpp16 == false caps.alpha == false IsSupported(GLFeature::packed_depth_stencil) == true IsGLES() == true (IsExtensionSupported(OES_depth24) == trueAt the end of the method, we're left with the following capabilities:
color_texInternalFormat = 6407 = 0x1907 = LOCAL_GL_RGB color_texFormat = 6407 = 0x1907 = LOCAL_GL_RGB color_texType = 5121 = 0x1401 = LOCAL_GL_UNSIGNED_BYTE color_rbFormat = 32849 = 0x8051 = LOCAL_GL_RGB8 depthStencil = 35056 = 0x88f0 = LOCAL_GL_DEPTH24_STENCIL8 depth = 33190 = 0x81a6 = LOCAL_GL_DEPTH_COMPONENT24 stencil = 36168 = 0x8d48 = LOCAL_GL_STENCIL_INDEX8To perform the conversion from hex value to GLenum I cross-referenced against the gfx/gl/GLConsts.h file. Based on this analysis, I'm testing with the following newly reworked ChooseGLFormats() method:
GLFormats GLContext::ChooseGLFormats(const SurfaceCaps& caps) const { GLFormats formats; formats.color_texType = LOCAL_GL_UNSIGNED_BYTE; formats.color_texInternalFormat = LOCAL_GL_RGB; formats.color_texFormat = LOCAL_GL_RGB; formats.color_rbFormat = LOCAL_GL_RGB8; formats.depthStencil = LOCAL_GL_DEPTH24_STENCIL8; formats.depth = LOCAL_GL_DEPTH_COMPONENT24; formats.stencil = LOCAL_GL_STENCIL_INDEX8; return formats; }I should make clear that if this works, my plan is to remove this method entirely. I'm doing this check just to ensure I have everything correct before propagating these changes at a more fundamental level.
$ make -j1 -C obj-build-mer-qt-xr/gfx/ $ make -j16 -C `pwd`/obj-build-mer-qt-xr/toolkitA quick test shows things are still working as expected, so the task now is to propagate this fixed configuration throughout the changes I've made. With any luck this will simplify things considerably, potentially allowing the use of SurfaceCaps to be eliminated entirely. The SurfaceCaps structure was removed between ESR 78 and ESR 91, so it'd be great if there's no need to restore it back again. Similarly for the AttachmentType class.
[...]
I've now spent a good few hours removing all references to SurfaceCaps from the code, to the extent that the structure isn't even defined in SurfaceTypes.h any more. That means big changes to the code: 79 new lines added, but more importantly 379 lines removed. That'll make a big difference to the final patch. I've also checked that the code compiles and, in theory, it's not doing anything different to the previous version that had the values fixed.
I'm going to test this out now, which means a partial build, but that will still take half an hour or so.
[...]
I've tested it; it all works well. So now I'm going to move on to the AttachmentType enumeration. Looking through the code, this attachment type is only ever set in three places: in the SharedSurface_EGLImage constructor and in one of the two SharedSurface_Basic constructors. In call cases it gets set to AttachmentType::GLTexture. Here are the three places:
SharedSurface_EGLImage::SharedSurface_EGLImage(GLContext* gl, const gfx::IntSize& size, const GLFormats& formats, GLuint prodTex, EGLImage image) : SharedSurface( SharedSurfaceType::EGLImageShare, AttachmentType::GLTexture, gl, size, false) // Can't recycle, as mSync changes never update TextureHost. , [...] SharedSurface_Basic::SharedSurface_Basic(const SharedSurfaceDesc& desc, UniquePtr<MozFramebuffer>&& fb) : SharedSurface(desc, std::move(fb), AttachmentType::GLTexture), [...] SharedSurface_Basic::SharedSurface_Basic(GLContext* gl, const IntSize& size, GLuint tex, bool ownsTex) : SharedSurface(SharedSurfaceType::Basic, AttachmentType::GLTexture, gl, size, true),Hopefully that gives the right idea. Given this is the case, it should be safe to remove the enumeration entirely and assume that the value is AttachmentType::GLTexture.
That simplifies the code a bunch more: we're now up to 98 inserted lines and 444 removed lines. Finally the GLFormats structure that's defined in GLContextTypes.h no longer exists in the upstream ESR 91 code. If you look back at the GLContext::ChooseGLFormats() listed out above, you'll notice that the values get set to default values; and in fact they never get changed. So this structure also looks to be a good candidate for removal.
Thankfully removing it is also pretty clean and straightforward. That leaves things in a much better state. These changes add a total of 95 lines but remove a total of 472 lines. I've checked that both the browser and WebView are still working as expected after these changes, so that feels like a good result for today.
That still leaves behind a potentially huge patch though. After these changes the patch still removes 145 lines and adds 1702 lines of code. We want both of these numbers to be as small as possible and that compares to 156 removals and 2090 additions before making these improvements.
I think we can do better; there are still further candidates for simplification. For example, one big change I had to make was to pass around the
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
3 Jun 2024 : Day 252 #
Before I get started, I want to mention that I've been heartened by all of the encouraging comments on the Sailfish Forum recently. I've not had a chance to reply there — I will do — but let me just say that while I appreciate the incredibly generous offers of donations, there's absolutely no need. It might seem a bit strange, but I'm do this for my own enjoyment and because just like everyone else I'd love to see the browser get a boost to the next version. But I also know there are many fabulous Sailfish developers who are far more deserving than I am, so if you want to splash the cash, you'll find willing recipients on the excellent The big Thank You & Coffee thread and I encourage you to donate to them!
With that said, let's get in to the day's development. Yesterday I was able to remove sixteen unused methods from my offscreen rendering patch. That's good progress, but it still leaves us with a very large patch. Large enough that it's worth continuing with the process in the hope of trimming it down further.
There are a few places where small changes upstream have caused the code to diverge, but in a way that could potentially be ironed out. If I can do this it'll remove code from the patch that can safely follow the upstream changes instead.
Chief among these is a change to the way GLContext is passed between methods. It's quite a large structure and so in ESR 78 it's typically passed as a pointer. In ESR 91 this has been changed in places so that it's now passed by reference.
In C++ there's not much practical difference between these apart from syntax: dereferencing isn't needed in the latter case. But the syntax changes propagate throughout the methods it's passed to. This can lead to what appear to be significant changes where they're actually pretty minor.
Let's take an example. Here's a method taken from ESR 78:
The changes needed to get offscreen rendering to work mean I've switched out the ESR 91 version for a copy of the ESR 78 version. While there are clearly differences between the two, they're smaller than they look at first glance. If I can refactor the code so it's more like the ESR 91 version, that should save effort in the future when the patch is applied to the next upstream changes beyond ESR 91.
This isn't just an idle preference. Much of the work I've spent on the upgrade to ESR 91 has been caused by having to refactor patches where the underlying code has changed upstream. I predict that for every line of code I can remove from a patch I'm going to save future developers two or three times the effort having not to worry about that change in the future. The smaller and more efficient the patches, the less work is needed when reapplying them later.
Apart from the pointer vs. reference difference, another important difference between these two methods is that the ESR 78 version has the format, type and linear flag passed in, whereas in ESR 91 they're all defined statically.
Let's tackle the last of these first: the linear flag. In the ESR 78 version of the method this defaults to a value of true if left unspecified. It looks like the only place this is called is from CreateTextureForOffscreen() where the parameter is left as its default value. So I'm going to remove the parameter and assume it's always true. The compiler will tell us if there are cases where ti's set to false which I missed.
To understand whether the other parameters could change we have to take a look at the GLContext::ChooseGLFormats() method, which is where the format value is chosen. And the return value depends on the SurfaceCaps structure that's passed in.
Some potential changes of the SurfaceCaps are performed in GLContextProviderEGL::CreateOffscreen(), but looking carefully at this, that can only happen if canOffscreenUseHeadless is set to false, which itself can only happen if MOZ_WIDGET_ANDROID is defined. It's never defined for us (we're not Android!) So I can safely remove not just the assumption, but the related code as well. That's much cleaner.
Now we go through to CompositorOGL::CreateContext() where the SurfaceCaps objects are originally created. Here's the code that creates them:
Given all of this, I'm not convinced we're going to get much more out of trying to simplify this CreateTexture() method.
It's getting quite late here now and I think I've reached the end of my viable energy for the day. I'll have to return to the topic of simplification tomorrow. Overnight I'll be pondering whether to lock the texture down to 24-bit RGB or allow other textures (primarily 16-bit textures) to be supported as well. Checking the ESR 91 code, it seems to be always assuming a LOCAL_GL_RGBA texture. Maybe we'd be safe just to do that?
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
With that said, let's get in to the day's development. Yesterday I was able to remove sixteen unused methods from my offscreen rendering patch. That's good progress, but it still leaves us with a very large patch. Large enough that it's worth continuing with the process in the hope of trimming it down further.
There are a few places where small changes upstream have caused the code to diverge, but in a way that could potentially be ironed out. If I can do this it'll remove code from the patch that can safely follow the upstream changes instead.
Chief among these is a change to the way GLContext is passed between methods. It's quite a large structure and so in ESR 78 it's typically passed as a pointer. In ESR 91 this has been changed in places so that it's now passed by reference.
In C++ there's not much practical difference between these apart from syntax: dereferencing isn't needed in the latter case. But the syntax changes propagate throughout the methods it's passed to. This can lead to what appear to be significant changes where they're actually pretty minor.
Let's take an example. Here's a method taken from ESR 78:
GLuint CreateTexture(GLContext* aGL, GLenum aInternalFormat, GLenum aFormat, GLenum aType, const gfx::IntSize& aSize, bool linear) { GLuint tex = 0; aGL->fGenTextures(1, &tex); ScopedBindTexture autoTex(aGL, tex); aGL->fTexParameteri(LOCAL_GL_TEXTURE_2D, LOCAL_GL_TEXTURE_MIN_FILTER, linear ? LOCAL_GL_LINEAR : LOCAL_GL_NEAREST); aGL->fTexParameteri(LOCAL_GL_TEXTURE_2D, LOCAL_GL_TEXTURE_MAG_FILTER, linear ? LOCAL_GL_LINEAR : LOCAL_GL_NEAREST); aGL->fTexParameteri(LOCAL_GL_TEXTURE_2D, LOCAL_GL_TEXTURE_WRAP_S, LOCAL_GL_CLAMP_TO_EDGE); aGL->fTexParameteri(LOCAL_GL_TEXTURE_2D, LOCAL_GL_TEXTURE_WRAP_T, LOCAL_GL_CLAMP_TO_EDGE); aGL->fTexImage2D(LOCAL_GL_TEXTURE_2D, 0, aInternalFormat, aSize.width, aSize.height, 0, aFormat, aType, nullptr); return tex; }The equivalent code in ESR 91 looks like this:
UniquePtr<Texture> CreateTexture(GLContext& gl, const gfx::IntSize& size) { const GLenum target = LOCAL_GL_TEXTURE_2D; const GLenum format = LOCAL_GL_RGBA; auto tex = MakeUnique<Texture>(gl); ScopedBindTexture autoTex(&gl, tex->name, target); gl.fTexParameteri(target, LOCAL_GL_TEXTURE_MIN_FILTER, LOCAL_GL_LINEAR); gl.fTexParameteri(target, LOCAL_GL_TEXTURE_MAG_FILTER, LOCAL_GL_LINEAR); gl.fTexParameteri(target, LOCAL_GL_TEXTURE_WRAP_S, LOCAL_GL_CLAMP_TO_EDGE); gl.fTexParameteri(target, LOCAL_GL_TEXTURE_WRAP_T, LOCAL_GL_CLAMP_TO_EDGE); gl.fTexImage2D(target, 0, format, size.width, size.height, 0, format, LOCAL_GL_UNSIGNED_BYTE, nullptr); return tex; }As you can see, several lines have their access operators changed from a dot (".") to an arrow ("->"). That's direct access vs. dereferenced access.
The changes needed to get offscreen rendering to work mean I've switched out the ESR 91 version for a copy of the ESR 78 version. While there are clearly differences between the two, they're smaller than they look at first glance. If I can refactor the code so it's more like the ESR 91 version, that should save effort in the future when the patch is applied to the next upstream changes beyond ESR 91.
This isn't just an idle preference. Much of the work I've spent on the upgrade to ESR 91 has been caused by having to refactor patches where the underlying code has changed upstream. I predict that for every line of code I can remove from a patch I'm going to save future developers two or three times the effort having not to worry about that change in the future. The smaller and more efficient the patches, the less work is needed when reapplying them later.
Apart from the pointer vs. reference difference, another important difference between these two methods is that the ESR 78 version has the format, type and linear flag passed in, whereas in ESR 91 they're all defined statically.
Let's tackle the last of these first: the linear flag. In the ESR 78 version of the method this defaults to a value of true if left unspecified. It looks like the only place this is called is from CreateTextureForOffscreen() where the parameter is left as its default value. So I'm going to remove the parameter and assume it's always true. The compiler will tell us if there are cases where ti's set to false which I missed.
To understand whether the other parameters could change we have to take a look at the GLContext::ChooseGLFormats() method, which is where the format value is chosen. And the return value depends on the SurfaceCaps structure that's passed in.
Some potential changes of the SurfaceCaps are performed in GLContextProviderEGL::CreateOffscreen(), but looking carefully at this, that can only happen if canOffscreenUseHeadless is set to false, which itself can only happen if MOZ_WIDGET_ANDROID is defined. It's never defined for us (we're not Android!) So I can safely remove not just the assumption, but the related code as well. That's much cleaner.
Now we go through to CompositorOGL::CreateContext() where the SurfaceCaps objects are originally created. Here's the code that creates them:
SurfaceCaps caps = SurfaceCaps::ForRGB(); caps.bpp16 = gfxVars::OffscreenFormat() == SurfaceFormat::R5G6B5_UINT16;By default gfxVars::OffscreenFormat() will return SurfaceFormat::X8R8G8B8_UINT32. I don't see any reason why that would be changed, which would mean that caps.bpp16 is by default set to false. What does the SurfaceCaps::ForRGB() method return? Apparently these values, according to the code:
bool any = false; bool color = true; bool alpha = false; bool bpp16 = false; bool depth = false; bool stencil = false; bool premultAlpha = true;This tallies with the values I'm seeing in practice using the debugger as well:
(gdb) p minOffscreenCaps $1 = {any = false, color = true, alpha = false, bpp16 = false, depth = false, stencil = false, premultAlpha = true, surfaceAllocator = {mRawPtr = 0x0}}Armed with this knowledge I can now simplify the GLContext::ChooseGLFormats() method appropriately. Although the internal format might change, the color_texFormat value is now always set to LOCAL_GL_RGB. That gets passed in to CreateTexture() as the third parameter (aFormat). So we can remove that parameter and simply set it to LOCAL_GL_RGB in all cases. In ESR 91 this is almost what's happening as well, except there it's set to LOCAL_GL_RGBA (an extra alpha channel). I think this is a difference we're going to have to maintain. We're also going to have to keep the aInternalFormat parameter as this might change depending on the GLES capabilities of the device.
Given all of this, I'm not convinced we're going to get much more out of trying to simplify this CreateTexture() method.
It's getting quite late here now and I think I've reached the end of my viable energy for the day. I'll have to return to the topic of simplification tomorrow. Overnight I'll be pondering whether to lock the texture down to 24-bit RGB or allow other textures (primarily 16-bit textures) to be supported as well. Checking the ESR 91 code, it seems to be always assuming a LOCAL_GL_RGBA texture. Maybe we'd be safe just to do that?
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
2 Jun 2024 : Day 251 #
Unsurprisingly the build I started lat night had completed by this morning. Unsurprising because it already passed through multiple partial builds yesterday. I've uploaded and installed the packages, now it's time to find out which methods are being called and which are not.
The approach I've devised for this is to start my test harbour-webview app running using the debugger. I'll then attach breakpoints to all of the methods I listed back Day 247, then execute the application and record which breakpoints get hit.
As each is hit I'll disable the breakpoint so it doesn't trigger multiple times. Eventually enough breakpoints will be disabled that the app will be running without interference from the debugger.
At that point I should have captured a pretty good list of which methods are necessary and which are unused.
I've ended up running this in three batches to keep things manageable. The result is that the breakpoints for all of the 39 methods listed below were triggered, meaning that these methods are definitely being used by the WebView renderer.
Batch 1:
Batch 2:
Batch 3:
The above methods all happened before the Web site had been fully rendered. The following were also hit, but a little later in the process; after rendering had apparently completed. I'm not sure whether this is really relevant, but I found it interesting:
That leaves the following methods that were added as a result of the changes I've made in order to get the WebView renderer working, but don't actually appear to be used by it. These are all candidates to be removed. I've actually marked the ones which I eventually did remove, but will come to those as we progress.
I'm now working through the code, checking where each of the above methods is actually called, if it is at all. This is quite an intricate process because even if a method isn't called within gecko, that doesn't mean it's not exported and called in code that links to libxul.so, such as qtmozembed or the sailfish-components-webview code.
After carefully working through the list above, it looks like I should be able to safely remove the following nine methods:
In addition to these, it looks like I'm also safe to remove the SurfaceCaps::preserve flag, since this is always set to false. Removing this flag also allows me to remove additional methods which are only ever called in the situations when this flag is set to true. So I've both removed the flag and simplified the code that's conditional on it.
Having made these changes I've built the library, installed it and run the test app again. The code built fine and everything appears to be working correctly, so these changes look to be safe. With these having gone through successfully, there are now some more methods which have become orphans as a result, so I can safely remove these as well:
None of the methods from SharedSurface_GLTexture are being called either, so I'm wondering whether it would make sense to remove the entire class. But on doing a quick search I can see that there is a single place where a SharedSurface_GLTexture instance is created. This is in the EmbedLiteCompositorBridgeParent::PrepareOffscreen() method where the relevant code looks like this:
The value returned by context->GetContextType() is determined based on the class context is an instance of. There are multiple classes that inherit from GLContext and so could potentially be in use here. They all derive from GLContext and override the GetContextType() method to generate different return values for this call. Here's an example from GLContextEGL:
For completeness, CreateOffscreen() goes on to call the following, where as we progress through the list we go deeper down the call stack:
As you can see, the deepest of these calls will create a GLContextProvider with type GLContextType::EGL. So it really does look like the context will always be of type GLContextEGL when this runs on a Sailfish device. I've therefore decided to remove the SharedSurface_GLTexture class completely, along with all of its associated code (e.g. SurfaceFactory_GLTexture) since this will never get used. That means removing all of the following:
Now to build and test the result:
That's at least three edit-rebuild-test cycles we've been round today. More than enough for one day I'd say. Tomorrow I'll continue stripping out unused code from the offscreen rendering patch. It's still a very large patch, so if there's any more redundant code it'd be really good to get rid of it.
After having spent so long floundering around trying to fix offscreen rendering over the last few months, it's really nice to be making steady progress again. Solid; mundane; and steady; but progress nonetheless.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
The approach I've devised for this is to start my test harbour-webview app running using the debugger. I'll then attach breakpoints to all of the methods I listed back Day 247, then execute the application and record which breakpoints get hit.
As each is hit I'll disable the breakpoint so it doesn't trigger multiple times. Eventually enough breakpoints will be disabled that the app will be running without interference from the debugger.
At that point I should have captured a pretty good list of which methods are necessary and which are unused.
I've ended up running this in three batches to keep things manageable. The result is that the breakpoints for all of the 39 methods listed below were triggered, meaning that these methods are definitely being used by the WebView renderer.
Batch 1:
- GLContextProviderEGL::CreateOffscreen()
- DefaultEglLibrary()
- GLContext::fBindFramebuffer()
- GLContext::raw_fBindFramebuffer()
- GLContext::InitOffscreen()
- GLContext::CreateScreenBuffer()
- GLScreenBuffer::Create()
- GLScreenBuffer::GLScreenBuffer()
- CreateTextureForOffscreen()
- CreateTexture()
- GLContext::OffscreenSize()
Batch 2:
- SurfaceFactory::SurfaceFactory()
- ChooseBufferBits()
- GLScreenBuffer::Resize()
- SurfaceFactory::NewTexClient()
- SharedSurface::SharedSurface()
- SharedSurface::GetTextureFlags()
- SurfaceFactory::StartRecycling()
- GLScreenBuffer::Attach()
- GLScreenBuffer::CreateRead()
- ReadBuffer::Create()
- CreateRenderbuffersForOffscreen()
- GLScreenBuffer::BindFB()
- GLScreenBuffer::Morph()
- SurfaceFactory::~SurfaceFactory()
- SurfaceFactory::StopRecycling()
- ReadBuffer::Size()
- GLScreenBuffer::Swap()
- ReadBuffer::Attach()
- SurfaceFactory::RecycleCallback()
Batch 3:
- SurfaceFactory_Basic::SurfaceFactory_Basic()
- SharedSurface_Basic::SharedSurface_Basic()
- SharedSurfaceTextureClient::Create()
- SharedSurfaceTextureClient::SharedSurfaceTextureClient()
- SharedSurface_EGLImage::SharedSurface_EGLImage()
- SharedSurfaceTextureClient::~SharedSurfaceTextureClient()
- SharedSurface_Basic::~SharedSurface_Basic()
- CreateTextureImageEGL()
- TileGenFuncEGL()
The above methods all happened before the Web site had been fully rendered. The following were also hit, but a little later in the process; after rendering had apparently completed. I'm not sure whether this is really relevant, but I found it interesting:
- TextureImageEGL::TextureImageEGL()
- TextureImageEGL::BindTexture()
- TextureImageEGL::Resize()
- GLFormatForImage()
- GLTypeForImage()
- TextureImageEGL::DirectUpdate()
- TextureImageEGL::~TextureImageEGL()
- TextureImageEGL::ReleaseTexImage()
- TextureImageEGL::DestroyEGLSurface()
That leaves the following methods that were added as a result of the changes I've made in order to get the WebView renderer working, but don't actually appear to be used by it. These are all candidates to be removed. I've actually marked the ones which I eventually did remove, but will come to those as we progress.
- GLContext::GuaranteeResolve() - Removed
- GLContext::CreateScreenBufferImpl() - Removed
- GLScreenBuffer::CreateFactory() - Removed
- GLScreenBuffer::~GLScreenBuffer()
- GLScreenBuffer::BindDrawFB()
- GLScreenBuffer::BindReadFB()
- GLScreenBuffer::BindReadFB_Internal() - Removed
- GLScreenBuffer::GetDrawFB() - Removed
- GLScreenBuffer::GetReadFB() - Removed
- GLScreenBuffer::GetFB() - Removed
- GLScreenBuffer::CopyTexImage2D() - Removed
- GLScreenBuffer::ReadPixels() - Removed
- ReadBuffer::~ReadBuffer()
- SharedSurface::ProdCopy() - Removed
- SurfaceFactory::Recycle()
- SharedSurface_EGLImage::ReadPixels() - Removed
- SharedSurface_Basic::Wrap() - Removed
- SharedSurface_GLTexture::Create() - Removed
- SharedSurface_GLTexture::~SharedSurface_GLTexture() - Removed
- SharedSurface_GLTexture::ProducerReleaseImpl() - Removed
- SharedSurface_GLTexture::ToSurfaceDescriptor() - Removed
- TextureImageEGL::BindTexImage()
I'm now working through the code, checking where each of the above methods is actually called, if it is at all. This is quite an intricate process because even if a method isn't called within gecko, that doesn't mean it's not exported and called in code that links to libxul.so, such as qtmozembed or the sailfish-components-webview code.
After carefully working through the list above, it looks like I should be able to safely remove the following nine methods:
- GLContext::GuaranteeResolve() - Removed
- GLContext::CreateScreenBufferImpl() - Removed
- GLScreenBuffer::CreateFactory() - Removed
- GLScreenBuffer::BindReadFB_Internal() - Removed
- GLScreenBuffer::GetDrawFB() - Removed
- GLScreenBuffer::GetReadFB() - Removed
- GLScreenBuffer::GetFB() - Removed
- GLScreenBuffer::CopyTexImage2D() - Removed
- GLScreenBuffer::ReadPixels() - Removed
In addition to these, it looks like I'm also safe to remove the SurfaceCaps::preserve flag, since this is always set to false. Removing this flag also allows me to remove additional methods which are only ever called in the situations when this flag is set to true. So I've both removed the flag and simplified the code that's conditional on it.
Having made these changes I've built the library, installed it and run the test app again. The code built fine and everything appears to be working correctly, so these changes look to be safe. With these having gone through successfully, there are now some more methods which have become orphans as a result, so I can safely remove these as well:
- SharedSurface::ProdCopy() - Removed
- SharedSurface_EGLImage::ReadPixels() - Removed
- SharedSurface_Basic::Wrap() - Removed
None of the methods from SharedSurface_GLTexture are being called either, so I'm wondering whether it would make sense to remove the entire class. But on doing a quick search I can see that there is a single place where a SharedSurface_GLTexture instance is created. This is in the EmbedLiteCompositorBridgeParent::PrepareOffscreen() method where the relevant code looks like this:
if (context->GetContextType() == GLContextType::EGL) { // [Basic/OGL Layers, OMTC] WebGL layer init. factory = SurfaceFactory_EGLImage::Create(context, screen->mCaps, nullptr, flags); } else { // [Basic Layers, OMTC] WebGL layer init. // Well, this *should* work... factory = MakeUnique<SurfaceFactory_GLTexture>(context, screen->mCaps, nullptr, flags); }Om the device I'm using for testing the context type is set to GLContextType::EGL and so it's always SurfaceFactory_EGLImage that's used to create the surface. But can I be sure there won't be situations in which the other branch will be executed? Perhaps on other devices with different hardware and drivers available?
The value returned by context->GetContextType() is determined based on the class context is an instance of. There are multiple classes that inherit from GLContext and so could potentially be in use here. They all derive from GLContext and override the GetContextType() method to generate different return values for this call. Here's an example from GLContextEGL:
virtual GLContextType GetContextType() const override { return GLContextType::EGL; }To find out whether anything other than GLContextType::EGL will ever be returned I need to find out how the context object is created. Unfortunately this is a bit of a maze. For example, this is how the context is pulled in for use inside the PrepareOffscreen() method:
GLContext* context = static_cast<CompositorOGL*>( state->mLayerManager->GetCompositor())->gl();Not pretty. The place where the context appears to be created is in CompositorOGL::CreateContext(). The logic there is also a bit serpentine, but at least for the WebView, it eventually leads to a call of the following:
context = GLContextProvider::CreateOffscreen( mSurfaceSize, caps, CreateContextFlags::REQUIRE_COMPAT_PROFILE, &discardFailureId);That GLContextProvider::CreateOffscreen() actually goes through to GLContextProviderEGL::CreateOffscreen(). Why is that? That's because GLContextProvider provides only a macro which is actually implemented by GLContextProviderEGL. In fact, there are no other instances of CreateOffscreen() implemented by any other GLContextProvider types.
For completeness, CreateOffscreen() goes on to call the following, where as we progress through the list we go deeper down the call stack:
- GLContextProviderEGL::CreateHeadless()
- GLContextEGL::CreateEGLPBufferOffscreenContext()
- GLContextEGL::CreateEGLPBufferOffscreenContextImpl()
- GLContextEGL::CreateGLContext()
- new GLContextEGL()
As you can see, the deepest of these calls will create a GLContextProvider with type GLContextType::EGL. So it really does look like the context will always be of type GLContextEGL when this runs on a Sailfish device. I've therefore decided to remove the SharedSurface_GLTexture class completely, along with all of its associated code (e.g. SurfaceFactory_GLTexture) since this will never get used. That means removing all of the following:
- SharedSurface_GLTexture::Create() - Removed
- SharedSurface_GLTexture::~SharedSurface_GLTexture() - Removed
- SharedSurface_GLTexture::ProducerReleaseImpl() - Removed
- SharedSurface_GLTexture::ToSurfaceDescriptor() - Removed
Now to build and test the result:
$ make -j1 -C obj-build-mer-qt-xr/gfx/ $ make -j16 -C `pwd`/obj-build-mer-qt-xr/toolkitHaving removed it the code compiles fine, but hits a problem during linkage:
aarch64-meego-linux-gnu-ld: libxul.so: hidden symbol `_ZN7mozilla2gl23SharedSurface_GLTexture6CreateEPNS0_9GLContextERKNS0_9GL FormatsERKNS_3gfx12IntSizeTypedINS7_12UnknownUnitsEEEb' isn't defined aarch64-meego-linux-gnu-ld: final link failed: bad valueThe reason for the failure is that I've also made changes to EmbedLiteCompositorBridgeParent.cpp, but the commands I ran to recompile the code didn't incorporate this file. That's how these partial builds work: you have to be careful to include all relevant directories in the make commands. So I'll need to ask the compiler to do a bit more work.
$ make -j1 -C obj-build-mer-qt-xr/mobile/sailfishos/ $ make -j16 -C `pwd`/obj-build-mer-qt-xr/toolkit $ strip obj-build-mer-qt-xr/toolkit/library/build/libxul.soThis time it builds successfully and I'm able to copy the resulting library over to my device. With this new version of the library installed both the browser and WebView app run successfully without any problems.
That's at least three edit-rebuild-test cycles we've been round today. More than enough for one day I'd say. Tomorrow I'll continue stripping out unused code from the offscreen rendering patch. It's still a very large patch, so if there's any more redundant code it'd be really good to get rid of it.
After having spent so long floundering around trying to fix offscreen rendering over the last few months, it's really nice to be making steady progress again. Solid; mundane; and steady; but progress nonetheless.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
1 Jun 2024 : Day 250 #
The build completed overnight and since then I've also cleaned up and rebuilt all of the other components as well: qtmozembed, embedlite-components, sailfish-browser and sailfish-components-webview.
After installing them all on my development phone both the browser and the WebView are now working correctly and the additional debug prints I added for testing are no longer showing in the console output. So things are looking in decent shape. I'm now ready for this edit-rebuild-test cycle that I've been going on about for the last few days.
To prove the point (to myself, as much as to anyone else) that this can be a tight cycle, I'm going to start off small. The changes to mCaps look like they may not be necessary, so let's see what happens if I remove them. To do this I've made the following changes to the code.
Next up, GLContext::GuaranteeResolve() looks unused, so I've removed it. I also notice that CreateScreenBufferImpl() only ever gets called by CreateScreenBuffer() so the two may as well be combined into a single method.
Finally there are some changes to GLContext::fCopyTexImage2D() which may or may not be important. I'm going to revert them to see whether that breaks anything.
Having made these changes it's now time to move to step two of the edit-rebuild-test cycle.
All working!
All of these changes have now been tested and committed, I need to continue working with these small changes and tests tomorrow and until the commit that fixes the offscreen rendering pipeline is looking more manageable in terms of size. Unfortunately as we've discussed before, the partial builds I've been doing mess up the debugging symbols and in fact I've been stripping the debug symbols from the library so as to allow it to be copied to my phone over the network more rapidly. So I'm going to run another full build overnight to get the debug symbols back.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
After installing them all on my development phone both the browser and the WebView are now working correctly and the additional debug prints I added for testing are no longer showing in the console output. So things are looking in decent shape. I'm now ready for this edit-rebuild-test cycle that I've been going on about for the last few days.
To prove the point (to myself, as much as to anyone else) that this can be a tight cycle, I'm going to start off small. The changes to mCaps look like they may not be necessary, so let's see what happens if I remove them. To do this I've made the following changes to the code.
git diff diff --git a/gfx/gl/GLContext.cpp b/gfx/gl/GLContext.cpp index 03ee715a8f35..9e07257837f8 100644 --- a/gfx/gl/GLContext.cpp +++ b/gfx/gl/GLContext.cpp @@ -288,8 +288,6 @@ GLContext::GLContext(const GLContextDesc& desc, GLContext* sharedContext, mSharedContext(sharedContext), mWorkAroundDriverBugs( StaticPrefs::gfx_work_around_driver_bugs_AtStartup()) { - mCaps.any = true; - mCaps.color = true; mOwningThreadId = PlatformThread::CurrentId(); MOZ_ALWAYS_TRUE(sCurrentContext.init()); sCurrentContext.set(0); @@ -952,13 +950,6 @@ bool GLContext::InitImpl() { // We're ready for final setup. fBindFramebuffer(LOCAL_GL_FRAMEBUFFER, 0); - // TODO: Remove SurfaceCaps::any. - if (mCaps.any) { - mCaps.any = false; - mCaps.color = true; - mCaps.alpha = false; - } - MOZ_GL_ASSERT(this, IsCurrent()); if (ShouldSpew() && IsExtensionSupported(KHR_debug)) {Having made these changes I can now perform a quick partial build directly from inside the build target using the following commands:
$ make -j1 -C obj-build-mer-qt-xr/gfx/gl/ $ make -j16 -C `pwd`/obj-build-mer-qt-xr/toolkit $ strip obj-build-mer-qt-xr/toolkit/library/build/libxul.soThis gives us a new libxul.so file from which I've stripped the debug output in order to keep the filesize down. I've transferred this new library over to my phone and am now checking both the browser and the WebView to see whether I can spot any regressions following the changes:
sailfish-browser harbour-webviewAfter running both of these and checking that the rendering is working correctly, I don't see any problems. Both are working nicely. So these changes look good and I can add them to the commit. Adding these changes to the commit will actually be reversing some changes that are in the commit already, so this will make the final patch simpler.
Next up, GLContext::GuaranteeResolve() looks unused, so I've removed it. I also notice that CreateScreenBufferImpl() only ever gets called by CreateScreenBuffer() so the two may as well be combined into a single method.
Finally there are some changes to GLContext::fCopyTexImage2D() which may or may not be important. I'm going to revert them to see whether that breaks anything.
Having made these changes it's now time to move to step two of the edit-rebuild-test cycle.
$ make -j1 -C obj-build-mer-qt-xr/gfx/gl/ $ make -j16 -C `pwd`/obj-build-mer-qt-xr/toolkit $ strip obj-build-mer-qt-xr/toolkit/library/build/libxul.soThis is the same as before, which is why this is a cycle. With the freshly built library installed on my phone, time to check both the browser and the WebView once again for any potential regressions:
sailfish-browser harbour-webviewAll working! Great! Next, I notice that sSafeModeInitialized is set to true in gfxPlatform.cpp. I suspect this was a change I made back when I was basically trying everything and anything I could to get the code to work. But the upstream code has it set to false and the eventual patch will be simpler if I can leave it in this original state of being set to false. So I've made the following change to reflect this:
git diff diff --git a/gfx/thebes/gfxPlatform.cpp b/gfx/thebes/gfxPlatform.cpp index 9c217dfc81e9..79e261e54f83 100644 --- a/gfx/thebes/gfxPlatform.cpp +++ b/gfx/thebes/gfxPlatform.cpp @@ -2045,7 +2045,7 @@ BackendType gfxPlatform::GetBackendPref(const char* aBackendPrefName, } bool gfxPlatform::InSafeMode() { - static bool sSafeModeInitialized = true; + static bool sSafeModeInitialized = false; static bool sInSafeMode = false; if (!sSafeModeInitialized) {Time now for step two of the edit-rebuild-test cycle again.
All working!
All of these changes have now been tested and committed, I need to continue working with these small changes and tests tomorrow and until the commit that fixes the offscreen rendering pipeline is looking more manageable in terms of size. Unfortunately as we've discussed before, the partial builds I've been doing mess up the debugging symbols and in fact I've been stripping the debug symbols from the library so as to allow it to be copied to my phone over the network more rapidly. So I'm going to run another full build overnight to get the debug symbols back.
$ sfdk build -d --with git_workaroundSo with that running overnight, that's it for today. Tomorrow I'll be back to simplifying the code.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.