List items
Items from the current list are shown below.
Gecko
8 Mar 2024 : Day 179 #
The build started yesterday has now completed; let's not waste any time and get straight to testing it out.
One thing that we are able to work with is the differences between the ESR 78 code (working) and the ESR 91 code (broken). And they are different. In ESR 78 the SharedSurfaceTextureData constructor at the top of the stack looks like this:
I could fire up my second development phone and place a breakpoint on the SharedSurfaceTextureClient constructor to compare, but I'm on the train and one laptop and two phones is already leaving me cramped. A laptop and three phones would crowd me out entirely. So let's find out why surf is null by looking through the ESR 91 code instead.
The odd thing is that the parent method has plenty of checks for it not being null:
So accessing this value that's been optimised out is actually more difficult than I'd thought. I can't just go up (down) a stack frame and check it there after all.
The solution will be to place a breakpoint on SharedSurfaceTextureClient::Create() and inspect the value before it's moved. Let's try that out.
And now suddenly it's hit me. There's something very wrong with this ordering:
I'm going to attempt to run the library generated from the partial build. Unfortunately the partial builds mess up the symbol references so it's not always possible to debug with them. I've also stripping them of debug symbols to make uploading them quicker, so even if they don't get messed up, I still can't use them. But running it may nevertheless help to find out whether this fix has made any difference at all.
[...]
I'm home now. After digging around in the code a bit and comparing with the execution of ESR 78, the reason for the crash has become clear. At the top of the backtrace is SharedSurface::ProdTexture(). But this method is designed to crash; it looks like this:
For the same reason, when I place a breakpoint on SharedSurface::ProdTexture() in the ESR 78 version of the code it doesn't get hit. On the other hand, when I place a breakpoint on ReadBuffer::Create() which is further down the stack trace that does get hit. After which if we place a breakpoint on all instances of ProdTexture() we do get a hit, but it's from SharedSurface_Basic::ProdTexture() rather than from SharedSurface:
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
$ gdb harbour-webview GNU gdb (GDB) Mer (8.2.1+git9) [...] (gdb) r Starting program: /usr/bin/harbour-webview [...] Created LOG for EmbedLiteLayerManager [New LWP 4065] Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault. [Switching to LWP 4059] 0x0000007ff125d16c in mozilla::layers::SharedSurfaceTextureData:: SharedSurfaceTextureData (this=0x7ed81af3c0, surf=...) at obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:283 283 obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h: No such file or directory. (gdb) bt #0 0x0000007ff125d16c in mozilla::layers::SharedSurfaceTextureData:: SharedSurfaceTextureData (this=0x7ed81af3c0, surf=...) at obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:283 #1 0x0000007ff125d210 in mozilla::layers::SharedSurfaceTextureClient:: Create (surf=..., factory=factory@entry=0x7ed80043d0, aAllocator=0x0, aFlags=<optimized out>) at obj-build-mer-qt-xr/dist/include/mozilla/ cxxalloc.h:33 #2 0x0000007ff111e038 in mozilla::gl::SurfaceFactory::NewTexClient (this=0x7ed80043d0, size=...) at obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:289 #3 0x0000007ff1107088 in mozilla::gl::GLScreenBuffer::Resize (this=0x5555643e90, size=...) at obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 #4 0x0000007ff1131c04 in mozilla::gl::GLContext::CreateScreenBufferImpl (this=this@entry=0x7ed819ee40, size=..., caps=...) at gfx/gl/GLContext.cpp:2150 #5 0x0000007ff1131cc8 in mozilla::gl::GLContext::CreateScreenBuffer (caps=..., size=..., this=0x7ed819ee40) at gfx/gl/GLContext.h:3555 #6 mozilla::gl::GLContext::InitOffscreen (this=this@entry=0x7ed819ee40, size=..., caps=...) at gfx/gl/GLContext.cpp:2398 [...] #29 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6 (gdb)This is progress: the application got a bit further today. But just from reading through the backtrace I'm not entirely certain what's going on here. The issues — and their locations — are being obscured by the various pointer wrappers (UniquePtr and RefPtr) in use. I'm finding it hard to get a purchase. The first actually useful location is line 2150 of GLContext.cpp but that's way down in frame 4.
One thing that we are able to work with is the differences between the ESR 78 code (working) and the ESR 91 code (broken). And they are different. In ESR 78 the SharedSurfaceTextureData constructor at the top of the stack looks like this:
SharedSurfaceTextureData::SharedSurfaceTextureData( UniquePtr<gl::SharedSurface> surf) : mSurf(std::move(surf)) {}Whereas in ESR 91 I appear to have added some additional initialisation steps:
SharedSurfaceTextureData::SharedSurfaceTextureData( UniquePtr<gl::SharedSurface> surf) : mSurf(std::move(surf)), mDesc(), mFormat(), mSize(surf->mDesc.size) { }One possibility is that the value of surf is null. This wouldn't necessarily cause a problem until we try to read the mDesc entry while setting mSize. I'm not able to extract the value of surf directly as the debugger informs me it's been "optimized out". But if I go up (down?) a stack frame I can seer what was passed in for its value.
(gdb) p surf $3 = <optimized out> (gdb) frame 1 #1 0x0000007ff125d210 in mozilla::layers::SharedSurfaceTextureClient::Create (surf=..., factory=factory@entry=0x7ed80043d0, aAllocator=0x0, aFlags=<optimized out>) at obj-build-mer-qt-xr/dist/include/mozilla/ cxxalloc.h:33 33 obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h: No such file or directory. (gdb) p surf $4 = { mTuple = {<mozilla::detail::CompactPairHelper<mozilla::gl::SharedSurface*, mozilla::DefaultDelete<mozilla::gl::SharedSurface>, (mozilla::detail::StorageType)1, (mozilla::detail::StorageType)0>> = {<mozilla::DefaultDelete<mozilla::gl::SharedSurface>> = {<No data fields>}, mFirstA = 0x0}, <No data fields>}} (gdb) p surf.mTuple.mFirstA $5 = (mozilla::gl::SharedSurface *) 0x0 (gdb)So the fact that surf is null provides us with an immediate reason for the crash, but it leaves us with a question: should the value be null and we shouldn't be attempting to access its contents, or should it be a non-null value going in to this method?
I could fire up my second development phone and place a breakpoint on the SharedSurfaceTextureClient constructor to compare, but I'm on the train and one laptop and two phones is already leaving me cramped. A laptop and three phones would crowd me out entirely. So let's find out why surf is null by looking through the ESR 91 code instead.
The odd thing is that the parent method has plenty of checks for it not being null:
already_AddRefed<SharedSurfaceTextureClient> SharedSurfaceTextureClient::Create( UniquePtr<gl::SharedSurface> surf, gl::SurfaceFactory* factory, LayersIPCChannel* aAllocator, TextureFlags aFlags) { if (!surf) { return nullptr; } TextureFlags flags = aFlags | TextureFlags::RECYCLE | surf->GetTextureFlags(); SharedSurfaceTextureData* data = new SharedSurfaceTextureData(std::move(surf)); return MakeAndAddRef<SharedSurfaceTextureClient>(data, flags, aAllocator); }So the value going in isn't null. And now that I look at it carefully, I see that I have this all wrong: the surf variable isn't an instance of SharedSurface, it's a UniquePtr wrapping an instance of SharedSurface. When the value inside the unique pointer is moved, the value inside the unique pointer that it's coming from gets set to zero. That's the whole point of unique pointers.
So accessing this value that's been optimised out is actually more difficult than I'd thought. I can't just go up (down) a stack frame and check it there after all.
The solution will be to place a breakpoint on SharedSurfaceTextureClient::Create() and inspect the value before it's moved. Let's try that out.
(gdb) b SharedSurfaceTextureClient::Create Breakpoint 1 at 0x7ff125d1a4: file gfx/layers/client/ TextureClientSharedSurface.cpp, line 113. (gdb) r The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /usr/bin/harbour-webview [...] Thread 36 "Compositor" hit Breakpoint 1, mozilla::layers:: SharedSurfaceTextureClient::Create (surf=..., factory=factory@entry=0x7ee0004310, aAllocator= 0x0, aFlags=mozilla::layers::TextureFlags::ORIGIN_BOTTOM_LEFT) at gfx/layers/client/TextureClientSharedSurface.cpp:113 113 LayersIPCChannel* aAllocator, TextureFlags aFlags) { (gdb) p surf $6 = { mTuple = {<mozilla::detail::CompactPairHelper<mozilla::gl::SharedSurface*, mozilla::DefaultDelete<mozilla::gl::SharedSurface>, (mozilla::detail::StorageType)1, (mozilla::detail::StorageType)0>> = {<mozilla::DefaultDelete<mozilla::gl::SharedSurface>> = {<No data fields>}, mFirstA = 0x7ee01a1c00}, <No data fields>}} (gdb) p surf.mTuple.mFirstA $7 = (mozilla::gl::SharedSurface *) 0x7ee01a1c00 (gdb) p surf.mTuple.mFirstA->mDesc.size $9 = {<mozilla::gfx::BaseSize<int, mozilla::gfx::IntSizeTyped<mozilla::gfx:: UnknownUnits> >> = {{{width = 1080, height = 2520}, components = {1080, 2520}}}, <mozilla::gfx::UnknownUnits> = {<No data fields>}, <No data fields>} (gdb) p surf.mTuple.mFirstA->mDesc.size.width $10 = 1080 (gdb) p surf.mTuple.mFirstA->mDesc.size.height $11 = 2520 (gdb)This is all now looking a lot more healthy than I thought. Since it's not that the value is null going in, what else could be causing the crash here? It could be that there are multiple calls to this Create() method and this isn't the one causing the problem. But that's easy to check as well:
(gdb) n 114 if (!surf) { (gdb) 49 obj-build-mer-qt-xr/dist/include/mozilla/TypedEnumBits.h: No such file or directory. (gdb) 117 TextureFlags flags = aFlags | TextureFlags::RECYCLE | surf->GetTextureFlags(); (gdb) 119 new SharedSurfaceTextureData(std::move(surf)); (gdb) Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault. 0x0000007ff125d16c in mozilla::layers::SharedSurfaceTextureData:: SharedSurfaceTextureData (this=0x7ee01af300, surf=...) at obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:283 283 obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h: No such file or directory. (gdb)Stepping forwards to the next call to the SharedSurfaceTextureData constructor triggers the crash. So we're definitely in the right place and the problem isn't a null value for the surf parameter after all.
And now suddenly it's hit me. There's something very wrong with this ordering:
: mSurf(std::move(surf)), mDesc(), mFormat(), mSize(surf->mDesc.size)The sequence here is going to be:
- Move surf into mSurf. This will leave the value stored inside surf as null.
- Create mDesc and mFormat in their default state.
- Attempt to access a value from inside surf. But surf has been moved out of the unique pointer wrapper and into another one, so we can't do this.
SharedSurfaceTextureData::SharedSurfaceTextureData( UniquePtr<gl::SharedSurface> surf) : mSurf(std::move(surf)), mDesc(), mFormat(), mSize(mSurf->mDesc.size) { }There may be other reasons why this constructor causes problems later, such as having the default values for mDesc and mFormat. I'm not sure how important these are. But this fix should get us closer to finding out.
I'm going to attempt to run the library generated from the partial build. Unfortunately the partial builds mess up the symbol references so it's not always possible to debug with them. I've also stripping them of debug symbols to make uploading them quicker, so even if they don't get messed up, I still can't use them. But running it may nevertheless help to find out whether this fix has made any difference at all.
$ make -j1 -C obj-build-mer-qt-xr/gfx/layer $ make -j16 -C `pwd`/obj-build-mer-qt-xr/toolkit $ strip obj-build-mer-qt-xr/toolkit/library/build/libxul.so [...] $ scp obj-build-mer-qt-xr/toolkit/library/build/libxul.so \ defaultuser@172.28.172.2:~/Documents/Development/gecko/ $ ssh defaultuser@172.28.172.2 [...] $ devel-su cp libxul.so /usr/lib64//xulrunner-qt5-91.9.1/ $ gdb harbour-webview GNU gdb (GDB) Mer (8.2.1+git9) [...] (gdb) r Starting program: /usr/bin/harbour-webview [...] Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault. [Switching to LWP 29907] 0x0000007ff1107e00 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so (gdb) bt #0 0x0000007ff1107e00 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #1 0x0000007ff1106948 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #2 0x0000007ff1106c30 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #3 0x0000007ff1106e5c in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #4 0x0000007ff1107104 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #5 0x0000007ff1131c64 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #6 0x0000007ff1131d28 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #7 0x0000007ff1131e74 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #8 0x0000007ff11999f8 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #9 0x0000007ff11aefe8 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #10 0x0000007ff12c4c98 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #11 0x0000007ff12cfd14 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so [...] #25 0x0000007ff07fe8d8 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #26 0x0000007feca3c9f0 in ?? () from /usr/lib64/libnspr4.so #27 0x0000007fefd00a4c in ?? () from /lib64/libpthread.so.0 #28 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6 (gdb)Clearly that's not as helpful as we might have hoped. Maybe it's worth a try without stripping out the debug symbols.
$ make -j16 -C `pwd`/obj-build-mer-qt-xr/toolkit [...] $ scp obj-build-mer-qt-xr/toolkit/library/build/libxul.so \ defaultuser@172.28.172.2:~/Documents/Development/gecko/The problem with using the non-stripped version is that it's a large 2.7GiB file (that's three times larger than the RPM packages, even including the debuginfo). The consequence is that it's actually taking my entire train journey to copy the file over to my phone (via my other phone using a Wifi hotspot). It's finally got there... but with only a few minutes to spare before we pull in to Cambridge station. I'm going to have to be quick!
$ ssh defaultuser@172.28.172.2 [...] $ devel-su cp libxul.so /usr/lib64//xulrunner-qt5-91.9.1/ $ gdb harbour-webview GNU gdb (GDB) Mer (8.2.1+git9) [...] (gdb) r Starting program: /usr/bin/harbour-webview [...] Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault. [Switching to LWP 2262] 0x0000007ff1107e00 in mozilla::gl::SharedSurface::ProdTexture (this=<optimized out>) at gecko-dev/gfx/gl/SharedSurface.h:154 154 gecko-dev/gfx/gl/SharedSurface.h: No such file or directory. (gdb) bt #0 0x0000007ff1107e00 in mozilla::gl::SharedSurface::ProdTexture (this=<optimized out>) at gecko-dev/gfx/gl/SharedSurface.h:154 #1 0x0000007ff1106948 in mozilla::gl::ReadBuffer::Create (gl=0x7ed819ee40, caps=..., formats=..., surf=surf@entry=0x7ed81a1cc0) at gecko-dev/gfx/gl/GLScreenBuffer.cpp:653 #2 0x0000007ff1106c30 in mozilla::gl::GLScreenBuffer::CreateRead (this=this@entry=0x55556434d0, surf=surf@entry=0x7ed81a1cc0) at gecko-dev/gfx/gl/GLScreenBuffer.cpp:584 #3 0x0000007ff1106e5c in mozilla::gl::GLScreenBuffer::Attach (this=this@entry=0x55556434d0, surf=0x7ed81a1cc0, size=...) at gecko-dev/gfx/gl/GLScreenBuffer.cpp:488 #4 0x0000007ff1107104 in mozilla::gl::GLScreenBuffer::Resize (this=0x55556434d0, size=...) at obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 #5 0x0000007ff1131c64 in mozilla::gl::GLContext::CreateScreenBufferImpl (this=this@entry=0x7ed819ee40, size=..., caps=...) at gecko-dev/gfx/gl/GLContext.cpp:2150 #6 0x0000007ff1131d28 in mozilla::gl::GLContext::CreateScreenBuffer (caps=..., size=..., this=0x7ed819ee40) at gecko-dev/gfx/gl/GLContext.h:3555 #7 mozilla::gl::GLContext::InitOffscreen (this=this@entry=0x7ed819ee40, size=..., caps=...) at gecko-dev/gfx/gl/GLContext.cpp:2398 #8 0x0000007ff1131e74 in mozilla::gl::GLContextProviderEGL::CreateOffscreen (size=..., minCaps=..., flags=flags@entry=mozilla::gl::CreateContextFlags:: REQUIRE_COMPAT_PROFILE, out_failureId=out_failureId@entry=0x7f1778a1c8) at gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1308 [...] #30 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6 (gdb)This is a more interesting backtrace! And it's certainly different from the one we had yesterday. But now we really are pulling in to the station so it's time to close up my laptop and prepare for the next stage of my journey home. When I get back, I'll dig in to this crash further.
[...]
I'm home now. After digging around in the code a bit and comparing with the execution of ESR 78, the reason for the crash has become clear. At the top of the backtrace is SharedSurface::ProdTexture(). But this method is designed to crash; it looks like this:
virtual GLuint ProdTexture() { MOZ_ASSERT(mAttachType == AttachmentType::GLTexture); MOZ_CRASH("GFX: Did you forget to override this function?"); }As you can see from the text passed to the crash macro, this function is never supposed to be called. It's supposed to be overridden by something else in a class that inherits from SharedSurface.
For the same reason, when I place a breakpoint on SharedSurface::ProdTexture() in the ESR 78 version of the code it doesn't get hit. On the other hand, when I place a breakpoint on ReadBuffer::Create() which is further down the stack trace that does get hit. After which if we place a breakpoint on all instances of ProdTexture() we do get a hit, but it's from SharedSurface_Basic::ProdTexture() rather than from SharedSurface:
(gdb) b ReadBuffer::Create Breakpoint 2 at 0x7fb8e66720: file gfx/gl/GLScreenBuffer.cpp, line 501. (gdb) r [...] Thread 36 "Compositor" hit Breakpoint 2, mozilla::gl::ReadBuffer::Create (gl=0x7eac109130, caps=..., formats=..., surf=surf@entry=0x7eac10a6e0) at gfx/gl/GLScreenBuffer.cpp:501 501 SharedSurface* surf) { (gdb) p surf $1 = (mozilla::gl::SharedSurface *) 0x7eac10a6e0 (gdb) p surf->mAttachType $2 = mozilla::gl::AttachmentType::GLTexture (gdb) b ProdTexture Breakpoint 3 at 0x7fb83835d0: ProdTexture. (5 locations) (gdb) c Continuing. Thread 36 "Compositor" hit Breakpoint 3, mozilla::gl::SharedSurface_Basic:: ProdTexture (this=0x7eac10a6e0) at gfx/gl/SharedSurfaceGL.h:65 65 virtual GLuint ProdTexture() override { return mTex; } (gdb)The SharedSurface_Basic class is defined in the SharedSurfaceGL.h file and in ESR 78 it does override the ProdTexture() method:
// For readback and bootstrapping: class SharedSurface_Basic : public SharedSurface { [...] virtual GLuint ProdTexture() override { return mTex; } [...] };However in the ESR 91 code there's no such override. I should have added one in, but the need hadn't made it on to my radar. So I'll make this change now. Unfortunately there are a collection of cascading changes that make this a slightly larger job than I'd hoped. Nothing crazy, but too much to do tonight, so I'll have to leave it until the morning.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comments
Uncover Disqus comments