List items
Items from the current list are shown below.
Blog
15 Mar 2024 : Day 186 #
Today I'm still searching for memory leaks and one memory leak in particular: after fixing the SurfaceFactory_EGLImage::Create() which, as the name implies, creates EGL surface textures, there's now a massive memory leak that grinds not just the browser but the entire phone to a halt.
I'm hypothesising that there's something created by the method that should get freed. The texture itself is the most likely culprit, since even just a few 1080 by 2520 textures are going to consume a lot of memory. Generate a few of those each frame without freeing them and bad things will happen.
We're definitely seeing bad things happen, so this is my guess. However yesterday I checked a few of the associated destructors and they seem to be getting called.
So today I'm going to try to tackle it from a different angle. Rather than try to figure out what's not getting freed I'm going to try to find out what's causing the crash. I've started off by adjusting the size of the texture created as part of the render loop. Rather than a 1080 by 2520 texture, I've set it to always generate an 8 by 8 texture, by altering this code from GLContext.cpp (the commented part is the old code, replaced by the equivalent lines directly below):
While the updated code builds and gets transferred over to my phone I can continue looking through the code. Here's the backtrace from the SharedSurface_EGLImage::Create() method, the change to which has triggered the memory leak. It's possible, maybe even likely, that the code for reclaiming the resources will live somewhere in or close to the methods in this stack.
My new binary has copied over to my phone. The running app exhibits very similar symptoms as before: responsive at first but quickly seizes up and becoming unresponsive. I killed the process before it took down my phone, but a visual check suggests reducing the texture size didn't have any obvious benefit.
Back to looking through the code, I'm working through SharedSurfaceTextureClient::~SharedSurfaceTextureClient() and I notice this method called Destroy(). Although it's not immediately clear from the way the code is written, this call actually goes through to TextureClient::Destroy(). I recall making changes to this and scanning through my notes confirms it: it was back on Day 176. At the time I wrote this:
Could it be that the changes I made back then are causing the issues I'm experiencing now?
While I can step through the ESR 78 build, unfortunately all of changes integrated using partial builds have messed up the debug source for the ESR 91 build. Stepping through gives me just a slew of useless "TextureClientSharedSurface.cpp: No such file or directory." messages. So I've decided to kick off a full build. With a bit of luck it'll be completed before the end of the day and I'll be able to come back to this in the evening to compare the two executing flows by stepping through them.
[...]
It was just before 9:00 this morning that I set the build going. I could well imagine it running into the night and leaving me with no more time to get the benefit from it. But by 17:13 this evening the build had completed. That means there's still time to perform the comparison of TextureClient::Destroy() running on the two versions.
The results, however, are not what I was expecting. At least on ESR 78 things seem to act normally. The breakpoint is hit and it's possible to step through the code. It seems a little anomalous that both the data and actor variables are null, meaning that the actual dealocation step gets skipped:
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
I'm hypothesising that there's something created by the method that should get freed. The texture itself is the most likely culprit, since even just a few 1080 by 2520 textures are going to consume a lot of memory. Generate a few of those each frame without freeing them and bad things will happen.
We're definitely seeing bad things happen, so this is my guess. However yesterday I checked a few of the associated destructors and they seem to be getting called.
So today I'm going to try to tackle it from a different angle. Rather than try to figure out what's not getting freed I'm going to try to find out what's causing the crash. I've started off by adjusting the size of the texture created as part of the render loop. Rather than a 1080 by 2520 texture, I've set it to always generate an 8 by 8 texture, by altering this code from GLContext.cpp (the commented part is the old code, replaced by the equivalent lines directly below):
GLuint CreateTexture(GLContext* aGL, GLenum aInternalFormat, GLenum aFormat, GLenum aType, const gfx::IntSize& aSize, bool linear) { [...] // aGL->fTexImage2D(LOCAL_GL_TEXTURE_2D, 0, aInternalFormat, aSize.width, // aSize.height, 0, aFormat, aType, nullptr); aGL->fTexImage2D(LOCAL_GL_TEXTURE_2D, 0, aInternalFormat, 8, 8, 0, aFormat, aType, nullptr); return tex; }This won't prevent the memory leak, but if this is the texture that's not being freed, it should at least slow it down, which should be discernible in use.
While the updated code builds and gets transferred over to my phone I can continue looking through the code. Here's the backtrace from the SharedSurface_EGLImage::Create() method, the change to which has triggered the memory leak. It's possible, maybe even likely, that the code for reclaiming the resources will live somewhere in or close to the methods in this stack.
#0 SharedSurface_EGLImage::Create, SharedSurfaceEGL.cpp:58 #1 SurfaceFactory_EGLImage::CreateSharedImpl, WeakPtr.h:185 #2 SurfaceFactory::CreateShared, RefCounted.h:240 #3 SurfaceFactory::NewTexClient, SharedSurface.cpp:406 #4 GLScreenBuffer::Swap, UniquePtr.h:290 #5 GLScreenBuffer::PublishFrame, GLScreenBuffer.h:229 #6 EmbedLiteCompositorBridgeParent::PresentOffscreenSurface, EmbedLiteCompositorBridgeParent.cpp:191 #7 embedlite::nsWindow::PostRender, embedshared/nsWindow.cpp:248 #8 InProcessCompositorWidget::PostRender, InProcessCompositorWidget.cpp:60 #9 LayerManagerComposite::Render, Compositor.h:575 #10 LayerManagerComposite::UpdateAndRender, LayerManagerComposite.cpp:657 #11 LayerManagerComposite::EndTransaction, LayerManagerComposite.cpp:572I should take a look more closely at SharedSurfaceTextureClient::~SharedSurfaceTextureClient() which is — in theory — being called each time Swap() is called when the contents of the mBack reference-counted pointer is freed.
My new binary has copied over to my phone. The running app exhibits very similar symptoms as before: responsive at first but quickly seizes up and becoming unresponsive. I killed the process before it took down my phone, but a visual check suggests reducing the texture size didn't have any obvious benefit.
Back to looking through the code, I'm working through SharedSurfaceTextureClient::~SharedSurfaceTextureClient() and I notice this method called Destroy(). Although it's not immediately clear from the way the code is written, this call actually goes through to TextureClient::Destroy(). I recall making changes to this and scanning through my notes confirms it: it was back on Day 176. At the time I wrote this:
This mWorkaroundAnnoyingSharedSurfaceLifetimeIssues is used in ESR 78 to decide whether or not to deallocate the TextureData when the TextureClient is destroyed. This has been removed in ESR 91 and to be honest, I'm comfortable leaving it this way... Maybe this will cause problems later, but my guess is that this will show up with the debugger if that's the case, at which point we can refer back to the ESR 78 code to restore these checks.
Could it be that the changes I made back then are causing the issues I'm experiencing now?
While I can step through the ESR 78 build, unfortunately all of changes integrated using partial builds have messed up the debug source for the ESR 91 build. Stepping through gives me just a slew of useless "TextureClientSharedSurface.cpp: No such file or directory." messages. So I've decided to kick off a full build. With a bit of luck it'll be completed before the end of the day and I'll be able to come back to this in the evening to compare the two executing flows by stepping through them.
[...]
It was just before 9:00 this morning that I set the build going. I could well imagine it running into the night and leaving me with no more time to get the benefit from it. But by 17:13 this evening the build had completed. That means there's still time to perform the comparison of TextureClient::Destroy() running on the two versions.
The results, however, are not what I was expecting. At least on ESR 78 things seem to act normally. The breakpoint is hit and it's possible to step through the code. It seems a little anomalous that both the data and actor variables are null, meaning that the actual dealocation step gets skipped:
(gdb) b TextureClient::Destroy Breakpoint 4 at 0x7fb8e9b1b8: file gfx/layers/client/TextureClient.cpp, line 558. (gdb) c Continuing. [LWP 7428 exited] [Switching to LWP 7404] Thread 37 "Compositor" hit Breakpoint 4, mozilla::layers::TextureClient:: Destroy (this=this@entry=0x7ea0118ab0) at gfx/layers/client/TextureClient.cpp:558 558 void TextureClient::Destroy() { (gdb) n [LWP 7308 exited] [LWP 7374 exited] [LWP 7375 exited] 560 MOZ_RELEASE_ASSERT(mPaintThreadRefs == 0); (gdb) n 562 if (mActor && !mIsLocked) { (gdb) n 566 mBorrowedDrawTarget = nullptr; (gdb) n 567 mReadLock = nullptr; (gdb) n [New LWP 7538] 569 RefPtr<TextureChild> actor = mActor; (gdb) n 577 TextureData* data = mData; (gdb) n 578 if (!mWorkaroundAnnoyingSharedSurfaceLifetimeIssues) { (gdb) n 582 if (data || actor) { (gdb) p mWorkaroundAnnoyingSharedSurfaceLifetimeIssues $11 = true (gdb) p data $12 = (mozilla::layers::TextureData *) 0x0 (gdb) p actor $13 = {mRawPtr = 0x0} (gdb)Here are the most relevant parts of the code associated with the above debug output to help follow what's happening:
void TextureClient::Destroy() { // Async paints should have been flushed by now. MOZ_RELEASE_ASSERT(mPaintThreadRefs == 0); if (mActor && !mIsLocked) { mActor->Lock(); } mBorrowedDrawTarget = nullptr; mReadLock = nullptr; RefPtr<TextureChild> actor = mActor; mActor = nullptr; if (actor && !actor->mDestroyed.compareExchange(false, true)) { actor->Unlock(); actor = nullptr; } TextureData* data = mData; if (!mWorkaroundAnnoyingSharedSurfaceLifetimeIssues) { mData = nullptr; } if (data || actor) { [...] DeallocateTextureClient(params); } }Although the code does step through okay, on ESR 91 something happens which I can't explain. The debugger suggests execution is getting stuck in a loop calling TextureClient::Destroy():
(gdb) b TextureClient::Destroy Breakpoint 3 at 0x7ff1148c90: file layers/client/ TextureClient.cpp, line 551. (gdb) c Continuing. [LWP 18393 exited] [Switching to LWP 18284] Thread 7 "GeckoWorkerThre" hit Breakpoint 3, mozilla::layers::TextureClient:: Destroy (this=this@entry=0x7fc5975380) at layers/client/TextureClient.cpp:551 551 void TextureClient::Destroy() { (gdb) n 553 MOZ_RELEASE_ASSERT(mPaintThreadRefs == 0); (gdb) Thread 7 "GeckoWorkerThre" hit Breakpoint 3, mozilla::layers::TextureClient:: Destroy (this=this@entry=0x7fc595de20) at layers/client/TextureClient.cpp:551 551 void TextureClient::Destroy() { (gdb) 553 MOZ_RELEASE_ASSERT(mPaintThreadRefs == 0); (gdb) Thread 7 "GeckoWorkerThre" hit Breakpoint 3, mozilla::layers::TextureClient:: Destroy (this=this@entry=0x7fc55d2280) at layers/client/TextureClient.cpp:551 551 void TextureClient::Destroy() { (gdb)The code is quite similar to the ESR 78 code, so I'm not sure why this might be happening. There's no obviously nested call to Destroy():
void TextureClient::Destroy() { // Async paints should have been flushed by now. MOZ_RELEASE_ASSERT(mPaintThreadRefs == 0); if (mActor && !mIsLocked) { mActor->Lock(); } mBorrowedDrawTarget = nullptr; mReadLock = nullptr; RefPtr<TextureChild> actor = mActor; mActor = nullptr; if (actor && !actor->mDestroyed.compareExchange(false, true)) { actor->Unlock(); actor = nullptr; } TextureData* data = mData; mData = nullptr; if (data || actor) { [...] DeallocateTextureClient(params); } }The backtrace is showing nested calls to TextureClient::~TextureClient(). But none of this explains the repeated hits of the TextureClient::Destroy() breakpoint.
#0 mozilla::layers::TextureClient::Destroy (this=this@entry=0x7fc6f7a3a0) at layers/client/TextureClient.cpp:551 #1 0x0000007ff1149144 in mozilla::layers::TextureClient::~TextureClient (this=0x7fc6f7a3a0, __in_chrg=<optimized out>) at layers/client/TextureClient.cpp:769 #2 0x0000007ff1149310 in mozilla::layers::TextureClient::~TextureClient (this=0x7fc6f7a3a0, __in_chrg=<optimized out>) at layers/client/TextureClient.cpp:764 #3 0x0000007ff110507c in mozilla::AtomicRefCountedWithFinalize <mozilla::layers::TextureClient>::Release (this=0x7fc6f7a3a8) at include/c++/8.3.0/bits/basic_ios.h:282 #4 0x0000007ff1268420 in mozilla::RefPtrTraits <mozilla::layers::TextureClient>::Release (aPtr=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:381 #5 RefPtr<mozilla::layers::TextureClient>::ConstRemovingRefPtrTraits <mozilla::layers::TextureClient>::Release (aPtr=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:381 [...]I was hoping to get this part resolved today, but the situation is confusing and my head has stopped focusing, so I'm going to have to continue tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comments
Uncover Disqus comments