List items
Items from the current list are shown below.
Blog
16 Mar 2024 : Day 187 #
Yesterday I was struggling. And getting this WebView render pipeline working is starting to feel drawn out. I don't want to make any claims about the larger task, but I'm hoping that a good night's sleep and some time to ponder on the best approach in the shower this morning will have helped me focus on the task I'm struggling with.
And this task is figuring out what — if anything &mash; is different between the execution of TextureClient::Destroy() on ESR 78 compared to ESR 91. I'm still labouring under the hypothesis that it's this method that's causing the (hypothetical) memory leak that's causing ESR 91 execution to seize up over time.
The difficulties I experienced yesterday were twofold. First on ESR 78 the actual section of code that reclaims the allocated memory appeared never to be executed. Second applying the debugger to ESR 91 gave peculiar results: what appeared to be an infinite loop of calls to Destroy() that never allowed me to step in to the method.
To tackle these difficulties I'm going to try two things. First I need to stick a breakpoint inside the conditional code that reclaims the memory on ESR 78, to establish whether it ever gets called. Second I'm going to annotate the ESR 91 code with debug prints. That should allow me to get a better idea of the true execution flow. If the debugger isn't playing by the rules, I'll take my ball somewhere else.
So, first up, checking the ESR 78 flow. The structure of the Destroy() method looks like this on ESR 78:
The reason I've never seen it enter this condition is because data (which is derived from the mData class variable) and actor (which is derived from the mActor class variable) have always been null when entering this method.
Let's do this then.
Let's now try the same thing on ESR 91.
To return to our original questions, I think this has answered both of them. First we can see that on ESR 78 this is definitely a place where memory is actually being freed. But on the other hand we also see it being freed on ESR 91. That doesn't mean that there isn't a problem here, but it does make it less likely.
Nevertheless there has been a change to this code. The mWorkaroundAnnoyingSharedSurfaceLifetimeIssues flag was removed by upstream. It's possible that this is causing the issue we're experiencing, so I'm going to reverse this and reinsert the removed code. I'm not really expecting this to fix things, but having travelled out into the sticks I now need to check under every stone. I've no choice but to figure this thing out if the WebView is going to get back up and running again.
[...]
Having worked carefully through the code and reintroduced the mWorkaroundAnnoyingSharedSurfaceLifetimeIssues variable and its associated logic, it's disappointing to find it's not fixed the issue. I'm not out of ideas yet though. Tomorrow I'm going to have a go at profiling the application and using specific memory tools (e.g. valgrind) to try to figure out which memory is being allocated but not deallocated. I'm not sure I hold out much hope of success using valgrind given that gecko is so big and messy and suffering from leakage as it is, but you never know.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
And this task is figuring out what — if anything &mash; is different between the execution of TextureClient::Destroy() on ESR 78 compared to ESR 91. I'm still labouring under the hypothesis that it's this method that's causing the (hypothetical) memory leak that's causing ESR 91 execution to seize up over time.
The difficulties I experienced yesterday were twofold. First on ESR 78 the actual section of code that reclaims the allocated memory appeared never to be executed. Second applying the debugger to ESR 91 gave peculiar results: what appeared to be an infinite loop of calls to Destroy() that never allowed me to step in to the method.
To tackle these difficulties I'm going to try two things. First I need to stick a breakpoint inside the conditional code that reclaims the memory on ESR 78, to establish whether it ever gets called. Second I'm going to annotate the ESR 91 code with debug prints. That should allow me to get a better idea of the true execution flow. If the debugger isn't playing by the rules, I'll take my ball somewhere else.
So, first up, checking the ESR 78 flow. The structure of the Destroy() method looks like this on ESR 78:
void TextureClient::Destroy() { [...] RefPtr<TextureChild> actor = mActor; [...] TextureData* data = mData; if (!mWorkaroundAnnoyingSharedSurfaceLifetimeIssues) { mData = nullptr; } if (data || actor) { [...] DeallocateTextureClient(params); } }I'm interested in whether it ever goes inside the condition at the end in order to ultimately call DeallocateTextureClient(). If it doesn't — or if this only happens occasionally, say on shutdown — then I'm likely to be looking in the wrong place.
The reason I've never seen it enter this condition is because data (which is derived from the mData class variable) and actor (which is derived from the mActor class variable) have always been null when entering this method.
Let's do this then.
(gdb) break TextureClient.cpp:583 Breakpoint 5 at 0x7fb8e9b2fc: file gfx/layers/client/TextureClient.cpp, line 585. (gdb) r [...] Thread 7 "GeckoWorkerThre" hit Breakpoint 5, mozilla::layers:: TextureClient::Destroy (this=this@entry=0x7f8ea53bb0) at gfx/layers/client/TextureClient.cpp:585 585 params.allocator = mAllocator; (gdb) c Continuing. [LWP 16174 exited] Thread 7 "GeckoWorkerThre" hit Breakpoint 5, mozilla::layers:: TextureClient::Destroy (this=this@entry=0x7f8effa780) at gfx/layers/client/TextureClient.cpp:585 585 params.allocator = mAllocator; (gdb) p mWorkaroundAnnoyingSharedSurfaceLifetimeIssues $21 = false (gdb) p data $22 = (mozilla::layers::TextureData *) 0x7f8dac3a70 (gdb) p actor $23 = <optimized out> (gdb) b DeallocateTextureClient Breakpoint 6 at 0x7fb8e9a908: file gfx/layers/client/TextureClient.cpp, line 490. (gdb) c Continuing. Thread 7 "GeckoWorkerThre" hit Breakpoint 6, mozilla::layers:: DeallocateTextureClient (params=...) at gfx/layers/client/TextureClient.cpp:490 490 void DeallocateTextureClient(TextureDeallocParams params) { (gdb) p params $24 = {data = 0x7f8dac3a70, actor = {mRawPtr = 0x7f8d648720}, allocator = {mRawPtr = 0x7f8cb40ff8}, clientDeallocation = false, syncDeallocation = false, workAroundSharedSurfaceOwnershipIssue = false} (gdb) n 491 if (!params.actor && !params.data) { (gdb) n 324 obj-build-mer-qt-xr/dist/include/nsCOMPtr.h: No such file or directory. (gdb) n 499 if (params.allocator) { (gdb) n 500 ipdlThread = params.allocator->GetThread(); (gdb) n 501 if (!ipdlThread) { (gdb) n 510 if (ipdlThread && !ipdlThread->IsOnCurrentThread()) { (gdb) n 532 if (!ipdlThread) { (gdb) n 540 if (!actor) { (gdb) n 555 actor->Destroy(params); (gdb) n 497 nsCOMPtr<nsISerialEventTarget> ipdlThread; (gdb) n 505 return; (gdb) c Continuing. [...]So that clears things up: it does go inside the condition and it does deallocate the actor. But the data and actor values are non-null and ultimately in this case because we're executing on the IPDL thread, the actor is destroyed directly.
Let's now try the same thing on ESR 91.
(gdb) break TextureClient.cpp:574 Breakpoint 4 at 0x7ff1148ddc: TextureClient.cpp:574. (2 locations) (gdb) r [...] Thread 7 "GeckoWorkerThre" hit Breakpoint 4, mozilla::layers:: TextureClient::Destroy (this=this@entry=0x7fc5612680) at gfx/layers/client/TextureClient.cpp:587 587 if (actor) { (gdb) c Continuing. Thread 7 "GeckoWorkerThre" hit Breakpoint 4, mozilla::layers:: TextureClient::Destroy (this=this@entry=0x7fc55b8690) at gfx/layers/client/TextureClient.cpp:587 587 if (actor) { (gdb) c Continuing. Thread 7 "GeckoWorkerThre" hit Breakpoint 4, mozilla::layers:: TextureClient::Destroy (this=this@entry=0x7fc55b6c10) at gfx/layers/client/TextureClient.cpp:587 587 if (actor) { (gdb) p data $12 = (mozilla::layers::TextureData *) 0x7fc4d79e80 (gdb) p actor $13 = <optimized out> (gdb) b DeallocateTextureClient Breakpoint 5 at 0x7ff1148394: file gfx/layers/client/TextureClient.cpp, line 489. (gdb) c Continuing. Thread 7 "GeckoWorkerThre" hit Breakpoint 5, mozilla::layers:: DeallocateTextureClient (params=...) at gfx/layers/client/TextureClient.cpp:489 489 void DeallocateTextureClient(TextureDeallocParams params) { (gdb) p params $14 = {data = 0x7fc4d79e80, actor = {mRawPtr = 0x7fc59e6620}, allocator = {mRawPtr = 0x7fc46684e0}, clientDeallocation = false, syncDeallocation = false} (gdb) n 490 if (!params.actor && !params.data) { (gdb) n 496 nsCOMPtr<nsISerialEventTarget> ipdlThread; (gdb) n [New LWP 5954] 498 if (params.allocator) { (gdb) n 499 ipdlThread = params.allocator->GetThread(); (gdb) n [LWP 5954 exited] [New LWP 6110] 500 if (!ipdlThread) { (gdb) n 509 if (ipdlThread && !ipdlThread->IsOnCurrentThread()) { (gdb) n [LWP 6110 exited] 867 ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h: No such file or directory. (gdb) n 539 if (!actor) { (gdb) n 548 actor->Destroy(params); (gdb) n 496 nsCOMPtr<nsISerialEventTarget> ipdlThread; (gdb) n mozilla::layers::TextureClient::Destroy (this=this@entry=0x7fc55b6c10) at gfx/layers/client/TextureClient.cpp:591 591 DeallocateTextureClient(params); (gdb) c Continuing. [...]Here we see something similar. The inner condition is entered and the DeallocateTextureClient() method is ultimately called on the same thread. The data and actor values are both non-null.
To return to our original questions, I think this has answered both of them. First we can see that on ESR 78 this is definitely a place where memory is actually being freed. But on the other hand we also see it being freed on ESR 91. That doesn't mean that there isn't a problem here, but it does make it less likely.
Nevertheless there has been a change to this code. The mWorkaroundAnnoyingSharedSurfaceLifetimeIssues flag was removed by upstream. It's possible that this is causing the issue we're experiencing, so I'm going to reverse this and reinsert the removed code. I'm not really expecting this to fix things, but having travelled out into the sticks I now need to check under every stone. I've no choice but to figure this thing out if the WebView is going to get back up and running again.
[...]
Having worked carefully through the code and reintroduced the mWorkaroundAnnoyingSharedSurfaceLifetimeIssues variable and its associated logic, it's disappointing to find it's not fixed the issue. I'm not out of ideas yet though. Tomorrow I'm going to have a go at profiling the application and using specific memory tools (e.g. valgrind) to try to figure out which memory is being allocated but not deallocated. I'm not sure I hold out much hope of success using valgrind given that gecko is so big and messy and suffering from leakage as it is, but you never know.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comments
Uncover Disqus comments