flypig.co.uk

List items

Items from the current list are shown below.

Gecko

3 Aug 2024 : Day 308 #
Yesterday I completed step one of my three step programme: find where the problematic instance of EmbedLitePuppetWidget is being created on both ESR 78 and ESR 91.

Today it's step two, which is to find where the mLayerManager for the problematic EmbedLitePuppetWidget instance is being created. Eventually I'll get to digging for this using the debugger, but first I'm going to look through the backtraces from yesterday and see if there are clues in the code that might point to something useful.

There is something notable in amongst those backtraces which comes out from closer inspection. You may recall that I was a little surprised by the fact that there are three instances of EmbedLitePuppetWidget created on ESR 78, but five on ESR 91. That's an unusual difference. The widgets start as windows and form a tree, each related to an active item on the page. The pages are the same, there's only the one window; why the discrepancy?

This is intriguing, so while it might turn out to be completely explainable and unrelated to the issue I'm trying to fix, I do think it's worth spending the time to investigate further.

It seems like it might be relevant though, because the widget that's causing the crash appears to be one of the widgets that's not being created on ESR 78. There's a widget on ESR 78 that relates to the Select item on the page, but the backtrace for the widget on ESR 91 shows that it's being created as a child of another widget. We don't have to go too far through the backtrace to notice this either, just the top two items will do:
#0  mozilla::embedlite::EmbedLitePuppetWidget::EmbedLitePuppetWidget (
    this=0x7fb86664a0, view=0x0)
    at mobile/sailfishos/embedshared/EmbedLitePuppetWidget.cpp:47
#1  0x0000007ff4c4d828 in mozilla::embedlite::EmbedLitePuppetWidget::
    CreateChild (this=0x7fb8aaf500, aRect=..., aInitData=0x7fde7be0c0,
    aForceUseIWidgetParent=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33
Notice how the value for this is different between the first and second frames: 0x7fb86664a0 and 0x7fb8aaf500 respectively. That's become one is creating the other, executed using the CreateChild() method which looks like this:
already_AddRefed<nsIWidget>
EmbedLitePuppetWidget::CreateChild(const LayoutDeviceIntRect &aRect,
                                   nsWidgetInitData* aInitData,
                                   bool              aForceUseIWidgetParent)
{
  if (Destroyed()) {
    return nullptr;
  }

  LOGT();
  bool isPopup = IsPopup(aInitData);
  nsCOMPtr<nsIWidget> widget = new EmbedLitePuppetWidget(nullptr);
  nsresult rv = widget->Create(isPopup ? nullptr : this, nullptr, aRect, 
    aInitData);
  return NS_FAILED(rv) ? nullptr : widget.forget();
}
Here's the signature of the widget->Create() call in there:
nsresult PuppetWidgetBase::Create(
    nsIWidget *aParent,
    nsNativeWidget aNativeParent,
    const LayoutDeviceIntRect &aRect,
    nsWidgetInitData *aInitData
)
One of my immediate thoughts when looking at the code is that if the value for isPopup is set to true, then the aParent value going in to this Create() method will be set to null. I wonder if this is the problem: that the new widget has no parent, so when it goes to traverse the tree to find the root where the layer manager is stored, it has no where to go?

To test out this hypothesis I set a breakpoint on CreateChild(). Sure enough the value of isPopup is set to true, so I've attempted to change the value to false using the debugger. Unfortunately this isn't possible due to the way the underlying code has been optimised (the isPopup variable has been optimised away). So I've ended up stepping into the Create() method and changing the value of aParent from null to the memory location of the parent instead. this was all done on ESR 91, since as I explained, there is no equivalent flow on ESR 78. Here's me changing the value:
mozilla::embedlite::PuppetWidgetBase::Create (this=0x7fb8f94bb0, aParent=0x0, 
    aNativeParent=0x0, aRect=..., aInitData=0x7fde7be0c0)
    at mobile/sailfishos/embedshared
/PuppetWidgetBase.cpp:46
46        LOGT(&quot;Puppet: %p, parent: %p&quot;, this, aParent);
(gdb) set var aParent = 0x7fb913f480
(gdb) print aParent
$20 = (nsIWidget *) 0x7fb913f480
(gdb) finish
But after continuing execution after making this change, the browser still crashes as before. Okay, so it's not that. Let's go a bit further back then and find out why this additional EmbedLitePuppetWidget is being created at all. The backtrace for its creation happens as a result of tapping on the Select widget on the screen. At the point when I press the screen, ESR 78 and ESR 91 execution should be in pretty much the same, or at least similar, state. So there should be a point in the ESR 91 backtrace for the creation of the EmbedLitePuppetWidget that exists on both ESR 78 and ESR 91. If we continue up the backtrace towards the PC end the two will eventually diverge. If we can find the point of divergence, there should be something in the code that's causing that divergence to happen.

So here's my plan: while running on ESR 78, place breakpoints on each of the calls from the ESR 91 backtrace. Starting with the call at the PC end, we already know that the backtrace won't be hit (because the additional EmbedLitePuppetWidget doesn't get created on ESR 78). And indeed that's the case; there is no hit:
Thread 1 &quot;sailfish-browse&quot; received signal SIGINT, Interrupt.
0x0000007fb7352740 in __GI___poll (fds=0x5555c7e8b0, nfds=5, timeout=<optimized 
    out>) at ../sysdeps/unix/sysv/linux/poll.c:41
41        return SYSCALL_CANCEL (ppoll, fds, nfds, timeout_ts_p, NULL, 0);
(gdb) b EmbedLitePuppetWidget::CreateChild
Breakpoint 3 at 0x7fbca925d0: EmbedLitePuppetWidget::CreateChild. (2 locations)
(gdb) c
Continuing.
[...]
So we have to go one further down the ESR 91 backtrace, add a breakpoint, run the code, press the button and see whether it hits. No hit:
Thread 1 &quot;sailfish-browse&quot; received signal SIGINT, Interrupt.
0x0000007fb7352740 in __GI___poll (fds=0x5555c7e8b0, nfds=5, timeout=<optimized 
    out>) at ../sysdeps/unix/sysv/linux/poll.c:41
41        return SYSCALL_CANCEL (ppoll, fds, nfds, timeout_ts_p, NULL, 0);
(gdb) b nsView::CreateWidgetForPopup
Breakpoint 4 at 0x7fbbdb05b0: file view/nsView.cpp, line 587.
(gdb) c
Continuing.
[..]
Do the same again. No hit this time either:
(gdb) b nsComboboxControlFrame::ShowList
Breakpoint 6 at 0x7fbc0b19b0: file layout/forms/nsComboboxControlFrame.cpp, 
    line 324.
(gdb) c
Continuing.
[...]
Until eventually we get to the call to nsComboboxControlFrame::SetFocus() from the ESR 91 backtrace. This time the breakpoint hits:
(gdb) b nsComboboxControlFrame::SetFocus
Breakpoint 8 at 0x7fbc0b4148: file layout/forms/nsComboboxControlFrame.cpp, 
    line 262.
(gdb) c
Continuing.
[W] unknown:0 - QConnmanEngine: Unable to translate the bearer type of the 
    unknown connection type: &quot;&quot;
[W] unknown:0 - QConnmanEngine: Unable to translate the bearer type of the 
    unknown connection type: &quot;&quot;

Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 8, nsComboboxControlFrame::
    SetFocus (this=0x7f8124b078, aOn=true, aRepaint=true)
    at layout/forms/nsComboboxControlFrame.cpp:262
262     void nsComboboxControlFrame::SetFocus(bool aOn, bool aRepaint) {
(gdb) 
Well that is interesting. That suggests that somewhere in the SetFocus() method there's some control flow which goes one way on ESR 78 and the other way on ESR 91. The code is the same for them both and looks like this:
void nsComboboxControlFrame::SetFocus(bool aOn, bool aRepaint) {
  AutoWeakFrame weakFrame(this);
  if (aOn) {
    nsListControlFrame::ComboboxFocusSet();
    sFocused = this;
    if (mDelayedShowDropDown) {
      ShowDropDown(true);  // might destroy us
      if (!weakFrame.IsAlive()) {
        return;
      }
    }
  } else {
    sFocused = nullptr;
    mDelayedShowDropDown = false;
    if (mDroppedDown) {
      mListControlFrame->ComboboxFinish(mDisplayedIndex);  // might destroy us
      if (!weakFrame.IsAlive()) {
        return;
      }
    }
    // May delete |this|.
    mListControlFrame->FireOnInputAndOnChange();
  }

  if (!weakFrame.IsAlive()) {
    return;
  }

  // This is needed on a temporary basis. It causes the focus
  // rect to be drawn. This is much faster than ReResolvingStyle
  // Bug 32920
  InvalidateFrame();
}
The important line here is the call to ShowDropDown() because this is in the ESR 91 backtrace, but isn't being called on ESR 78. It goes on to call ShowList() that we can also see in the ESR 91 backtrace. I'm going to step through the SetFocus() method on ESR 78 to find out why this isn't getting called.
Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 9, nsComboboxControlFrame::
    SetFocus (this=0x7f8124b078, aOn=true, aRepaint=true)
    at layout/forms/nsComboboxControlFrame.cpp:262
262     void nsComboboxControlFrame::SetFocus(bool aOn, bool aRepaint) {
(gdb) n
5078      MOZ_IMPLICIT AutoWeakFrame(nsIFrame* aFrame)
(gdb) n
264       if (aOn) {
(gdb) p aOn
$2 = true
(gdb) n
265         nsListControlFrame::ComboboxFocusSet();
(gdb) n
266         sFocused = this;
(gdb) n
267         if (mDelayedShowDropDown) {
(gdb) p mDelayedShowDropDown
$3 = false
(gdb) n
5105      bool IsAlive() const { return !!mFrame; }
(gdb) n
293       InvalidateFrame();
(gdb) n
263       AutoWeakFrame weakFrame(this);
(gdb) 
As we can clearly see from this execution, the reason is because mDelayedShowDropDown is set to false. In order to go down the branch that calls ShowDropDown() this variable would have to be set to true. Let's do the same step-through on ESR 91:
Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 1, nsComboboxControlFrame::
    SetFocus (this=0x7fb916f718, aOn=true, aRepaint=true)
    at layout/forms/nsComboboxControlFrame.cpp:260
260     void nsComboboxControlFrame::SetFocus(bool aOn, bool aRepaint) {
(gdb) n
261       AutoWeakFrame weakFrame(this);
(gdb) n
262       if (aOn) {
(gdb) p aOn
$1 = true
(gdb) n
263         nsListControlFrame::ComboboxFocusSet();
(gdb) n
264         sFocused = this;
(gdb) n
265         if (mDelayedShowDropDown) {
(gdb) p mDelayedShowDropDown
$2 = true
(gdb) n
266           ShowDropDown(true);  // might destroy us
(gdb) n
5610      bool IsAlive() const { return !!mFrame; }
(gdb) n
291       InvalidateFrame();
(gdb) n
And there it is. The value of mDelayedShowDropDown is true!

This is surprising and also useful material. A search of the full gecko code shows that mDelayedShowDropDown is only ever set inside the nsComboboxControlFrame method (it's protected and not exposed through any other method or referenced in any other files). There is, in fact, only one place in the entire codebase where the value gets set to true and that's in the nsComboboxControlFrame::ShowDropDown() method. It's not a very long method, so it's worth me copying out the full source here:
void nsComboboxControlFrame::ShowDropDown(bool aDoDropDown) {
  MOZ_ASSERT(!XRE_IsContentProcess());
  mDelayedShowDropDown = false;
  EventStates eventStates = mContent->AsElement()->State();
  if (aDoDropDown && eventStates.HasState(NS_EVENT_STATE_DISABLED)) {
    return;
  }

  if (!mDroppedDown && aDoDropDown) {
    nsFocusManager* fm = nsFocusManager::GetFocusManager();
    if (!fm || fm->GetFocusedElement() == GetContent()) {
      DropDownPositionState state = AbsolutelyPositionDropDown();
      if (state == eDropDownPositionFinal) {
        ShowList(aDoDropDown);  // might destroy us
      } else if (state == eDropDownPositionPendingResize) {
        // Delay until after the resize reflow, see nsAsyncResize.
        mDelayedShowDropDown = true;
      }
    } else {
      // Delay until we get focus, see SetFocus().
      mDelayedShowDropDown = true;
    }
  } else if (mDroppedDown && !aDoDropDown) {
    ShowList(aDoDropDown);  // might destroy us
  }
}
The critical parts are the line in the middle that sets mDelayedShowDropDown to true with the explanatory comment stating "Delay until after the resize reflow, see nsAsyncResize" and the identical line towards the end with the comment "Delay until we get focus, see SetFocus()". So we should step through this method to take a look at whether either of these branches that set mDelayedShowDropDown to true are being executed or not. Here's the step through of this method on ESR 91:
Thread 10 &quot;GeckoWorkerThre&quot; hit Breakpoint 2, nsComboboxControlFrame::
    ShowDropDown (this=0x7fb9117528, aDoDropDown=aDoDropDown@entry=true)
    at layout/forms/nsComboboxControlFrame.cpp:890
890     void nsComboboxControlFrame::ShowDropDown(bool aDoDropDown) {
(gdb) n
892       mDelayedShowDropDown = false;
(gdb) p mDelayedShowDropDown
$1 = false
(gdb) n
894       if (aDoDropDown && eventStates.HasState(NS_EVENT_STATE_DISABLED)) {
(gdb) n
898       if (!mDroppedDown && aDoDropDown) {
(gdb) p mDroppedDown
$2 = false
(gdb) p aDoDropDwown
No symbol &quot;aDoDropDwown&quot; in current context.
(gdb) p aDoDropDown
$3 = true
(gdb) n
899         nsFocusManager* fm = nsFocusManager::GetFocusManager();
(gdb) n
900         if (!fm || fm->GetFocusedElement() == GetContent()) {
(gdb) n
827       nsIContent* GetContent() const { return mContent; }
(gdb) n
910           mDelayedShowDropDown = true;
(gdb) p fm
$4 = <optimized out>
And we can see that it is being set to true here, in the second of the two cases. The reason it's going this way is because the condition (!fm || fm->GetFocusedElement() == GetContent()) is false. Unfortunately fm has been optimised out which makes it harder for us to find out exactly which part of the condition is failing.

This is all rather a lot to absorb, but I'm hoping that all this will turn out to be key to determining what's causing ESR 91 to crash. It's super late now here, so time for me to pause for the day. I'll pick this up again tomorrow morning to find out what the equivalent flow is on ESR 78.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comments

Uncover Disqus comments