flypig.co.uk

List items

Items from the current list are shown below.

Blog

6 Mar 2024 : Day 177 #
Yesterday I spent the day fixing linker errors. By the evening there was one last error that popped up at the end of the full build that looked like this:
TextureClientSharedSurface.cpp:108: undefined reference to `vtable for mozilla::layers::SharedSurfaceTextureClient'
I promised to spend a bit of time today explaining what was going on and how I fixed it.

One of the reasons I want to explain is that these vtable errors are a little cryptic. They also related to an interesting implementation detail of C++. They're telling us that there's something wrong with the SharedSurfaceTextureClient implementation, but not telling us exactly what. The line number the error refers to is in the SharedSurfaceTextureClient constructor, which is also not particularly helpful:
SharedSurfaceTextureClient::SharedSurfaceTextureClient(
    SharedSurfaceTextureData* aData, TextureFlags aFlags,
    LayersIPCChannel* aAllocator)
    : TextureClient(aData, aFlags, aAllocator) {
}
The vtable the error is referring to is the "virtual table" of methods that can be dynamically overridden. When you subclass a class in C++ you can override certain methods from the parent class. In other words, when the code calls the method in the subclass, you can make sure it uses a method written for the subclass, rather than using the inherited method from the parent class. Both overriding and not overriding are useful. The whole point of creating a subclass is that you re-use some functionality from the parent class, so if you don't override a method it's great that the implementation from the parent class can be used.

However, sometimes you want to change the behaviour — or at least some of it — from the parent class. That might be where you'd override some of that functionality.

Where am I going with this? Bear with me! The point is that there are also two ways that a method can be overridden. It can be statically overridden or dynamically overridden. The former case is useful when the compiler can always tell that the method to be used is the one from the child class. In particular, if you never cast a class to make it look like its parent. If you cast it to the parent, any statically overridden methods will use the parent class's implementation.

However, if you dynamically override a method, the child's implementation will be used even if you cast the class upwards. Let's take some simple example (this doesn't appear in the Gecko codebase!).
#include <iostream>

using namespace std;

class Parent {
public:
    string hello() { return string("Hello son"); }    
    string wave() { return string("Waves"); }
    virtual string goodbye() { return string("Goodbye son"); }
};

class Child : public Parent {
public:
    string hello() { return string("Hello Mum"); }
    string goodbye() { return string("Goodbye Mum"); }
};

int main() {
    Parent* parent = new Parent();
    Child* child = new Child();
    Parent* vparent = dynamic_cast<Parent*>(child);

    cout << "1.  " << parent->hello() << "\n";
    cout << "2.  " << child->hello() << "\n";
    cout << "3.  " << vparent->hello() << "\n";

    cout << "4.  " << parent->wave() << "\n";
    cout << "5.  " << child->wave() << "\n";
    cout << "6.  " << vparent->wave() << "\n";
    
    cout << "7.  " << parent->goodbye() << "\n";
    cout << "8.  " << child->goodbye() << "\n";
    cout << "9.  " << vparent->goodbye() << "\n";
    
    cout << "10. " << parent << "\n";
    cout << "11. " << child << "\n";
    cout << "12. " << vparent << "\n";

    delete child;
    delete parent;

    return 0;
}
Here we can see from the class definition that the Child class is inheriting from the Parent class. The Child class doesn't have any implementation of the wave() method because we're expecting it to be inherited from the Parent. The Child class statically overrides its parent's hello() implementation but dynamically overrides its parent's goodbye() method. We can see this because of the virtual keyword that's added before the name of the method. Note that it's actually specified on the parent's goodbye() method. It's parents that decide whether their children can inherit methods virtually or not.

Now let's see what happens when we build and run this code.
$ g++ main.cpp
$ ./a.out 
1.  Hello son
2.  Hello Mum
3.  Hello son
4.  Waves
5.  Waves
6.  Waves
7.  Goodbye son
8.  Goodbye Mum
9.  Goodbye Mum
10. 0x563a9f694eb0
11. 0x563a9f694ed0
12. 0x563a9f694ed0
Notice how the Child class is still able to successfully call the wave() method (line 5). When the child class calls hello() and goodbye() it calls its own implementations of these methods.

However, there's also this peculiar vparent variable. This is the instance of Child that's been dynamically cast to look like a Parent. We can see that child and parent are actually the same object because they point to the same place (lines 11 and 12) compared to the parent object which has a different pointer (line 10).

Casing a Child to a Parent is perfectly safe because the former is inheriting from the latter. In other words, the code knows that everything a Parent has a Child will have as well. Safe.

But in the case of the dynamic cast, the calls to any overridden virtual methods are resolved at runtime rather than compile time. In particular, even though vparent is of type Parent, calling the goodbye() method on it calls the child's implementation of goodbye() (line 9).

The way this is achieved is with a vtable (this is where we're going!). This is a list of pointers to methods stored at runtime. Because goodbye() is marked as virtual in the Parent class, a pointer to this method is stored in the vtable. When the compiler calls the method it gets the address to call from the vtable, rather than from some fixed value stored at compile time.

Now when the Child dynamically overrides goodbye() it overwrites the value in the vtable. The result is that when the goodbye() class is called from vparent, the Child instance is referenced even though the class looks otherwise just like a Parent.

Finally we get to our error. The error is suggesting that there's some method that should be virtual, but no vtable has been created. This is most likely because there's a signature for a virtual method, but no implementation for any virtual method.

It's all a bit obscure, made more so because there are no virtual methods in the definition. However we do have ~SharedSurfaceTextureClient() in the class definition that doesn't have an implementation. Plus the class is inheriting from TextureData and this has a virtual destructor. As a result the ~SharedSurfaceTextureClient() destructor will also need to be added to the vtable.

But while ~SharedSurfaceTextureClient() appears in the class definition, there is no implementation of it. This is therefore likely what's causing our error.

The solution, after this very long explanation, is that we need to add in the implementation of the class destructor. Thankfully there's a destructor implementation in the ESR 78 code we can use:
SharedSurfaceTextureClient::~SharedSurfaceTextureClient() {
  // XXX - Things break when using the proper destruction handshake with
  // SharedSurfaceTextureData because the TextureData outlives its gl
  // context. Having a strong reference to the gl context creates a cycle.
  // This needs to be fixed in a better way, though, because deleting
  // the TextureData here can race with the compositor and cause flashing.
  TextureData* data = mData;
  mData = nullptr;

  Destroy();

  if (data) {
    // Destroy mData right away without doing the proper deallocation handshake,
    // because SharedSurface depends on things that may not outlive the
    // texture's destructor so we can't wait until we know the compositor isn't
    // using the texture anymore. It goes without saying that this is really bad
    // and we should fix the bugs that block doing the right thing such as bug
    // 1224199 sooner rather than later.
    delete data;
  }
}
As I mentioned yesterday, the partial builds all passed. But last night I also set off the full build again. So what was the outcome of this? The good news is that the full build went through as well, as a consequence of which I now have five shiny new xulrunner-qt5 packages to test on my phone. I'm not expecting them to work first time, but testing them is an essential step in finding out how to progress. Let's give them a go...
$ gdb harbour-webview 
GNU gdb (GDB) Mer (8.2.1+git9)
[...]
(gdb) r
Starting program: /usr/bin/harbour-webview 
[...]
=============== Preparing offscreen rendering context ===============
[New LWP 29323]

Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 29318]
mozilla::gl::SwapChain::OffscreenSize (this=0x0)
    at gfx/gl/GLScreenBuffer.cpp:141
141       return mPresenter->mBackBuffer->mFb->mSize;
(gdb) bt
#0  mozilla::gl::SwapChain::OffscreenSize (this=0x0)
    at gfx/gl/GLScreenBuffer.cpp:141
#1  0x0000007ff3666884 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget (this=0x7fc4b7c500, aId=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#2  0x0000007ff12b6fbc in mozilla::layers::CompositorVsyncScheduler::
    ForceComposeToTarget (this=0x7fc4d23b20, aTarget=aTarget@entry=0x0, 
    aRect=aRect@entry=0x0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/
    LayersTypes.h:82
#3  0x0000007ff12b7018 in mozilla::layers::CompositorBridgeParent::
    ResumeComposition (this=this@entry=0x7fc4b7c500)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#4  0x0000007ff12b70a4 in mozilla::layers::CompositorBridgeParent::
    ResumeCompositionAndResize (this=0x7fc4b7c500, x=<optimized out>, y=<optimized out>, 
    width=<optimized out>, height=<optimized out>)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:794
#5  0x0000007ff12afc40 in mozilla::detail::RunnableMethodArguments<int, int,
    int, int>::applyImpl<mozilla::layers::CompositorBridgeParent, void
    (mozilla::layers::CompositorBridgeParent::*)(int, int, int, int),
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>,
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, 0ul, 1ul,
    2ul, 3ul> (args=..., m=<optimized out>, o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151
[...]
#17 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) p this
$1 = (const mozilla::gl::SwapChain * const) 0x0
(gdb) 
Immediately there's a segmentation fault and it feels like we've been here before. Here's where the crash is triggered:
const gfx::IntSize& SwapChain::OffscreenSize() const {
  return mPresenter->mBackBuffer->mFb->mSize;
}
It's being triggered because the instance of SwapChain this is being called from is null. But hang on; didn't we just spend days getting rid of the SwapChain code and swapping in GLScreenBuffer precisely to avoid this error? Well, yes, we did. Clearly after changing all that code I missed a case of SwapChain being used.

The fix is to switch back from SwapChain to the previous GLScreenBuffer interface, like to:
@@ -151,8 +153,7 @@ EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget(VsyncId aId)
 
   if (context->IsOffscreen()) {
     MutexAutoLock lock(mRenderMutex);
-    if (context->GetSwapChain()->OffscreenSize() != mEGLSurfaceSize
-      && !context->GetSwapChain()->Resize(mEGLSurfaceSize)) {
+    if (context->OffscreenSize() != mEGLSurfaceSize &&
+        !context->ResizeOffscreen(mEGLSurfaceSize)) {
       return;
     }
   }
I've also checked through the rest of the EmbedliteCompositionBridgeParent code to ensure there are no other SwapChain references in there. And, with that, we're done for the day. I've spent a long time explaining vtables today and very little time actually coding, but the change is an essential one and I can't test it without another build. So I've set it building again and with any luck I'll be able to test it again tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comments

Uncover Disqus comments