flypig.co.uk

List items

Items from the current list are shown below.

Blog

25 Mar 2024 : Day 196 #
Yesterday I finally narrowed down the error causing the WebView app to seize up during execution to a call to EglDisplay::fCreateImage(). Now it may not be this call that's the problem, it might be the way the result is used or the fact that it's not being freed properly, or maybe the parameters that are being passed in to it. But the fact that we've narrowed it down is likely to be a big help in figuring things out.

The call itself goes through to a method that looks like this:
  EGLImage fCreateImage(EGLContext ctx, EGLenum target, EGLClientBuffer buffer,
                        const EGLint* attribList) const {
    MOZ_ASSERT(HasKHRImageBase());
    return mLib->fCreateImage(mDisplay, ctx, target, buffer, attribList);
  }
Here mLib is an instance of GLLibraryEGL. It looks like we have several layers of wrappers here so let's continue digging. This goes through to the following method that's part of GLLibraryEGL:
  EGLImage fCreateImage(EGLDisplay dpy, EGLContext ctx, EGLenum target,
                        EGLClientBuffer buffer,
                        const EGLint* attrib_list) const {
    WRAP(fCreateImageKHR(dpy, ctx, target, buffer, attrib_list));
  }
That looks similar but it's not quite the same. It is just another wrapper though, this time going through to a dynamically created method. The WRAP() macro looks like this:
#define WRAP(X)                \
  PROFILE_CALL                 \
  BEFORE_CALL                  \
  const auto ret = mSymbols.X; \
  AFTER_CALL                   \
  return ret
The PROFILE_CALL, BEFORE_CALL and AFTER_CALL lines are all macros which turn into something functional in the Android build, but in our build are just empty. That means that the WRAP(fCreateImageKHR(dpy, ctx, target, buffer, attrib_list)) statement actually reduces down to just the following:
  const auto ret = mSymbols.fCreateImageKHR(dpy, ctx, target, buffer, 
    attrib_list);
  return ret
The mSymbols object has the following defined on it:
    EGLImage(GLAPIENTRY* fCreateImageKHR)(EGLDisplay dpy, EGLContext ctx,
                                          EGLenum target,
                                          EGLClientBuffer buffer,
                                          const EGLint* attrib_list);
Here EGLImage is a typedef of void* and GLAPIENTRY is an empty define, giving us a final result that looks like this:
    void* (*fCreateImageKHR)(EGLDisplay dpy, EGLContext ctx,
                             EGLenum target,
                             EGLClientBuffer buffer,
                             const EGLint* attrib_list);
We're still not quite there though. Inside GLLibraryEGL.cpp we find this:
    const SymLoadStruct symbols[] = {SYMBOL(CreateImageKHR),
                                     SYMBOL(DestroyImageKHR), END_OF_SYMBOLS};
    (void)fnLoadSymbols(symbols);
This is packing symbols with some data which is then passed in to fnLoadSymbols(), a method for loading symbols from a dynamically loaded library. The define that's used here is the following:
#define SYMBOL(X)                 \
  {                               \
    (PRFuncPtr*)&mSymbols.f##X, { \
      { "egl" #X }                \
    }                             \
  }
Notice how here it's playing around with the input argument so that, with a little judicious simplification for clarity, SYMBOL(CreateImageKHR) becomes:
  mSymbols.fCreateImageKHR, {{ "eglCreateImageKHR" }}
In other words (big reveal, but no big surprise) a call to mSymbols.fCreateImageKHR() will get converted into a call to the EGL function eglCreateImageKHR, loaded in from the EGL driver.

What does this do? According to the documentation:
 
eglCreateImage is used to create an EGLImage object from an existing image resource buffer. display specifies the EGL display used for this operation. context specifies the EGL client API context used for this operation, or EGL_NO_CONTEXT if a client API context is not required. target specifies the type of resource being used as the EGLImage source (examples include two-dimensional textures in OpenGL ES contexts and VGImage objects in OpenVG contexts). buffer is the name (or handle) of a resource to be used as the EGLImage source, cast into the type EGLClientBuffer. attrib_list is a list of attribute-value pairs which is used to select sub-sections of buffer for use as the EGLImage source, such as mipmap levels for OpenGL ES texture map resources, as well as behavioral options, such as whether to preserve pixel data during creation. If attrib_list is non-NULL, the last attribute specified in the list must be EGL_NONE.

Super. Where does that leave us? Well, it tells us that the call to fCreateImage() in our SharedSurface_EGLImage::Create() is really just a bunch of simple wrapper calls that ends up calling an EGL function. What could be going wrong? One obvious potential problem is that the input parameters may be messed up. Another one is that each call to eglCreateImageKHR() creating an EGLImage object should be balanced out with a call to eglDestroyImageKHR() to destroy it.

We do have a call to eglDestroyImageKHR() happening in our SharedSurface_EGLImage destructor. It looks like this:
SharedSurface_EGLImage::~SharedSurface_EGLImage() {
  const auto& gle = GLContextEGL::Cast(mDesc.gl);
  const auto& egl = gle->mEgl;
  egl->fDestroyImage(mImage);
[...]
There's an unexpected difference with the way it's called in ESR 78, where the code looks like this:
SharedSurface_EGLImage::~SharedSurface_EGLImage() {
  const auto& gle = GLContextEGL::Cast(mGL);
  const auto& egl = gle->mEgl;
  egl->fDestroyImage(egl->Display(), mImage);
[...]
Notice the extra egl->Display() value being passed in as a parameter. That's because in ESR 91 EGLLibrary is storing its own copy of the EGLDisplay:
  EGLBoolean fDestroyImage(EGLImage image) const {
    MOZ_ASSERT(HasKHRImageBase());
    return mLib->fDestroyImage(mDisplay, image);
  }
That gives us a couple of things to look into: first, is the correctly value being passed in for image? Second, is the value stored for mDisplay valid? The underlying call to eglDestroyImage also has a Boolean return value which will return EGL_FALSE in case something goes wrong. A nice first step would be to check this return value in case it's indicating a problem. To do this I've added some additional debug output to the code:
  EGLBoolean result = egl->fDestroyImage(mImage);
  printf_stderr("RENDER: fDestroyImage() return value: %d\n", result);
The result of running it shows a large number of successful calls to fDestroyImage():
[...]
[JavaScript Warning: "Layout was forced before the page was fully loaded. 
    If stylesheets are not yet loaded this may cause a flash of unstyled 
    content." {file: "https://jolla.com/themes/unlike/js/
    modernizr.js?x98582&ver=2.6.2" line: 4}]
RENDER: fDestroyImage() return value: 1
RENDER: fDestroyImage() return value: 1
RENDER: fDestroyImage() return value: 1
RENDER: fDestroyImage() return value: 1
RENDER: fDestroyImage() return value: 1
RENDER: fDestroyImage() return value: 1
RENDER: fDestroyImage() return value: 1
[...]
Since this output looks okay I've taken it a step further and added a count to the creation and deletion calls in case it shows any imbalance between the two.
[...]
Frame script: embedhelper.js loaded
RENDER: fCreateImage() return value: 1, 0
RENDER: fCreateImage() return value: 1, 1
CONSOLE message:
[JavaScript Warning: "This page uses the non standard property “zoom”. 
    Consider using calc() in the relevant property values, or using “transform” 
    along with “transform-origin: 0 0”." {file: "https://jolla.com/
    " line: 0}]
CONSOLE message:
[JavaScript Warning: "Layout was forced before the page was fully loaded. 
    If stylesheets are not yet loaded this may cause a flash of unstyled 
    content." {file: "https://jolla.com/themes/unlike/js/
    modernizr.js?x98582&ver=2.6.2" line: 4}]
RENDER: fCreateImage() return value: 1, 2
RENDER: fDestroyImage() return value: 1, 0
RENDER: fCreateImage() return value: 1, 3
RENDER: fDestroyImage() return value: 1, 1
[...]

RENDER: fCreateImage() return value: 1, 316
RENDER: fDestroyImage() return value: 1, 314
RENDER: fCreateImage() return value: 1, 317
RENDER: fDestroyImage() return value: 1, 315
[...]
The increasing numbers (going up to 317 and 315 here) tell us that the balance between creates and destroys is pretty clean. There are two creates at the start which don't have matching destroys, after which everything is balanced. It seems unlikely therefore that this is the cause of the seize-ups. What's more, it all makes sense too: at any point in time there should be a front and a back buffer, so there should always be exactly two images in existence at any one time. That's a situation that's confirmed by the numbers.

Just to ensure this matches the behaviour of the previous version I've also tested the same using the debugger on ESR 78. I got the same sequence of calls. First two creates, followed by balanced create and destroy calls so that there are exactly two images in existence at any one time:
fCreateImage
fCreateImage
fCreateImage
fDestroyImage
fCreateImage
fDestroyImage
fCreateImage
[...]
In conclusion everything here looks in order on ESR 91. So tomorrow I'll move on to checking that the display value is set correctly.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comments

Uncover Disqus comments