List items
Items from the current list are shown below.
Blog
29 Aug 2023 : Day 13 #
Yesterday we discussed a mysterious compiler segfault which we worked around by reducing the optimisation level of the build from O2 to O1. We also ended up with a Rust error that also triggered a segfault.
Before we look at that Rust error, a couple of readers made really good suggestions that deserve a mention.
First Fabrice Desré suggested on Mastodon that it would be worth switching from using gcc to clang for the build:
That's a really useful point to make. I double checked the build and it's definitely using gcc and it sounds like we really should be using clang. At the outset of this journey I did try using ./mach bootstrap but I didn't get good results and ended up using ./mach create-mach-environment instead.
I don't plan to look into this now as changing the optimisation seems to have done the job. But this is definitely something to look into, either once the build is working, or in case there are later failures that don't seem to have another solution.
The reason I'm not looking in to this now is because — in spite of all evidence to the contrary — there is a plan here, which I need to stick to: get a successful build working as quickly as possible. Once the build is working, there will then be more time to try to do things properly. I'm really grateful to the input from Fabrice and it's important not to forget along the way that this needs investigation.
I also got a great suggestion from Nephros on the Sailfish Forum.
This is also a really neat idea. If it's possible to narrow this down to a single compiler option, not only will it help avoid unnecessary de-optimisation, it might also clarify what the underlying compiler bug is (if that's what it is) that's causing the segfault.
What's more direc85 has very generously offered to try this out:
I think nephros and direc85 are onto something here. And once again, these interactions have really highlighted to me how important it is to have external input on all this.
Let's now move on to the Rust components that we were looking at yesterday and try to figure out why they're failing to build. First we had glslopt, which was being built for the wrong architecture. Then we had a strange optimisation failure building webrender. The last thing we saw was the style component failing to build.
The error generated was this:
Things I've not seen before make me nervous, especially if a Web search isn't throwing up good leads.
The good news is that the error output includes the final command that caused the problem. The entire command is 9712 characters long (that's nearly enough text to fill up the available memory on a BBC B running in Mode 2) so I'm not going to copy it out here. But here's an abridged version with the most important parts retained.
You may recall Thigg highlighted this second issue a while back, flagging up a problem experienced with QEMU related to high memory usage. Upgrading QEMU was seen as the solution in that case, but that's going to be a long way around for us.
With previous gecko builds one of the challenges was building using debug symbols, as this in particular seemed to push the build over the edge at the final linking step. We're nowhere near final linking yet (I wish!), but it makes me thing that dropping debug symbols might have some beneficial effect here.
Checking the rustc options again I see that this is controlled by the debuginfo option in the command we're running.
That's great, but now I have to get that into the build process to see if it will have the same effect there.
In build/moz.configure/rust.configure I find this code:
Ideally I'd like to filter based on the component to be built, to minimise the extent to which the debug information is removed. But I'll start with this and see if it at least helps the build run further.
Changing the build like this means scrapping my current incremental build, so after making the edit and kicking off a full build, it's going to be at least an hour before the results come in.
So it looks like that's it until tomorrow.
Today is the last day of my two week holiday, so I'll be back to work tomorrow and things may slow down a bit as a result. Let's see what happens as it feels like I have some momentum right now.
Don't forget, my Gecko Dev Diary has all the other posts.
Before we look at that Rust error, a couple of readers made really good suggestions that deserve a mention.
First Fabrice Desré suggested on Mastodon that it would be worth switching from using gcc to clang for the build:
@flypig you're not building with clang instead of gcc? [...] I'm just wondering if clang would not fare better than gcc. Gecko prebuilt toolchain that you can install with ./mach bootstrap is clang based these days.
That's a really useful point to make. I double checked the build and it's definitely using gcc and it sounds like we really should be using clang. At the outset of this journey I did try using ./mach bootstrap but I didn't get good results and ended up using ./mach create-mach-environment instead.
I don't plan to look into this now as changing the optimisation seems to have done the job. But this is definitely something to look into, either once the build is working, or in case there are later failures that don't seem to have another solution.
The reason I'm not looking in to this now is because — in spite of all evidence to the contrary — there is a plan here, which I need to stick to: get a successful build working as quickly as possible. Once the build is working, there will then be more time to try to do things properly. I'm really grateful to the input from Fabrice and it's important not to forget along the way that this needs investigation.
I also got a great suggestion from Nephros on the Sailfish Forum.
I'm sure I'm telling nothing new here, but -O2 is just a shorthand for a set of optimization switches. Maybe it's useful to try specifying them manually, and "bisect" the list to find which precise switch is behind this.
gcc -Q --help=optimizers
This is also a really neat idea. If it's possible to narrow this down to a single compiler option, not only will it help avoid unnecessary de-optimisation, it might also clarify what the underlying compiler bug is (if that's what it is) that's causing the segfault.
What's more direc85 has very generously offered to try this out:
That sounds like a lot of grunt work that needs some horsepower... No promises yet, but that’s something I may be able to help with.
Having now read the daily nerd snipe over lunch, one thought popped into my mind: what if it is architecture specific? If it compiles for armv7hl but not aarch64, that would indeed indicate a compiler bug... Something that may have been fixed already, somewhat likely... I’ll do some research after work, let’s see what I can find.
Having now read the daily nerd snipe over lunch, one thought popped into my mind: what if it is architecture specific? If it compiles for armv7hl but not aarch64, that would indeed indicate a compiler bug... Something that may have been fixed already, somewhat likely... I’ll do some research after work, let’s see what I can find.
I think nephros and direc85 are onto something here. And once again, these interactions have really highlighted to me how important it is to have external input on all this.
Let's now move on to the Rust components that we were looking at yesterday and try to figure out why they're failing to build. First we had glslopt, which was being built for the wrong architecture. Then we had a strange optimisation failure building webrender. The last thing we saw was the style component failing to build.
The error generated was this:
6:29.21 fatal runtime error: Rust cannot catch foreign exceptions 6:29.35 warning: `style` (lib) generated 5 warnings 6:29.35 error: could not compile `style`; 5 warnings emitted 6:29.35 Caused by: 6:29.35 process didn't exit successfully: `/usr/bin/rustc --crate-name [...]` (signal: 6, SIGABRT: process abort signal)This is slightly different from the previous errors — it's not immediately obvious that it's an optimisation error — but it's also not obvious that it's a genuine coding or configuration error. It could be a compiler bug, or something else entirely. It's also not something we saw when working on ESR 78. Finally, it's not something that appears to be well-documented on the Web.
Things I've not seen before make me nervous, especially if a Web search isn't throwing up good leads.
The good news is that the error output includes the final command that caused the problem. The entire command is 9712 characters long (that's nearly enough text to fill up the available memory on a BBC B running in Mode 2) so I'm not going to copy it out here. But here's an abridged version with the most important parts retained.
/usr/bin/rustc --crate-name style --edition=2018 servo/components/style/lib.rs \ --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat \ --crate-type lib --emit=dep-info,metadata,link -C opt-level=2 -C panic=abort \ -C embed-bitcode=no --cfg 'feature="bindgen"' --cfg 'feature="gecko"' \ [...] debuginfo=2 -C force-frame-pointers=yes --cap-lints warn -Cembed-bitcode=yes \ -C codegen-units=1By setting the OUT_DIR environment variable appropriately I'm able to execute this command inside the scratchbox2 target in order to generate the same error.
fatal runtime error: Rust cannot catch foreign exceptions Aborted (core dumped)This is really helpful because it converts a multi-hour build odyssey into a few-minute build excursion. My immediate impulse is to try this with a lower optimisation setting. Checking the rustc options I see that -C opt-level=2 is setting the optimisation level for the build:
$ rustc -C help | grep "opt-level" -C opt-level=val -- optimization level (0-3, s, or z; default: 0)When I try to build with the optimisation level set to zero, a slightly different error occurs.
LLVM ERROR: out of memory Allocation failed Aborted (core dumped)Maybe memory is the underlying issue here? We've had memory problems with previous gecko builds, so this is definitely a possibility. They seem to come in two flavours. First there are memory errors that come from a lack of physical memory in the machine doing the build. From previous experience I know the build will require at least 16GiB of memory. The machine I'm using has 32GiB of RAM and 16GiB of swap memory, which really should be enough. Second there are memory errors caused be Sailfish SDK (scratchbox2 and QEMU) restrictions.
You may recall Thigg highlighted this second issue a while back, flagging up a problem experienced with QEMU related to high memory usage. Upgrading QEMU was seen as the solution in that case, but that's going to be a long way around for us.
With previous gecko builds one of the challenges was building using debug symbols, as this in particular seemed to push the build over the edge at the final linking step. We're nowhere near final linking yet (I wish!), but it makes me thing that dropping debug symbols might have some beneficial effect here.
Checking the rustc options again I see that this is controlled by the debuginfo option in the command we're running.
$ rustc -C help | grep " debuginfo" -C debuginfo=val -- debug info emission level (0 = no debug info, 1 = line tables only, 2 = full debug info with variable and type information; default: 0)If I drop this down from 2 to 0, I discover the command builds without error.
That's great, but now I have to get that into the build process to see if it will have the same effect there.
In build/moz.configure/rust.configure I find this code:
debug_info = None [...] if debug_symbols: debug_info = "2"I thought this might be being controlled by the MOZ_DEBUG_SYMBOLS environment variable (set in mozconfig.merqtxulrunner). But a quick test showed this not to be the case. So instead I just adjusted this piece of code to fix the debuginfo level for Rust.
Ideally I'd like to filter based on the component to be built, to minimise the extent to which the debug information is removed. But I'll start with this and see if it at least helps the build run further.
Changing the build like this means scrapping my current incremental build, so after making the edit and kicking off a full build, it's going to be at least an hour before the results come in.
So it looks like that's it until tomorrow.
Today is the last day of my two week holiday, so I'll be back to work tomorrow and things may slow down a bit as a result. Let's see what happens as it feels like I have some momentum right now.
Don't forget, my Gecko Dev Diary has all the other posts.
Comments
Uncover Disqus comments