List items

Items from the current list are shown below.

Gecko

12 Jan 2024 : Day 136 #

Downloading all of the files served by DuckDuckGo individually didn't work: I end up with a site that simply triggers multiple "404 Not Found" errors due to files being requested but not available.

But I'm not giving up on this approach just yet. On the Sailfish Forum attah today made a nice suggestion in relation to this:

Finally remembering to post thoughts i have accumulated after the last weeks of following the blog: Have you tried with a wildly different User Agent override for DucDuckGo, like iPhone or something? The hanging parallel compile - could that be related to some syscall that gets used in synchronization, but which is stubbed in sb2?

There are actually two points here, the first or which relates to DuckDuckGo and the second of which relates to the issue of the build hanging when using more than one process using scratchbox2, part of the Sailfish SDK. Let me leave the compile query to one side for now, because, although it's a good point, I unfortunately don't know the answer (but it sounds like an interesting point to investigate).

Going back to DuckDuckGo, so far I've tried the ESR 78 user agent and the Firefox user agent, but I admit I've not tried anything else. It's a good idea — thank you attah — definitely worth trying. So let's see what happens.

I don't have an iPhone to compare with, but of course there are plenty of places on the Web that claim to list it. I'm going to use this one from the DeviceAtlas blog:

Mozilla/5.0 (iPhone12,1; U; CPU iPhone OS 13_0 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0 Mobile/15E148 Safari/602.1

I used the Python script from yesterday to download the files twice, first using the ESR 78 user agent (stored to the ddg-esr78 directory) and then again using the iPhone user agent above (and stored to the ddg-iphone directory). Each directory contains 34 files and here's what I get when I diff them:

$ find ddg-esr78/ | wc -l
34
$ find ddg-iphone12 | wc -l
34
$ diff --brief ddg-esr78/ ddg-iphone12/
Common subdirectories: ddg-esr78/assets and ddg-iphone12/assets
Common subdirectories: ddg-esr78/font and ddg-iphone12/font
Common subdirectories: ddg-esr78/ist and ddg-iphone12/ist
Common subdirectories: ddg-esr78/locale and ddg-iphone12/locale

So they resulting downloads are identical. That's too bad (although also a little reassuring). It's hard not to conclude that the user agent isn't the important factor here then. Nevertheless, I'm still concerned that I'm not getting the right files when I download using this Python script. If the problem is that DuckDuckGo is recognising a different browser when I download the files with my Python script — even if I've set the User Agent string to match — the the solution will have to be to download the files with the Sailfish Browser itself. It could be another issue entirely, but, well, this is a process of elimination.

I already have the means to do this, in theory. The EMBED_CONSOLE="network" setting gives a preview of any text files it downloads. But by default that's restricted to showing the first 32 KiB of data. That's not enough for everything and some files get truncated. So I've spent a bit of time improving the output.

First I've increased this value to 32 MiB. In practice I really want it to be set to have no limit, but 32 MiB should be more than enough (and if it isn't it should be obvious and can easily be bumped up). But when I first wrote this component I was always disappointed that the request and response headers could be output at a different time to the document content. That meant that it wasn't always possible to tie the content to the request headers (and in particular, the URL the content was downloaded from).

The reason the two can get separated is that the headers are output as soon as they've been received. And the content is output as soon as it's received in full. But between the headers being output and the content being received it's quite possible for some smaller file to be received in full. In this case, the smaller file would get printed out between the headers and content of the larger file.

My solution has been to store a copy of the URL in the content receiver callback object. That way, the URL can be output at the same time as the content. Now the headers and the content can be tied together since the URL is output with them both.

Here's an example (slightly abridged to keep things manageable):

[ Request details ------------------------------------------- ]
    Request: GET status: 200 OK
    URL: https://duckduckgo.com/dist/p.f5b58579149e7488209f.js
    [ Request headers --------------------------------------- ]
        Host : duckduckgo.com
        User-Agent : Mozilla/5.0 (Mobile; rv:78.0) Gecko/78.0 Firefox/78.0
        Accept : */*
        Accept-Language : en-GB,en;q=0.5
        Accept-Encoding : gzip, deflate, br
        Referer : https://duckduckgo.com/
        Connection : keep-alive
        TE : Trailers
    [ Response headers -------------------------------------- ]
        server : nginx
        date : Wed, 10 Jan 2024 22:27:17 GMT
        content-type : application/x-javascript
        content-length : 157
        last-modified : Fri, 27 Oct 2023 12:03:07 GMT
        vary : Accept-Encoding
        etag : "653ba6fb-9d"
        content-encoding : br
        strict-transport-security : max-age=31536000
        permissions-policy : interest-cohort=()
        content-security-policy : [...] ;
        x-frame-options : SAMEORIGIN
        x-xss-protection : 1;mode=block
        x-content-type-options : nosniff
        referrer-policy : origin
        expect-ct : max-age=0
        expires : Thu, 09 Jan 2025 22:27:17 GMT
        cache-control : max-age=31536000
        vary : Accept-Encoding
        X-Firefox-Spdy : h2
    [ Document URL ------------------------------------------ ]
        URL: https://duckduckgo.com/dist/p.f5b58579149e7488209f.js
        Charset: 
        ContentType: application/x-javascript
    [ Document content -------------------------------------- ]
function post(t){if(t.source===parent&&t.origin===location.protocol+"//"+
    location.hostname&&"string"==typeof t.data){var o=t.data.indexOf(":"),
    a=t.data.substr(0,o),n=t.data.substr(o+1);"ddg"===a&&(parent.window.
    location.href=n)}}window.addEventListener&&window.addEventListener
    ("message",post,!1);
    [ Document content ends --------------------------------- ]

Notice how the actual document content (which is only a few lines of text in this case) is right at the end. But directly beforehand the URL is output, which as a result can now be tied to the URL at the start of the request.

After downloading the mobile version of DuckDuckGo using the ESR 78 engine and these changes I can see they've made a difference when I compare the previous and newly collected data:

$ ls -lh ddg-urls-esr78-mobile-*.txt
-rw-rw-r-- 1 flypig flypig 2.5K Jan  8 19:04 ddg-urls-esr78-mobile-01.txt
-rw-rw-r-- 1 flypig flypig 1.3M Jan 10 22:27 ddg-urls-esr78-mobile-02.txt

Previously 2.5 KiB of data was collected, but with these changes that goes up to 1.3 MiB.

The log file is a bit unwieldy, but it should hopefully contain all the data we need: every bit of textual data that was downloaded. Tomorrow I'll try to disentangle the output and turn them into files again. With a bit of luck, I'll end up with a working copy of the DuckDuckGo site (famous last words!).

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comments

Uncover Disqus comments

flypig.co.uk

Location

List items

Gecko

Comments

Navigate

Actions

Archives