flypig.co.uk

List items

Items from the current list are shown below.

Blog

11 Jan 2024 : Day 135 #
After collecting together and comparing the lists of files downloaded yesterday, today I'm actually downloading those files from the server.

I've created a very simply Python script that will take each line from the output, then reconstruct a local copy of each of the files, using the same relative directory hierarchy. The script is short and simple enough to show here in full.
#!/bin/python3

import os
import urllib.request

BASE_DIR = 'ddg'
USER_AGENT = 'Mozilla/5.0 (Android 8.1.0; Mobile; rv:78.0) Gecko/78.0 Firefox/78.8'

def split_url(url):
	url = url.rstrip()
	path = url.lstrip('https://duckduckgo.com/')
	path, leaf = os.path.split(path)
	leaf = 'index.html' if not leaf else leaf
	path = os.path.join(BASE_DIR, path)
	filename = os.path.join(path, leaf)
	return path, filename, url

def make_dir(directory):
	print('Dir: {}'.format(directory))
	os.makedirs(directory, exist_ok=True)

def download_file(url, filename):
	print('URL: {}'.format(url))
	print('File: {}'.format(filename))
	opener = urllib.request.build_opener()
	opener.addheaders = [('User-agent', USER_AGENT)]
	urllib.request.install_opener(opener)
	urllib.request.urlretrieve(url, filename)

with open('download.txt') as fp:
	for line in fp:
		directory, filepath, url = split_url(line)
		make_dir(directory)
		download_file(url, filepath)
It really is a very linear process; they don't get much simpler than this. All it does is read in a file line by line. Each line is interpreted as a URL. For example suppose the line was the following:
https://duckduckgo.com/dist/lib/l.656ceb337d61e6c36064.js
Then the file will extract the directory dist/lib/, create a local directory ddg/dist/lib/, then download the file from the URL and save it in the directory with the filename l.656ceb337d61e6c36064.js.

We'll end up with a directory structure that should match the root directory structure of DuckDuckGo:
$ tree ddg
ddg
├── 3.html
├── assets
│   ├── logo_homepage.alt.v109.svg
│   ├── logo_homepage.normal.v109.svg
│   └── onboarding
│       ├── arrow.svg
│       └── bathroomguy
│           ├── 1-monster-v2--no-animation.svg
│           ├── 2-ghost-v2.svg
│           ├── 3-bathtub-v2--no-animation.svg
│           ├── 4-alpinist-v2.svg
│           └── teaser-2@2x.png
├── dist
│   ├── b.9e45618547aaad15b744.js
│   ├── d.01ff355796b8725c8dad.js
│   ├── h.2d6522d4f29f5b108aed.js
│   ├── lib
│   │   └── l.656ceb337d61e6c36064.js
│   ├── o.2988a52fdfb14b7eff16.css
│   ├── p.f5b58579149e7488209f.js
│   ├── s.b49dcfb5899df4f917ee.css
│   ├── ti.b07012e30f6971ff71d3.js
│   ├── tl.3db2557c9f124f3ebf92.js
│   └── util
│       └── u.a3c3a6d4d7bf9244744d.js
├── font
│   ├── ProximaNova-ExtraBold-webfont.woff2
│   ├── ProximaNova-Reg-webfont.woff2
│   └── ProximaNova-Sbold-webfont.woff2
├── index.html
└── locale
    └── en_GB
        └── duckduckgo85.js

9 directories, 24 files
The intention is that this will make a verbatim copy of all the files that the browser used when rendering the page. Unfortunately servers don't always serve the same file every time, but to try to avoid it serving up the wrong file, I've also set the user agent to be the same as for ESR 78.

That's no guarantee that the server will identify us as that — servers use all sorts of nasty tricks to try to identify misidentified browsers — but it's probably the best we can reasonably do.

Once I've got a local copy of the site structure I copy this over to my server and get the browser to render it.

But unfortunately without success. For reasons I can't figure out, when I attempt to open the page, the browser requests a wholly different set of files to download. And not just different leafnames, but a totally different file structure as well. So rather than it downloading the files I've collected together, I just get a bunch of "404 File Not Found" errors.

Frustrating. But the nature of me writing this up daily is that I can't just summarise all the things that work. As anyone who's been following along will no doubt have noticed by now, often things I try just don't work. But from the comments I've been getting from others, it's reassuring to know it's not just me. Sometimes failure is still progress.

Maybe I'll have better luck with a new approach tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

Comments

Uncover Disqus comments