Depth Limiting and Path Filtering in Lighthouse Parade

Written by Caleb Eby on

In case you missed it, last month we released Lighthouse Parade, a CLI tool to automatically run and aggregate Lighthouse performance reports across an entire site. One of the most requested features has been the ability to limit which pages are crawled. We’re excited to release Lighthouse Parade 1.1, which introduces three new flags to accommodate these use cases.

We can install and run lighthouse-parade using npx, and we will use cloudfour.com as our example site.

npx lighthouse-parade https://cloudfour.com

At a glance this doesn’t look like a large site, but when you consider all the blog posts and indexes, there are a lot of pages to run Lighthouse on, so it will take a while. We can reduce the number of pages that are crawled by limiting the crawl depth using the new --max-crawl-depth flag. Depth limiting allows you to control how far to traverse—how many “clicks” the crawler will take. We’ll set it to two so that it crawls the home page and only pages that are linked directly from the home page:

npx lighthouse-parade https://cloudfour.com --max-crawl-depth 2

This helps speed up the crawling (only twelve pages get crawled). But maybe we want to crawl more pages than that. Let’s bump up the crawl depth to three, and filter out blog posts (which have URLs like https://cloudfour.com/thinks/*). The new --exclude-path-glob flag lets us do that. Keep in mind that in order for the glob to work, it has to be specified in quotes, otherwise your shell will try to expand it.

npx lighthouse-parade https://cloudfour.com --max-crawl-depth 3 --exclude-path-glob "/thinks/*"

This works pretty well. It provides a broader picture of the site’s performance than simply limiting the depth to 2 (specifically, this covers more kinds of pages) without being slowed down by running Lighthouse for every single blog post.

This option is especially useful on e-commerce sites where you wouldn’t want Lighthouse to run on every single product page.

Going back to the cloudfour.com example, maybe we don’t want to limit the depth, but we still want to exclude blog posts. If we tried that, we would see it starting to pick up sitemap pages like https://cloudfour.com/sitemap-pt-post-2020-12.html, and paginated links, so we’ll exclude those too by passing the --exclude-path-glob flag two more times:

npx lighthouse-parade https://cloudfour.com --exclude-path-glob "/thinks/*" --exclude-path-glob "/sitemap-*" --exclude-path-glob "**/page/*"

We’ll look at another example to show the last new flag, --include-path-glob. Maybe we want to run Lighthouse only on blog posts, so that we can see which blog posts might have unoptimized images or other resources that slow them down. The --include-path-glob allows us to ignore any URL that doesn’t match the specified glob:

npx lighthouse-parade https://cloudfour.com --include-path-glob "/thinks/*"

Another example would be sites that are internationalized with URL prefixes like /en/. The --include-path-glob flag could be used to make it so that Lighthouse only runs on one version of translated pages.

Combining these new features gives you fine-grained control over which pages are crawled. We hope that these new features are helpful, and feel free to leave feedback on GitHub!

Caleb Eby

Caleb Eby is a developer passionate about web performance, efficient JavaScript, and helpful web tooling. You can find his abandoned side-projects on GitHub

Never miss an article!

Get Weekly Digests


Comments

Add a comment

Reply to Dave’s Comment

Please be kind, courteous and constructive. You may use simple HTML or Markdown in your comments. All fields are required.


Let’s discuss your project! Email Us