Crawling pages…
stack.hu
SM

Fast Sitemap Generator

Header-first crawling — fast, no HTML parsing when possible

jimpl.com hexeditor.com stack.hu

Recently checked

hexeditor.com 1 1h ago
stack.hu 149 2h ago
apps.apple.com 126 2d ago
jimpl.com 8 2d ago

What is a sitemap?

A sitemap is a file — usually named sitemap.xml — that lists all the important URLs on your website. It tells search engine crawlers like Googlebot exactly which pages exist, so they don't have to discover everything by following links alone.

The most common format is the XML sitemap, defined by the Sitemaps protocol. Each entry is a <url> element containing at minimum a <loc> tag with the full page address. You can optionally add <lastmod> (last modified date), <changefreq>, and <priority> hints, though Google has stated it mostly uses <loc> and <lastmod>.

Once generated, place the file at https://yourdomain.com/sitemap.xml and submit it via Google Search Console under Indexing → Sitemaps. This is the most reliable way to ensure all your pages are discovered and indexed — especially new content, deep pages, or URLs that aren't well linked internally.

How it works

This tool uses a header-first crawling technique inspired by Metehan Yesilyurt's experiment at SEO Week 2026. HTTP response headers travel before any HTML body — and some servers publish structured link data there.

For each page the crawler checks for an X-Internal-Links header containing a base64url-encoded JSON array of internal URLs. If found, link discovery happens with zero HTML parsing — just decode and go. If the header is absent, it falls back to scanning <a href> tags in the page body as usual.

The result is a standard sitemap.xml file you can submit to Google Search Console or place at yourdomain.com/sitemap.xml.

Speed via the X-Internal-Links header

The article describes crawling 65,000 pages in 99 seconds by reading only HTTP headers. When a server publishes X-Internal-Links, a crawler can discover all internal links on a page without downloading or parsing the full HTML — the links arrive as compact JSON in the header, before even one byte of body is sent.

Most sites today don't emit this header yet, so the tool falls back to HTML parsing. You can add it to your own site using a Cloudflare Worker, nginx add_header, or PHP header() — see the open-source code on GitHub.

Frequently Asked Questions

Even on a tightly linked site, a sitemap is still worth having. Search engines use it as a cross-check: if a page isn't reachable by following links from the homepage within a few clicks, the sitemap is often its only guaranteed path to being crawled. It also tells Google when pages were last updated, which helps with recrawl scheduling for frequently changing content.
The standard location is the root of your domain: https://yourdomain.com/sitemap.xml. You should also reference it in your robots.txt file with a line like Sitemap: https://yourdomain.com/sitemap.xml — this lets any crawler discover it automatically, without you having to submit it manually to each search engine.
Regenerate your sitemap whenever you add, remove, or significantly update pages. For blogs or e-commerce sites that publish frequently, a daily or weekly regeneration is typical. Static sites can get away with updating it only when content changes. The <lastmod> tag is useful here: Google uses it to decide whether to recrawl a URL, so keeping it accurate matters more than regenerating on a fixed schedule.
PHP processes requests synchronously, so large crawls run in a single thread. The 200-page cap keeps the request within server time limits. For bigger sites, run this from the command line or split the crawl by section. Depth is also capped at 4 clicks from the homepage.
"Via Header" counts pages where the X-Internal-Links HTTP response header was present and decoded successfully. For those pages, no HTML body parsing was needed at all — links were extracted directly from the header. "Via HTML" means the header was absent and the tool fell back to parsing <a href> tags in the page body.
No — the crawler fetches raw HTTP responses without a headless browser, so JavaScript-rendered links won't be discovered via HTML parsing. However, if a site publishes X-Internal-Links headers (as the article proposes), those links are visible even on JS-heavy pages, because headers are sent before any script runs.
Currently this tool does not parse robots.txt. It only crawls pages on the same domain you entered and does not follow external links. Use the generated sitemap only for sites you own or have permission to crawl.
Pages can be missed if they are only reachable via JavaScript navigation, if they are deeper than 4 clicks from the homepage, if the total page budget of 200 was reached, or if the server returned a 4xx/5xx error. Only 2xx HTTP responses are included in the final sitemap.xml.