A sitemap is a file — usually named sitemap.xml — that lists all the important URLs on your website. It tells search engine crawlers like Googlebot exactly which pages exist, so they don't have to discover everything by following links alone.
The most common format is the XML sitemap, defined by the Sitemaps protocol. Each entry is a <url> element containing at minimum a <loc> tag with the full page address. You can optionally add <lastmod> (last modified date), <changefreq>, and <priority> hints, though Google has stated it mostly uses <loc> and <lastmod>.
Once generated, place the file at https://yourdomain.com/sitemap.xml and submit it via Google Search Console under Indexing → Sitemaps. This is the most reliable way to ensure all your pages are discovered and indexed — especially new content, deep pages, or URLs that aren't well linked internally.
This tool uses a header-first crawling technique inspired by Metehan Yesilyurt's experiment at SEO Week 2026. HTTP response headers travel before any HTML body — and some servers publish structured link data there.
For each page the crawler checks for an X-Internal-Links header containing a base64url-encoded JSON array of internal URLs. If found, link discovery happens with zero HTML parsing — just decode and go. If the header is absent, it falls back to scanning <a href> tags in the page body as usual.
The result is a standard sitemap.xml file you can submit to Google Search Console or place at yourdomain.com/sitemap.xml.
The article describes crawling 65,000 pages in 99 seconds by reading only HTTP headers. When a server publishes X-Internal-Links, a crawler can discover all internal links on a page without downloading or parsing the full HTML — the links arrive as compact JSON in the header, before even one byte of body is sent.
Most sites today don't emit this header yet, so the tool falls back to HTML parsing. You can add it to your own site using a Cloudflare Worker, nginx add_header, or PHP header() — see the open-source code on GitHub.
https://yourdomain.com/sitemap.xml. You should also reference it in your robots.txt file with a line like Sitemap: https://yourdomain.com/sitemap.xml — this lets any crawler discover it automatically, without you having to submit it manually to each search engine.<lastmod> tag is useful here: Google uses it to decide whether to recrawl a URL, so keeping it accurate matters more than regenerating on a fixed schedule.X-Internal-Links HTTP response header was present and decoded successfully. For those pages, no HTML body parsing was needed at all — links were extracted directly from the header. "Via HTML" means the header was absent and the tool fell back to parsing <a href> tags in the page body.X-Internal-Links headers (as the article proposes), those links are visible even on JS-heavy pages, because headers are sent before any script runs.robots.txt. It only crawls pages on the same domain you entered and does not follow external links. Use the generated sitemap only for sites you own or have permission to crawl.