robots.txt + sitemap.xml Cross-Validator

Validate robots.txt and sitemap.xml together — including cross-checks (sitemap URL blocked by robots).

Overview

The robots.txt + sitemap.xml cross-validator checks both files together and flags the contradictions search engines actually penalise: a URL listed in the sitemap but blocked by robots, a sitemap reference in robots that points to a 404, or disallow rules that accidentally hide your homepage. Paste both files, or supply a base URL, and the tool walks every entry.

SEO consultants performing an audit, developers shipping a new site, and content teams troubleshooting a sudden traffic drop all benefit from a combined robots and sitemap validator. Long-tail keywords covered: validate robots.txt and sitemap together, detect sitemap URL blocked by robots, and check for crawl errors before going live.

How it works

robots.txt (defined by RFC 9309 in 2022) is a plain-text file at the site root that lists User-agent, Disallow, Allow, Crawl-delay, and Sitemap directives. sitemap.xml follows the sitemaps.org schema — a <urlset> of <url> entries with <loc>, optional <lastmod>, <changefreq>, and <priority>. A sitemap index file points to multiple per-section sitemaps.

The validator parses both, normalises every URL to absolute form, evaluates the matching Disallow / Allow rules per user-agent (longest-match wins), and reports overlaps. It also confirms each Sitemap: directive in robots.txt returns a valid XML response, checks for nested sitemaps, and warns when the sitemap exceeds the 50,000-URL or 50-MB hard limits.

Examples

A sitemap entry for /admin/dashboard while robots has Disallow: /admin → flagged as "URL in sitemap is disallowed".
A Sitemap: https://example.com/sitemap.xml in robots that returns 404 → flagged as "sitemap reference broken".
An accidental Disallow: / under a wildcard user-agent → flagged as "site blocked entirely".
60,000 URLs in a single sitemap.xml → flagged because it exceeds the 50,000 cap.

FAQ

Does Google still honour robots.txt?

Yes, for crawling. A page disallowed in robots is not fetched. It can still appear in search results if other sites link to it; to keep it out of the index, use noindex on the page itself.

What is the longest-match rule?

When multiple Allow / Disallow directives match a URL, the one with the longest path prefix wins. Allow: /admin/public overrides Disallow: /admin for /admin/public/page.

Should I list a sitemap in robots.txt?

Yes — it is the quickest way to point a new crawler at your URL list. You can list multiple sitemap entries if you split by content type or section.

Do all bots respect robots.txt?

Major search engines and well-behaved crawlers do. Malicious scrapers ignore it entirely. Treat robots.txt as a hint, not a security boundary.

Try robots.txt + sitemap.xml Cross-Validator