robots.txt + sitemap.xml Cross-Validator
Validate robots.txt and sitemap.xml together — including cross-checks (sitemap URL blocked by robots).
Overview
The robots.txt + sitemap.xml cross-validator checks both files together and flags the contradictions search engines actually penalise: a URL listed in the sitemap but blocked by robots, a sitemap reference in robots that points to a 404, or disallow rules that accidentally hide your homepage. Paste both files, or supply a base URL, and the tool walks every entry.
SEO consultants performing an audit, developers shipping a new site, and content teams troubleshooting a sudden traffic drop all benefit from a combined robots and sitemap validator. Long-tail keywords covered: validate robots.txt and sitemap together, detect sitemap URL blocked by robots, and check for crawl errors before going live.
How it works
robots.txt (defined by RFC 9309 in 2022) is a plain-text file at the site root that lists User-agent, Disallow, Allow, Crawl-delay, and Sitemap directives. sitemap.xml follows the sitemaps.org schema — a <urlset> of <url> entries with <loc>, optional <lastmod>, <changefreq>, and <priority>. A sitemap index file points to multiple per-section sitemaps.
The validator parses both, normalises every URL to absolute form, evaluates the matching Disallow / Allow rules per user-agent (longest-match wins), and reports overlaps. It also confirms each Sitemap: directive in robots.txt returns a valid XML response, checks for nested sitemaps, and warns when the sitemap exceeds the 50,000-URL or 50-MB hard limits.
Examples
- A sitemap entry for
/admin/dashboardwhile robots hasDisallow: /admin→ flagged as "URL in sitemap is disallowed". - A
Sitemap: https://example.com/sitemap.xmlin robots that returns 404 → flagged as "sitemap reference broken". - An accidental
Disallow: /under a wildcard user-agent → flagged as "site blocked entirely". - 60,000 URLs in a single sitemap.xml → flagged because it exceeds the 50,000 cap.
FAQ
Does Google still honour robots.txt?
Yes, for crawling. A page disallowed in robots is not fetched. It can still appear in search results if other sites link to it; to keep it out of the index, use noindex on the page itself.
What is the longest-match rule?
When multiple Allow / Disallow directives match a URL, the one with the longest path prefix wins. Allow: /admin/public overrides Disallow: /admin for /admin/public/page.
Should I list a sitemap in robots.txt?
Yes — it is the quickest way to point a new crawler at your URL list. You can list multiple sitemap entries if you split by content type or section.
Do all bots respect robots.txt?
Major search engines and well-behaved crawlers do. Malicious scrapers ignore it entirely. Treat robots.txt as a hint, not a security boundary.