Most websites have a sitemap. Far fewer have a sitemap that is actually correct. Broken URLs, duplicate entries, redirect chains, and files that balloon past the 50,000-URL limit per Google's spec are all common enough that search consoles regularly flag them as errors.
A good XML sitemap generator takes care of all of this automatically. It validates each URL, strips duplicates, splits oversized inventories into a sitemap index, and hands you a file that is ready to submit. This guide walks through how the whole thing works and what to look for when picking or building one.
Photo by Senne on Pexels
What an XML Sitemap Actually Does
A sitemap is a machine-readable list of URLs that tells search engine crawlers which pages exist on your site and, optionally, when they were last updated. It does not guarantee that those pages will be indexed, but it does make sure crawlers know where to look.
The XML format has been the standard since Google, Yahoo, and Microsoft jointly adopted the Sitemaps protocol in 2006. The spec defines a small set of elements: <urlset>, <url>, <loc>, <lastmod>, <changefreq>, and <priority>. Only <loc> is required. Everything else is advisory.
The practical effect is straightforward. A crawler that finds your sitemap can schedule URLs for crawling without having to discover every page through links. That matters most for large sites, newly launched pages, and content that is not well-linked internally.
Sitemaps also help with crawl budget. Large sites are not crawled exhaustively on every pass. If the crawler has to follow link trails to discover pages, orphaned or deeply nested content may go weeks between crawls. A sitemap brings those URLs directly to the crawler's attention so they are at least visible in the scheduling queue, even if indexing is not immediate.
Why a Generator Beats Hand-Writing One
For sites with fewer than a hundred pages, writing a sitemap by hand is feasible. For anything larger, it is a maintenance problem. Pages get added and removed constantly, and a manually maintained sitemap drifts out of sync almost immediately.
A generator solves this by producing the file from a canonical URL list rather than from memory. More importantly, it can validate each entry before it goes in. Invalid URL syntax, known 404s, and redirect loops are all easier to catch at generation time than after you have submitted the file and waited for a crawl.
The EvvyTools XML Sitemap Generator handles validation and duplicate removal as part of the core workflow. Paste in your URL list, and the tool flags any entries that fail RFC 3986 syntax checks before generating the final file.
There is also a practical ceiling to account for. The sitemaps protocol caps each file at 50,000 URLs and 50 MB uncompressed. Sites with large catalogs need to split their sitemap into multiple files and reference them from a sitemap index. Doing that split by hand is tedious; a generator that handles it automatically saves real time.
Photo by Brett Sayles on Pexels
Key Features to Look For
Not every sitemap tool is the same. These three capabilities separate useful generators from ones that produce a syntactically valid but practically broken file.
URL Validation
The generator should check each URL against the RFC 3986 specification before including it. That means verifying the scheme (http or https), confirming there are no illegal characters, and flagging anything that looks malformed. A URL that fails the spec will be rejected or silently ignored by some crawlers. Better to know before you submit.
Validation should also catch common formatting mistakes like trailing spaces, bare domain names without a scheme, or fragments (#anchors), which are not valid in sitemaps. These slip in easily when URLs come from a CMS export or a spreadsheet.
Duplicate Removal
Sitemaps should not list the same URL twice, and they should not list the same canonical resource under multiple paths if you are treating those paths as identical. A sitemap generator that normalizes trailing slashes and deduplicates entries before output saves you from submitting a file that repeats entries.
This is especially relevant for sites that serve content at both www and non-www versions, or at both HTTP and HTTPS. Pick the canonical version and make sure the generator outputs only that.
Sitemap Index Splitting
When your URL count exceeds 50,000, you need a sitemap index file that references multiple sitemap files. The generator should handle this split automatically and output a valid <sitemapindex> document pointing to the individual files. The index file itself is then what you submit to Google Search Console and Bing Webmaster Tools.
Manually splitting a 200,000-URL catalog into four files and writing the index by hand is an error-prone afternoon. A generator that does it deterministically is worth using.
Step-by-Step: Building a Sitemap with a Generator
The process is short once you have a working URL list.
Step 1: Export your URL list. Pull a complete list of canonical URLs from your CMS, your analytics platform, or a crawl of your live site. The exact source matters less than completeness. If your site has pagination, category pages, or search result pages you do not want indexed, filter those out now.
Step 2: Paste into the generator and validate. Open the XML Sitemap Generator and paste your URL list. Let the tool run its validation pass. Review any flagged entries and decide whether to fix or remove them. Do not include URLs that return 404 or 301 in the final sitemap, since submitting a sitemap with broken URLs signals to crawlers that your site is poorly maintained.
Step 3: Review the output. The generated sitemap should be well-formed XML. Open it in a text editor or browser and spot-check a few entries. The <loc> values should be the full canonical URLs including scheme. If you added <lastmod> dates, verify they are in W3C datetime format (YYYY-MM-DD at minimum).
Step 4: Host and reference the file. Upload the sitemap to your web root, typically at /sitemap.xml or /sitemap_index.xml. Reference it from your robots.txt with a Sitemap: directive. This lets crawlers find it automatically without requiring a manual submission every time.
Photo by Pixabay on Pexels
Submitting to Search Engines
Referencing the sitemap from robots.txt is sufficient for discovery, but direct submission through search console gives you faster feedback. Google Search Console and Bing Webmaster Tools both have dedicated sitemap submission interfaces. After submission you can monitor crawl coverage, see how many URLs were indexed versus submitted, and spot errors that the crawler encountered.
The most common submission error is a mismatch between the sitemap location and the verified property. If you verified the bare domain version of your site in Search Console but uploaded your sitemap to the www subdomain version, the submission may fail or report no URLs. Confirm the URLs in your sitemap match the exact property variant you verified.
For large sites, resubmit the sitemap after major content changes rather than waiting for it to be re-crawled organically. Search Console shows the last time the sitemap was read, which gives you a rough idea of how quickly changes are being picked up.
One underused feature of Search Console's sitemap report is the breakdown between submitted URLs and indexed URLs. A large gap between the two usually means the unindexed pages have thin content, canonical conflicts, or are being blocked by robots.txt. The sitemap itself is not the problem, but the report makes the problem visible. That is reason enough to submit directly rather than relying on auto-discovery.
"The sitemap is one of the cheapest technical SEO wins available. It costs almost nothing to get right, and a broken one is a quiet tax on every new page you publish." - Dennis Traina, founder of 137Foundry
Common Mistakes to Avoid
A few patterns show up repeatedly when auditing sitemaps.
Including redirects. A sitemap should contain destination URLs, not redirect sources. If /old-page 301s to /new-page, only /new-page belongs in the sitemap. Including the redirect source forces the crawler to follow an extra hop every time.
Including noindex pages. Pages with a noindex directive should not appear in the sitemap. Submitting them sends a contradictory signal: you are telling the indexer "here is a page" while the page itself says "do not index me." Most crawlers will follow the noindex directive, but the inconsistency is unnecessary noise.
Setting changefreq and priority arbitrarily. These fields are advisory and largely ignored by modern crawlers, but setting every URL to priority: 1.0 and changefreq: always is a known anti-pattern that some systems interpret as spammy. If you include them, set values that reflect actual update frequency.
Forgetting to update after major site changes. A sitemap that references pages deleted six months ago and omits pages added last week is actively misleading. Regenerate the sitemap after any significant structural change to the site.
Putting It Together
A clean XML sitemap is a background task that pays off quietly over months. Pages get discovered faster, crawl budget is spent on real content, and search console stops flagging errors that were always preventable.
The right tool makes the generation step fast enough that it does not feel like work. The EvvyTools dev and tech toolkit includes the XML Sitemap Generator alongside other utilities for the same workflow. If you are exploring what is available, the EvvyTools blog has more guides on technical site maintenance.
The protocol itself is documented at sitemaps.org. For authoritative guidance on what search engines do with your sitemap after submission, Google's sitemap documentation is the clearest reference available.
Photo by