Skip to main content

HTML Cleaner & Formatter

Clean messy HTML from Word, Google Docs, and CMS editors

EVT·T146
Markup Sanitizer

About the HTML Cleaner

The HTML Cleaner strips the dirty markup that Word, Google Docs, email clients, and rich-text CMS editors silently inject into copy-pasted content — MS Office namespaces (xmlns:o, xmlns:w), mso-* CSS properties, conditional comments, inline styles, empty tags, deprecated <font> elements, and Word’s class-based junk. Output is semantic HTML5 with configurable aggressiveness (Light / Standard / Deep / Custom).

It is built for content marketers pasting from Google Docs into WordPress (where the inline styles can override the theme), editors migrating archives from one CMS to another, email-template designers cleaning Outlook-exported HTML, and developers de-bloating user-submitted content that came in via a rich-text editor that didn’t sanitize aggressively enough.

All cleaning runs locally in JavaScript. Pasted HTML — including unpublished editorial drafts, internal documentation, and client work under NDA — never leaves your device. The page makes no network call after first load. The Network tab in DevTools will confirm zero outbound requests during the clean.

Choose preset aggressiveness based on target: Light keeps classes (use when migrating to a CMS with matching styles); Standard strips inline styles and Word junk while keeping semantic structure (the default for most uses); Deep reduces to bare semantic tags (best for hand-styled blog templates); Custom for fine control. After cleaning, validate via the W3C Nu HTML Checker before publishing — some severely-broken Office HTML survives basic cleaning and trips up downstream parsers. For user-submitted content, also run through DOMPurify or similar before rendering in a browser.

Privacy100% client-side · HTML never transmitted
PresetsLight / Standard / Deep / Custom
Last reviewed2026-05-14 by Dennis Traina
Save requires subscription
137 Foundry — custom app building studio

How to Use the HTML Cleaner

Paste your HTML into the left panel and the tool instantly processes it through the active cleaning rules. The cleaned output appears in the right panel in real time, and a visual preview below shows exactly how the cleaned HTML renders in a browser. Use the Copy Clean HTML button to grab the result, or click Format / Prettify to add proper indentation before copying. Choose one of the four aggressiveness presets: Light for Office cleanup only, Standard for styles, classes, and empty tags, Deep to strip to pure semantic HTML, or Custom for individual toggles.

Why Word and Google Docs HTML Is So Messy

When you copy content from Microsoft Word or Google Docs and paste it into a web editor, the clipboard carries an enormous amount of hidden formatting. Word inserts proprietary XML namespaces (xmlns:o, xmlns:w), conditional comments targeting specific Office versions, mso-* CSS properties that no browser understands, and deeply nested <span> tags with inline styles that attempt to replicate the document's exact appearance. Google Docs produces similarly bloated markup with extensive inline styles and wrapper divs that serve no structural purpose. Cleaning this markup is not optional — it is a necessary step in any content workflow that involves word processors.

Cleaning HTML for CMS Migration

Migrating content between content management systems is one of the most common use cases for HTML cleaning. The HTML you export almost always carries platform-specific classes, inline styles tied to the old theme, and structural markup that conflicts with the target platform. The ideal approach is to strip content down to semantic HTML — paragraphs, headings, lists, links, emphasis, and strong text — and let the new system's stylesheets handle presentation. Use the Deep preset or build a custom tag whitelist to define exactly which elements your target CMS expects.

Semantic HTML Best Practices

Semantic HTML uses tags that convey meaning rather than appearance. A <strong> element communicates importance, while a <b> tag only visually bolds text without semantic weight. Screen readers and search engine crawlers rely on these semantic distinctions to understand content structure. This tool automatically upgrades <b> to <strong> and <i> to <em>, bringing your markup in line with modern standards without changing the visual output.

When to Clean vs. Rewrite

Use automated cleaning when the content structure is sound but the markup is cluttered with presentation artifacts. Rewrite manually when the HTML structure itself is fundamentally wrong — layout tables, deeply nested divs used as a substitute for semantic elements, or content that mixes data and presentation in ways that cannot be separated by removing attributes alone.

Common HTML Formatting Issues

Beyond Word and Google Docs artifacts, email HTML is notoriously messy because email clients have inconsistent CSS support, forcing inline styles and table-based layouts. WYSIWYG editors in older CMS platforms generate excessive <br> tags, wrap every text node in <span> tags, and leave empty elements scattered throughout. Non-breaking spaces (&nbsp;) accumulate as content is edited and reformatted. This tool intelligently removes non-breaking spaces used for spacing while preserving those between words where line breaks should not occur.

For validating and formatting other code formats, try the JSON Formatter & Validator for JSON data, or the CSS Generator when you need to create clean stylesheets from scratch rather than cleaning existing markup.

Frequently Asked Questions

Why does Word HTML break my website layout?

Word exports include proprietary xmlns:o and xmlns:w namespaces, conditional comments targeting specific Office versions, and mso-* CSS properties that browsers ignore but preserve in source. These can override your site styles and bloat pages by 5 to 10 times their cleaned size.

What is the difference between semantic HTML and presentational HTML?

Semantic HTML uses tags like article, nav, h1, and strong that describe content meaning, while presentational HTML relies on style attributes and tags like font or b to define appearance. The HTML5 specification strongly favors semantic markup because it improves accessibility, SEO, and maintainability.

Will removing class attributes affect my CSS?

Yes. Class attributes are hooks for CSS selectors and JavaScript. Only strip classes when migrating content into a new CMS where the original classes do not apply. The Light preset leaves classes intact for this reason.

Is it safe to paste HTML with sensitive content into this tool?

Yes. All cleaning runs in your browser using client-side JavaScript, and no data is transmitted to a server. You can verify this by opening DevTools and checking the Network tab while pasting content.

Does the tool fix invalid HTML or just strip it?

It normalizes common issues such as unclosed tags, stray attributes, and empty elements, but it is not a full HTML5 parser like html5lib. For documents with severely broken structure, run the output through the W3C Nu HTML Checker to confirm validity before publishing.

Honey-Do Tracker — home maintenance for landlords and property managers
Honey-Do Tracker — home maintenance for landlords and property managers
137 Foundry — custom app building studio
Link copied to clipboard!