Near-Duplicate Detection with Simhash

There's this specific kind of frustration that only comes from watching your crawl budget burn through thousands of dollars fetching the same exact article fifty different times because someone's CMS decided to add a timestamp to every page load. Different URL. Different HTML. Same. Damn. Article.

I was three weeks into building a news aggregator when I realized I'd fetched the same Economics Times story about gold price increases 127 times. One hundred and twenty-seven. The only things that changed? The ad rotation, some tracking pixels, and (this is my favorite) a "Last updated: [current timestamp]" in the footer that updated every single time you loaded the page.

ETags didn't help. Last-Modified headers were a lie. MD5 hashing the HTML was like asking "are these two snowflakes identical?" when what I really needed was "are these both... you know... snow?"

Enter simhash. (And no, I didn't know what that was either. Stay with me.)

The Thing Nobody Tells You About Web Crawling

Here's what happens when you're crawling at any kind of scale: you drown in near-duplicates.

Not actual duplicates (those would be easy). No, these are pages that are like... 95% the same content, wrapped in slightly different HTML gift paper. The article is identical, but:

The navigation changed (A/B test, probably)
The ads rotated (of course they did)
Someone added a new comment (one comment! one!)
The "related articles" sidebar shuffled itself
A timestamp updated somewhere
The weather widget refreshed

And your traditional caching strategies just... give up? ETags work great until someone's CDN forgets to set them. Last-Modified headers are fantastic in theory, adorable in practice. And MD5 hashing the full HTML? One changed <span class="timestamp"> and boom, completely different hash.

What I needed was something that could say: "yeah these two 50KB HTML blobs differ by 200 bytes of rotating ads and a timestamp, but the actual article is identical, skip it."

The Pipeline: HTML → Clean → Markdown → Simhash

I ended up building this three-stage pipeline, and honestly? The simhash part is almost the boring part. The interesting part is the normalization.

(Which sounds backwards, I know. The hash with "sim" in the name should be the exciting bit. But stick with me.)

Stage 1: Cleaning the Mess (Because HTML is Never Just HTML)

You know what raw HTML is full of? Lies. Beautiful, structured lies.

So first, we parse it into a DOM using jsdom:

import { JSDOM } from 'jsdom';

function createDom(html: string): Document {
    const dom = new JSDOM(html);
    return dom.window.document;
}

Then we start removing all the stuff that's technically HTML but definitely not content:

// The usual suspects
const BOILERPLATE_SELECTORS = [
    'script',
    'style',
    'nav',
    'header',
    'footer',
    '[role="navigation"]',
    '[role="complementary"]',
    '.sidebar',
    '.advertisement',
    '.cookie-banner',
    'form',
    'input',
    'button',
];

function removeElements(document: Document, selector: string): number {
    const elements = document.querySelectorAll(selector);
    elements.forEach((el) => el.remove());
    return elements.length;
}

But that's not enough. Oh no. People hide navigation in <div class="totally-not-a-nav"> and call it a day.

So we get clever. Detecting boilerplate by sniffing class and id names:

function hasBoilerplateTokens(element: Element): boolean {
    const boilerplatePatterns = [
        'nav',
        'menu',
        'sidebar',
        'footer',
        'header',
        'advertisement',
        'ad-',
        'social',
        'share',
        'comment-form',
        'cookie',
        'popup',
        'modal',
    ];

    const className = element.className.toLowerCase();
    const id = element.id.toLowerCase();

    return boilerplatePatterns.some(
        (pattern) => className.includes(pattern) || id.includes(pattern)
    );
}

And then (then!) we deal with link-dense navigation lists. You know those menus that are just 40 links crammed into a <ul>? Yeah, those:

function isLinkDenseList(element: Element): boolean {
    const links = element.querySelectorAll('a');
    const textLength = (element.textContent || '').length;
    const linkTextLength = Array.from(links)
        .map((link) => (link.textContent || '').length)
        .reduce((sum, len) => sum + len, 0);

    // If >80% of text is links, it's navigation, probably
    return textLength > 0 && linkTextLength / textLength > 0.8;
}

Putting it all together:

function cleanDom(document: Document): void {
    // Remove the obvious stuff
    removeElements(document, BOILERPLATE_SELECTORS.join(', '));

    // Remove elements with sus class/id names
    const allElements = document.querySelectorAll('*');
    for (const element of Array.from(allElements)) {
        if (hasBoilerplateTokens(element)) {
            element.remove();
        }
    }

    // Remove link spam (navigation menus)
    const lists = document.querySelectorAll('ul, ol, div');
    for (const list of Array.from(lists)) {
        if (isLinkDenseList(list)) {
            list.remove();
        }
    }

    // Also: fix lazy-loaded images, resolve relative URLs
    // (skipping that code because this post is already long)
}

This stage alone removes 80-90% of the noise. Eighty to ninety percent. Most of the HTML on the modern web is just... not content.

Stage 2: Finding the Article (The "Please Just Give Me The Words" Stage)

Okay so we've removed the obvious junk. Now we need to find the actual article.

This is where Mozilla's Readability algorithm comes in. (Yes, the same one Firefox uses for Reader View. It's good.)

import { Readability } from '@mozilla/readability';

function extractMainContent(document: Document): Element {
    // Readability modifies the DOM, so clone it first
    const clonedDoc = document.cloneNode(true) as Document;

    // Try Readability
    const reader = new Readability(clonedDoc, {
        charThreshold: 400, // "Valid content" = at least 400 characters
    });
    const article = reader.parse();

    if (article && article.content) {
        // Parse the extracted HTML back into a DOM element
        const template = document.createElement('template');
        template.innerHTML = article.content;
        return template.content.firstElementChild as Element;
    }

    // Readability failed? Okay, fallback time.
    // Try known content selectors
    const selectors = ['article', 'main', '[role="main"]', '#content'];
    for (const selector of selectors) {
        const element = document.querySelector(selector);
        if (element && (element.textContent?.length || 0) >= 400) {
            return element;
        }
    }

    // Still nothing? Find the biggest chunk of text and hope for the best
    return findLargestTextBlock(document.body) || document.body;
}

That last fallback (finding the largest text block) is kind of a last-ball sixer, but it works surprisingly often:

function findLargestTextBlock(root: Element): Element | null {
    let largest = root;
    let maxLength = (root.textContent || '').length;

    const candidates = root.querySelectorAll('div, article, section, main');
    candidates.forEach((element) => {
        const length = (element.textContent || '').length;
        if (length > maxLength) {
            maxLength = length;
            largest = element;
        }
    });

    return largest;
}

"Just find the element with the most words in it" is caveman logic, but sometimes caveman logic is exactly what you need.

Stage 3: HTML → Markdown (The Part That Actually Made Everything Work)

Here's the insight that took me way too long to figure out: converting to Markdown first makes everything better.

Why? Because Markdown is normalized by design. It doesn't care about your CSS classes or inline styles or whether you used <em> or <i>. It's just... text with some semantic markers.

I use Turndown for this:

import TurndownService from 'turndown';
import { gfm } from 'turndown-plugin-gfm';

function createMarkdownConverter(): TurndownService {
    const turndown = new TurndownService({
        headingStyle: 'atx', // # Headers
        codeBlockStyle: 'fenced', // ``` code blocks
    });

    // Add GitHub Flavored Markdown support
    turndown.use(gfm);

    return turndown;
}

But wait, there's more. Code blocks on the web often have the language in the CSS class (language-python, hljs-javascript). We want to preserve that:

function getCodeLanguage(element: Element | null): string {
    if (!element) return '';

    const className = element.className || '';

    // Match: language-js, lang-python, hljs-typescript
    const patterns = [/language-(\w+)/, /lang-(\w+)/, /hljs-(\w+)/];

    for (const pattern of patterns) {
        const match = className.match(pattern);
        if (match) return match[1];
    }

    return '';
}

// Add custom rule to preserve code languages
turndown.addRule('codeWithLanguage', {
    filter: (node) => {
        return node.nodeName === 'PRE' && node.querySelector('code') !== null;
    },
    replacement: (content, node) => {
        const codeElement = node.querySelector('code');
        const code = codeElement?.textContent || '';
        const lang = getCodeLanguage(codeElement);
        return `\n\`\`\`${lang}\n${code}\n\`\`\`\n`;
    },
});

Now we have clean, stable, normalized Markdown. This is what we hash.

Simhash (Finally)

Okay so simhash. What is it?

It's a locality-sensitive hash, which is a fancy way of saying "similar inputs produce similar outputs." The opposite of cryptographic hashes (MD5, SHA) where changing one bit creates a completely different hash.

Here's the beautiful part: if two documents are similar, their simhashes will differ by only a few bits. And we can count differing bits with Hamming distance.

The Algorithm (It's Cooler Than It Sounds)

Step 1: Tokenize the markdown

function tokenize(markdown: string): string[] {
    return markdown
        .toLowerCase()
        .split(/\s+/)
        .filter((token) => token.length > 2);
}

// Or use 3-word shingles for better accuracy
function createShingles(text: string, size: number = 3): string[] {
    const words = text.toLowerCase().split(/\s+/);
    const shingles: string[] = [];

    for (let i = 0; i <= words.length - size; i++) {
        shingles.push(words.slice(i, i + size).join(' '));
    }

    return shingles;
}

Step 2: Hash each token with MurmurHash

MurmurHash is this wonderful non-cryptographic hash function that's fast and produces well-distributed outputs. Perfect for this:

import murmur from 'murmurhash3js-revisited';

function hash64(token: string): bigint {
    // MurmurHash gives us 32 bits, combine two for 64
    const h1 = murmur.x86.hash32(token, 0);
    const h2 = murmur.x86.hash32(token, h1);
    return (BigInt(h1) << 32n) | BigInt(h2);
}

Step 3: The Voting Mechanism (This is where it gets good)

Each token "votes" on each of the 64 bits:

function computeSimhash(tokens: string[]): bigint {
    // 64-dimensional vector, start at 0
    const vector = new Array(64).fill(0);

    // Each token votes on every bit
    for (const token of tokens) {
        const hash = hash64(token);

        for (let i = 0; i < 64; i++) {
            const bit = (hash >> BigInt(i)) & 1n;

            // If bit is 1, vote +1. If bit is 0, vote -1
            vector[i] += bit === 1n ? 1 : -1;
        }
    }

    // Final hash: if vector[i] > 0, set bit i to 1
    let simhash = 0n;
    for (let i = 0; i < 64; i++) {
        if (vector[i] > 0) {
            simhash |= 1n << BigInt(i);
        }
    }

    return simhash;
}

Why this works: similar documents share many tokens, so their bit vectors vote similarly. Different documents have different tokens, different votes, different final hashes. But (and this is key) slightly different documents produce slightly different hashes.

Hamming Distance (Or: Counting Differing Bits)

Hamming distance is just "how many bits are different?"

function hammingDistance(hash1: bigint, hash2: bigint): number {
    let xor = hash1 ^ hash2; // XOR gives 1 where bits differ
    let distance = 0;

    while (xor > 0n) {
        distance += Number(xor & 1n); // Count the 1s
        xor >>= 1n;
    }

    return distance;
}

function similarityPercentage(hash1: bigint, hash2: bigint): number {
    const distance = hammingDistance(hash1, hash2);
    return ((64 - distance) / 64) * 100;
}

What the Numbers Mean

Based on Google's research on near-duplicate detection:

0-3 bits different (95%+ similar): Identical content
- Same article, maybe minor template changes
- Skip the fetch, use cache
4-8 bits different (85-95% similar): Near-duplicate
- Similar content with meaningful differences
- Could be an updated article
- Judgment call, depends on your use case
9+ bits different (below 85% similar): Different content
- Substantially different documents
- Fetch it

(These thresholds are tunable. Start here, adjust based on your false positive/negative tolerance.)

Putting It All Together (The Actual Caching Part)

interface CachedHash {
    url: string;
    hash: bigint;
    fetchedAt: Date;
}

async function shouldFetch(url: string, html: string): Promise<boolean> {
    // 1. Clean the HTML
    const document = createDom(html);
    removeElements(document, BOILERPLATE_SELECTORS.join(', '));
    const mainContent = extractMainContent(document);

    // 2. Convert to Markdown
    const markdown = createMarkdownConverter().turndown(mainContent.outerHTML);

    // 3. Compute simhash
    const tokens = tokenize(markdown);
    const newHash = computeSimhash(tokens);

    // 4. Check recent hashes from same domain
    const domain = new URL(url).hostname;
    const cached = await db.getRecentHashes(domain, { limit: 100 });

    // 5. Look for near-duplicates
    for (const entry of cached) {
        const distance = hammingDistance(newHash, entry.hash);

        if (distance <= 3) {
            // 95%+ match, we've seen this before
            const similarity= similarityPercentage(newHash, entry.hash);
            console.log(`Cache hit: ${similarity}% similar to ${entry.url}`);
            return false; // Don't fetch
        }
    }

    // 6. New content! Store the hash
    await db.storeHash({ url, hash: newHash, fetchedAt: new Date() });
    return true; // Fetch it
}

A note on indexing: For small scale (thousands of hashes), this naive O(n) comparison is fine. For millions? You need Locality-Sensitive Hashing indices or bit-split multi-indexing. Libraries like FAISS can help. But start simple, optimize later.

Doing This in PostgreSQL

If you're using Postgres (and you probably are), you can calculate Hamming distance directly in SQL using the bit_count() function (Postgres 14+):

CREATE TABLE content_hashes (
    id SERIAL PRIMARY KEY,
    url TEXT NOT NULL,
    domain TEXT NOT NULL,
    hash BIGINT NOT NULL,
    fetched_at TIMESTAMP DEFAULT NOW()
);

-- Find similar hashes (distance <= 3)
SELECT url, hash, bit_count(hash # $1) as distance
FROM content_hashes
WHERE domain = $2
  AND bit_count(hash # $1) <= 3
ORDER BY distance
LIMIT 10;

The # operator does XOR, and bit_count() counts the set bits—exactly what we need for Hamming distance. It's fast enough for tens of thousands of hashes per domain. Beyond that, you'll want specialized indices.

Why This Works (The Part I Should Have Led With)

Direct HTML hashing is fragile. One changed CSS class, one rotated ad, one updated timestamp, different hash.

But converting to Markdown first?

All the HTML noise disappears: No more <span class="text-blue-500 font-bold"> nonsense
Semantic structure stays: Headers are headers, lists are lists
Stable tokens: Fewer false negatives from presentation changes
Faster hashing: Simpler input, less computation

The pipeline creates fingerprints that survive template updates while still catching actual content changes. It's... kind of elegant, actually.

The Mistakes I Made (So You Don't Have To)

I tried direct HTML simhash first. Terrible results. Template changes created false negatives everywhere. Normalize first, always.

I picked too strict a threshold initially. Distance ≤ 1 seemed safer, but I was missing obvious duplicates. 3-5 is the sweet spot for most content.

What's Next (If You Want to Get Fancy)

This is just the foundation. You could:

Weight tokens by structure: Headers worth more than body text
Add time decay: Expire old hashes from the index
Tune thresholds by domain: News sites vs docs need different sensitivity
Ensemble fingerprints: Combine simhash with DOM structure hashes
Track drift over time: Monitor how content changes, detect site redesigns

But honestly? Start with this. It works. It's simple. It'll save you money.

So yeah. Simhash works. Normalize your content, hash it, compare the bits.

And maybe (maybe) you won't spend three hours debugging why your crawler thinks the same article is new just because someone updated a timestamp.

(Though knowing web development, something else will break. It always does.)