Detecting Phishing Kits via HTML Structure Similarity Analysis
The modern phishing landscape is characterized by extreme ephemeralism. Threat actors no longer rely on static, long-lived infrastructure; instead, they deploy automated pipelines that spin up hundreds of unique domains, often using hijacked legitimate infrastructure, to host nearly identical phishing kits. Traditional detection mechanisms-such as URL blacklisting, reputation-based filtering, and even basic keyword scanning-are fundamentally reactive. By the time a URL is flagged by a threat intelligence feed, the attacker has often already rotated to a new domain, rendering the previous signature obsolete.
To move from reactive to proactive defense, we must shift our focus from the content of the phishing page (which is easily mutated) to the structural DNA of the phishing kit itself. This approach, known as HTML Structure Similarity Analysis, treats the underlying DOM (Document Object Model) architecture as a fingerprint that remains relatively constant even when the branding, text, and URLs are altered.
The Anatomy of a Phishing Kit
A phishing kit is essentially a collection of web assets designed to mimic a legitimate service (e.g., Microsoft 365, PayPal, or a banking portal). While the text content, images, and CSS values change to target different victims, the underlying HTML skeleton-the hierarchy of `<div>`, `<table>`, `<form>`, and `<input>` tags-remains remarkably consistent across different deployments of the same kit.
Attackers frequently use the same kit for weeks, simply swapping out the `action` attribute in a `<form>` tag to point to a new data-exfiltration endpoint. Because the structural arrangement of the login fields, the placement of the "Submit" button, and the nesting of the layout containers are hardcoded into the kit, they provide a much more stable signal for detection than the surface-level strings.
The Methodology: Structural Fingerprinting
Detecting these kits requires a pipeline that transforms raw, noisy HTML into a normalized, comparable structural representation. The process can be broken down into four distinct stages: Parsing, Normalization, Feature Extraction, and Similarity Computation.
1. DOM Parsing and Tree Construction
The first step is to parse the raw HTML into a structured DOM tree. This allows us to ignore the "noise" of the raw byte stream and focus on the parent-child relationships between elements. Using a robust parser (like `BeautifulSoup` in Python or `html5lib`) is critical to ensure that malformed HTML-often used by attackers to bypass simple regex-based scrapers-is correctly interpreted.
2. Structural Normalization
Raw HTML contains significant entropy that must be stripped away to reveal the underlying structure. To achieve this, we must implement a normalization layer that:
- Removes Text Nodes: All human-readable text (the "payload") is discarded.
- Anonymizes Attributes: Unique identifiers (e.g., `id="login-btn-8472"`) and randomized CSS classes are stripped or regex-replaced with generic tokens (e.g., `id="[ID]"`).
- Prunes Non-Structural Elements: Elements like `<script>`, `<style>`, and `<meta>` tags, which contain high-variance data, are either removed or reduced to structural placeholders.
- Unifies Whitespace: Collapsing all whitespace ensures that formatting changes do not affect the structural hash.
3. Feature Extraction: The N-gram Approach
Once the HTML is normalized, we represent the DOM tree as a sequence of structural tokens. A common and effective technique is to use Structural N-grams.
An N-gram is a contiguous sequence of $n$ items from a given sample. In our context, we can traverse the DOM tree (using a Depth-First Search) and generate a sequence of tags. For example, a simplified fragment of a login form might yield the sequence:
`[DIV, FORM, DIV, INPUT, DIV, BUTTON]`
By applying a sliding window of size $n$ (where $n$ is typically 3 or 4), we create a set of structural features. If $n=3$, our features are:
- `(DIV, FORM, DIV)`
- `(FORM, DIV, INPUT)`
- `(DIV, INPUT, DIV)`
- `(INPUT, DIV, BUTTON)`
This sequence captures the local topology of the page, making the detection resilient to the insertion of "junk" `<div>` tags used to obfuscate the layout.
4. Similarity Computation via MinHash
Comparing every new URL against a database of millions of known phishing structures is computationally expensive. To solve this, we use Locality Sensitive Hashing (LSH) via the MinHash algorithm.
MinHash allows us to estimate the Jaccard Similarity Coefficient between two sets of N-grams without performing a direct, exhaustive comparison. The Jaccard index is defined as:
$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$
By hashing the N-gram sets into small "signatures," we can quickly determine if a new, unseen webpage has a high structural overlap with a known phishing kit. If the similarity score exceeds a predefined threshold (e.g., 0.85), the page is flagged for investigation.
Implementation and Operational Considerations
Deploying this in a production environment (such as an email gateway or a web proxy) requires careful engineering:
- The "Template Problem": Many legitimate websites use common frameworks like Bootstrap or Tailwind. If your normalization is too aggressive, you may inadvertently flag every website using Bootstrap as a "phishing kit." To mitigate this, you must weigh
Conclusion
As shown across "The Anatomy of a Phishing Kit", "The Methodology: Structural Fingerprinting", "Implementation and Operational Considerations", a secure implementation for detecting phishing kits via html structure similarity analysis depends on execution discipline as much as design.
The practical hardening path is to enforce strict token/claim validation and replay resistance, protocol-aware normalization, rate controls, and malformed-traffic handling, and provenance-attested build pipelines and enforceable release gates. This combination reduces both exploitability and attacker dwell time by forcing failures across multiple independent control layers.
Operational confidence should be measured, not assumed: track detection precision under peak traffic and adversarial packet patterns and policy-gate coverage and vulnerable artifact escape rate, then use those results to tune preventive policy, detection fidelity, and response runbooks on a fixed review cadence.