Lookalike Domains: What Makes Them Hard to Catch

Why lookalikes work

A lookalike domain attack succeeds not because users are careless but because the visual difference between a legitimate domain and a crafted lookalike can be invisible. A single character substitution in a 15-character domain name is not something most people catch when they are reading quickly. An attacker who knows this does not need to trick experts. They need to trick anyone who clicks a link in a hurry.

The techniques vary in sophistication. Simple typosquats are caught by basic brand monitoring. Unicode homoglyph attacks, subdomain abuse, and combo-squatting require detection methods that most standard monitoring tools do not implement correctly. Understanding the full taxonomy tells you what your monitoring has to cover.

The taxonomy

Typosquatting

The simplest technique: register a domain with a common typing error in the brand name. Common patterns include character transposition (cliaresc.com vs clairsec.com), character doubling (clairssecc.com), character omission (clarsec.com), and adjacent key substitution on QWERTY keyboards (clairsex.com, vlairsec.com).

Detection requirement: fuzzy string matching against a brand dictionary with Levenshtein distance thresholds, applied continuously to domain registration feeds. Most commercial monitoring tools handle this adequately.

Homoglyph substitution

Homoglyphs are characters that look identical or near-identical across different Unicode character sets. The Cyrillic "a" (U+0430) is visually indistinguishable from the Latin "a" (U+0061) in most fonts. A domain using Cyrillic characters renders in a browser URL bar as a legitimate-looking ASCII string but resolves to an entirely different punycode address.

Example: clаіrsec.com (where "а" is Cyrillic U+0430 and "і" is Cyrillic U+0456) displays as clairsec.com but is a different domain entirely.

Detection requirement: Unicode normalization and cross-script character mapping. The monitoring system must convert candidate domains to their punycode equivalents and then test visual similarity against the brand profile using character-level Unicode substitution tables. Many tools match ASCII representations only and miss this entirely.

Combo-squatting

A combo-squat combines the legitimate brand name with another word to create a plausible subdomain or domain path. Common additions: -login, -secure, -portal, -verify, -support, my-, get-, account-. The brand name is spelled correctly, which defeats pure string-distance matching.

Examples: clairsec-login.com, myclairsec.com, clairsec-portal.io.

Detection requirement: brand keyword detection within domain strings, not just whole-domain matching. The monitoring system must flag any newly registered domain containing your brand terms as a substring, regardless of what surrounds them.

Transposition attacks

Adjacent characters in the brand name are swapped. Carilsec.com, claisrec.com. These are a subset of typosquatting but worth listing separately because they produce strings that look plausible on fast reading, particularly when the transposition affects characters that are visually similar in shape (r/n, u/v, m/n).

TLD substitution

The brand name is registered correctly but under an alternative top-level domain: clairsec.io, clairsec.net, clairsec.co, clairsec.info. Risk varies by TLD credibility. A .co registration impersonating a .com brand is common enough to warrant monitoring as a high-severity signal. A .info registration is lower-value but still used in mass phishing campaigns.

Detection requirement: monitoring brand keywords across all gTLDs and relevant ccTLDs, not just the registered TLDs.

Subdomain abuse

The attacker does not register a lookalike domain at all. They host a phishing page at a path that looks legitimate: clairsec.com.malicious-host.ru, or on a compromised legitimate website at legitimate-site.com/clairsec-login. The full URL contains the brand name, which is what users and some security tools check.

Detection requirement: passive DNS and URL scanning, not just domain registration monitoring. This is the hardest variant to catch because it does not appear in domain registration feeds at all.

Why automated detection misses the hard cases

Most domain monitoring tools are built around two inputs: a brand keyword list and a stream of newly registered domains. Matching is ASCII-based. This catches typosquats and TLD substitutions reliably. It catches combo-squats if the keyword matching is implemented correctly.

It misses homoglyph attacks because the punycode representation of a Cyrillic-substituted domain does not match the ASCII brand string. It misses subdomain abuse entirely because the attack does not involve a newly registered domain. And it misses infrastructure reuse, where an attacker uses a domain registered months ago (which passed monitoring as a low-risk registration at the time) and newly points it at a phishing page.

The sophistication ceiling of most brand monitoring tools is exactly where the sophisticated attacks begin. Commodity phishing kits use typosquats. Targeted attacks use homoglyphs and infrastructure reuse.

What complete detection requires

Closing the detection gap requires four capabilities working together:

Unicode-aware fuzzy matching that converts candidate domains to canonical form before comparison, with substitution tables covering the most-abused Unicode ranges (Cyrillic, Greek, Latin look-alikes).
Certificate transparency monitoring as a parallel signal. Phishing infrastructure receives certificates before it goes live. CT logs are a faster signal than some registrar feeds and catch infrastructure that is not yet resolving.
Active probing of flagged domains. A domain that looks like a lookalike but serves no content today might serve a credential harvesting page tomorrow. Periodic active checks on flagged domains catch late-activation attacks.
URL and passive DNS scanning to catch subdomain abuse on existing infrastructure.

The organizations most consistently victimized by brand impersonation attacks are the ones relying on a single monitoring source and a single matching method. Coverage requires overlap: multiple feeds, multiple detection methods, and human review of high-confidence matches before they are escalated as confirmed threats.

Lookalike domains: what makes them hard to catch