Using AI to Improve Hreflang at Scale
What We Learned from the Proof of Concept (Part 2)
AI hreflang TL;DR
AI can identify candidate links for hreflang tags with two important caveats:
- Language sites that have fewer pages than others pose a problem for current AI models (embeddings),
- Pages with a high degree of generic copy on fairly broad topics can rank artificially highly.
In this second part, we move from planning to interpretation.
Rather than discussing implementation details, we focus on:
- what the Proof of Concept revealed,
- where it struggled,
- and how those signals could translate into useful product behaviour inside a real hreflang management interface.
The goal is to understand how AI might improve a WordPress hreflang manager plugin.
What the Proof of Concept Actually Revealed
The most useful outcome of the PoC wasn’t whether suggestions were “right” or “wrong”, but how confident the system was in different situations.
When pages clearly covered the same topic and played the same role across languages – for example, straightforward product or category pages – the system behaved consistently.
Where things became less clear, confidence dropped. Not because anyone had made an obvious mistake, but because the pages themselves had diverged over time – or because they relied on very generic business language that is genuinely difficult for AI to differentiate.
This is a common reality on large multilingual sites:
- markets expand at different speeds,
- content is updated unevenly,
- pages end up appearing as thin content because they cover a broad topic using typical language (think ‘about-us’),
- some pages accumulate detail while others remain more general.
None of this implies poor SEO practice. It reflects the fact that hreflang decisions are often made with incomplete or evolving information.
The important point is this:
Even when hreflang tags are valid and thoughtfully chosen, the underlying pages don’t always remain equally comparable as a site grows and changes.
That loss of clarity is exactly what the PoC surfaced.
From AI Output to Product Signals
A key lesson was that raw AI output isn’t useful on its own.
Confidence scores or ranked suggestions only become valuable when they are translated into signals that:
- are easy to understand,
- don’t override human decisions,
- and fit naturally into existing SEO workflows.
So rather than treating AI output as answers, we treated it as context.
UI Signal: “Not Enough Data to Assess Reliably”
One of the clearest product ideas to emerge was a neutral readiness state, rather than a warning or error.
For example:
German (de-DE): not enough data to assess reliably
This reflects situations such as:
- very small page sets in a given language,
- use of generic language on generic pages,
- recently launched or partially completed markets.
Importantly, this is not a judgement on quality, or the ability of the AI to translate and it doesn’t block anything.
In product terms, it simply explains:
- why strong suggestions aren’t available yet,
- that no automatic changes have been made.
This helps set expectations without second-guessing professional decisions.
UI Signal: Ranked Candidates Per Page
Another strong outcome of the PoC was the usefulness of ranked suggestions instead of yes/no answers.
Rather than asking:
“Is this the correct hreflang page?”
the system worked better when framed as:
“Which pages are the strongest candidates for this page in another language?”
For each canonical page, the PoC could show:
- a short list of likely alternatives,
- relative confidence rather than absolutes,
- and clear cases where no strong candidate existed.
This supports several real-world workflows:
- double-checking existing mappings,
- spotting pages that may need editorial alignment,
- identifying gaps where no good equivalent exists yet.
Crucially, it treats existing hreflang mappings as informed decisions, not mistakes – while still acknowledging that circumstances change.
Why We Avoided Binary Validation
The PoC reinforced why simple pass/fail validation is often unhelpful.
Many real-world hreflang cases are:
- technically valid,
- reasonable when created,
- but harder to evaluate years later as sites evolve.
Reducing these cases to “correct” or “incorrect” hides that nuance.
By contrast, ranked and confidence-based signals:
- make uncertainty visible,
- encourage review rather than blind trust,
- and scale better across large, uneven sites.
This reflects how experienced SEO teams actually work.
What the Plugin Should Not Do
Just as important as what the system can do is what it deliberately should not do.
Based on the PoC, a responsible tool should not:
- automatically add or remove hreflang tags,
- silently change user decisions,
- or claim to know how search engines will interpret a site.
Its role is to support judgement, not replace it.
A useful rule of thumb is:
The system doesn’t decide. It explains.
Where This Leaves Us
The Proof of Concept confirmed that AI can be useful in hreflang workflows – when used carefully.
Its strengths lie in:
- highlighting uncertainty,
- ranking plausible alternatives at scale,
- revealing situations where equivalence has become less clear over time.
Its limits are equally clear:
- editorial intent,
- market strategy,
- and final decisions remain human responsibilities.
Those boundaries aren’t a weakness. They’re what make the approach practical.
From here, the questions are no longer experimental:
- how these signals are presented without noise,
- how often they are refreshed,
- and how much explanation is helpful rather than distracting.
Those are product questions – and they’ll determine what ships.