Warning: Undefined array key 0 in /home/staging-yoast/staging-platform.yoast.com/versions/b6ab6cc77e40bc2acc362da080d5dcf3b4ac2281/web/app/themes/yoast-com/single-post.php on line 46

Is your site the victim of internal site search spam?

Over the last year or so, we’ve seen large-scale, widespread SEO spam ‘attacks’ on WordPress sites, all targeting their internal site search functionality. In most cases, these attacks aren’t harmful from an SEO perspective, but they do come with time and resource costs – for both the attacker and the victim. Most sites won’t need to worry about this, but if you have a large or popular site, you might have been ‘hit’ and not even know about it. So, what’s going on?

The SEO industry is divided over whether ‘negative SEO’ exists. Could another site harm your visibility and rankings by linking to you from nefarious or spam sites? Google says that most sites won’t need to worry about this, but the reality is more complex.

Even if negative SEO doesn’t exist, there are many people out there who think that it does. And some are actively ‘attacking’ other sites via their internal site search. That has real-world implications, which are worth exploring and understanding. Here’s what’s happening and what we’re — already — doing to protect you in Yoast SEO.

Spammers can use your internal site search to advertise

Many WordPress sites have an internal site search feature, which you can get to via example.com/?s=example (or example.com/search/example/).

You can put anything you want in those URLs. And in many cases, the words you search for will be output on the site’s search results page. That means anyone can write an advert for illicit goods or services, like https://staging-platform.yoast.com/?s=buy my fake rolex watch from www.example.com, and ‘create’ a page on your website that features their ‘advert’.

An example ‘spam’ search result on yoast.com

You could also write scripts and software to generate requests to URLs like this at scale across many websites. Those URLs might also appear in places like analytics accounts and server logs. At scale, this is a crude but cheap form of ‘advertising’.

This kind of thing is an annoyance, but only a minor one. It becomes more serious when attackers start linking to these types of URLs.

More advanced spammers using these tactics want to reach as large an audience as possible. They try to do that by taking advantage of — and compromising — your site’s SEO.

It’s common for the perpetrators of these attacks to have a readily-available network of low-quality spam websites — which all link to each other. They’re generally not interested in getting those to rank, as long as they’re getting crawled. Because if they’re getting crawled, search engines are likely to discover and then crawl anything they link to. So, what happens if they link to search results on your website?

Now the impact of the attack starts to increase. Real humans might discover and click those links. At the very least, that probably represents a brand risk. You probably don’t want your site promoting whatever the spammers are selling. But that’s far from the worst possible outcome. Now that search engines will find and follow those links; a few things might happen.

What are the possible impacts?

  1. If you don’t manage the SEO settings for your site, these pages might start getting crawled, indexed, and ranked. That’s going to cause all sorts of brand and SEO damage. Thankfully, Yoast SEO automatically adds a noindex meta robots directive to your internal search results page, which prevents them from being indexed.
  2. If you’re actively taking steps to protect yourself by blocking internal search results in your robots.txt file, then these adverts might start showing up in the search results. Remember, robots.txt prevents crawling, but not indexing — and as far as search engines are concerned, these pages look like they’re pretty popular, and deserve to be indexed. They’re getting all sorts of links, from all sorts of websites, after all.
  3. If you’re setting a noindex directive, then these pages still get crawled, and your Google Search Console account is going to fill up with reports of “Crawled but not indexed” URLs.

Many folks with WordPress sites will find themselves in this third category. They’ll discover reports like these in their Google Search Console accounts.

Site search URLs promoting a dating website, amongst seemingly unrelated spam text.
The word/phrase ‘KaKaoTalk’ frequently occurs alongside a 【example】 notation format. These are generally usernames and adverts (often for illicit or adult services from users) on the popular South Korean chat app ‘KakaoTalk’.
A site with ‘only’ a few thousand articles has more than 90,000 recorded spam URLs.

Even though examples like this probably don’t harm your SEO, this kind of report can be concerning – and there’s still some real-world impact here.

If Google crawls these URLs at scale, that may consume ‘crawl budget’ — a theoretical, finite amount of energy they’re willing to expend on exploring your site. It also makes it harder to identify or diagnose other (legitimate) SEO problems or concerns with your site.

More significantly, it wastes electricity and server resources for the attacker, the victim, and the search engine. At scale — particularly across many websites — that wastage and impact add up.

But that’s not all…

If we dig deeper, we can see more to these attacks than meets the eye. In our example images above, we can see some URL variations which suggest some nuance to the attacks. For example:

  • URLs target both ?s=example and /search/example formats; where sites might use either, or both, and sometimes have different template logic on each version. That increases their chance of successfully getting their text onto the page, and might help them to work around noindex directives.
  • They target paginated states, like ?page/5/?s=example or /search/page/2/?s=example. This is particularly nasty because pagination links in WordPress pass query parameters to pagination URLs. That means that if I have 100 pages of results for a search query, the ‘next/previous’ links at the bottom of each of those include the spam search parameter. Now your own site is linking to these spam URLs, and ‘validating’ them. That creates a huge mess in Google Search Console of ‘self-referring’ spam URLs, and makes it hard to track down the original sources.
  • They target RSS feed versions of search results (e.g., /search/[spam]/feed/rss2/). This is particularly clever, and I suspect the main (or most impactful) example. That’s because other systems actively seek out and consume RSS feeds, and often convert URLs into links. That creates a link back to the attacking site on many more sites. Your WordPress site is just part of a ‘man in the middle’ attack.
  • These attacks can be successful even if your site doesn’t have a site search input field or results page. Most WordPress sites/themes support site search out of the box, even if they don’t have a dedicated page or template for the results (in which case, a fallback such as the homepage or index page is used).
A URL targeting the RSS feed of a search result

Interactions with Cloudflare and IndexNow

The larger WordPress sites likely to be victims of this attack often use Cloudflare — a content delivery network, performance, and security platform. Cloudflare has a ‘Crawler Hints’ feature that monitors pages on your website and automatically submits them to IndexNow. Bing, Yandex, and others will now crawl those URLs.

Because paginated search results in WordPress persist the spam parameters in the pagination links, these URLs look like they come from your site. This system will pick them up and automatically push them to IndexNow. Now your site is actively telling search engines that you want them to crawl (and, by extension, index) these spam URLs. Ouch.

That also means that the spammer, Cloudflare, Bing, Yandex, and your site are wasting electricity creating, promoting, and crawling spam URLs. Double ouch.

The good news

Most sites shouldn’t need to worry about these kinds of attacks. Yoast SEO automatically applies a noindex directive to your search results page, which keeps these URLs out of Google. Even if you’re seeing this kind of data in Google Search Console, it’s not affecting your SEO.

The environmental impact, and the cost impact on your hosting, can still be significant, though. That’s why we’ve been adding a series of ‘crawl cleanup‘ and optimization features to Yoast SEO Premium in recent months. These features allow you to disable URL formats and features that most sites won’t need search engines to have access to.

Our crawl cleanup features also enable you to protect your internal site search URLs from some forms of attack. For example, we let you limit the maximum length of search queries and give you the option to disable common attack patterns (like searches containing emojis). Blocked search formats will return 404 errors.

These kinds of tweaks ‘close the door’ on some attack formats. That should discourage search engines from crawling and indexing those URLs, which removes a key incentive for the spammers to create them. If you’re worried that you might be under attack and haven’t explored these settings, we will encourage you to do so.

Moving forwards

Meanwhile, we’re looking for opportunities to improve WordPress core’s handling of these scenarios. For example, we’re pushing for improvements in how pagination URLs are constructed. We’re also in touch with Cloudflare about trying to exclude URL patterns like this from their IndexNow integration and even exploring options for ‘disabling’ the /search/ URL format by default. We’re planning to continue to explore this problem, and solve as much of it as possible via Yoast SEO plugins or in WordPress’ core code.

If you see these attacks in your data, please let us know in the comments. The more examples and kinds of URL formats we see, the more we can try to reverse-engineer the patterns, mechanics, and incentives behind these links!

Coming up next!


6 Responses to Is your site the victim of internal site search spam?