Presumably you saw my post about talkorigins.org, a site that was recently hacked so that the front page had spammy porn text and links. Google temporarily removed talkorigins.org from our index, but we emailed talkorigins.org to alert them that they had been hacked. We also made it possible for talkorigins.org to confirm the penalty in our Webmaster console tools. Once the spammy porn links/text were gone, Google reincluded the site in our index within days.
So Google tries to alert hacked sites of problems; that’s good. But we also email many sites for violations of our quality guidelines, such as hidden text. Take for example the case of trouw.nl, a leading Dutch newspaper. They wrote an article criticizing the fact that Google temporarily removed trouw.nl from our index for hidden text, and emphasized their belief that Google should have alerted them to the removal. In fact, Google did email trouw.nl.
What exactly was trouw.nl doing? By using Cascading Style Sheet (CSS), dozens of words were hidden on hundreds or thousands of pages on the site. Here’s the code that was on the front page of trouw.nl in October, for example:
dagblad trouw, podium, nieuws, achtergronden, kranten, verdieping, opvoeding, onderwijs, religie, filosofie, natuurtochten, gezondheid(s)zorg, cultuur, natuur, milieu, stijlboek, recensies, boeken, chat, polderpeil, maandaggids, dinsdaggids, woensdaggids, donderdaggids, vrijdaggids, weekendgids, letter, geest, letter&geest, boekrecensies, novum, laatstenieuws, rss, handheld, dossiers, trouwkabinet, illegaletrouw, ephimenco, schouten, spotprenten, spotprent, len, tom, modernemanieren, cryptogram, zusje, kritieken, nieuwskoppen, horizonreizen, relatie, parship, schrijfboek, webshop, trouwcompact, compact, animatie(s), Flash, video, radio, strip(s).</div>
indexKeywords div? If you examine http://www.trouw.nl/trouw.nl/styles/basic.css, you’ll see that the properties of that div are
The net effect of that CSS div is to hide those 60+ keywords in a way that is completely invisible to users. In case you’re wondering, trouw.nl also used the
indexLinks div style to hide multiple links as well. It’s interesting that the definition of
indexKeywords remains in the CSS of trouw.nl, even though they’ve removed the actual hidden text.
In general, I do not feel that Google is obligated to notify every site that we remove from Google’s index for violating our quality guidelines. Our webspam team does not have infinite resources, and our primary goal has to be to protect Google users by keeping our index clean. However, in this case Google did email trouw.nl (in Dutch) to alert them about their hidden text. I’ll include an excerpt of the email that we sent to multiple email addresses, including webmaster at trouw.nl and support at trouw.nl:
Geachte eigenaar of webmaster van trouw.nl/,
Tijdens het indexeren van uw webpagina’s is geconstateerd dat enkele van uw pagina’s technieken gebruiken die in strijd zijn met onze kwaliteitsrichtlijnen. Deze richtlijnen kunt u vinden op: http://www.google.nl/webmasters/guidelines.html
Om de kwaliteit van onze zoekmachine te waarborgen zullen enkele van uw pagina’s tijdelijk uit onze zoekresultaten verwijderd worden. Momenteel staan de pagina’s van trouw.nl/ op het punt om verwijderd te worden voor een periode van ten minste 30 dagen.
In het bijzonder zijn de volgende technieken geconstateerd op uw pagina’s:
* De onderstaande verborgen tekst op trouw.nl/:
dagblad trouw, podium, nieuws, achtergronden, kranten, verdieping, opvoeding, onderwijs, religie, filosofie, natuurtochten, gezondheid(s)zorg, cultuur, natuur, milieu, stijlboek, recensies, boeken, chat, polderpeil, maandaggids, dinsdaggids, woensdaggids, donderdaggids, vrijdaggids, weekendgids, letter, geest, letter&geest, boekrecensies, novum, laatstenieuws, rss, handheld, dossiers, trouwkabinet, illegaletrouw, ephimenco, schouten, spotprenten, spotprent, len, tom, modernemanieren, cryptogram, zusje, kritieken, nieuwskoppen, horizonreizen, relatie, parship, schrijfboek, webshop, trouwcompact, compact, animatie(s), Flash, video, radio, strip(s).
As you can see, we tried to alert trouw.nl that we were taking action on their hidden text and hidden links. We mentioned the page with the issue (in this case, the root page), and we included the actual hidden text. The rest of the email goes on to describe how to request that Google reconsider the site for reinclusion in our index. After trouw.nl removed the hidden text and hidden links, Google reincluded the site.
I understand that trouw.nl was frustrated to be removed from Google’s index, but our users have told Google repeatedly that they hate webspam and don’t like seeing pages with hidden text secretly buried on the page. Hidden text is also not fair to other sites that try to compete for similar queries without hiding words from users.
In this case, I believe that Google did more than any other search engine does:
- We provided our webmaster guidelines in Dutch at http://www.google.nl/webmasters/guidelines.html (“Avoid hidden text or hidden links”? “Vermijd verborgen teksten en verborgen links.”)
- We scheduled the site to be removed for 30+ days so that users wouldn’t get hidden-text, hidden-link pages back in response to searches.
- We made it possible for trouw.nl to confirm that they had a penalty via our Webmaster console.
- We emailed trouw.nl in Dutch with the exact page to check and the exact text to look for.
- Once the site removed the hidden text and hidden links, we reincluded trouw.nl.
In reviewing this situation, I believe that the webspam team handled this issue in a way that protected our users but also tried to alert the site to issues. We will continue to work to improve our communication so that legitimate sites receive even more information to help them with webspam-related issues.
Q: So you’re not just working on webspam in English?
A: No! We are continuing our anti-spam efforts in many different languages, as you can see from this situation. In fact, I expect Google to focus even more effort on other languages in coming months. I’m extremely proud of our webspam team members who are located in Europe (and other places around the world). I’m also sending one of the best people on our Mountain View-based webspam team, Brian White, to Europe for six months in 2007 to provide the webspam team in Europe with more even more visibility and more support.
Q: Matt, you still love Dutch sites and the Netherlands, right?
A: Yes. One of my favorite authors is Dutch. Janwillem van de Wetering is the best existential mystery writer in the world, without question. Because of him, I can’t wait to drink jenever in Amsterdam someday. But we also have to protect Google’s users and the quality of our index.
Update: I’ve been in contact with someone at Trouw.nl, and as always, there are two sides to the story. The email addresses we tried to use to contact Trouw didn’t exist, so Trouw couldn’t have received our message. This situation shows that the idea of contacting site owners is solid, but we can still find ways to improve our communication and webmaster outreach. Trouw has also added an update at the end of their article saying much the same thing.