<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" > <channel><title>Comments on: Other Google fetching</title> <atom:link href="http://www.mattcutts.com/blog/other-google-fetching/feed/" rel="self" type="application/rss+xml" /><link>http://www.mattcutts.com/blog/other-google-fetching/</link> <description>neat fun stuff</description> <lastBuildDate>Wed, 08 Feb 2012 21:30:01 +0000</lastBuildDate> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.3.1</generator> <item><title>By: Mister Bark</title><link>http://www.mattcutts.com/blog/other-google-fetching/#comment-114698</link> <dc:creator>Mister Bark</dc:creator> <pubDate>Sun, 21 Oct 2007 20:08:29 +0000</pubDate> <guid isPermaLink="false">http://www.mattcutts.com/blog/other-google-fetching/#comment-114698</guid> <description>I&#039;ve made a regexp ;) I don&#039;t know if all ranges are here but I can say that all of that is google, and there are the main ranges.$ip =~ /^((64\.68&#124;66\.249)\.(6[4-9]&#124;[78]\d&#124;9[0-5])&#124;64\.233\.(1[6-8]\d&#124;19[01])&#124;216\.239\.(3[2-9]&#124;[45]\d&#124;6[0-3]))\./bye ;)</description> <content:encoded><![CDATA[<p>I&#8217;ve made a regexp <img src='http://www.mattcutts.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /><br /> I don&#8217;t know if all ranges are here but I can say that all of that is google, and there are the main ranges.</p><p>$ip =~ /^((64\.68|66\.249)\.(6[4-9]|[78]\d|9[0-5])|64\.233\.(1[6-8]\d|19[01])|216\.239\.(3[2-9]|[45]\d|6[0-3]))\./</p><p>bye <img src='http://www.mattcutts.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /></p> ]]></content:encoded> </item> <item><title>By: Petter Nilsen</title><link>http://www.mattcutts.com/blog/other-google-fetching/#comment-96615</link> <dc:creator>Petter Nilsen</dc:creator> <pubDate>Fri, 09 Feb 2007 13:29:24 +0000</pubDate> <guid isPermaLink="false">http://www.mattcutts.com/blog/other-google-fetching/#comment-96615</guid> <description>And to add to my concern, it sounds like it will be able to create a DDoS situation where the source IP is one of the 100s of google IPs and there is no easy way to block it.</description> <content:encoded><![CDATA[<p>And to add to my concern, it sounds like it will be able to create a DDoS situation where the source IP is one of the 100s of google IPs and there is no easy way to block it.</p> ]]></content:encoded> </item> <item><title>By: Petter Nilsen</title><link>http://www.mattcutts.com/blog/other-google-fetching/#comment-96614</link> <dc:creator>Petter Nilsen</dc:creator> <pubDate>Fri, 09 Feb 2007 13:16:28 +0000</pubDate> <guid isPermaLink="false">http://www.mattcutts.com/blog/other-google-fetching/#comment-96614</guid> <description>Won&#039;t google&#039;s feedfetcher check the robots.txt on the top level of the site?  I do not want google to fetch any of the 200k+ entries that is available on a feed from me as it introduces more cpu load on an already taxed system.   Trying to dig out the IP ranges to block is not good enough, really.</description> <content:encoded><![CDATA[<p>Won&#8217;t google&#8217;s feedfetcher check the robots.txt on the top level of the site?  I do not want google to fetch any of the 200k+ entries that is available on a feed from me as it introduces more cpu load on an already taxed system.   Trying to dig out the IP ranges to block is not good enough, really.</p> ]]></content:encoded> </item> <item><title>By: George R.</title><link>http://www.mattcutts.com/blog/other-google-fetching/#comment-92936</link> <dc:creator>George R.</dc:creator> <pubDate>Wed, 03 Jan 2007 10:02:19 +0000</pubDate> <guid isPermaLink="false">http://www.mattcutts.com/blog/other-google-fetching/#comment-92936</guid> <description>Found your page/blog via Altavista SE when searching for Google Web Accelerator Cache Warmer found in my log the &quot;full UA&quot; as &quot;Mozilla (Google Web Accelerator Cache Warmer; Google-TR-4-GT)&quot; the IP for that entry was 65.124.120.130 which was for some sort of financial business in Mass, US.Here are my &quot;thoughts and ideas&quot;.Immediately before these multiple lines with that &quot;accelerator&quot; I found that a person in the UK was trying to visit my site, of course &quot;one bad apple ruined the whole bunch&quot; so that I had the entire 81.x.x.x for the UK (and Europe) blocked. Likewise I don&#039;t allow almost all of the APNIC group of IP&#039;s.... too many email harvesters in Asia, thus many spam originate from Asia ... first received line in emails cannot be forged... and seeing too much bad activity from Asia and Europe got most of those areas blocked. Unfortunate for innocent parties in those areas. (I block many US and most Canadian IP&#039;s due to repeated bad activities whether its email spam or bad web activity.)Nevertheless, found a UK visitor trying and trying to view my site, of course 403&#039;d. Then after many of those I found the &quot;accelerator&quot; UA entry for this Google tool. The IP for a business in Mass, USA. My guess is that this Google tool looks for a cached copy of your site sitting on any number of NON-SE CACHES and if it finds a copy of your site or the wanted page on a cache near to the viewer then such copy is served to the viewer.Good and bad. Here&#039;s the good, first. That user agent is a real person, so a site owner can allow it or disallow. Site owner choice.The bad items now. It &quot;in effect&quot; hides the actual viewers IP, instead showing the IP of the cache site (which may vary). A site owner can decide to allow this anonymous viewing or decide to block it via user-agent of Google Web Accelerator Cache Warmer. A good part is that with this new user agent roaming sites, there is one advantage. If a site owner monitors their log file (monthly log file would be best but could be huge), simply search the log for that UA and jot down the IP numbers. That should give a listing of some &quot;unknown&quot; and possibly undesired caches around the country or the world. The ending part of the full UA after the semicolon, I believe &quot;may vary&quot;. Not sure if it is a version number or an indicator of the visitor&#039;s origination point, but it might be interesting to note that portion with the IP list that would be in the results.Now, a good item. If that UA is allowed (at least at first) then a site owner can find many NON-SE CACHES around the country or world. Not knowing hold old they may allow their caches to become can be an issue. They might be serving VERY OLD copies of your site to visitors. The option is then to block those IP&#039;s, or cidr groups for those IP&#039;s especially if a cidr group is for a business. That might stop or block some of the &quot;unknown&quot; bots/crawlers.The latter has always been an issue with me. Unknown bots! Especially if it is a bot owned by EDU facilities or by businesses. You never know how they might use your content. And whether they have a name or not if they don&#039;t have full info about their bots at their site and if they&#039;re not a normal SE then most usually I don&#039;t want them. They use my bandwidth, they could be infringing on my writings or copyright and they could be using code of my own design (I only did one &quot;visitor counter&quot; script) thus the latter is not a big issue.I believe this &quot;info&quot; may be correct about the Google accelerator, but it might take some verification by one or more people. Most definitely I &quot;could not&quot; see why someone at a business would view just one page of my site, but I can very well associate it with it being a &quot;cache&quot; source and that it was associated with the one site visitor that was blocked.Most probably I will allow it for a short term, just to watch some of its activity (does it obey robots.txt or fetch too fast) or maybe I will block it completely.Actually on an update.... it might be that the &quot;accelerator&quot; tool may have found that such Mass USA site was a super-high-speed source with less routing for getting to my server in southern california. Nevertheless, it does &quot;hide&quot; the visitor&#039;s IP, and if that IP gets my pages they may in fact actually cache some of my pages.It could be an interesting study. Who owns those IP&#039;s and where are they located. And what is the relationship or info at the end of the UA string, a program version number or origination indicator of the visitor?Best regards,(ps please don&#039;t put my email onto ANY lists of any kind and please ensure that it is not &quot;anywhere&quot; in the blog code so that it can&#039;t be harvested.... it &quot;is&quot; a disposable email but I really don&#039;t want to dispose of it. thanks.)</description> <content:encoded><![CDATA[<p>Found your page/blog via Altavista SE when searching for<br /> Google Web Accelerator Cache Warmer<br /> found in my log the &#8220;full UA&#8221; as<br /> &#8220;Mozilla (Google Web Accelerator Cache Warmer; Google-TR-4-GT)&#8221;<br /> the IP for that entry was 65.124.120.130 which was for some sort of financial business in Mass, US.</p><p>Here are my &#8220;thoughts and ideas&#8221;.</p><p>Immediately before these multiple lines with that &#8220;accelerator&#8221; I found that a person in the UK was trying to visit my site, of course &#8220;one bad apple ruined the whole bunch&#8221; so that I had the entire 81.x.x.x for the UK (and Europe) blocked. Likewise I don&#8217;t allow almost all of the APNIC group of IP&#8217;s&#8230;. too many email harvesters in Asia, thus many spam originate from Asia &#8230; first received line in emails cannot be forged&#8230; and seeing too much bad activity from Asia and Europe got most of those areas blocked. Unfortunate for innocent parties in those areas. (I block many US and most Canadian IP&#8217;s due to repeated bad activities whether its email spam or bad web activity.)</p><p>Nevertheless, found a UK visitor trying and trying to view my site, of course 403&#8242;d. Then after many of those I found the &#8220;accelerator&#8221; UA entry for this Google tool. The IP for a business in Mass, USA. My guess is that this Google tool looks for a cached copy of your site sitting on any number of NON-SE CACHES and if it finds a copy of your site or the wanted page on a cache near to the viewer then such copy is served to the viewer.</p><p>Good and bad. Here&#8217;s the good, first. That user agent is a real person, so a site owner can allow it or disallow. Site owner choice.</p><p>The bad items now. It &#8220;in effect&#8221; hides the actual viewers IP, instead showing the IP of the cache site (which may vary). A site owner can decide to allow this anonymous viewing or decide to block it via user-agent of Google Web Accelerator Cache Warmer. A good part is that with this new user agent roaming sites, there is one advantage. If a site owner monitors their log file (monthly log file would be best but could be huge), simply search the log for that UA and jot down the IP numbers. That should give a listing of some &#8220;unknown&#8221; and possibly undesired caches around the country or the world. The ending part of the full UA after the semicolon, I believe &#8220;may vary&#8221;. Not sure if it is a version number or an indicator of the visitor&#8217;s origination point, but it might be interesting to note that portion with the IP list that would be in the results.</p><p>Now, a good item. If that UA is allowed (at least at first) then a site owner can find many NON-SE CACHES around the country or world. Not knowing hold old they may allow their caches to become can be an issue. They might be serving VERY OLD copies of your site to visitors. The option is then to block those IP&#8217;s, or cidr groups for those IP&#8217;s especially if a cidr group is for a business. That might stop or block some of the &#8220;unknown&#8221; bots/crawlers.</p><p>The latter has always been an issue with me. Unknown bots! Especially if it is a bot owned by EDU facilities or by businesses. You never know how they might use your content. And whether they have a name or not if they don&#8217;t have full info about their bots at their site and if they&#8217;re not a normal SE then most usually I don&#8217;t want them. They use my bandwidth, they could be infringing on my writings or copyright and they could be using code of my own design (I only did one &#8220;visitor counter&#8221; script) thus the latter is not a big issue.</p><p>I believe this &#8220;info&#8221; may be correct about the Google accelerator, but it might take some verification by one or more people. Most definitely I &#8220;could not&#8221; see why someone at a business would view just one page of my site, but I can very well associate it with it being a &#8220;cache&#8221; source and that it was associated with the one site visitor that was blocked.</p><p>Most probably I will allow it for a short term, just to watch some of its activity (does it obey robots.txt or fetch too fast) or maybe I will block it completely.</p><p>Actually on an update&#8230;. it might be that the &#8220;accelerator&#8221; tool may have found that such Mass USA site was a super-high-speed source with less routing for getting to my server in southern california. Nevertheless, it does &#8220;hide&#8221; the visitor&#8217;s IP, and if that IP gets my pages they may in fact actually cache some of my pages.</p><p>It could be an interesting study. Who owns those IP&#8217;s and where are they located. And what is the relationship or info at the end of the UA string, a program version number or origination indicator of the visitor?</p><p>Best regards,</p><p>(ps please don&#8217;t put my email onto ANY lists of any kind and please ensure that it is not &#8220;anywhere&#8221; in the blog code so that it can&#8217;t be harvested&#8230;. it &#8220;is&#8221; a disposable email but I really don&#8217;t want to dispose of it. thanks.)</p> ]]></content:encoded> </item> <item><title>By: raheel</title><link>http://www.mattcutts.com/blog/other-google-fetching/#comment-89289</link> <dc:creator>raheel</dc:creator> <pubDate>Wed, 08 Nov 2006 17:54:31 +0000</pubDate> <guid isPermaLink="false">http://www.mattcutts.com/blog/other-google-fetching/#comment-89289</guid> <description>Google is crawling my site continues yesterday 16k hits out of which 9k hits from one IP 66.249.65.202 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)</description> <content:encoded><![CDATA[<p>Google is crawling my site continues yesterday 16k hits out of which 9k hits from one IP 66.249.65.202 Mozilla/5.0 (compatible; Googlebot/2.1; +<a href="http://www.google.com/bot.html" rel="nofollow">http://www.google.com/bot.html</a>)</p> ]]></content:encoded> </item> <item><title>By: Timmy Dred</title><link>http://www.mattcutts.com/blog/other-google-fetching/#comment-87955</link> <dc:creator>Timmy Dred</dc:creator> <pubDate>Sun, 15 Oct 2006 01:57:56 +0000</pubDate> <guid isPermaLink="false">http://www.mattcutts.com/blog/other-google-fetching/#comment-87955</guid> <description>I think this is the correct place to bring this up...We have several client&#039;s sites in which our stats (visitor analysis) pages showed up in Google&#039;s index.  I submitted a sitemap (our script is set to not crawl the filesystem, but instead the webpages) and made sure that those pages/folders were not included.  I also password protected the pages.  They still show up in Google Webmaster Tools (sitemaps) section as an http error (401).I had to explain this first, but my real question is...Does Google toolbar use personal information or something because those pages are definitely not linked from anywhere.  The only way they could get into Google&#039;s index is if the Google toolbar or some other Google application was sending info from the address bar back to Google.  The only time I ever go to those pages is by manually typing in the url.Love Google&#039;s services, by the way.  Not trying to expose anything... just very curious.Thanks,Tim</description> <content:encoded><![CDATA[<p>I think this is the correct place to bring this up&#8230;</p><p>We have several client&#8217;s sites in which our stats (visitor analysis)<br /> pages showed up in Google&#8217;s index.  I submitted a<br /> sitemap (our script is set to not crawl the filesystem, but instead the<br /> webpages) and made sure that those pages/folders were not included.  I also password protected the pages.  They still show up in Google Webmaster Tools (sitemaps) section as an http error (401).</p><p>I had to explain this first, but my real question is&#8230;</p><p>Does Google toolbar use personal information or something because those<br /> pages are definitely not linked from anywhere.  The only way they could<br /> get into Google&#8217;s index is if the Google toolbar or some other Google<br /> application was sending info from the address bar back to Google.  The only time I ever go to those pages is by manually typing in the url.</p><p>Love Google&#8217;s services, by the way.  Not trying to expose anything&#8230;<br /> just very curious.</p><p>Thanks,</p><p>Tim</p> ]]></content:encoded> </item> <item><title>By: Az</title><link>http://www.mattcutts.com/blog/other-google-fetching/#comment-87811</link> <dc:creator>Az</dc:creator> <pubDate>Thu, 12 Oct 2006 08:09:00 +0000</pubDate> <guid isPermaLink="false">http://www.mattcutts.com/blog/other-google-fetching/#comment-87811</guid> <description>Hello Folks, Could anyone tell me how google finds new website names? lets say there is a new website with brand new IP address in kazakhstan. How does googlebot find out about that? does it use all .kz IP range and does reverse dns lookup or simply it has the access to DNS database of Kazak Name registrar? or it has its own Bind server? Sincerely, Az</description> <content:encoded><![CDATA[<p>Hello Folks,<br /> Could anyone tell me how google finds new website names?<br /> lets say there is a new website with brand new IP address in kazakhstan. How does googlebot find out about that?<br /> does it use all .kz IP range and does reverse dns lookup or simply it has the access to DNS database of Kazak Name registrar? or it has its own Bind server?<br /> Sincerely,<br /> Az</p> ]]></content:encoded> </item> <item><title>By: Revan</title><link>http://www.mattcutts.com/blog/other-google-fetching/#comment-87743</link> <dc:creator>Revan</dc:creator> <pubDate>Wed, 11 Oct 2006 11:19:02 +0000</pubDate> <guid isPermaLink="false">http://www.mattcutts.com/blog/other-google-fetching/#comment-87743</guid> <description>It is very pity that there no list of google bot ip&#039;s adress... That will be very... mmm... good :)</description> <content:encoded><![CDATA[<p>It is very pity that there no list of google bot ip&#8217;s adress&#8230; That will be very&#8230; mmm&#8230; good <img src='http://www.mattcutts.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /></p> ]]></content:encoded> </item> <item><title>By: LeeKO</title><link>http://www.mattcutts.com/blog/other-google-fetching/#comment-87708</link> <dc:creator>LeeKO</dc:creator> <pubDate>Wed, 11 Oct 2006 00:43:57 +0000</pubDate> <guid isPermaLink="false">http://www.mattcutts.com/blog/other-google-fetching/#comment-87708</guid> <description>Hi Matt,May I ask a quick question? I operate a Canada business. Can I register a .com domain, use a DNS and web host  in USA? If I do that will harm my ranking on google.ca?Leeko</description> <content:encoded><![CDATA[<p>Hi Matt,</p><p>May I ask a quick question?<br /> I operate a Canada business. Can I register a .com domain, use a DNS and web host  in USA?<br /> If I do that will harm my ranking on google.ca?</p><p>Leeko</p> ]]></content:encoded> </item> <item><title>By: Damon Hart-Davis</title><link>http://www.mattcutts.com/blog/other-google-fetching/#comment-87605</link> <dc:creator>Damon Hart-Davis</dc:creator> <pubDate>Sun, 08 Oct 2006 12:55:02 +0000</pubDate> <guid isPermaLink="false">http://www.mattcutts.com/blog/other-google-fetching/#comment-87605</guid> <description>Hi Matt,I do use If-Modified-Since, but this is a potential way of your being able to avoid downloading the 1st *redundant* copy of (say) an image ever...Get your folks to look at the Content-MD5 header definition.I don&#039;t use ETag (knowingly), heck I don&#039;t even understand it (yet) after ~10 years serving Web pages!RgdsDamon</description> <content:encoded><![CDATA[<p>Hi Matt,</p><p>I do use If-Modified-Since, but this is a potential way of your being able to avoid downloading the 1st *redundant* copy of (say) an image ever&#8230;</p><p>Get your folks to look at the Content-MD5 header definition.</p><p>I don&#8217;t use ETag (knowingly), heck I don&#8217;t even understand it (yet) after ~10 years serving Web pages!</p><p>Rgds</p><p>Damon</p> ]]></content:encoded> </item> </channel> </rss>
<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Minified using disk
Page Caching using disk (enhanced)
Database Caching using disk

Served from: www.mattcutts.com @ 2012-02-08 23:00:25 -->
