Okay, let’s try tackling a few questions from the Grab bag thread. Just a hint for next time: if your question takes three paragraphs to ask, your odds of getting an answer go down.
Q: “Is Bigdaddy fully deployed?”
A: Yes, I believe every data center now has the Bigdaddy upgrade in software infrastructure, as of this weekend.
Q: “What’s the story on the Mozilla Googlebot? Is that what Bigdaddy sends out?”
A: Yes, I believe so. You will probably see less crawling by the older Googlebot, which has a User-Agent of “Googlebot/2.1 (+http://www.google.com/bot.html)”. I believe crawling from the Bigdaddy infrastructure has a new User-Agent, which is “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
Q: “Do you take Emmy with you to San Francisco?”
A: Nope, Emmy is a true indoors cat; she doesn’t like to travel.
Q: “Any new word on sites that were showing more supplemental results?”
A: An additional crawling change to show more sites from those sites was checked in late last week, but it may still take a little bit of time (another few days) for that to show up in the index. I’ll keep an eye on sites that people have given as examples to see how those sites are showing up.
Q: “Is the RK parameter turned off, or should we expect to see it again?”
A: I wouldn’t expect to see the RK parameter have a non-zero value again.
Q: “What’s an RK parameter?”
A: It’s a parameter that you could see in a Google toolbar query. Some people outside of Google had speculated that it was live PageRank, that PageRank differed between Bigdaddy and the older infrastructure, etc.
Q: “Now that Bigdaddy is out, will there be a new export of PageRank anytime soon?” and “Will the deployment of BigDaddy stabilise the rolling PR issues we are experiencing at present?”
A: I’ll ask around about that. If there aren’t any logistical obstacles, I’ll ask if we could make a new set of PageRanks visible within the next couple weeks. I’d expect that as Bigdaddy stabilizes everywhere, the variation in toolbar PR for individual urls is more like to settle down too.
Q: “This datacentre http://64.233.185.104/ works differently to all of the others. Noticed just a few hours ago. . . . . Where does that DC fit into the scheme of things? Is it mainly made from newly spidered data?”
A: Sharp eyes, g1smd. That wouldn’t surprise me. As Bigdaddy cools down, that frees us up to do new/other things.
Q: “Not so much a question… GET A PSP!”
A: I got one today, TallTroll. I picked up Me and My Katamari (MAMK) and a PSP that turned out to have firmware v1.52 on it. So I could upgrade to 2.0, then downgrade to 1.5 so I could run homebrew programs. But I think MAMK requires firmware 2.5 or 2.6 to play, which means a one-way upgrade or maybe using RunUMD or a similar program. Suffice it to say I’m having fun just geeking around.
Q: “Can you give us a general way of getting a good idea in front of Google?”
A: If it’s bizdev, there’s a bizdev dept. at Google you could contact. If it’s not a business/patent/proprietary idea, I’d mention it here or blog about it somewhere. Writing a snail mail letter could work well too.
Q: “Did you check out the guys all painted in silver doing the robot on milk crates in San Fran?”
A: Nope, that’s down by Fisherman’s Wharf. We’re hanging near Union Square.
Q: “Why do you focus your attention so much on SEOs and not at webmasters who make actual quality websites?”
A: I think that’s an issue I have personally, because I spend so much of my time looking at spam. Lots of other people focus on helping general webmasters, like the Sitemaps team, for example. I have started to do “SEO Advice” posts instead of just “SEO Mistakes” posts, but you’re right: I personally could use a reminder to keep focusing on the sites that make quality content and how to pull those sites up, not just how to counter sites that cheat. Thanks for bringing that up.
Q: “My sitemap has about 1350 urls in it. . . . . its been around for 2+ years, but I cannot seem to get all the pages indexed. Am I missing something here?”
A: One of the classic crawling strategies that Google has used is the amount of PageRank on your pages. So just because your site has been around for a couple years (or that you submit a sitemap), that doesn’t mean that we’ll automatically crawl every page on your site. In general, getting good quality links would probably help us know to crawl your site more deeply. You might also want to look at the remaining unindexed urls; do they have a ton of parameters (we typically prefer urls with 1-2 parameters)? Is there a robots.txt? Is it possible to reach the unindexed urls easily by following static text links (no Flash, JavaScript, AJAX, cookies, frames, etc. in the way)? That’s what I would recommend looking at.
Q: “When I change a robots.txt to exclude more existing files from being crawled, how long does it take for them to be removed from the index? Perhaps the answer is a function of how often the site is crawled and it’s PR?”
A: It is a function of how often the site is crawled. I believe in the past that every several hundred page fetches or several days, the bot would re-check the robots.txt. Note that for supplemental results, you need recrawling to happen by the supplemental Googlebot in order for the robots.txt file to take affect on those pages. If you’re really sure you never want those pages to be seen, you can use our url removal tool to remove urls for six months at a time. But I’d be very careful with the url removal tool unless you’re an expert. If you make a mistake and (for example) remove your entire site, that’s your responsibility. Google can sometimes clear out self-removals, but we don’t guarantee it.
Q: “I would love to be able to search for html code and see how that ranks.”
A: I would like that too. Indexing non-visible things like punctuation, JavaScript, and HTML would be great, but it would also bulk up the size of the index. Any time you’re considering a new feature (e.g. our numrange search), you have to trade off how much the index would get bigger versus the utility of the feature. My guess is that we wouldn’t offer this any time soon.
Q: “Seriously, How do you plan on picking which of these questions to answer?”
A: I’m tackling the ones that looked interesting, short, and general enough that more than one person would be interested.
Q: “I am seeing a lot of sites with “%09″ (tab) and “%20″ (space) in front of the URL in Googles index.”
A: I’ll ask someone about that.
Q: (paraphrasing) The sitemaps validation fetch seems to happen with a User-Agent of “-”? My auto-reject rules reject that user agent.
A: I’ll ask someone about that. You could whitelist the IP range that Googlebot comes from in the mean time.
Q: “If one were to offer to sell space on their site (or consider purchasing it on another), would it be a good idea to offer to add a NOFOLLOW tag so to generate the traffic from the advertisement, but not have the appearence of artificial PR manipulation through purchasing of links?”
A: Yes, if you sell links, you should mark them with the nofollow tag. Not doing so can affect your reputation in Google.
Q: “On sites directed to international audiences with the same (high quality) content in several languages is it better to do several TLDs like mydomain.com, mydomain.de, mydomain.fr, mydomain.eu and so on or do subdomains like en.mydomain.eu, de.mydomain.eu, fr.mydomain.eu or something else like mydomain.com/en, mydomain.com/de, mydomain.com/fr?”
A: Good question. If you’ve only got a small number of pages, I might start out with subdomains, e.g. de.mydomain.eu or de.mydomain.com. Once you develop a substantial presence or number of pages in each language, that’s where it often makes sense to start developing separate domains.
Q: “Any results on why IDN Domains don’t show pagerank?”
A: I’ve seen a couple that do, but I’ll check into why most don’t. My guess is that there’s a normalization issue somewhere in the toolbar PageRank pathway.
Q: “Would it be possible to add a date range to queries? I might get 91,000,000 results, but the first 200 are 2-3 years old. I would like to limit results to items no more than 6-12 months old.”
A: Check out our advanced search page for this option. Tara Calashain also did some really interesting digging into this too, e.g. this info she uncovered. Google Hacks is a pretty solid book if you’d like to read more fun Google hacks.
Q: “What about the problem of directories and shopping comparison spam overriding real pages?”
A: Fair feedback. I heard that recently from a Googler, too. Sometimes we think of spam as strictly things like hidden text, cloaking, etc. But users think of spam as noise: things that they don’t want. If they’re trying to get information, fix a problem, read reviews, etc., then sites that like aren’t as helpful.
Q: “Are you planning to visit/speak in the UK at all in the near future?”
A: Sadly not. I’m hitting the Boston Pubcon and SES San Jose, but I can only do 4-5 conferences a year.
Q: “The one thing that seems to be getting to people generally, is what are the post Big Daddy intentions? Fixes, spam issues, regeneration of ‘pure’ indices, supp. issues, PR and BL update, etc.”
A: I can’t give a timeline (e.g. “scaling up communication in April, more work on canonicalization in May”) because priorities can change, esp. depending on machine issues, deployments of new binaries, webspam developments, etc. Short-term, I wouldn’t be surprised to see some refreshing in supplemental results relatively soon, and potentially different PageRanks visible in the next couple weeks.
Q: “Even Matt is afraid to use a redirect from www.mattcutts.com/ to www.mattcutts.com/blog/ because Google might penalize his website and put it into supplemental hell.”
A: Heh. No, that’s not it. I’m deliberately leaving them separate as a test case to see how we do now and down the road.
Q: “Just like you told me a couple of months ago, the Supplemental Googlebot (SG) got around to my site and things got sorted out. Thanks. . . . . If you are in San Fran and want to check out the Monterey Aquarium, could you please write a short review? I’ve been thinking of visiting and wondering if it is worth the trip.”
A: I would definitely recommend the Monterey Bay Aquarium, especially if you can find a coupon or other good deal. I highly recommend the otters, the kelp forest, and the jellyfish area.