Log in

No account? Create an account
Google "number of hits" data is wacked 
11th-May-2011 01:29 pm
slanty, martha
Number of hits for the word "shrift": 1 million.
Number of hits for the phrase "short shrift": 2 million.

I'm unable to think of any approximate counting algorithm that would produce these results.
11th-May-2011 03:41 pm (UTC)
I get 626,000 results for "short shrift" without quotes (and about 1 million for "shrift").
11th-May-2011 03:42 pm (UTC)
Oh, but "short shrift" with quotes is 2 million! Weird.
11th-May-2011 03:49 pm (UTC)
Officer Short Shrift is a character in The Phantom Tollbooth.
I suspect that does _something_ to the numbers, though yes, I'd expect the more general to be more numerous than the more specific.
11th-May-2011 04:00 pm (UTC)
The reason Short Shrift bears that name is because it was already a common phrase; But it didn't seem to get appreciably more common when Phantom Tollbooth was published in 1961

But yeah, it's the specific beats general that's so bizarre. Almost 2/3 of the pages that include the phrase "short shrift" don't include the words "short" and "shrift". half of them don't even include the word "shrift".

Edited at 2011-05-25 04:14 am (UTC)
11th-May-2011 03:52 pm (UTC)
Yeah, don't trust the nominal result counts. I'd suggest looking at this instead:

11th-May-2011 04:05 pm (UTC)
Yes, the ngram viewer is much more useful for things like this. But the "result count" field must be felt to have some use some of the time, or it wouldn't be included in the search results. And my question isn't "why isn't this perfect?" or "what other tool would be more useful?" But "what possible approximate counting system could yield such inconsistent results?".
11th-May-2011 04:08 pm (UTC)
My guess is it has something to do with SEO.
11th-May-2011 05:00 pm (UTC)
Not sure what you mean here. SEO to me means "techniques used to get your web site to the top of google searches". Any page that's crawled at all that contains the word "shrift" should count towards the "total hits for shrift", shouldn't it? And while there's some approximation done for speed, why should this approximation consistently overcount "short shrift" or consistently undercount "shrift"?
11th-May-2011 05:02 pm (UTC)
My guess was that the pages that don't seem to contain "shrift" actually do, hidden somewhere. Or, that there are links to them that contain "shrift", even if the pages themselves don't, and somehow that affects the counts? No clue, really.
11th-May-2011 04:12 pm (UTC)
Also interestingly, the count for "shrift" is different than the count for shrift.
11th-May-2011 04:53 pm (UTC)
We had this happen recently; I don't remember what we were searching for, but it was a multi-word string (without quotes), and many of the pages didn't have some of the words, as is often the case. We put a plus sign in front of one of the words, and the number of hits went UP.
11th-May-2011 06:07 pm (UTC)
Well clearly a "hit" to Google isn't "a page that contains this search term." In fact, such an algorithm would be pretty bad by the standards of search.

Suppose we represent the Internet and various search strings as a weighted graph, induce some distance metric on the graph by applying some algorithm to the link structure of the graph. Then define a "hit" to be a node in this graph which is within a ball of a certain radius of the search string node under the metric.

This would exhibit all the behaviors described here: superset search strings could exhibit more hits because the additional search strings place them in a dense part of the graph, nodes would be returned that don't contain the search strings at all, because they are related to pages that are in some algorithmic way, etc.
24th-May-2011 10:25 pm (UTC) - greetings from the Cat Sanctuary
Hello, I'm actually cat_sanctuary.livejournal (or priscillascatsanctuary.weebly.com). Found a link on Ozarque's LJ and thought I'd comment on the number of hits for "dog shelter" coming ahead of my new site which has the actual name of "Cat Sanctuary"...when what I search is "Cat Sanctuary."
This page was loaded Aug 26th 2019, 9:24 am GMT.