Officer Short Shrift is a character in The Phantom Tollbooth. I suspect that does _something_ to the numbers, though yes, I'd expect the more general to be more numerous than the more specific.
But yeah, it's the specific beats general that's so bizarre. Almost 2/3 of the pages that include the phrase "short shrift" don't include the words "short" and "shrift". half of them don't even include the word "shrift".
Yes, the ngram viewer is much more useful for things like this. But the "result count" field must be felt to have some use some of the time, or it wouldn't be included in the search results. And my question isn't "why isn't this perfect?" or "what other tool would be more useful?" But "what possible approximate counting system could yield such inconsistent results?".
Not sure what you mean here. SEO to me means "techniques used to get your web site to the top of google searches". Any page that's crawled at all that contains the word "shrift" should count towards the "total hits for shrift", shouldn't it? And while there's some approximation done for speed, why should this approximation consistently overcount "short shrift" or consistently undercount "shrift"?
My guess was that the pages that don't seem to contain "shrift" actually do, hidden somewhere. Or, that there are links to them that contain "shrift", even if the pages themselves don't, and somehow that affects the counts? No clue, really.
We had this happen recently; I don't remember what we were searching for, but it was a multi-word string (without quotes), and many of the pages didn't have some of the words, as is often the case. We put a plus sign in front of one of the words, and the number of hits went UP.
Well clearly a "hit" to Google isn't "a page that contains this search term." In fact, such an algorithm would be pretty bad by the standards of search.
Suppose we represent the Internet and various search strings as a weighted graph, induce some distance metric on the graph by applying some algorithm to the link structure of the graph. Then define a "hit" to be a node in this graph which is within a ball of a certain radius of the search string node under the metric.
This would exhibit all the behaviors described here: superset search strings could exhibit more hits because the additional search strings place them in a dense part of the graph, nodes would be returned that don't contain the search strings at all, because they are related to pages that are in some algorithmic way, etc.
24th-May-2011 10:25 pm (UTC) - greetings from the Cat Sanctuary
Anonymous
Hello, I'm actually cat_sanctuary.livejournal (or priscillascatsanctuary.weebly.com). Found a link on Ozarque's LJ and thought I'd comment on the number of hits for "dog shelter" coming ahead of my new site which has the actual name of "Cat Sanctuary"...when what I search is "Cat Sanctuary."
I suspect that does _something_ to the numbers, though yes, I'd expect the more general to be more numerous than the more specific.
But yeah, it's the specific beats general that's so bizarre. Almost 2/3 of the pages that include the phrase "short shrift" don't include the words "short" and "shrift". half of them don't even include the word "shrift".
Edited at 2011-05-25 04:14 am (UTC)
http://ngrams.googlelabs.com/graph?content=shrift%2C+short+shrift&year_start=1800&year_end=2000&corpus=0&smoothing=3
Suppose we represent the Internet and various search strings as a weighted graph, induce some distance metric on the graph by applying some algorithm to the link structure of the graph. Then define a "hit" to be a node in this graph which is within a ball of a certain radius of the search string node under the metric.
This would exhibit all the behaviors described here: superset search strings could exhibit more hits because the additional search strings place them in a dense part of the graph, nodes would be returned that don't contain the search strings at all, because they are related to pages that are in some algorithmic way, etc.