top || fun | art | media | issues | net | tech | science | history | search | shop
This page was extensively revised in December 1999 as a selection of the best links from the Robot Wisdom Weblog archives, with the aim of constructing a concise introduction to net.literacy about search engines.
If you only take away one thing from visiting this page let it be an appreciation of the superiority of Google.com as a general search engine, delivering better results than Yahoo for almost all searches.
Knowing how to use search-engines effectively is the primary skill for internet literacy.
Newsgroups: comp.infosystems.search and alt.internet.search-engines are two good places to ask questions and pick up tips. And alt.fan.dejanews has finally got some traffic (mostly bitter complaints).
The three fundamental challenges for search.literacy are:
- choosing the right search-words,
- grouping them with quotation marks, and
- refining them with boolean operators.
(Here's an everchanging random sampling of the last 500 search-patterns used at WebCrawler, most of which are probably very poorly formed. On 16Dec99 for example, about 150 used plus signs, about 50 used double-quotes, and about 50 wrote out a question ending with a "?".)
This basic overview includes links to top engines, basic search strategies, and recent developments.
Searchtip from Daniel Farinha:
I noticed how good the word 'links' is when searching for a number of pages on any subject. This might sound trivial, but it's interesting to see that getting pages like 'Other links to blah...' is the best way to start a search. ... I've now been using the '+links' string in all my queries in Altavista ... Could 'links' be the search magic word?
Every search engine uses a different set of criteria for ranking results. You can experiment with this problem very easily via Dogpile meta-search, which collates results from the major engines: Yahoo!, Lycos' A2Z, Excite Guide, GoTo.com, PlanetSearch, Thunderstone, What U Seek, Magellan, Lycos, WebCrawler, InfoSeek, Excite, and AltaVista.
Search bookmarklets are short Javascript programs that look like simple links or bookmarks, but act like super-speedy searchform pages.
They're a little bit tricky to master, but will save you vast amounts of time once you have.
If your browser offers a personal toolbar for your most-frequently used links, you probably want to put at least Google there, perhaps many more.
Deluxe search bookmarklets even detect the word or phrase you've highlighted on a webpage, and send that pattern to your search engine automatically. (This can be problematic for various reasons: frames disable it, it works differently on Netscape and MSIE, etc.)
Even if bookmarklets are too complicated-sounding, you should be aware of 'searchlinks'-- bookmarks or links that send a pre-defined query to a search-engine.
For example, I click this eBay link once a day to see what's being auctioned there related to 'James Joyce'. Most search engines will work correctly if you bookmark the results-page after you do a search, so that selecting that bookmark saves you re-typing the search-pattern each time.
(Deja.com requires special handling, see below.)
I strongly recommend building a range of these into your own custom startpage. If you don't like bookmarklets, you may also add local search forms.
Google hasn't done any advertising yet, but it uses a radically different (and better) way of rating the pages that match your query. Instead of counting the number of appearances of the phrase, it calculates the popularity of each matching page, according to how many other pages link to it.
So if I search Google for the words 'search engines' it rates highest the most popular page that uses those words.
Google bookmarklets: simple, NN, MSIE
Google supports the 'link:' syntax to find pages that link to a given site: [example search]
At least 140 matches for "link:www.robotwisdom.com/"
They also have a 'What's related' function: [example]
Google backgrounder
Rankdex was an earlier variant on the Google approach, not currently available.
Clever is an IBM approach that we'll be hearing more about.
June 1999 Scientific American feature; PC Week
The next most useful search-engine after Google, imho, is Deja.com, which searches recent (or older) postings to the Usenet newsgroups.
This is the likeliest place to find discussions of topics only a few hours old, that the other search-engines haven't gotten to yet.
Unfortunately, Deja.com has a corporate culture that seems to hate this side of their operation, favoring instead a much less useful product-ratings service. Because Deja has burdened their pages with this new, irrelevant function, it's wiser to approach it with this streamlined alternative searchform page.
Around 250M pages in Dec99.
Counting comparisons-- Elvis, misspellings
This review of search engines claims that AllTheWeb now indexes 200M pages: http://www.freep.com/tech/qtchb2.htm
Brodin, who holds a doctorate in philosophy, first became interested in searches and research when he cataloged the 20,000 pages of Austrian philosopher Ludwig Wittgenstein's diaries.
AllTheWeb's targeted billion URLs will be pretty useless without 'link:' and 'host' and 'url' filters: http://news2.thls.bbc.co.uk/hi/english/sci/tech/newsid%5F410000/410251.stm
Fast says a typical query can search all 200m documents in less than a second and the parallel systems mean a document index can be built in only 12 hours, compared to several days or even weeks for some of the other search engines.
Around 250M pages in Dec99.
You can register an AltaVista pattern and get email notification when new matches are indexed
Regular: numbers unreliable
Advanced: different syntax?
Another: If you find a URL you really like, go to AltaVista and type in "+link:the.url.you.like". This will return all the known pages that like that page, too...
Digital to Compaq to CMGI
CMGI buys AltaVista for $2.3B: http://www.cmgi.com/press/99/altavista.htm [Slashdot]
CMGI's majority-owned subsidiaries include Activerse, Adsmart, Engage, iCast, Magnitude Network, NaviSite, NaviNet, Planet Direct and ZineZone. The Company's @Ventures affiliates have ownership interests in Lycos, Inc, Critical Path, Silknet, Ancestry.com, Asimba, blaxxun, BizBuyer.com, CarParts.com, Chemdex, eCircles.com, Furniture.com, HotLinks, KOZ.com, MotherNature.com, NextMonet.com, NextPlanetOver.com, OneCore.com, ONElist, Productopia, Promedix.com, Raging Bull, Softway Systems, Speech Machines, ThingWorld.com, Universal Learning Technology, Vicinity, Virtual Ink and Visto....In addition, CMGI will issue a $220 million three-year note to Compaq, bringing total consideration for CMGI's 83% ownership in the AltaVista business to approximately $2.3 billion, implying a total value of $2.7 billion for AltaVista.
CMGI will acquire control of Compaq's AltaVista business and its related properties (Shopping.com and Zip2) and will integrate the popular AltaVista search engine into its network of 40 leading Internet operating companies to deliver a superior online experience to Internet users both at home and at work.
Great backgrounder on Yahoo
DMoz's low standards for editors
Ambitious value-added portal strategy: http://www.adweek.com/interactive/iqnews01.asp
Looking to simplify online searches for Web users, a new search and community portal, 4anything.com, launches tomorrow with a network of over 600 affinity-based Web sites housed under 16 categories, ranging from 4shopping.com to 4news.com.Each domain, branded with the number "4" plus an array of most frequently searched terms, such as 4dogs.com and 4weather.com, will feature editor-selected links to other Web sites related to searchers' desired topics.
Inktomi building its own Yahoo, algorithmically? http://news.excite.com/news/r/990615/02/net-internet-inktomi
The company says its new directory engine uses breakthrough technology that leverages human intelligence to categorize millions of documents within an automated Web directory.
(This can only be IBM's Clever, or an equivalent. Great to see search engines getting smarter, finally! ...Now if only they'd scale up to the current billion-plus pages.)
A few details about Inktomi's Yahoo-buster: http://www.eet.com:80/story/OEG19990617S0011
Paul Gauthier, chief technical officer at Inktomi (San Mateo, Calif.), said that human developers were used to set up category taxonomies and statistics modeling, under the goal of making directory linkages relevant to a search at hand, and high-quality in terms of responding to human inference patterns. The latter steps of creating entries for the broad directory categories were automated, using several proprietary Inktomi concepts of concept induction to find the most appropriate directory links.
Good day for RBuzz: http://www.researchbuzz.com/news/
ResearchBuzz -- Internet Research News
SearchEngineWatch
Summary chart of search engine features [qv]
Which search engines support which syntax features [qv]
Which search engines offer which auxiliary modes [qv]
Which search engines use which indexing techniques [qv]
Search-Engine Annoyances FAQ (v0.2): http://www.deja.com/=dnc/getdoc.xp?AN=503573385
Please post suggestions for additions (or corrections) to this FAQ, either about a specific search-engine or about search-engines in general. These should be constructive criticisms, about things that could reasonably be fixed.
Northern Lights
Lycos
New NY Observer offers Chris Byron on Lycos-Diller: http://www.observer.com/cgi-win/homepage.exe?nyo1/CB032299
Lycos, as you may know, is generally regarded as one of the weaker of the four major search engine companies (the others being Yahoo Inc., Excite Inc. and Infoseek Corporation). It has roughly half the revenues of Yahoo or Excite, yet during the 12-month period that ended Dec. 31, 1998, it racked up nearly seven times the losses ($121.3 million) of the rest of the group combined....And since Lycos, at $131 per share, was being valued on Wall Street at $5.6 billion while USA Networks carried a $6.5 billion market cap, this meant that Mr. Diller was actually valuing Lycos at $84 per share at most (30 percent of $12.1 billion).
Frontiers of paying-for-linking: http://www.usatoday.com/life/cyber/tech/ctf841.htm
Lycos will pay third-party sites 2 cents for every referral from a HotBot search box and 3 cents for every referral from one of the other content boxes.
The best way I know to compare search-engine performance is with http://www.dogpile.com/ and I was surprised to find, in testing it on my authors' names over the last few days, that the winner by a mile was: http://www.infoseek.com/ which only has one-quarter as many pages and rarely scores well in the ratings!??
GoTo
Goto.com's rate-schedule for search-engine placements: http://www.usatoday.com/life/cyber/tech/ctf596.htm
According to Goto's metrics, Mohammed is worth 1 cent, Moses is 10, Buddha is 4. And it perhaps is a telling comment on the state of Western civilization that while Jesus Christ is worth 31 cents a hit, he's not the highest-ranked name on the site. That honor, at $1.27, goes to movie actress Cameron Diaz from There's Something About Mary.
'The magazine for database professionals' (Searcher)
The BotSpot monthly newsletter looks pretty invaluable, given the rate of new search-agent development
Search-app Copernic99 is now available for Mac: (freeware, 1.5Mb) http://www.copernic.com/download/index.html [RBuzz] [big fun!]
I never hear much about this top-rated Windows web-search app: http://thewebtools.com/features.htm
Mata Hari creates a desktop database of Web search results which, themselves, can be further searched, manipulated, edited or annotated.
This search-engine app sounds like a genre-buster [not for Mac yet ] Ditto.
Atomz.com
Free site-search service looks very well-designed:
http://www.atomz.com/help/faq.tk
Decent-sounding tips for improving your site's search-engine placement: http://www.kaleidoscope-dts.com/secrets.html
Note that I wrote "naked truth exposed!" and not "the naked truth exposed!" Yes, it's true that using "the" would have made this phrase sound better. However, "the" can be referred to as a "stop word". Stop words are often conjunctions, prepositions, and articles and other words like "and", "to" and "a" that often appear in documents yet alone may have little meaning. Many search engines ignore stop words, so using them won't enhance your search engine placement; they just take up valuable space.
New variant on the evil short-pages conspiracy (in a moderately interesting piece on search-engine placement): http://www.hotwired.com/webmonkey/templates/print_template.htmlt?meta=/webmonkey/99/31/index1a_meta.html
Most pages have too much text in them to score well for one particular phrase. A page that has a few words, including "Web consulting," will score higher than one that goes on and on and doesn't mention Web consulting until the 10th paragraph.
A 'pay-for-placement' search engine, GoTo
John Berger's witty keyword-saturation story
Playing 'cloaking' games with search-engine spiders for ethically dubious profit: http://www.internetday.com/archives/022399.html [More]
Two pages are created. The first is the VISIBLE page, the one that we want our visitors to see. The second is a HIDDEN page, visitors don't see it but search engines do.
NOINDEX
Beautifully written intro to controlled vocabularies: http://webreview.com/wr/pub/1999/07/09/web_architect/index.html
Whether you realize it or not, you're already familiar with controlled vocabularies. The Library of Congress subject headings and Yahoo's search criteria are a couple of examples. So, as you've probably guessed by now, controlled vocabularies are predetermined sets of terms that fit together to describe a specific domain such as kitchen appliances, nuclear engineering, or dirt biking.
Why some web-pages need a 'META GEOLOCATION' header: [Deja URL]
Have you ever searched the net for a pizza-service? Ok, you get results, but you are probably looking for one that, at least theoretically, is able to deliver you a hot pizza. ...In short: the web does not know distance...
Apparently the search-engine keyword lawsuits have one small valid point: http://www.thestandard.net/articles/display/0,1449,3871,00.html
In the case of the porn companies that bought "Playboy" from Excite and Netscape, most of the banner ads served do not contain information about the advertiser. This could spell trouble for the search engines, according to Neil Shapiro, a First Amendment and intellectual-property lawyer in San Francisco. Trademark law, he explains, is not designed to protect trademark owners but to prevent the public from being confused.
This thread on indexing tech-news stories is in reverse order due to propagation problems: [Messy URL]
But happily, Netscape includes a handy little drag-n-drop hierarchy editor, in the form of its 'Edit Bookmarks' window, that allows me to work with the headlines themselves, along with (at no extra charge) the associated URLs -- in case I need to doublecheck the full stories.
Dvorak calls for a net.dewey.decimal system: http://boardwatch.internet.com/mag/98/dec/bwm33.html
Is it too complex?
Do most sites really want their content scraped?
XML-driven search engine (via WebWord)
Sherlock Find file
Search-engine design ideas: http://search.dejanews.com/getdoc.xp?AN=470492869
If 'search engineers' ever get their asses in gear, this is what a results-page could look like in a year or three, entirely without human intervention...
A bunch of ways search-engines could be improved: http://search.dejanews.com/getdoc.xp?AN=467730817
Frequent query-patterns could be identified, and special pages maintained for each of them, with some kind of raters-feedback system to gravitate the best stuff toward the top.
A Deja-specific sequel to yesterday's search-engine speculations (below): http://search.dejanews.com/getdoc.xp?AN=470627131
Tomorrow's DejaNews search-results page design?
Some inconclusive mathematical thinking about search: http://search.dejanews.com/getdoc.xp?AN=465071971
Here are some tools for thinking about net.search:- hypothetically, one could arrange all the questions that searchers want answered, in order from most common to least common. (Ask Jeeves is trying to classify all the top ones. They have 10,000 'templates' so far, according to Forbes [1].)
A discussion-forum on standardising search-engine syntax: [Messy URL may include Danny Sullivan's password?] [CSky]
PC Mag analyses search-engine sites: [multipage] http://www.zdnet.com/pcmag/features/websearch98/edchoice.html
Editors' Choices:
Simple searching: Yahoo!
Advanced searching: Northern Light
Kids' search: Ask Jeeves for Kids
News search: None
Metasearch: MetaCrawlerFind an acronym: Acronym Finder
Find anyone: Yahoo! People Search
Find an apartment: AllApartments
Find an ATM: Visa and MasterCard
Find an attorney: lawyers.com
Find company information: Hoover's Online
Find a computer article: Computer Magazine Archive
Find a definition: OneLook Dictionaries
Find a discussion forum: Forum One
Find a domain name: NetNames USA
Find driving directions: MapQuest
Find a file: Filez
Find a house: Realtor.com
Find an ISP: The List
Find a job: The CareerBuilder Network
Find legislators' voting records: Project Vote Smart
Find a long-distance carrier: TeleWorth
Find a mailing list: Liszt
Find medical information: Medical World Search
Find an old friend: ClassMates and PlanetAll
Find online events: On Now and Yahoo! Net Events
Find a personal home page: WhoWhere?
Find public records: KnowX
Find stock filings: Edgar Online
Find technical information: developer.com
Find tech-support information: No Wonder
Find the weather: AccuWeather
Find yellow-pages information: Switchboard
Find a ZIP code: USPS ZIP Code Lookup
Jesse's five top people finders: [multipage] http://www.zdnet.com/anchordesk/story/story_3421.html
AnyWho...
InfoSpace...
WhoWhere...
Yahoo People Search...
SwitchBoard...
Here's a simple bookmarklet for Britannica (requires Javascript).
MP3
movie reviews
Overview of databases not indexed by search-engines (via RBuzz)
Including a politics-only search engine: (not bookmarklet-able, I fear) http://www.politicalinformation.com/
Search-engine for cultural events: http://www.culturefinder.com/ [Newsweek]
CultureFinder has listings for 300,000 events in more than 1000 cities nationwide.
Awesome inventory of specialty search-engines includes these categories and many more: http://www.leidenuniv.nl/ub/biv/specials.htm [Coppersky]
American political rhetoric - Archaeology - Audio, Midi, MP3 - Birds, birding - Books - Celebrities - Cheese - Cigars - Cybercafes - Dance - Fashion - Humor - Mysticism - Orchids - Weather - Women - Work - Y2K
New TidBits offers tips on searching for images. Ditto for images.
Image search:
Find an email address; http://mesa.rrzn.uni-hannover.de/
A reverse phone directory. Another. Another.
Switchboard sorts the yellow pages by neighborhood
PC Mag picks its favorite search engines for various tasks
IBM's patent server is miraculously great, if you need to do any sort of patent research
This Internet 'new-book shelf' adds a dozen new titles daily: [150k]
A page of 'new and obscure' search engines
That new Echelon paper: (long page w/big pix) http://www.iptvreports.mcmail.com/ic2kreport.htm
This report describes how Comint organisations have for more than 80 years made arrangements to obtain access to much of the world's international communications.Contrary to reports in the press, effective "word spotting" search systems automatically to select telephone calls of intelligence interest are not yet available, despite 30 years of research. However, speaker recognition systems - in effect, "voiceprints" - have been developed and are deployed to recognise the speech of targeted individuals making international telephone calls.
Besides UKUSA, there at least 30 other nations operating major Comint organisations. The largest is the Russian FAPSI, with 54,000 employees.(8) China maintains a substantial Sigint system, two stations of which are directed at Russia and operate in collaboration with the United States. Most Middle Eastern and Asian nations have invested substantially in Sigint, in particular Israel, India and Pakistan.
Processing may also involve translation or "gisting" (replacing a verbatim text with the sense or main points of a communication). Translation and gisting can to some degree be automated.
The most advanced type of HF monitoring system deployed during this period for Comint purposes was a large circular antenna array known as AN/FLR-9. AN/FLR-9 antennae are more than 400 metres in diameter. They can simultaneously intercept and determine the bearing of signals from as many directions and on as many frequencies as may be desired. In 1964, AN/FLR-9 receiving systems were installed at San Vito dei Normanni, Italy; Chicksands, England, and Karamursel, Turkey.
...The main stations are at Buckley Field, Denver, Colorado; Pine Gap, Australia; Menwith Hill, England; and Bad Aibling, Germany. The satellites and their processing facilities are exceptionally costly (of the order of $1 billion US each).
...Thus, although European communications passing on inter-city microwave routes can be collected, it is likely that they are normally ignored. But it is very highly probable that communications to or from Europe and which pass through the microwave communications networks of Middle Eastern states are collected and processed.
NSA and other Comint agencies have spent a great deal of money on research into tapping optical fibres, reportedly with little success.
[Internet tapping:] Although the quantities of data involved are immense, NSA is normally legally restricted to looking only at communications that start or finish in a foreign country. Unless special warrants are issued, all other data should normally be thrown away by machine before it can be examined or recorded.
In the UK, the Defence Evaluation and Research Agency maintains a 1 Terabyte database containing the previous 90 days of Usenet messages.(35) A similar service, called "Deja News", is available to users of the World Wide Web (WWW). Messages for Usenet are readily distinguishable. It is pointless to collect them clandestinely.
The same article alleged that a leading US Internet and telecommunications company had contracted with NSA to develop software to capture Internet data of interest, and that deals had been struck with the leading manufacturers Microsoft, Lotus, and Netscape to alter their products for foreign use.
NSA and CIA then [1970s?] discovered that Sigint collection from space was more effective than had been anticipated, resulting in accumulations of recordings that outstripped the available supply of linguists and analysts.
The use of strong cryptography is slowly impinging on Comint agencies' capabilities. This difficulty for Comint agencies has been offset by covert and overt activities which have subverted the effectiveness of cryptographic systems supplied from and/or used in Europe.
[Clipper, etc:] Viewed in retrospect, the actual purpose of these proposals was to provide NSA with a single (or very few) point(s) of access to keys, enabling them to continue to access private and commercial communications.
Between 1993 to 1998, the United States conducted sustained diplomatic activity seeking to persuade EU nations and the OECD to adopt their "key recovery" system. Throughout this period, the US government insisted that the purpose of the initiative was to assist law enforcement agencies. Documents obtained for this study suggest that these claims wilfully misrepresented the true intention of US policy.
In 1970, according to its former Executive Director, the US Foreign Intelligence Advisory Board recommended that "henceforth economic intelligence be considered a function of the national security, enjoying a priority equivalent to diplomatic, military, technological intelligence".
"Signals intelligence is in a crisis. ... Over the last fifty years ... In the past, technology has been the friend of NSA, but in the last four or five years technology has moved from being the friend to being the enemy of Sigint."
How Inktomi uses parallel processing:
http://www.forbes.com:80/Forbes/98/1130/6212336a.htm
top || fun | art | media | issues | net | tech | science | history | search | shop
[Robot Wisdom home page] (Feedback)