Internet search tools have been created to answer this very pressing need. They are evolving rapidly - some would say more rapidly than the Internet itself. By the end of 1996, it is estimated that the Internet will consist of no less than 150 million pages, containing 50 or 60 billion words. To make matters worse, this great mass of data exists completely without any kind of bibliographical controls, standard numbering systems, or classification systems. Clearly, automated tools of some sort are necessary to sift through this mass of material (Venditto, 1996).
My own personal interest in the Internet has increased in direct proportion to the growth, power, and flexibility of the excellent search tools that have appeared over the past year or two. In my opinion, they have elevated the Web from a simply a browsers paradise, to a more respectable, searchable, and interesting world-wide reference source. In fact, the same critical skills that are used to locate books, journal articles, musical scores, or any other information resources, can and should be applied to finding information on the Internet (Tillman, 1996).
This paper will concentrate on search engines and their characteristics only. My target audience is our class, that is, sophisticated and experienced students of electronic databases, who are well aware of the established methodology used to search them. Unfortunately, a discussion of the many wonderful Internet subject guides and annotated directories will have to be done by another writer, another time.
Search engines, if used properly, are able to match search terms with corresponding terms contained in specific Web sites. Many of the newer engines incorporate a spider or robot software to index Web sites. This automated process actually visits each new Web page and records the full text of every page (including as many as three of the page's links). Other engines may only base their indexing on the title, heading, and say the first 200 words of the body. Still others may analyze the number of links that point to the page being indexed, to determine its usefulness. The point is, each search engine goes about the job of indexing in a different way. The other half of the process, the front end offered to the user via the search screen form, also varies widely in terms of the operations and features engineered into the software. Some engines permit the user to key in all the necessary control language such as Boolean operators, proximity operators, and various limiting schemes. Other simply present forms with pull down menus that allow the user to select to proper limiting terms. The later technique is referred to as "form based" controls (see Comparison Table). The bottom line is that search engines rarely yield the identical results when presented with identical search terms. The user, in able to use each engine effectively, needs to understand the difference in the construction and use of each, in order to make an informed choice of product.
All search engines match the user's search terms to documents in roughly the same way (Sullivan, 1996). These are simply:
Subject catalogs are actually hierarchically organized indexes of subject categories that permit the searcher to browse through lists of Web sites by subject in search of relevant information (Tyner, 1996). The analysis of sites by subject is done by humans, not computers, and therein lies both their advantage and disadvantage. First the disadvantage: the pool of indexed sites is necessarily smaller in comparison to search engines that use an automated robot spider to collect indexing information. However, no amount of word frequency counting or proximity calculation can compare with the interpretative ability of the human mind. So, when browsing a subject catalog, one can be assured of subject relevancy (high precision), but not comprehensiveness (high recall). What is the best answer for the poor researcher !?!
In the case of search engines, the more powerful the controls the searcher has to sort and manipulate the hits in a predictable and intuitive fashion, the better. As in all other forms of electronic querying, the user simply must take time beforehand to analyze and list as many relevant, synonymous and necessary terms as possible. The more precise the query, the more likely the material retrieved will be useful. The searcher also needs to consider the level of responses needed. To state this concept simply, the user may want to approach the subject very broadly in order to gain an idea of just how large the body of information is relevant to his topic. Or, he may want very specific, exacting information about the topic to answer questions or help to confirm a hypotheses.
The next section will address individual search engines ranked in order of preference.
1. AltaVista
The searcher should proceed immediately to the AltaVista Advanced Search option in consideration of the fact that this engine indexes all existing Web pages full-text (it claims 30 million). The searcher needs every control tool offered by AltaVista to avoid being hit by a tidal wave of sites. AltaVista also offers searching of News Groups on the Web. It's not unusual for an unfiltered search to yield over 100,000 hits returned for a single query - in one second! One should always head straight for the advanced search mode - or for the beta page - in any search engine. It will always provide the tools for a more controlled search.
Ever since AltaVista first exploded on the scene in December, 1995, it has been recognized as the premier search engine. It is regarded as being the most comprehensive of the search engines in terms of URLs indexed, although interestingly enough, no one seems to agree about just how many Web Sites are out there. At any rate, AltaVista's search results are also consistently more comprehensive than its competitors (Venditto, 1996). I concur absolutely with this conclusion.
The performance of the major search engines are similar with fairly simple searches, but as the concepts become more and more complex, the differences in engines became more apparent. The searcher can construct search phrases for AltaVista much like the phrases used in DIALOG and many other similar electronic databases. This has not always been the case for Internet search engines. Boolean, proximity searching, phrase searching, and field searching are allowed, and can be stated in the syntax that has been well established over the years (why reinvent the wheel?). Also available are the use of wildcards (an AltaVista exclusive) and case sensitivity. Examples of "good" search strings for AltaVista include (Gray, 1996):
2. HotBot
HotBot is so new on the scene that few have had time to actually test and review it. It seems that once AltaVista paved the way, HotBot and several other search engines have created Internet tools that are very similar in speed and control, which also offer some unique features as well.
HotBot boasts of having indexed no fewer than 54,000,000 net sites (as of October 29, 1996), and supports the boolean AND and OR, phrase searching, limiting by date, media type, and location in its form based menu. Once again, the experienced user should head straight for the "Expert Search" mode to gain maximum control of the 54 million options. A feature that permits the user limit by media type is unique to HotBot. With this feature, the user can access all the sites that feature specific software add-ons like Java, JavaScript, Shockwave, Acrobat, audio, or VRML viewers. This is a great way find sites to test newly downloaded software. Also, I found the Graphic layout for this page to be attractive in an austere, "generation-X" sort of way. In terms of speed, all other variables considered, all of these major engines are amazingly fast. Somehow, the program is able to search all 50 million sites in about one second.
3. infoseek ultra
This new engine was introduced on August 14, 1996, and offers a major improvement over its predecessor, infoseek guide, which is still very much alive. This very impressive new product also boasts of having over 50 million URLs in its index, but what really sets it apart from the others is what infoseek calls its "real-time index" of the Web (Grady, 1996). This rather obtuse phrase really means that infoseek is actually updating its index continuously. Its spider senses new and changed pages and updates the index immediately.
I must admit to a healthy doubt concerning this claim, so I put it to a test. I clicked on their Save URL link on the home page, and submitted all three of my personal home pages in a very short and simple process (it may have taken 25 seconds). I immediately went back to the infoseek search screen and entered appropriate search terms for my pages, and all three came up in the first ten hits! Take note all Web authors! No other search engine I tested can came close to the instantaneous refreshment that infoseek has perfected. The only other engine that is even close is AltaVista at under twenty-four hours from posting to index. This is to me a critical factor, because I feel that one of the Internet's most positive characteristics is its currency. To say that this engine is constantly the most current of all the engines is high praise indeed!
Some estimates claim that almost half of the URLs on the Web are either duplicates or dead/invalid links (INFOSEEK, 1996). Infoseek ultra has created software that filters out duplicate and/or dead links, and this too is a major feature of this engine. I have yet to get an invalid link message in any of my infoseek ultra searches. These searches are lean and accurate, with a very high "signal to noise ratio", also known as high precision.
Other useful search features include case sensitivity, proper name recognition (the search term "Junkin" alone sends my B.F. Junkin Home Page to near the top of the hit list), limiting search terms to particular fields, and eliminating terms with a minus sign ("-"). I would prefer more traditional syntax to execute some of these controls, but all in all, it is very difficult to find much to criticize in infoseek ultra.
4. Excite
Excite is the first engine discussed here that qualifies as both an effective Web directory organized by category and a Web search engine. It also lists 50 million indexed URLs so it can't be criticized for having a smaller pool of pages like the other Web directories. In fact, "Excite provides the fullest range of services of all the Web search sites" (Venditto, 1996). The user can search the text of at least 10,000 newsgroups, a daily news summary, opinion columns, cartoons, and Web site reviews.
Excite allows searching by keyword or concept, and offers searching in all the above- mentioned areas: usenet newsgroups, reviews, web documents, or classifieds. Allowable Excite search terms include (Gray, 1996):
However, the tests that I performed on Excite included trying to access my three home pages using my own specified search terms. They produced some very strange results. For my first home page with the HTML Title: "Letters from the 126th Ohio Volunteer Infantry", I keyed in "126th Ohio Volunteer Infantry" and got 236 hits. My page was not in the first sixty of them. I then keyed in the complete title verbatim, and got zero hits. A little unnerved, I decided to try my second page entitled "Decedents of Johann Tobias Horine". This time, my Horine page was hit number 1 (as it should have been), but amazingly enough, my 126th page (which is a link off the Horine page) showed up as hit number 5. Go figure! Not only that, but all my links from the Horine page were listed in the top 7 links. One of the links, "Civil War Ohio - A special Collection" was listed even though it is a link from the first page (the 126th OVI). In addition, the Excite document summary for this link consisted of a couple of random sentences from the middle of the document. This is totally inexplicable to me, so I won't attempt it here. Suffice it to say, if I can't get predictable results when I key in my own search terms for my own pages, I tend to generally distrust the keyword matching ability of this engine across the board.
Lastly, I find the Excite screen cluttered and more than a little obtuse. Don't bother clicking on the Advanced Search link unless all you're after is information, because you cannot enter search terms from the advanced screen, you have to back out to the original screen to perform a search.
5. Lycos
Many veteran Web searchers have very soft spots in their hearts for Lycos, because for a while after its 1994 inception at Carnegie Mellon, it was alone in its class. After all, how can anyone dislike a search engine that was developed by a man named Dr. "Fuzzy" Mauldin? At any rate, Lycos is still quite popular, but objectively speaking, it hasn't quite kept pace with some of the newer shinier engines. It does claim an index of 68 million URLs, and their concept is to allow Internet user to:
In the tests I performed with my own URLs, Lycos performed perfectly and predictably. Generally though, Lycos is known for high recall but poor precision (Venditto, 1996). I must agree. For example, I keyed in the exact title of my 126th OVI page, and got back 364 documents, with my page right where it should be, number 1. With a search this precise, I wonder why Lycos retrieved so many other documents. In the identical search in AltaVista, I got one hit, my page. If the search terms are this precise, I think the response of the database should be equally precise. I found the Lycos response soft; if I had wanted to retrieve related documents, I would have made a more general query. This is a small point perhaps, but it makes me wonder just how many "soft" hits I would get with a more general query - probably way too many. The level of the response should match the level of the query, and this is I believe, a basic database heuristic.
The summaries of the retrieved documents are informative, with the search term bolded, a feature that would be beneficial for all engines to incorporate. Its use of boolean operators is frankly a little confusing, but the searcher can specify degrees of relevancy of search terms.
Generally, Lycos retrieves lots of documents, so it's probably not the best engine for finding something quickly. It is very comprehensive, but its control language is inferior to several of the newer shinier engines listed previously.
6. Open Text
Opentext is a little secretive about the size of their index. Estimates are that it is in the range of 1.5 million URLs (Sullivan, 1996). This is considerably smaller than the 50+ million claimed by Excite, HotBot, Lycos, and Infoseek Ultra. Ironically, in the FAQ information linked from their main page, they go way out of their way to kick-around WebCrawler for only indexing 100,000 or so sites (Opentext, 1996). The truth is that of the major search engines, Open Text is next to last in index size, and WebCrawler is the only smaller one. I must say, Open Text does pick on someone its own size, the only one!
These concerns aside, Open Text is arguably the best-designed search site on the Web (Venditto, 1996). Open Text offers seemingly every conceivable search option. Its robot indexes each page full-text, my personal method of preference for access. It offers "power search" which can include up to five search terms and the use of five boolean operators between terms selected from pull-down menus. You can specify field searching per term: anywhere, title, summary, first heading or URL. And finally, you can specify a weighted search for up to four search terms. These options are mostly quite accommodating, but I found them to be quite linear. For example, when I entered a complex series of terms in the main menu, it only retrieved documents in which these terms occurred in the order I created.
The bottom line for Open Text is that this engine offers nice control options, but it's not nearly comprehensive enough. It is better to stick with the big indexes, and these days there are quite a few excellent ones from which to choose.
7. WebCrawler
As previously mentioned, WebCrawler has the smallest index of the major search engines, estimated at 500,000 URLs (Sullivan, 1996). It does index its sites full-text, but WebCrawler's principle criteria for selecting sites to add to the index is page popularity, or the sites that are the most well-traveled in terms of visitors. To my mind, this method would tend to yield sites that are "pop" in nature, or concerned with mainstream information. This type of construction is very well-suited to its new sponsor, America On Line. I would not look in WebCrawler for scholarly or esoteric information, however.
Another problem is that only the page titles of each retrieved URL are displayed for the searcher. This title may or may not be descriptive enough to provide intellectual access to the documents. The searcher is forced to link to each page to get a sense of its content.
If the object of your search is mainstream information, such as information on high-profile corporations, television networks, sports, or movie stars, WebCrawler should be your first choice. This is more the character of this index, and it does occupy a distinct niche. I must add however, that judicious use of control language when using the more comprehensive engines like AltaVista, HotBot, or Infoseek ultra, should enable the searcher to locate the same material.
WebCrawler is fast and easy to use. It does offer a browsable subject catalog, and in the "advanced mode" it offers boolean and proximity searching to hone your search. But, once again, WebCrawler's index is only 1% of the size of the big indexes, so I really cannot conceive of a good reason for using it as a search engine. "Compared with the newer speed merchants such as AltaVista and HotBot, WebCrawler isn't the fastest or most up-to-date search engine" (Page, 1996).
Subject indexes/catalogs & meta search engines
Of the subject indexes on the Web, Yahoo is generally regarded as the largest and best tool (Gray, 1996). If you would prefer an interesting approach to an Internet index based on the Dewey Decimal System, check out the BUBL (BUlletin Board for Libraries) Information Service where the URLs are divided into subject hierarchies based on Dewey.
Other very good subject indexes include:
Examples of excellent meta-search engines (also known as multi-threaded search engines) include:
These search tools allow the searcher to perform a search combining the results from a variety of multiple search engines, in a customized combination specified by the searcher. The user is presented with a list of hits, and information on which search engines (i.e. AltaVista, Lycos, HotBot, etc.) they came from. The user then can simply click on any of these documents just as he would in a single engine search.
Before a researcher logs onto the Internet, he needs to answer a few simple question to help him determine the best type of search tool for his purposes. If he is looking for specific information, the best choice is a search engine in the order of preference as indicated in the body of this paper. If the purpose is merely to browse sites to learn what is available on the subject of interest, the subject indexes are the place to start. The meta-search engines are alluring, but theoretically at least, search engines that are comprehensive like AltaVista, HotBot, Infoseek ultra, Lycos, and Excite should yield much the same results. When using these comprehensive engines, the searcher needs to be as explicit as necessary to retrieve the level of results desired. Also, if precise information is needed, the search terms likewise need to be as precise and limiting as possible. As previously mentioned, AltaVista seems to be the best at matching the level of search terms with its level of retrieved documents, and for this and many other reasons, is my first choice for an Interenet search tool.
Babb, Chris. March, 27, 1996. Babb's bookmarks. Boardwatch vol. X: Issue 3.
Hypertext Document: http://www.boardwatch.com/mag/96/apr/bwm8.htm
Berkeley Digital Library SunSITE. September 17, 1996. Internet search tool details.
Hypertext Document: http://sunsite.berkeley.edu/Help/searchdetails.html
Dvorak, John C. November 7, 1996. Indexing the internet. PC Online.
Hypertext document: Http://www. Pcmag.com.dvorak.jd961007.htm.
Grady, Steve. August 14, 1996. Infoseek introduces Infoseek Ultra. Hypertext document:Http://C%7C/Program%20Files/Netscape/Na...32.19960930112925.006d5d78@corp&number=51.
Gray, Terry A. July 29, 1996. How to search the web A guide to search tools. Palomar University Library. Hypertext document: http://issfw.palomar.edu/Library/TGSEARCH.HTM
Infoseek Ultra, November 2, 1996. Comparison of world wide web search engines.
Hypertext Document: http://www.infoseek.com/doc?pg=comparison.html
Jones-Catalana, Cynthia N. November 1996. One-stop surfing. Internet World vol 7, no. 11.
LYCOS, Inc. October 23, 1996. LYCOS, Inc profile. Answer to e-mail inquiry by author.
Mitchell, Steve. May, 14, 1996. General interest resource finding tools: A review and list of those
used to build infomine. University of California, Riverside.
Hyper text document: http://lib-www.ucr.edu/pubs/navigato.html#1
Northwestern University. June 10, 1996. Evaluation of selected internet search tools.
Hypertext Document: http://www.library.nwu.edu/resources/internet/search/evaluate.html
Open Text Index. November 2, 1996. Frequently asked questions about the Open Text Index.
Hypertext document: http://index.opentext.net/main/faq.html
Page, Adam. 1996. The search is over. PC Computing Sept 12, 1996.
Hypertext document: http://www.zdnet.com/pccomp/features/960912/sub2.html
Solock, Jack. June 3, 1996. End user's corner: Site-ation pearl growing. Internic news.
Hypertext Document: http://rs.internic.net/nic-support/nicnews/archive/june96/enduser.html
_________, 1996. Searching the internert part II: Subject catalogs, annotated directories, and
Subject guides. Internic news.
Hypertext document: http://rs.internic.net/nic-support/nicnews.endusers.html.
Sullivan, Danny. The Webmaster's Guide to Search Engines and Directories. Hypertext document: http://calafia.com/webmasters/study.html.
Tillman, Hope N. Evaluating Quality on the Net. July 10, 1996. Hypertext document: http://www.tiac.net/users/hope/findqual.html
Tyner, Ross. May, 10, 1996. Sink or swim: Internet search tools & techniques. Okanagan University College. Hypertext document: http://www.sci.ouc.bc.ca/libr/connect96/search.htm
Venditto, Gus. "Search Engine Showdown." Internet World 7.5 (May 1996): 79-86.
Webster, Kathleen and Paul, Kathryn. January 1996. Beyond surfing: Tools and techniques for searching the web. Feliciter 42.1. pp. 48-54. Hypertext document: http://magi.com/~mmelick/it96jan.htm
Winship, Ian. "World Wide Web Searching Tools." Vine 99 (1995): 49-54.
Hypertext document: http://www.bubl.bath.ac.uk/BUBL/IWinship.html
people have visited this page. Thanks! |