In the Search Engines Corner for this issue, Dave Beckett shows you the best of the search engine features and gives tips on managing the output when a query returns thousands of answers. This article appears in the Web, and not the print, version of Ariadne.
Are you trying to search for something difficult and end up with enormous and completely useless results like this?
Search Engine |
Est. Size (May 1998) |
Results of searching for The Internet |
---|---|---|
AltaVista[1] | 140M | About 47,565,550 matches were found |
HotBot[2] | 110M | Web Results 7939803 matches. Breakdown: internet 4602458, the 47280842 |
The query above is rather contrived but indicates the problem. In this article I hope to help you with these kinds of situation by pointing you in the right place, showing you the best of the search engine (SE) features and giving tips on managing the output that you get when a query returns thousands of answers.
I run several search engines myself including the Academic Directory[3], UK Internet Sites[4] and research in the area.
The first choice to make is where to search; and this may not be at a web crawler SE. This table lists some good choices for specialised searches:
Searching ...? | Try these |
---|---|
USENET news | DejaNews[5] (also available via other SEs) |
Stuff printed on dead trees | Your local OPAC, NISS[6], BIDS[7], eLib[8] projects, Northern Light[9] |
Pictures | Lycos[10], Yahoo! Image Surfer[11] |
Sounds | Lycos |
News | BBC News[12], CNN[13], NewBot[14], NewsIndex[15], most big SEs |
UK Web | Yahoo! UK[16], Search UK[17], UK Index[18], InfoSeek UK[19], Excite[20], Lycos |
.ac.uk Web | Academic Directory(I run this.) |
European Web | Excite, Euroferret[21], HotBot |
Particular country Web | HotBot |
In a date range | HotBot |
In a particular language | AltaVista (translations too) |
If you are very lucky, there may be a subject-specific site that you can use where professional cataloguers have been paid to find web resources and provide high quality records. For example there are several eLib Subject Based Information Gateways (SBIGs). Some of those projects are experimenting with web crawls of the sites they have hand-picked, so searching there should return highly relevant results.
So how do you find one of these sites? Try using an appropriate authority in your subject. If you have a professional body or association, look at their web site for links. If that isn't possible, try a general SE or a directory service. Search.com[22] contains a list of over 100 specialises SEs and may be a good place to start.
There are a few well known large directories of which Yahoo! is the largest and best with over 500K web sites listed (and in fact is the #1 used site on the Web). It is always a good idea to search there or maybe one of the other smaller sites such as LookSmart[23]. There are also some non-commercial directories such as WWWVL[24] (a site started by a creator of the web, Tim Berners-Lee) and the new NewHoo![25].
If you have got this far, you want to get something on the web, or the specialised searches didn't get quite what you wanted.
Coverage and freshness of the web crawls is important and all the web crawlers have different sizes and activity patterns. Search Engine Watch[26] keeps up-to-date estimates on the sizes[27] but this isn't the full story. A paper[28] at the WWW7 conference estimated the size of the web in November 1997 as 200M static pages but the joint coverage of the 4 largest SEs was only 160M and the overlap between them was only 2.2M or 1.4% of all pages!
At the current date, AltaVista is probably still the largest by far, with HotBot a close second and Northern Light getting larger rapidly. Most of the largest SEs crawl are very active, checking or adding millions of pages each day so for freshness, chose the largest crawlers. Each of the SEs has a different set of indexing and search features that can be exploited which may make the difference.
The worst thing you can do to an SE is to present a one word query with a very common item. If you look at some of the internet search engine spy pages[29] many people really do this. Thus you should really choose some extra words. Can't think of any? Well try your one-word search on Excite and use its suggested words feature. 10 related words are suggested with the results of every search (JavaScript support is required for you to pick them via clicking; but you can always type them in too).
Assuming you have managed to get a couple of words, now you can do something with them. If you try a general query on a SE like above, you end up with millions of results, so it is a good idea to modify the query. Most SEs allow you use these methods, although the syntax varies and sometimes it is found on an advanced search page:
Match Any | Results may contain any of the words |
Match All | Results must contain all of the words |
Exclude / Require | Exclude / Require certain mix of words |
Phrase Searching | Require words in the exact order given |
Proximity | Look for words near other words |
Wildcards | Partial match on words; do you know how to spell it? |
For the details of which engines support which features, see the Search Engine Watch Power Searching page at [30].
So you have chosen a SE and tweaked the query to match what you want but still end up with an unmanageable set of results. There are two types of page that you really want to identify[31]:
People who create such pages do tend to submit them to directories and search engines or maybe get links made from other related authorities or hubs. Check there if you can find those sites there but here are some techniques I suggest in working with SEs:
``So what search engine do you use?''
HotBot mostly, with all the options on; AltaVista for coverage and others
as need be.
``Why?''
HotBot has loads of juicy features, is large enough and crawls often to
keep the index fresh. One downside is that it is US-based and has no local
partner.
Here, .com sites are in USA, .uk are in the UK unless otherwise indicated.
[1] AltaVista
http://www.altavista.digital.com/
Note: Since the merger with Compaq, the European AltaVista service
at altavista.telia.com does not seem to be updated. You will have to use
the US one.
[2] HotBot
http://www.hotbot.com/
Inktomi provide the technology and data behind HotBot and also
power searches at Yahoo!, CNET's Snap and Disney's Internet Guide (DIG).
Searches there mostly have the same functionality as HotBot.
[3] The Academic Directory, HENSA Unix
http://acdc.hensa.ac.uk/
I made this.
[4] UK Internet Sites, HENSA Unix
http://www.hensa.ac.uk/uksites/
and this.
[5] DejaNews
http://www.dejanews.com/
[6] National Information Services and Systems (NISS)
http://www.niss.ac.uk/
[7] Bath Information & Data Services (BIDS)
http://www.bids.ac.uk/
[8] The Electronic Libraries Programme (eLib)
http://www.ukoln.ac.uk/services/elib/
[9] Northern Light
http://www.northernlight.com/
See last months article for more information.
[10] Lycos UK, (actually in Germany)
http://www.lycos.co.uk/
[11] Yahoo! (US) Image Surfer
http://isurf.yahoo.com/
[12] BBC News
http://news.bbc.co.uk/
[13] CNN (European mirror)
http://europe.cnn.com/
[14] NewBot
http://www.newbot.com/
[15] News Index
http://www.newsindex.com/
[16] Yahoo! UK & Ireland
http://www.yahoo.co.uk/
[17] Search UK
http://www.searchuk.com/
[18] UK Index
http://www.ukindex.co.uk/
[19] InfoSeek UK (actually in USA)
http://www.infoseek.co.uk/
[20] Excite UK (actually in USA)
http://www.excite.co.uk/
[21] EuroFerret (actually in UK)
http://www.euroferret.com/
[22] Search.com, CNET
http://www.search.com/
[23] LookSmart
http://www.looksmart.com/
[24] World Wide Web Virtual Library
http://www.vlib.org/
UK Mirror: http://www.mth.uea.ac.uk/VL/Overview.html
[25] NewHoo!
http://www.newhoo.com/
[26] Search Engine Watch, Mecklermedia
http://www.SearchEngineWatch.com/
[27] Search Engine Sizes, Search Engine Watch
http://www.searchenginewatch.com/reports/sizes.html
[28] A Technique for measuring the relative size and overlap of public Web search engines, Bharat and Broder, Digital Systems Research Center, USA in Proceedings of WWW7, April 1998.
[29] Yahoo! Search Engine Spying Page
http://www.yahoo.co.uk/
Computers_and_Internet/ Internet/ World_Wide_Web/ Searching_the_Web/
Indices_to_Web_Documents/ Random_Links/ Search_Engine_Spying/
[30] Search Engine Watch - Power Searching
http://www.searchenginewatch.com/facts/powersearch.html
[31] Authoritative sources in a hyperlinked
environment, Jon Kleinberg in: Proceedings of 9th ACM-SIAM Symposium
on Discrete Algorithms, 1998; also appears as IBM Research Report RJ
10076(91892) May 1997
http://www.cs.cornell.edu/home/kleinber/auth.ps
[32] Planet Search
http://www.planetsearch.com/
[33] Metacrawler
http://www.metacrawler.com/
Material on this page is copyright Ariadne/original authors. This article last updated/links checked on 22-Jul-1998