About Searching and Search Tools

- Why, where and how to find the information you need -

by: Albert Benschop

  1. From surfing to searching
  2. Types of search tools
    1. Tools for browsing: Subject Catalogs
      • Virtual Libraries
      • Subject Guides
      • Geographical Guides
    2. Tools for keyword searching: Navigators
      • General search engines, robots, crawlers, worms
      • Uniform meta-search engines
      • Multiform meta-search engines
      • Geographical search engines
      • Specialized search engines
  3. Which is the best?
  4. Watch the money makers ...
  5. UltraSeek: Claims to be the best
  6. Using search engines
  7. Understanding and comparing Web tools


FROM SURFING TO SEARCHING

Searching for information on the World Wide Web (WWW) can be a long and tedious task. Finding the information you need amongst this huge collection of resources can be very difficult without effective tools.The WWW has grown phenomenally since its origin as a small-scale resource for sharing information. Manual browsing through a significant portion of the hyperterxt structure is no longer possible, let alone an effective method for resource discovery. There is too much material, and it's too dynamic.

There are millions of Web pages out there. The latest estimation (march 1998) is that there are around 275 million pages of information on the web (against 110 million medio 1997). Nobody really knows. The Web is growing and mutating very fast, which is to be expected when there are so many people able to communicate online. There are up to 20 million new pages per month. If this phenomenal growth rate is continued, everyone on Earth would have their own personal page on the Web in four years time.

Until recently, surfing was a typical approach for finding information on the Web. Surfing is unstructured and serendipitous browsing: you start with a particular Web page and follow links from page to page, make educated guesses along the way, hoping sooner or later to arrive at the desired piece of information. Surfing is browsing without tools. Surfing is fun when you have the time to explore. But if you need to find a specific piece of information quickly, or need to find that same information again, surfing and serendipity soon lose their charm.

As the WWW has grown it has become necessary to provide a quick and easy method of rapidly searching webspace. Search tools - often known as search engines - have been developed which can perform this activity. Search engines provide a front end to a database of indexed WWW resources, into which search keywords can be typed.

The number of search tools available on the WWW has grown quickly over a very recent period. This has posed new problems for WWW users. There is now a bewildering variety of search tools available - each offering different features and interfaces. Many are linked to sizeable catalogs of WWW resources, and some claim to offer a comprehensive index of the entire WWW. Some search just on machine names or directory and file names (the URL's) while others search on titles and headers of HTML-pages as well. Some allow searching several different indexes.

The upshot here is (a) that there's no one best way of searching, and (b) that there's no perfect searching tool. All search tools and engines have their strengths and weaknesses. So your best bet is to learn how to use an entire arsenal of them. Most of us end up preferring different suites for finding tools depending on searching effectiveness within the chosen subject domain of interest and personal searching styles. For your subject interests, you actually need to explore and determine for yourself which works best for your purposes.

Index


TYPES OF SEARCH TOOLS

There are two interdependent approaches for finding information on the Web: (1) browsing through subject trees and hierarchies, and (2) keyword searching using navigators or search engines. So in general we can say that there are two types of search tools:
  1. The tools for browsing are the Subject Catalogs
  2. The tools for keyword searching are the Navigators
Be aware that people are using different words for these tools. The tools for browsing are called: 'virtual libraries' or 'link libraries', 'subject catalogs' or 'subject indexes', 'searchable directories'. We will use the term 'subject catalogs' as a general name for all the tools for browsing. The tools for keyword searching also have several names: 'search engines', 'navigators', 'robots', 'crawlers', 'worms' etc. We will use the term 'navigator' als a general name for all the tools for keyword searching.

1 Subject Catalogs or virtual libraries

One way to organize information on the Internet is to create a document or collection that maintains lists of links organized by their content. A subject catalog is a tool that provides a structured and organized hierarchy of categories for browsing for information by subject (we call them 'subject trees'). Under each category and/or subcategory, links to appropriate Web pages are listed. Web pages are assigned categories either by the Web page author or by the subject tree administrator. Many subject catalogs also have their own keyword searchable indexes.

A subject catalog contains substantial numbers of links to Internet resources organized via subject categories created by someone familiar both with the topic and how people would seek information within it. It is an intelligently designed 'links library' or 'links index' that has been organized and compiled by subject experts. The intent is to guide searchers within a high-quality, large domain of selected resources.

Subject catalogs or virtual libraries are often organized hierarchically to make it easier to navigate from the general to the specific topic of interst. Well written catalogs also contain cross-references between related topics under different headings.

The searchable domain of subject catalogs is smaller than that of most navigators and quality is dependent on the subject expertise and Internet experience of those doing the selecting. Subject catalogs are like subject guides but usually much more comprehensive and less narrative.

General Virtual Libraries Some subject catalogs or virtual libraries present their links with brief annotations, and are typically large with minimal restrictions as to what will be accepted for inclusion. Well-known virtual libraries that present links with brief annotations are: Galaxy, Infomine, Internet Public Library, Internet Sleuth (it's not only a search engine), Planet Earth, WWW Virtual Library Series Subject Catalog, WebSurfer, and the most popular Yahoo.

Reviewing Virtual Libraries Some virtual libraries provide significant added value to each link with commentaries and ratings provided by skilled reviewers. Here are some examples: NetReviews (form Excite), Magellan, Point Communications, and WIC (formerly GNN's Whole Internet Catalog).

Subject-specific Guides Subject-specific virtual libraries that function as subject bibliographies to Internet resources are being authored by subject specialists. These subject guides are specialized subject trees for specific disciplines. Examples are: ArchNet WWW Virtual Library (for archeology), and the Clearinghouse for Subject-Oriented Internet Resource Guides (for sociologist they provide access to: Social Sciences and Social Issues).

Geographical Guides A special kind of subject catalogs are the geographical guides. In these guides you can search in continents, countries, regions, cities etc. Most of them use a series of increasingly more specific clickable maps that allow you to get the locale you wish to visit. This is browsing in geographical space. Examples are: CityNet, Virtual Tourist2, and the GeoSurfer.

2 Navigators: Search Engines, Robots, Crawlers, Worms etc.

Navigators or search engines feature indexes that are automatically compiled by computer programs, such as robots and spiders, that go out over the Internet to discover and collect resources.

The words robot, spider, crawler, wanderer and worm are all used to describe computer programs which are designed to explore and compile information about the Web. These programs usually have a database to organize the data about the sites they encounter. Often, the database is put on the Web so that you the user can search it. Because each robot is programmed to search the Web in a different way, the information stored in each database can be very different.

A Web spider or robot examines a document and indexes it, or enters it into the database, based on words extracted from the title or text. In addition the software also searches the document for pointers or URLs for other documents that haven't been indexed yet. Search engines work on the principle that the information content of a document can be summarized by extracting those words already present in the title or text. By ranking the extracted text by its position in title or text, the number of times it appears in the document, and other criteria, the database reduces the number of incidental words or phrases (know as 'false drops'), from those relevant to the topic.

Searchers can connect to a search engine site and enter keywords to query the index. Web pages and other Internet resources that satisfy the query are identified and listed.

Not all search engines are created equally. Search engines vary according to the size of the index, the frequency of updating the index, search options, the speed of returning a result set, the result set presentation, the relevancy of the items included in a result set, and the overall ease of use.

There are several types of navigators or search engines:

  1. General Search Engines
  2. Uniform Meta-Search Engines (unified interface)
  3. Multiform Meta-Search Engines (multiple interfaces)
  4. Geographical Search Engines
  5. Specialized Search Engines
General Search Engines General purpose navigators that search the Web solely or the Web as well as other types of Internet resources. They vary a great deal in comprehensiveness, search features, display features and relevance to particular subject domains. Examples are: AliWeb, Alta Vista (web, gopher, newsgroups), Excite (web, newsgroups), HotBot (web, newsgroups), Infoseek (web, newsgroups), Inktomi, Lycos (web, ftp. gopher), OpenText(web, gopher), Ultraseek, and WebCrawler (web, gopher and ftp).

Uniform Meta-Search Engines There is a special kind of navigators that allow you to search other search engines. They are usually called meta-search engines.They allow you to search enormous domains of servers and documents from one point. The multi-threaded search engines are the most intelligent navigators. I will call them 'Uniform' meta-search engines, because they typically use only one form to call out the search engines. They allow you to put in one search that will be automatically and often simultaneously conducted in several large search engines. Examples of good meta-search engines with a unified interface are: All4One, Dogpile, iFind, Javabot, MetaCrawler, SavvySearch.

Multiform Meta-Search Engines The second sort of mega-search engines are compilations or multi-form front-ends for other search engines. These meta-search engines offer, from one site, forms for entering searches in several navigators in serial fashion. This may be convenient, but sometimes you need to intervene and tune your search as the display features of the individual navigators allow. Examples are: All-in-One (compilation of forms for more than 120 search engines), 2ask (several hunderd search engines), Internet Sleuth (several hunderd), Infomine (more than 90), Search.com (more than 250), W3, and Cusi.

Geographically focused navigators These navigators are arranged by continent, country, city etc. They often use a series of increasingly more specific maps that allow you to get the locale you wish to go (and therefore they could also be seen as geographical guides). They can be the quickest way to get to a resource in a specific locale. They are fast in finding Internet sites in remote areas. Examples are the CityNet (from Excite) and the GeoSurfer.

Specialized or focused Search Engines Immense amounts of very useful materials can be found in other parts of the internet. There is a whole bunch of specialized navigators that focus on: Gophers, FTP sites, NewsGroups and Mailing Lists, Libraries, Ejournals, Software, Shareware, Products and Services. Some of them will be explained in the section for specialized search engines, others will be dealt with on special pages of the SocioSite such as the "Libraries", "Electronic Journals", "Who's where?", "What's New?", and the "NewsPaper and NewsServices" pages.

Index


WHICH SEARCH ENGINE IS BEST ?

The answer to this question depends on what you're looking for. All of them do some things better or faster than all the rest. In using them you will learn what the differences are.

Most people start with Yahoo, because this is the best known search engine. This is not a bad choice because it also happens to be one of the best. Yahoo is a good place to start with, but it is a man-made directory and therefore carries only a limited amount of information.

For sociologists the ClearingHouse for Subject-Oriented Internet Research Guides is a very productive source of information. This metacatalogue has specialized guides for many subjects in the social sciences.

Many people are impressed with Excite and InfoSeek. In many tests they get the highest honours. But there are some very good and fast alternatives: Hotbot, Alta Vista and Google also get high marks. Together they will give you almost all the results you need. And if you include the WWW Worm, you have little need for anything else.

If you're searching for keywords in documents, Open Text is extremely fast.

The meta search engines provide one-stop access to several engines. You will see more mega or meta search machines on the market, especially multi-threaded search pages, which can rummage through multiple engines simultaneously. MetaCrawler is problably the best and fastest uniform meta-search engine. But newer systems like iFind and JavaBot are closing in. Although the metapages are very useful, they present their own limitations: they don't usually offer the full search customizability of the original engine, so the results are generally much less precise. And they can be painfully slow. A multithreaded search page must work its way through several sites on the Net, any one of which might be tied up doing other work. (the servers where these free services reside occasionally have to earn their keep by doing some real work like accounting or data processing.) Such delays can grind your search to a standstill. The best results are achieved by using one query word only. The reason for this is that there is no standard for search engines on the Internet and they all have their own way of treating the words that you enter. When you enter 2 or more words some search engines will default to treating those words as implicit OR, while others will handle them as implicit AND or as a phrase. Meta searches are only really useful if you are doing a very broad search or are familiar with the databases that you are searching. It's just like all the other things in life: you will enjoy it most when you are fully aware of the limitations.

On the search pages of the SocioSite you will find short reviews of the major engines. They give a description of the features and benefits of the engine, the content and size of the databases, the ways to search and the character of the results, and their pros and cons.

Index


WATCH THE MONEY MAKERS...

Free search engines are an endangered species. Until recently, the best search tools were available free of charge. The latest raves, InfoSeek and Ultraseek, are configured on a fee-for-search basis. Magellan is now free, but the developers are clear that the intent is to make it available only through licensed Internet providers. EINet Galaxy is now part of EINet Corporation, acquired by SunRiver, presumably for commercial purposes. America Online recently bought Yahoo, GNN's Whole Internet Catalogue, WebCrawler and long-time Internet database specialists WAIS Inc., with the potential for future subscription requirements. The Microsoft Network has licensed Lycos.

What these acquisitions mean for the cost of searching on the Internet remains uncertain. But one thing is clear: they want to make huge profits, and you are a potential victim of their exploiting strategies. So watch your wallets, and support all the initiatives to decommercialize the Internet.

Some people think that there might be a silver lining. They expect the tools to become more refined when we start paying for searches. We all want good search instruments to get precise results easily and quickly. If you think that the commodification of the Internet will create the instruments to reach this goal, you must be prepared to pay a price. And that will not only be the money you'll have to pay for every search action. It will also be the loss of free access to information on the Internet. Commercialization of the search engines implies that the electronic access to information will be sold as commodity to those who are able to pay for it. This would mean a serious infringement of the rights of the netizans - the loss of a liberated area that has been conquered with great efforts.

The popular search tools have moved from the laboratories of computer scientists and are now affiliated with for-profit organizations. That's the way things normally go in the post-modern age of cybercapitalism. But this is not a natural law. Social structures are made by men, so they can always be changed by men. That's also true for the way we organize the electronic access to the enormous wealth of information on the Internet. Erecting financial barriers would substantially affect the open character of the Internet. We may and can resist this. Nobody has to be ashamed of democratic initiatives - on the contrary.

Index


ULTRASEEK - CLAIMS TO BE THE BEST

Ultraseek is a powerful new search technology from Infoseek. It is intended to be the fastest, most comprehensive, and first virtually real-time search technology available. Infoseek wants to provide the best information navigation services on the Internet. Here is what you'll get with Ultraseek, if they keep their promises:
Speed
Ultraseek was designed with speed in mind. It processes multi-word queries 100 times faster than any other search engine tested against (Alta Vista, Lycos, Webcrawler, Open Text, Inktomi, and Excite). It processes phrase queries more than 6,000 times faster than its closest competitor.
By using a totally new searching algorithm, Ultraseek search speed only decreases by a factor of 10 for every thousand-fold increase in the database size. That means you can search a database of 10 billion documents 10 times faster than the competitors can search a 10 million document database today.
Most search engines slow down the more query terms you give them, but Ultraseek actually speeds up: longer queries take less time than shorter queries!
It does not sacrifice quality for speed. Case information (i.e., upper/lowercase) is retained, as is proximity information. You can search for the phrase "To be or not to be" and get precise results.You can search for AIDS and not get matches on "aids." The only other engine that can do both of these queries is Alta Vista.
Size
It uses a new multi-threaded worm and the Matisse object-oriented database for storing WWW pages. This allows more accuracy in tracking pages. It can check over 25 million pages a week.
Ultraseek was designed to be able to perform 1,000 queries per second on a database of over 1 billion documents (that's 50 times larger than the largest WWW databases today). It utilizes a patent-pending distributed search algorithm that can accurately merge results from multiple collections.
With their new indexing and crawler technology Infoseek wants to realize an aggressive goal: to maintain the largest collection of WWW pages in the world.
Currency
Ultraseek is the first virtually 'real-time' index of the Internet. You can submit your WWW page to their index and they will immediately download it. Within 100 msec from the time the page is fully downloaded, it can be found in a search. The document counts for each search term are constantly changing. With Ultraseek, you can search the WWW as it is now, rather than search it as it was in the past.
A new worm keeps track of how often pages are changing and downloads each page at its frequency of change. You'll get the most complete and up-to-date WWW search available
Accuracy
Infoseek is already critically acclaimed for highly accurate search technology. With Ultraseek, they raise the bar. It uses a new, highly accurate relevance ranking algorithm. Search features that provide high precision include such things as automatic name recognition, phrase searching, field searching and advanced query operators, such as require and reject.
If you don't believe it, try it for yourself. You can try the beta version of UltraSeek for free.

Index


USING SEARCH ENGINES

WWW interfaces to search engines generally consist of a form which appears on a WWW page. Keywords can be typed into the form and a button is provided which can be clicked on with a mouse in order to activate the search. Other features such as small menus for selecting Boolean operators may also be present.

Some search engines - like Alta Vista - support full Boolean searching. You can use 'and', 'or', 'not' and 'near' to expand or constrict a search.

Many search engines have two interfaces - one for simple keyword searching and another for more advanced queries using Boolean operators. Simple keyword search interfaces are located on the home page of each search engine, so these are the first interfaces that the user sees, and many users tend to use these by default rather than exploring the other options available. These interfaces generally provide a fast and easy to use tool for very simple WWW searching, but their use may be problematic given the size of the WWW and the generally diverse nature of material available.

Each search engine has its own distinct features and capabilities. In most cases instructions for using the search engine are included somewhere on the site. These instruction may contain unfamiliar terms that relate to specific functions. Here are the definitions for seven common functions (not all search engines provide all the functions defined).

Natural Language Queries: For novice Internet users, this is probably the easiest way to search the Web. Users enter questions in natural English, and the server software extracts relevant keywords to create a database query. For example, the phrase "Find pages about INEQUALITY, labor market, or segmentation" would resolve into the individual keywords INEQUALITY, labor, market, and segmentation.

Boolean Searching: Allows terms to be put into logical groups by the use of connective terms. One of the most popular ways servers handle multiple keywords is by linking each with a Boolean AND or Boolean OR. For example, cats AND dogs narrows a search. Cats OR dogs broadens a search. Cats NOT dogs narrows a search. Each service explains its connective terms for Boolean searching in its help or FAQ file. Some systems are defaulted to a certain connective term without the use of that term. So in some cases cats dogs is treated as cats OR dogs.

Keyword Controls: Rather than requiring some relation between keywords, some search engines allow each keyword to be qualified individually. Each keyword in the query can be prefixed with special characters like + or - to indicate that they are required (much line Boolean AND) or that they are required not to be in the document. Often, unqualified keywords are linked a Boolean OR by default. For example, the Keyword Control query "inequality, labor, market -income +rights" is equivalent to the Boolean "(inequaliy OR labor or market) AND (NOT income) AND rights".

Keyword in context (KWIC): These searches will return the key word and N words near the key word to give the user the context in which the key word was found.

Phrase Searching: Allows searching of phrases when available. Some systems can be confusing if you think that "Rural Sociology" is searching the two words together as a phrase, when in fact the engine is searching Rural OR Sociology.

Proximity Searching: Allows searching of one term within N words of another term, narrowing the search.

Relevance Feedback: Attempts to measure how closely the retrieval matches the query, usually in quantitative terms between 0 and 100 or 0 and 1,000.

Truncation Searching: Allows searching on different word endings or plurals with the use of a truncation wild card symbol (a sort of suffix management on keywords). This helps users to get the most for their queries by generalizing each keyword to its roots, and expanding the search to include all forms of that root word. For example, if the truncation symbol is *, then the search term econ* will return items that contain economics, economy, economic, and econometric. Car* will return items that contain cars and cartoon, so it is advisable to use truncation symbols judiciously. Most servers perform the truncation automatically according to their own rules. Some servers allow the users to choose which words are truncated, typically by appending a * character to the end of the root word. See individual help files for the specific truncation symbol used with each engine, when available.

Index


© Albert Benschop, University of Amsterdam
1998 - Last Modified: