GHuRU - Search Engine Interaction

http://www.cs.hmc.edu/~dbethune/ghuru/search.html


|| GHuRU || HRU || RCF || NLP ||

At the heart of GHuRU, and what makes it able to independently expand its knowledge base is the World Wide Web. A search engine is needed to extract information from the vast nebula of pages and sites. Ideally, GHuRU would have an integrated search engine that would search through its own knowledge base as well as the external base of information (the web), and look for new information and incorporate that information. Any query would consist of a brief check for new information and then a search through the knowledge base to try and resolve the question. The resolving concept would be the answer, and would be returned as Natural English (thanks to the Natural Language Processer).

Developing another search engine is time consuming, and resource intensive. It might make more sense to use information available through exisiting public, free search engines (such as AltaVista). To do this would only require that an interface be written to understand the particular nuances of the engine's syntax. This interface would act as a direct bridge between the NLP and the Search Engine. It would translate a question into a search string, and when the results are delivered, it would fetch the contents of the pages, and run those through the NLP as well. To use different search engines, only an interface would need to be written. Obviously, using an outside engine to gather the outside information would necessitate the development of a tool to manage GHuRU's only database of knowledge. The nature of a system such as this is pretty open and left to the implementation.

Regardless of the search technique used by any instance of GHuRU, any information retrieved through public access means (the www, for instance) would need to be analyzed for its reliability. This reliability weighting would be used by the HRU to determine the outcome of conflicting information. Also, through the combination of various beliefs and disbeliefs, all with associated weights attached, complex logical concepts can be developed.

A system to determine the reliability of information on a web page could take the author into account (people would we have believed before are likely to be believed again), the highest level domain (schools and government pages are more likely to be believed than corporate pages, for instance), as well as the amount of content (fuller pages tend to know what they're talking about more often). The exact weightings would be left up to the individual implementor. It might be an interesting experiment to play with the weightings assigned to different attributes of the page and see which gave you the truest information.

|| GHuRU || HRU || RCF || NLP ||

questions or comments should be sent to dbethune@hmc.edu