SmartScape: intelligent keyword generation

Design team members: Shyam Sheth, Debbie Yau

Supervisor: Dr. A.K.C. Wong, Dr. Yang Wang

Background

Due to the proliferation of information on the Internet in the recent years, tools are required to archive and retrieve information in an efficient manner. This problem has been addressed through the existence of search engines and drill-down subject directories. Search engines use technology like keyword searches (the most frequently appeared words), relevancy rankings (hits returned with confidence level ranked by the search engine), and concept-based searching (abstract understanding of a subject).

As the popularity and the amount of on-line information increases, the current search engine algorithms may not be adequate in providing links to useful information. Glimpses of this can be seen today. For example, a quick query on the google.com search engine, “What is the weather in Toronto?” elicited links to multiple websites with weather forecasts to “Pilot Project: Clothing Optional Beach at Hanlan’s Point”.

The general goal of information storage is to find ways to represent the content of the document in a manner that is easily searchable and thus, retrievable. The most common method is to represent the contents of a document with keywords or terms that can best summarize the material in the document. This is known as the indexing problem. The indexing problem refers to how to best store documents so that they can be most efficiently retrieved when required.

"There exists a large document collection together with a population of individuals (potential retrieval system customers), each of whom wants information that they think might be supplied by documents in the collection. How should the documents in the collection be identified so that the collection can be searched to the maximal collective benefit of the customers." [1]

There are essentially four basic families of statistical techniques used to determine the weights for keyword or index terms:
a) Simple word count techniques
b) Inter- vs. Intra- document frequency
c) Poisson distribution
d) Term Discrimination Value model

Project description

The overall expectation of this project is to apply the key understanding and insights from the statistical approaches of automatic text analysis in order to design and implement an algorithm to extract a set of keywords (no more than ten) that represents the content/theme of a webpage/website.

This project is divided into two phases. In phase I, various statistical methods of keyword generation have been explored. Understanding of each method and the structure of websites and webpages have been acquired to enable the development of the algorithm and implementation for phase II. In phase II, a keyword extraction prototype will be developed, applied to webpages, and used to evaluate the different statistical methods.

The goal of this project is to understand, evaluate, and implement automated information indexing techniques for Internet applications.

*It is important to note that the focus of this project is on the information storage side. Results obtained from this project can be used to further enhance the information retrieval aspect.

Design methodology

In this project, the Engineering Design Methodology by Professor Barry Wills has been chosen and used as a guideline for our design process according to the seven stages stated below:

1. Accept
Due to the vast amount information on the Internet, it is believed that information storage and retrieval method needs to be improved for better information storage and knowledge management.

2. Analyze
As the popularity and the amount of on-line information increases, the current search engine algorithms may not be adequate in providing links to useful information. There is also the problem of biased indexing that leads to unequal access.

3. Define
The goal is to understand, evaluate, and implement automated information indexing techniques for Internet applications.

4. Ideate
This problem may be solved by (1) computational linguistics or (2) various statistical weighting techniques. Upon further research, it has been found that computational linguistics analysis is very expensive to implement. As well, it is not evident, according to the research in this area, as of how to use linguistic approach can enhance the information storage and retrieval process.

5. Select
The weighting methods of 1-Poisson, intra- vs. inter- document frequency, and simple word count, have been chosen for implementation and evaluation.

6. Implementation
First, a keyword extraction algorithm will be developed. Thereafter, a prototype, that is a software program capable of extracting keywords from a body of text with the use of the chosen weighting techniques, will be implemented in phase II.

7. Evaluate
Upon completion of the prototype, the implemented algorithm will be tested on a collection of documents where the "correct" keywords are already known. This will serve as a basis to determine how well the goal of this project has been met, its contribution and applications to information storage and retrieval on the Internet.