Design team members: Shyam Sheth, Debbie Yau
Supervisor: Dr. A.K.C. Wong, Dr. Yang Wang
Background
Due to the proliferation of information on the Internet in the recent years, tools are required to archive and retrieve information in an efficient manner. This problem has been addressed through the existence of search engines and drill-down subject directories. Search engines use technology like keyword searches (the most frequently appeared words), relevancy rankings (hits returned with confidence level ranked by the search engine), and concept-based searching (abstract understanding of a subject).
As the popularity and the amount of on-line information increases, the current search engine algorithms may not be adequate in providing links to useful information. Glimpses of this can be seen today. For example, a quick query on the google.com search engine, “What is the weather in Toronto?” elicited links to multiple websites with weather forecasts to “Pilot Project: Clothing Optional Beach at Hanlan’s Point”.
The general goal of information storage is to find ways to represent the content of the document in a manner that is easily searchable and thus, retrievable. The most common method is to represent the contents of a document with keywords or terms that can best summarize the material in the document. This is known as the indexing problem. The indexing problem refers to how to best store documents so that they can be most efficiently retrieved when required.
"There exists a large document collection together with a population of individuals (potential retrieval system customers), each of whom wants information that they think might be supplied by documents in the collection. How should the documents in the collection be identified so that the collection can be searched to the maximal collective benefit of the customers." [1]
There
are
essentially
four
basic
families
of
statistical
techniques
used
to
determine
the
weights
for
keyword
or
index
terms:
a)
Simple
word
count
techniques
b)
Inter-
vs.
Intra-
document
frequency
c)
Poisson
distribution
d)
Term
Discrimination
Value
model
Project description
The overall expectation of this project is to apply the key understanding and insights from the statistical approaches of automatic text analysis in order to design and implement an algorithm to extract a set of keywords (no more than ten) that represents the content/theme of a webpage/website.
This project is divided into two phases. In phase I, various statistical methods of keyword generation have been explored. Understanding of each method and the structure of websites and webpages have been acquired to enable the development of the algorithm and implementation for phase II. In phase II, a keyword extraction prototype will be developed, applied to webpages, and used to evaluate the different statistical methods.
The goal of this project is to understand, evaluate, and implement automated information indexing techniques for Internet applications.
*It is important to note that the focus of this project is on the information storage side. Results obtained from this project can be used to further enhance the information retrieval aspect.
Design methodology
In this project, the Engineering Design Methodology by Professor Barry Wills has been chosen and used as a guideline for our design process according to the seven stages stated below:
1.
Accept
Due
to
the
vast
amount
information
on
the
Internet,
it
is
believed
that
information
storage
and
retrieval
method
needs
to
be
improved
for
better
information
storage
and
knowledge
management.
2.
Analyze
As
the
popularity
and
the
amount
of
on-line
information
increases,
the
current
search
engine
algorithms
may
not
be
adequate
in
providing
links
to
useful
information.
There
is
also
the
problem
of
biased
indexing
that
leads
to
unequal
access.
3.
Define
The
goal
is
to
understand,
evaluate,
and
implement
automated
information
indexing
techniques
for
Internet
applications.
4.
Ideate
This
problem
may
be
solved
by
(1)
computational
linguistics
or
(2)
various
statistical
weighting
techniques.
Upon
further
research,
it
has
been
found
that
computational
linguistics
analysis
is
very
expensive
to
implement.
As
well,
it
is
not
evident,
according
to
the
research
in
this
area,
as
of
how
to
use
linguistic
approach
can
enhance
the
information
storage
and
retrieval
process.
5.
Select
The
weighting
methods
of
1-Poisson,
intra-
vs.
inter-
document
frequency,
and
simple
word
count,
have
been
chosen
for
implementation
and
evaluation.
6.
Implementation
First,
a
keyword
extraction
algorithm
will
be
developed.
Thereafter,
a
prototype,
that
is
a
software
program
capable
of
extracting
keywords
from
a
body
of
text
with
the
use
of
the
chosen
weighting
techniques,
will
be
implemented
in
phase
II.
7.
Evaluate
Upon
completion
of
the
prototype,
the
implemented
algorithm
will
be
tested
on
a
collection
of
documents
where
the
"correct"
keywords
are
already
known.
This
will
serve
as
a
basis
to
determine
how
well
the
goal
of
this
project
has
been
met,
its
contribution
and
applications
to
information
storage
and
retrieval
on
the
Internet.