Design of an intelligent matching algorithm

Design team members: Edward Kim, Brian Lee, William Wong

Supervisor: Prof. Kamel

Background

As the Internet continues its rapid growth - growing at approximately 1 million web pages a day, so do virtual communities which are needed to accommodate the ever-growing numbers of its' netizens. These communities serve as pools of information resources and often take the form of venues in which people can search through documents posted by each community member and possibly interact with that member.

The shear amount of data across these communities, however, comes at the price of information-overload. This is a serious limitation for those who wish to find and enlist the expertise of a particular community on matters that may be of mutual interest. Search engines and directory listings can partially help, however, they lack the contextual sensitivity needed for accurate matching. A more effective matching tool is needed.

Project description

This project entails the design and implementation of a tool that allows people in university/technical web-communities to efficiently find and contact other people in the community sharing the same research interests. More specifically, this project will involve the development of an algorithm that can intelligently match web-documents based on distinguishing features mined from the documents. Additionally the matching tool will be semantically sensitive enough to ensure the return of only the most relevant documents to the user.

Design methodology

The design task can be broken down into two parts: profile generation and profile matching.

The profile generation task involves the analysis of web documents, extraction of their key features based on this analysis, and representation of these key features utilizing a data structure. More specifically, the most important features of the web document (such as technical terms, names of people, organizations, etc.) will be extracted and represented in the form of a key-word vector. Additionally, each element in the key-word vector will have an assigned importance weighting based on factors such as any emphasis tags (i.e headings, titles, bolding) that were associated with the key-word in the original document.

The profile matching task involves the generation of an ordered list of profiles that meet a minimum interest similarity criterion. In order to achieve this, a vector space model will be utilized. This involves using a measure of similarity between key-word vectors (obtained by the profile generator) such as taking the cosine of the angles formed between the vectors being measured. The profile matching module will also have an adaptive component in which the key-word vectors can be modified and improved based on user-feedback on the quality of the matches.