Gregg, Dawn G. & Walczak, Steven
Communications of the ACM Vol. 49, Issue 5, p. 78-84
Extracting information from Web pages for internal applications is difficult. An effective Web information extraction system needs to interpret a wide variety of HTML pages and adapt to changes without breaking. An information extraction system should recognize different Web page structures and act on this knowledge to modify the information extraction techniques employed. In addition, the system should be customizable for a variety of domains and data-object types. This paper examines the characteristics of effective Web information extraction systems. This paper also presents a prototype adaptive Web information extraction system for building intelligent systems for mining information from Web pages.