raijack1.htm: How to search the web, by fravia+ (¯`·.¸(¯`·.¸ raijack1.htm ¸.·´¯)¸.·´¯)

~ Essays ~

				essays

(Courtesy of fravia's searchlore.org)

Gathering News Headlines and Text Classification through Distributed Efforts and Regular Expression Web Templates

by rai.jack (rai.jack(AT)gmx.net)

very slightly edited by fravia+, published @ searchlores in January 2001

I found the short essay by rai.jack you'll be able to read below on our [~S~ Seekers' msgboard] on 26 january 2001. The text is not only interesting but points to a worthy bibliography as well. Mere coincidence? Only one day before I did receive an e-mail by +Forseti, pointing to one of the resources that raj recalls below (Lisa Guerneys' recent article on the NYT):

Dear Fravia, As you predicted 4 or so years ago the 'Art of searching the web' has become so important that the New York Times (the 'Paper of Record' here in Metropolis) has run today an article on 'Deep Web' searching. I have archived this article on qu00l.net and edited it to remove all useless frills. This is because you may wish to read it without NYTimes cookies et alia, and also because after a short time all back articles from the times cost $2.50 to retrieve. Note this copy exists for critical fair use purposes only
It seems that seeking has finally become a necessity even to the media-fed consciousness of the ever sleeping consumer-zombie NYT readership. Enjoy the article, it seems that someone's been reading your site...
+Forseti

Yep: searching is slowly going 'mainstream'. Read the following text, you'll enjoy rai's 'diplomatic' tone (that barely covers his sarcastic nature :-) and you'll maybe have some answers of your own, which of course may differ from the proposals and/or possible approaches that rai lists below. Your own contribution would be welcome.
Indeed rai is correctly pointing out that an "useful, and enlightened discussion" is requested on these matters. I may just (optimistically) point out that Laurent's current project (a new HCUsearching bot) could (underlined: could) deliver some interesting relevant answers before the end of next month
And on these sibyllic words :-) I'll now leave you, o reader, to rai's reasoning...

Gathering News Headlines and Text Classification through Distributed Efforts and Regular Expression Web Templates

The Nature of Late-Breaking News Content and the Internet
by rai.jack (rai.jack(AT)gmx.net)
January 2001

The web is extremely dynamic and for many, the need to stay abreast of current issues is a pressing matter. The web does not currently offer efficient, free, and relevant ways of searching for the latest news and current issues. "Search engines can be pitifully inadequate, partly because they rely on Web-page indexes that were compiled weeks before."(1) If you want to know late-breaking information about current news events from multiple sources, then either a manual search of your favorite news sources is in order, or you must rely on `weblogs' that cater to your specific interest on the day of search, or you must depend on 3rd party commercial entities to cull news sites regularly for you, and to provide you with categorized information. These three manners of searching for current news events all have serious drawbacks, and a fourth efficient, and free manner of searching for breaking news must be found. Furthermore, I propose such an effort.

Manual searches of reliable news sources is a time consuming task, and not one suited for combing through thousands of news sources. The time constraints involved in such a search are unacceptable, to say the least. Computers and networks are sufficiently advanced that harnessing their power is inevitable, and is to be desired.

Relying on `weblogs' that cater to your specific interest is undesired because of the perceived, if not actual, lack of professionalism that is rampant in the weblog arena. These sites are not always on top of news events, and are often full of editorial mistakes. One of the largest weblog sources for `geek news', Slashdot(2), is often full of editorial mistakes and serious bias.

Being dependent on 3rd party commercial entities for appropriate search results can lead to skewed results, and extreme bias. In the past, it has not been uncommon for search engines to sell their page ranking results to the highest bidder(3). A cynical and skeptical nature is to be appreciated when dealing with commercial entities.

Gathering Content

When doing large, professional searching, it is customary to use bots, or software search agents to comb the World Wide Web for valuable information. There are many tools available for writing bots(4), and much literature regarding said subject.

Regular Expression Web Templates

Large web sites are not easy to maintain, hence the rise of dynamic web sites and scripting tools, such as PHP, Perl, ASP, JSP/Servlets, Python, etc. Because of the use of such dynamic tools, most web sites fit a `template.' Logical tools create logical web sites, even if structures are masked by extra long and obscure URLs. A human can decipher and extricate such a structure, and hence map out the essence of a web site. This work is manifest by "sending out finely tuned software agents, or bots, that learn not only which pages to search, but also what information to grab from those pages."(1)

This structure that is extricated by a human is used once, in the form of a custom bot, then lost because of the nature of the ever-changing web, and because there does not exist a standard way of communicating the structure of a website to another person, or bot. RDF(5), and other meta tag standards are not useful, because of the voluntary nature of their use. RDF is a great idea for a perfect world. We need pragmatic solutions for an imperfect world.

A standard form for communicating the structure of a web site is needed, so that this structure can be fed to a bot, and information gathered efficiently, without the extreme duplication of work that is so rampant today in the creation of custom search agents. Regular expressions are the natural choice for modelling a single page, but an appropriate form for the structure of a website is needed. This structural form must be completely modelled in one file, be operating system agnostic, and must cater specifically to the HTTP protocol. There must also be a table of metadata at the head of the file, that indicates all the data that can be culled from the web site in question. For example, when modelling a news site, then the table of metadata must indicate that `Science Headlines', as well as `Economic Headlines' are available. Thus, a robot that is able to digest this standard form for communicating the structure of a web site must only be told what data is relevant, and not how to retrieve it and parse it.

Please note that this standard form for communicating the structure of a web site is the general case, and can be specifically used for the culling of news headlines. The reason for choosing to discuss news headlines is because of a perceived interest of the Readers by the Author.

Distributed Effort

By harnessing the many hands of the Internet, it will be possible to keep abreast of the changes of the structure to different web sites, by utilizing the good will of World Wide Volunteers. A central repository of Web Templates will be needed to house the structures of web sites, and this will necessarily need to be completely Free. Make no mistake, this technology must be available to one and all, and I don't care if we all have to live in a cardboard box to do it.

Text Classification

When News Content is available by these generic robots, text classification technology can be utilized to categorize content, and to assign user preferences to said content. It could be organized in such a manner to extricate certain patterns in content, in the goal of finding valuable information. Envision a Library, or if the immensity of that thought is too grand, then perhaps a News Library. Proven algorithms could be used, for example Bayesian Classification, or perhaps the implementation of cutting edge technology, such as Support Vector Machines (SVM). This is an extremely fruitful area of research, and I recommend it to all who are interested in understanding the nature of information, and hence, the nature of the 'net.

Conclusion

I would like to thank you, Gentle Reader, for staying the course and reaching this place of rest. I have written this text in the hopes of sparking useful, and enlightened discussion. I know that I will not be dissapointed. Your comments are welcome, and anxiously awaited.

by rai.jack (rai.jack(AT)gmx.net)

Sources

(1) Mining the 'Deep Web' With Specialized Drills, Lisa Guernsey,
http://partners.nytimes.com/2001/01/25/technology/25SEAR.html or also @ +Forseti's: http://qu00l.net/seeking-nyt.html

(2) Slashdot: News for Nerds. Stuff that matters.,
http://www.slashdot.org

(3) Pay For Placement?, Danny Sullivan,
h ttp://searchenginewatch.com/resources/paid-listings.html

(4) Bot Writing, Bot Trapping & Bot Wars: How to search the web, fravia+,
http://www.searchlore.or g/bots.htm or http://www.searchlores. org/bots.htm

(5) Resource Description Framework (RDF), W3C,
http://www.w3.org/RDF/

Bibliography

Information Retrieval on the Web (2000), Mei Kobayashi, Koichi Takeda,
http ://citeseer.nj.nec.com/kobayashi00information.html
(citeseer is an excellent source for computer science papers, that span text classification technology as well as the future of bots - you won't be dissapointed! also, perhaps do some searches here on `bayesian classifier' , 'bayesian networks' , 'support vector machines' for some text classification algorithms - please note that implementations are forthcoming - to be integrated with the generic bot)

2000 Search Engine Watch Awards: Best Specialty Search, Danny Sullivan, (http://www.searchenginewatch.com/awards/index.html#specialty
(mentions www.moreover.com, and is informative)

Moreover: Business Intelligence and Dynamic Content,
http://www.moreover.com
(commercial implementation of the culling of news sources, currently offering free searches of their database - could be much greater were it Free, our Aim)

Autonomy: Automating the Digital Economy,
http://www.autonomy.com
(commercial implementation of basic text classification algorithms, aimed at diverse content types to automate the `understanding' of text - see their White Papers for an intro to their tech - a Free implementation will be completed soon)

(c) 1952-2032: [fravia+], all rights reserved