The World Wide Web provides a wealth of data that can be harnessed to help improve information retrieval and increase understanding of the relationships between different entities. In many cases, we are often interested in determining how similar two entities may be to each other, where the entities may be pieces of text or descriptions of some object. In this work, we examine multiple instances of this problem, and show how they can be addressed by harnessing data mining techniques applied to large web-based data sets. Specifically, we examine the problems of determining the similarity of short texts (even those that may not share any terms in common) and also of learning similarity functions for semi-structured data to address tasks such as record linkage between objects. While we present rather different techniques for each problem, we show how measuring similarity between entities in these domains has a direct application to the overarching goal of improving information access for users of web-based systems.
About the Speaker
Mehran Sahami is a Senior Research Scientist at Google. His research interests include machine learning, data mining, and information retrieval on the Web. Mehran was also previously a Lecturer in the Computer Science Department at Stanford University (where he received his PhD), and prior to Google, involved in a number of commercial and research machine learning projects at Epiphany, Xerox PARC and Microsoft Research. He has published dozens of refereed technical papers, served on numerous conference program/organizing committees and has several patents pending. This year he is serving at Track Chair for the Industrial Practice and Experience track at WWW-07 and is Co-Chair of the Student Abstract and Poster program at AAAI-07.
Official Website: http://sfbayacm.org/events/2007-04-11.php
Added by marstein on April 9, 2007