Workshop on Human-Computer Interaction
and Information Retrieval
MIT CSAIL, Cambridge, Massachusetts, USA
October 23, 2007
Workshop Proceedings
For this workshop, researchers and practitioners were invited to present ideas,
research results, work in progress, and system demonstrations related to the
intersection of Human Computer Interaction (HCI) and Information Retrieval (IR). The
intent of the workshop is not archival publication, but rather to provide a forum to
build community and to stimulate discussion, new insight and experimentation. This
workshop proceedings is provided for convenience.
Workshop Chairs:
Michael Bernstein (MIT)
Robin Stewart (MIT)
Program Committee:
David R. Karger (MIT)
Rob Miller (MIT)
Daniel Tunkelang (Endeca)
Martin Wattenberg (IBM)
hcir@csail.mit.edu
http://projects.csail.mit.edu/hcir/
2
Table of Contents
Talk Sessions
Partitioning the Web: Shaping Online Consumer Choice ...................................................... 5
Jolie M. Martin and Michael I. Norton
Record Relationship Navigation: Implications for Information Access and
Discovery ........................................................................................................................................... 6
Christina Anderson
Faceted Browsing, Dynamic Interfaces, and Exploratory Search: Experiences and
Challenges .......................................................................................................................................... 7
Robert Capra and Gary Marchionini
Idea Navigation: Structured Browsing for Unstructured Text ........................................... 10
Robin Stewart, Gregory Scott, and Vladimir Zelevinsky
GK: A post-search information retrieval system ................................................................... 12
Joseph Barillari
A Knowledge-Based Search Engine Powered by Wikipedia ................................................ 14
David Milne, Ian H. Witten, and David M. Nichols
Visual Text Analysis by Lay Users: The Case of “Many Eyes†.......................................... 16
Fernanda B. Viégas and Martin Wattenberg
Personal Information Management, Personal Information Retrieval? ............................. 18
Michael Bernstein, Max Van Kleek, David R. Karger, and mc schraefel
Collaborative Exploratory Search .............................................................................................. 21
Jeremy Pickens and Gene Golovchinsky
Poster/Demo Sessions
Codifier: A Programmer-Centric Search User Interface ....................................................... 23
Andrew Begel
Authority Facets: Judging Trust in Unfamiliar Content ...................................................... 25
Peter Bell
Mapping the Design Space of Faceted Search Interfaces .................................................... 26
Bill Kules
3
Images as Supportive Elements for Search ............................................................................. 28
Giridhar Kumaran and Xiaobing Xue
Searching Conversational Speech .............................................................................................. 29
Mark Maybury
Natural Language Access to Information for Mobile Users ................................................ 31
Alexander Ran and Raimondas Lencevicius
AnalogySpace and ConceptNet ................................................................................................... 33
Rob Speer and Catherine Havasi
Normalized Clarity and Guided Query Interpretation ......................................................... 34
Daniel Tunkelang
Less Searching, More Finding: Improving Human Search Productivity .......................... 35
William A. Woods
Mediating between User Query and User Model with Adaptive Relevance-Based
Visualization .................................................................................................................................... 37
Jae-wook Ahn and Peter Brusilovsky
Characterization of Diagrams and Retrieval Strategies for Them .................................... 39
Robert Futrelle
Human Computation for HCIR Evaluation .............................................................................. 40
Shiry Ginosar
Jigsaw: A Visual Index on Large Document Collections ...................................................... 43
Carsten
Görg and John Stasko
Reasoning and Learning in Mixed-Initiative Tasks ............................................................... 45
Yifen Huang
Resonance: Penalty-Free Deep Look-Ahead With Dynamic Summarization of Result
Sets ..................................................................................................................................................... 46
Blade Kotelly
Visual Concept Explorer ............................................................................................................... 48
Xia Lin
Navigating Document Networks ................................................................................................ 50
Mark D. Smucker
Integrating the "Deep Web" With the "Shallow Web" ........................................................... 52
Michael Stonebraker
Multimodal Question Answering for Mobile Devices .......................................................... 54
Tom Yeh
4
Partitioning the Web: Shaping Online Consumer Choice
Jolie M. Martin & Michael I. Norton
Harvard University
Imagine you want to spend the weekend with your partner in some city at a nice hotel,
with reasonable prices, located near the water. Because you have not been to this city
before, you decide to visit one of the many online websites that aggregate information
about different hotels, allowing you to view ratings for each hotel’s various attributes,
narrowing the options until you pick your eventual winner. This research explores the
ways in which online vendors structure this search process by categorizing the attributes
available for viewing, thereby impacting both the search process and, more importantly,
consumers’ ultimate purchases. For instance, if you happened to visit a website that
displayed the ratings given by prior guests to hotel value, we suggest you might be likely
to overweight this criteria and underweight your other key criteria – proximity to the
water. On the other hand, if the website showed ratings for location, you would be apt to
give this criterion more weight, perhaps changing your decision from a cheap option far
from the water to a more expensive option closer to the water. In both cases, though your
underlying preference (a reasonably-priced hotel near the water) has not changed, the
ways in which information is partitioned changes the search process and your stated
preference: your ultimate selection of hotel.
We first survey existing websites to demonstrate the wide variability of rating categories
even within the same product category, highlighting the seemingly arbitrary way that
attributes are partitioned by web designers for use by consumers. In order to demonstrate
the impact of such partitioning on consumer choice, we present results from a series of
laboratory studies that demonstrate the impact of partitioning on both explicitly stated
attribute weightings and implicit attribute weightings in decisions between options.
Across two domains – buying a car and finding a date – Studies 1 and 2 demonstrate that
individuals report option attributes to be more or less important depending on the way
that those attributes are partitioned. In Studies 3 and 4, we show that participants’
preferences for options – selecting a hotel or choosing a date – are impacted by how
option attributes are partitioned. Note that in all four studies, the total information
available was the same in that participants could see each options’ rating on each attribute,
but it was the manner in which that information was partitioned that drove decisions.
We conclude by describing the implications of these studies for the design of more
effective information interfaces. In particular, online vendors can use knowledge about
the impact of partitioning to help consumers make better choices – by partitioning
information in ways that reflects consumers’ preferences – or make choices that are more
closely aligned with the vendors’ interest – by partitioning information in ways that drive
consumers toward some option that the vendor desires (for example, a hotel that is
neither cheap nor near the water!).
5
ï€ï€‚ï€ï€‚ï€ï€Žï€‹ï€ï€‡ï€ï€‰ï€‘
ï€ï€‚ï€ï€‰ï€Žï€Œï€ï€Šï€ï€‰ï€Žï€‚ï€ï€†ï€†ï€ï€‹ï€‹ï€Œï€‡ï€Šï€‘ï€ï€Žï€”
ï€ï€‚ï€ï€‡ï€Žï€‰ï€ï€‡ï€‹ï€Œï€ï€ˆï€‰ï€‘ï€ï€‚ï€ï€’ï€ï€“
ï€ï€ï€ƒï€‹ï€‰ï€ƒï€Œï€’ï€ï€‡ï€…ï€ï€‡ï€‰ï€Œï€—ï€ï€Œï€†ï€Œï€‹ï€‰ï€‡ï€ˆï€–ï€ï€‡ï€‰ï€™ï€ï€‹ï€Œï€’
ï€ï€‡ï€…ï€ï€Œï€Œï€‡ï€‰ï€ƒï€Œï€ï€ï€ƒï€‹ï€…ï€ï€ƒï€‰ï€„ï€ï€ƒï€™ï€ˆï€†ï€„ï€ï€‡ï€‰ï€Œï€—ï€ï€ƒï€ˆï€†ï€„ï€ï€‡ï€Žï€‰ï€ƒï€ˆï€†ï€‚ï€ï€ƒï€Œï€ˆï€†ï€„
ï€ï€‰ï€ƒï€Œï€’ï€ï€‡ï€…ï€ï€Œï€Œï€‡ï€‰ï€ƒï€Œï€ï€ï€ƒï€‹ï€…ï€ï€Œï€†ï€…ï€ï€‰ï€‹ï€Œï€…ï€ï€ƒï€„
ï€ï€ï€ƒï€‹ï€‰ï€ƒï€Œï€’ï€ï€‡ï€…ï€ï€‡ï€‰ï€ï€ˆï€‡ï€‰ï€…ï€ï€ˆï€‡ï€†ï€’ï€ï€Œï€‰ï€ˆï€‡ï€‰ï€Œï€‡ï€‹ï€‰ï€œï€…ï€Œï€ƒï€ ï€…ï€‰ï€‹ï€„ï€…ï€ï€ï€–
ï€ï€Œï€Žï€‰ï€„ï€ï€‰ï€ƒï€ˆï€„ï€ï€˜ï€‰ï€„ï€ï€ƒï€™ï€ˆï€†ï€„ï€ï€‡ï€‰ï€ˆï€ï€ï€Œï€…ï€ï€‡ï€…ï€ï€’ï€ï€‚
ï€ï€†ï€„ï€ï€‡ï€‰ï€ˆï€‰ï€…ï€ï€‚ï€ï€—ï€ï€œï€’ï€ï€…ï€ï€‚
ï€ï€ï€ƒï€‹ï€‰ï€ï€ï€œï€‡ï€†ï€…ï€ï€‚ï€ï€†ï€’ï€ï€ï€‡ï€…ï€ï€ï€ƒï€‹ï€žï€‰
ï€ï€„ï€ï€‡ï€…ï€ï€ï€ƒï€‹ï€‰ï€ƒï€Œï€’ï€ï€‡ï€…ï€ï€‡ï€‰ï€„ï€ï€†ï€ï€†ï€›ï€•ï€Œï€‰ï€Œï€ï€ï€™ï€™ï€Œï€ƒï€ï€Œï€‰ï€ï€Œï€šï€…
ï€ï€‚ï€ï€…ï€ï€ï€ƒï€‹ï€…ï€ï€’ï€ï€‹ï€œï€ï€†ï€…ï€ï€…ï€ï€‚ï€ï€ƒï€‹ï€Œï€ƒï€…ï€ï€ï€ƒï€‹ï€…
ï€ï€ï€ƒï€‹ï€…ï€ï€„ï€ï€Œï€Œï€‡ï€‰ï€ƒï€Œï€ï€ï€ƒï€‹ï€…ï€ï€œï€’ï€ï€Œï€‰ï€ï€„ï€ï€ï€ƒï€‹ï€‰
ï€ï€‡ï€…ï€ï€‡ï€‰ï€†ï€ï€‰ï€ˆï€ï€‚ï€ï€ï€„ï€ï€‡ï€‰ï€Œï€—ï€ï€Œï€Žï€‰ï€ï€‚
ï€ï€ˆï€‡ï€‰ï€Œï€—ï€ï€„ï€ï€ï€™ï€šï€„ï€ï€Œï€‰ï€ï€˜ï€‰ï€•ï€ƒï€ï€‹ï€œï€ï€†ï€…ï€ï€…ï€ï€ƒï€‰ï€Œï€—
ï€ï€ˆï€‡ï€‰ï€¦ï€§ï€¨ï€‰ï€…ï€ï€…ï€ï€ƒï€„ï€ï€œï€†ï€‰ï€†ï€‚ï€ï€‰ï€›ï€Œï€ˆï€ƒï€‰ï€ï€’ï€ï€ƒï€‰ï€ï€ˆï€‡ï€‰
ï€ï€‚ï€ï€‚ï€ï€ˆï€™ï€Œï€ƒï€ˆï€…ï€ï€Œï€„ï€ï€˜ï€‰ï€ï€‰ï€ï€ƒï€‰ï€®ï€‰ï€˜ï€ƒï€ï€™ï€‰ï€˜ï€Œï€™ï€ˆï€’ï€ï€®ï€‰ï€›ï€Œï€ˆï€ƒï€…ï€ï€ƒï€‰
ï€ï€’ï€ï€Œï€ƒï€Œï€‰ï€ƒï€Œï€–ï€ï€Œï€‹ï€‰ï€„ï€ï€‡ï€†ï€‚
ï€ï€‡ï€…ï€ï€‡ï€‰ï€ï€„ï€ï€˜ï€‰ï€ƒï€Œï€ï€ï€ƒï€‹ï€‰ï€ƒï€Œï€’ï€ï€‡ï€…ï€ï€‡ï€Žï€‰ï€ˆï€…ï€ï€Œï€’
ï€ï€™ï€Œï€‰ï€ï€˜ï€‰ï€†ï€‚ï€ï€•ï€Œï€‡ï€‰ï€„ï€ï€ƒï€™ï€ˆï€†ï€„ï€ï€‡ï€‰ï€‹ï€„ï€ï€ï€–ï€ï€ï€Œï€…
ï€ï€•ï€Œï€‡ï€‰ï€„ï€ï€’ï€ï€‚ï€ï€˜ï€‰ï€…ï€ï€„ï€ï€ƒï€‰ï€ï€˜ï€‰ï€ˆï€‰ï€…ï€ï€‚ï€ï€—
ï€ï€Œï€‰ï€ï€˜ï€‰ï€ƒï€Œï€’ï€ï€ï€ƒï€‹ï€…ï€ï€˜ï€‰ï€‹ï€„ï€ï€ï€ƒï€‹ï€‰ï€ï€ï€œï€‡ï€†ï€…ï€ï€ï€„ï€ï€„
ï€ï€ƒï€Œï€™ï€Œï€‡ï€†ï€ˆï€’ï€ï€‡ï€‰ï€…ï€ï€‚ï€ï€‰ï€‹ï€„ï€ï€ï€ƒï€‹ï€‰ï€†ï€›ï€•ï€Œï€…
ï€ï€ˆï€’ï€ï€œï€’ï€ï€ƒï€‰ï€˜ï€Œï€ˆï€†ï€œï€ƒï€Œï€…ï€ï€ï€ƒï€‹ï€…ï€ï€†ï€Œï€‡ï€†ï€„ï€ï€ƒï€‰
ï€ï€‰ï€šï€’ï€ï€ï€«ï€œï€•ï€žï€‰
6
Faceted Browsing, Dynamic Interfaces, and
Exploratory Search: Experiences and Challenges
Robert Capra, Gary Marchionini
School of Information and Library Science
University of North Carolina at Chapel Hill
100 Manning Hall
rcapra3@unc.edu, march@ils.unc.edu
Introduction
The Relation Browser (RB) is a graphical interface for exploring information spaces, developed by the
Interaction Design Lab at the University of North Carolina at Chapel Hill for use in research on how to
support users’ needs to understand and explore information. In this abstract, we describe the Relation
Browser, results of recent studies, and the design goals for the next-generation RB in current development.
At the workshop, we will demonstrate the current RB and a prototype of our next-generation RB.
Current Relation Browser (RB++)
The Relation Browser is designed as a tool for understanding relationships between items in a collection
and for exploring an information space (i.e., a set of documents). It has been through a number of major
design revisions [2,3]. The current version is called the RB++ (Figure 1). Facets are central to the RB and
are displayed at the top of the interface. Results of queries are shown at the bottom of the screen in tabular
format. Blue bars and numbers to the left of each facet category indicate how many documents match that
category. Previews of queries can be issued by simply mousing over facet categories. For example,
mousing over the topic “inflation†will dynamically
update the blue bars and number to reflect only
documents that are in the inflation topic. By clicking
on a facet category and then pressing the “Searchâ€
button, results are retrieved and displayed in the
lower part of the screen. Results are tightly coupled
with the facet categories and with search boxes
displayed above each result field. For example,
typing “occupations†into the text box above the title
field will not only narrow the results shown at the
bottom of the screen, but will also update the blue
bars and numbers shown at the top of the screen.
The current RB allows searching the metadata fields
in the result sets, but does not support full-text
keyword searches. Because of this, the current RB
encourages a “facets first†strategy of exploration.
The RB is designed as a generic interface that can accept and display data for many different types of
collections. RB instances have been developed for a variety of data sets including U.S. federal statistics
(Bureau of Labor Statistics, Energy Information Administration, NSF Science and Engineering Indicators),
classical music, the Open Video collection, a university movie database, the CIA World Factbook, and a
database of roller coasters. A new version of the RB, called RB07, is currently in development.
Structure and Interaction Study
In the summer of 2006, we conducted two studies to compare three different interface styles (handcrafted
web site, simple facet interface, and the Relation Browser) for three different task types (simple lookup,
complex lookup, and exploratory search) for the U.S. Bureau of Labor Statistics (BLS) web site data. This
data set was fairly large (over 67,000 documents) and semi-structured, providing a good test set for
examining facet use for data that does not have a fully defined set of metadata on which to organize.
Figure 1. Relation Browser displaying BLS data
7
The BLS web site uses a polyhierarchical structure with two levels of topics displayed on the home page.
The design of the BLS site was handcrafted based on a series of needs assessments and user studies. For
the simple facet (SF) and Relation Browser (RB) interfaces, we created a facet set using a variety of semi-
automated techniques [1]. The results of the studies surprised us: we found no significant differences
among the three interfaces for measures of task completion time, accuracy, confidence, or mental effort.
The semi-automated facet interfaces (SF & RB) performed just as well as the handcrafted BLS site and no
significant two-way interactions between task type and interface were found. These results indicate that
facet sets generated using semi-automated methods can provide useful interfaces to large, semi-structured
data sets.
Perhaps even more interesting than the quantitative results are the qualitative data and observations we
made during the study. One of the common observations was that our participants (recruited from the UNC
community, aged 18-35) often attempted to use a “keyword search first†strategy, even in the interfaces that
did not directly support this. The BLS web site provided a keyword search feature one click away from the
home page, but the SF and RB did not (they both only allowed search on the metadata). Related to this, a
number of users expressed feelings of “not being in control†for the RB interface. The current RB strongly
emphasizes facets and encourages users to adopt a “facets first†strategy that may be at odds with users’
preference for using keyword search first. Despite this conflict, many users appreciated and noted the
benefits that facets provide. Thus, we concluded that interfaces should support agile, user-controllable
searching and browsing. This has been a design goal for the next-generation of the Relation Browser,
described in more detail in the next section.
Next-Generation Relation Browser (RB07)
One of the primary goals of the RB is to provide a tool for exploring data spaces – for gaining a better
understanding of the documents and how they are related to each other. Adding full-text search support
while maintaining agile exploration is the challenge for our next-generation RB, called RB07.
RB07 is currently in development and we will show a demonstration of the prototype at the workshop. The
new design includes flexible facet views so that the user can control how the facets and document counts
are presented. The initial two facet views are the traditional view as in the current RB++, and a
representation of the facets as a dynamic tag cloud. The new design also includes a choice of ways to view
the results of a search. A “grid view†that is similar to the current RB++ results table provides a way to see
a concise summary of essential metadata about matching records. A “list view†presents results in a format
that is typical of search engines with a document title, matching text snippets, and a URL (if available).
The new views for facets and results continue to be tightly coupled using an extensible architecture that can
support additional “plug-ins†for displaying facets and results.
Whereas the current RB++ is a pop-up applet that displays in a new window, the new RB07 is designed to
be an embedded component of a web page, providing tighter integration with an existing web site if
desired. For the implementation of the new RB07, we considered Java and JavaScript as two well-
supported client-side languages. JavaScript has easy-to-use access to its surrounding web page, which
would have benefits for RB-website integration. However, in the RB, the dynamic updating of the display
based on each mouse movement over a facet category triggers a series of computations that are a function
of the number of facets, categories, and documents. After extensive testing with JavaScript, we found that
it was not fast enough to support the dynamic updating aspects of the RB on current hardware for document
collections larger than 5000 to 10000 documents. Thus, we are developing the new RB07 as a Java applet.
Support for full-text keyword searching is being implemented using the Apache Lucene search engine as
packaged in the Apache SOLR search server. Although SOLR provides some support for facets, we are not
currently leveraging that support, but instead implement facets through the RB itself (this is in large part
due to the need to do fast dynamic updates for the mouseovers). Documents are linked between SOLR and
the RB07 through a unique document identifier. One of the interface design issues that arose during our
development is how to provide clear distinctions between using keywords to search within a result set
versus starting a new keyword search. In our current design, new searches are started using a keyword
textbox at the top of the display and refinement searches are issued through a textbox closer to the result
8
set. We hope that the new RB07 will help address users expectations of how to explore document spaces,
while still providing powerful interface components for seeing relationships in the collection and refining
result sets.
Future Questions
While facets are widely regarded as being useful to information seeking, especially in large data sets with
complete metadata that can be used as facets (such as shopping domains), a number of research questions
still remain: How are facets used during the information seeking process? When, how, and why are facets
helpful (i.e. facets first versus facets to refine)? How is facet use affected by task type? What role do
facets play in exploring and gaining an understanding of the information space? Do facets help users refind
documents they have seen before, acting as waypoints?
Understanding these issues will help us create better tools for information discovery and exploratory
searching and we expect that user studies and other empirical investigations will lead to answers to these
questions and guide future system designs.
References
[1] Capra, R., Marchionini, G., Oh, J. S., Stutzman, F., and Zhang, Y. (2007). Effects of structure and
interaction style on distinct search tasks. In Proceedings of the 2007 Conference on Digital Libraries
(Vancouver, BC, Canada, June 18 - 23, 2007). JCDL '07.
[2] Marchionini, G. & Brunk, B. (2003). Toward a General Relation Browser: A GUI for Information
Architects. Journal of Digital Information, 4(1), http://jodi.ecs.soton.ac.uk/Articles/v04/i01/Marchionini/
[3] Zhang, J., and Marchionini, G. (2004). Coupling Browse and Search in Highly Interactive User
Interfaces: A Study of the Relation Browser++. Proceedings of the 4th ACM/IEEE Joint Conference on
Digital Libraries (Tucson, AZ: June 7-11, 2004), 384.
9
1
Idea Navigation:
Structured Browsing for Unstructured Text
Robin Stewart
MIT
stewart@csail.mit.edu
Gregory Scott
Tufts University
Greg.Scott@tufts.edu
Vladimir Zelevinsky
Endeca
vzelevinsky@Endeca.com
INTRODUCTION
It is well established that the omnipresent search box is
insufficient for supporting many common information-
seeking tasks and strategies [1]. In order to provide a better
interface, many commercial websites now use faceted
browsing, which provides the ability to narrow a search
result set by choosing to view only a particular slice of
metadata [2]. For example, one can search for “televisionsâ€
and then narrow the results by clicking on facets such as
“flat screen†or “$1500-$3000â€. This type of interface is
particularly useful for exploratory search tasks where users
may not know how to define a priori the best query to solve
their task – whether because they don’t know in advance
what information is available in a particular collection or
they cannot anticipate which keywords would best describe
their desired results.
A number of interfaces have recently been introduced that
aim to further expand support for exploratory search over
various domains [3]. However, the facets exposed by these
interfaces have been limited to explicitly assigned metadata
and keywords extracted from text. This limitation is
acceptable if the search happens to be a conjunction of the
available keywords or metadata, as in a search for a flat-
screen television that costs $1500-$3000. But what if we
are looking for something more abstract or subjective? For
example, we may want to find “historical events that are
interesting in the present context†or “quotations that one
might find controversial.†The available metadata is
unlikely to be useful for these tasks.
Further, even when relevant facets do exist, the query may
depend on a relationship
between
facets. Consider a search
for campaign proposals made by Hillary Clinton. It would
not be sufficient to look for documents that contain the
terms “Hillary Clinton†and “proposalsâ€, because the result
set will contain many documents in which Clinton is not the
person doing the proposing. This type of query is
particularly common in scientific, legal, and patent searches
(e.g.: find molecules that target a particular cell; find
inventions that burn solids for locomotion). Current
interfaces make such search tasks awkward at best, since
users need to scan through the list of matching articles for
passages that might indicate relevance.
To help users better answer these types of queries, we have
developed a richer summarization of natural language text
documents that takes into account the semantic information
provided by English sentence structure. Our system
initially scans every sentence in the document collection to
extract sets of terms (subject–verb–object) that indicate the
Figure 1. The full system interface after clicking on “Clintonâ€. Further refinement options are on the left.
Sentences from the document collection that have a Clinton as their subject appear on the right.
10
2
presence of what we term an
idea
, such as “Clinton–
proposed–reforms.†It then groups similar terms together
under broader categories so that they can be more
effectively summarized. The system then dynamically
aggregates all of the ideas in response to user input,
allowing navigation through the corpus using the same
interaction style as faceted browsing (figure 1). This gives
users (a) a view into the most common ideas in the result
set (since they are presented with the list of concepts
extracted from this set) and (b) an easy way to narrow in on
concepts that look interesting. We call this process
idea
navigation
.
IDEA NAVIGATION
We demonstrate idea navigation by answering one of the
questions above: What did Hillary Clinton propose in
previous campaigns? Our prototype system contains all
~9000 articles published in October 2000 by a popular news
source. At that time, Clinton was running for the Senate.
We first select “Clinton†in the Subject column, revealing a
variety of narrow terms such as “Mr. Clinton†and
“President Clinton†(figure 1). We choose “Mrs. Clintonâ€
to refine the results to sentences containing Mrs. Clinton as
their subject. Selecting the broad action “express†in the
Verb column refines the result set to things that Mrs.
Clinton said. This results in 117 matching ideas, so we
select the narrow verb “proposeâ€. We have now narrowed
our result set down to five sentences, all of which provide
answers to our query. One is: “Mrs. Clinton, for example,
has proposed federally financed scholarships for college
students who commit to teaching.â€
We chose to represent ideas as sets of three terms (subject–
verb–object) because this representation has been
successfully used in question answering systems to help
answer focused questions such as “Who did Hillary Clinton
marry?†[4]. This representation is also the basic
underpinning of the semantic web’s Resource Description
Framework. Numerous interfaces for searching such semi-
structured data have been built, including ESTER [5],
which proactively displays refinement options that it
considers relevant. However, to our knowledge, our system
is the first to present the ternary representation as a faceted
browsing interface.
EVALUATION
To better understand the extent to which Idea Navigator is
understandable and helpful for information seeking, we
carried out a formative evaluation with 11 participants.
Presumably due to the dominance of keyword search
interfaces today, most subjects had a clear initial bias
towards formulating queries as keywords in the provided
search box, such as “roger clemens throw bat†or “offensive
statement.†Even so, subjects consistently and successfully
used the idea navigation refinements to improve upon their
search box results. As the session progressed, they tended
to use these refinements more and more. Across all
subjects and all tasks, a total of 100 idea navigation
refinements were performed, versus only 61 searches. This
is strong evidence that users understood the new interface,
expected it to be useful, and continued to use it.
CONCLUSIONS AND FUTURE WORK
In this paper, we demonstrate how document search
interfaces could be enhanced by extending faceted
browsing to the subject–verb–object representation of ideas.
Our user study demonstrated that such an interface is
understandable to first-time users and useful for solving
realistic search tasks that are poorly supported by existing
systems.
Future work includes: increasing the number of sentence
types that our system understands; trying different ways of
grouping the idea components; improving the interface
design; and exploring alternate idea representation schemes
that better handle adjectives, prepositional phrases, or other
natural language information. We plan to integrate idea
navigation into a full-featured search interface with a large
health science document set, allowing us to perform a more
extensive, comparative user study that examines user
performance when various search components are
available.
Beyond this, we would like to further investigate the use of
idea navigation as an interface for exploring facts, either
automatically extracted from documents or entered
manually as semantic web data. Just as faceted browsing
supports exploratory search in document collections, we
posit that the techniques of idea navigation may well
support the related task of exploratory question answering.
REFERENCES
1.
White, R. W., Kules, B., Drucker, S. M., schraefel, m.
c. (2006) Supporting Exploratory Search, Introduction,
Special Issue. In
Communications of the ACM
49(4),
pp. 36-39.
2.
Hearst, M., English, J., Sinha, R., Swearingen, K., and
Yee, K.-P. (2002) Finding the Flow in Web Site
Search. In
Communications of the ACM
45(9), pp. 42-
49.
3.
Wilson, M., schraefel, m.c., White, R. (2007)
Evaluating advanced interfaces using established
information-seeking models. Technical Report, School
of Electronics and Computer Science, University of
Southampton. http://eprints.ecs.soton.ac.uk/13737/
4.
Katz, B., Lin, J. (2003) Selectively Using Relations to
Improve Precision in Question Answering. In
Proc.
EACL 2003 Workshop on Natural Language
Processing for Question Answering
.
5.
Bast, H., Chitea, A., Suchanek, F., Weber, I. (2007)
ESTER: Efficient Search on Text, Entities, and
Relations. In
Proc. SIGIR 2007
, ACM Press, pp. 671-
678.
11
GK: A post-search information retrieval system
Joseph Barillari
1
This abstract introduces GK, a
web-based software system for research
support. GK is designed to help users
store, organize, and navigate large
quantities of text, currently targeting
but by no means limited to biomedical
research.
2
Motivation.
Information retrieval
researchers have addressed with great
success the search problem:
very
roughly speaking, the process of re-
trieving a small collection of relevant
documents from a collection many or-
ders of magnitude larger.
GK ad-
dresses a different part of information
retrieval: the
post-search
problem. Af-
ter he or she collects several hundred to
a few tens of thousands of documents
that a search engine has indicated were
relevant, GK helps the user make sense
of them.
Use case.
GK’s canonical use
case involves a biomedical researcher
(we’ll call her U.) investigating a new
field. PubMed, Google Scholar, and
ISI Web of Science
3
will happily lo-
cate a few hundred relevant papers,
but what does U. do with them? She’d
like to find the most interesting parts
of the papers without reading every-
thing (since there are bound to be re-
dundancies and irrelevant pieces). U.
begins by dragging the papers into
her GK volume. (GK has a network
filesystem interface, so uploading data
is simple.) GK indexes the papers and
draws a map of the concepts
4
men-
tioned therein. The map lets U. navi-
gate the documents horizontally: when
she encounters the gene
fgfr2
in paper
A., the map shows her that paper B.
associates it with
VEGF
and paper C.
associates it with
wnt
. The map also
shows her the sentences from which it
inferred those associations and lets her
jump from A. directly to those parts of
documents B. and C.
As she reads, U. can highlight and
annotate the documents in her collec-
tion. She can highlight HTML doc-
uments directly within GK, or use
Adobe Acrobat to add annotations to
PDF files, which GK can read and in-
dex. She can then ask GK to draw an
editable outline that collects her high-
lights and annotations from all of the
documents in one place.
Design goals.
GK is designed
to solve the three biggest post-search
problems: storage, organization, and
navigation.
Storage.
Storage is mundane but
too-often ignored. An ideal system
would let a researcher access his or
her collection of documents from any
device at any location.
GK pro-
1
Harvard School of Engineering and Applied Sciences and the Harvard-MIT Division of
Health Science and Technology. (barillar@fas.harvard.edu)
2
For simplicity’s sake, the examples below are all biomedical.
3
Scientific search engines. PubMed is run by the National Institutes of Health and indexes
the biomedical literature. Google Scholar indexes scientific papers. ISI Web of Science is a
citation index: it lets the user search for papers that cite a particular paper.
4
“Concepts†is defined loosely as “interesting n-grams.†Currently, GK identifies inter-
esting concepts using lists of known-interesting terms (for instance, gene names). In other
fields, it might use proper nouns (by examining capitalization patterns) or unlikely phrases
(c.f. Amazon.com).
1
12
vides both a through-the-web and
network-filesystem-based access to a
user’s collection. The filesystem uses
the HTTP-based WebDAV protocol,
which is supported on all modern plat-
forms. It also supports transactions,
versioning, and arbitrary metadata.
The web interface provides access to
a search system that indexes both the
full text and the metadata of the user’s
documents. (For instance, if a doc-
ument is indexed in MEDLINE and
GK can find its DOI,
5
GK will auto-
matically download its metadata from
PubMed and index it.)
Organization.
In the physical
world, users highlight and annotate pa-
pers with pens and file them into fold-
ers. GK supports highlighting, annota-
tion, and filing. It also incorporates or-
ganizational techniques impossible in
the physical world: for instance, the
ability to place items in multiple fold-
ers (tagging), to search for documents
by the notes one has taken on them,
and to pull highlights and annotations
from a group of documents into an ed-
itable outline.
6
Navigation.
Navigation is by far
the most difficult and most important
post-search problem—given a set of
documents, all of which are relevant,
how does one help the user locate the
sections he or she would find most in-
teresting? GK’s key navigational fea-
ture is the automated drawing of con-
cept maps which show how how indi-
vidual fragments of documents relate
to one another. This granularity lets
users jump directly from a paragraph
of interest in one document directly to
a paragraph of interest in another. The
functions which detect concepts of in-
terest and infer interrelations between
them are under heavy development.
Direction.
GK is under active de-
velopment, with a particular emphasis
on improving its navigational features.
While many IR systems have incorpo-
rated graph-based navigational tools,
few such tools have gained mass ap-
peal. A key aim for GK at present is
to determine if graphs and maps are
too cumbersome to be helpful, or if a
properly-designed mapping tool can be
genuinely useful.
5
MEDLINE: the NIH’s index of the medical literature. PubMed: the web interface for
MEDLINE. DOI: Document Object Identifier, a numbering system used by many publishers.
6
Support for sharing annotations between users is under development.
2
13
A Knowledge-Based Search Engine
Powered by Wikipedia
David Milne
Ian H. Witten
David M. Nichols
Department of Computer Science, University of Waikato
Private Bag 3105, Hamilton, New Zealand
+64 7 838 4021
{dnk2, ihw, dmn}@cs.waikato.ac.nz
Introduction
This paper describes Koru: a new search interface that offersÂ
effective domainÂindependent knowledgeÂbased informationÂ
retrieval [1]. Koru exhibits an understanding of the topicsÂ
involved in both queries and documents, allowing them to beÂ
matched more accurately. It helps users express queries moreÂ
precisely and evolve them interactively. This understanding isÂ
mined from the vast investment of manual effort and judgmentÂ
that is Wikipedia. This open, constantly evolving encyclopediaÂ
yields manuallyÂdefined yet inexpensive structures that can beÂ
specifically tailored to expose the topics, terminology andÂ
semantics of individual document collections. This paperÂ
describes a brief overview of Koru and the knowledge base itÂ
extracts. A more detailed description and evaluation of the systemÂ
can be found elsewhere [2].
Koru
!"#$% &'%()*%+,"#&%-"#.%/"#%()*%0*-1"#02%$0/$#3&04%/*#0%/#"0.5%6%
delicate spiral of expanding fractal shapes. For indigenous NewÂ
Zealanders it symbolizes growth; expansion; evolution. Likewise,Â
the Koru topic browsing system aims to provide an environmentÂ
in which users can progressively work towards what they seek.Â
Koru’s interface is illustrated in Figure 1. The uppermost area is aÂ
classic search box in which the user has entered the queryÂ
american airlines security
. Below are three panels: query topics,Â
query results, and the document tray.Â
The latter two panels are fairly standard. The query results list
documents in much the same fashion as other search engines,Â
while the document tray allows the reader to collect multipleÂ
documents they wish to peruse. The first panel—query topics
—
isÂ
intended to display Koru’s interpretation of the query and provide
a base from which to evolve it.Â
The query topics listed here—
American Airlines
,Â
Security
,Â
Security (finance)
,Â
Airline
 andÂ
Americas
—are identifiedÂ
automatically by checking words and consecutive sequences ofÂ
words in the query against articles in Wikipedia. Synonyms minedÂ
from the same resource are listed below that term. For example,Â
amongst the topicÂ
Airline
’s synonyms areÂ
air carrier
,Â
airline
company,
andÂ
scheduled air transport
. These are (a) usedÂ
internally to improve queries, and (b) presented to the user to helpÂ
them understand the sense of the topic. The user can also learnÂ
more about a topic by clicking the adjacent Wikipedia link.
Query terms are often ambiguous and relate to multiple entries inÂ
Wikipedia. ByÂ
security
, for example, the user could also meanÂ
property pledged as collateral for a loan. Each sense is included,Â
and ranked according to the likelihood that it is a relevant,Â
significant topic for this particular document collection. Only theÂ
Figure 1:
 Browsing Koru for topics and documents related toÂ
american airlines security
14
topÂranked topics that cover all the query terms are used forÂ
retrieval (in the example,Â
American Airlines
 and the first meaningÂ
ofÂ
security
). This can be overridden manually using theÂ
checkboxes to the left of each topic.Â
Topics which are recognized in the query can be investigated inÂ
isolation by using them to browse Wikipedia for other topics ofÂ
interest. In Figure 1 the user has chosen to expand topics relatedÂ
toÂ
airline
 (by clicking the triangle to the right), and canÂ
investigate further topics of interest such asÂ
Singapore Airlines
andÂ
British Airways
. Any of these could be incorporated into theÂ
query by clicking the appropriate checkbox. As with alternateÂ
senses, these topics are ranked according to their expectedÂ
relevance.
What the figure does not convey is that to avoid clutter not all theÂ
panels in Figure 1 are visible at any given time. Initially only theÂ
first two are shown. The user builds an effective query by addingÂ
and removing phrases and topics until the query and resulting listÂ
of documents satisfies the user’s information need. Once aÂ
suitable query is formed, the user must determine the mostÂ
relevant ones and judge whether they warrant further reading. AtÂ
this point the panels slide across so that only the query results andÂ
document tray are visible.
Creating a Relevant Knowledge Base
To work well, Koru relies on a large and comprehensiveÂ
knowledge base. From Wikipedia we derive a thesaurus that isÂ
specific to each particular document collection. Wikipedia isÂ
particularly attractive for this work because it represents a vastÂ
domainÂindependent pool of manually defined terms, conceptsÂ
and relations. By intersecting this with individual documentÂ
collections, we are able to provide thesauri that are individuallyÂ
tailored to those who seek knowledge from the documents.Â
The basic idea is to use Wikipedia’s articles as the building blocksÂ
of a knowledge base and its skeleton structure of hyperlinks toÂ
determine which blocks we need and how these should fitÂ
together. Each article describes a single concept; its title is aÂ
succinct, wellÂformed phrase that resembles a term in aÂ
conventional thesaurus—and we treat it as such. Concepts areÂ
often referred to by multiple terms— e.g. in Figure 1Â
airline
isÂ
grouped withÂ
air carrier,
and
passenger aircraft
—and WikipediaÂ
handles these using “redirectsâ€: pseudoÂarticles that exist only toÂ
connect an alternative title of an article with the preferred one.Â
Related topics—
british airways, qantas
andÂ
air safety
—are minedÂ
from the hyperlinks within Wikipedia’s article on airlines, and theÂ
categories in which it is placed.Â
The danger in using Wikipedia’s structure is that because it is soÂ
huge (~2 million topics, plus a further 2 million synonyms) theÂ
Koru user will become swamped with irrelevant topics and links.Â
It is essential to identify the concepts relevant a particularÂ
document collection, and place these in a structure that allowsÂ
navigation between related concepts. This process is described inÂ
detail in [2].
Evaluation
To gain insight into the performance of Koru for documentÂ
retrieval, we conducted an experiment in which participantsÂ
performed tasks for which the relevant documents had beenÂ
identified manually. These tasks were specifically selected toÂ
encourage a high degree of interaction. To provide a baseline weÂ
created another version that provides as much of the sameÂ
functionality as possible without using a thesaurus, and whoseÂ
interface is otherwise identical. Comparison of these two systemsÂ
provided concrete evidence of the effectiveness of Koru andÂ
Wikipedia for assisting information retrieval.
Due to Wikipedia’s use of contemporary language andÂ
exceptional size, Koru was able to recognize and expand upon
almost all queries that were issued. This assistance was effective;Â
as shown in Table 1, it resulted in significant improvements in FÂ
measure.Â
Koru’s design was also validated, in that it allowed users to applyÂ
the knowledge found in Wikipedia to their retrieval process easily,Â
effectively and efficiently. The following quote, given by oneÂ
participant at the conclusion of their session, summarizes Koru’sÂ
performance best:
It feels like a more powerful searching method, and allows
you to search for topics that you may not have thought of …
… it could use some improvements but the ability to
graphically turn topics on/off is useful, and the way the
system compresses synonymous terms together saves the user
from having to search for the variations themselves. The
ability to see a list of related terms also makes it easier to
refine a search, where as with keyword searching you have to
think up related terms yourself.
Unfortunately we found that the interactive browsing facilitiesÂ
offered by Koru are significantly flawed. More work is required toÂ
identify related topics that are of interest given the context of theÂ
user’s task at hand, and to allow users to explore them efficiently.Â
Koru currently provides topic recognition and automaticÂ
expansion that helps users express their information needs moreÂ
easily and consistently, but only as a oneÂstep process ofÂ
improvement—which can only take queries so far. Our goal inÂ
future is to improve Koru’s interactive query expansion andÂ
exploratory searching facilities until it provides this ability toÂ
unfurl queries, and thus lives up to its name.Â
References
[1] http://www.nzdl.org/koru
[2] Milne, D., Witten, I.H. and Nichols, D.M. (2007). AÂ
KnowledgeÂBased Search Engine Powered by Wikipedia. Â
In
Proc. of CIKM'07
, Lisbon, Portugal.Â
Baseline
Koru
Recall
43.4%
51.5%
Precision
10.2%
11.6%
F-measure
13.2%
17.3%
Table 1:
 performance of tasks
15
1
Visual Text Analysis by Lay Users:
The Case of “Many Eyesâ€
Fernanda B. Viégas
IBM Research
viegasf@us.ibm.com
Martin Wattenberg
IBM Research
mwatten@us.ibm.com
Abstract
Many Eyes (
http://www.many-eyes.com
) is a public
web site that provides "visualization as a service,"
allowing users to upload data sets, visualize them, and
comment on each other's visualizations. Among the
visualization techniques offered by the site are two
aimed at unstructured text: a “tag cloud†view and a
“word tree,†a type of visual concordance view. Both
techniques have seen heavy usage. Our talk will break
into two parts. First, we will introduce and
demonstrate the two text visualization techniques.
Second, we will describe the patterns of usage we
have observed, particularly around political speeches
and testimony and artistic expression.
Visualizations
The first text visualization introduced on Many Eyes
was a tag cloud, that is, a simple display in which
word frequencies in a text are indicated by font size.
Tag clouds have become a familiar presence on many
web sites. The Many Eyes tag cloud, however,
includes distinctive features such as instant letter-by-
letter search and a two-word-phrase view. Figure 1
(next page) is an example of a tag cloud on Many
Eyes, showing the recent controversial speech by
Iran’s president at Columbia University.
Our second text visualization, dubbed a “word tree,â€
is a new kind of visual concordance. When a user
types in a word or phrase, the visualization displays a
tree structure showing all the different contexts in
which the word has been used. (From a computer
scientist’s perspective, it shows a subtree of the suffix
tree of the sequence of words in the text.) Figure 2
shows an example, applied to Martin Luther King’s
famous “I have a dream†speech. Users can navigate
the word tree by typing or by clicking on branches of
the tree to zoom, and can see a view of either leading
or trailing context for a word.
Usage
What sorts of analyses have users been creating with
these visualizations? Probably the most common
application is to political texts. The words of George
Bush, Gordon Brown, and Nicholas Sarkozy have all
come under scrutiny, as has testimony from Alberto
Gonzales and Bill Clinton. In many cases, the
visualizations are used by amateur pundits to make
political points on blogs. We have also seen
professionals who use these tools to bolster their
arguments, ranging from a well-known journalism
professor to employees of a respected foundation that
aims for political transparency.
A second common use case involves personal artistic
expression. Several users have created miniature art
projects based on tag clouds. One user made a sort of
visual poem based on the contents of their freezer.
Another has created a series of “litmashes†in which
he concatenates two novels and visualizes the results.
A third began with a political objective—analyzing
grants to artists—and ended up with a commission to
create an art project based on his tag cloud.
We’ll conclude our talk with a discussion of the
implications of these examples. The usage we have
seen on our site, as well as the activity on hundreds of
blogs that refer to our site, indicates an intense
amateur interest in text analysis. We believe this
interest points to new directions for visualization, and
also suggests there may be unexplored ways in which
other types of textual data mining and information
retrieval may be used by citizen activists and artists.
16
2
Figure 1.
Figure 2.
17
Personal Information Management,
Personal Information Retrieval?
Michael Bernstein, Max Van Kleek,
David R. Karger
MIT CSAIL
32 Vassar Street
Cambridge, MA 02139
msbernst@mit.edu, emax@csail.mit.edu
karger@mit.edu
mc schraefel
Electronics and Computer Science
University of Southampton
Southampton, UK
mc+hcir@ecs.soton.co.uk
ABSTRACT
Traditional information retrieval has focused on the task of
finding information or documents in a largely unknown space
such as the Web or a library collection. In this paper we
propose that the space of Personal Information Management
(PIM) holds a great number of problems and untapped po-
tential for research at the intersection of HCI and IR. In
this position paper we focus on the problem of
information
scraps
, or unstructured notes and thoughts, as a particularly
interesting space for future research in HCI and IR.
INTRODUCTION
Information retrieval has traditionally assumed a user goal
of finding information or documents within a search space
whose contents may not be known
a priori
to the user, such
as the Web. The user’s challenge is to specify an accurate
information query (e.g., “What is the capital of Uruguay?â€)
and the system’s challenge is to return the most relevant an-
swers or documents.
Contrast this situation with Personal Information Manage-
ment (PIM). Here, the user’s challenge is instead to organize,
find and manipulate
his or her own
information, of which the
user has intimate knowledge. Though users are still perform-
ing information retrieval or search tasks, the tasks’ nature
may change dramatically. In PIM, users may bring much
more highly contextual queries (“When is that meeting with
Kerry that I set up while I was at lunch on Friday?â€), and
have personally authored much of the information they are
attempting to retrieve. Further, the user’s task does not finish
with the retrieval of the document or datum; it often contin-
ues through cycles of editing and reorganization.
PIM embeds many IR tasks that users deal with on a daily
basis, yet these tasks have been largely overlooked by the
IR community. We believe that PIM as an area of research
has much to gain from information retrieval techniques and
that IR, in turn, may benefit from focusing some of its ef-
fort on PIM tasks. In this paper, we outline our research
on
information scraps
[1], detailing how this PIM problem
interacts closely with information retrieval, and how taking
a traditional IR approach might overly decontextualize the
problem.
INFORMATION SCRAPS
Figure 1. Information Scraps collected in our investigation of existing
practice.
Despite the number of personal information management
tools available today, a striking amount of our data remains
out of their reach: the content is instead scribbled on Post-it
notes, scrawled on corners of sheets of paper, buried inside
the bodies of e-mail messages sent to ourselves, and typed
haphazardly into text files (Figure 1). This scattered data
contains our great ideas, sketches, notes, reminders, driving
directions, and even our poetry. We refer to these pieces of
personal information as
information scraps
.
We conducted a study consisting of 27 semi-structured in-
terviews and artifact examinations of participants’ physical
and digital information scraps, at 5 different organizations
[1]. Here, we summarize one piece of the study in support
of our argument, focusing on general uses our participants
found for information scraps.
1
18
Roles that Information Scraps Play
We consolidated a list of common information scrap roles
from participants’ responses regarding how and why they
chose to store information of various types in scraps.
Temporary Storage.
Information scraps’ small, discardable
presence enabled their common use as temporary storage.
One participant kept Post-it notes on her laptop palm rest
for just this purpose, recording visitors’ names and contact
information, later to be disposed of.
Archiving.
Many information scraps were intended to reli-
ably hold on to important personal information for long peri-
ods of time. Participants commonly used information scraps
to archive notes from meetings and passwords.
Work-in-progress.
Our participants shared with us many work-
in-progress scraps, such as half-written emails, ideas for busi-
ness plans, brainstorms, and interface designs. “Before I put
anything in the computer, I like to put it on the whiteboard
first,†one participant explained of her newsletter layout de-
sign process.
Reminding.
Many participants took advantage of informa-
tion scraps’ visibility and mobility by placing them in the
way of their future movements to create reminders for them-
selves. Participants used techniques such as colored Post-its
or unread or unfiled e-mails, reminding them to take action
later.
Unusual Data Types.
Taking advantage of information scraps’
freeform nature, participants managed unique data types that
might have otherwise remained unorganized. For example,
one participant created an information scrap system to man-
age a library-style checkout for his privately owned con-
struction tools, and one participant maintained a complex
document of contact information annotated with private notes
on clients.
Information Scraps, PIM and IR
Information scraps present a challenging information retrieval
task. It is appropriate to consider IR with respect to scraps’
lifecycle because of the sheer number of scraps our partici-
pants compiled, in tension with the need to re-find specific
scraps later. However, information scraps do not easily lend
themselves to traditional IR approaches. Scraps are often
recorded incompletely, written tersely, or intentionally left
ambiguous – making it more difficult for IR algorithms to
parse the content. Conversely, the user may recall the con-
tent via a completely different set of cues than the content
itself: for example, context surrounding note creation (“I
wrote it while in the elevator.â€) or gestalt meaning (“My
notes from that meeting about funding.â€).
Depending on the role an information scrap is playing, its
information retrieval needs may further vary. For exam-
ple, work-in-progress scraps may often contain very little
explicit information to index – for example, consider a note-
book page full of rough interface sketches. It is also ques-
tionable whether traditional IR metrics and tasks are even
Figure 2. Jourknow, our prototype information scrap management
tool.
appropriate for information scraps. Reminder scraps, for ex-
ample, are intended to proactively remind rather than be in-
dexed, searched or visualized later, and temporary storage
scraps’s relevance to the user decreases quickly as the scraps
age.
JOURKNOW: A NEW INFORMATION SCRAP TOOL
In parallel with our ethnographic study, we have been de-
signing an information scrap management tool (Figure 2)
[2]. We briefly mention Jourknow, our information scrap
client, which focuses on the following ideas:
Structure Extraction.
In order to elevate ambiguous or un-
structured text to searchable, sortable data, we can assist in
the conversion of the raw scrap rich in implicit structure to
data with explicit metadata structure. Thus, “mtg. w/ Karger
@ 5†becomes reified as a calendar event in the user’s calen-
dar application and searchable as such.
Context Association.
We can further assist in information
scrap retrieval tasks by allowing the user to query by the
situation surrounding the note capture. Jourknow automat-
ically captures and associates information surrounding the
user’s situation. This data includes day and time, location
hints (e.g., wifi ssid), events scheduled on the calendar, and
activity traces including web pages, active applications, peo-
ple the user communicated with, and pictures of the desktop
and the user. Users may then use a faceted browsing inter-
face to search for notes fulfilling specific criteria.
Mobility.
We are constructing a mobile version of the Jour-
know client to run on users’ cellular phones. The mobile
version of the client is intended to support users when away
2
19
from the computer, tailored to the information capture needs
when mobile and affordances the cell phone offers.
We recently completed a weeklong deployment evaluation
of the Jourknow client to help determine its strengths and
weaknesses in information scrap management.
CONCLUSION
In this position paper we have raised the information re-
trieval problem as it affects and is affected by personal in-
formation management. We examined information scraps as
a particularly interesting case of personal information in this
respect, considering features which might bear on IR tasks.
We have seen that importing traditional IR goals and met-
rics into the space of information scraps fails to account for
many of the distinct qualities of scraps’ encoding and usual
retrieval cues. We report on our efforts to bridge this gap
via our prototype system Jourknow, which attempts to make
easier the capture and retrieval of information scraps.
REFERENCES
1. M. Bernstein, M. V. Kleek, D. Karger, and mc schraefel.
Information scraps: How and why information eludes
our personal information management tools.
In
Submission to ACM Transactions on Information
Systems
, 2007.
2. M. V. Kleek, M. Bernstein, D. Karger, and mc schraefel.
Gui? – phooey! the case for text input. In
Proc. UIST
’07
, September 2007.
3
20
!"##$%"&$'()*+,-.#"&$'"&/+0*$&12+
!"#"$%&'()*"+,&-+.&/"+"&/0102)3(+,*%&
!"#$%&''(%
)*+,-.*/%0*-1*2%
%
3456789!5374+
3*-1%4*56+2%728*+4917*2%+61+76:9;%<-69+=.>%-?-164-%9+6%
@69+65%1*A9+5%.6;/72@%9%B-6+%CB7=,;?%925%6886=17:6;?%8725%
9%/9+17=B;9+%7164D%E.91%7164%49?%F6%9%5*=B4621G%9%
@6*@+9/.7=%;*=917*2G%9%89=1*75G%61=D%E.7-%9//+*9=.%7-%@**5%
A.62%9%-72@;6%/76=6%*8%728*+4917*2%=92%8B;87;;%1.6%B-6+H-%
728*+4917*2%2665D%%
%
#2%492?%-71B917*2-G%.*A6:6+G%4B;17/;6%7164-%925I*+%+7=.6+%
*:6+:76A-%*8%1.6%6217+6%728*+4917*2%-/9=6%9+6%26=6--9+?D%
#2%-B//*+1%*8%1.7-G%6J/;*+91*+?%-69+=.%-?-164-%.9:6%F662%
56:6;*/65D%KJ/;*+91*+?%-69+=.%-?-164-%1?/7=9;;?%F;625%9%
:9+761?%*8%728*+4917*2%-66,72@%1**;-%925%19=17=-%<6D@DG%
CB6+?72@G%F+*A-72@G%5*=B4621%=;B-16+72@G%61=D>%1*%.6;/%1.6%
B-6+%F6116+%B256+-1925%1.6%+92@6%*8%9:97;9F;6%728*+4917*2D%
LB=.%19=17=-%9+6%=*4F7265%72%*//*+1B27-17=%A9?-%9-%B-6+-H%
B256+-192572@%*8%1.67+%728*+4917*2%2665-%6:*;:6-%M0916-G%
NOPOQG%
%
"B++621;?G%1.6+6%7-%9;-*%4B=.%A*+,%9+*B25%49--%-*=79;%
=*;;9F*+917*2G%*+%9@@+6@917*2%*8%;9+@6R-=9;6%B-6+%7216217*2%
925%728*+4917*2%-66,72@%F6.9:7*+-D%%$6=*446256+%
-?-164-%-B=.%9-%S49T*2%M=716G%U;72562QG%/6+-*29;7T917*2%
-?-164-%F9-65%*2%F6.9:7*+9;%*+%1*/7=9;%=;B-16+72@%-B=.%9-%
V9;17J%<U**@;6>%M=716G%&''WQG%=*44B271?R46579165%-69+=.%
=*;;9F*+917*2%-B=.%9-%7L/?%M=716%09++?%L4?1.Q%925%
F**,49+,72@%925%:*172@%-716-%-B=.%9-%56;D7=7*DB-%925%X7@@%
9;;%49,6%B-6%*8%1.6%YA7-5*4%*8%=+*A5-Y%9//+*9=.%1*%
=*;;9F*+917*2D%%E.6%B256+;?72@%=*44*29;71?%94*2@%1.6-6%
-?-164-%7-%1.6%2*17*2%*8%B-72@%=*++6;9165%%9@@+6@916%
F6.9:7*+%1*%-166+%1.6%7257:75B9;%-69+=.6+%1*A9+5%1.6%4*-1%
+6;6:921%/76=6-%*8%728*+4917*2%F9-65%*2%+6-B;1-%*F197265%
F?%*1.6+-D%%S25%7257:75B9;G%A*+,72@%9;*26G%7-%74/;7=71;?%
728*+465%F?%/+7*+%-69+=.%F6.9:7*+-%*8%1.6%=+*A5D%%%
%
0B1%A.91%78%9%-69+=.6+%7-%;**,72@%8*+%2*:6;%728*+4917*2G%*+%
8*+%728*+4917*2%1.91%5*6-%2*1%.9:6%9%;9+@6%/66+%@+*B/%8+*4%
A.7=.%1*%5+9A%+6=*44625917*2-Z%).91%78%1.6%-69+=.6+H-%
@*9;-%9+6%57886+621%8+*4%*1.6+-%A.*%8*+4B;9165%-747;9+%
-69+=.%+6CB6-1-Z%%%%E.6%/+*F;64%A71.%1.6%=+*A5RF9-65%
9//+*9=.6-%7-%1A*R8*;5[%1.6+6%49?%F6%;9+@6%2B4F6+-%*8%
5*=B4621-%72%9%-?-164%A71.%2*%/+7*+%B-6+%9116217*2G%925%
1.6%728*+4917*2%2665%*8%1.6%=+*A5%47@.1%2*1%491=.%1.6%
2665%*8%1.6%=B++621%-69+=.6+D%%E.6-6%57-/9+7176-%49?%;7471%
1.6%6886=17:626--%*8%1.6%*1.6+A7-6%56-7+9F;6%-1+916@?%*8%
B-72@%4B;17/;6%/6*/;6H-%72/B1%1*%5616+4726%1.6%*B1=*46%*8%
9%-69+=.%-6--7*2D%%
%
)6%/+*/*-6%1*%4717@916%1.6%5687=762=76-%*8%=*++6;9165%
-69+=.%A71.%=*;;9F*+917:6%-69+=.G%1.91%7-G%-69+=.%72%A.7=.%9%
-49;;%@+*B/%*8%/6*/;6%-.9+6-%9%=*44*2%728*+4917*2%2665%
925%9=17:6;?%<925%-?2=.+*2*B-;?>%=*;;9F*+916-%1*%9=.76:6%
71D%\B+1.6+4*+6G%A6%/+*/*-6%9%-?-164%9+=.716=1B+6%1.91%
4657916-%-69+=.%9=17:71?%*8%4B;17/;6%/6*/;6%F?%=*4F7272@%
1.67+%72/B1-%925%F?%-/6=79;7T72@%+6-B;1-%56;7:6+65%1*%1.64%
1*%19,6%95:9219@6%*8%1.67+%-,7;;-%925%,2*A;65@6D%%
%
\*+%6J94/;6G%1.6%@*9;%*8%/+6:7*B-%=+*A5RF9-65%-?-164-%7-%
1*%.6;/%1.6%=B++621%-69+=.6+%57-=*:6+%+6;6:921%7164-%1.91%
*1.6+-%.9:6%9;+695?%57-=*:6+65D%%#216+6-172@%16=.2*;*@7=9;%
=.9;;62@6-%72:*;:6%57-=*:6+72@%B-68B;%=+*A5%=;B-16+-%*+%
;91621%-/9=6-G%87@B+72@%*B1%A.7=.%/+7*+%B-6+-H%9=17*2-%9+6%
+6;6:921%1.6%=B++621%B-6+]-%728*+4917*2%2665G%61=D%%#2%92%
6J/;7=71;?%=*;;9F*+917:6G%-49;;%1694RF9-65%62:7+*24621G%
*2%1.6%*1.6+%.925G%2*%@B6--A*+,%7-%266565%8*+%B-6+%-69+=.%
F6.9:7*+%=;B-16+72@^%*26]-%=*;;9F*+91*+-%9+6%,2*A2D%E.6%
8*=B-%*8%1.6%-?-164%-.781-%8+*4%.6;/72@%7257:75B9;-%+6R8725%
728*+4917*2%1.91%.9-%9;+695?%F662%8*B25%F?%-*46*26%6;-6%
1*%.6;/72@%9%1694%464F6+%8725%728*+4917*2%1.91%2*%*1.6+%
464F6+%.9-%?61%8*B25G%FB1%1.91%7-%+6;6:921%1*%1.6%*:6+9;;%
728*+4917*2%2665-D%%KJ/;7=71%=*;;9F*+917*2%1.6+68*+6%
9;;*A-%9%8B25946219;%=*2=6/1B9;%-.781%72%-?-164%56-7@2G%
9A9?%8+*4%9;@*+71.4-%925%7216+89=6-%1.91%-B//*+1%+6R
872572@%925%+6R57-=*:6+72@%1*%9;@*+71.4-%925%7216+89=6-%
1.91%-B//*+1%26A%57-=*:6+?%925%6J/;*+917*2D%%%
%
&
4561()(7&
8$61()(7&
,%
+)
3#
0
+0
9,
&
:011-;0#-7(2"&
"5610#-70#%&,"-#)3&
<=>'?@A&
B"-17($"&-C-#"+",,&
-+.&)0+7(+9-1D96.-7"&
)0+7"57&,%,7"$,&<"EFE&
8$(7%G&0#A&
&
-,
%+
)3
#0
+0
9,
&
H,"-#)3&I#-(1,H&JKL&
&
M";&NEOG&
<#")0$$"+."#&
,%,7"$,G&/00F1"P,&
Q-17(5G&C(,.0$&0R&
)#0C.,A&
E9F;6%ND%"*;;9F*+917:6%KJ/;*+917*2%L69+=.%56-7@2%-/9=6D%
%
5:,+8,03;4+0<=!,+
)6%=92%=.9+9=16+7T6%1.6%56-7@2%-/9=6%9;*2@%1A*%
57462-7*2-G%9-%-.*A2%72%E9F;6%ND%L69+=.%=92%F6%6J/;7=71%
*+%74/;7=71G%9-?2=.+*2*B-%*+%-?2=.+*2*B-D%E.6%4972%
57886+62=6%F61A662%6J/;7=71%925%74/;7=1%=*;;9F*+917*2%7-%
21
A.61.6+%/6*/;6%1*%A*+,%1*A9+5%1.6%-946%@*9;%<6J/;7=71>%*+%
-747;9+%FB1%7256/625621%@*9;-%<74/;7=71>D%E.6%,6?%
57886+62=6%F61A662%-?2=.+*2*B-%925%9-?2=.+*2*B-%
=*;;9F*+917*2%7-%1.91%-?2=.+*2*B-%=*;;9F*+917*2%*==B+-%72%9%
17@.1%+69;%<*+%269+R+69;>%1746%8665F9=,%;**/G%A.6+69-%
9-?2=.+*2*B-%=*;;9F*+917*2%;9=,-%1.91%744657916%
7216+9=17*2D%%
E.6-6%,6?%57886+62=6-%74/;?%1.6%2665%8*+%57886+621%B-6+%
7216+89=6-%925%7216+9=17*2%-1?;6-G%925%/*-6%7216+6-172@%
56-7@2%=.9;;62@6-%2*1%*2;?%8*+%1.6%B256+;?72@%9+=.716=1B+6%
FB1%9;-*%8*+%1.6%7216+89=6-%A71.%A.7=.%/6*/;6%7216+9=1D%
_26%,6?%95:9219@6%*8%-49;;%1694-%7-%1.6%F+6951.%*8%
6J/6+762=6%1.6?%F+72@%1*%1.6%-69+=.%19-,D%E.6%291B+6%*8%
1.7-%6J/6+762=6%49?%F6%6J/+6--65%9-%+*;6-D%#2%-*46%=9-6-G%
69=.%1694%464F6+%/;9?-%9%-747;9+%+*;6G%A.6+69-%72%*1.6+%
=9-6-G%1.6%+*;6-%9+6%57886+621D%%
67>,0+
E.6+6%9+6%9%2B4F6+%*8%9+69-%72%A.7=.%=*;;9F*+917:6%
6J/;*+91*+?%-69+=.%*8%1.7-%291B+6%7-%B-68B;D%%!6+6%9+6%1A*%
6J94/;6-D%
%
!
&
X*4972% 6J/6+1I5*4972% 6J/6+1[% 1A*% 5*=1*+-% 8+*4%
57886+621% 5*4972-% 2665% 1*% =*;;9F*+916% *2% 9% =*4/;6J%
579@2*-17=% CB6-17*2G% 72:*;:72@% 6-*16+7=% 728*+4917*2%
8+*4% 69=.% 5*4972D% K9=.% @626+916-% 5*4972R-/6=787=%
-69+=.% 16+4-% 1.91% 1.6% -?-164% 7216@+916-% 721*% `*7265%
CB6+76-D% $6-B;1-% 9+6% 9;;*=9165% 1*% 69=.% B-6+% F9-65% *2%
5*4972%6J/6+17-6D%
!
&
X*4972% 6J/6+1I-69+=.% 6J/6+1% =*;;9F*+917*2[% 1.6%
=*;;9F*+91*+-% 9+6% 9% 5*4972% 6J/6+1% <96+*-/9=6%
62@7266+>%925%9%-69+=.%6J/6+1%<;7F+9+792>D%E.6%-?-164%
9;;*A-%1.6%;7F+9+792%1*%-6;6=1%-*B+=6-%925%1*%8*+4B;916%
CB6+76-G% A.7;6% 1.6% 62@7266+% -B@@6-1-% 16+4-% 925%
/+*:756-%+6;6:92=6%8665F9=,D%
:9?=4@!7?<95,6+345,6=!5374+
S%-?-164%1.91%-B//*+1-%-?2=.+*2*B-%=*;;9F*+917:6%-69+=.%
4B-1%F6%9F;6%1*%=*;;6=1%5919%8+*4%1A*%*+%4*+6%/6*/;6%925%
1*%B-6%1.91%5919%1*%-?21.6-7T6%CB6+76-%925%*1.6+%*/6+917*2-%
*2%9%-69+=.%62@726G%925%71%4B-1%9;;*=916%-69+=.%+6-B;1-%1*%
/9+17=7/921-%9-%9//+*/+7916%1*%1.67+%+*;6-D%%
LB=.%9%-?-164%4B-1%72=;B56%7216+89=6-%8*+%7257:75B9;%
6J/;*+917*2G%7216+89=6-%1.91%-B//*+1%74/;7=71%9A9+626--%*8%
*1.6+-H%9=17:7176-G%925%7216+89=6-%8*+%6J/;7=71%=*;;9F*+917*2D%
S%4755;6A9+6%;9?6+%8*+%=**+5729172@%1.6%9=17:7176-%*8%1.6%
@+*B/G%925%92%9;@*+71.47=%62@726%*/1747T65%8*+%
=*;;9F*+917:6%6J/;*+91*+?%-69+=.%4B-1%6J7-1%1*%-B//*+1%1.7-%
+7=.%728*+4917*2%-.9+72@D%%
E.7-%-?-164%9+=.716=1B+6%=92%9;;*A%9%-49;;%@+*B/%*8%
8*=B-65%728*+4917*2%-66,6+-%1*%-69+=.%=*;;6=17*2-%72%
=*2=6+1D%E.6%-?-164%/+*:756-%8665F9=,%F9-65%2*1%*2;?%*2%
92%7257:75B9;H-%-69+=.%F6.9:7*+G%FB1%9;-*%*2%1.6%=B++621G%
9=17:6%-69+=.%F6.9:7*+%*8%*26H-%-69+=.%9;;76-D%%
a-6+%7216+89=6%56-7@2%72%-B=.%92%62:7+*24621%4B-1%5+9A%
*2%872572@-%8+*4%"L")G%!"#G%925%#$%876;5-D%%
S,"#&K&S8
S,"#&N&S8
T3-#".&S8
8+697&
:00#.(+-70#
U97697&
:00#.(+-70#
?1F0#(73$()&:011-;0#-7(0+&4+F(+"
+
!"#$%&'()'*+,-&.'/%01"-&0-$%&'
,A=>9=5374+
#1%7-%56-7+9F;6%1*%6:9;B916%=*;;9F*+917:6%728*+4917*2%
-66,72@%72%1+95717*29;%#$%16+4-G%FB1%=9+6%4B-1%F6%19,62%
A.62%=*4/9+72@%-69+=.%+6-B;1-%*8%-?-164-%A71.%57886+621%
2B4F6+-%*8%B-6+-%A*+,72@%-?2=.+*2*B-;?D%_26%/*--7F;6%
9//+*9=.%7-%1*%/+*5B=6%9%=*4F7265%+92,65%;7-1%F9-65%*2%
=*21+7FB17*2-%*8%9;;%-69+=.6+-G%925%1.62%=*4/B16%4692%
9:6+9@6%/+6=7-7*2%<3Sb>%-=*+6-%1.91%9+6%95`B-165%1*%
9==*B21%8*+%1.6%2B4F6+%*8%/6*/;6%=*21+7FB172@%1*%1.6%
+6-B;1-D%
_26%/*--7F;6%A9?%1*%95`B-1%3Sb%-=*+6-%7-%1*%/629;7T6%
.7@.;?R+92,65%2*2R+6;6:921%5*=B4621-%F9-65%*2%1.6%
2B4F6+%*8%-69+=.6+-D%E.6%;*@7=%7-%F9-65%*2%1.6%9--B4/17*2%
1.91%756217=9;%+6-B;1-%/+*5B=65%F?%9%-49;;6+%1694%9+6%
F6116+D%E.B-%1.6%4*+6%/6*/;6%897;%1*%49,6%9%=*++6=1%
`B5@4621%9F*B1%9%5*=B4621G%1.6%A*+-6%1.91%1694%
/6+8*+4-D%%
"
#
>
<
c
c
c
c
c
c
!
"
#$%#&$'$(
#$)$'*+%
#$%#&$'$(
#$)$'*+%
#$%#&$'$(
#$)$'*+%
$,&-&.+
/.(&"&$(0#
$
%
&
%
%
'
%
).6+6%8<L>%7-%1.6%/629;71?%8B2=17*2G%-B=.%1.91%8<N>%d%N%925%
8<LeN>%e%ND%).7;6%9%2B4F6+%*8%-B=.%8B2=17*2-%9+6%
/*--7F;6G%A6%-B@@6-1%1.91%8<L>dL%7-%9%@**5%/;9=6%1*%-19+1D%
E.6%6J9=1%=.*7=6%*8%8B2=17*2%A7;;%56/625%*2%9%:9+761?%*8%
89=1*+-G%925%A6%2665%4B=.%4*+6%6J/6+762=6%A71.%1.6-6%
8B2=17*2-%1*%5616+4726%A.7=.%9+6%9//+*/+7916%A.62D%
!74!>903740+
6,B,6,4!,0+
ND
&
".7G%KD!DG%b7+*;;7G%bDG%".62G%VD%925%b71,*AG%fG%a-72@%
728*+4917*2%-=621%1*%4*56;%B-6+%728*+4917*2%2665-%
925%9=17*2-%925%1.6%)6FD%#2%
1#.,$$(&+2-3."3%4$3
!567853,.+"$#$+,$3.+389/*+3"*,%.#-3&+3,./09%&+23
-:-%$/-
G%<L6911;6G%)SG%a27165%L1916-G%&''N>G%S"3%
b+6--G%gO'RgO(D%
&D
&
3*++7-G%3D$DG%#216+89=6-%8*+%"*;;9F*+917:6%
KJ/;*+91*+?%)6F%L69+=.[%3*17:917*2-%925%X7+6=17*2-%
8*+%3B;17Ra-6+%X6-7@2-D%#2%b+*=66572@-%*8%1.6%"!#%
&''(%)*+,-.*/%*2%KJ/;*+91*+?%L69+=.%925%!"#DG%
<&''(>D%
22
Codifi
e
r: A Programm
e
r-C
e
ntri
c
S
e
ar
c
h U
s
e
r Int
e
rfa
ce
Andrew Begel
Microsoft Research
Redmond, WA 98052
andrew.begel@microsoft.com
Search tools have transformed knowledge discovery
by exposing information from previously hidden re-
positories to the workers who need it. Search engines
like Google and Live.com provide search capabilities
via a simple one-line text query box, and present re-
sults in a paged HTML list. When the repository be-
ing searched contains structured information with
extractable metadata (e.g. program source code), it
can be advantageous to index the metadata and use it
to enable queries that are more task-centric and suita-
ble for an domain-specific audience.
Codifier is a programmer-centric search user inter-
face that enables software developers to ask domain-
specific questions related to programming languages
and software. For example, developers might ask
ï‚·
Where is this API or data structure defined?
ï‚·
Where is this API used?
ï‚·
Where is this variable assigned a value?
ï‚·
I know this function writes data to the disk, but I
forget exactly what its name is.
ï‚·
Even if I spell it wrong, I still want to find IPer-
sistentItemsChangedSink.
ï‚·
Find all functions where Open() is called with
Init().
ï‚·
Show me all calls to this method, so I can refac-
tor it by hand.
We index C, C++, C# and VBScript program source
code using a modified compiler to extract and store
lexical and syntactic metadata into a SQL Server
2005 or Windows Desktop Search database. The Co-
difier user interface, presented in Figure 1, enables
software developers to search in this database for
symbols found anywhere in the indexed source code
(not just their definitions). Searches supported by
metadata can be quite powerful. In additional to
source code symbols, we can search for language-
specific connectors (e.g.
foo::bar, foo.bar,Â
Figur
e
1:
Codifier search user interface showing a search for the symbol COM.
23
fooÂ>bar
), synonyms, homophones, abbreviations,
concept keywords (e.g. searching for COM finds
COMString, ComException, ICOMPointer
),
kind and usage of symbols (e.g. searching for defini-
tions of methods named
WriteString
(kind:method
usage:def WriteString), newly instantiated objects of
class
IEnumString
(kind:class usage:use IEnum-
String), assignments to local variables named
firstTimeThroughLoop
(kind:localvar
usage:assign firstTimeThroughLoop)), lexical scop-
ing (e.g. searching for calls to method
Open()
in
classes named
MemoryAccess
), and keywords for
searching by programming language, source control
information and file path.
Filtering, sorting and refinement capabilities are im-
portant for winnowing the thousands of answers re-
sulting from a search over a large source code base.
For example, Microsoft Windows 2003 Server con-
tains several hundred million symbols in its source
code
–
when
searching for code, the “rightâ€Â r
esult
may be one out of thousands, or may be
all
of them.
Codifier provides support for filtering results based
on the lexical, syntactic, and file path scope of the
result. In addition, by presenting the results in a grid,
with one row per symbol found, the UI enables sort-
ing based on any of the metadata values. Refinement
of queries is supported by metadata facet. A top 10
list of results is shown for each facet. When the user
clicks on one of them, an additional filtering term is
added to the query, which is then re-executed.
Codifier stores one symbol per row when using SQL
Server 2005. Using as-of-yet unoptimized schema,
each symbol
’
s metadata is stored in about 2,300 bytes
of space on disk, so even the largest bodies of source
code we index fit into less than 500 GB. Indexing
time is about 2 million symbols per hour. When using
the less scalable Windows Desktop Search 3.0, Co-
difier stores all metadata for the symbols in each file
in the inverted index, enabling metadata-based
searches, but requiring reanalysis and extraction of
metadata when each search result is retrieved. Index-
ing time with WDS is about 80,000 files per hour.
Other search engines such as Google Code Search,
Krugle.com, and Koders.com have been targeted at
program source code, but these index minimal meta-
data, and mostly function by restricting the scope of
full-text search to source code files. Numerous IDEs
such as Visual Studio and Eclipse support symbolic
searches with full metadata support, but have limita-
tions as well. Eclipse has a GUI-based interface to
metadata specification which can be onerous to enter,
and both IDEs limit searches to a single managed
project at a time. The Source Insight IDE supports a
larger search scope using heuristically-evaluated me-
tadata, but does not support synonyms, homophones,
concept keywords or lexical scoping in queries or
results. Various research projects such as GENOA
[4], SCRUPLE [7], TAWK [1], and a project by
Clarke, Cox and Sim [3], have emphasized the back-
end techniques of indexing code and left their front
ends to technical pattern-matching languages. Strath-
cona [6] searches for code by example, alleviating the
query language problem. Sourcerer [2] and Assieme
[5] search for links within public code (and Assieme
links in web-based descriptions) to enable program-
mers to learn how to use new APIs. We have made
Codifier
’
s query language straightforward, like a typ-
ical web search engine, and concentrate mainly on
improving the usability of the UI for understanding
and manipulating search results.
Codifier will be demoed at the workshop and com-
ments and feedback will be gratefully appreciated.
[1] Atkinson, D. C. and Griswold, W. 2006. Effective pattern
matching of source code using abstract syntax patterns.
S
o
ft
w
.
Pra
c
t
.
Exp
e
r
.
36, 4 (Apr 2006), 413-447.
[2] Bajracharya, S., Ngo, T., Linstead, E., Dou, Y., Rigor, P., Bal-
di, P., and Lopes, C. 2006. Sourcerer: a search engine for open
source code supporting structure-based search. In
Companion To
OOPSLA
.
(Portland, Oregon, USA, Oct 22 - 26, 2006). ACM
Press, 681-682.
[3] Clarke, C., Cox, A., and Sim, S. 1999. Searching program
source code with a structured text retrieval system (poster ab-
stract). In
Pro
cee
ding
s
o
f
SIGIR
.
(Berkeley, California, United
States, Aug 15 - 19, 1999). ACM Press, 307-308.
[4] Devanbu, P. T. 1992. GENOA: a customizable language- and
front-end independent code analyzer. In
Pro
cee
ding
s
o
f
ICSE
.
(Melbourne, Australia, May 11 - 15, 1992). ACM Press, 307-317.
[5] Hoffman, R., Fogarty, J., and Weld, D. Assieme: Finding and
Leveraging Implicit References in a Web Search Interface for
Programmers. To appear in
Pro
cee
ding
s
o
f
UIST
. (Newport,
Rhode Island, Oct 7-10, 2007). ACM Press.
[6] Holmes, R., Walker, R. J., and Murphy, G. C. 2005. Strathcona
example recommendation tool. In
Pro
cee
ding
s
o
f
ESEC
.
(Lisbon,
Portugal, Sept 05 - 09, 2005). ACM Press, 237-240.
[7] Paul, S. and Prakash, A. 1994. A Framework for Source Code
Search Using Program Patterns.
IEEE Tran
s
.
So
ft
w
.
Eng
.
20, 6
(Jun. 1994), 463-475.
24
Authority facets: Judging trust in unfamiliar content
Peter Bell
Endeca
How do people judge trust in content that has no provenance, like the stamp of an editorial
process or audit trail? Most digital content lacks the authority of traditional publication models,
and in content created through social collaboration, like the Wikipedia, it even lacks a clear
single author. Nevertheless, informal content is widely consumed, albeit with healthy skepticism.
Trust and authority are not binary — present or absent. In fact, trust in a source varies depending
on a user’s task. For example, a financial analyst might:
!
Trust numbers reported in
The
Economist
enough to prepare a report that will be filed
with the SEC.
!
Disagree with the editorial opinion of an Economist blog on tax law reforms, but will
follow links to sources provided by the author to research more.
!
Might not trust an anonymous posting on an Economist message board, yet finds a tip
worthy of emailing the author for further due diligence.
In each case, content with widely varying degrees of authority proved to be helpful, given proper
context.
People judge trust for themselves constantly, and rely on nuanced evidence to do so. In a faceted
navigation system it is possible to supply readers with abundant evidence to determine trust for
themselves through “authority facets.†Just as facets help users filter and browse content by
subject or attributes, authority facets can add rich information about the trustworthiness of
content by an author for a given task. Examples include facets on skills, ratings, certifications,
affiliations, and social network ties.
There are at least two approaches to implementing authority facets:
!
Content is tagged with authority facets and users can navigate based on those tags.
Example: at an ecommerce site with user-generated product reviews, users can navigate
to camera reviews written just by novices, intermediates, or experts. A review stating a
camera is “heavy†might be judged to carry different meanings coming from novice vs.
expert authors.
!
Facets themselves can have authority facets, and users can filter the values in a facet by
its authority facets. Example: the author facet in a university Wiki is associated with its
own facets, like department, seniority, and certifications. A student searching for
comments in a blog might narrow the author facet to find just authors that are on faculty,
and then use that subset to filter blog comments.
Authority facets are just beginning to emerge in faceted navigation systems, and show promise
as a way to enrich content so users can determine its suitability for a given task. This brief survey
of examples will show how authority facets are being used today and suggest future directions.
25
Mapping the Design Space of Faceted Search Interfaces
Bill Kules
School of Library and Information Science
The Catholic University of America
kules@cua.edu
Introduction
Faceted search, guided search, and categorized overviews are becoming accepted techniques to support complex
information seeking tasks like exploratory search. There are a growing number of applications that use these
techniques for library catalogs, web search, shopping, image collections, and other domains (Antelman, Lynema, &
Pace, 2006; Hearst et al., 2002; Tunkelang, 2006; Yee, Swearingen, Li, & Hearst, 2003). Design guidelines for the
application of these techniques are starting to emerge (Hearst, 2006; Kules & Shneiderman, to appear), but there is
no systematic description of the design space for faceted search interfaces. An understanding of the design space
will aid designers by alerting them to design options and decisions they should address. It will aid researchers by
suggesting a framework for guidelines as well as additional areas of study. In particular, it may help understand the
actions, tactics and strategems (Bates, 1990) supported by faceted search interfaces.
The objective of this paper is to begin identifying and structuring a set of dimensions of the design space for
categorized overviews of search results. This paper proposes a set of dimensions for the design space of faceted
search interfaces and two structures for meaningfully organizing them. These dimensions and the organizing
structures emerged from analysis of recent literature and applications from several domains.
Method
We began with design dimensions extracted from Hearst (2006), Smith & Kules (2006), and Kules (2006). Hearst
(2006) makes detailed design recommendations for hierarchical faceted search interfaces. Kules (2006) identifies a
set of ten dimensions and their corresponding design options for categorized overviews. Smith & Kules (2006)
proposes 14 dimensions in three areas (organization, display, and interaction). We extend those 14 dimensions by
analyzing additional interfaces from different domains: mSpace (schraefel et al., 2005), the Relation Browser
(Marchionini & Brunk, 2003), and several commercial shopping interfaces.
This analysis yielded 28 dimensions. By considering the three primary conceptual elements (the facets, the
categories within the facets, and the individual search result items), we structure the interactions by examining how
an action on each element can affect that element and the other two. Actions include, but are not limited to, clicking
on, dragging, and “hovering†over an interface element. For example, clicking on a category in a faceted search
interface often affects the items (by narrowing the set of results to that category) and the categories displayed (by
displaying the subcategories). Structuring the design dimensions in this manner yields two tables. Table 1 contains
Table 1. Design dimensions related to the organization and display of facets, categories, and items.
Organization
Display
Facet
• Ordering
• Grouping
• Semantics/relationship represented
• Location & layout
• Method for determining which facets are displayed (e.g.,
predetermined, user-selected)
• Form of display (textual or graphical)
• Display of facets with 0 or 1 non-empty categories (e.g., no
change, shrink or hide facet)
Category
• Ordering
• Breadth & depth
• Labels & terms
• Categorization method (e.g., use
existing metadata, extract or infer
categories, automated clustering)
• Visible levels of hierarchy
• Method of determining which categories to display (e.g. all,
most common, show/hide empty categories, provision of
keyword search/filter on category name)
• Sorting/grouping of displayed categories
• Display of an “Uncategorized†pseudo-category for
uncategorized items
Item
• Ordering
• Method of determining which items to display (e.g. display N
per page, display a sample for each visible category)
26
design dimensions related to the organization and display of facets, categories, and items. Table 2 structures the
interaction dimensions into a 3x3 array. The rows represent the element being acted upon and the columns represent
the element being affected. The empty cells suggest areas for additional study. One remaining dimension does not fit
in these structures: Breadcrumbs (how they are ordered; the effect of removing an element of the breadcrumb list).
Conclusion
The organization and dimensions described here are a work in progress, intended to stimulate discussion. This is one
step in developing an understanding of the design space, starting with the current literature and a sample of
applications targeted at desktop-based web browsers. Additional actions, interactions, and dimensions will certainly
emerge from the study of other applications and non-desktop devices (e.g. PDAs).
Table 2. Design dimensions related to the interaction of facets, categories, and items. The rows contain the
elements being acted upon by the user and the columns contain the elements being affected.
Affected element
Facet
Category
Item
Facet
• Selection (simultaneous
or sequential)
Category
• Effect of category
selection on other facets
(e.g., display
subcategories as a new
pseudo-facet)
• Effect of category selection on
display of other categories
within facet (e.g. are multiple
selections supported)
• Previews of subcategories
• Narrowing, broadening
• Sorting/grouping items (e.g.,
group by children of most
recently selected category)
• Previews
E
le
me
nt
b
ein
g
ac
te
d
up
on
b
y
us
er
Item
• Highlight related facet
• Highlight category
membership
• Find related items (e.g.
“More like thisâ€
References
Antelman, K., Lynema, E., & Pace, A. (2006). Toward a 21st Century Library Catalog.
Information Technology and
Libraries, 25
(3), 128-139.
Bates, M. (1990). Where should the person stop and the information search interface start.
Information Processing
and Management, 26
(5), 575-591.
Hearst, M. (2006).
Design recommendations for hierarchical faceted search interfaces
. Paper presented at the ACM
SIGIR 2006 Workshop on Faceted Search. Retrieved September 27, 2007, from
http://flamenco.berkeley.edu/papers/faceted-workshop06.pdf
.
Hearst, M., Elliot, A., English, J., Sinha, R., Swearingen, K., & Yee, P. (2002). Finding the flow in web site search.
Communications of the ACM, 45
(9), 42-49.
Kules, B. (2006).
Supporting Exploratory Web Search with Meaningful and Stable Categorized Overviews
(Unpublished doctoral dissertation): University of Maryland, College Park. Retrieved May 30, 2007, from
http://hcil.cs.umd.edu/trs/2006-14/2006-14.pdf
.
Kules, B., & Shneiderman, B. (to appear). Users can change their web search tactics: Design guidelines for
categorized overviews.
Information Processing & Management
.
Marchionini, G., & Brunk, B. (2003). Towards a General Relation Browser: A GUI for Information Architects.
Journal of Digital Information, 4
(1), Article No. 179.
schraefel, m. c., Smith, D. A., Owens, A., Russell, A., Harris, C., & Wilson, M. (2005).
The evolving mSpace
platform: Leveraging the semantic web on the trail of the memex
. Paper presented at the Proceedings of the
Sixteenth ACM Conference on Hypertext and Hypermedia.
Smith, J., & Kules, B. (2006).
Toward a design space for categorized overviews of search results
. Retrieved
September 27, 2007, from
http://faculty.cua.edu/kules/Papers/SmithKules_DesignSpace.pdf
.
Tunkelang, D. (2006).
Dynamic Category Sets: an approach for faceted search
. Paper presented at the ACM SIGIR
2006 Workshop on Faceted Search. Retrieved September 27, 2007, from
http://www.cs.cmu.edu/~quixote/DynamicCategorySets.pdf
.
Yee, K.-P., Swearingen, K., Li, K., & Hearst, M. (2003). Faceted metadata for image search and browsing. In
Proceedings of the SIGCHI Conference on Human factors in Computing Systems, Ft. Lauderdale, FL
(pp.
401-408). New York: ACM Press.
27
Images as Supportive Elements for Search
Giridhar Kumaran and Xiaobing Xue
Center for Intelligent Information Retrieval, Department of Computer Science
University of Massachusetts Amherst, Amherst, MA 01002
{giridhar, xuexb}@cs.umass.edu
Introduction
The dominant paradigm of search today is heavily biased towards textual interfaces. Users enter textual
queries, and navigate to potentially relevant content guided by short textual snippets offering summaries
of retrieved information. This interaction paradigm is not only quite successful in practice but also
provides an opportunity for improved techniques that are potentially even easier and effective. Our work
focuses on that part of the interactive retrieval process where users are offered textual cues to guide them
towards relevant content. By using images in lieu of text, we believe we can provide a user experience
that is not only more effective, but also more efficient.
Images as supportive elements.
A study reported in (Coltheart, 1999) observed that a person can get the gist of an image in 110ms or less
while in the same time she can only read less than 1 word, or skim 2 words. This was the basis for
previous research (Xue et al, 2006) in the web domain that showed that using images in conjunction with
text improves user experience and satisfaction. Usually, web documents have images included in them,
making the task of finding appropriate supportive images easier. We are interested in extending this idea
to collections where documents do not have associated images. We believe that this is important as vast
amounts of information in historical archives, corporate intranets, scanned books etc. do not have images
associated with them.
Our proposed approach is to build on standard information retrieval techniques and available resources
like image search APIs to bring the same advantages of image-supported search to the collections we are
interested in. We envision a procedure starting with retrieval of a set of documents from the collection in
response to a user’s query. The next step is to cluster the retrieved documents. Once the clusters are
created we propose to create concise textual summaries representing each cluster, and using those
summaries as queries for image search using an image search API. Alternately, we propose to search the
web for documents similar to the cluster centroids, and use the images associated with them as supportive
images for the user to work with.
We hypothesize that such image-supported search will not only improve precision, but also recall since
the user can quickly sift though the images summarizing the ranked list, indirectly accessing documents
further down.
References
Coltheart, V., ed. 1999.
Fleeting Memories: Cognition of Brief Visual Stimuli
. Cambridge, MA: MIT
Press.
X.-B. Xue, Z.-H. Zhou, and Z. Zhang.
Improve web search using image snippets
. In: Proceedings of the
21st National Conference on Artificial Intelligence (AAAI'06), Boston, MA, 2006, pp.1431-1436.
28
MIT-Endeca Workshop on Human-Computer Interaction and Information Retrieval,
projects.csail.mit.edu/hcir
Extracted From Public Release Case Number: 07-0980 25 July 2007
Searching Conversational Speech
Mark Maybury
Information Technology Center
The MITRE Corporation
202 Burlington Road
Bedford, MA 01730, USA
maybury@mitre.org
itc.mitre.org
ABSTRACT
This paper summarizes two MITRE efforts to address the
speech search challenge. We first describe Audio Hot Spot-
ting (AHS) and then Cross Language Automated Speech
Recognition (CLASR).
HUMAN LANGUAGE TECHNOLOGY AT MITRE
MITRE has engaged in a highly diverse HLT program over
multiple decades. This has resulted in operational systems
of integrated capabilities such as the DARPA MITRE Text
and Audio Processing System (MiTAP) and the Translin-
gual Instant Messaging (TrIM). MITRE has made a num-
ber of its contributions available via open source including:
-
DARPA Galaxy Communicator architecture
(~800 downloads at communicator.sourceforge.net)
-
Midiki MITRE dialog manager toolkit
(200+ downloads at midiki.sourceforge.net)
-
Callisto annotation tool framework
(~900 downloads at callisto.mitre.org)
We have also been active in facilitating the community to
advance a number of key standards such as TIMEX2 (Ferro
et al 2005), TimeML (timeml.org, Pustejovsky, et al. 2005),
and more recently an effort to create SpatialML (2007). We
have been awarded a patent for our effort in Broadcast
News Navigation (US Patent 6,961,954 ; Maybury et al
1997) and have patents submitted for Personalcasting and
Audio Hot Spotting and have advocated advanced Question
Answering (Maybury 2004).
AUDIO HOT SPOTTING
The Audio Hot Spotting project (Hu et al. 2004) aims to
support natural querying of audio and video, including
meetings, news broadcasts, telephone conversations, and
tactical communications/surveillance. As Figure 1 illus-
trates, the architecture of AHS integrates a variety of tech-
nologies including speaker ID, language ID, non speech
audio detection, keyword spotting, transcription, prosodic
feature and speech rate detection (e.g., for speaker emo-
tional detection), and cross language search.
Figure 1. AHS Architecture
An important innovation of AHS is the combination of
word-based speech recognition with phoneme-based audio
retrieval for mutual compensation for keyword queries.
Phoneme-based audio retrieval is fast, more robust to spell-
ing variations and audio quality, and may have more false
positives for short-word queries. In addition, phoneme-
based engines can retrieve proper names or words not in the
dictionary (e.g., “Shengzhenâ€) but, unfortunately, produces
no transcripts for downstream processes. In contrast, word-
based retrieval is more precise for single-word queries in
good quality audio and provides transcripts for automatic
downstream processes. Of course it has its limitations too.
For example, it may miss hits for phrasal queries, out-of-
vocabulary words, and in noisy audio and is slower in pre-
processing.
Figure 2 illustrates the user interface for speech search, and
includes a speaker and keyword search facility against both
video and audio collections. The user can also search by
non speech audio (e.g., clapping, laughter). A recent exten-
sion enables a user to query in English, have this query
translated to a foreign language (e.g., Spanish, Arabic), use
this query to retrieve hot spots in a transcription of the tar-
get media, which is then retrieved and translated into the
query language.
29
MIT-Endeca Workshop on Human-Computer Interaction and Information Retrieval,
projects.csail.mit.edu/hcir
Extracted From Public Release Case Number: 07-0980 25 July 2007
Figure 2. AHS Search Interface
CROSS LANGUAGE ASR (CLASR)
Access to foreign language spoken discourse is challeng-
ing. Building systems to do so is even more difficult when
no written resources for that language exist. The Cross
Language Automated Speech Recognition (CLASR) effort
investigates a new approach for spoken language transla-
tion of languages that lack significant written resources.
This effort is exploring the hypothesis that recent advances
in both speech recognition and machine translation enable a
fresh approach.
In particular, CLASR aims to build a process that goes
from audio in a foreign language to text in English, ad-
dressing languages that do not have the right quantity and
type of language resources for the current approaches. Cur-
rent approaches to this challenge go from source language
acoustics to source language written form, then from the
source language written form to the English written form.
Typically they use 1-best ASR output although some use n-
best, but in all cases they output written form. CLASR sim-
plifies this process and folds the translation model and
acoustic model into one cross-language acoustic model.
While CLASR aims to address low resource languages,
experiments are being performed on well-known languages
(Spanish and Mandarin) to compare the new single stage
approach to the traditional two stage pipe-line system, i.e.,
ASR+MT. In particular, CLASR uses an open source tool-
kit for ASR (HTK from Cambridge University) and a de-
velopment kit for MT (GIZA++ and PHARAOH (JHU,
MIT)). Our Spanish experiments are based on 30 hours of
broadcast news audio using audio from Central America
and transcripts in Spanish which have been translated into
English. Initial results with Spanish with no additional
language model have been promising as assessed by BLEU
(BiLingual Evaluation Understudy), i.e., the portion of
4-word sequences in MT output that are found in reference
translations with a range from 0 (poor) to 100 (good). The
very first single-stage score, an initial foothold as we begin
hill climbing, was a BLEU score of 8. By contrast, the 2-
stage ASR+MT scores achieved a word error rate of 45 and
and a BLEU score of 13. Our recent system Spanish-
English MT system has a BLEU score of 21, outperforming
the two stage baseline.
In summary, this approach is analogous to the results re-
ported in this workshop by Olsson (2007) in which a single,
integrated model outperforms a sequence of transcription
and retrieval. Notably, CLASR’s combined approach
shows promise both performance-wise as well as in terms
of its limited requirement for language resources.
REFERENCES
1.
Ferro, L., Gerber, L., Mani, I., Sundheim, B. and Wil-
son G. (2005) "TIDES 2005 Standard for the Annota-
tion of Temporal Expressions" April 2005, Updated
September 2005.
http://timex2.mitre.org/annotation_guideli-
nes/2005_timex2_standard_v1.1.pdf
2.
Fiscus, J., Ajot, J., Garofolo, J. and Doddington, G.
Results of the 2006 Spoken Term Detection Evalua-
tion. 2007 SIGIR Workshop on Searching Spontaneous
Conversational Speech, Amsterdam, 27 July 2007. p.
45-51.
3.
Hu, Q., Goodman, F., Boykin, S., Fish, R., and Greiff
,
W. 2004. "Audio Hot Spotting and Retrieval Using
Multiple Audio Features and Multiple ASR Engines".
Rich Transcription 2004 Spring Meeting Recognition
Workshop at ICASSP 2004, Montreal, Can-
ada. http://www.nist.gov/speech/test_beds/mr_proj/ica
ssp_program.html
4.
Maybury, M. (ed.) 1997.
Intelligent Multimedia In-
formation Retrieval
. Menlo Park: AAAI/MIT Press.
(http://www.aaai.org:80/Press/Books/Maybury-2/)
5.
Maybury, M. editor. 2004.
New Directions in Ques-
tion Answering
. AAAI/MIT Press.
6.
Olsson, S. Improved Measures for Predicting the Use-
fulness of Recognition Lattices in Ranked Utterance
Retrieval. 2007 SIGIR Workshop on Searching Spon-
taneous Conversational Speech, Amsterdam, 27 July
2007. p. 1-5.
7.
Pustejovsky, J., Ingria, B. Sauri, R., Castano, J., Litt-
man, J., Gaizauskas, R., Setzer, A., Katz, G. and Mani,
I. 2005. The Specification Language TimeML. In
Mani, I., Pustejovsky, J. and Gaizauskas, R. (eds.), The
Language of Time: A Reader, 545-557. Oxford Uni-
versity Press. http://timeml.org
8.
SpatialML: Annotation Scheme for Marking Spatial
Expressions in Natural Language, March 30, 2007.
MITRE Technical Report.
http://sourceforge.net/projects/spatialm
ACKNOWLEDGEMENTS
This paper includes a summary of the contributions of Qian
Hu’s AHS project and John Henderson’s CLASR project
and their associated technical teams.
30
Natural Language Access to Information for Mobile Users
Alexander Ran
Nokia Research Center Cambridge
3 Cambridge Center
Cambridge, MA 02142
Alexander.Ran@nokia.com
Raimondas Lencevicius
Nokia Research Center Cambridge
3 Cambridge Center
Cambridge, MA 02142
Raimondas.Lencevicius@nokia.com
Mobile devices store a rich set of structured information. The
phone book application contains names, phone numbers,
addresses and affiliations of personal contacts. The calendar
application contains entries for meetings with participants,
meeting location and time. Logs of dialed received calls are stored
on the device as well as sent and received messages and emails.
Unfortunately, users can only access this information using
specially designed applications that manage different subsets of
this information. There are several significant problems with this
situation.
The applications that manage data on a mobile device only
provide a fixed and limited set of ways to access the information.
The contacts in the phone book have affiliations with
organizations, professions, titles, home and office addresses, and
other attributes. However the only way you can access this
information with current applications is using contact’s name.
There is no direct way to find answers for many reasonable
questions like: Who do you know at Nokia? What is the office
address of your lawyer? Who is the sales manager at AT&T store
in Burlington? Although the information about your meeting
includes subject, location and participants, the only way for you
to access the meeting information is browsing it chronologically.
There is no direct way to find answers for many reasonable
questions like: When is your next meeting at MIT? Where do you
meet John next week? Are you free on Wednesday afternoon in
October?
Although data sets owned by different applications are
semantically related, there is no simple way to make these
relations explicit. You receive calls, exchange messages and have
meetings with people in your address book. Places that you visit
often correspond to addresses of people and organizations listed
as your contacts and may appear as meeting places on your
calendar. For an intuitive interaction with information the
semantic relations between different data items need to be
explicitly represented.
Clearly we need a better way to manage the information. But
is this enough? Having a rich semantic repository that integrates
all information into a semantic network is a significant step
forward compared to arbitrarily partitioned, disconnected subsets
of information, but it does not solve the essential problem of user
interaction with complex information sets. It is the language
barrier. The richer the information set we are dealing with, the
richer language we need to interact with this information set. It
seems plausible that the only general solution to this problem
would involve some form of natural language based access to
information.
Natural language interfaces to databases is not a new idea.
Besides some inherent limitations of natural language interaction
with a machine, a major factor for limited success of this
technology was the fact that NLI were not easily portable between
different databases. This is due to significant dependencies on
data organization and content that had to be introduced in the
language system component in order to generate database specific
semantic representation of natural language request.
Let us assume the user asks the system about contacts in
some organization and geographical location:
Who do I know at IBM Ulm?
Who are my contacts at IBM in Ulm?
What are the names of my contacts at IBM in Ulm?
1
The operational semantics of these questions can be
adequately represented with a database query. Let us consider
how this request would need to be posed to an RDF [2]
repository. SPARQL [3] query corresponding to our example
question over our extended PIM ontology looks as follows:
SELECT DISTINCT
?person ?givenName ?familyName
FROM
<http://localhost/pim.rdf>
WHERE
{?person
a
pim:Person;
pim:givenName
?givenName; pim:familyName ?familyName; pim:affiliation
?affiliation; pim:address ?person_address.
?affiliation pim:organization ?organization.
?organization
pim:address
?organization_address;
pim:name “IBMâ€.
{?person_address pim:locality “Ulmâ€} UNION
{?organization_address pim:locality “Ulmâ€}}
Unfortunately in order for a language system to generate
such semantic representation from the original questions, the
language system must contain a large amount of information
about the structure of the database and its content. Such
information includes the facts that IBM is a name of an
organization and Ulm is a name of a city, cities can be related to
organization through their addresses, organizations are related to
people through their affiliations, people are related to cities
through their home and office addresses, and all these
relationships and objects are represented by the specific structures
and entities of the database.
Entering such information into a language system is a
tedious and costly process that is not only domain dependent but
also is sensitive to specific choices of database organization.
However if it were possible to automate the integration of
language system with a data repository the natural language
interface proposition would become very attractive.
We are attempting to do exactly that, exploring the rules for
design of semantic repositories that can be interpreted as a
1
The name of the organization and the city were selected for
shortness and carry no other information
31
grammar and a model for semantic interpretation of natural
language requests concerned with access to information in the
repository.
We have designed and implemented the Natural Query (NQ)
language and engine [1] that accepts database independent
semantic representation of natural language requests and using
heuristics produces operational interpretation the natural language
question over a given semantic repository. NQ requires attaching
basic linguistic information to structural elements of semantic
repository and imposes certain rules on the design of its ontology.
NQ relies on language tags being attached to database
elements such as classes and properties. Multiple tags can be
attached to a single element and a single tag can be attached to
multiple elements. Language tags could correspond to the names
of semantic categories used by the language system(s) or include
them in their semantic class. Given a form like the one in our
example,
contact.name: ?
organization: IBM
city: Ulm
NQ interprets it as:
find the attributes tagged as “name†of an instance of the
class tagged as “contact†related through properties tagged as
“organization†and “city†to values “IBM†and “Ulmâ€
respectively
While a formal query defines a connected subgraph, the
database independent meaning representation only identifies some
nodes and edges of this subgraph. Identified fragment might be
disconnected. In the example above it identifies “
Person
†and
“
Organization
†classes as well as “
Ulm
†value of “
locality
â€
property (by reference to its language tag “
city
â€) and “
IBM
†as a
value of “
name
†property of an instance of “
Organization
†class.
This leads to an important idea: that the knowledge embedded in
the formal queries that
know
the database organization can be also
extracted from the natural language meaning representation and
the data repository itself. For a given set of elements identified by
a meaning representation of natural language request it is possible
to identify the query subgraph by searching the database. In other
words, a program could find paths connecting the nodes known
from the meaning representation, such as
“Personâ€, “nameâ€,
“Organizationâ€, “Cityâ€, “Ulmâ€,
and
“IBMâ€.
Therefore, while traditional approaches to semantic analysis
of natural language questions over databases rely on hand crafted
code or data for representing the information about the
organization of the database, NQ extracts such knowledge from
the data repository by using graph search. Given a question “
Who
are my contacts at IBM in Ulm?
â€, NQ finds paths connecting the
nodes known from the database independent meaning
representation, such as
“Personâ€, “nameâ€, “Organizationâ€,
“Cityâ€, “Ulmâ€,
and
“IBMâ€
.
NQ may find multiple subgraphs that connect all given
elements. In such cases we apply heuristic ranking of these
subgraphs in order to determine the most relevant ones. So far we
experimented with several ranking mechanisms all of which are
variations on path length (weight) between the elements specified
by the meaning representation. In all our experiments the results
retrieved by the system in response to natural language questions
correspond well with intuition of human subjects.
We are currently working on integrating NQ with the Galaxy
natural language system [3] and exploring the potential of our
approach to provide mobile users natural language access to
semantic repositories and on and off the mobile device.
1.
REFERENCES
[1]
Ran, A., and Lencevicius, R., “Natural Language Query
System for RDF Repositoriesâ€, To appear in the Proceedings
of International Symposium on Natural Language
Processing, SNLP 2007.
[2]
Resource Description Framework, http://www.w3.org/RDF/,
2007.
[3]
S. Seneff, E. Hurley, R. Lau, C. Pao, P. Schmid, and V. Zue,
"GALAXY-II: A Reference Architecture for Conversational
System Development,"
Proc. ICSLP 98
, Sydney, Australia,
November 1998.
[4]
SPARQL Query Language for RDF,
http://www.w3.org/TR/rdf-sparql-query/
, 2007
32
AnalogySpace and ConceptNet
Rob Speer and Catherine Havasi
October 19, 2007
When people communicate with each other, their conversation relies on many ba-
sic, unspoken assumptions, and they often learn the basis behind these assumptions
long before they can write at all, much less write the text found in corpora. These
assumptions underlie all forms of human communication, from teaching, to giving
directions, to ordering dinner at a restaurant.
A user who interacts with a computer interface, however, can become frustrated
because the computer does not understand their goals and motivations. For human-
computer interaction to become as fluent as communication between humans, com-
puters need to be able to understand the user’s basic, unspoken assumptions. These
assumptions form the body of knowledge known as “common senseâ€.
Grice’s theory of pragmatics states that when communicating, people tend not
to provide information which is obvious or extraneous. If someone says “I bought
groceriesâ€, he is unlikely to add that he used money to do so, unless the context
made this fact surprising or questionable. Thus, it is difficult to collect common
sense knowledge automatically from the Internet or a lexical resource.
Since 2000, the Open Mind Common Sense project has been collecting common
sense information from volunteers on the Internet. This information is converted,
using automatic NLP techniques, to a semantic network called ConceptNet. Over
the years ConceptNet has grown to contain over 250,000 predicates in English and
has recently been expanding to include many new languages.
Using principal component analysis on the graph structure of ConceptNet yields
AnalogySpace, a vector space representation of common sense knowledge. This rep-
resentation reveals large-scale patterns in the data, while smoothing over noise, and
predicts new knowledge that the database should contain. The inferred knowledge,
which a user survey shows is often correct, is used as part of a feedback loop that
shows contributors what the system is learning, and guides them to contribute useful
new knowledge.
1
We feel that information retrieval would benefit from our work in several ways.
First, interfaces used in IR could benefit from the â€sanity checking†features that
adding common sense to a system provides. In the past, this has been used in speech
recognition, predictive text entry, and other UI applications. Secondly, we would like
to explore a representation similar to AnalogySpace, or even built on it, for other
types of complex data such as those found in IR. We feel that AnalogySpace and
principal component analysis shows great potential in reasoning which can extend
to other areas.
2
33
ï€ï€‚ï€ï€Žï€‹ï€…ï€ï€Šï€‹
ï€ï€‘ï€ï€ï€‰ï€ƒï€”ï€ï€…ï€ï€‡ï€‚ï€ï€‹
ï€ï€‡ï€‰ï€†ï€‹ï€–ï€ï€—ï€ï€˜ï€‹
ï€ï€Šï€‰ï€šï€…
ï€
ï€ï€…ï€ï€ˆï€‰ï€Šï€…ï€ï€…ï€ï€‹ï€‡ï€Œï€‰ï€‹ï€‡ï€Šï€ï€ï€Žï€ƒï€Šï€…ï€ï€„ï€ï€ï€ï€ï€ï€ï€‘ï€ï€ƒï€ï€ï€„ï€ï€‡ï€’ï€ï€†ï€‰ï€Šï€ï€‰ï€Šï€Šï€”ï€ï€‰ï€ï€„ï€ï€•ï€‡ï€‹ï€
ï€ï€‡ï€’ï€ï€Žï€ƒï€‹ï€‡ï€“ï€ï€ï€ï€ï€ï€–ï€ï€Œï€†ï€ï€‰ï€ï€—ï€ï€„ï€ï€Šï€ï€‰ï€ï€„ï€ï€•ï€ï€ï€™ï€ï€ƒï€‘ï€ï€Œï€ƒï€„ï€ï€‹ï€ƒï€Œï€”ï€ï€…ï€ï€šï€˜ï€ï€…ï€ï€„ï€
ï€ï€Žï€‰ï€…ï€ï€„ï€ï€Œï€‡ï€ï€…ï€ï€…ï€ï€Šï€ï€—ï€ï€œï€†ï€ï€Šï€ï€Žï€ƒï€‹ï€‡ï€“ï€ï€Žï€‰ï€ˆï€Šï€ï€…ï€ï€…ï€ï€‘ï€ï€“ï€ï€‰ï€„ï€ï€”ï€ï€ï€ï€…ï€ï€ƒï€‘ï€ï€Žï€ƒï€Šï€…ï€
ï€ï€‰ï€“ï€ï€‰ï€ï€‹ï€ï€‰ï€Œï€‰ï€‹ï€‡ï€Žï€ï€Œï€ï€Šï€‡ï€‰ï€„ï€ï€‡ï€ï€™ï€ï€ï€‡ï€Šï€›ï€
ï€
ï€ï€‡ï€Šï€ˆï€ï€…ï€ï€ï€…ï€ï€ˆï€ƒï€ˆï€”ï€ï€…ï€ï€ï€…ï€ï€„ï€ï€•ï€‡ï€‹ï€ï€„ï€ï€‡ï€’ï€ï€Žï€ƒï€‹ï€‡ï€“ï€ï€Šï€”ï€ï€šï€‡ï€Œï€‰ï€”ï€ï€ï€…ï€ï€‹ï€ƒï€‡ï€Šï€ï€ï€ƒï€…ï€ï€ˆï€„ï€ï€‹ï€‡ï€ï€‰ï€
ï€ï€Šï€ˆï€“ï€ï€…ï€ï€šï€‡ï€…ï€ï€ï€„ï€ï€…ï€ï€‰ï€ï€‹ï€ï€ï€„ï€ï€…ï€ï€‹ï€ƒï€Œï€”ï€ï€…ï€ï€œï€†ï€ï€Šï€ï€–ï€ï€‡ï€Šï€Šï€ï€Žï€‰ï€•ï€‡ï€Šï€ï€ï€…ï€
ï€ï€Œï€ï€…ï€ï€ƒï€šï€…ï€ï€ï€ï€‡ï€’ï€ï€ï€šï€‰ï€Šï€ï€Œï€ï€‰ï€ï€‰ï€“ï€ï€Šï€ï€ƒï€‘ï€ï€…ï€ï€—ï€ï€„ï€ï€ï€Šï€”ï€ï€‰ï€Šï€ï€…ï€ï€ï€”ï€ï€ƒï€‘ï€
ï€ï€…ï€ï€‹ï€ƒï€Œï€”ï€ï€…ï€ï€ï€“ï€ï€‰ï€“ï€ï€‡ï€ï€‰ï€ï€Žï€ƒï€„ï€ï€Œï€ƒï€Žï€ˆï€“ï€ï€Œï€‰ï€…ï€ï€ƒï€ï€‡ï€ï€ï€Šï€”ï€ï€‰ï€Šï€ï€…ï€ï€„ï€ï€—ï€ï€…ï€
ï€
ï€ï€ï€Œï€ƒï€ï€…ï€ï€ï€‰ï€ï€Šï€‡ï€…ï€ï€„ï€ï€‡ï€’ï€ï€Žï€ƒï€‹ï€‡ï€“ï€ï€ˆï€‰ï€„ï€ï€…ï€ï€ƒï€ï€Šï€ï€…ï€ï€Œï€ƒï€„ï€ï€ï€ï€…ï€ï€…ï€ï€Šï€”ï€ï€ƒï€‘ï€ï€‹ï€ƒï€Œï€”ï€ï€…ï€
ï€ï€…ï€ï€‰ï€„ï€ï€Œï€ƒï€ï€Šï€ï€‹ï€‡ï€„ï€ï€„ï€ï€…ï€ï€ï€‰ï€ï€‹ï€ï€…ï€ï€…ï€ï€‰ï€„ï€ï€ï€ƒï€…ï€ï€ ï€ï€Šï€‡ï€…ï€ï€„ï€ï€‡ï€’ï€ï€Žï€ƒï€‹ï€‡ï€“ï€ï€‹ï€ƒï€‡ï€Šï€ï€ï€ƒï€…ï€
ï€ï€•ï€ï€…ï€ï€„ï€ï€‡ï€’ï€ï€‹ï€ƒï€Œï€”ï€ï€…ï€ï€ï€ï€Šï€…ï€ï€ï€ï€…ï€ï€‡ï€Šï€…ï€ï€Šï€†ï€‡ï€Šï€ï€‰ï€ï€Œï€“ï€ï€Šï€ˆï€“ï€ï€…ï€ï€šï€‡ï€…ï€ï€ï€‹ï€ƒï€Œï€”ï€ï€…ï€ï€…ï€
ï€ï€ï€ï€ï€‰ï€ï€‹ï€ï€ƒï€”ï€ï€ƒï€‘ï€ï€…ï€ï€„ï€ï€‡ï€’ï€ï€Šï€‡ï€…ï€ï€œï€†ï€ï€Šï€ï€‡ï€¢ï€ˆï€“ï€ï€Œï€ï€…ï€ï€‹ï€‡ï€…ï€ï€ï€‰ï€…ï€ï€ƒï€ï€ï€ƒï€‘ï€ï€…ï€ï€Šï€‡ï€…ï€ï€ƒï€‘ï€ï€‹ï€ƒï€Œï€”ï€ï€…ï€ï€…ï€
ï€ï€Šï€…ï€ï€…ï€ï€…ï€ï€—ï€ï€„ï€ï€Žï€‰ï€•ï€‡ï€Šï€ï€ï€…ï€ï€ˆï€ƒï€Šï€Šï€ï€šï€“ï€ï€…ï€ï€‰ï€ï€‰ï€“ï€ï€…ï€ï€Šï€ï€Šï€‡ï€…ï€ï€ï€‰ï€ï€‹ï€ï€…ï€ï€ï€…ï€ï€‰ï€ˆï€ˆï€“ï€ï€…ï€ï€Šï€
ï€ï€‰ï€“ï€ï€Šï€ï€…ï€ï€ï€Žï€ˆï€„ï€ï€”ï€ï€‡ï€¢ï€ˆï€‡ï€„ï€ï€‡ï€ï€Œï€‡ï€›ï€
ï€
ï€ï€ï€ˆï€‰ï€„ï€ï€Œï€”ï€ï€ï€–ï€ï€Œï€‰ï€ï€ï€„ï€ï€Šï€ï€…ï€ï€‰ï€ï€ï€‰ï€ï€‰ï€“ï€ï€Šï€ï€…ï€ï€ï€—ï€ï€‹ï€‡ï€Šï€ï€™ï€ï€‡ï€‹ï€ï€…ï€ï€Žï€‡ï€‰ï€Šï€”ï€ï€„ï€ï€—ï€ï€…
ï€ï€‰ï€Žï€‡ï€“ï€ï€ï€…ï€ï€—ï€ï€Œï€“ï€ï€…ï€ï€Šï€Œï€ƒï€„ï€ï€ï€ï€…ï€ï€šï€˜ï€ï€¥ï€„ï€ï€‡ï€ï€¦ï€œï€ƒï€–ï€ï€Šï€‡ï€ï€‹ï€ï€‰ï€ï€‹ï€ï€¥ï€„ï€ï€ï€ï€ï€§ï€¨ï€¨ï€§ï€ï€©ï€ªï€«ï€›ï€
ï€ï€Šï€ï€Šï€Œï€ƒï€„ï€ï€‡ï€’ï€ï€—ï€ï€„ï€ï€šï€‰ï€Šï€‡ï€‹ï€ï€ƒï€ï€ï€…ï€ï€ï€ï€‘ï€ï€ƒï€ï€ï€™ï€‰ï€ï€ï€ï€ƒï€šï€Šï€‡ï€„ï€ï€ï€ï€ï€…ï€ï€Šï€‡ï€…ï€ï€ƒï€‘ï€
ï€ï€•ï€‡ï€‹ï€ï€—ï€ï€„ï€ï€žï€ï€ï€…ï€ï€ï€Šï€”ï€ï€‰ï€ï€‰ï€“ï€ï€Šï€ï€ƒï€‘ï€ï€—ï€ï€„ï€ï€Œï€ƒï€”ï€ï€šï€‡ï€ï€”ï€ï€…ï€
ï€ï€ï€‡ï€ï€…ï€ï€Šï€”ï€ï€ƒï€‘ï€ï€…ï€ï€Šï€˜ï€Šï€…ï€ï€ï€ï€ï€‰ï€ï€Šï€–ï€ï€ï€™ï€ï€‰ï€ï€—ï€ï€¬ï€ƒï€–ï€ï€ï€„ï€ï€šï€˜ï€ï€œï€”ï€ï€ï€
ï€ï€‹ï€ï€¬ï€‡ï€„ï€ï€ï€ï€ï€§ï€¨ï€¨ï€ï€ï€©ï€§ï€«ï€ï€Šï€†ï€ƒï€–ï€ï€‰ï€ï€“ï€ï€ƒï€‘ï€ï€Œï€ƒï€„ï€ï€ƒï€ï€ï€šï€‡ï€…ï€ï€ï€Œï€“ï€ï€…ï€ï€Šï€Œï€ƒï€„ï€ï€‰ï€ï€‹ï€ï€”ï€
ï€ï€Œï€‡ï€›ï€
ï€
ï€ï€Œï€“ï€ï€Žï€ï€…ï€ï€—ï€ï€Œï€“ï€ï€…ï€ï€ï€Šï€ï€‰ï€ï€ï€‡ï€‘ï€ï€’ï€ï€Žï€‡ï€‰ï€Šï€”ï€ï€ï€šï€”ï€ï€…ï€ï€ï€…ï€ï€ï€‡ï€‡ï€‹ï€Šï€ï€…ï€ï€šï€‡ï€ï€„ï€ï€Šï€‡ï€‹ï€ï€…ï€
ï€ï€Œï€ï€…ï€ï€“ï€ï€‰ï€ï€Šï€‡ï€…ï€ï€„ï€ï€‡ï€’ï€ï€Žï€ƒï€‹ï€‡ï€“ï€ï€®ï€‡ï€ï€ˆï€„ï€ï€‰ï€ï€ï€ƒï€„ï€ï€£ï€‡ï€‹ï€ï€Œï€“ï€ï€…ï€ï€Šï€Œï€ƒï€„ï€ï€…ï€
ï€ï€…ï€ï€Œï€“ï€ï€…ï€ï€ƒï€‘ï€ï€‰ï€ï€—ï€ï€„ï€ï€Šï€‡ï€…ï€ï€„ï€ï€’ï€ï€…ï€ï€Šï€ï€Žï€ï€“ï€ï€Šï€ï€£ï€‡ï€‹ï€ï€‹ï€ƒï€Œï€”ï€ï€…ï€ï€Šï€‡ï€…ï€
ï€
ï€ï€ˆï€„ï€ï€…ï€ï€ƒï€”ï€ï€–ï€ï€ï€ï€ï€ˆï€„ï€ï€ƒï€ï€ï€ï€ƒï€„ï€ï€£ï€‡ï€‹ï€ï€Œï€“ï€ï€…ï€ï€ï€Šï€†ï€ƒï€–ï€ï€ï€™ï€ï€šï€ƒï€…ï€ï€…ï€ï€Œï€‰ï€“ï€ï€‰ï€ï€‹ï€
ï€ï€„ï€ï€Œï€‰ï€“ï€ï€„ï€ï€®ï€‡ï€ï€‰ï€“ï€ï€ƒï€”ï€ï€ï€‡ï€ï€‰ï€ï€ï€‰ï€ˆï€ˆï€“ï€ï€Œï€‰ï€…ï€ï€ƒï€ï€ï€ƒï€‘ï€ï€ï€ƒï€„ï€ï€£ï€‡ï€‹ï€ï€Œï€“ï€ï€…ï€ï€…ï€ï€™ï€”ï€ï€‹ï€‡ï€ï€‰ï€ï€ˆï€„ï€
ï€ï€ï€ï€…ï€ï€’ï€ï€—ï€ï€„ï€ï€ƒï€ï€›ï€
ï€
ï€ï€¥ï€„ï€ï€‡ï€ï€¦ï€œï€ƒï€–ï€ï€Šï€‡ï€ï€‹ï€ï€ï€¯ï€›ï€ï€‰ï€ï€‹ï€ï€®ï€›ï€ï€°ï€›ï€ï€¥ï€„ï€ï€ï€®ï€›ï€ï€°ï€›ï€ï€±ï€”ï€ï€…ï€ï€‘ï€ï€ï€™ï€ï€—ï€ï€‰ï€Žï€šï€ï€™ï€”ï€ï€…ï€ï€žï€ï€ï€²ï€„ï€ï€ƒï€‘ï€ï€¬ï€”ï€ï€
ï€ï€™ï€”ï€ï€œï€‡ï€Œï€†ï€ï€ƒï€“ï€ï€§ï€¨ï€¨ï€§ï€ï€ï€ˆï€‰ï€™ï€‡ï€Šï€ï€´ï€ï€¦ï€¦ï€´ï€µï€ï€ï€¶ï€‰ï€„ï€ï€§ï€¨ï€¨ï€§ï€›ï€
ï€
ï€ï€œï€”ï€ï€ï€ï€ï€ ï€ï€‰ï€ï€‹ï€ï€¬ï€‡ï€„ï€ï€ï€®ï€›ï€ï€ï€ƒï€ï€Œï€“ï€ï€…ï€ï€Šï€Œï€ƒï€„ï€ï€‘ï€ï€—ï€ï€‡ï€Šï€ï€Œï€ƒï€„ï€ï€–ï€ï€…ï€ï€”ï€ï€ˆï€‡ï€„ï€ï€Œï€‡ï€·ï€›ï€ï€žï€ï€ï€²ï€„ï€ï€ï€™ï€Šï€
ï€ï€…ï€ï€ªï€¸ï€…ï€ï€ ï€ï€‰ï€ï€ï€ï€‰ï€…ï€ï€¥ï€ƒï€ï€‘ï€ï€Œï€‡ï€ï€¦ï€ï€¹ï€ƒï€“ï€ï€§ï€ºï€ï€»ï€ï€”ï€ï€‡ï€‹ï€ï€ï€ï€ï€¼ï€‡ï€–ï€ï€½ï€‡ï€‰ï€“ï€ï€‹ï€¾ï€›ï€ï€¿ï€›ï€ï€¯ï€Œï€†ï€‡ï€–ï€ï€‰ï€ï€‹ï€ï€¬ï€›ï€
ï€ï€“ï€ï€‰ï€Žï€Šï€ï€ï€ï€‹ï€Šï€›ï€ï€ ï€ï€žï€ï€…ï€ï€‰ï€…ï€ï€ƒï€ï€‰ï€“ï€ï€¥ï€ƒï€ï€‘ï€ï€Œï€‡ï€ï€²ï€„ï€ï€ï€™ï€ï€¯ï€‡ï€„ï€ï€‡ï€Šï€ï€ï€’ï€ï€¸ï€§ï€›ï€ï€ ï€ï€‰ï€ï€ï€¥ï€ƒï€Žï€ˆï€”ï€ï€¯ï€ƒï€Œï€ï€‡ï€…ï€ï€
ï€ï€‰ï€„ï€ï€ï€™ï€†ï€”ï€ï€ï€ ï€ï€‰ï€ï€ï€µï€¸ï€¦ï€´ï€ªï€ï€ï€§ï€¨ï€¨ï€ï€›ï€
34
Less Searching, More Finding: Improving Human Search Productivity
William A. Woods
ITA Software
Finding information and organizing information so that it can be found are two key aspects of any
company's knowledge management and knowledge delivery strategy. This talk will describe what I
have learned from years of thinking about these problems and from a project I led at Sun
Microsystems Laboratories that addressed these problems by combining the respective strengths of
humans and computers in a knowledge-based system to help people find information. It explored a
new search technology aimed at addressing problems that hinder human search effectiveness, and it
developed techniques that provide a user with the necessary information to quickly decide whether a
document has the information being sought. Unlike many previous attempts to improve search
effectiveness, this system demonstrated a substantial improvement in human search productivity.
The system combines a technique for automatically constructing a semantic conceptual index of the
material to be searched with a new passage retrieval algorithm that finds specific passages of text that
are likely to contain the information sought. The first technique can supplement or replace expensive
manual indexing of material, and it provides a semantically oriented conceptual structure that can be
intuitively browsed by information seekers. The information in this conceptual index can also be used
by the passage retrieval algorithm for find passages that are relevant to a request but use different,
semantically related terms than those used in the query.
The system makes use of linguistic and world knowledge and exploits sophisticated knowledge
representation techniques. It combines linguistic knowledge and natural language processing with
knowledge representation techniques to automatically construct an intuitive conceptual taxonomy of
all the words and phrases found in the material, augmented with additional semantically related terms,
and organized by generality. More general terms occur higher in the taxonomy, while more specific
terms are linked below more general terms that "subsume" them. This provides useful information for
the search algorithm, which automatically includes any terms that are subsumed by the requested
query terms, and it is also an intuitive structure for human browsing. The system provides an
integrated framework for combining searching and browsing and allows a user to switch easily
between the two perspectives.
Why is searching so difficult?
For years I have been asking myself this question, beginning when I took a course in information
retrieval as a graduate student and was appalled to discover what information retrieval systems
actually did. I expected the system to understand what I was asking about and find documents that
were about that. I discovered that all these systems did was count occurrences of words and push the
numbers through a simple equation to compute a ranking. Since then, I have been trying to develop
systems that come closer to my original assumption. It turns out that this involves a lot of linguistic
and cognitive science insights and the use of some real knowledge (in addition to good algorithms and
human factors).
One of the problems that makes searching difficult is that people often ask for information using
different terminology from that used in the material that they need to find. To be successful a search
engine needs to use knowledge to make connections between what the user asks for and what they
need to find. Two sources of such connections are morphological and semantic relationships. These
35
are addressed by automatically constructed conceptual taxonomy mentioned above. Another problem
is that of finding out whether a retrieved document actually has the information that you are seeking.
This is addressed by the specific passage retrieval algorithm and features provided in the human
interface to the system. Both techniques will be described, together with some examples and some
experimental results.
References:
"Linguistic Knowledge can Improve Information Retrieval," William A. Woods, Lawrence A.
Bookman, Ann Houston, Robert J. Kuhns, Paul Martin, and Stephen Green, Proceedings of ANLP-
2000, Seattle, WA, May 1-3, 2000, (preliminary version: Technical Report SMLI TR-99-83, Sun
Microsystems Laboratories, Mountain View, CA, December, 1999. Online at:
http://www.sun.com/research/techrep/1999/abstract-83.html
"Aggressive Morphology for Robust Lexical Coverage," William A. Woods, Proceedings of ANLP-
2000, Seattle, WA, May 1-3, 2000, (preliminary version: Technical Report SMLI TR-99-82, Sun
Microsystems Laboratories, Mountain View, CA, December, 1999. Online at:
http://www.sun.com/research/techrep/1999/abstract-82.html)
"Conceptual Indexing: Practical Large-Scale AI for Efficient Information Access," William A. Woods,
Proceedings AAAI 2000, Austin, TX, August 2, 2000.
"Searching vs. Finding," William A. Woods, ACM Queue, Vol. 2, Issue 2 (April 2004), pp 27-35.
(available online at: http://portal.acm.org/citation.cfm?id=988405, or at
http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=137).
36
Mediating between User Query and User Model
with Adaptive Relevance-Based Visualization
Jae-wook Ahn and Peter Brusilovsky
School of Information Sciences, University of Pittsburgh
{jaa38, peterb}@pitt.edu
Abstract
Personalized information retrieval systems [1] seek to adapt search results to long term interests of an individual
users represented in a user profile (also known as user model). One of the problems of these systems is how to
“fuse†query-based and profile-based document rankings in search result presentation. The traditional solution to
this problem, which is applied in several adaptive search systems, is to select a fixed mediation point
!
between 0
and 1 and to produce a personalized rank by fusing query- and profile-based rankings with coefficients
!
and (1-
!
).
By manipulating
!
, the system designers can give more priority to documents similar to the query or documents
similar to the profile. This paper presents a more flexible approach to “fusing†query- and profile-based rankings.
The idea of this approach is to allow the analysts to dynamically decide whether they are interested in documents
which are closer to the query or documents which are closer to the user profile – with the ability to navigate on a
continuum between the query to the user profile and back again.
The core component of our approach is the relevance-based visualization originally implemented in VIBE [2].
VIBE is known as an excellent tool for visual query results analysis. VIBE supports POI (Point of Interest)-based
browsing. POIs represent key concepts or keywords and are displayed as user-draggable icons on the screen. The
documents are placed according to their similarities to the POIs. Users can drag and move POIs anywhere they want
and the locations of the documents are dynamically updated depending on their similarities to the POIs. It allows the
user to explore the connection between search results and query terms, for example, enabling the user to pick a
subset of results that is more relevant to a specific query term or group of terms.
In our project we applied relevance-based visualization to help the user to mediate between the query terms and
terms from the user profile.
Our key idea is to use both query terms and profile terms as POIs.
The application of
the user profile makes the relevance-based visualization adaptive. The results of the visualization are different for
different users who have submitted the same query and even different over time for the same user, if the interests of
the user represented in the user profile evolve. Our poster presents our implementation of VIBE for adaptive
relevance-based visualization, stresses several features that are critical for this type of visualization, and shares some
evaluation results. This work is a part of our broader agenda on using adaptive visualization to increase the
interactivity and expressiveness of personalized information access systems. The rest of the position paper presents
the idea of our approach using a practical example.
Figure 1 shows an example of applying VIBE to the query and profile fusion problem. The circles colored in pink
and green are POIs representing two different sets of terms: query terms (pink) and profile terms (green). In this
example a user entered a query “NUCLEAR WEAPON†and the system retrieved relevant articles with high
similarity scores (which will be discussed in detail in Section 3). White squares represent these retrieved documents
and users can examine their titles and summaries by hovering the mouse cursor over the square icons. We extracted
10 profile terms and displayed the top 5 of them as green circles on the screen. The rest of the profile terms are
disabled temporarily and docked in a white box at the corner of the screen (4 in this case because one term,
NUCLEAR, overlapped the query). Users are able to freely move both query and profile terms and explore which
document is related to which POI (or term).
This example clearly demonstrates the difference between the traditional search result (query-based ranked list),
an adaptively re-ranked result (profile-based ranked list) and our flexible approach exploiting VIBE. Originally, the
search engine results contain the top 5 articles on Iranian nuclear weapon development. The ranked list sorts the
documents by their relevance score and users typically examine the top ones first. This result is appropriate if the
user in this example was most interested in recent events in Iran. However, let’s consider a user who is interested in
Korean affairs including North Korean nuclear weapon development. Over the weeks of using the system, this user
37
has been accumulating terms like KOREA, NUCLEAR, JAPAN, and NORTH in her profile of interests. Proponents
of adaptive search and filtering systems would argue that this user would be most interested to see information about
North Korean, not Iranian nuclear programs and would prefer to see news ranked according to her profile with North
Korean news emerging on the top of the list. Unfortunately, in a realistic context it is hard to decide what is the real
need of the user because of a lack of information. Her interests may have remained the same (i.e., she does prefer
news on North Korean nuclear developments) or may have switched to a different direction (i.e., she is interested in
seeing up-to-date news about other programs).
Figure 1. VIBE Visualization for Query and Profile-based Ranking Fusion
“Fusing†query-based ranking and profile-based ranking is a more reliable way to assist the user in an
ambiguous context. VIBE allows users interactively explore the query terms, profile terms, and the retrieved
documents simultaneously. The users are able to understand the relationships among these three components and
discover relevant information more easily. The example above clearly shows the benefits of our approach. By
examining the locations of the articles using VIBE, it is surprising that a lot of articles are placed closer to a profile
term (KOREA) than the query terms (NUCLEAR and WEAPON). This result is very interesting because the
documents visualized here are exactly the same set of articles displayed in the query-based ranked list retrieved by a
conventional search engine, where the top 5, most important articles were about Iranian nuclear weapon
development. Our approach can provide users with the flexibility to intuitively discern which documents are more
related to the query or the profile terms by just glancing at the picture. We don’t have to choose either of the two:
query or user profile-based ranked list. We can merely show the relatedness of each document to each of the
concepts and let users visually explore to understand what the situation really is.
Acknowledgements
This paper is partially supported by DARPA GALE project and by the National Science Foundation under Grant
No. 0447083.
References
1. Micarelli, A., Gasparetti, F., Sciarrone, F., Gauch, S.: Personalized search on the World Wide Web. In:
Brusilovsky, P., Kobsa, A., Neidl, W. (eds.): The Adaptive Web: Methods and Strategies of Web
Personalization. Lecture Notes in Computer Science, Vol. 4321. Springer-Verlag, Berlin Heidelberg New York
(2007) 195-230
2. Olsen, K.A., Korfhage, R.R., Sochats, K.M., Spring, M.B., Williams, J.G.: Visualisation of a document
collection: The VIBE system. Information Processing and Management 29,
1 (1993)
Profile terms
Query terms
38
Characterization of Diagrams and Retrieval Strategies for Them
Robert Futrelle
Northeastern University
There are a few hundred million diagrams available on the web, by rough estimate. They cover
every imaginable topic. But quality retrieval of the "diagram you want" is close to impossible,
because virtually all current methods rely entirely on the accompanying or referring text to
characterize diagram content. Our lab has worked on a variety of aspects of diagrams and their
internal content for a number of years, with one of the major goals being how to build IR
systems for them. We have published diagram-related papers on machine learning for
classification, constraint-grammar-based parsing, ambiguity, summarization, text-diagram
interrelations, ontologies, and vectorization of diagram images. Much of our work has been
focused on the diagrams that typically appear in papers from the biomedical research literature.
This talk will range over the portions of our research most relevant to IR, arguing that many of
the topics we have studied need to be kept in mind in building future systems for diagram
analysis, representation, interaction, and retrieval.
39
HUMAN COMPUTATION FOR HCIR EVALUATION
Shiry Ginosar
Endeca Technologies, Inc.
101 Main Street, Cambridge, MA 02142
sginosar@endeca.com
ABSTRACT
A novel method for the evaluation of Interactive IR systems
is presented. It is based on Human Computation, the
engagement of people in helping computers solve hard
problems. The Phetch image-describing game is proposed as
a paradigmatic example for the novel method. Research
challenges for the new approach are outlined.
Index Terms—
Interactive IR and HCIR evaluation,
Web-based games.
1. INTRODUCTION
There are currently two main approaches to evaluation of IR
systems - the TREC conference approach and the HCI
approach, and neither is optimal across the wide range of
systems that exist today. In particular, evaluation paradigms
for Interactive IR systems are interesting to investigate,
since on the one hand the TREC evaluation method cannot
be applied here [7,8] while on the other hand HCI methods
tend to be hard to generalize.
In this paper, we consider the relative value of the two
primary approaches to this problem and propose and discuss
a novel approach to evaluating IR and Interactive IR
systems that uses Human Computation [1]. This approach
extends TREC evaluation metrics so that it can be
applicable to interactive systems, and it improves upon HCI
methods by reducing their subjectivity.
2. EXISTING IR EVALUATION METHODS
The first approach for IR systems evaluation, taken by
TREC [http://trec.nist.gov] is based on a batch evaluation.
The queries and corpus to be used are decided upon
a priori
and the entire corpus is relevance-ranked by hand for each
of the queries. Each IR system is then queried using a batch
process with the pre-compiled queries over the given
corpus. The resulting relevance-ranked set of documents is
then compared to the pre-annotated “gold standard†and
scores such as
precision
and
recall
are computed [10,13].
The batch process approach is arguably a successful
measure of goodness for the effectiveness of the IR system
itself [10,11].
However, evaluating an IR system using a batch
process may fail to capture the intended use of systems that
are designed to support other information discovery
processes [13]. This is especially true in regards to
evaluating Interactive IR systems. On the one hand, classic
IR evaluation relies on a one click paradigm where queries
are first composed in full and then sent to the systems to
compute a static set of answers [10,13]. On the other hand,
Interactive IR systems are often designed to enable a user to
iteratively formalize the query. Since the query as a whole is
not known
a priori
, there is no way to assess the relevance
of documents in the corpus in advance and therefore there is
no way to compose a gold standard with which to compare
results returned from different systems. Thus, alternative
methods must be used in order to evaluate such systems [7,
8].
The second approach to evaluation of IR systems, used
primarily within the HCI community, focuses on task level
evaluations rather than evaluating the results for individual
queries. Such evaluations often employ a mix of objective
and subjective metrics such as completion time, user
satisfaction and perceived user success [8,14]. Since the
metrics used by HCI are partially subjective and since the
tasks performed during the evaluation are highly correlated
with the specific system and the specific corpus used [8,14],
it is hard to compare different systems and the results of
these evaluations are rarely accepted by the greater IR
community.
Furthermore, HCI evaluations that are set up as user
studies are often stymied by the lack of willing participants,
the need to compensate participants and the difficulties of
recruiting participants from outside the specific university
or company where the study is conducted. These hardships
can result in lack of data or lack of a sufficiently varied
participant population, both of which make it difficult to
make statistically significant claims.
3. HUMAN COMPUTATION EVALUATION OF IR
In this paper, we propose a new approach to evaluation of
IR and Interactive IR systems. The goal of this line of
40
thought is to design a system that will allow users to
perform search tasks in a natural way, while assessing the
quality of the system as well as the success and satisfaction
of the users in the background. This approach is not
intended to achieve a mapping to the classical evaluation
scores used by TREC (unlike [11] which claim to
successfully do so, or [7,13] which claim that there is no
correlation between user success and TREC metrics).
Rather, we seek a new scoring system that will be able to
compare different user-systems combinations.
Moreover, to overcome the hardships of recruiting
individuals for participation in user studies, we propose to
incorporate the concept of Human Computation [1] into the
design of our system. Human Computation engages people
to aid computers in completing tasks which are either too
hard or too expensive for computers to do on their own.
Most Human Computation systems are designed as games
[1,2,3,4,5,6] which people enjoy playing, or as verification
systems which act as gateways to information that people
want
to
access
[http://www.captcha.net/,
http://recaptcha.net/]. However, a Human Computation
system is more than a game: it is cleverly designed such that
as a side effect of game play or everyday tasks such as
logging in to an email account, useful information can be
collected.
4. AN EXAMPLE EVALUATION USING A GAME
As an example of the Human Computation evaluation
paradigm, we will investigate in more detail the possible use
of the online game Phetch [4] that can be hooked up to
different IR systems [3]. Phetch requires players to perform
search tasks in order to advance in the game. In Phetch, a
describer generates a text description of an image and
multiple seekers race to identify the described image out of
a large collection of similar images. People play the game
because it is fun, and as a side effect of game play the set of
IR systems supporting the game may be evaluated. Since the
game is interactive in nature, this type of evaluation is
suited for IR as well as Interactive IR systems.
There are several advantages for using a game like
Phetch for evaluation. First and foremost, since the game
involves users performing search tasks while trying to fulfill
an information need, it naturally lends itself to evaluation of
not only IR systems but also of Interactive IR systems.
Second, the search task itself within the game is done in
a natural way. Players are presented with an item (in this
case, an image) that they need to find, and they are expected
to devise their own ways in which to find it. This type of
search task is very similar to search tasks that users of IR
systems perform in real life scenarios and therefore would
eliminate the need to come up with a contrived
simulated
work task situation
for the purpose of the evaluation [9].
Third, a game like Phetch outputs a clean scoring
number for players in the game. This score encapsulates the
success of the player both as a seeker who searches for
images as well as a describer who describes images for
others to find. It depends on the randomly chosen image as
the goal of the search, on the speed in which the player
processes visual and language information, on the
opponents she played against and even on the speed of her
internet connection. However, an average scoring over
many players, many images and many game sessions could
potentially serve as a form of measure of goodness for the
combination of a generic user with the specific IR system
that was hooked onto the game. This scoring could later be
incorporated with other metrics from HCI user studies or
batch processes performed against all or part of the IR
system to produce a more accurate metric. Repeating the
same setting of game play with the same corpus using other
IR systems would produce similar scoring which could then
be compared with the first, resulting in an overall
comparison between the two IR systems and the ways in
which they allow users to interact with them.
In this way, a Human Computation system could
potentially bridge between the two different approaches to
IR evaluation. It could provide a clean score to aid the
current ways of comparing different IR systems while also
taking into account user interaction with the system as well
as the performance of the system itself.
From our experience with Phetch we learned that it can
be employed as a possible Human Computation evaluation
tool, but an interesting problem is how to apply the concepts
from Phetch to non-image domains, in particular text
documents.
5. CHALLENGES AND OPEN QUESTIONS
Using Human Computation for evaluation of IR systems
requires further research. In particular, this paradigm should
be correlated with accepted figures of merit of IR systems
that are used by TREC and HCI methods, such as accuracy,
precision, recall, success and satisfaction of users.
Additional work may be required for interfacing a Human
Computation game with other types of IR systems. For
example, in systems that support faceted metadata browsing
such as Flamenco [14] and Endeca’s [www.endeca.com]
Guided Navigation, the corpus should be pre-processed to
organize flat tags hierarchically, for which many automatic
and semi-automatic methods are available [12]. It is clear
that when applying different IR interfaces to the same
corpus, the quality of data preprocessing to tailor it to the
specific interface could dramatically impact the results of
the evaluation. Dealing with preprocessing could potentially
be achieved in a manner similar to the one taken by TREC,
where competing teams are required to submit a system that
interfaces with the Game. The corpus would be known
ahead of time, so each team could do their best effort on
data preprocessing. Assuming no
a priori
knowledge of the
41
corpus may also lead to interesting results, but is not
currently under consideration.
6. ACKNOWLEDGEMENTS
I thank Daniel Tunkelang, Michael Tucker, Robin Stewart
and Andrew Schlaikjer for their insightful comments. I
further thank Dr. Luis von Ahn for introducing me to the
exciting domain of Human Computation.
7. REFERENCES
[1] von Ahn
,
L.
Games With A Purpose. In
IEEE Computer
Magazine,
June 2006
,
pp. 96-98.
[2] von Ahn, L., and Dabbish, L. Labeling Images with a
Computer Game. In
ACM Conference on Human Factors in
Computing Systems (CHI),
2004, pp. 319-326.
[3] von Ahn, L., Ginosar, S., Kedia, M., and Blum, M. Improving
Image Search with Phetch. In
IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP)
2007, Vol. 4
pp. IV-1209-IV-1212.
[4] von Ahn, L., Ginosar, S., Kedia, M., Liu, R., and Blum, M.
Improving Accessibility of the Web with a Computer Game. In
ACM Conference on Human Factors in Computing Systems (CHI),
2006, pp. 79-82.
[5] von Ahn, L., Kedia, M., Liu, R., and Blum, M.
Verbosity: a
game for collecting common-sense facts. In
ACM Conference on
Human Factors in Computing Systems (CHI),
2006, pp. 75 – 78.
[6] von Ahn, L., Liu, R., and Blum, M.
Peekaboom: a game for
locating objects in images. In
ACM Conference on Human Factors
in Computing Systems (CHI),
2006, pp. 55-64.
[7] Al-Maskari., A. Beyond Classical Measures: How to Evaluate
the Effectiveness of Interactive Information Retrieval System?. In
ACM Special Interest Group on Information Retrieval (SIGIR),
2007, pp. 915.
[8] Borlund, P., The IIR Evaluation Model: A Framework for
Evaluation of Interactive Information Retrieval Systems.
Information Research,
2003, Vol. 8, No. 3.
[9] Borlund, P., and Ingwersen, P. The Development of a Method
for the Evaluation of Interactive Information Retrieval Systems.
Journal of Documentation,
53(3), 1997, pp. 225-250.
[10] Cleverdon, C. The Cranfield Tests on Index Language
Devices.
Aslib Proceedings,
19:173-192, 1967. (Reprinted in K.
Spark Jones and P. Willet, editors.
Readings in Information
Retrieval.
Morgan Kaufmann Publishers Inc., 1997)
[11] Huffman, S., and Hochster, M. How Well does Result
Relevance Predict Session Satisfaction?.
ACM Special Interest
Group on Information Retrieval (SIGIR),
2007. pp. 567-573.
[12] Stoica, E. and Hearst, M. Nearly Automated Metadata
Hierarchy Creation.
HLT-NAACL,
2004. Companion Volume.
[13] Turpin, A., and Scholer, F. User Performance versus Precision
Measures for Simple Search Tasks.
ACM Special Interest Group
on Information Retrieval (SIGIR),
2006. pp. 11-18.
[14] Yee, K.P., Swearingen, K., Li, K. and Hearst, M. Faceted
Metadata for Image Search and Browsing. In
ACM Conference on
Human Factors in Computing Systems (CHI),
2003.
42
Jigsaw: a Visual Index on Large Document Collections
Carsten Görg
John Stasko
School of Interactive Computing & GVU Center
Georgia Institute of Technology
{goerg,stasko}@cc.gatech.edu
Abstract
Investigative analysts who work with collections of text documents connect embedded
threads of evidence in order to formulate hypotheses about plans and activities of
potential interest. As the number of documents and the corresponding number of
concepts and entities within the documents grow larger, sense-making processes become
more and more difficult for the analysts. We have developed a visual analytic system
called
Jigsaw
that represents documents and their entities visually in order to help
analysts examine reports more efficiently and develop theories about potential actions
more quickly.
The
Jigsaw
system provides multiple coordinated views that show connections between
entities (like people, places, organizations, dates,
etc.
) across documents. A connection
between two entities is defined as a co-occurrence in at least one document.
To allow
Jigsaw
to handle large datasets, the system does not show the entire dataset at
once but uses an incremental query-based approach to show a subset of the dataset. The
query window allows analysts to search for entities and also provides a text search within
the documents.
Jigsaw
’s query approach is different from traditional search engines.
Getting a list of ranked documents as a result of a query would not be sufficient for
analysts’ tasks because their activities go beyond just looking for a set of documents.
Analysts also care about understanding what is inside of a document and how those
entities are connected to entities in other documents. To support that task
Jigsaw
acts as a
visual index on the document collection: the query results, consisting of entities and
documents, are sent to multiple views that show different perspectives on the connections
between those elements. The analysts can interact with the views, apply filters, or expand
the context to gain more insight about the document collection. This exploration then
spurs further queries and retrieves other documents and entities. While acting as a visual
index,
Jigsaw
guides the analysts to related documents and facilitates the information
retrieval process.
Jigsaw
presents documents and entities resulting from queries through six different types
of views. Therefore, the availability of significant screen space is very beneficial. The
Text View shows document text, allowing analysts to validate connections, providing
their context, and giving access to information that is not extracted as an entity. The List
and Graph Views display connections between entities and allow analysts to explore the
connection network. The Scatter Plot View highlights pairwise relationships between any
43
two entity types. The Time Line and Calendar Views organize entities and reports by date
to ease the search for time patterns.
Figure 1 shows three different views after querying for “Faron Gardner†and exploring
the query result. The Text View shows documents related to the query in three tabs.
Entities within those documents are color coded accordingly to their type. The List View
shows people and organizations connected to Gardner, with a darker shade of orange
indicating a stronger connection. The Graph View displays the documents in which
Gardner is mentioned, as well as the entities within these documents. Thus, it is easy to
see which entities are mentioned in multiple documents.
Figure 1: The Text View, List View, and Graph View showing different perspectives after querying
for “Faron Gardner†and exploring the query result.
Jigsaw
’s views are coordinated using an event mechanism: interactions with one view
(selecting, adding, removing, or expanding entities) are transformed into events that are
then broadcast to all other views. Thus, the views of the system stay consistent and
provide different perspectives on the same data.
For a detailed description of the system, we refer the reader to an article
1
about
Jigsaw
in
the VAST '07 proceedings and to a video on the project website
2
that shows interaction
with the system.
1
J. Stasko, C. Görg, Z. Liu, and K. Singhal. Supporting Investigative Analysis through Interactive
Visualization. In
IEEE Symposium on Visual Analytics Science and Technology
, October 2007.
2
Jigsaw project. http://www.gvu.gatech.edu/ii/jigsaw/.
44
!"#$%&'&()#&*)+"#,&'&()'&)-'."*/0&'1'#1'2")3#$4$)
!"#$%&'()%*&+,-"#$%./01/2(1$3(4
5)6%$*"$&7$889%&:%";$60"<-&
=%#962)<"9%&2)%)*$2$%<&>$/92$0&296$&)%3&296$&/,)88$%*"%*&?"<,&"%#962)<"9%&/922(%"/)<$3&
$8$/<69%"/)88-&96&6$<6"$;$3ʴ&<,$&?$>1&@&,(2)%&(0$6&"0&9#<$%&9;$6?,$82$3&>-&<,$&)29(%<&9#&
"%#962)<"9%&<,)<&"0&$)0"8-&)//$00">8$&<9&<,$21&A,$&)BB69)/,&9#&'(2)%C592B(<$6&=%<$6)/<"9%&+'5=4&
3$0"*%0&"%<$6#)/$0&<9&8$;$6)*$&9##&)&(0$6D0&$##96<01&&E%&<,$&9<,$6&,)%3F&@6<"#"/")8&=%<$88"*$%/$&+@=4&
3$;$89B0&296$&)(<9%929(0&<$/,%"G($0&"%&"%#962)<"9%&6$<6"$;)8F&3)<)&2"%"%*F&)%3&8$)6%"%*1&&@8<,9(*,&
>9<,&)BB69)/,$0&)"2&<9?)63&<,$&0)2$&*9)8F&<,$-&,);$&B69/$$3$3&"%&<,$&B)0<&?"<,&<99&8"<<8$&"%<$6)/<"9%1
=%&)&0"2B8"#"$3&%9<"9%F&)&/92B(<$6&,)0&<?9&/)B)>"8"<"$0H&+I4&)%&"%#$6$%/$&#(%/<"9%&<,)<&<)J$0&"%B(<0F&$1*1F&
)&0$<&9#&39/(2$%<0F&)%3&B693(/$0&9(<B(<0F&$1*1F&)&6)%J&8"0<&9#&<,$0$&39/(2$%<0F&)%3&+K4&6$B6$0$%<)<"9%&
9#&"%#$6$%/$&6$0(8<01&A,$&#"60<&/)B)>"8"<-&"0&@=&6$8)<$3&)%3&<,$&0$/9%3&/)B)>"8"<-&"0&'5=&6$8)<$31&=&?)%<&<9&
B69B90$&)&%$?&2"L$3C"%"<")<";$)2$?96J&<,)<&"%;98;$0&>9<,&)BB69)/,$0&)%3&,9B$#(88-&>990<0&<,$&
G()8"<-&9#&<,$&098(<"9%&M9"%<8-1&A,$&"3$)&"0&)0&<,$⚡?"%*H&"%&)33"<"9%&<9&)&/92B(<$6F&)&(0$6&/)%&)809&
*";$&#$$3>)/J&9%&?,)<&6$B6$0$%<)<"9%&?$&0,9?&<9&<,$&(0$6&)%3&<,$&"%#$6$%/$&#(%/<"9%&/)%&>$&6$C<6)"%$3&
>)0$3&9%&(0$6&#$$3>)/J1&A,"0▔&)&3"6$/<$3&899Bʴ&<,$&@=&)BB69)/,&<9&<,$&'5=&)BB69)/,&<9&<,$&
(0$6&)%3&>)/J&<9&<,$&@=&)BB69)/,1
@⚡?"%*&G($0<"9%&"0&?,)<&J"%3&9#&"%#962)<"9%&/)%&>$&B)00$3&<,69(*,&<,"0&899B&)%3&,9?&"<&/9%<6">(<$0&
<9&<,$&9;$6)88&B$6#962)%/$1&N$&,);$	(%3&"<&"0&G("<$&(0$#(8&<9&$L<6)/<&B69B$6<"$0&9#&<,$&"%#$6$%/$&
#(%/<"9%&)%3&0,9?&<,$0$&,-B9<,$0"O$3&B69B$6<"$0&<9&)&(0$6&>$/)(0$&)&(0$6&"0&9#<$%&/8($8$00&"#&9%8-&
B6$0$%<$3&>-&"%#$6$%/$&6$0(8<01&
&
P96&$L)2B8$F&Q$<#8"L&6$/922$%3$3&)&29;"$F&A,$&8)0<&J"%*&9#&
R/9<8)%3F&<9&2$1&R"%/$&=&?)0&%9<&#)2"8")6&?"<,&<,$&6$/922$%3$3&
29;"$F&=&,)3&%9&"3$)&"#&"<&?)0&)&*993&6$/922$%3)<"9%&96&%9<&"#&<,"0&
?$6$&<,$&9%8-&"%#962)<"9%&=&*9<1&'9?$;$6F&Q$<#8"L&0,9?$3&2$&<,)<&
"<&0(**$0<$3&<,"0&29;"$&>$/)(0$&9#&092$&29;"$0&=&6)%J$3&,"*,&
>$#96$1&S$0B"<$&<,$&#)/<&=&,)3%D<&,$)6<&)>9(<&<,"0&6$/922$%3$3&
29;"$F&=&*9<&<,$&>)0"/&"3$)&>$#96$&/8"/J"%*&"%<9&"<0&"%<693(/<96-&
B)*$&>$/)(0$&=	(%3&0"2"8)6"<"$0&>$<?$$%&B6$;"9(0&29;"$01&
R,9?"%*&B6$;"9(08-&$%M9-$3&29;"$0&"0&)%&$L)2B8$&9#&6$B6$0$%<"%*&
,-B9<,$0"O$3&B69B$6<"$0&9#&"%#$6$%/$&#(%/<"9%01
P(6<,$6296$F&)&0-0<$2&/)%&/988$/<&(0$6&#$$3>)/J&9%&%9<&9%8-&
"%#$6$%/$&6$0(8<0&>(<&)809&,-B9<,$0"O$3&B69B$6<"$0&9#&"%#$6$%/$&
#(%/<"9%01&:0$6&#$$3>)/J&/)%&>$&"%<$6B6$<$3&)0&293"#"$3&
B69B$6<"$01&N"<,&B69B$6&293$8&)3)B<)<"9%F&?$&/)%&6$C<6)"%&)%&
"%#$6$%/$&#(%/<"9%&09&"<0&%$?&,-B9<,$0"O$3&B69B$6<"$0&*$<0&/890$6&
<9&)&(0$6D0&293"#"$3&B69B$6<"$0&9%&<,$&0)2$&<)0J1&P96&$L)2B8$F&"#&
Q$<#8"L&B69;"3$0&)&?)-`&2$&<9&0)-&T7)6")&P(88&9#&U6)/$&)%3&'9<$8&
V?)%3)&)6$&)%&"%<$6$0<"%*&29;"$&B)"6`&<,"0&6$/922$%3$3&
29;"$FT&?$&/)%&(0$&<,"0&293"#"$3&B69B$6<-&"%&#(<(6$&
6$/922$%3)<"9%0&CC&?,$%&<,$&0-0<$2&#"%30&<?9&%$?&
6$/922$%3)<"9%0H&29;"$&@&"0&)009/")<$3&?"<,&7)6")&P(88&9#&U6)/$&
)%3&'9<$8&V?)%3)F&)%3&29;"$&W&"0&)009/")<$3&?"<,&7)6")&P(88&9#&
U6)/$&)%3&V)-1&A,$&0-0<$2&/)%&8$)6%&>-&B)0<&#$$3>)/J&<,)<&29;"$&
@&"0&B69>)>8-&)&>$<<$6&6$/922$%3)<"9%&>$/)(0$&"<&"0&)009/")<$3&
?"<,&)&>$<<$6&B69B$6<-1&
X)0<F&?$&?9(83&8"J$&<9&J$$B&B6$;"9(08-&9><)"%$3&(0$6&#$$3>)/J&"%&<,$&899B&09&)&(0$6&,)0&%9&%$$3&<9&*";$&
<,$&0)2$&#$$3>)/J&<?"/$1&A,"0&/)%&>$&39%$&>-&;)8"3)<"%*&"#&B6$;"9(0&293"#"$3&B69B$6<"$0&0<"88&$L"0<&"%&
/(66$%<&"%#$6$%/$&6$0(8<0&96&,-B9<,$0"O$3&B69B$6<"$01&=#&<,$-&0<"88&$L"0<F&?$&/)%&0,9?&<,$2&"%&<,$&
"%<$6#)/$&<9&6$2"%3&)&(0$6&)>9(<&<,$"6&B6$;"9(0&3$/"0"9%01&
45
1
Resonance: Introducing the concept of penalty-free deep look
ahead with dynamic summarization of arbitrary result sets
Blade Kotelly
Endeca Technologies
101 Main St. Cambridge, MA, USA
blade@endeca.com
INTRODUCTION
Providing the ability for someone to be able to see what’s
around the corner is a hallmark of good HCIR. A great
example of this is when there are various refinements that a
user could make to refine a set, and next to each refinement
is a count which tells the user how many results would be
left in the set if they choose that particular refinement.
EXAMPLE:
The refinement count next to “Projection TVs†tells me that
if I chose that option, my next set would contain only 20
items. This kind of look-ahead is very useful and easy to
understand. But how can I give a similar ability for a user to
see what’s under the covers when it comes to learning about
what people are saying about a product? Right now, there
are a plethora of sites that use user-reviews and user-ratings
to help people evaluate a product.
These reviews are very helpful, but the problem starts when
a user is confronted with a number of reviews that is too
numerous to read in a timely manor. Often there are several
reviews, from as few as 20 to more than 70, associated with
a single product on sites like CircuitCity.com. If you were
to look at a set of televisions say, all 151 plasma TVs,
you’d see 363 reviews covering the first 10 products alone!
Even if you get down to one product, but see that the
product has more than 100 reviews it wouldn’t be
reasonable to expect most users to read them all. They’ll
typically read the most favorable reviews (why should I buy
it?) and the least favorable ones (what’s wrong with it?) If
you can winnow the set of reviews down by some method,
perhaps by looking only at reviews from people who
consider themselves professional users, or people who are
in a particular age range, then the remaining reviews, which
would normally be unread, would become more useful.
The problem is that you can’t examine the reviews before
you start reading by any method that will give you a sense
of what the reviews are talking about, in general, and you
can’t read the reviews across multiple products to see what
people are talking about within a certain category to
compare 2 arbitrary categories against each other (e.g.,
what are people saying about large, Panasonic plasma TVs
compared to large Sony TVs? Or what are people saying
about large, Panasonic TVs compared to small, Panasonic
TVs?)
PROPOSED SOLUTION
Here is a proposed technique that could be used to provide
some deep look ahead that would give the user an
understanding of what people are saying, in aggregate,
about an arbitrary collection of items.
Applying certain technologies, it’s easy to automatically go
through an unstructured document and extract terminology,
such as a set of terms associated with reviews of consumer
products, like “banding†and “exceptional picture qualityâ€.
It is also possible to cluster those phrases so that they form
logical groupings. For example, when clustering
technology is used to sort through all the text returned from
a query on Wikipedia for the term Apple, we see that term-
based clusters are formed in which one cluster has words
like, core, juice production, etc. and a different cluster has
words like, iPod, Computer and Steve Jobs.
If we can take the same technology and apply it to the
unstructured text of a set of user reviews, we could again
extract all the terms in the set of reviews, form clusters, and
present those to the user so that they could use a cluster to
further filter the set of reviews. However, this isn’t that
useful because it would require the user to attempt to
understand the meaning about each cluster and then
46
2
determine if choosing that cluster would help them get to a
set of reviews that were talking about a particular topic.
Also, when terms and phrases are presented without
context, the user doesn’t know the valence of a particular
idea, i.e., “
contrast ratio
†could be used in the phrase
“phenomenal
contrast ratio
†as well as “the worst
contrast
ratio
of all the TVs on the market.â€
A more useful interaction would be for the system to
automatically extract the terms, and perhaps when the user
rolls over a dimension refinement (e.g., “ 33†– 42†(8))
they would be presented with a short set of snippets which
talked about the various hot topics in that set. It’s almost as
if we were able to get people in a room, with some arbitrary
set of televisions (in this case, say the 8 available
televisions which are between 33†and 42†inches) and see
what they were talking about the most – to understand what
thoughts would resonate in the head of a person listening to
the reviewers in that room.
This solution would primarily be used when the domain is
somewhat specific (televisions, compared to all of
consumer-electronics) and where a user would normally get
stuck as to what they can do to advance their ability to dive
deeper. When I need to choose a television, I can easily
get myself into a position where I’m not sure what to do
next. In the below example, the split between LCD and
Plasma TVs is almost even, the price range is heavily
centered around the $1500-3000 zone, ad the brands all
have just a few televisions to choose from.
So if I don’t care if I get an LCD or Plasma TV, if I know
that it will cost me between $1500 and $3000 and I am not
brand loyal, then I have to start looking at all my results for
other things that will help me further refine my set. It’s at
this point that I could really use a way to look more deeply
at the results, and take advantage of all the unstructured
information from things like user-reviews or other
collections of information.
To accomplish this, we could take all the terms from all the
reviews in the result set (say, 8 televisions, which have a
total of 250 reviews among them), cluster the all the terms
from all the reviews and then apply the terms found in each
cluster as a set of search-terms against the whole corpus of
250 reviews from those 8 products. We’d then rank the
results (presumably with the top result as the one which is
most about the terms in a particular cluster) and present the
user with a snippet that contains the searched-terms from
the top ranked review.
Using the reviews to generate terms, then clusters from the
terms:
We could then present the snippets in a readable form,
which also indicated to the user the set from which the
snippets came by presenting them when the user rolled over
a dimension value:
This allows people to be able to make judgments about
what the information in the reviews indicates about an
arbitrary set and compare that to any other set (or example
comparing 2 different manufacturers) without having to
click through and read all – or any of the reviews. Even
better, is that the user won’t see the three or four most
common phrases, which in a set of televisions might be
very similar.
However, problems with this technique might occur when
people don’t have the ability to directly compare the
snippets against each other because the clusters are
inherently about different things, i.e., while one set of
snippets of reviews may talk about
price
and
picture
quality
, another set may focus on
ease-of-set-up
and
viewing angle.
That difference could lead a user to think
that one TV is a better value, while the other is easier-to-
use,
which
may
in
fact,
not
be
accurate.
47
Visual Concept Explorer
Xia Lin
College of Information Science and Technology
Drexel University
Philadelphia, PA 19104
Visual Concept Explorer (VCE) is a visualization system developed to explore potentials
of visual mapping and information retrieval. Implemented as a client-server application,
VCE contains a Java-based server that interacts with a very large ontology, the National
Library of Medicine’s UMLS Knowledge Source (UMLS, 2007) and a live search engine,
the free PUBMED search engine available on the Web. Its visual interface is
implemented in FLASH with various mapping and interactive functions. Figure 1 is a
sample interface screen for searching the keyword “cognition.†VCE is available for
testing at:
http://cluster.cis.drexel.edu/vce/
Figure 1. A screen dump of the VCE interface.
Visual Concept Mapping
The first function of VCE is to create dynamic visual concept maps for user’s queries.
When a user’s query term is received, VCE first checks the UMLS ontology to see if the
term is a standardized term that can be mapped to a unique concept ID. If it is not, the
term will be sent to the PUBMED search engine to retrieve documents. The top 20
controlled vocabulary terms (MESH terms) from the top 200 retrieved documents will be
presented to the user. The user then can select a term that best matches his or her query
for concept mapping. In the example in Figure 1, the user’s query term is “cognitionâ€
which is a concept term in UMLS. VCE first identifies 24 terms that co-occur most often
48
with the term “cognition†and then map the 25 terms based on some mapping algorithms.
The display shown in figure 1 is generated by Pathfinder Associative Networks (PFNET)
(Schvaneveldt, 1990). The map clearly shows that the concept “cognition†is most
closely related to “brain,†“intelligence,†“vocabulary,†“Models, psychologicalâ€, etc.,
and the term “brain†is connected to “language,†“emotion,†and “electroence
phalography,†etc. Such a concept map helps the user explore semantic relationships of
terms and navigate through the concept relationships to find the best term to represent his
or her information needs. The map is interactive. The user can explore any of the terms
shown on the map by double-clicking on it. The clicked term will become the new “focus
center†and a new concept map will be generated for the term. The user can also select
different mapping models to see concept maps in different styles. Another mapping
model currently implemented is the Kohonen’s self-organizing feature map algorithm
(Kohonen 1997). Other mapping algorithms can be easily added to the system as needed.
Visual Search Interface
VCE is also a search interface for the PUBMED search engine. The user can use the
visual display to develop their search strategies while exploring the term relationships
shown on the display. Right-clicking on any of the terms would allow the user to add the
term to search boxes on the right-hand side of the screen. Each time a term is added to
the search boxes, a search will run automatically against the PUBMED search engine (or
other search engines that the user chooses) and the number of hits will be shown or
updated on the screen. The search is executed using a Boolean query constructed from
the terms in the search boxes – terms in the same box are ORed and terms in different
boxes are ANDed together. In this way, the search process becomes a natural result of the
user’s interaction and exploration of concept semantics. VCE assists the user in at least
three aspects of searching: (1) converting free-text query terms to controlled vocabulary
terms, (2) constructing Boolean queries through choosing-and-clicking from dynamic
concept maps, and (3) providing immediate feedback of query results. In the example
shown, the user starts with the term “cognition†and constructs a query “Emotions AND
Cognition Disorder AND Child Development,†yielding 16 results. The query can
represent user’s information needs much better.
The VCE prototype will be demonstrated during the conference and viewers are invited
to explore various mapping and interactive functions implemented in VCE.
References:
Kohonen, T. (1997). Self-Organizing Maps. Springer-Verlag, Berlin.
Schvaneveldt, R. W. (Ed.). (1990). Pathfinder associative networks: Studies in
knowledge organization. Norwood, NJ: Ablex.
UMLS (2007). Unified Medical Language System.
http://www.nlm.nih.gov/research/umls/
49
Work in Progress: Navigating Document Networks
Mark D. Smucker
Center for Intelligent Information Retrieval
Department of Computer Science
University of Massachusetts Amherst
smucker@cs.umass.edu
ABSTRACT
While much research effort has been expended on innova-
tive user interfaces for information retrieval (IR), deployed
IR user interfaces have adopted few innovations. Rather
than design another novel user interface tool that users never
adopt, we decided that our first step would be to better un-
derstand the nature of an adopted tool. In that vein, we
are in the process of studying the potential and the actual
performance of find-similar, which is a widely adopted tool
that allows a user to request documents similar to a given
document. Find-similar is a compelling IR interface tool for
the very reason that users appear to have adopted it and
that it has the potential to provide to users the power long
known to be available via relevance feedback. Our hope is
that by better understanding find-similar, we’ll be able to
take that understanding and apply it to other user interface
tools that will both be powerful and be adopted by users.
1. INTRODUCTION
Find-similar allows a user to request a list of documents
similar to a given document. As such, find-similar provides
a way for users to navigate from one document to another
and browse by document similarity. This feature is typically
instantiated as a button or link next to each result in the
list of search results. For example, the Excite search engine
labeled its find-similar link “More Like This: Click here for
a list of documents like this one.â€
Find-similar can be an important and valuable tool for im-
proving IR systems. Spink et al. [6, 7] analyzed samples of
Excite’s query logs and reported that between 5 and 9.7 per-
cent of the queries came from the use of the “more like thisâ€
find-similar feature. Lin et al. [2] have reported that for the
US National Library of Medicine’s search engine, PubMed,
18.5% of non-trivial search sessions involve clicks on articles
suggested by PubMed’s find-similar, which PubMed refers
to as
related articles
[3].
While relevance feedback is well known to be a powerful
technique for improving retrieval performance, it has seen
little adoption by popular search systems. We’ve shown that
find-similar has the potential to match the performance of
relevance feedback [4]. Earlier work by Wilbur and Coffee [8]
found that certain browsing patterns could improve over the
original query’s ranking.
HCIR’07 Workshop on Human-Computer Interaction and Infor-
mation Retrieval, October 23, 2007, Cambridge, Massachusetts.
2. DOCUMENT NETWORKS
Find-similar can be studied and understood in graph the-
oretic terms. Each document or web page is a node in a
graph. When find-similar is applied to a document, find-
similar provides the user with links to the similar documents
and these links are effectively added to the document. For a
web page, these automatically created links join the already
existing links on the page. If the added links are good, users
should be able to use the links to navigate to other relevant
documents. These links make documents in the graph closer
to each other, which is good, but these links also increase
the amount of time that a user needs to spend examining
the page, which is bad.
In a broad sense, find-similar aims to add links to doc-
uments such that the time for a user to get from relevant
document to relevant document is minimized. These added
links can represent many different types of similarity. The
most studied form of similarity is content-based, which typ-
ically involves comparison of the terms in each document.
The web’s hyperlinks are another form of similarity; a simi-
larity defined by the authors of the web pages. Find-similar
could add links to documents from a similar time period or
documents written by the same author, for example.
We’ve proposed a pair of metrics that can be used to mea-
sure the navigability of documents [5] and preliminary ex-
periments show that existing web hyperlinks are ill-suited
for navigation from relevant document to relevant document
compared to links produced by a content similarity measure.
Different types of content similarity measures create dif-
ferent document networks. Using simulated browsing be-
haviors, we have found that a query-biased document-to-
document similarity consistently outperforms a baseline “reg-
ular†similarity. Figure 1 shows an example of how query-
biased similarity can result in a better clustering of rele-
vant documents. We intend to study query-biased similar-
ity using our navigability metrics and obtain a better un-
derstanding of the more navigable networks produced by
query-biased similarity.
3. USER BEHAVIOR
While understanding of the find-similar document net-
works is important to understanding find-similar and max-
imizing its performance, we also want to better understand
how people use find-similar.
In a study by Huggett and Lanir [1], users found more
relevant documents using an interface that provided find-
similar over an interface without find-similar. Huggett and
Lanir’s study used small newswire collections of 2000 doc-
50
(a)
(b)
Figure 1: Simplified depictions of the relevant document networks for TREC topic 337, “viral hepatitis.†The
network on the left (a) uses regular similarity while the network on the right (b) uses query-biased similarity,
which better clusters relevant documents. The documents are closer in figure (b) because they are higher
ranked in each other’s ranked lists. Links are drawn between two documents when one of the pair is close to
the other. The actual relevant document network is a weighted, directed graph [5].
uments and limited test subjects to two minutes for each
search. We would like to examine find-similar’s usage on
much larger TREC collections and on the web. As Huggett
and Lanir did, we will also compare our find-similar im-
plementations to IR systems allowing query reformulation
and one of our measures of performance will be the rate at
which relevant documents are found. We hypothesize that
users adopt IR interface features that provide better rates of
information discovery as opposed to tools that may improve
ranking performance but overall slow the rate of finding rel-
evant documents.
We also want to learn about how users navigate the doc-
ument networks formed by find-similar and what forms of
user interface support are needed to maximize performance.
For example, how far away from the original query will users
navigate? Do users apply find-similar to documents that are
non-relevant but which they think might lead them to rele-
vant documents? How is find-similar usage interleaved with
query reformulation? We intend to answer these and other
questions as part of a planned user study.
4. CONCLUSION
Find-similar provides a chance for us to study a user in-
terface feature that has been adopted by search engines and
shown to be frequently used by users. To date, we’ve shown
that find-similar has the potential to match a traditionally
styled multiple item relevance and that different forms of
similarity offer different levels of inherent navigability. Our
next steps include a closer examination of similarity func-
tions such as query-biased similarity and actual user stud-
ies. As much as possible, we hope to learn what has allowed
find-similar to become a useful tool for search when so many
other user interface features have failed to succeed and be
adopted outside of the laboratory.
5. ACKNOWLEDGMENTS
This work was supported in part by the Center for In-
telligent Information Retrieval and in part by the Defense
Advanced Research Projects Agency (DARPA) under con-
tract number HR0011-06-C-0023. Any opinions, findings
and conclusions or recommendations expressed in this ma-
terial are the authors’ and do not necessarily reflect those of
the sponsor.
6. REFERENCES
[1] M. Huggett and J. Lanir. Static reformulation: a user
study of static hypertext for query-based reformulation.
In
JCDL ’07: Proceedings of the 2007 conference on
Digital libraries
, pages 319–328. ACM Press, 2007.
[2] J. Lin, M. DiCuccio, V. Grigoryan, and W. J. Wilbur.
Exploring the effectiveness of related article search in
pubmed. Technical Report LAMP-TR-145/CS-TR-
4877/UMIACS-TR-2007-36/HCIL-2007-10, College of
Information Studies, University of Maryland, College
Park, July 2007.
[3] Pubmed,
www.pubmed.gov
. “Related articlesâ€:
www.nlm.nih.gov/bsd/pubmed_tutorial/m5002.html
.
[4] M. D. Smucker and J. Allan. Find-similar: Similarity
browsing as a search tool. In
SIGIR ’06
, pages 461–468.
ACM Press, 2006.
[5] M. D. Smucker and J. Allan. Measuring the
navigability of document networks. In
SIGIR ’07 Web
Information-Seeking and Interaction Workshop
, 2007.
[6] A. Spink, B. J. Jansen, and H. C. Ozmultu. Use of
query reformulation and relevance feedback by excite
users.
Internet Research: Electronic Networking
Applications and Policy
, 10(4):317–328, 2000.
[7] A. Spink, D. Wolfram, B. J. Jansen, and T. Saracevic.
Searching the web: The public and their queries.
JASIST
, 52(3):226–234, 2001.
[8] W. J. Wilbur and L. Coffee. The effectiveness of
document neighboring in search enhancement.
IPM
,
30(2):253–266, 1994.
51
Integrating the “Deep Web†With the “Shallow Webâ€
Michael Stonebraker
MIT
Public web integration services such as Google and Yahoo provide access to the “shallow webâ€,
i.e. those sites that are reachable by a text-oriented crawler. Although this service is very useful,
it misses large portion of the information available on the web. Specifically, it misses the “deep
webâ€, which is data available behind form-oriented user interfaces. Examples of deep web sites
include all airline sites, 411.com, weather.com, and most retail sites. These all require one to
issue a query through a form-oriented interface to an underlying data base. Obviously, current
crawlers are incapable of accessing the deep web. In this paper, we propose a methodology for
integrating the deep web with the shallow web.
Our proposal builds on a research prototype, Morpheus, which we have built over the last two
years at MIT and the University of Florida. Morpheus is focused on enterprise data integration
and database schema heterogeneity. For example, a multinational corporation would have
employees in several countries, each with a local salary. Hence, the French employee data base
would record French salaries in Euros, after taxes, and including a lunch allowance. In contrast,
the US employee data base would contain salary records, gross in dollars with no lunch
allowance. Obviously, retrieving the two collections of salaries will produce garbage. Hence, a
global schema must be defined and the local schemas must be mapped to this global schema.
There are many approaches to this issue of schema integration, including forcing standardization,
using description languages such as Owl, and performing automatic translation based on some
sort of logic description of the source and the target objects. Our point of view is that a
transform, written in some programming notation, is generally required to convert between
enterprise data elements, which we call data types. The objective of Morpheus is to facilitate
building and reusing such transforms.
As such, Morpheus is a software system that contains a metadata repository about transforms and
data types, a sophisticated browser that allows a human to find objects of interest in the repository
and a high level transform construction tool (TCT) that allows a human to build transforms from
scratch as well as “morph†existing ones into a form that meets his needs. Morpheus is built on
top of Postgres; hence transforms are Postgres user-defined functions. As such, transforms can be
called by running Postgres queries on input data stored in Postgres, producing output data in other
Postgres tables. As a result of this architecture, web services can be “wrapped†to convert them
to user-defined functions, allowing Morpheus to interact with remote data and services.
We are working on several Morpheus extensions, which will allow Morpheus to perform the
desired integration of the shallow and deep web:
1)
We are building a crawler that will explore the public web, looking for Javascript forms.
We plan to semi-automatically wrap each such site to turn it into a Morpheus transform
and register it in the Morpheus repository. Human involvement is required to ascertain
how (if at all) the new transform is related to existing Morpheus objects.
This will extend Morpheus to know about the subset of the deep web that is behind
Javascript forms
52
2)
It is straightforward to model our metadata tables as RDF objects. Hence, we can build a
mapping layer that will support Sparkl access to our Postgres repository.
This will support Sparkl access to the deep web through Morpheus.
3)
We will build a GUI that will support a simple query interface and convert it to Sparkl.
Using this interface, an end user could, for example, type professor(“Mike Stonebrakerâ€)
4)
We will build a run time system that performs the following actions. It will decide which
of the Javascript wrappers to invoke, run them, converts the result into RDF and load the
result into a temporary Postgres table, modeling RDF objects in a table in a
straightforward manner. Any returned object can be “enriched†by passing it through any
local Morpheus transforms that would convert the object to another Morpheus data type.
Furthermore, it will submit the same query to Google, pass the top X results through a
natural language parser to find “subject-verb-object†triples, and load the results into the
same Postgres table.
5)
The result is a collection of RDF objects from the shallow and deep web that deal with
the query: professor(“Mike Stonebrakerâ€)
6)
A graph of the result of 5) will be presented to the user, perhaps through Haystack or
some other means. The user can browse this representation. Alternately he can refine the
result using Sparkl directly.
Jointly with researchers at the University of Florida, we are building out this prototype. Research
support is available to one or more interested MIT students.
53
Multimodal Question Answering for Mobile Devices
Tom Yeh
Vision Interface Group
INTRODUCTION
In recent years, community-based question answering
(QA) services, such as Yahoo Answers! and Naver, have
enjoyed growing popularity. These services operate web
sites to invite anyone to post open-ended questions for
free. The questions can be on a variety of topics ranging
from car buying tips to dating advice. They are viewed
by millions of users in an online community, some of
whom may happen to possess the right expertise to give
satisfactory answers to them. Users frequent these sites
to seek answers about topics they know little of; at the
same time, they often volunteer answers about other
topics they happen to be familiar with. It is in the
spirit of such reciprocity that people are willing to con-
tribute their expertise and knowledge for the benefits of
the whole, without seeking any form of monetary com-
pensation.
Over time, these community-based question answering
services have accumulated a considerable amount of hu-
man knowledge, which can be readily tapped into with
an intelligent search engine. In Yahoo Answers!, when
a user asks a question, a search engine can quickly lists
previously asked questions that are relevant to the new
question. The automatic search mechanism allows the
user to receive immediate feedback instead of having to
wait for some human expert to answer it. If nothing
the search engine provides is satisfactory, the question
can still be left as as an open question for the whole
community to solve.
However, these QA services have some fundamental lim-
itations that can be improved upon. The first limitation
is that conventional QA services provide only a single
input modality—text. Sometimes we may find it dif-
ficult to phrase a question about an unfamiliar object
identifiable only by its distinct visual features. For ex-
ample, we may see an exotic bird outside of a window
and want to know about it. Not knowing how to name
the bird, our only resort is to actually describe the bird
in our question based on its appearance, such as
a red-
feathered, large-beak, round-eye bird
. The second limi-
tation is that most QA services were designed primarily
for the desktop platform. Yet, we often encounter new
and interesting things around us that arouse our curios-
ity and instill questions in our mind. Thus, there is a
need to support situated and pervasive question answer-
ing from a mobile platform, so that we can ask questions
as soon as we become curious about something.
Therefore, in this paper, we propose a multi-modal ques-
tion answering system designed for mobile users. As a
solution to the limitation of text-only QA, we introduce
an additional modality—photo—to be used directly as
part of a multi-modal query that includes both visual
and textual cues. Using our system, whenever some-
thing in the environment catches a user’s attention, the
user can learn about it immediately, simply by taking
a photo of and asking any question about it.
Unlike conventional question answering services, ours
is multi-modal (photo and question) rather than text-
only, mobile rather than desktop-centric. Unlike location-
based information services such as those based on GPS
or RFID, ours is active (users decide when and what
to inquire) rather than passive (users receive whatever
the system decides to show based on the location cue).
Moreover, the idea of using photos directly as queries
to an information retrieval system has been explored by
various works such as a system that recognizes flowers
[Flower] and another system that recognizes fish [Fish]
based on photos taken by camera phones. Unlike these
photo-based information services, ours is open-ended
(users can ask about any topic) instead of domain spe-
cific.
SYSTEM ARCHITECTURE
This section gives a high-level overview of the architec-
ture of our multi-modal information retrieval system.
The architecture we propose consists of five major com-
ponents described as follows:
Mobile User Interface
To let users issue multi-modal
queries from mobile devices, we need to design an in-
terface for users to take photos and enter questions
that can be embedded in the queries. This interface
also allows users to submit queries to a server and to
view relevant questions and answers returned by the
server.
Expert Community
To answer multi-modal queries
users submitted, we rely on a community of experts
willing to share their knowledge and expertise with
others. The success of existing community-based ques-
tion answering services ,such as Yahoo Answers! and
Naver, has demonstrated that an incentive model based
on reciprocity and reputation is sufficient for solicit-
54
54
Figure 1. A typical usage scenario (left) and our pro-
posed system architecture (right) of a multi-modal ques-
tion answering system.
ing altruistic behaviors from community users. It is
quite likely the same model that has made these ser-
vices so successful can be extended to our case.
Multi-modal Query Database
To store multi-modal
queries ever submitted so that they can be looked up
in the future, we need to build a suitable database.
Also must be stored in this database are the answers
provided by the community users. In effect, this
database represents the collective multi-modal knowl-
edge contributed by the community over time. When
a multi-modal query is submitted, it will be matched
against the queries stored in this database to look up
relevant information, before the expert community is
consulted.
Photo-matching Engine
To check whether someone
else has taken a relevant photo in the past, we need a
photo-matching engine that can match a new photo
against the photos in the database to identify those
that are visually similar. Because people can take
photos of the same object or scene in various ways,
the photo-matching engine needs to provides robust-
ness to image variations due to scale, translation, ro-
tation, and occlusion.
Question-matching Engine
To check whether some-
one else has asked a relevant question in the past, we
need a question-matching engine that can match a
new question against the questions in the database
to identify those that are semantically related.
ALGORITHM
This section describes an algorithm that takes a multi-
modal query and lookups relevant queries in a database.
The basic idea of the algorithm is to use the photo-
matching engine to retrieve a list of candidate photos
similar to the photo in the query, and use the question-
matching engine to re-rank the candidate photos based
on the similarities between the question in the query
and the question corresponding to each candidate photo.
Formally, let
x
:
!
p, q
"
denote a multi-modal query that
consists of a photo
p
and a question
q
. Let
X
n
=
{
x
1
. . . x
n
}
denote a multimodal database that has seen
n
queries. The operation of the photo-matching engine
can be denoted as:
f
i
:
!
p, P
" →
¯
p,
whose job is to find finds the
i
-th most similar photo
¯
p
∈
P
of a photo
p
. Similarly, the operation of the
question-matching engine can be denoted as
g
i
:
!
q, Q
" →
¯
q,
whose job is to find the
i
-th most similar question ¯
q
∈
Q
of a question
q
.
Given a new multi-modal query
x
n
+1
:
!
p
∗
, q
∗
"
, the al-
gorithm to find the
k
most relevant multi-modal queries
proceeds in three steps:
First, we find a set of candidate photos
C
p
that consists
of the
k
photos most similar to the query photo
p
∗
:
C
p
=
{
p
i
|
p
i
â†
f
i
(
p
∗
, P
n
) ,
i
= 1
. . . k
}
(1)
where
P
n
is the set of
n
photos already in the database
X
n
.
Next, we find a set of candidate questions
C
q
that con-
sists of the questions submitted together with the can-
didate photos in the same multi-modal queries:
C
q
=
{
q
i
|!
q
i
, p
i
" ∈
X
n
,
i
= 1
. . . k
}
(2)
Finally, as results, we return a set of relevant multi-
modal queries
R
, ranked by similarity to the new ques-
tion:
R
=
{!
p
j
, q
i
"|!
p
j
,
g
i
(
q
∗
, C
q
)
" ∈
X
n
,
i
= 1
. . . k
}
(3)
55
55