Developing a Desktop Knowledge System LO11648

Marvin Schneider (marvin@falcon.idiscover.co.uk)
Fri, 3 Jan 1997 21:35:27 -0000

Integrating State of the Art Document Imaging, Full Text Indexing, Search
and Delivery Technologies to Create a Fully Functional Desktop Knowledge
System

What does mankind gain from the use of technology if technology is merely
used to generate more information? We are already suffering from an
overload of information. In the new millennium, will we be able to say
that we have learned to harness technology so as to allow us to learn and
generate knowledge?

A number of technologies have emerged over the last 3-4 years that deliver
the core components required to turn the desktop computing environment
into a workable knowledge base capable of supporting continuous learning.
Yet no one has managed to integrate all of the required features into a
fully functional desktop document imaging, full text indexing, search and
delivery system. In my mind, there is an opportunity for an
entrepreneurial software development company to license the key
technologies from a handful of firms and perform the required systems
integration. Potential candidates to fill the gap in the product market
would include:

* Ambia (http://www.ambia.com)
* Adobe (http://www.adobe.com)
* Visioneer (http://www.visioneer.com)
* Microsoft (http://www.microsoft.com)
* Hewlett Packard (http://www.hp.com)
* Xerox (http://www.xerox.com)
* Folio (http://www.folio.com)
* Fore Front (http://www.ffg.com)

Specification of a Desktop Knowledge Based System

Visioneer has got 50% of the desktop document imaging problem solved in
the form of their Paper Port product. Paper Port 3.0 performs well as the
interface between the scanning process, rudimentary image correction
process (page rotation, page edge alignment, line enhancement and some
document mark up tools), image compression process, and linking process to
other applications and devices (such as email, fax, word processing and
optical character recognition). In order to be a truly useful component
of the overall puzzle, the following enhancements would be required to the
Paper Port style of document imaging:

* include as part of the desktop interface, an expandable folder structure
(similar to the left hand side of the Windows 95 Explorer interface) to
manage an unlimited number of documents organised in a hierarchical folder
structure. At present, the Paper Port desktop is limited to 100 stacks on
a single dimensional desktop

* enhance the current generation of scanner drivers to allow complex
documents to be scanned at multiple resolutions. At present, desktop
scanners are only capable of scanning an entire document at a single
resolution. Surely it would be possible to scan text an vector graphics
at one resolution (say black and white 300 dpi) and bitmap images at a
different resolution (say 24 bit colour at 150 dpi) in a number of
scanning passes on a single page

* in addition to the line enhancement feature, why not perform text and
vector graphics enhancement using core Adobe technology to overcome poor
quality scans

* include page touch up tools available in many of the vector graphics
packages such as line, curve and box drawing, different polygon fill tools
and an eraser tool

Adobe has got 70% of the electronic publishing (document delivery) problem
solved in the form of their Acrobat 3.0 product. In my mind, PDF is the
page description language of the future and Adobe has (finally) managed to
exploit their expertise in this technology to create enhancements that
brings us closer than ever to "knowledge based nirvana". Features such as
converting postscript files (which any document authoring application can
generate) into (almost) fully searchable text while retaining the full
integrity of the original document layout, hypertext linking to other PDF
documents or the web, navigation tools, ability to serve PDF documents in
standard web browsers, annotation tools and an open architecture to allow
third party developers like Ambia to develop plug-ins, are a notable
contribution. However, the interface of Acrobat's PDF authoring and
reading components do not comply with the state-of-the-art in user
interface design and would be best brought into the fold of the enhanced
Visioneer style desktop. In addition, Adobe has failed miserably in their
text indexing and search engines.

In my mind, the text indexing and searching engine is the heart of any
knowledge based system. While many search engines allow you to perform
key word searching (with some Boolean logic and word proximity
refinements), the search results of such a system will invariably suffer
from too many irrelevant hits.

Imagine an alternative search and retrieval paradigm that relies on
documents being classified into a hierarchical taxonomy of subject matter
(similar to the DEWY cataloguing system). Clearly the key to the
effectiveness of such a system would be to define the taxonomy of the
subject matter correctly (but then again, isn't this the real value add of
a knowledge base anyway?). I could imaging that an entire consulting
industry could emerge where experts in their field would sell their
subject taxonomies. My consulting firm is a world recognised expert in
the field of Value Based Management. We have developed a taxonomy of our
field of expertise which reflects the nature of our management consulting
practice.

To find documents, users would search through the hierarchy of the
taxonomy. Additionally, a key word searching facility to identify the
relevant branches of the taxonomy tree (including known synonyms) would be
provided. Retrieved (PDF) documents can be viewed in a reader application
(in the form of a stand alone application, plug-in to a conventional web
browser, or integrated into the enhanced Visoneer style desktop).
Additionally, readers of documents will be able to be annotated (with
notes, highlighting, attached files, hand drawing, hypertext linking to
other PDF files or web pages, etc). Annotations should be saved as a
shadow of the original document, and be available to other readers and
indexed for later document classification (similar to the features
provided by Folio).

It remains to classify PDF documents (as and when they are generated) into
the defined subject taxonomy to create an on-going, living knowledge base.
This is where state-of-the-art indexing technologies play their role.
Documents which are known to be exemplar representations of each end
branch of the subject taxonomy are identified. These documents are
indexed and analysed for their content using advanced language analysis
technologies to describe the content of the document (in much the same way
as a spectrum analyser describes the content of an electrical signal). The
summary statistics of these known documents are used to define the
"expected" content of documents in each end branch of the taxonomy. Then
as new documents are made available, their information "spectrum" is
compared to the "exemplar" and one or more categorisations are made and
maintained a relational database. The relational database which describes
the current state of the knowledge base (including all indexed documents)
should be able to be exported to a standard desktop database program such
as Microsoft Access for off-line analysis.

In my mind, it is not until all (or a large set) of these technologies are
integrated into one clean desktop environment, and "client" applications
are available for under US$500, that we can say that we are using
technology for the sake of learning and knowledge creation, rather than
just information creation.

So Ambia, Adobe, Visioneer, Microsoft, Hewlett Packard, Xerox, Folio and
Fore Front, if you are listening, hear the plea of at least one person
with a desire to see technology used the way it ought to be used.

Marvin Schneider
Senior Associate
Marakon Associates
1-3 Strand
London WC2N 5HP
+44 171 321 3683
marvin@falcon.idiscover.co.uk

-- 

Marvin Schneider <marvin@falcon.idiscover.co.uk>

Learning-org -- An Internet Dialog on Learning Organizations For info: <rkarash@karash.com> -or- <http://world.std.com/~lo/>