Position Paper by Christopher Mueller

For the Workshop on "Information Visualization Software Infrastructures" at IEEE 2004 Visualization,
Organized by Katy B
šrner, Indiana University, USA and Jean-Daniel Fekete, INRIA, France

Part I

I.1) What functionality should a general InfoVis infrastructure provide?

'Infrastructure' has many different meanings and connotations. In this discussion we try to use the term in its broadest and most inclusive sense: a set of tools, protocols, and policies that support a common goal. In this case, the common goal is, of course, software and support for InfoVis. The tools in an infrastructure are the software applications and libraries, databases, and repositories that make up and manage the hard assets. Protocols address the requirements for the software and data components of the infrastructure, from how they interact with each other to what languages should be used. The policies help direct the social aspects of the infrastructure, e.g. who maintains software, minimum levels of implementation and documentation required to be part of the infrastructure, versioning and release management, and licensing issues.

Tools

A Web-based central information repository, modeled after the successful biological and software source code repositories (e.g. NCBI, FlyBase, SourceForge CPAN ) should make up the core of the infrastructure.

This repository would provide:

Protocols

Visualization software presents many unique challenges to implementers and integrators. Often, tools are written in different languages or use different interfaces for data access and user interaction. At one extreme, some tools exist as standalone applications that provide full support for data I/O and interaction, but little support for integration with other tools (e.g. Spotfire). Other tools provide software frameworks for extending and embedding visualizations along with extensible I/O toolkits (e.g. Matlab, IDL). Finally, some packages exist only as libraries, meant to be as plugins in new applications and existing applications.

The InfoVis infrastructure should not preclude any models, but should provide a common set of component protocols for guiding new development towards libraries to ensure the most flexibility for InfoVis researchers and end users. To encourage the use of the protocols, the infrastructure should provide a reference implementation of a container framework that provides support for common tasks, such as data I/O and user interaction, that can be used by anyone developing new InfoVis tools (e.g. Katy Boerner's IVC). The component protocols should follow the general philosophy of Microsoft's COM framework, enabling different levels of interface support and allowing the dynamic discovery of new interfaces.

The reference implementation and component framework should be implemented in one language. Given that the InfoVis community has adopted Java, Java would be a good choice for this language. However, if it is important to make the InfoVis tools accessible outside of academia, .NET bindings should be considered.

Component Protocols (along with implementation suggestions) should include, but not be limited to:

One thing worth noting is that components will not need to be implemented entirely in Java. As long as they expose themselves via the component interfaces, they can use any language that can talk to Java. This is important for supporting large datasets since Java memory usage does not scale well (e.g. The Boost Graph Library (BGL) integration).

The component interfaces should evolve early in their lifecycles. While this may present some compatibility issues, the flexibility will be important as design issues get worked out.

Policies

The infrastructure policies are designed to provide a basic set of guidelines that encourage people to use the framework and prevent people from abusing it.

They should:

Unlike the Java or C++ standard libraries, the InfoVis tools do not need to be industrial strength. Both tools under active development and complete tools should be encouraged, to help share ideas and promote collaboration. A few minimum requirements should be met by projects. These include:

This requirement insures that anyone can use the tool 'out of the box'.

This is only a partial list, but is meant to address many common problems people encounter with unknown and poorly supported software. Since this repository will be how most people interact with the InfoVis community, it should provide a pleasant experience!

I.2) What do you see as the main technical challenges for creating a central but flexible and universally useful (information) visualization software infrastructure (as opposed to 100 different ones)?

For the infrastructure described above, there are very few technical challenges. Techniques for developing large repository-style Web sites are well known and most of the common approaches (PHP/MySQL, Java/EJBs, Plone/Content-managment, hybrid approaches, etc) would be fine for the repository.

Developing a flexible set of component protocols will be more challenging, but as long as a careful, iterative design process is employed, there is no reason it will not be successful. The same holds for the reference framework. Plugin based applications are common and Java provides good support for developing easily extended applications.

One technical area that will be challenging is providing support for scaleable components that support large datasets. While Java works well for small and medium size data sets, it tends to degrade quickly when the number of elements reaches the tens and hundreds of millions. Since large datasets are important for many modern applications (e.g. genomics and Homeland Security), care should be taken to design the component system in a way that encourages the use of more appropriate languages for large-scale and high-performance visualizations.

While the technical challenges are fairly tractable, the cultural ones will be far more important to the overall success of the infrastructure. Some cultural challenges that will arise and some suggested solutions are:

Part II

Please describe the (information) visualization software infrastructure you are working on.

II.1) Project Name and Web Address The Boost Graph Library (BGL) is the main visualization infrastructure component we are developing. The BGL is a high-performance graph library written in C++ that we are integrating into Katy Borner's InfoVis CyberInfrastructure (IVC).

In addition to the BGL, we are also in the early stages of exploring large-scale and high-performance visualization systems for Bioinformatics.

II.2) Core Team Members (Please list in order, Role of Project Member, Full Name, E-mail. e.g.: Developer, John Doe, jdoe@univ.edu)

Big Chief, Dr. Andrew Lumsdaine, lums at osl dot indiana dot edu
Lead Developer/Researcher, Dr. Douglas Gregor, dgregor at cs dot indiana dot edu
Developer/Researcher, Christopher Mueller, chemuell at cs dot indiana dot edu

II.3) Project Start Date

1997 (sequential BGL), June 2004 (parallel and distributed BGL)

II.4) Targeted User Group

Infrastructure developers, graph algorithm developers

II.5) Supported User Tasks

High performance data structures and algorithms for graphs

II.6) Major Features of the System Architecture

Algorithms are generic, allowing them to operate efficiently on any graph-like data structure. For instance, we have simple bindings permitting BGL algorithms to operate on LEDA graphs, and could even use Java Native Interface (JNI) to allow BGL algorithms to operate on JUNG graphs.

II.7) Algorithms Provided

Various shortest paths (single-source and all-pairs), minimum spanning tree, connected components, maximum flow, isomorphism, betweenness centrality, Kamada-Kawai layout, etc. See the more complete list in the table of contents.

II.8) Snapshot of the Interface

Not applicable, because the BGL is a low-level graph manipulation library.

II.9) Development Platform

Standard C++

II.10) Supported Operating Systems

Any.

II.11) Software Dependencies/Required Libraries

Requires a standards-compliant C++ compiler. Requires (and is part of) the Boost C++ Libraries.

II.12) Current License

The Boost Software License, which is a BSD-style license.

II.13) Number of Users/Downloads

The BGL is part of a the larger Boost C++ Libraries, a collection of free, peer-reviewed libraries. Boost itself has had approximately 194,000 downloads (currently averaging around 1,000 per day) and its developer mailing list consists of 2,200 developers and users. The Boost libraries are highly regarded in the C++ community, with several libraries having been accepted for standardization by the ANSI/ISO C++ committee. However, it is difficult to characterize how much traffic is due to the BGL itself.

II.14) Pros and Cons

The BGL is a fast, stable, extensible library for manipulation of graphs. It's implementation in C++ is both a pro and a con: there is essentially no unnecessary overhead in a BGL graph or graph algorithm, even when adapting to other graph libraries. For instance, the (sequential) BGL computes betweenness centrality between 20 and 40 times faster than JUNG and requires much less memory. The parallel BGL implementation of betweenness centrality then scales linearly with the number of processors, permitting additional speedups.

On the other hand, the learning curve for the BGL is much more steep than for, e.g., JUNG, and it may prove too steep for casual developers. Also, integration of C++ into a Java project is workable, and probably necessary for large data sets, but complicates development, maintenance, and release management.

II.15) Planned Work

Our current research involves improvements to the existing BGL (new algorithms, etc.), parallel and distributed enhancements to the BGL (for computation on very large graphs), large graph visualization (eventually), and integration of both the sequential and parallel BGL into an InfoVis framework (currently working with IVC).

Part III

Please describe your main interest in participating in the workshop

We are interested in researching high-performance and scalable tools for data processing and visualization.


Send the completed paper by Sept. 30, 2004 to katy@indiana.edu and Jean-Daniel.Fekete@inria.fr.


Created by Jean-Daniel Fekete and Katy Bšrner on Thur Aug 12 11:15:27 2004