Pixel Based Visual Data Mining in Large Geo-Spatial Point Sets

Pixel Based Visual Data Mining in Large Geo-Spatial Point Sets (under construction)

People

Daniel Keim Stephen North Mike Sips

Spatial Data Mining

Spatial data describes objects or phenomena with specific real-world locations. Large spatial data sets occur naturally when accumulating many samples or readings of phenomena in the real world while moving through two dimensions in space. Spatial data mining methods can be applied to understand spatial phenomena and to discover relationships between spatial and non-spatial data. In the development of modern data mining, researchers have proposed various automated data mining algorithms for discovering patterns in large spatial databases. These automated data mining algorithms combine methods from more mature research areas within statistics, pattern recognition and machine learning.

Visualizing geospatial data

Geospatial data describe real-world objects or phenomena with specific locations and associated statistical values or attributes. By considering just one statistical attribute at a time, we can interpret geospatial data sets as points in a 3D data space-that is, two geographical dimensions and a statistical dimension. Because real-world data set distributions are often nonuniform, the data points form readily identifiable 3D point clouds. Figure 1 shows a household income distribution in a 3D data space spanned by longitude, latitude, and median household income.

3D-Point clouds
Figure 1: 3D-Point clouds even in small real-world data sets - Plotting of geospatial data points on longitude, latitude, and statistical attribute (income in this example); Example shows a small sample, 1 percent, from the US Census (http://www.census.gov) 1999 New York Household Income data set

Figure 2 shows an xy-plot of the 3D point clouds. Visualizing large geospatial data sets involves mapping the two geographical dimensions to screen coordinates and encoding the statistical value by color. The difficulty is finding a useful mapping function f.

3D-Point clouds
Figure 2: 3D-Point clouds even in small real-world data sets - an xy-plot of the 3D-point clouds. The goal is to display all 3D-point clouds in a single continuous display without overlap. (Example shows a small sample, 1 percent, from the US Census (http://www.census.gov) 1999 New York Household Income data set).

When using a simple dot plot mapping function f, developers must solve two important visualization challenges:

Traditional Visual Models

Point phenomena with statistical values can be displayed as colored pixels. This simple visualization is called Dot Map. Dot Maps can be an elegant medium for communicating a wealth of information about the spatial relationships of spatial point phenomena, in a compact, convenient and familiar format. However, when large spatial data sets are drawn on a map, the problem of overlapping or overplotting of data points arises in highly populated areas, while low-population areas are virtually empty since spatial data are highly non-uniformly distributed in real world data sets. One widely used method is a 2.5D visualization showing data points aggregated up to map regions.

traditional map gridfit agg_bar_map
Figure 1: Approaches to Visualize Large Spatial Data Sets - An Overview

This technique is commercially available in systems such as VisualInsight's In3D and ESRI's ArcView. An alternative that shows more detail is a visualization of individual data points as bars according to their statistical value on a map. This technique is embodied in systems such as SGI's MineSet and AT&T's Swift 3D. A problem here is that a large number of data points are plotted at the same position, and therefore, only a small portion of the data is actually visible. Moreover, due to occlusion in 3D, a significant fraction of the data may not be visible unless the viewpoint is changed.

Local Density PixelMaps

PixelMaps solves the optimization problem by a hybrid cluster and visualization approach, based on kernel density estimation and an iterative scheme for local repositioning of data points.

Efficient Implementation

A key feature of our efficient implementation PixelMap, is the rescaling of regions of the map to better fit dense point clouds to unique output positions. This is effective because spatial data sets in the real world are often highly non-uniformly distributed.

First, the PixelMap algorithm approximates the two-dimensional kernel-density-estimation based clustering in the two spatial dimensions by finding a recursive partitioning of the dataset in 2D screen space, applying split operations according to the spatial parameters of the data points and the extensions of the 2D display space. The goal is (a) to find areas with high density in the two geometric dimensions and (b) to allocate enough pixels to place all data points of these dense regions at unique positions close to each other. The top-down partitioning of the dataset and 2D screen space results in the distortion of certain map regions. That means, however, virtually empty areas will shrink and dense areas will expand to achieve pixel coherence. For the efficient partitioning of the dataset and the 2D screen space, and efficient scaling to new boundaries, we implement a combination of gridfiles and quadtrees.

split_series
Figure: Pixel placement step for real 3D Clusters in the Manhattan Area - starting with the smallest class from the real cluster partition and placing all cluster members to free positions close to their centroid without overwriting pixels

Second, the PixelMap algorithm approximates the three-dimensional kernel-density-estimation-based clustering in the three dimensions performing an array based clustering for each dataset partition. After rescaling the data points to the new boundaries, the iterative positioning of the data points (pixel placement step) starts with the densest regions and within the dense regions the smallest clusters are processed first.

manhattan_placement
Figure: Pixel placement step for real 3D Clusters in the Manhattan Area - starting with the smallest class from the real cluster partition and placing all cluster members to free positions close to their centroid without overwriting pixels

Application Examples

Our first comparison is a map of the United States showing the U.S. Year 2000 Median Household Income Data. The left visualization is a traditional map, and the right visualization was made by the PixelMap algorithm. The traditional map provides random results in areas with high degree of overlap while sparsely populated areas are virtually empty. The visualization on the right shows the advantages of the PixelMap algorithm. First, we can easily see that New York City and Los Angeles County are the population areas of greatest interest in the United States. In the PixelMap visualization, the densest regions get the space needed to place all data points close to each other. We can even see the distribution of U.S. Year 2000 Median Household Incomes in these regions. Finally, for the U.S. Year 2000 Median Household Income we can observe a sharp decline between high and low income. Most Americans had income below 75000$, and ca. 10% of all Americans live below the federal poverty level.

AppEx2
Figure: Comparison of Traditional Map versus PixelMap - United States, Year 2000 Median Household Income

Our second comparison is a zoomed view to an interesting region, the State of New York. We can easy see that households with very high median household income are located in Manhattan and Queens, and households with low median household income are in the Bronx and Brooklyn. Especially, very wealthy inhabitants live on the east side of Central Park.

AppEx2
Figure: PixelMap displays cluster regions - Note high-income clusters on the East side of Manhattan's Central Park, and low-income clusters on the West end of Brooklyn)

Geo-Spatial Data Viewer

Geo-Spatial Data Viewer follows a three step process: Overview first, zoom and filter, and then details-on-demand which has been called the Information Seeking Mantra. In other words, in the exploratory data analysis (EDA) of a data set, an analyst first obtains an overview. This may reveal potentially interesting patterns or certain subsets of the data that deserve further investigation. Our {\em Geo-Spatial Data Viewer} provides an overview of the geo-spatial data using familiar land-covering maps. In this overview, the user can identify interesting patterns or groups in the geo-spatial data and focus on one or more of them. To focus on one or more of them, the data analyst can choose different distortion levels to view the geo-spatial phenomena in more detail. The different distortion levels can be chosen on the fly using the interactive distortion slider. The efficiency of our approach enables a smooth change between the different distortion levels. To get access to the geo-spatial data in the different distortion levels, the data analyst can influence the pixel-placement. The goal is to get a reasonable clustering of the data. To avoid non-practicable visualizations, the pixel placement step depends on the distortion level.

series
Figure: From familiar land-covering (level 0) to arbitrary distorted geo-spatial quadtree maps (level 14)

Key Advantages

Some of the key advantages of our Geo-Spatial Data Viewer over automatic data mining techniques alone are:

Publication

Current Stable Releases

Release Info