![]() |
![]() |
![]() |
| Daniel Keim | Stephen North | Mike Sips |
Spatial data describes objects or phenomena with specific real-world locations. Large spatial data sets occur naturally when accumulating many samples or readings of phenomena in the real world while moving through two dimensions in space. Spatial data mining methods can be applied to understand spatial phenomena and to discover relationships between spatial and non-spatial data. In the development of modern data mining, researchers have proposed various automated data mining algorithms for discovering patterns in large spatial databases. These automated data mining algorithms combine methods from more mature research areas within statistics, pattern recognition and machine learning.
Geospatial data describe real-world objects or phenomena with specific locations and associated statistical values or attributes. By considering just one statistical attribute at a time, we can interpret geospatial data sets as points in a 3D data space-that is, two geographical dimensions and a statistical dimension. Because real-world data set distributions are often nonuniform, the data points form readily identifiable 3D point clouds. Figure 1 shows a household income distribution in a 3D data space spanned by longitude, latitude, and median household income.
Point phenomena with statistical values can be displayed as colored pixels. This simple visualization is called Dot Map. Dot Maps can be an elegant medium for communicating a wealth of information about the spatial relationships of spatial point phenomena, in a compact, convenient and familiar format. However, when large spatial data sets are drawn on a map, the problem of overlapping or overplotting of data points arises in highly populated areas, while low-population areas are virtually empty since spatial data are highly non-uniformly distributed in real world data sets. One widely used method is a 2.5D visualization showing data points aggregated up to map regions.
|
| Figure 1: Approaches to Visualize Large Spatial Data Sets - An Overview |
This technique is commercially available in systems such as VisualInsight's In3D and ESRI's ArcView. An alternative that shows more detail is a visualization of individual data points as bars according to their statistical value on a map. This technique is embodied in systems such as SGI's MineSet and AT&T's Swift 3D. A problem here is that a large number of data points are plotted at the same position, and therefore, only a small portion of the data is actually visible. Moreover, due to occlusion in 3D, a significant fraction of the data may not be visible unless the viewpoint is changed.
PixelMaps solves the optimization problem by a hybrid cluster and visualization approach, based on kernel density estimation and an iterative scheme for local repositioning of data points.
A key feature of our efficient implementation PixelMap, is the rescaling of regions of the map to better fit dense point clouds to unique output positions. This is effective because spatial data sets in the real world are often highly non-uniformly distributed.
First, the PixelMap algorithm approximates the two-dimensional kernel-density-estimation based clustering in the two spatial dimensions by finding a recursive partitioning of the dataset in 2D screen space, applying split operations according to the spatial parameters of the data points and the extensions of the 2D display space. The goal is (a) to find areas with high density in the two geometric dimensions and (b) to allocate enough pixels to place all data points of these dense regions at unique positions close to each other. The top-down partitioning of the dataset and 2D screen space results in the distortion of certain map regions. That means, however, virtually empty areas will shrink and dense areas will expand to achieve pixel coherence. For the efficient partitioning of the dataset and the 2D screen space, and efficient scaling to new boundaries, we implement a combination of gridfiles and quadtrees.
Second, the PixelMap algorithm approximates the three-dimensional kernel-density-estimation-based clustering in the three dimensions performing an array based clustering for each dataset partition. After rescaling the data points to the new boundaries, the iterative positioning of the data points (pixel placement step) starts with the densest regions and within the dense regions the smallest clusters are processed first.
Our first comparison is a map of the United States showing the U.S. Year 2000 Median Household Income Data. The left visualization is a traditional map, and the right visualization was made by the PixelMap algorithm. The traditional map provides random results in areas with high degree of overlap while sparsely populated areas are virtually empty. The visualization on the right shows the advantages of the PixelMap algorithm. First, we can easily see that New York City and Los Angeles County are the population areas of greatest interest in the United States. In the PixelMap visualization, the densest regions get the space needed to place all data points close to each other. We can even see the distribution of U.S. Year 2000 Median Household Incomes in these regions. Finally, for the U.S. Year 2000 Median Household Income we can observe a sharp decline between high and low income. Most Americans had income below 75000$, and ca. 10% of all Americans live below the federal poverty level.
|
| Figure: Comparison of Traditional Map versus PixelMap - United States, Year 2000 Median Household Income |
Our second comparison is a zoomed view to an interesting region, the State of New York. We can easy see that households with very high median household income are located in Manhattan and Queens, and households with low median household income are in the Bronx and Brooklyn. Especially, very wealthy inhabitants live on the east side of Central Park.
|
| Figure: PixelMap displays cluster regions - Note high-income clusters on the East side of Manhattan's Central Park, and low-income clusters on the West end of Brooklyn) |
Geo-Spatial Data Viewer follows a three step process: Overview first, zoom and filter, and then details-on-demand which has been called the Information Seeking Mantra. In other words, in the exploratory data analysis (EDA) of a data set, an analyst first obtains an overview. This may reveal potentially interesting patterns or certain subsets of the data that deserve further investigation. Our {\em Geo-Spatial Data Viewer} provides an overview of the geo-spatial data using familiar land-covering maps. In this overview, the user can identify interesting patterns or groups in the geo-spatial data and focus on one or more of them. To focus on one or more of them, the data analyst can choose different distortion levels to view the geo-spatial phenomena in more detail. The different distortion levels can be chosen on the fly using the interactive distortion slider. The efficiency of our approach enables a smooth change between the different distortion levels. To get access to the geo-spatial data in the different distortion levels, the data analyst can influence the pixel-placement. The goal is to get a reasonable clustering of the data. To avoid non-practicable visualizations, the pixel placement step depends on the distortion level.
|
| Figure: From familiar land-covering (level 0) to arbitrary distorted geo-spatial quadtree maps (level 14) |
Some of the key advantages of our Geo-Spatial Data Viewer over automatic data mining techniques alone are: