23 July 2015

Data Vis Highlights from useR! 2015

  • tessera
    • Interactive plotting for large, complex data
  • tmap
    • Thematic maps
  • oaPlots
    • Density-plot-type legends
  • dendextend
    • Dendrograms
  • plotROC
    • Interactive ROC plots

Tessera

Consists of two main packages:

  • datadr
    • Divide & Recombine paradigm
    • Parallel data and processing back-ends (Hadoop and Spark)
    • A simple consistent interface
  • trelliscope
    • Flexible, detailed, scalable visualization of large, complex data

datadr

  • Fundamentally, all data types are stored in a back-end as key/value pairs

  • Data type abstractions on top of the key/value pairs
    • Distributed data frame (ddf):
    • A data frame that is split into chunks
    • Each chunk contains a subset of the rows of the data frame
    • Each subset may be distributed across the nodes of a cluster

Housing Data Example

# look at housing data
str(housing)
## 'data.frame':    224369 obs. of  7 variables:
##  $ fips            : Factor w/ 3235 levels "01001","01003",..: 187 187 187 187 187 187 187 187 187 187 ...
##  $ county          : Factor w/ 1969 levels "Abbeville County",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ state           : Factor w/ 57 levels "AK","AL","AR",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ time            : Date, format: "2008-10-01" "2008-11-01" ...
##  $ nSold           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ medListPriceSqft: num  308 299 NA 290 288 ...
##  $ medSoldPriceSqft: num  326 NA 318 306 292 ...

Divide

  • Divide the housing data set by the variables "county" and "state"
    • Akin to dplyr::group_by()
  • Creates a distributed data frame (ddf)
# divide by county and state
byCounty <- divide(housing, by = c("county", "state"), update = TRUE)
## * Running map/reduce to get missing attributes...
class(byCounty)
## [1] "ddf"      "ddo"      "kvMemory"

Each subset of the ddf has a key

byCounty[[1]]
## $key
## [1] "county=Abbeville County|state=SC"
## 
## $value
##    fips       time nSold medListPriceSqft medSoldPriceSqft
## 1 45001 2008-10-01    NA         73.06226               NA
## 2 45001 2008-11-01    NA         70.71429               NA
## 3 45001 2008-12-01    NA         70.71429               NA
## 4 45001 2009-01-01    NA         73.43750               NA
## 5 45001 2009-02-01    NA         78.69565               NA
## ...

Trelliscope

  • Divide and recombine visualization tool
  • Based on Trellis display
  • Apply a visualization method to each subset of a ddf or ddo
  • Interactively sort and filter plots
  • Define a function to apply to each subset that creates a plot
    • Plots can be created using base R graphics, ggplot, lattice, rbokeh, conceptually any htmlwidget

Make a panel function

# medListPriceSqft and medSoldPriceSqft by time
timePanel <- function(x) xyplot(medListPriceSqft + medSoldPriceSqft ~ 
    time, data = x, auto.key = TRUE, ylab = "$ / Sq. Ft.")
# test the panel function on one division
head(byCounty[[1]][[2]], 6)
##    fips       time nSold medListPriceSqft medSoldPriceSqft
## 1 45001 2008-10-01    NA         73.06226               NA
## 2 45001 2008-11-01    NA         70.71429               NA
## 3 45001 2008-12-01    NA         70.71429               NA
## 4 45001 2009-01-01    NA         73.43750               NA
## 5 45001 2009-02-01    NA         78.69565               NA
## 6 45001 2009-03-01    NA         76.38889               NA
class(byCounty[[1]][[2]])
## [1] "data.frame"

Test the panel function on one panel

timePanel(byCounty[[1]][[2]])

make a cognostics function

# slope of fitted line of list price for each county
lmCoef <- function(x)
  coef(lm(medListPriceSqft ~ time, data = x))[2]
  
priceCog <- function(x) { list(
  slope = cog(lmCoef(x), desc = "list price slope"),
  meanList = cogMean(x$medListPriceSqft),
  listRange = cogRange(x$medListPriceSqft),
  nObs = cog(sum(!is.na(x$medListPriceSqft)),
             desc = "number of non-NA list prices")
)}

Test the cognostics function on one panel

priceCog(byCounty[[1]][[2]])
## $slope
##          time 
## -0.0002323686 
## 
## $meanList
## [1] 72.76927
## 
## $listRange
## [1] 23.08482
## 
## $nObs
## [1] 66

Visualization database (vdb)

  • Trelliscope creates a directory with all the data to render the plots
  • Can later re-launch the Trelliscope display without all the prior data analysis
vdbConn("housing_vdb", autoYes = TRUE)

Make a display

makeDisplay(byCounty,
   name = "list_sold_vs_time_datadr",
   desc = "List and sold price over time",
   panelFn = timePanel,
   width = 400, height = 400,
   lims = list(x = "same")
)

view()

tmap package for thematic maps

A ggplot-like plotting plackage for thematic maps:

library(tmap)
data(Europe)
qtm(Europe)

quick plot for thematic maps

qtm(Europe,
    fill = "gdp_cap_est",
    text = "iso_a3",
    text.size = "AREA",
    root = 5,
    fill.title = "GDP per capita", 
    fill.textNA = "Non-European countries",
    theme = "Europe")

tmap Elements

Plotting with tmap elements

The main plotting method, the equivalent to ggplot2's ggplot, consists of elements that start with tm_. The first element to start with is tm_shape, which specifies the shape object. Next, one, or a combination of the following drawing layers should be specified:

  • tm_fill
  • tm_borders
  • tm_bubbles
  • tm_lines
  • tm_lines
  • tm_text
  • tm_grid
  • tm_credits
  • tm_scale_bar

Flexible way

tm_shape(Europe) +
    tm_fill("gdp_cap_est", 
            textNA="Non-European countries", 
            title="GDP per capita")

Flexible way

tm_shape(Europe) +
    tm_fill("gdp_cap_est",
            textNA="Non-European countries",
            title="GDP per capita") +
    tm_borders()

Flexible way

tm_shape(Europe) +
    tm_fill("gdp_cap_est",
            textNA="Non-European countries", 
            title="GDP per capita") +
    tm_borders() +
    tm_text("iso_a3", size="AREA", root=5)

Flexible way

tm_shape(Europe) +
    tm_fill("gdp_cap_est", 
            textNA="Non-European countries", 
            title="GDP per capita") +
    tm_borders() +
    tm_text("iso_a3", size="AREA", root=5) + 
    tm_layout_Europe()

Adding points

  • Use the metro data set:
    • Spatial data of metropolitan areas, of class SpatialPointsDataFrame.
    • Includes a population times series from 1950 to (forecasted) 2030.
    • All metro areas with over 1 million inhabitants in 2010 are included.

data(metro)

tm_shape(metro) +
    tm_bubbles("pop2010", "red", alpha = 0.5, size.lim = c(0, 11e6), 
               sizes.legend = seq(2e6,10e6, by=2e6), title.size="Metropolitan Population")

tm_shape(metro) +
    tm_bubbles("pop2010", "red", alpha = 0.5, size.lim = c(0, 11e6), 
               sizes.legend = seq(2e6,10e6, by=2e6), title.size="Metropolitan Population") +
    tm_text("name", size="pop2010", scale=1, ymod=-.02, root=4, size.lowerbound = .60) 

tm_shape(Europe) +
    tm_fill("pop_est_dens", style="kmeans", textNA="Non-European countries", 
    title="Country population density (per km2)") +
    tm_borders() +
    tm_text("iso_a3", size="area", scale=1.5, root=8, size.lowerbound = .40,
            fontface="bold", case=NA, fontcolor = "gray35")

tm_shape(Europe) +
    tm_fill("pop_est_dens", style="kmeans", textNA="Non-European countries", 
    title="Country population density (per km2)") +
    tm_borders() +
    tm_text("iso_a3", size="area", scale=1.5, root=8, size.lowerbound = .40,
            fontface="bold", case=NA, fontcolor = "gray35") +
tm_shape(metro) +
    tm_bubbles("pop2010", "red", alpha = 0.5, size.lim = c(0, 11e6), 
               sizes.legend = seq(2e6,10e6, by=2e6), title.size="Metropolitan Population") +
  tm_layout_Europe()

Density plots legends

install.packages("oaPlots", repos = "http://repos.openanalytics.eu", 
    type = "source")
library(oaPlots)

# dsub is a subset of the diamonds data frame

# define color pallette, color vector and color region breaks
colorPalette <- brewer.pal(9, "Blues")[4:9] # RColorBrewer function
colorObj <- splitColorVar(colorVar = dsub$z, colorPalette) # oaPlots function
colorVec <- colorObj$colorVec
breaks <- colorObj$breaks

# plot the data
prepLegend(side = "right", proportion = 0.3) # oaPlots function
oaTemplate(xlim = range(dsub$x), ylim = range(dsub$y),
           xlab = "X", ylab = "Y") # oaPlots function
points(x = dsub$x, y = dsub$y, col = colorVec, pch = 19, cex = 0.6)
# add the legend
densityLegend(x = dsub$z, colorPalette = colorPalette, side = "right",
              main = "Z", colorBreaks = breaks)

Dendrograms

## 
## Welcome to dendextend version 1.1.0
## 
## Type ?dendextend to access the overall documentation and
## browseVignettes(package = 'dendextend') for the package vignette.
## You can execute a demo of the package via: demo(dendextend)
## 
## More information is available on the dendextend project web-site:
## https://github.com/talgalili/dendextend/
## 
## Contact: <tal.galili@gmail.com>
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## 
##          To suppress the this message use:
##          suppressPackageStartupMessages(library(dendextend))
## 
## 
## Attaching package: 'dendextend'
## 
## The following object is masked from 'package:datadr':
## 
##     %>%
## 
## The following object is masked from 'package:stats':
## 
##     cutree
library(dendextend)
dend <- c(1:5) %>% 
            dist %>% 
            hclust("ave") %>% 
            as.dendrogram

plot(dend)

# Labels
labels(dend) <- c("A", "B", "extend", "dend", "C")
# Label colors
labels_colors(dend) <- rainbow(5)
plot(dend)

dend <- color_branches(dend, k = 2)
plot(dend)
dend2 <- sort(dend)
plot(dend2)

tanglegram(dend, dend2)

Interactive ROC curves

library(plotROC)
shiny_plotROC()

test

print(Cervantes)
## [1] "Facts are the enemy of truth"