Heatmaps of Spherical Densities in R

DISCLAIMER: While I know a thing or two, there's a reasonable chance I got some things wrong or at very least there are certainly more efficient ways to go about things. Feedback always appreciated!

Last time we made contour maps of densities of points on a globe, now it is time to take another step and make heatmaps. We created all the data we needed when creating the contours, but heatmaps add new challenges of dealing with large amounts of raster and polygon data. Lets get to it.

Set-Up

First, we'll make use of a number of libraries and setup our plotting environment:

library(rgdal)       # For coordinate transforms
library(sp)          # For plotting grid images
library(sf)
library(lwgeom)
library(Directional) # For spherical density functions
library(spData)      # worldmap
library(raster)

We'll also use the same vmf_density_grid function we introduced in the Intro post.

vmf_density_grid <- function(u, ngrid = 100) {
  # Translate to (0,180) and (0,360)
  u[,1] <- u[,1] + 90
  u[,2] <- u[,2] + 180
  res <- vmf.kerncontour.new(u, thumb = "none", ret.all = T, full = T,
                             ngrid = ngrid)

  # Translate back to (-90, 90) and (-180, 180) and create a grid of
  # coordinates
  ret <- expand.grid(Lat = res$Lat - 90, Long = res$Long - 180)
  ret$Density = c(res$d)
  ret
}

Global Earthquakes Again

Global Earthquakes from Northern California Earthquake Data Center is a great dataset we'll continue to use, so we start with a set of quakes since Jan 1, 1950 of magnitude 5.9 or higher.

For all our heatmaps, we'll start the same as we did for contours, calculating the density map:

grid.size = 100
earthquakes <- read.csv(file.path("..", "data", "earthquakes.csv"))
earthquake.densities <- vmf_density_grid(earthquakes[,c("Latitude",
                                                        "Longitude")],
                                         ngrid = grid.size)

Once we have the densities, we need to coerce them into a spatial format – in this case we'll create a SpatialGridDataFrame, matching the grid of densities we calculated with vmf_density_grid.

density_matrix <- matrix(earthquake.densities$Density, nrow = grid.size)
density_matrix <- t(apply(density_matrix, 2, rev))
gridVals <- data.frame(att=as.vector(density_matrix))
gt <- GridTopology(cellcentre.offset = c(-180 + 180 / grid.size,
                                         -90 + 90 / grid.size),
                   cellsize = c( 360 / grid.size, 180 / grid.size),
                   cells.dim = c(grid.size, grid.size))
sGDF <- SpatialGridDataFrame(gt,
                             data = gridVals,
                             proj = "+proj=longlat +datum=WGS84 +no_defs")

plot(sGDF)
plot(gridlines(sGDF), add = TRUE, col = "grey30", alpha = .1)
plot(st_geometry(world), add = TRUE, col = NA, border = "grey")

plot of chunk earthquake_plot

Great, we have a heatmap! But it is in rectangular coordinates, we want to project it to something nicer, like a Winkel triple. There's a problem though… We can't just re-project our SpatialGridDataFrame – it gets interpolated into points, losing our nice pretty smooth heatmap.

There are two real options for us:

  • Convert to raster data, then project the raster
  • Convert to raster, convert to polygons, project the polygons

Projecting Raster Data

This is really slow, so we have to turn the resolution way down.

r <- raster(sGDF)
crs1 <- "+proj=wintri"
world.crs1 <- st_transform_proj(world, crs = crs1)

pr1 <- projectExtent(r, crs1)
res(pr1) <- 9e5
pr2 <- projectRaster(r, pr1, method = "bilinear", over = TRUE)
plot(pr2)
plot(st_geometry(world.crs1), add = TRUE, col = NA, border = "grey")

plot of chunk earthquake_proj_raster

I guess this works, but the low resolution suggests we can do better.

Using Polygons

We'll use raster data again, but we'll immediately convert it into a grid of square polygons which we can then project

r2 <- raster(sGDF)
# We'll manually colorize
r2 <- cut(r2,
          pretty(r2[], 50),
          include.lowest = F)
color.vals <- rev(terrain.colors(50))
pol <- rasterToPolygons(r2)
crs1 <- "+proj=wintri"
world.crs1 <- st_transform_proj(world, crs = crs1)
pol.crs1 <- spTransform(pol, crs1)
plot(pol.crs1, col=color.vals[r2[]], border = NA)
# plot(gridlines(sgdf.crs1), add = TRUE, col = "grey30", alpha = .1)
plot(st_geometry(world.crs1), add = TRUE, col = NA, border = "grey")

plot of chunk earthquakes_projected

Now that looks good!

One thing to keep in mind however – because our polygons are rectangular in equal coordinates, they will warp and distort as a projection gets more severe. In our animation, you can see what I mean

Animating

We're projecting into an orthographic projection to simulate the rotating globe. A few things you'll see in the code where I jump through hoops:

  • Cropping the top – If I leave the top polygons in place, they bunch up in an ugly fashion
  • Making features valid – Both for the world and our heatmap polygons I jump through hoops to make sure only valid polygons get through to the final plot.
r3 <- raster(sGDF)

# Crop down because projecting the poles causes problems
r.crop <- res(r3)
rc <- crop(r3, extent(-180, 180,
                      -90 + r.crop[2], 90 - r.crop[2]))
pol <- rasterToPolygons(rc)
pol.breaks <- pretty(pol$att, 20)
pol.colors <- rev(terrain.colors(length(pol.breaks) - 1))
# Make the lowest color transparent
substr(pol.colors[1], 8, 9) <- "00"

par_old <- par()
par(mar = c(0, 0, 0, 0))
n.frames <- 30
grad <- st_graticule(ndiscr = 1e4)
for (i in 1:n.frames) {
  long <- -180 + (i - 1) * 360 / n.frames
  crs.ani <- paste0("+proj=ortho +lat_0=0 +lon_0=", long)
  grad.ani <- st_geometry(st_transform(grad, crs.ani))

  world.ani <- st_transform(st_geometry(world), crs = crs.ani)
  world.ani <- st_make_valid(world.ani)
  # We don't want the points
  world.ani <- world.ani[st_geometry_type(world.ani) %in% c('POLYGON',
                                                            'MULTIPOLYGON')]

  # There are inevitable some bad polygons out of the transform
  world.ani <- world.ani[st_is_valid(world.ani)]

  pol.ani <- st_transform(as(pol, "sf"), crs.ani)
  pol.ani.geo <- lwgeom::st_make_valid(pol.ani)
  pol.ani.geo <- pol.ani.geo[st_geometry_type(pol.ani.geo) %in% c('POLYGON',
                                                                  'MULTIPOLYGON',
                                                                  'GEOMETRYCOLLECTION'), ]
  pol.ani.geo <- pol.ani.geo[st_is_valid(pol.ani.geo), ]
  pol.ani.geo <- pol.ani.geo[!st_is_empty(pol.ani.geo), ]

  plot(grad.ani, col = "black")
  plot(world.ani, add = TRUE, col = "grey30", border = "grey")
  plot(pol.ani.geo, border = NA, breaks = pol.breaks, pal = pol.colors,
       add = TRUE, main = NA, key.pos = NULL)
}

plot of chunk earthquake_ani

par(par_old)

Looks pretty good, but we do have some interesting world map problems with countries popping out as they reach the edge… Something to investigate another day.

Final Notes

In both these examples we've used global data as it shows the problems of using “traditional” density estimators, but the same issue exists at all scales. It is just a question of when a simpler approximation is reasonable.

You can also see a bit of blockiness which we could reduce with an increase in grid size, but that will be very dependent on need.

Next, some real data…

Appendix

Spherical Density Function

This calculates a grid of densities which can then be used with geom_contour. The code basically comes directly from Directional's vmf.kerncontour, only returning a data.frame instead of actually plotting the output.

vmf.kerncontour.new <- function(u, thumb = "none", ret.all = FALSE, full = FALSE,
                            ngrid = 100) {
  ## u contains the data in latitude and longitude
  ## the first column is the latitude and the
  ## second column is the longitude
  ## thumb is either 'none' (default), or 'rot' (Garcia-Portugues, 2013)
  ## ret.all if set to TRUE returns a matrix with latitude, longitude and density
  ## full if set to TRUE calculates densities for the full sphere, otherwise
  ##   using extents of the data
  ## ngrid specifies the number of points taken at each axis
  n <- dim(u)[1]  ## sample size
  x <- euclid(u)

  if (thumb == "none") {
    h <- as.numeric( vmfkde.tune(x, low = 0.1, up = 1)[1] )
  } else if (thumb == "rot") {
    k <- vmf(x)$kappa
    h <- ( (8 * sinh(k)^2) / (k * n * ( (1 + 4 * k^2) * sinh(2 * k) -
    2 * k * cosh(2 * k)) ) ) ^ ( 1/6 )
  }

  if (full) {
    x1 <- seq( 0, 180, length = ngrid )  ## latitude
    x2 <- seq( 0, 360, length = ngrid )  ## longitude
  } else {
    x1 <- seq( min(u[, 1]) - 5, max(u[, 1]) + 5, length = ngrid )  ## latitude
    x2 <- seq( min(u[, 2]) - 5, max(u[, 2]) + 5, length = ngrid )  ## longitude
  }
  cpk <- 1 / (  ( h^2)^0.5 *(2 * pi)^1.5 * besselI(1/h^2, 0.5) )
  mat <- matrix(nrow = ngrid, ncol = ngrid)

  for (i in 1:ngrid) {
    for (j in 1:ngrid) {
      y <- euclid( c(x1[i], x2[j]) )
      a <- as.vector( tcrossprod(x, y / h^2) )
      can <- sum( exp(a + log(cpk)) ) / ngrid
      if (abs(can) < Inf)   mat[i, j] <- can
    }
  }

  if (ret.all) {
    return(list(Lat = x1, Long = x2, h = h, d = mat))
  } else {
    contour(mat$Lat, mat$Long, mat, nlevels = 10, col = 2, xlab = "Latitude",
            ylab = "Longitude")
    points(u[, 1], u[, 2])
  }
}

References

Introduction to Spherical Densities in R

DISCLAIMER: While I know a thing or two, there's a reasonable chance I got some things wrong or at very least there are certainly more efficient ways to go about things. Feedback always appreciated!

It always happens… I get interested in what I think will be a small data project to scratch some itch and end up down a deep rabbit hole. In this case, a passing interest in the geographic distribution of some samples (more on that in a future post) led to a deep dive into spherical distributions and densities.

Motivation

While I got interested in figuring out densities for the purpose of figuring out the density of points on a map, there are plenty of other cases where you might be interested in the distribution of points on a sphere. The trouble is that most functions commonly available, e.g. geom_density_2d from ggplot2, only handles regular grid coordinates.

The forms:

  • Global densities simply fail at the 'edge' of coordinates – e.g. near the poles or near +/- 180 degrees longitude.
  • Projection issues. On small scales and near the equator, it is generally safe to make the simplification that longitude/latitude forms a square grid. As you look to larger scales and close to the poles, that assumption breaks down.

I think it is important to point out that there are many tutorials on plotting event densities on maps (e.g. crime occurrences), but that these are all at the city level, where the problems of using existing methods is a reasonable approximation.

Set-Up

First, we'll make use of a number of libraries and setup our plotting environment:

library(ggplot2)     # For most of our plotting
library(cowplot)     # grid arrangement of plots
library(Directional) # For spherical density functions
library(maps)        # vector maps of the world
library(hrbrthemes)  # hrbrmstr themes

# And set some theme defaults
theme_set(theme_ipsum())
# Axis settings we'll reuse a lot
no.axis <- theme(axis.ticks.y = element_blank(), axis.text.y = element_blank(),
                 axis.ticks.x = element_blank(), axis.text.x = element_blank(),
                 axis.title.x = element_blank(), axis.title.y = element_blank())

Next, for this example, we'll be using a random blob placed on a sphere. I'll use the rvmf function from the Directional package. Directional is a general purpose library using Latitude defined from 0 to 180 degrees and Longitude from 0 to 360 instead of -90 to 90 and -180 to 180 respectively. The random_points function here gives us points in a coordinate system we're used to.

random_points <- function(n_points, lat, lon, concentration) {
  # Directional defines lat + long as 0-180 and 0-360 respectively so we
  # have to shift back and forth
  mu <- euclid(c(lat + 90, lon + 180))[1,]
  pts <- euclid.inv(rvmf(n_points, mu, concentration))
  pts[,1] <- pts[,1] - 90
  pts[,2] <- pts[,2] - 180
  data.frame(pts)
}

Problem

To visualize the problem, we'll create 2 sets of points, one centered on the map, the other near the pole and near 180 degrees. We'll then plot the contours of the densities to show the issue.

offset.pos <- list(Lat = 75, Long = 175)
positions.center <- random_points(1000, 0, 0, 10)
positions.offset <- random_points(1000, offset.pos$Lat, offset.pos$Long, 10)
plot.colors <- hcl(h = c(0:3)*90, c = 50 , l = 70)
g.base <- ggplot(positions.center, aes(x = Long, y = Lat)) +
          scale_y_continuous(breaks = (-2:2) * 30, limits = c(-90, 90)) +
          scale_x_continuous(breaks = (-4:4) * 45, limits = c(-180, 180)) +
          coord_map()

g.broken <- g.base +
     # The centered random points
     geom_density_2d(color = plot.colors[1]) +
     geom_point(size = 0.5, stroke = 0, color = plot.colors[1]) +
     # The offset random points
     geom_density_2d(data = positions.offset, color = plot.colors[2]) +
     geom_point(data = positions.offset, size = 0.5, stroke = 0,
                color = plot.colors[2])

ortho.projections <- plot_grid(
  g.broken + coord_map("ortho", orientation = c(0, 0, 0)) + no.axis,
  g.broken + coord_map("ortho", orientation = c(offset.pos$Lat, offset.pos$Long, 0))
           + no.axis,
  labels = NULL,
  align = 'h')
g.broken
ortho.projections

plot of chunk problemplot of chunk problem

We can quickly see the problem looking at the blue offset density plot – there are multiple “centers” and the contours don't connect cleanly.

Spherical Densities

The solution is to use spherical densities an fortunately, the Directional package provides functions for spherical (and in fact, circular and spheres of arbitrary dimensions) distributions using the von Mises-Fisher distribution.

Our basic approach will be the following steps:

  • Calculate a “grid” of densities manually, covering the entire globe
  • Use geom_contour to turn those density maps into contour curves
  • Plot away!

Before we fix the problem using spherical densities, we first need to do some setup. We'll be using vmf.kerncontour from the Directional library, but in current CRAN version (3.2), that function plots contours itself. We want to get the data to perform the plots ourselves, so we need a version that returns the data. The next version of the package will have that option, but in the meantime we put the code for the revised function in the Appendix as vmf.kerncontour.new.

Similar to what we did for random_points, we also need to perform some translation of vmf.kerncontour's input and output to out more familiar formats.

vmf_density_grid <- function(u, ngrid = 100) {
  # Translate to (0,180) and (0,360)
  u[,1] <- u[,1] + 90
  u[,2] <- u[,2] + 180
  res <- vmf.kerncontour.new(u, thumb = "none", ret.all = T, full = T,
                             ngrid = ngrid)

  # Translate back to (-90, 90) and (-180, 180) and create a grid of
  # coordinates
  ret <- expand.grid(Lat = res$Lat - 90, Long = res$Long - 180)
  ret$Density = c(res$d)
  ret
}

Now we can go ahead and calculate the densities and plot the contours. We'll keep the “bad” contours for comparison.

densities.center <- vmf_density_grid(positions.center)
densities.offset <- vmf_density_grid(positions.offset)

g.broken <- g.base +
     geom_density_2d(color = plot.colors[1], alpha = .5) +
     geom_point(size = 0.5, stroke = 0, color = plot.colors[1], alpha = .5) +
     geom_density_2d(data = positions.offset, color = plot.colors[2], alpha = .5) +
     geom_point(data = positions.offset, size = 0.5, stroke = 0, color =
                plot.colors[2], alpha = .5)

g.densities <- g.broken +
  geom_contour(data = densities.center,
               aes(x=Long, y=Lat, z=Density),
               color = plot.colors[3]) +
  geom_contour(data = densities.offset,
               aes(x=Long, y=Lat, z=Density),
               color = plot.colors[4])

ortho.projections <- plot_grid(
  g.densities + coord_map("ortho", orientation = c(0, 0, 0)) + no.axis,
  g.densities + coord_map("ortho",
                          orientation = c(offset.pos$Lat, offset.pos$Long, 0))
              + no.axis,
  labels = NULL,
  align = 'h')
g.densities
ortho.projections

plot of chunk fixed_densititesplot of chunk fixed_densitites

Particularly looking at the orthographic plots, it is easy to see that the spherical density process gives the same rings in both locations, with continuous curves.

Practical Example: Global Earthquakes

Earthquake density is used in one of the few existing attempts to perform density calculations with spherical coordiates on R-Bloggers. The Northern California Earthquake Data Center provides an archive of earthquakes for download, so we start with a set of quakes since Jan 1, 1950 of magnitude 5.9 or higher. Given that data, we then follow the same process as we did with our random data to plot both the 2d density contours and the density contours using spherical functions.

earthquakes <- read.csv(file.path("..", "data", "earthquakes.csv"))
earthquake.densities <- vmf_density_grid(earthquakes[,c("Latitude",
                                                        "Longitude")],
                                         ngrid = 300)
world <- map_data("world")
g.earthquakes <- ggplot() +
  geom_map(data = world, map = world,
           mapping = aes(map_id = region),
           color = "grey90", fill = "grey80") +
  geom_point(data = earthquakes,
             mapping = aes(x = Longitude, y = Latitude),
             color = "red", alpha = .2, size = .5, stroke = 0) +
  geom_density_2d(data = earthquakes,
                  aes(x=Longitude, y=Latitude),
                  color = plot.colors[2], alpha = 1) +
  geom_contour(data = earthquake.densities, aes(x=Long, y=Lat, z=Density),
               color = plot.colors[4]) +
  scale_y_continuous(breaks = (-2:2) * 30, limits = c(-90, 90)) +
  scale_x_continuous(breaks = (-4:4) * 45, limits = c(-180, 180)) +
  coord_map("mercator")

g.earthquakes

plot of chunk earthquake_plot

# We use the built-in knitr options to animate the globe
n.frames <- 30
for (i in 1:n.frames) {
  long <- 170 + (i - 1) * 360 / n.frames
  # We Explicitly use the 'plot' command to show the ggplot
  plot(g.earthquakes + coord_map("ortho", orientation = c(0, long, 0)) + no.axis)
}

plot of chunk earthquake_ani

The yellow shows default 2d density, and you can again see the continuity problems. The blue shows the expected Ring of Fire thanks to the spherical density. It isn't perfect – if we were really interested in the most accurate results, we'd probably want to turn up the grid size to better follow the chains of quakes or tweak the contour breakpoints to see the fine features.

This should be a good first step to looking at densities in geo events.

Next

While this should have given a good introduction to densities on a sphere and the issues with using the default density functions, there is still more we can do. We've got a few more posts coming:

  • Heatmaps – Working with heatmaps means generating raster data and projections with raster data adds more complexity
  • More Real Examples – I mentioned I had an actual project I was curious about, right?

Appendix

Spherical Density Function

This calculates a grid of densities which can then be used with geom_contour. The code basically comes directly from Directional's vmf.kerncontour, only returning a data.frame instead of actually plotting the output.

vmf.kerncontour.new <- function(u, thumb = "none", ret.all = FALSE, full = FALSE,
                            ngrid = 100) {
  ## u contains the data in latitude and longitude
  ## the first column is the latitude and the
  ## second column is the longitude
  ## thumb is either 'none' (default), or 'rot' (Garcia-Portugues, 2013)
  ## ret.all if set to TRUE returns a matrix with latitude, longitude and density
  ## full if set to TRUE calculates densities for the full sphere, otherwise
  ##   using extents of the data
  ## ngrid specifies the number of points taken at each axis
  n <- dim(u)[1]  ## sample size
  x <- euclid(u)

  if (thumb == "none") {
    h <- as.numeric( vmfkde.tune(x, low = 0.1, up = 1)[1] )
  } else if (thumb == "rot") {
    k <- vmf(x)$kappa
    h <- ( (8 * sinh(k)^2) / (k * n * ( (1 + 4 * k^2) * sinh(2 * k) -
    2 * k * cosh(2 * k)) ) ) ^ ( 1/6 )
  }

  if (full) {
    x1 <- seq( 0, 180, length = ngrid )  ## latitude
    x2 <- seq( 0, 360, length = ngrid )  ## longitude
  } else {
    x1 <- seq( min(u[, 1]) - 5, max(u[, 1]) + 5, length = ngrid )  ## latitude
    x2 <- seq( min(u[, 2]) - 5, max(u[, 2]) + 5, length = ngrid )  ## longitude
  }
  cpk <- 1 / (  ( h^2)^0.5 *(2 * pi)^1.5 * besselI(1/h^2, 0.5) )
  mat <- matrix(nrow = ngrid, ncol = ngrid)

  for (i in 1:ngrid) {
    for (j in 1:ngrid) {
      y <- euclid( c(x1[i], x2[j]) )
      a <- as.vector( tcrossprod(x, y / h^2) )
      can <- sum( exp(a + log(cpk)) ) / ngrid
      if (abs(can) < Inf)   mat[i, j] <- can
    }
  }

  if (ret.all) {
    return(list(Lat = x1, Long = x2, h = h, d = mat))
  } else {
    contour(mat$Lat, mat$Long, mat, nlevels = 10, col = 2, xlab = "Latitude",
            ylab = "Longitude")
    points(u[, 1], u[, 2])
  }
}

References

Parsing Functions in edgarWebR

New to edgarWebR 0.2.0 are funtions for parsing SEC documents. While there are good R packages for XBRL processing, there is a gap in extracting information from other document types available via the site. edgarWebR currently provides functions for 2 of those –

  • parse_submission() – Processes a raw SGML filing into component documents
  • parse_filing() – Processes a narrative filing (e.g. 10-K, 10-Q) into paragraphs annotated with part and item numbers

This vignette will show how to use both functions to find the risks reported by in a company's recent filing.

Find a Submission

Using edgarWebR functions, we'll first look up a recent filing.

ticker <- "STX"

filings <- company_filings(ticker, type="10-Q", count=40)
# Specifying the type provides all forms that start with 10-, so we need to
# manually filter.
filings <- filings[filings$type == "10-Q",]
# We're only interested in the latest filing this time
filing <- filings[1,]
filing$md_href <- paste0("[Link](", filing$href, ")")
knitr::kable(filing[,c("type", "filing_date", "accession_number", "size",
                              "md_href")],
             col.names=c("Type", "Filing Date", "Accession No.", "Size", "Link"),
             digits = 2,
             format.args = list(big.mark=","))
Type Filing Date Accession No. Size Link
10-Q 2017-04-28 0001193125-17-148855 8 MB Link

Get the Complete Submission File

We'll next get the list of files and find the link to the complete submission.

docs <- filing_documents(filing$href)
doc <- docs[docs$description == 'Complete submission text file',]
doc$md_href <- paste0("[Link](", doc$href, ")")

knitr::kable(doc[,c("seq", "description", "document", "size",
                              "md_href")],
             col.names=c("Sequence", "Description", "Document", "Size", "Link"),
             digits = 2,
             format.args = list(big.mark=","))
Sequence Description Document Size Link
5 NA Complete submission text file 0001193125-17-148855.txt 8,112,579 Link

Normally, we would use filing_documents() to get to the 10-Q directly, but as an example we'll be using the complete submission file to demonstrate the parse_submission() function. You would want to use the complete submission file if you want to access the full list of files – e.g. in this case there are 80 files in the submission, but only 10 available on the website and therefore available to filing_documents() – or if you worry about efficiency and are downloading all of the documents.

Parse the Complete Submission File

Now that we have the link to the complete submission file, we can parse it into components.

parsed_docs <- parse_submission(doc$href)
knitr::kable(head(parsed_docs[,c("SEQUENCE", "TYPE", "DESCRIPTION", "FILENAME")]),
             col.names=c("Sequence", "Type", "Description", "Document"),
             digits = 2,
             format.args = list(big.mark=","))
Sequence Type Description Document
1 10-Q 10-Q d381726d10q.htm
2 EX-31.1 EX-31.1 d381726dex311.htm
3 EX-31.2 EX-31.2 d381726dex312.htm
4 EX-32.1 EX-32.1 d381726dex321.htm
5 EX-101.INS XBRL INSTANCE DOCUMENT stx-20170331.xml
6 EX-101.SCH XBRL TAXONOMY EXTENSION SCHEMA stx-20170331.xsd

And just for example, here's the end of the full list – note the excel that isn't on the SEC site for instance.

knitr::kable(tail(parsed_docs[,c("SEQUENCE", "TYPE", "DESCRIPTION", "FILENAME")]),
             col.names=c("Sequence", "Type", "Description", "Document"),
             digits = 2,
             format.args = list(big.mark=","))
Sequence Type Description Document
75 75 XML IDEA: XBRL DOCUMENT R65.htm
76 76 EXCEL IDEA: XBRL DOCUMENT Financial_Report.xlsx
77 77 XML IDEA: XBRL DOCUMENT Show.js
78 78 XML IDEA: XBRL DOCUMENT report.css
79 80 XML IDEA: XBRL DOCUMENT FilingSummary.xml
80 82 ZIP IDEA: XBRL DOCUMENT 0001193125-17-148855-xbrl.zip

The 10-Q Filing document is Seq. 1, with the full text of the document in the TEXT column.

# NOTE: the filing document is not always #1, so it is a good idea to also look
# at the type & Description
filing_doc <- parsed_docs[parsed_docs$TYPE == '10-Q' &
                          parsed_docs$DESCRIPTION == '10-Q', 'TEXT']
substr(filing_doc,1,80)
#> [1] "<HTML><HEAD>\n<TITLE>10-Q</TITLE>\n</HEAD>\n <BODY BGCOLOR=\"WHITE\">\n<h5 align=\"left"

We can see that contains the raw document. For document types which are not plain text, e.g. the XBRL zip file, the content is uuencoded and would been further processing.

Parse the Filing Document

Fortunately edgaWebR functions that take URL's will also take a string containing the document, so to parse the document, while we could have passed the URL to the online document we can just pass in the full string.

doc <- parse_filing(filing_doc, include.raw = TRUE)
unique(doc$part.name)
#> [1] ""        "PART I"  "PART II"
unique(doc$item.name)
#>  [1] ""                                                                                                   
#>  [2] "ITEM 1.\nFINANCIAL STATEMENTS"                                                                      
#>  [3] "ITEM 2.\nMANAGEMENT\u0092S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS"
#>  [4] "ITEM 3.\nQUANTITATIVE AND QUALITATIVE DISCLOSURES ABOUT MARKET RISK"                                
#>  [5] "ITEM 4.\nCONTROLS AND PROCEDURES"                                                                   
#>  [6] "ITEM 1.\nLEGAL PROCEEDINGS"                                                                         
#>  [7] "ITEM 1A.\nRISK FACTORS"                                                                             
#>  [8] "ITEM 2.\nUNREGISTERED SALES OF EQUITY SECURITIES AND USE OF PROCEEDS"                               
#>  [9] "ITEM 3.\nDEFAULTS UPON SENIOR SECURITIES"                                                           
#> [10] "ITEM 4.\nMINE SAFETY DISCLOSURES"                                                                   
#> [11] "ITEM 5.\nOTHER INFORMATION"                                                                         
#> [12] "ITEM 6.\nEXHIBITS"
head(doc[grepl("market risk", doc$item.name, ignore.case=TRUE),"text"], 3)
#> [1] "ITEM 3.\nQUANTITATIVE AND QUALITATIVE DISCLOSURES ABOUT MARKET RISK"                                                                                                                                                                                                                                                                                                                                                                                                                               
#> [2] "We have exposure to market\nrisks due to the volatility of interest rates, foreign currency exchange rates, equity and bond markets. A portion of these risks are hedged, but fluctuations could impact our results of operations, financial position and cash flows. Additionally,\nwe have exposure to downgrades in the credit ratings of our counterparties as well as exposure related to our credit rating changes."                                                                         
#> [3] "Interest Rate Risk. Our exposure to market risk for changes in interest rates relates primarily to our investment portfolio. As of\nMarch 31, 2017, we had no material available-for-sale securities that had been in a continuous unrealized loss position for a period greater than 12 months. We\ndetermined no available-for-sale securities were other-than-temporarily impaired as of March 31, 2017. We currently do not use derivative financial instruments in\nour investment portfolio."
risks <- doc[grepl("market risk", doc$item.name, ignore.case=TRUE),"raw"]

Now the document is all ready for whatever further processing we want. As a quick example we'll pull out all the italicized risks.

risks <- risks[grep('<i>',risks)]
risks <- gsub("^.*<i>|</i>.*$", "", risks)
risks <- gsub("\n", " ", risks)
risks
#> [1] "Interest Rate Risk"             "Foreign Currency Exchange Risk"
#> [3] "Derivatives and Hedging."       "Other Market Risks"

This is a fairly simplistic example, but should serve as a good tutorial on processing filings.

How to Download

edgarWebR is available from CRAN, so can be simply installed via

install.packages("edgarWebR")

If you want the latest and greatest, you can get a copy of the development version from github by using devtools:

# install.packages("devtools")
devtools::install_github("mwaldstein/edgarWebR")

Introduction to Sentiment Analysis of 10-K Reports in R

Nothing in this article should be considered as investment advice

Overview

In the US, publicly traded companier are required to publish an annual report, called a 10-K. In addition to basic financial information, these reports include management commentary and a disclosure of perceived risks. While the financials are typically where analysts focus, attention is given to reading between the lines of the typically bland and risk adverse narrative sections.

I got interested in using R to automate the process of grabbing the 10K from the SEC website, parsing out the narrative sections, and applying basic sentiment and text analysis.

Basics

It was working on this project that led to the creation of the edgarWebR package and we'll be using it extensively as we look up a company, find its annual filings, and fetching the management report.

First, we'll specify the ticker we're interested in and the number of years we want to analyze. Using that we can find the company information.

ticker <- 'EA'
years <- 5
company.details <- company_details(ticker, type = "10-K", count = years)

#TODO: better formatting company information**
kable(company.details$information %>% select(name, cik, fiscal_year_end),
      col.names=c('Company Name', 'CIK', 'Fiscal Year End'))
Company Name CIK Fiscal Year End
ELECTRONIC ARTS INC. 0000712515 0331

Next, look up reports in the SEC EDGAR system and look in each filing to find the report. A lot is happening in a few lines, but effectively we're going from a list of annual filings to urls for the main document in each filing.

One note – while, we specified both the type (10-K) and count (5), because the SEC site isn't precise, we need to re-filter.

# Each filing is made up of multiple documents,
# this extracts the main narrative
filing_doc <- function(href) {
  sapply(href, function(x) {
         filing_documents(x) %>%
         filter( type == "10-K" ) %>% select(href) }) %>%
  unlist(recursive = TRUE, use.names = FALSE)
}
company.reports <- company.details$filings %>%
  filter(type == "10-K") %>%
  slice(1:years) %>% 
  mutate(doc.href = filing_doc(href),
         mdlink = paste0("[Filing Link](", href, ")"),
         reportLink = paste0("[10-K Link](", doc.href, ")")) %>%
  select(filing_date, accession_number, mdlink, reportLink, href, doc.href)

knitr::kable(company.reports %>% select(-href, -doc.href),
     col.names=c('Filing Date', 'Accession #', 'Filing Link', '10-K Link'))
Filing Date Accession # Filing Link 10-K Link
2017-05-24 0000712515-17-000035 Filing Link 10-K Link
2016-05-27 0000712515-16-000111 Filing Link 10-K Link
2015-05-21 0000712515-15-000033 Filing Link 10-K Link
2014-05-21 0000712515-14-000024 Filing Link 10-K Link
2013-05-22 0000712515-13-000022 Filing Link 10-K Link

We now have a link to the narrative 10-K for each of the past 5 years.

Parse out 10K

To perform sentiment analysis, the next big step is to take the HTML formatted 10-K and turn it into a set of words. While we won't use it right now, we'll also parse the various sections of the document for later filtering.

Warning: this gets hairy…

knitr::spin_child("../R/parse10k.R")
parse10k <- function(uri) {
  # 10-K HTML files are very flat with a long list of nodes. This pulls all
  # the relevant nodes.
  nodes <- read_html(uri) %>% html_nodes('text') %>% xml_children()
  nodes <- nodes[xml_name(nodes) != "hr"]

  # Unfortunately there isn't much of a workaround to this loop - we need
  # to track position in the file so it has to be a bit sequential...
  doc.parts <- tibble(nid = seq(length(nodes)),
                     node = nodes,
                     text = xml_text(nodes) ) %>% 
              filter(text != "") # way to get columns defined properly

  parts <- doc.parts %>%
    filter(grepl("^part",text, ignore.case=TRUE)) %>%
    select(nid,text)
  #  mutate(next.nid = c(nid[-1],length(nodes)+1)) %>%
  if (parts$nid[1] > 1) {
    parts <- bind_rows(tibble(nid = 0, text= "PART 0"), parts)
  }
  parts <- bind_rows(parts,
                     tibble(nid = doc.parts$nid[length(doc.parts$nid)] + 1,
                            text = "NA"))

  items <- doc.parts %>%
    filter(grepl("^item",text, ignore.case=TRUE)) %>%
    select(nid,text) %>%
    mutate(next.nid = c(nid[-1],length(nodes)+1),
           part.next = parts$nid[findInterval(nid,parts$nid) + 1],
           next.nid = ifelse(part.next < next.nid, part.next, next.nid),
           prev.end = c(0,next.nid[-length(nid)]))

  # Fill in item gaps w/ N/A
  n <- 0
  for(i in seq(length(items$nid))) {
    j <- i + n
    if(items$prev.end[j] != items$nid[j]) {
      items <- items %>% 
        add_row(nid = items$prev.end[j], text = NA, .before = j)
      n <- n + 1
    }
  }

  doc.parts <- doc.parts %>% 
    mutate( part = parts$text[findInterval(nid, parts$nid)],
            item = items$text[findInterval(nid, items$nid)]) %>%
    select(nid,part,item,text)

  return(doc.parts)
}

We then take our parsing function and run it against each of the reports

data <- company.reports %>% rowwise() %>%
  mutate(nodes = map(doc.href, parse10k)) %>%
  select(-accession_number, -href, -mdlink, -doc.href, -reportLink) %>%
  ungroup() %>%
  group_by(filing_date)

Finally, we use unnest_tokens from TidyText to prepare our documents for sentiment analysis

words <- data %>%
  unnest(nodes) %>%
  select(-nid) %>%
  unnest_tokens(word, text)
words
## # A tibble: 271,533 x 4
## # Groups:   filing_date [5]
##    filing_date   part  item       word
##         <dttm>  <chr> <chr>      <chr>
##  1  2017-05-24 PART 0  <NA>   document
##  2  2017-05-24 PART 0  <NA>     united
##  3  2017-05-24 PART 0  <NA>     states
##  4  2017-05-24 PART 0  <NA> securities
##  5  2017-05-24 PART 0  <NA>        and
##  6  2017-05-24 PART 0  <NA>   exchange
##  7  2017-05-24 PART 0  <NA> commission
##  8  2017-05-24 PART 0  <NA> washington
##  9  2017-05-24 PART 0  <NA>        d.c
## 10  2017-05-24 PART 0  <NA>      20549
## # ... with 271,523 more rows

We can also use this to run a quick wordcloud!

words %>%
  ungroup() %>%
  anti_join(stop_words) %>%
  ungroup() %>%
  count(word) %>%
  with(wordcloud(word,n,max.words = 75, use.r.layout=FALSE, rot.per=0.35))
## Joining, by = "word"

plot of chunk wordcloud

Basic Sentiment Analysis

Basic sentiment analysis just assigns a sentiment to each word, so getting the sentiment of a particular document just requires adding up the sentiments. There are a lot of sentiment data sets included in tidytext, but for this proof of concept, we'll use the simplest (and most widely used), bing. The bing lexicon assigns each word to be positive, negative or neutral.

Following the process outlined by Text Mining with R, we compute a sentiment for each filing.

word.counts <- words %>% 
  group_by(filing_date) %>% 
  summarize(words = n())

bing <- words %>%
  inner_join(get_sentiments("bing"), by=c("word")) %>% 
  count(filing_date, sentiment) %>%
  spread(sentiment,n,fill=0) %>%
  left_join(word.counts, by=("filing_date")) %>%
  mutate(sentiment = positive-negative,
         sentiment.ratio = sentiment/words, 
         positive.ratio = positive/words, 
         negative.ratio = negative/words ) %>%
  filter(TRUE) # Noop so we can comment things in the middle for testing
ggplot(bing, aes(x=filing_date, y=sentiment)) +
  geom_col() +
  labs(x='Filing Date', y='Sentiment', title='10-K Sentiment')

plot of chunk sentiment

knitr::kable(bing,
  col.names=c('Filing Date', "Neg.", "Pos.", 'Total Words', 'Sentiment',
              'Sentiment Ratio', 'Pos. Ratio', 'Neg. Ratio'))
Filing Date Neg. Pos. Total Words Sentiment Sentiment Ratio Pos. Ratio Neg. Ratio
2013-05-22 918 1211 60450 293 0.0048 0.0200 0.0152
2014-05-21 949 1150 58383 201 0.0034 0.0197 0.0163
2015-05-21 911 1073 53697 162 0.0030 0.0200 0.0170
2016-05-27 906 1004 51284 98 0.0019 0.0196 0.0177
2017-05-24 819 908 47719 89 0.0019 0.0190 0.0172

Discussion

A cursory glance suggests something interesting might be going on, with an apparent decline in sentiment over the preceding 5 years. It is important to point out a few limitations for the current level of analysis –

  • Use of a general sentiment lexicon – There are financial specific sentiment lexicons that would be more appropriate. e.g. in the financial context, 'gross' is fairly neutral, as in 'gross receipts'.
  • Whole report analysis – There is a lot of material in a 10-K but for sentiment analysis we likely only want Management's Report and Risks.
  • Relative comparison – Without comparing these results to other companies and their financials, it is impossible to say if this drop is statistically significant.

I'll be addressing all of these in future posts and comparing report sentiment to financial results.

While only a first step, we've managed to fully programaically go from a ticker symbol to a multi-year sentiment analysis, opening up a lot of interesting possibilities.

Introducting edgarWebR

There are plenty of packages for R that allow for fetching and manipulation of companies’ financial data, often fetching that direct from public filings with the SEC. All of these packages have the goal of getting to the XBRL data, containing financial statements, typically in annual (10-K) or quarterly (10-Q)
filings.

SEC filings however contain far more information. edgarWebR is the first step in accessing that data by providing an interface to the SEC EDGAR search tools and the metadata they provide.

Current Features

  • Search for companies and mutual funds.
  • List filings for a company or mutual fund.
  • Get all information associated with a particular filing

Simple Usecase

Using information about filings, we can use edgarWebR to see how long after the close of a financial period it takes for a company to make a filing. We can also see how that time has changed.

Get Data

First, we’ll fetch a bunch of 10-K and 10-Q filings for our given company using company_filings(). To make sure we’re going far enough back we’ll take a peak at the tail of our results:

ticker <- "EA"

filings <- company_filings(ticker, type="10-", count=100)
initial_count <- nrow(filings)
# Specifying the type provides all forms that start with 10-, so we need to
# manually filter.
filings <- filings[filings$type == "10-K" | filings$type == "10-Q",]

Note that explicitly filtering caused us to go from 93 to 89 rows.

filings$md_href <- paste0("[Link](", filings$href, ")")
knitr::kable(tail(filings)[,c("type", "filing_date", "accession", "size",
                              "md_href")],
             col.names=c("Type", "Filing Date", "Accession No.", "Size", "Link"),
             digits = 2,
             format.args = list(big.mark=","))
Type Filing Date Accession No. Size Link
87 10-Q 1996-08-14 0000950005-96-000615 72 KB Link
88 10-K 1996-07-01 0000912057-96-013563 197 KB Link
89 10-Q 1996-02-14 0000912057-96-002551 85 KB Link
90 10-Q 1995-11-14 0000912057-95-009843 83 KB Link
91 10-Q 1995-08-10 0000912057-95-006218 142 KB Link
93 10-Q 1995-02-13 0000912057-95-000644 83 KB Link

We’ve received filings dating back to 1995 which seems good enough for our purposes, so next we’ll get the filing information for each filing.

So far we’ve done everything in base R, but now we’ll use some useful functions from dplyr and purrr to make things a bit easier.

# this can take a while - we're fetching ~100 html files!
filing_infos <- map_df(filings$href, filing_information)

filings <- bind_cols(filings, filing_infos)
filings$filing_delay <- filings$filing_date - filings$period_date

# Take a peak at the data
knitr::kable(head(filings) %>% select(type, filing_date, period_date,
                                      filing_delay, documents, filing_bytes) %>% 
             mutate(filing_delay = as.numeric(filing_delay)),
             col.names=c("Type", "Filing Date", "Period Date", "Delay",
                         "Documents", "Size (B)"),
             digits = 2,
             format.args = list(big.mark=","))
Type Filing Date Period Date Delay Documents Size (B)
10-K 2017-05-24 2017-03-31 54.00 127 14,944,223
10-Q 2017-02-07 2016-12-31 38.00 89 10,322,322
10-Q 2016-11-08 2016-09-30 39.04 89 10,233,460
10-Q 2016-08-09 2016-06-30 40.00 89 9,207,799
10-K 2016-05-27 2016-03-31 57.00 133 15,865,077
10-Q 2016-02-08 2015-12-31 39.00 97 11,389,536

Basic Analysis

Now our data is arranged, we can run some summary statistics and plot using ggplot2.

knitr::kable(filings %>%
             group_by(type) %>% summarize(
               n=n(),
               avg_delay = as.numeric(mean(filing_delay)),
               median_delay = as.numeric(median(filing_delay)),
               avg_size = mean(filing_bytes / 1024),
               avg_docs = mean(documents)
             ),
             col.names=c("Type", "Count", "Avg. Delay", "Median Delay",
                         "Avg. Size", "Avg. Docs"),
             digits = 2,
             format.args = list(big.mark=","))
Type Count Avg. Delay Median Delay Avg. Size Avg. Docs
10-K 22 67.98 62.48 6,848.91 22.18
10-Q 67 40.04 40.00 4,357.80 13.12

No surprise, yearly filings (10-K’s) are larger and take more time than quarterly filings (10-K’s). Lets see how the times are destributed:

ggplot(filings, aes(x = factor(type), y=filing_delay)) +
  geom_violin() + geom_jitter(height = 0, width = 0.1) +
  labs(x = "Filing Date", y = "Filing delay (days)")

plot of chunk unnamed-chunk-6

We can also examine how the filing delay has changed over time:

ggplot(filings, aes(x = filing_date, y=filing_delay, group=type, color=type)) +
  geom_point() + geom_line() +
  labs(x = "Filing Type", y = "Filing delay (days)")

plot of chunk unnamed-chunk-7
Whether because of some internal change or change to SEC rules, the time
between the end of the fiscal year and the 10-K has dropped quite a bit, though there is still a bit of variation.

An interesting extension would be to compare the filing delays to the company’s stock price, overall market performance and other companies to see if there are particular drivers to the pattern.

ggplot(filings, aes(x = filing_date, y=filing_bytes/1024, group=type, color=type)) +
  geom_point() + geom_line() +
  labs(x = "Filing Type", y = "Filing Size (KB)")

plot of chunk unnamed-chunk-8
The jump in size ~2010 is the requirement for inclusion of financial datafiles starting, but there is still interesting variations.

More to come

The SEC contains a wealth of information and we’re only scratching the surface. edgarWebR itself has a lot more functionality than what we’ve explored here and there is more coming.

How to Download

edgarWebR is not yet of CRAN, as the API hasn’t stabilized yet. In the meantime, you can get a copy from github by using devtools:

# install.packages("devtools")
devtools::install_github("mwaldstein/edgarWebR")