Parsing Functions in edgarWebR

New to edgarWebR 0.2.0 are funtions for parsing SEC documents. While there are good R packages for XBRL processing, there is a gap in extracting information from other document types available via the site. edgarWebR currently provides functions for 2 of those –

  • parse_submission() – Processes a raw SGML filing into component documents
  • parse_filing() – Processes a narrative filing (e.g. 10-K, 10-Q) into paragraphs annotated with part and item numbers

This vignette will show how to use both functions to find the risks reported by in a company's recent filing.

Find a Submission

Using edgarWebR functions, we'll first look up a recent filing.

ticker <- "STX"

filings <- company_filings(ticker, type="10-Q", count=40)
# Specifying the type provides all forms that start with 10-, so we need to
# manually filter.
filings <- filings[filings$type == "10-Q",]
# We're only interested in the latest filing this time
filing <- filings[1,]
filing$md_href <- paste0("[Link](", filing$href, ")")
knitr::kable(filing[,c("type", "filing_date", "accession_number", "size",
             col.names=c("Type", "Filing Date", "Accession No.", "Size", "Link"),
             digits = 2,
             format.args = list(big.mark=","))
Type Filing Date Accession No. Size Link
10-Q 2017-04-28 0001193125-17-148855 8 MB Link

Get the Complete Submission File

We'll next get the list of files and find the link to the complete submission.

docs <- filing_documents(filing$href)
doc <- docs[docs$description == 'Complete submission text file',]
doc$md_href <- paste0("[Link](", doc$href, ")")

knitr::kable(doc[,c("seq", "description", "document", "size",
             col.names=c("Sequence", "Description", "Document", "Size", "Link"),
             digits = 2,
             format.args = list(big.mark=","))
Sequence Description Document Size Link
5 NA Complete submission text file 0001193125-17-148855.txt 8,112,579 Link

Normally, we would use filing_documents() to get to the 10-Q directly, but as an example we'll be using the complete submission file to demonstrate the parse_submission() function. You would want to use the complete submission file if you want to access the full list of files – e.g. in this case there are 80 files in the submission, but only 10 available on the website and therefore available to filing_documents() – or if you worry about efficiency and are downloading all of the documents.

Parse the Complete Submission File

Now that we have the link to the complete submission file, we can parse it into components.

parsed_docs <- parse_submission(doc$href)
knitr::kable(head(parsed_docs[,c("SEQUENCE", "TYPE", "DESCRIPTION", "FILENAME")]),
             col.names=c("Sequence", "Type", "Description", "Document"),
             digits = 2,
             format.args = list(big.mark=","))
Sequence Type Description Document
1 10-Q 10-Q d381726d10q.htm
2 EX-31.1 EX-31.1 d381726dex311.htm
3 EX-31.2 EX-31.2 d381726dex312.htm
4 EX-32.1 EX-32.1 d381726dex321.htm
5 EX-101.INS XBRL INSTANCE DOCUMENT stx-20170331.xml

And just for example, here's the end of the full list – note the excel that isn't on the SEC site for instance.

knitr::kable(tail(parsed_docs[,c("SEQUENCE", "TYPE", "DESCRIPTION", "FILENAME")]),
             col.names=c("Sequence", "Type", "Description", "Document"),
             digits = 2,
             format.args = list(big.mark=","))
Sequence Type Description Document
76 76 EXCEL IDEA: XBRL DOCUMENT Financial_Report.xlsx
78 78 XML IDEA: XBRL DOCUMENT report.css
79 80 XML IDEA: XBRL DOCUMENT FilingSummary.xml

The 10-Q Filing document is Seq. 1, with the full text of the document in the TEXT column.

# NOTE: the filing document is not always #1, so it is a good idea to also look
# at the type & Description
filing_doc <- parsed_docs[parsed_docs$TYPE == '10-Q' &
                          parsed_docs$DESCRIPTION == '10-Q', 'TEXT']
#> [1] "<HTML><HEAD>\n<TITLE>10-Q</TITLE>\n</HEAD>\n <BODY BGCOLOR=\"WHITE\">\n<h5 align=\"left"

We can see that contains the raw document. For document types which are not plain text, e.g. the XBRL zip file, the content is uuencoded and would been further processing.

Parse the Filing Document

Fortunately edgaWebR functions that take URL's will also take a string containing the document, so to parse the document, while we could have passed the URL to the online document we can just pass in the full string.

doc <- parse_filing(filing_doc, include.raw = TRUE)
#> [1] ""        "PART I"  "PART II"
#>  [1] ""                                                                                                   
#>  [2] "ITEM 1.\nFINANCIAL STATEMENTS"                                                                      
#>  [5] "ITEM 4.\nCONTROLS AND PROCEDURES"                                                                   
#>  [6] "ITEM 1.\nLEGAL PROCEEDINGS"                                                                         
#>  [7] "ITEM 1A.\nRISK FACTORS"                                                                             
#>  [9] "ITEM 3.\nDEFAULTS UPON SENIOR SECURITIES"                                                           
#> [10] "ITEM 4.\nMINE SAFETY DISCLOSURES"                                                                   
#> [11] "ITEM 5.\nOTHER INFORMATION"                                                                         
#> [12] "ITEM 6.\nEXHIBITS"
head(doc[grepl("market risk", doc$,,"text"], 3)
#> [1] "ITEM 3.\nQUANTITATIVE AND QUALITATIVE DISCLOSURES ABOUT MARKET RISK"                                                                                                                                                                                                                                                                                                                                                                                                                               
#> [2] "We have exposure to market\nrisks due to the volatility of interest rates, foreign currency exchange rates, equity and bond markets. A portion of these risks are hedged, but fluctuations could impact our results of operations, financial position and cash flows. Additionally,\nwe have exposure to downgrades in the credit ratings of our counterparties as well as exposure related to our credit rating changes."                                                                         
#> [3] "Interest Rate Risk. Our exposure to market risk for changes in interest rates relates primarily to our investment portfolio. As of\nMarch 31, 2017, we had no material available-for-sale securities that had been in a continuous unrealized loss position for a period greater than 12 months. We\ndetermined no available-for-sale securities were other-than-temporarily impaired as of March 31, 2017. We currently do not use derivative financial instruments in\nour investment portfolio."
risks <- doc[grepl("market risk", doc$,,"raw"]

Now the document is all ready for whatever further processing we want. As a quick example we'll pull out all the italicized risks.

risks <- risks[grep('<i>',risks)]
risks <- gsub("^.*<i>|</i>.*$", "", risks)
risks <- gsub("\n", " ", risks)
#> [1] "Interest Rate Risk"             "Foreign Currency Exchange Risk"
#> [3] "Derivatives and Hedging."       "Other Market Risks"

This is a fairly simplistic example, but should serve as a good tutorial on processing filings.

How to Download

edgarWebR is available from CRAN, so can be simply installed via


If you want the latest and greatest, you can get a copy of the development version from github by using devtools:

# install.packages("devtools")

Introduction to Sentiment Analysis of 10-K Reports in R

Nothing in this article should be considered as investment advice


In the US, publicly traded companier are required to publish an annual report, called a 10-K. In addition to basic financial information, these reports include management commentary and a disclosure of perceived risks. While the financials are typically where analysts focus, attention is given to reading between the lines of the typically bland and risk adverse narrative sections.

I got interested in using R to automate the process of grabbing the 10K from the SEC website, parsing out the narrative sections, and applying basic sentiment and text analysis.


It was working on this project that led to the creation of the edgarWebR package and we'll be using it extensively as we look up a company, find its annual filings, and fetching the management report.

First, we'll specify the ticker we're interested in and the number of years we want to analyze. Using that we can find the company information.

ticker <- 'EA'
years <- 5
company.details <- company_details(ticker, type = "10-K", count = years)

#TODO: better formatting company information**
kable(company.details$information %>% select(name, cik, fiscal_year_end),
      col.names=c('Company Name', 'CIK', 'Fiscal Year End'))
Company Name CIK Fiscal Year End
ELECTRONIC ARTS INC. 0000712515 0331

Next, look up reports in the SEC EDGAR system and look in each filing to find the report. A lot is happening in a few lines, but effectively we're going from a list of annual filings to urls for the main document in each filing.

One note – while, we specified both the type (10-K) and count (5), because the SEC site isn't precise, we need to re-filter.

# Each filing is made up of multiple documents,
# this extracts the main narrative
filing_doc <- function(href) {
  sapply(href, function(x) {
         filing_documents(x) %>%
         filter( type == "10-K" ) %>% select(href) }) %>%
  unlist(recursive = TRUE, use.names = FALSE)
company.reports <- company.details$filings %>%
  filter(type == "10-K") %>%
  slice(1:years) %>% 
  mutate(doc.href = filing_doc(href),
         mdlink = paste0("[Filing Link](", href, ")"),
         reportLink = paste0("[10-K Link](", doc.href, ")")) %>%
  select(filing_date, accession_number, mdlink, reportLink, href, doc.href)

knitr::kable(company.reports %>% select(-href, -doc.href),
     col.names=c('Filing Date', 'Accession #', 'Filing Link', '10-K Link'))
Filing Date Accession # Filing Link 10-K Link
2017-05-24 0000712515-17-000035 Filing Link 10-K Link
2016-05-27 0000712515-16-000111 Filing Link 10-K Link
2015-05-21 0000712515-15-000033 Filing Link 10-K Link
2014-05-21 0000712515-14-000024 Filing Link 10-K Link
2013-05-22 0000712515-13-000022 Filing Link 10-K Link

We now have a link to the narrative 10-K for each of the past 5 years.

Parse out 10K

To perform sentiment analysis, the next big step is to take the HTML formatted 10-K and turn it into a set of words. While we won't use it right now, we'll also parse the various sections of the document for later filtering.

Warning: this gets hairy…

parse10k <- function(uri) {
  # 10-K HTML files are very flat with a long list of nodes. This pulls all
  # the relevant nodes.
  nodes <- read_html(uri) %>% html_nodes('text') %>% xml_children()
  nodes <- nodes[xml_name(nodes) != "hr"]

  # Unfortunately there isn't much of a workaround to this loop - we need
  # to track position in the file so it has to be a bit sequential... <- tibble(nid = seq(length(nodes)),
                     node = nodes,
                     text = xml_text(nodes) ) %>% 
              filter(text != "") # way to get columns defined properly

  parts <- %>%
    filter(grepl("^part",text, %>%
  #  mutate(next.nid = c(nid[-1],length(nodes)+1)) %>%
  if (parts$nid[1] > 1) {
    parts <- bind_rows(tibble(nid = 0, text= "PART 0"), parts)
  parts <- bind_rows(parts,
                     tibble(nid =$nid[length($nid)] + 1,
                            text = "NA"))

  items <- %>%
    filter(grepl("^item",text, %>%
    select(nid,text) %>%
    mutate(next.nid = c(nid[-1],length(nodes)+1),
  = parts$nid[findInterval(nid,parts$nid) + 1],
           next.nid = ifelse( < next.nid,, next.nid),
           prev.end = c(0,next.nid[-length(nid)]))

  # Fill in item gaps w/ N/A
  n <- 0
  for(i in seq(length(items$nid))) {
    j <- i + n
    if(items$prev.end[j] != items$nid[j]) {
      items <- items %>% 
        add_row(nid = items$prev.end[j], text = NA, .before = j)
      n <- n + 1
  } <- %>% 
    mutate( part = parts$text[findInterval(nid, parts$nid)],
            item = items$text[findInterval(nid, items$nid)]) %>%


We then take our parsing function and run it against each of the reports

data <- company.reports %>% rowwise() %>%
  mutate(nodes = map(doc.href, parse10k)) %>%
  select(-accession_number, -href, -mdlink, -doc.href, -reportLink) %>%
  ungroup() %>%

Finally, we use unnest_tokens from TidyText to prepare our documents for sentiment analysis

words <- data %>%
  unnest(nodes) %>%
  select(-nid) %>%
  unnest_tokens(word, text)
## # A tibble: 271,533 x 4
## # Groups:   filing_date [5]
##    filing_date   part  item       word
##         <dttm>  <chr> <chr>      <chr>
##  1  2017-05-24 PART 0  <NA>   document
##  2  2017-05-24 PART 0  <NA>     united
##  3  2017-05-24 PART 0  <NA>     states
##  4  2017-05-24 PART 0  <NA> securities
##  5  2017-05-24 PART 0  <NA>        and
##  6  2017-05-24 PART 0  <NA>   exchange
##  7  2017-05-24 PART 0  <NA> commission
##  8  2017-05-24 PART 0  <NA> washington
##  9  2017-05-24 PART 0  <NA>        d.c
## 10  2017-05-24 PART 0  <NA>      20549
## # ... with 271,523 more rows

We can also use this to run a quick wordcloud!

words %>%
  ungroup() %>%
  anti_join(stop_words) %>%
  ungroup() %>%
  count(word) %>%
  with(wordcloud(word,n,max.words = 75, use.r.layout=FALSE, rot.per=0.35))
## Joining, by = "word"

plot of chunk wordcloud

Basic Sentiment Analysis

Basic sentiment analysis just assigns a sentiment to each word, so getting the sentiment of a particular document just requires adding up the sentiments. There are a lot of sentiment data sets included in tidytext, but for this proof of concept, we'll use the simplest (and most widely used), bing. The bing lexicon assigns each word to be positive, negative or neutral.

Following the process outlined by Text Mining with R, we compute a sentiment for each filing.

word.counts <- words %>% 
  group_by(filing_date) %>% 
  summarize(words = n())

bing <- words %>%
  inner_join(get_sentiments("bing"), by=c("word")) %>% 
  count(filing_date, sentiment) %>%
  spread(sentiment,n,fill=0) %>%
  left_join(word.counts, by=("filing_date")) %>%
  mutate(sentiment = positive-negative,
         sentiment.ratio = sentiment/words, 
         positive.ratio = positive/words, 
         negative.ratio = negative/words ) %>%
  filter(TRUE) # Noop so we can comment things in the middle for testing
ggplot(bing, aes(x=filing_date, y=sentiment)) +
  geom_col() +
  labs(x='Filing Date', y='Sentiment', title='10-K Sentiment')

plot of chunk sentiment

  col.names=c('Filing Date', "Neg.", "Pos.", 'Total Words', 'Sentiment',
              'Sentiment Ratio', 'Pos. Ratio', 'Neg. Ratio'))
Filing Date Neg. Pos. Total Words Sentiment Sentiment Ratio Pos. Ratio Neg. Ratio
2013-05-22 918 1211 60450 293 0.0048 0.0200 0.0152
2014-05-21 949 1150 58383 201 0.0034 0.0197 0.0163
2015-05-21 911 1073 53697 162 0.0030 0.0200 0.0170
2016-05-27 906 1004 51284 98 0.0019 0.0196 0.0177
2017-05-24 819 908 47719 89 0.0019 0.0190 0.0172


A cursory glance suggests something interesting might be going on, with an apparent decline in sentiment over the preceding 5 years. It is important to point out a few limitations for the current level of analysis –

  • Use of a general sentiment lexicon – There are financial specific sentiment lexicons that would be more appropriate. e.g. in the financial context, 'gross' is fairly neutral, as in 'gross receipts'.
  • Whole report analysis – There is a lot of material in a 10-K but for sentiment analysis we likely only want Management's Report and Risks.
  • Relative comparison – Without comparing these results to other companies and their financials, it is impossible to say if this drop is statistically significant.

I'll be addressing all of these in future posts and comparing report sentiment to financial results.

While only a first step, we've managed to fully programaically go from a ticker symbol to a multi-year sentiment analysis, opening up a lot of interesting possibilities.

Introducting edgarWebR

There are plenty of packages for R that allow for fetching and manipulation of companies’ financial data, often fetching that direct from public filings with the SEC. All of these packages have the goal of getting to the XBRL data, containing financial statements, typically in annual (10-K) or quarterly (10-Q)

SEC filings however contain far more information. edgarWebR is the first step in accessing that data by providing an interface to the SEC EDGAR search tools and the metadata they provide.

Current Features

  • Search for companies and mutual funds.
  • List filings for a company or mutual fund.
  • Get all information associated with a particular filing

Simple Usecase

Using information about filings, we can use edgarWebR to see how long after the close of a financial period it takes for a company to make a filing. We can also see how that time has changed.

Get Data

First, we’ll fetch a bunch of 10-K and 10-Q filings for our given company using company_filings(). To make sure we’re going far enough back we’ll take a peak at the tail of our results:

ticker <- "EA"

filings <- company_filings(ticker, type="10-", count=100)
initial_count <- nrow(filings)
# Specifying the type provides all forms that start with 10-, so we need to
# manually filter.
filings <- filings[filings$type == "10-K" | filings$type == "10-Q",]

Note that explicitly filtering caused us to go from 93 to 89 rows.

filings$md_href <- paste0("[Link](", filings$href, ")")
knitr::kable(tail(filings)[,c("type", "filing_date", "accession", "size",
             col.names=c("Type", "Filing Date", "Accession No.", "Size", "Link"),
             digits = 2,
             format.args = list(big.mark=","))
Type Filing Date Accession No. Size Link
87 10-Q 1996-08-14 0000950005-96-000615 72 KB Link
88 10-K 1996-07-01 0000912057-96-013563 197 KB Link
89 10-Q 1996-02-14 0000912057-96-002551 85 KB Link
90 10-Q 1995-11-14 0000912057-95-009843 83 KB Link
91 10-Q 1995-08-10 0000912057-95-006218 142 KB Link
93 10-Q 1995-02-13 0000912057-95-000644 83 KB Link

We’ve received filings dating back to 1995 which seems good enough for our purposes, so next we’ll get the filing information for each filing.

So far we’ve done everything in base R, but now we’ll use some useful functions from dplyr and purrr to make things a bit easier.

# this can take a while - we're fetching ~100 html files!
filing_infos <- map_df(filings$href, filing_information)

filings <- bind_cols(filings, filing_infos)
filings$filing_delay <- filings$filing_date - filings$period_date

# Take a peak at the data
knitr::kable(head(filings) %>% select(type, filing_date, period_date,
                                      filing_delay, documents, filing_bytes) %>% 
             mutate(filing_delay = as.numeric(filing_delay)),
             col.names=c("Type", "Filing Date", "Period Date", "Delay",
                         "Documents", "Size (B)"),
             digits = 2,
             format.args = list(big.mark=","))
Type Filing Date Period Date Delay Documents Size (B)
10-K 2017-05-24 2017-03-31 54.00 127 14,944,223
10-Q 2017-02-07 2016-12-31 38.00 89 10,322,322
10-Q 2016-11-08 2016-09-30 39.04 89 10,233,460
10-Q 2016-08-09 2016-06-30 40.00 89 9,207,799
10-K 2016-05-27 2016-03-31 57.00 133 15,865,077
10-Q 2016-02-08 2015-12-31 39.00 97 11,389,536

Basic Analysis

Now our data is arranged, we can run some summary statistics and plot using ggplot2.

knitr::kable(filings %>%
             group_by(type) %>% summarize(
               avg_delay = as.numeric(mean(filing_delay)),
               median_delay = as.numeric(median(filing_delay)),
               avg_size = mean(filing_bytes / 1024),
               avg_docs = mean(documents)
             col.names=c("Type", "Count", "Avg. Delay", "Median Delay",
                         "Avg. Size", "Avg. Docs"),
             digits = 2,
             format.args = list(big.mark=","))
Type Count Avg. Delay Median Delay Avg. Size Avg. Docs
10-K 22 67.98 62.48 6,848.91 22.18
10-Q 67 40.04 40.00 4,357.80 13.12

No surprise, yearly filings (10-K’s) are larger and take more time than quarterly filings (10-K’s). Lets see how the times are destributed:

ggplot(filings, aes(x = factor(type), y=filing_delay)) +
  geom_violin() + geom_jitter(height = 0, width = 0.1) +
  labs(x = "Filing Date", y = "Filing delay (days)")

plot of chunk unnamed-chunk-6

We can also examine how the filing delay has changed over time:

ggplot(filings, aes(x = filing_date, y=filing_delay, group=type, color=type)) +
  geom_point() + geom_line() +
  labs(x = "Filing Type", y = "Filing delay (days)")

plot of chunk unnamed-chunk-7
Whether because of some internal change or change to SEC rules, the time
between the end of the fiscal year and the 10-K has dropped quite a bit, though there is still a bit of variation.

An interesting extension would be to compare the filing delays to the company’s stock price, overall market performance and other companies to see if there are particular drivers to the pattern.

ggplot(filings, aes(x = filing_date, y=filing_bytes/1024, group=type, color=type)) +
  geom_point() + geom_line() +
  labs(x = "Filing Type", y = "Filing Size (KB)")

plot of chunk unnamed-chunk-8
The jump in size ~2010 is the requirement for inclusion of financial datafiles starting, but there is still interesting variations.

More to come

The SEC contains a wealth of information and we’re only scratching the surface. edgarWebR itself has a lot more functionality than what we’ve explored here and there is more coming.

How to Download

edgarWebR is not yet of CRAN, as the API hasn’t stabilized yet. In the meantime, you can get a copy from github by using devtools:

# install.packages("devtools")

Fetching Patreon Data

Patreon is a delight to scrape. Actually, scrapping is the wrong word for it – the frontend of Patreon is a react application that calls a number of very sensibly designed json end points. Call the same endpoints and you get delightfully clean json that exactly matches what gets displayed on the site.

A disclaimer – this is undocumented as far as I can tell – the publicly documented API (JS Implementation) is targeted at creators and provides access to privately information only visible to creators. All of this could change at any point.

I was all ready to parse HTML, but looking at the source there was a beautiful JS object containing all the data needed to display most pages.

"data": {
  "attributes": {
    "created_at": "2016-04-30T13:58:22+00:00",
    "creation_name": "Entertainment",
    "display_patron_goals": false,
    "earnings_visibility": null,

Even better, at the tail end of the long object is the call to fetch just the JSON:

"links": {
  "self": ""

So as long as I can get the ID of a campaign, I can get all the information about it in an easily processed format. Thanks to the explore pages and a bit of network monitoring reveals calls to the following URL’s:[user]=full_name,image_url,url&fields[campaign]=creation_name,patron_count,pledge_sum,is_monthly,earnings_visibility&page[count]=20&json-api-version=1.0';

Inspection of the different category pages reveals the number after ‘/category/’ runs from 1 through 14, and 99 for the ‘All’ category. This way, I can fetch all the top campaigns then use the campaign API to retrieve detailed information.

An interesting note – the data structures reveal a lot about how site has been designed and where complexity can be added later – multiple campaigns per user, links between campaigns, etc.

Full code for my scrapper is after the break – I’ll be diving into analysis next.
Continue reading “Fetching Patreon Data”

Quick Analysis – Patreon Funding

Inspired by the launch yesterday of the Patreon funding campaign for Movies with Mikey, a movie analysis YouTube channel, I’ve performed some rudimentary analysis of how Patreon donors fit into donation tiers.

Movies with Mikey Patreon

A quick primer – Patreon allows creators to collect donations from supporters on an ongoing basis as opposed to a one-time engagement as with Kickstarter. Donations can be by month or by produced work. While various donation levels provide perks or recognition, Patreon tends to be more focused on “support” than perk compared to other funding platforms. As a result, tiers tend to be less about “value for money” and therefore more interesting to analyze.

Patreon, as with Twitch, limits the information publicly available to summary information, but still enough to assess donation breakdown given some reasonable assumptions. For Movies with Mikey (MWM), we are given the total donations and how those are broken down by tier.

While we would like to know the specific distribution of donations, we can estimate the average donation per tier to get an idea of how donations break down. One caveat is that there is some missing data – the sum of donors in tiers only adds up to 744, so 39 supporters or about 5% didn’t select a tier and we have no idea where they fall. To analyze the breakdown, I made a simple spreadsheet to allow easy estimation of donation amounts.

This estimation underestimates the total by 1.8%, which given the ‘missing’ donations suggests a fair degree of accuracy. There are a few hypothesis that come out of this:

  • Donors give the minimum to be in a given tier.
  • There are likely a few high-end donations that pull the average of the top tier up.

I’d stress that this is exploratory work at the moment that uncovered some reasonable hypotheses. To confirm them and to make any recommendations will entail seeing if this pattern holds for other Patreon campaigns which I hope to do in the next few days looking across categories and sizes of campaigns.

One last comment I also plan of returning to is the nature of sites that provide summary statistics. By limiting the information available, they are creating a market for 3rd party scrapers and losing control over the data. For Patreon, there is Graphtreon and Kickstarter has Kicktraq. I’m torn on how the platforms and their users should feel about these, particularly as personal finances are often involved, but I’ll delve more into the issue later.