{"id":450,"date":"2017-10-12T17:45:21","date_gmt":"2017-10-12T21:45:21","guid":{"rendered":"https:\/\/micah.waldste.in\/?p=450"},"modified":"2018-06-20T16:26:48","modified_gmt":"2018-06-20T20:26:48","slug":"parsing-functions-in-edgarwebr","status":"publish","type":"post","link":"https:\/\/micah.waldste.in\/blog\/blog\/2017\/10\/parsing-functions-in-edgarwebr\/","title":{"rendered":"Parsing Functions in edgarWebR"},"content":{"rendered":"<p>New to edgarWebR 0.2.0 are funtions for parsing SEC documents. While there are good R packages for XBRL processing, there is a gap in extracting information from other document types available via the site. edgarWebR currently provides functions for 2 of those &#8211;<\/p>\r\n\r\n<ul>\r\n<li><code>parse_submission()<\/code> &#8211; Processes a raw SGML filing into component documents<\/li>\r\n<li><code>parse_filing()<\/code> &#8211; Processes a narrative filing (e.g. 10-K, 10-Q) into\r\nparagraphs annotated with part and item numbers<\/li>\r\n<\/ul>\r\n\r\n<p>This vignette will show how to use both functions to find the risks reported by in a company&#39;s recent filing.<\/p>\r\n\r\n<h2>Find a Submission<\/h2>\r\n\r\n<p>Using edgarWebR functions, we&#39;ll first look up a recent filing.<\/p>\r\n\r\n<pre><code class=\"r\">ticker &lt;- &quot;STX&quot;\r\n\r\nfilings &lt;- company_filings(ticker, type=&quot;10-Q&quot;, count=40)\r\n# Specifying the type provides all forms that start with 10-, so we need to\r\n# manually filter.\r\nfilings &lt;- filings[filings$type == &quot;10-Q&quot;,]\r\n# We&#39;re only interested in the latest filing this time\r\nfiling &lt;- filings[1,]\r\nfiling$md_href &lt;- paste0(&quot;[Link](&quot;, filing$href, &quot;)&quot;)\r\nknitr::kable(filing[,c(&quot;type&quot;, &quot;filing_date&quot;, &quot;accession_number&quot;, &quot;size&quot;,\r\n                              &quot;md_href&quot;)],\r\n             col.names=c(&quot;Type&quot;, &quot;Filing Date&quot;, &quot;Accession No.&quot;, &quot;Size&quot;, &quot;Link&quot;),\r\n             digits = 2,\r\n             format.args = list(big.mark=&quot;,&quot;))\r\n<\/code><\/pre>\r\n\r\n<table><thead>\r\n<tr>\r\n<th align=\"left\">Type<\/th>\r\n<th align=\"left\">Filing Date<\/th>\r\n<th align=\"left\">Accession No.<\/th>\r\n<th align=\"left\">Size<\/th>\r\n<th align=\"left\">Link<\/th>\r\n<\/tr>\r\n<\/thead><tbody>\r\n<tr>\r\n<td align=\"left\">10-Q<\/td>\r\n<td align=\"left\">2017-04-28<\/td>\r\n<td align=\"left\">0001193125-17-148855<\/td>\r\n<td align=\"left\">8 MB<\/td>\r\n<td align=\"left\"><a href=\"https:\/\/www.sec.gov\/Archives\/edgar\/data\/1137789\/000119312517148855\/0001193125-17-148855-index.htm\">Link<\/a><\/td>\r\n<\/tr>\r\n<\/tbody><\/table>\r\n\r\n<h2>Get the Complete Submission File<\/h2>\r\n\r\n<p>We&#39;ll next get the list of files and find the link to the complete submission.<\/p>\r\n\r\n<pre><code class=\"r\">docs &lt;- filing_documents(filing$href)\r\ndoc &lt;- docs[docs$description == &#39;Complete submission text file&#39;,]\r\ndoc$md_href &lt;- paste0(&quot;[Link](&quot;, doc$href, &quot;)&quot;)\r\n\r\nknitr::kable(doc[,c(&quot;seq&quot;, &quot;description&quot;, &quot;document&quot;, &quot;size&quot;,\r\n                              &quot;md_href&quot;)],\r\n             col.names=c(&quot;Sequence&quot;, &quot;Description&quot;, &quot;Document&quot;, &quot;Size&quot;, &quot;Link&quot;),\r\n             digits = 2,\r\n             format.args = list(big.mark=&quot;,&quot;))\r\n<\/code><\/pre>\r\n\r\n<table><thead>\r\n<tr>\r\n<th align=\"left\"><\/th>\r\n<th align=\"right\">Sequence<\/th>\r\n<th align=\"left\">Description<\/th>\r\n<th align=\"left\">Document<\/th>\r\n<th align=\"right\">Size<\/th>\r\n<th align=\"left\">Link<\/th>\r\n<\/tr>\r\n<\/thead><tbody>\r\n<tr>\r\n<td align=\"left\">5<\/td>\r\n<td align=\"right\">NA<\/td>\r\n<td align=\"left\">Complete submission text file<\/td>\r\n<td align=\"left\">0001193125-17-148855.txt<\/td>\r\n<td align=\"right\">8,112,579<\/td>\r\n<td align=\"left\"><a href=\"https:\/\/www.sec.gov\/Archives\/edgar\/data\/1137789\/000119312517148855\/0001193125-17-148855.txt\">Link<\/a><\/td>\r\n<\/tr>\r\n<\/tbody><\/table>\r\n\r\n<p>Normally, we would use <code>filing_documents()<\/code> to get to the 10-Q directly, but as an example we&#39;ll be using the complete submission file to demonstrate the <code>parse_submission()<\/code> function. You would want to use the complete submission file if you want to access the full list of files &#8211; e.g. in this case there are 80 files in the submission, but only 10 available on the website and therefore available to <code>filing_documents()<\/code> &#8211; or if you worry about efficiency and are\r\ndownloading all of the documents.<\/p>\r\n\r\n<h2>Parse the Complete Submission File<\/h2>\r\n\r\n<p>Now that we have the link to the complete submission file, we can parse it into\r\ncomponents.<\/p>\r\n\r\n<pre><code class=\"r\">parsed_docs &lt;- parse_submission(doc$href)\r\nknitr::kable(head(parsed_docs[,c(&quot;SEQUENCE&quot;, &quot;TYPE&quot;, &quot;DESCRIPTION&quot;, &quot;FILENAME&quot;)]),\r\n             col.names=c(&quot;Sequence&quot;, &quot;Type&quot;, &quot;Description&quot;, &quot;Document&quot;),\r\n             digits = 2,\r\n             format.args = list(big.mark=&quot;,&quot;))\r\n<\/code><\/pre>\r\n\r\n<table><thead>\r\n<tr>\r\n<th align=\"left\">Sequence<\/th>\r\n<th align=\"left\">Type<\/th>\r\n<th align=\"left\">Description<\/th>\r\n<th align=\"left\">Document<\/th>\r\n<\/tr>\r\n<\/thead><tbody>\r\n<tr>\r\n<td align=\"left\">1<\/td>\r\n<td align=\"left\">10-Q<\/td>\r\n<td align=\"left\">10-Q<\/td>\r\n<td align=\"left\">d381726d10q.htm<\/td>\r\n<\/tr>\r\n<tr>\r\n<td align=\"left\">2<\/td>\r\n<td align=\"left\">EX-31.1<\/td>\r\n<td align=\"left\">EX-31.1<\/td>\r\n<td align=\"left\">d381726dex311.htm<\/td>\r\n<\/tr>\r\n<tr>\r\n<td align=\"left\">3<\/td>\r\n<td align=\"left\">EX-31.2<\/td>\r\n<td align=\"left\">EX-31.2<\/td>\r\n<td align=\"left\">d381726dex312.htm<\/td>\r\n<\/tr>\r\n<tr>\r\n<td align=\"left\">4<\/td>\r\n<td align=\"left\">EX-32.1<\/td>\r\n<td align=\"left\">EX-32.1<\/td>\r\n<td align=\"left\">d381726dex321.htm<\/td>\r\n<\/tr>\r\n<tr>\r\n<td align=\"left\">5<\/td>\r\n<td align=\"left\">EX-101.INS<\/td>\r\n<td align=\"left\">XBRL INSTANCE DOCUMENT<\/td>\r\n<td align=\"left\">stx-20170331.xml<\/td>\r\n<\/tr>\r\n<tr>\r\n<td align=\"left\">6<\/td>\r\n<td align=\"left\">EX-101.SCH<\/td>\r\n<td align=\"left\">XBRL TAXONOMY EXTENSION SCHEMA<\/td>\r\n<td align=\"left\">stx-20170331.xsd<\/td>\r\n<\/tr>\r\n<\/tbody><\/table>\r\n\r\n<p>And just for example, here&#39;s the end of the full list &#8211; note the excel that\r\nisn&#39;t on the SEC site for instance.<\/p>\r\n\r\n<pre><code class=\"r\">knitr::kable(tail(parsed_docs[,c(&quot;SEQUENCE&quot;, &quot;TYPE&quot;, &quot;DESCRIPTION&quot;, &quot;FILENAME&quot;)]),\r\n             col.names=c(&quot;Sequence&quot;, &quot;Type&quot;, &quot;Description&quot;, &quot;Document&quot;),\r\n             digits = 2,\r\n             format.args = list(big.mark=&quot;,&quot;))\r\n<\/code><\/pre>\r\n\r\n<table><thead>\r\n<tr>\r\n<th align=\"left\"><\/th>\r\n<th align=\"left\">Sequence<\/th>\r\n<th align=\"left\">Type<\/th>\r\n<th align=\"left\">Description<\/th>\r\n<th align=\"left\">Document<\/th>\r\n<\/tr>\r\n<\/thead><tbody>\r\n<tr>\r\n<td align=\"left\">75<\/td>\r\n<td align=\"left\">75<\/td>\r\n<td align=\"left\">XML<\/td>\r\n<td align=\"left\">IDEA: XBRL DOCUMENT<\/td>\r\n<td align=\"left\">R65.htm<\/td>\r\n<\/tr>\r\n<tr>\r\n<td align=\"left\">76<\/td>\r\n<td align=\"left\">76<\/td>\r\n<td align=\"left\">EXCEL<\/td>\r\n<td align=\"left\">IDEA: XBRL DOCUMENT<\/td>\r\n<td align=\"left\">Financial_Report.xlsx<\/td>\r\n<\/tr>\r\n<tr>\r\n<td align=\"left\">77<\/td>\r\n<td align=\"left\">77<\/td>\r\n<td align=\"left\">XML<\/td>\r\n<td align=\"left\">IDEA: XBRL DOCUMENT<\/td>\r\n<td align=\"left\">Show.js<\/td>\r\n<\/tr>\r\n<tr>\r\n<td align=\"left\">78<\/td>\r\n<td align=\"left\">78<\/td>\r\n<td align=\"left\">XML<\/td>\r\n<td align=\"left\">IDEA: XBRL DOCUMENT<\/td>\r\n<td align=\"left\">report.css<\/td>\r\n<\/tr>\r\n<tr>\r\n<td align=\"left\">79<\/td>\r\n<td align=\"left\">80<\/td>\r\n<td align=\"left\">XML<\/td>\r\n<td align=\"left\">IDEA: XBRL DOCUMENT<\/td>\r\n<td align=\"left\">FilingSummary.xml<\/td>\r\n<\/tr>\r\n<tr>\r\n<td align=\"left\">80<\/td>\r\n<td align=\"left\">82<\/td>\r\n<td align=\"left\">ZIP<\/td>\r\n<td align=\"left\">IDEA: XBRL DOCUMENT<\/td>\r\n<td align=\"left\">0001193125-17-148855-xbrl.zip<\/td>\r\n<\/tr>\r\n<\/tbody><\/table>\r\n\r\n<p>The 10-Q Filing document is Seq. 1, with the full text of the document in the\r\nTEXT column.<\/p>\r\n\r\n<pre><code class=\"r\"># NOTE: the filing document is not always #1, so it is a good idea to also look\r\n# at the type &amp; Description\r\nfiling_doc &lt;- parsed_docs[parsed_docs$TYPE == &#39;10-Q&#39; &amp;\r\n                          parsed_docs$DESCRIPTION == &#39;10-Q&#39;, &#39;TEXT&#39;]\r\nsubstr(filing_doc,1,80)\r\n#&gt; [1] &quot;&lt;HTML&gt;&lt;HEAD&gt;\\n&lt;TITLE&gt;10-Q&lt;\/TITLE&gt;\\n&lt;\/HEAD&gt;\\n &lt;BODY BGCOLOR=\\&quot;WHITE\\&quot;&gt;\\n&lt;h5 align=\\&quot;left&quot;\r\n<\/code><\/pre>\r\n\r\n<p>We can see that contains the raw document. For document types which are not plain text, e.g. the XBRL zip file, the content is uuencoded and would been further processing.<\/p>\r\n\r\n<h2>Parse the Filing Document<\/h2>\r\n\r\n<p>Fortunately edgaWebR functions that take URL&#39;s will also take a string containing the document, so to parse the document, while we could have passed the URL to the online document we can just pass in the full string.<\/p>\r\n\r\n<pre><code class=\"r\">doc &lt;- parse_filing(filing_doc, include.raw = TRUE)\r\nunique(doc$part.name)\r\n#&gt; [1] &quot;&quot;        &quot;PART I&quot;  &quot;PART II&quot;\r\nunique(doc$item.name)\r\n#&gt;  [1] &quot;&quot;                                                                                                   \r\n#&gt;  [2] &quot;ITEM 1.\\nFINANCIAL STATEMENTS&quot;                                                                      \r\n#&gt;  [3] &quot;ITEM 2.\\nMANAGEMENT\\u0092S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS&quot;\r\n#&gt;  [4] &quot;ITEM 3.\\nQUANTITATIVE AND QUALITATIVE DISCLOSURES ABOUT MARKET RISK&quot;                                \r\n#&gt;  [5] &quot;ITEM 4.\\nCONTROLS AND PROCEDURES&quot;                                                                   \r\n#&gt;  [6] &quot;ITEM 1.\\nLEGAL PROCEEDINGS&quot;                                                                         \r\n#&gt;  [7] &quot;ITEM 1A.\\nRISK FACTORS&quot;                                                                             \r\n#&gt;  [8] &quot;ITEM 2.\\nUNREGISTERED SALES OF EQUITY SECURITIES AND USE OF PROCEEDS&quot;                               \r\n#&gt;  [9] &quot;ITEM 3.\\nDEFAULTS UPON SENIOR SECURITIES&quot;                                                           \r\n#&gt; [10] &quot;ITEM 4.\\nMINE SAFETY DISCLOSURES&quot;                                                                   \r\n#&gt; [11] &quot;ITEM 5.\\nOTHER INFORMATION&quot;                                                                         \r\n#&gt; [12] &quot;ITEM 6.\\nEXHIBITS&quot;\r\nhead(doc[grepl(&quot;market risk&quot;, doc$item.name, ignore.case=TRUE),&quot;text&quot;], 3)\r\n#&gt; [1] &quot;ITEM\u00a03.\\nQUANTITATIVE AND QUALITATIVE DISCLOSURES ABOUT MARKET RISK&quot;                                                                                                                                                                                                                                                                                                                                                                                                                               \r\n#&gt; [2] &quot;We have exposure to market\\nrisks due to the volatility of interest rates, foreign currency exchange rates, equity and bond markets. A portion of these risks are hedged, but fluctuations could impact our results of operations, financial position and cash flows. Additionally,\\nwe have exposure to downgrades in the credit ratings of our counterparties as well as exposure related to our credit rating changes.&quot;                                                                         \r\n#&gt; [3] &quot;Interest Rate Risk.\u00a0Our exposure to market risk for changes in interest rates relates primarily to our investment portfolio. As of\\nMarch\u00a031, 2017, we had no material available-for-sale securities that had been in a continuous unrealized loss position for a period greater than 12\u00a0months. We\\ndetermined no available-for-sale securities were other-than-temporarily impaired as of March\u00a031, 2017. We currently do not use derivative financial instruments in\\nour investment portfolio.&quot;\r\nrisks &lt;- doc[grepl(&quot;market risk&quot;, doc$item.name, ignore.case=TRUE),&quot;raw&quot;]\r\n<\/code><\/pre>\r\n\r\n<p>Now the document is all ready for whatever further processing we want. As a\r\nquick example we&#39;ll pull out all the italicized risks.<\/p>\r\n\r\n<pre><code class=\"r\">risks &lt;- risks[grep(&#39;&lt;i&gt;&#39;,risks)]\r\nrisks &lt;- gsub(&quot;^.*&lt;i&gt;|&lt;\/i&gt;.*$&quot;, &quot;&quot;, risks)\r\nrisks &lt;- gsub(&quot;\\n&quot;, &quot; &quot;, risks)\r\nrisks\r\n#&gt; [1] &quot;Interest Rate Risk&quot;             &quot;Foreign Currency Exchange Risk&quot;\r\n#&gt; [3] &quot;Derivatives and Hedging.&quot;       &quot;Other Market Risks&quot;\r\n<\/code><\/pre>\r\n\r\n<p>This is a fairly simplistic example, but should serve as a good tutorial on\r\nprocessing filings.<\/p>\r\n\r\n<h2>How to Download<\/h2>\r\n\r\n<p>edgarWebR is available from CRAN, so can be simply installed via<\/p>\r\n\r\n<pre><code class=\"r\">install.packages(&quot;edgarWebR&quot;)\r\n<\/code><\/pre>\r\n\r\n<p>If you want the latest and greatest, you can get a copy of the development version\r\nfrom github by using devtools:<\/p>\r\n\r\n<pre><code class=\"r\"># install.packages(&quot;devtools&quot;)\r\ndevtools::install_github(&quot;mwaldstein\/edgarWebR&quot;)\r\n<\/code><\/pre>","protected":false},"excerpt":{"rendered":"New to edgarWebR 0.2.0 are funtions for parsing SEC documents. While there are good R packages for XBRL processing, there is a gap in extracting information from other document types available via the site. edgarWebR currently provides functions for 2 of those &#8211; parse_submission() &#8211; Processes a raw SGML filing into component documents parse_filing() &#8211; &hellip; <p class=\"link-more\"><a href=\"https:\/\/micah.waldste.in\/blog\/blog\/2017\/10\/parsing-functions-in-edgarwebr\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Parsing Functions in edgarWebR&#8221;<\/span><\/a><\/p>","protected":false},"author":1,"featured_media":453,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[16],"tags":[],"class_list":["post-450","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-rstats","no-wpautop"],"_links":{"self":[{"href":"https:\/\/micah.waldste.in\/blog\/wp-json\/wp\/v2\/posts\/450","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/micah.waldste.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/micah.waldste.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/micah.waldste.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/micah.waldste.in\/blog\/wp-json\/wp\/v2\/comments?post=450"}],"version-history":[{"count":2,"href":"https:\/\/micah.waldste.in\/blog\/wp-json\/wp\/v2\/posts\/450\/revisions"}],"predecessor-version":[{"id":454,"href":"https:\/\/micah.waldste.in\/blog\/wp-json\/wp\/v2\/posts\/450\/revisions\/454"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/micah.waldste.in\/blog\/wp-json\/wp\/v2\/media\/453"}],"wp:attachment":[{"href":"https:\/\/micah.waldste.in\/blog\/wp-json\/wp\/v2\/media?parent=450"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/micah.waldste.in\/blog\/wp-json\/wp\/v2\/categories?post=450"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/micah.waldste.in\/blog\/wp-json\/wp\/v2\/tags?post=450"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}