Most public companies are required to file annual 10-K’s with the SEC, which may be found on its EDGAR system. Parsing these free-form, loosely structured documents is challenging due to inconsistent formatting, regulatory changes over time, and varying filing requirements across companies. In this paper, I will document the numerous assumptions and decisions made while developing version 3.0 of the DocNet system, an automated tool for parsing 10-K’s (and other narratives) into their component sections while also allowing for manual adjustments via a web-based editor. The core logic is written in Python and the web interface in Flask, built on a combination MongoDB and SQLite database backend.
Work in Process, 2017.