Jan 4, 2017
The wording of an amendment can play a significant role in shifting the direction or intent of a bill. Knowing what words are being amended and who’s amending them are valuable insights for effective legislative tracking.
However, identifying accurate amendment information has been a consistent pain for those following Capitol Hill. Amendments made available on Congress.gov only include text as a link to the page in the Congressional Record where they’re mentioned, requiring complicated text parsing to process the full text of each amendment. Fortunately, recent efforts made by the House Rules Committee to publish the text of House amendments in XML format has allowed Quorum to deliver a strategic solution to this problem.
What is XML?
XML is a simple text format designed to meet the challenges of large-scale electronic publishing. Since its creation in 1996, XML has been the foundation for other notable languages like RSS and HTML, the markup language used for every website on the internet. In 2013, House bills were made available as XML bulk data on the Government Publishing Office’s FDsys Bulk Data Repository (Senate bills were later added in 2015). Last February marked a significant step towards transparency when House and Senate bill status information became available in XML Bulk Data. This release laid the groundwork for Quorum to improve the speed and accuracy of their Federal bill data and build a more robust system with less engineering oversight needed to maintain quality.
The engineering team at Quorum started by building scrapers designed to visit the House Rules website each day as new amendment information is made available. The scrapers are programmed to identify PDF files on the website, extract XML from them, and automatically pull in the full text of House amendments. The system then uses data from XML files published by the Government Publishing Office (GPO) website to enrich the amendment text with additional information. Aaron Pelz, a software engineer at Quorum, explains, “House Rules is the best source for amendment text, but the GPO has a better timeline of what actually happens to the amendment. The GPO data helps us parse out when things were voted on and if they were passed or not.” By combining data from House Rules and the GPO, Quorum engineers successfully built a database of all House amendments from the 114th Congress that will continue to operate in the 115th Congress without additional effort.
Unfortunately, the Senate has yet to publish amendments in XML format and thus requires a more extensive process of text identification. Quorum engineers first pull in a list of amendment numbers from bulk XML files published by the GPO and then corroborate those numbers with the Congressional Record to identify the text of Senate amendments. Pelz elaborates, “We look for the same amendment number in the congressional record following a certain pattern and then take that as our amendment. Under this approach it is much more difficult to get the text of every Senate amendment.” To make matters more difficult the Congressional Record is not published in an easily machine-readable format, forcing Quorum to write advanced parsers to identify the different sections.
After the data is pulled and the amendment profiles constructed, users are presented with a searchable database of House and Senate amendments. “The ability to search across all of the amendments is a remarkable feat. People used to go through by hand matching up amendments with bills and trying to track down the text,” said Quorum’s Cofounder Alex Wirth.
Amendments are a valuable component to the legislative process, and with Quorum’s easy-to-use profiles, individuals can gain insight into the amendment sponsors, text, timeline, votes, and additional documents. As more and more government offices begin publishing data in machine-readable formats like XML, Quorum will effortlessly integrate them into our comprehensive database of legislative information.