Data Extraction


Job Description

Goal: I want to extract multiple pieces of data from financial press releases and have them delivered to me in a CSV file

Details: I have a sample set of 500 U.S. publicly traded companies. Each of these companies issues press releases, which can be accessed through either 1) email, 2) an RSS feed, or 3) the company's investor relations website. Some of the press releases contain event scheduling information (scheduling an earnings call, announcing a dividend, announcing an analyst day, a stock split).

I want to develop an easy-to-use and easy-to-update system that can take one of the three inputs detailed above (email, RSS feed, or the IR website address where press releases are published) for each of 3000+ companies, filter the press releases for the ones that contain event information, and for each that contains event information, extract the appropriate details (the details differ by event). These details then need to be placed into a correct format and made accessible in some fashion (RSS feed, emailed, etc)

My strong preference is to use cloud-based tools that require minimal coding. Possibly making use of data/entity extraction services like Alchemy. I will need to be able to edit the input sources, as companies get acquired or bankrupt, and new companies become public through an IPO.

As an example, Google's IR site is ( The site has a news feed viewable, and it can be subscribed to via RSS or email. The third item on the list (Google Announces Date of Third Quarter 2012 Financial Results Conference Call) contains one of the event scheduling items I am interested in (Earnings release being scheduled). Within this text content of this press release, I would need to extract the company symbol [GOOG], event type [financial results/earnings], conference call date/time [10/18/12 4:30 pm EST], webcast link if available [], conference call dial-in number if available [none--N/A] and conference call passcode if available [non--N/A]

Requirements: Must accept input via email, RSS feed or website of 3000+ entities. Filter press releases for this containing event information, and extract event information for those containing it. Format data into the appropriate columns (CSV or RSS preferably), and make it available daily (preferably via email)

Entry requirement #1: Specify methodology/tools you would use to accomplish the task in a step-by-step process. (e.g. Aggregate all RSS feeds in Pipes, filter aggregated feed through Yahoo pipes, use Pipes to push to Alchemy web service for data extraction, use Alchemy to format extracted data in correct columnar format, data pushed back into Yahoo pipes and delivered as-needed via RSS feed)

Entry requirement #2: To ensure you read all the requirements, please write "IR Press Release" as the first words in your response to this request.