We need a system/platform that will allow us to acquire better business intelligence and understanding of a certain e-commerce industry segment in the South East Asia region, with primary and initial focus on Singapore and Malaysia. The intelligence gathering will be done via automatic web scraping and data mining. The system and the scraped data will need to be accessible via en online front-end.
* Initially, there will be around 10 websites scraped (5 in Singapore, 5 in Malaysia). More will be added later.
* The information that needs to be scraped and captured will be specified to selected contractors, but will include data such as the name of the site, product number, product category, description, availability, original price, discounted prices, the name of the seller, etc.
* The system has to monitor and scrape the sites on an ongoing basis and store the data in a database.
* Alert system needs to be in place that will notify us if there is an issue with scraping any of the sites and/or if the layout of a scraped site changed and the information we are scraping is no longer correct.
* The system needs to be modular and the scraping processes parallel (sites can be added and removed without having to restart everything else). At the same time, a track needs to be kept of each scraping process, so that if it’s interrupted or needs to be restarted, the scraping doesn’t start at the beginning.
* The scraped information is stored in a database that is regularly backed up.
* The scraping is anonymous and spaced out in time, so that it doesn’t negatively affect the scraped sites.
* The system is high-performing and will remain so even after several years of scraping (and a large database of records).
* The system can be easily modified and new functionalities and scraping scripts added.
* Suggested technologies - Python, Ruby, or Node.js. MongoDB or MySQL. (open to suggestions and recommendations)
* User roles and content access levels (admin vs. regular user). Admins will be able to control which scraped sites and information the regular users can view. The regular users will be further sub-divided into several categories with different content access rights (e.g. users who can only view the overall industry reports [see below], users who have access to the raw data, etc.)
* Ability to generate reports on the fly and for a specified time period. Reports can be generated using any or only required fields.
* Ability to print out generated report or download it as a pdf.
* Ability to export raw data as CSV.
* Pre-generated overall industry reports, including visual charts (these may for example include Industry summary; Company performance based on total revenue; Company performance based on the number of products sold; Industry category breakdown; Top performing products in each category; etc.).
* Attractive, easy to use and navigate, “web 2.0” look and feel.
* Suggested technologies - Python (Django), Ruby (Rails), or PHP. AJAX, jQuery, CSS, HTML5. MySQL. (open to suggestions and recommendations)
In terms of priorities, the front-end is secondary. Scraping and extracting the data is more crucial.
It is important to understand that we are looking for someone to help us build this system, not only scrape and deliver the data to us. We will own the code and the system developed, and we will continue to actively maintain, expand and develop it.
We are a startup operating in Singapore and Malaysia and have much more ambitious plans than this initial platform. We are actually hiring full-time employees/developers for this project, so if you have strong background in data scraping and web development and if you are interested in working with us on a more permanent and fixed basis, let us know.