Web data scraping from a local directory of office buildings
Closed - This job posting has been filled.
Need to scrap an office building directory site which claims to have 50k records. Data are tidily shown up in templated pages. All pages are linked from a rigid 2-tier category structure with pagination. The top tier has only 3 categories.
You will need to write a php scraper to crawl all pages and write the data onto a utf-8 tab-delimited text file. For 50k records your may set the bot to start crawling from each given top-tier category. So the 50k records will be split into 3 txt files.
Sample page: primeoffice dot com dot hk slash hong_kong_office slash building_index slash Building_Profile.asp?B=10422
PS: First image URL of each building is also needed.