Solr filesystem monitor prototype

Solr filesystem monitor prototype


Job Description


I would like indexing a filesystem with Solr,
and maintaining such indexation periodically, updating changes, removing deleted items, etc...

The expected delivery artifacts are :
- schema.xml matching filesystem information (file name, directory path, creation date, modified date, owner, size, + all tika extractable info, content, author, etc...), needing facets hierarchy to deal cleverly with filesystem concept and browsing requests.
- java autonomous program that drill down filesystem, and synchronize remote solr instance with observed information.
- solr-web portlet configured to search with defined schema
- simple additional portlet to "browse" in the indexed filesystem

Please note the specific constrains :
- The only expected referential will be solr (typically the java autonomous program shouldn't use any cache).
- Take great attention to "update" and "delete" cases.
- Relay on last-modification date to identify updates.
- Solr instance it-self won't have access to the filesystem (ie: Tika extraction must be handle by the java autonomous program, and full extracted indexation request will be pushed to solr)
- Performance (number of solr calls, network go and back, etc...)
- portlets must work with Liferay

Versions :
Solr 4.0.0
Liferay Portal Community Edition 6.1.1 CE GA2
Solr-web portlet : solr-web-

Only fixed quotes please.
Happy biding.