We are working with file which commonly shown as a text files. This files contain tables. We need to store this table is DB format (Hive, Impala, and also Oracle and other). These files often can be represented in different formats, and user should be able to get file in other formats on-fly by special request. System should contain plugable rules for conversion from internal format to external and conversion should be done on-a-fly during request.
We need to develop a file system which will be mountable on linux and recognized as regular FS, it also should have support for web-based FS (httpfs -as an example).
In this project Hive will be used ad backend but Hive connector should be replacable plugin. Import to DB should be performed as regular copy comand to mounted file system. By specifying extension system should recognize input format.
User should be able to download file back in different formats based on extension. The original data should also be available.
Examples of commands
import GFF file to db:
cp OriginalFile /ourFilesystem/XXX.gff
export the same data in different formats:
cp /ourFilesystem/XXX.gff destination #(in gff format)
cp /ourFilesystem/XXX.gtf destination #(in gtf format)
cp /ourFilesystem/XXX.gb destination #(in GeneBank format)
cp /ourFilesystem/XXX.raw destination #(in original format)
Conversion rules should be added to a specific location of the file system and usable immideately after adding without any system restart.
Some additional points:
1. Metadata should be available. It can include information on time, original format, command which used for submission, user provided metadata, tools provided metadata.
2. it should be possible to pass additional parameters to hive during import. Example file compression, RCF or regular file....
Code should be annotated, and completely covered by functional and unit testing
Project should be written in Java using Maven.
If third party modules will be used or code will be copied from any other project (including Open Source), these code or modules should be under one of following licenses: MIT, BSD, LGPL, Apache, EPL, Mozilla Public License.
Any other licenses should be approved before use. In general, rule for approval - license should allow to generate Open and Close source projects without restrictions, and should allows of using code in commercial projects. GPL is not allowed.
I will be happy to answer on any questions.