Perl or python script to calculate the majority value from a set of values

Perl or python script to calculate the majority value from a set of values

Closed - This job posting has been filled and work has been completed.

Job Description

The input file is always the same number of columns/structure will be dumped from a database. It will have approximately 50,000 rows.

See the attached file for a cleaner description.
...
(403, 'FR940419-0-00065', 153.4, 6, 34, -1, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL),
(403, 'FR940419-2-00002', 5733.2, 3, 2, 1, 0, 0, 0, 0, 0, NULL, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NULL, 0, 0, 0, NULL),
(403, 'FR940419-2-00002', 5733.2, 3, 2, -1, 0, 0, 0, 0, 0, NULL, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NULL, 0, 0, 0, NULL),
(403, 'FR940419-2-00012', 38.6, 8, 69, -1, 0, NULL, NULL, 1, NULL, 0, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, 0, NULL, 0, NULL, NULL, NULL, NULL, NULL),
...

If a -1 appears in the 6th column, it is an unknown value and should be evaluated the same as a null value for comparison purposes.

The 6th column should only be one of the following values: (-1, 0, 1).

The last 24 columns should each only be one of the following values: (NULL, 0, 1).

If the 6th column is different from the majority choice of the last 24 columns, write out the row as described below. Take the majority action (see below on what to do if a tie occurs).

For the following, keep in mind that we take the majority non-null vote, even if only one of them is non-null. Only if all 24 values are null, do we take the "all are NULL" decision.

col 6 majority of 24 action
------ ------------------ -----------------------
-1 all are NULL don't write (skip it)
-1 0 write
-1 1 write

0 all are NULL write
0 0 don't write (skip it)
0 1 write

1 all are NULL write
1 0 write
1 1 don't write (skip it)


What to write?

one file with a single line with 7 comma-separated values, each enclosed in double quotes for loading into a database.

column 1 (from orig file), column2 (from orig file), column 6 (from orig file), majority decision of last 24 non-null columns, a count of the number of the last 24 columns that are not null, the percentage of non-null terms in the majority (should always be at least 0.500, or 50%)

So for:

(1, '88666130', 3855.0, 1, 7, 1, 0, 0, 0, NULL, 0, 0, 1, 0, 0, 0, 0, 0, NULL, 0, 0, 0, NULL, 0, NULL, 1, 0, 0, 0, 0),
(1, '88056144', 745.2, 1, 80, 0, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL),

The program would write (note the double quotes around each term):

"1", "88666130","1","0","20", "0.900"
"1","88056144","0","NULL","0", "0"

If there is a tie (since you are taking the majority of 24 columns it is possible), write it out and put a "T" where the average should be (5th column):

(403, 'LA041090-0027', 6825.2, 4, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, NULL, 0, 0, NULL, 1),

The program would write:
"403","LA041090-0027","0","T","22", "0.500"

Open Attachment