I am asked to asses possible chice of technology we need to use for the problem described below. Possible options are Hadoop, Hive, and Pig. I do not have much experience with either of those. If you could point out a good source to read. I google and find tons of references but it is hard to find a step by step explanation or comparison.
Here is the task I need to solve.
Users enter sentences into the system. Sentences are broken out by words and stored in Cassandra column family. Each row is a single word (key) and column names are the time stamp this record was entered with no column values.
I need to be able to query the database and extract N words that are taken from the following breakdown:
a_1% must be the top words from period T1 from now into the past a_2% must be the top words from period T2 from now into the past a_3% must be the top words from period T3 from now into the past
a_n% must be the top words from period T_n from now into the past
a_1+a_2+...a_n = 100%
and T1, T2, etc are arbitrary time intervals.
any suggestion for a choice of technology I should use for this task would be greatly appreciate. We are using Cassandra and we are quite familiar with it. Now we need to decide which analytical tool to put on top of it.