Thursday 15 May 2014

Kafka Storm HDFS/S3 data flow -


It is not clear that you can do a fan out (duplication) in Kafka as you can in the flu.

I want to save Kafka data on HDFS or S3 and send a duplicate of that data to the storm for actual data processing. Production of storm aggregation / analysis will be collected in the cassandra. I see some implementation, all the data flows through the storm and then two outputs from the storm. However, I want to end the dependence of the storm for raw data collection.

Is this possible? Are you aware of any such document / example / implementation?

Besides, does Kafka get good support for S3 storage?

I saw the camus to store HDFS - do you just run this work through Kroon to load data from HDFC to KFK? If the second work has been completed, will the other work start before that? Finally, will Camus work with S3?

Thank you - I appreciate it! In relation to camus, yes, a scheduler who launches a job should work.

What they use in LinkedIn are Ajabban, you can also consider that too.

If someone launches before the end, some data will be read twice. Since the second job will start using the same offset for the first time.

About Camus with S3, currently I think this is not the place.

No comments:

Post a Comment