Apache Nifi for Data Collection
Hortonworks Data Flow (Apache Nifi) is a great open source solution for the data ingestion problem faced by many Hadoop implementations. We have seen many log analysis projects work great in test and in the data center only to fall over in real production. Why?
Life isn't perfect. And neither are networks and the Internet. What happens when your log producer is generating more data than your centralized Hadoop environment can handle at the moment? What happens when the Internet connection goes down? How do you retain the data until the connection is restored? Unfortunately many previous solutions relied on the fact that some data loss was acceptable. However, this just kicked the can down the road into reporting and analysis. To provide consistent reports, data had to be extrapolated producing inaccurate, but acceptable results. Developing for the unreliable realities of a connected world required too much effort to justify the expense to manage it.
Then along came Nifi. Nifi handles these complications by processing the data closest to the source in a distributed way when possible and managing the data throttling and delivery for you. As an added benefits, its web UI is trivial to use and as data sources and types change, it is easy to swap out the collecting component with a new one to match the new data source.
We recommend that you give Nifi a try for all your data ingestion needs. It is available for free at http://www.hortonworks.com/hdf or from http://nifi.apache.com .