To complement digital transformation efforts, many organisations are looking for a data-first approach to growth, security and performance. In a distributed, disintermediated world, in which all communications and interactions are achieved through the transference of data, the need to analyse historic data is paramount.
Preparedness for tomorrow comes from understanding what happened yesterday, or the month or year before. Data, as is rapidly understood, is immeasurably valuable; to not record it or store it is the equivalent of leaving the tap on. The waste is profligate, and the missed opportunity can have significant consequences down the line.
Where it starts
You’re looking for a stack of solutions that, together, work as a complete answer for the collection, storing and analysis of data. It starts with nProbe: a network traffic solution that collects and exports NetFlow flows. It features both a NetFlow probe – v5, v9 and IPFIX – and a collector, to collect and export flows generated by border gateways, switches and routers (or any device that can generate in v5 or v9).
Significantly, nProbe is able to analyse multi-gigabit networks at full speed with no or very moderate pocket loss. But once you’ve collected your NetFlow data, the question of where to export it remains. And even before then, there’s the intermediary step of ingestion: to process and broker the data.
Step two: Apache Kafka
This is where Apache Kafka comes in (hereafter referred to as Kafka, not to be confused with the author or visions of grotesque lifeforms). Kafka is an open-source distributed event streaming platform used for high-performance data pipelines, streaming analytics and data integration.
A pipeline is a useful visual analogy. Kafka funnels NetFlow flows probed and collected by nProbe, acting as a message broker. Event streaming ensures an unending flow and interpretation of data, crucial for an always-connected world, so that the right information is funnelled to the right place. Kafka features an out-of-the-box connection with hundreds of event sources. Crucially, for our purposes, that includes NetFlow. It is able to scale, with production clusters up to a thousand brokers, and process trillions of messages per day and petabytes of data.
A home for data: Apache Hadoop
With all this NetFlow data probed, collected and brokered, it needs a place to go. Somewhere to store it in the quantity necessary to maintain historic insight, and scale to the large-scale network production of data. After all, the bigger the network, the greater the requirement for insight. This means more data.
As with our intermediary step, the answer to this final step also comes from Apache. An Apache Hadoop data lake is a highly scalable, distributed file storage. It has numerous benefits over a traditional data warehouse, but none greater than its capacity: with rising amounts of data flowing through networks and collected by monitoring tools, any storage solution must be able to scale with volume. Data lakes are a direct answer to the requirements of Big Data. They help to reduce the cost of data analysis, act as a unified storage solution – with a single set of employee access policies – and provide the analytics tools to easily extract insight from the data stored.
With Hadoop, you’ve got a data solution that can sustainably store the NetFlow flows collected by nProbe and brokered by Kafka.
We’ve always advocated that the best solution is an amalgamation of solutions, each playing a key role in the data-collection-and-analysis process. For help integrating any single solution described here, or all three for a comprehensive solution to maximising data collection and insight, contact us today.