Lately, I've been thinking quite a bit about the value that different types of data provide. Specifically, I've been considering the analytical, monitoring, and forensics value of different data types. Typically, organizations instrument their networks to collect a wide variety of data (flow, IDS, DNS, router logs, firewall logs, proxy logs, PCAP, etc.). Very quickly, the amount of data collected, along with the diversity of the data, can confuse and complicate the network monitoring goals of an organization. An organized, well-structured approach is critical to successfully monitoring a network, and "data overload" is a serious detractor. I've seen it with my own eyes many times.
In thinking about why organizations end up in an overloaded/confused/complicated state, I've come up with two primary reasons:
1) No one data type by itself gives them what they need analytically/forensically/legally
2) There is great uncertainty of what data needs to be collected and maintained to ensure adequate “network knowledge”, so organizations err on the side of caution and collect everything.
To me, this seems quite wasteful. It's not only wasteful of computing resources (storage, instrumentation hardware, etc.), but it's also wasteful of precious analytical/monitoring/forensics cycles. With so few individuals skilled in how to properly monitor a network, the last thing we want to do is make doing so harder, more confusing, and further obfuscated.
The good news is, I think there is a way forward here. As I discussed in a previous post, enriching layer 4 meta-data (e.g., network flow data) with some of the layer 7 (application layer) meta-data can open up a world of analytical/monitoring/forensics possibilities. I believe that one could take the standard netflow (layer 4) meta-data fields and enrich them with some application layer (layer 7) meta-data to create an "uber" data source that would meet the network monitoring needs of most organizations. I'm not sure exactly what that "uber" data source would look like, but I know it would be much easier to collect, store, and analyze than the current state of the art. The idea would be to find the right balance between the extremes of netflow (extremely compact size, but no context) and full packet capture (full context, but extremely large size). The "uber" data source would be somewhat compact in size and have some context. Exactly how to tweak that dial should be the subject of further thought and dialogue.
This is something I intend to continue thinking about, as I see great promise here.