When managing complex data flows, it is important to have an effective notification system to alert teams when something goes wrong. Event-driven notifications can be overwhelming, so data-driven notifications should be used to detect issues with data processing. Anomaly-driven notifications are the most sophisticated and require a significant investment, but can be worth it for larger data processing systems.
Almost everyone starts here. You identify a few key jobs that are critical to the business. You set up emails that get sent based on the success or failure of those jobs. As you scale up the number of jobs, this approach will result in notification storms everytime something goes wrong (or right, depending on your configuration). The emails fill up your inbox making this way of triggering notifications less palatable.
You are tired of event driven notifications. You disable the less critical alerts. You also create more error resistant and self healing data pipes. Any individual failure doesn’t impact the business significantly and doesn’t require any hands on fixing to get things flowing again.
This also means that, to determine if there’s anything wrong with processing, you need to look into the data at various stages and get statistics on freshness, quantity, and rate of errors over time.
For most companies, alerts triggered based on this type of analysis is sufficient. But, at a certain scale and volume, this way of handling notifications can also become overwhelming and trigger false alerts.
This strategy for triggering notifications is highly sophisticated. When you start down this path, you will start with something simple, like a trigger based on statistical anomalies outside of a certain percentage of change from prior history. Eventually, this will lead into machine learning models designed to detect anomalies under different circumstances. It is a significant investment to implement this strategy, but if your data processing is at the appropriate scale,
Regardless of the level of sophistication in your notification trigger design, you will most likely have a mix of the strategies above. However, the crucial goal is to create an environment where the on-call teams responsible for your data can preserve their sanity, and consumers can continue to get the quality and continuity of service that they deserve.← Go back