Run this between Prometheus and your real exporters. Watch Prometheus log parse error and target down – then verify your alerts fire correctly.
@app.route('/metrics') def metrics(): if random.random() < 0.2: # 20% of the time return "malformed_metric{ invalid syntax", 200 return Response(real_metrics(), mimetype='text/plain')
What happens when your Prometheus server runs out of memory? What if a metric scrape takes 30 seconds because a target is thrashing? What if your alerting rules become corrupt?
# Inject 5s latency into 50% of scrape requests for 2 minutes curl -X POST http://localhost:9091/inject/latency \ -d '"duration":"2m","percent":50,"delay":"5s"' If you run Prometheus Operator, pair it with Chaos Mesh (CNCF project) and a NetworkChaos experiment:
Prometheus Chaos Edition turns the old monitoring paradox on its head. Instead of trusting your monitoring blindly, you break it on purpose – gently, repeatedly, and observably.