Monday, August 6, 2018

Detecting Data Feed Issues with Splunk - Part II

by Tony Lee

As a Splunk admin, you don’t always control the devices that generate your data. As a result, you may only have control of the data once it reaches Splunk. But what happens when that data stops being sent to Splunk? How long does it take anyone to notice and how much data is lost in the meantime?

We have seen many customers struggle with monitoring and detecting data feed issues so we figured we would cast some light onto the subject. Part I of this series discusses the challenges and steps required to build a potential solution. We highly recommend a quick read since it lays the ground work for the dashboard shown here:  http://www.securitysynapse.com/2017/11/detecting-data-feed-issues-with-splunk.html

In this article, we build on that work and provide a handy dashboard (screenshot shown below) that can be used for heads up awareness.


Figure 1:  Data Feed Monitor dashboard

Dashboard Explanation

The search to generate the percentage drop is similar to the search we created in part I of this series.  It looks back over the past two days and calculates the last two days worth of traffic. It takes the difference and generates a percentage drop. Anything over 50% drop will be displayed as a tile. Notice that we are also excluding a few indexes such as test, main, and lastchanceindex.  This can also be customized depending on your needs.
 
| tstats prestats=t count where earliest=-2d@d latest=-0d@d index!=lastchanceindex index!=test index=* by index, _time span=1d | timechart useother=false limit=0 span=1d count by index | eval _time=strftime(_time,"%Y-%m-%d") | transpose | rename column AS DataSource, "row 1" AS TwoDaysAgo, "row 2" AS Yesterday | eval PercentageDiff=(100-((Yesterday/TwoDaysAgo)*100)) | where PercentageDiff>50 AND DataSource!="catch_all" | table DataSource, PercentageDiff | eval tmp="anything" | xyseries tmp DataSource PercentageDiff | fields - tmp | sort PercentageDiff


The dashboard code uses a trellis layout where tiles are dynamically created when the percentage drop exceeds 50%.  Then range colors are used to indicate severity.  Anything below 50% (which typically is not shown is green, 50 - 80% is yellow, and over 80% is red.  These can also be customized to fit your needs.

Conclusion

This dashboard can be one more tool use to help detect data loss. It is not as real-time as it could be, but if it is made too real-time, there can be false positives when legitimate dips in traffic occur (e.g. employees go home for the day). Because you have the code, you are welcome to adjust it as needed to fit your situation.  Enjoy!

Dashboard Code

<dashboard>
  <label>Data Feed Monitor</label>
  <description>Percentage Drop Shown Below</description>
  <row>
    <panel>
      <single>
        <search>
          <query>| tstats prestats=t count where earliest=-2d@d latest=-0d@d index!=test index!=main index!=lastchanceindex index=* by index, _time span=1d | timechart useother=false limit=0 span=1d count by index | eval _time=strftime(_time,"%Y-%m-%d") | transpose | rename column AS DataSource, "row 1" AS TwoDaysAgo, "row 2" AS Yesterday | eval PercentageDiff=(100-((Yesterday/TwoDaysAgo)*100)) | where PercentageDiff&gt;50 AND DataSource!="catch_all" | table DataSource, PercentageDiff | eval tmp="anything" | xyseries tmp DataSource PercentageDiff | fields - tmp | sort PercentageDiff</query>
          <earliest>-48h@h</earliest>
          <latest>now</latest>
          <sampleRatio>1</sampleRatio>
        </search>
        <option name="colorBy">value</option>
        <option name="colorMode">block</option>
        <option name="drilldown">none</option>
        <option name="numberPrecision">0</option>
        <option name="rangeColors">["0x65a637","0xf58f39","0xd93f3c"]</option>
        <option name="rangeValues">[50,80]</option>
        <option name="refresh.display">progressbar</option>
        <option name="showSparkline">1</option>
        <option name="showTrendIndicator">1</option>
        <option name="trellis.enabled">1</option>
        <option name="trellis.scales.shared">0</option>
        <option name="trellis.size">medium</option>
        <option name="trellis.splitBy">DataSource</option>
        <option name="trendColorInterpretation">standard</option>
        <option name="trendDisplayMode">absolute</option>
        <option name="unit">%</option>
        <option name="unitPosition">after</option>
        <option name="useColors">1</option>
        <option name="useThousandSeparators">1</option>
      </single>
    </panel>
  </row>
</dashboard>

4 comments:

  1. This helps out greatly! Keep up the good info.

    ReplyDelete
  2. There are also the Splunk Universal Forwarder heartbeat logs (every 2 minutes in my environment) and each host Windows Security logs are generally pretty active. I have a "last 5 min" search on the SUF heartbeats and should get at least 2 of them (YMMV) which also has tcp_thruput values. Do some baseline calculations and you can setup a stats summary by host for a period of time of how many minimum events you should expect to see. Key is to alert even if there are no heartbeats... so start your search with a lookup table of asset hosts, then "fillnull" with 0 for hosts that don't have matching heartbeat events. Same for WinEventLogs.

    ReplyDelete
  3. Nice! Do you mind sharing the search here when you get a chance? Thanks for the tip.

    ReplyDelete