Sunday, October 7, 2018

Troubleshooting Data Sources with Incorrect Times using Splunk

By Tony Lee

Have you ever had a data source that you thought was sending data using the wrong time? This can be a problem since Splunk tries to parse and use the event time instead of the ingest time, this can cause issues when trying to find ingested data. If you suspect this is the case you may be experiencing one of the following scenarios:
  • Systems not using NTP that experience clock drift
  • Systems using broken or faulty NTP
  • Systems using the wrong timezone (ex: Sending events in central time, but specifies GMT)
Depending on the time range selected, this can result in data not showing up within Splunk (or any SIEM) because the data may appear to be in the past or the future.  For example, events that are lagging current time by 5 hours will not show up if "Last 4 hours" is selected for the time range.  In a similar fashion, events that are sent with a future date and time will only show up when the time range selector of "All Time" is selected.

Enough about the problems, let's walk through building one possible solution. As a bonus we provide the dashboard shown below at the bottom of the article.

Figure 1:  Last Communicated Calculator

Dashboard Components

To assist in usability, we provide a drop down input at the top that contains a list of the indexes.  This is list of indexes is populated dynamically. This is derived using the dbinspect command which contains data about existing indexes within Splunk. The following creates the drop down input in the dashboard.

| dbinspect index=* | where NOT match(index, "^_") | table index | dedup index


The upper (host detail) panel consists of columns indicating the host, total count, first written time, last written time and so on--perfect information to determine time issues.  This information can be found using the metadata command which can quickly query info about hosts, sources, and sourcetypes. In this case, we care about the hosts.

| metadata index=<index we care about> type=hosts


The lower panel (a time-based area chart), represents the volume of data at a given time for a given host. We used the tstats command that we covered in previous article, but looks like the following:

| tstats prestats=t count where index=<index we care about> AND host=<host we care about> by host, _time | timechart useother=false count by host


It is certainly noteworthy that every search on this dashboard uses metadata and that's why it is so quick to discover these details. As a result, you will probably notice that there is no time wasted waiting for the search to return as the data renders almost instantly.

Conclusion

Splunk provides decent visibility into various features within Monitoring Console / DMC (Distributed management console), but we found this flexible and customizable dashboard to be quite helpful for gaining additional insight into the last time a host communicated. This can be used to identify, troubleshoot, and finally confirm the time being reported by devices. We hope this article helps you troubleshoot these very frustrating issues. Enjoy!

Dashboard XML code

Below is the dashboard code needed to see the Last Communicated Times for hosts by Index.  Feel free to modify the dashboard as needed:

<form>
  <label>Last Communicated Calculator</label>
  <description>Select an Index (or Indexes) - High Number is bad...</description>
  <fieldset submitButton="true">
    <input type="time" token="time">
      <label>Time Range</label>
      <default>
        <earliest>-24h@h</earliest>
        <latest>now</latest>
      </default>
    </input>
    <input type="multiselect" token="index">
      <label>Index</label>
      <fieldForLabel>Index</fieldForLabel>
      <fieldForValue>index</fieldForValue>
      <search>
        <query>| dbinspect index=* | where NOT match(index, "^_") | table index | dedup index | sort index</query>
        <earliest>-30d@d</earliest>
        <latest>now</latest>
      </search>
      <valuePrefix>index=</valuePrefix>
      <delimiter> OR </delimiter>
    </input>
    <input type="text" token="host">
      <label>Host</label>
      <default>*</default>
    </input>
  </fieldset>
  <row>
    <panel>
      <table>
        <title>Hosts</title>
        <search>
          <query>| metadata $index$ type=hosts | dedup host | eval currentTime=now() | eval seconds=now()-lastTime | eval minutes=(seconds/60) | eval hours=(minutes/60) | convert ctime(lastTime) ctime(firstTime) ctime(currentTime) | table host, totalCount, firstTime, lastTime, currentTime, hours, minutes, seconds | sort - seconds | rename hours AS "Last Comm (in hrs)", minutes AS "Last Comm (in mins)", seconds AS "Last Comm (in secs)" | search host=$host$</query>
          <earliest>$time.earliest$</earliest>
          <latest>$time.latest$</latest>
          <sampleRatio>1</sampleRatio>
        </search>
        <option name="count">20</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">none</option>
        <option name="percentagesRow">false</option>
        <option name="rowNumbers">true</option>
        <option name="totalsRow">false</option>
        <option name="wrap">true</option>
      </table>
    </panel>
  </row>
  <row>
    <panel>
      <chart>
        <title>Visual (Keep in mind your time range.  Anything beyond the time range will not show up)</title>
        <search>
          <query>| tstats prestats=t count where $index$ AND host=$host$ by host, _time | timechart useother=false count by host</query>
          <earliest>$time.earliest$</earliest>
          <latest>$time.latest$</latest>
        </search>
        <option name="charting.chart">area</option>
        <option name="charting.drilldown">none</option>
        <option name="refresh.display">progressbar</option>
      </chart>
    </panel>
  </row>
</form>

1 comment:

  1. Per customer demand, we updated the dynamic input query to sort by index so they are listed in alphabetical order. To help the reader, we updated the code above.

    ReplyDelete