Sunday, July 8, 2018

Spelunking your Splunk – Part IV (User Metrics)

By Tony Lee

Welcome to the fourth article of the Spelunking your Splunk series, all designed to help you understand your Splunk environment at a quick glance.  Here is a quick recap of the previous articles:



This article will focus on understanding the users within the environment--even when spread over a search head cluster. We will show you that it is possible to check the amount of concurrent Splunk users, how much they are searching, successful and failed logins and aged accounts. This information is useful not only from an accountability perspective, but also from a resource perspective. When a search head (or cluster) becomes overloaded with users, it may be a good time to consider horizontal scaling.

Finding and understanding user information

There are at least two places within Splunk to discover user information. The first requires a RESTful call and provides information about authenticated users. The second is a search against the _audit index filtering on user activity. Try copying and pasting the following two searches into your Splunk search bar one at a time to see what data is returned:

| rest /services/authentication/httpauth-tokens splunk_server=*

Figure 1:  Current authenticated users via httpauth-tokens


index=_audit user=*

Figure 2:  _audit index with a focus on user activity

Now that you understand the basics, the sky is the limit. You can audit each user or display the statistics for all users. Take a look at our dashboard below to see what is possible. If you find it useful, we provide the code for it at the bottom of this article. Give it a try and let us know what you think.

Figure 3:  User Metrics dashboard with all panels



Conclusion

Splunk provides decent visibility into various features within Monitoring Console / DMC (Distributed management console), but we found this flexible and customizable dashboard to be quite helpful for monitoring gaining additional insight.  We hope this helps you too.  Enjoy!


Dashboard XML code


Below is the dashboard code needed to enumerate your user metrics.  Feel free to modify the dashboard as needed:

<form>
  <label>User Metrics</label>
  <description>Displays Interesting Usage Metrics</description>
  <!-- Add time range picker -->
  <fieldset autoRun="True">
    <input type="time" searchWhenChanged="true">
      <default>
        <earliestTime>-24h@h</earliestTime>
        <latestTime>now</latestTime>
      </default>
    </input>
    <input type="text" token="wild">
      <label>Search</label>
      <default>*</default>
      <suffix/>
    </input>
  </fieldset>
  <row>
    <panel>
      <chart>
        <title>Current Active Users</title>
        <search>
          <query>| rest /services/authentication/httpauth-tokens splunk_server=* | where NOT userName="splunk-system-user" | stats dc(userName) AS "Total Users"</query>
          <earliest>$earliest$</earliest>
          <latest>$latest$</latest>
        </search>
        <option name="charting.axisLabelsX.majorLabelStyle.overflowMode">ellipsisNone</option>
        <option name="charting.axisLabelsX.majorLabelStyle.rotation">0</option>
        <option name="charting.axisTitleX.visibility">visible</option>
        <option name="charting.axisTitleY.visibility">visible</option>
        <option name="charting.axisTitleY2.visibility">visible</option>
        <option name="charting.axisX.scale">linear</option>
        <option name="charting.axisY.scale">linear</option>
        <option name="charting.axisY2.enabled">false</option>
        <option name="charting.axisY2.scale">inherit</option>
        <option name="charting.chart">fillerGauge</option>
        <option name="charting.chart.nullValueMode">gaps</option>
        <option name="charting.chart.sliceCollapsingThreshold">0.01</option>
        <option name="charting.chart.stackMode">default</option>
        <option name="charting.chart.style">shiny</option>
        <option name="charting.drilldown">all</option>
        <option name="charting.layout.splitSeries">0</option>
        <option name="charting.legend.labelStyle.overflowMode">ellipsisMiddle</option>
        <option name="charting.legend.placement">right</option>
      </chart>
    </panel>
    <panel>
      <table>
        <title>Current Logged in Users</title>
        <search>
          <query>| rest /services/authentication/httpauth-tokens splunk_server=* | where NOT userName ="splunk-system-user" | stats max(timeAccessed) AS "Latest Activity" by userName | rename userName AS "User" | sort -"Latest Activity"</query>
          <earliest>$earliest$</earliest>
          <latest>$latest$</latest>
        </search>
        <option name="count">10</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">cell</option>
        <option name="rowNumbers">false</option>
        <option name="wrap">true</option>
      </table>
    </panel>
    <panel>
      <table>
        <title>Total Searches</title>
        <search>
          <query>index=_audit user=* (action="search" AND info="granted") | where NOT user ="splunk-system-user" | stats count(action) AS Searches by user | sort - Searches</query>
          <earliest>$earliest$</earliest>
          <latest>$latest$</latest>
        </search>
        <option name="count">10</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">cell</option>
        <option name="rowNumbers">false</option>
        <option name="wrap">true</option>
      </table>
    </panel>
  </row>
  <row>
    <panel>
      <table>
        <title>Successful Logins</title>
        <search>
          <query>index=_audit user=* (action="login attempt" AND info="succeeded") | stats count(action) AS Logins by user | rename user AS User, Logins AS Successes | sort - Successes</query>
          <earliest>$earliest$</earliest>
          <latest>$latest$</latest>
        </search>
        <option name="count">10</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">cell</option>
        <option name="rowNumbers">false</option>
        <option name="wrap">true</option>
      </table>
    </panel>
    <panel>
      <table>
        <title>Failed Logins</title>
        <search>
          <query>index=_audit user=* (action="login attempt" AND info="failed") | stats count(action) AS Logins by user | rename user AS User, Logins AS Failures | sort - Failures</query>
          <earliest>0</earliest>
          <latest></latest>
        </search>
        <option name="count">10</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">cell</option>
        <option name="rowNumbers">false</option>
        <option name="wrap">true</option>
      </table>
    </panel>
    <panel>
      <table>
        <title>Aged Accounts (15 days or older)</title>
        <search>
          <query>index=_audit user=* (action="login attempt" AND info="succeeded") | dedup user | eval age_days=round((now()-_time)/(60*60*24)) | where age_days &gt;= 15 | eval time=strftime(_time, "%m/%d/%Y %H:%M:%S") | table user, time, age_days | sort -age_days</query>
          <earliest>-15d@d</earliest>
          <latest>now</latest>
        </search>
        <option name="wrap">true</option>
        <option name="rowNumbers">false</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">cell</option>
        <option name="count">10</option>
      </table>
    </panel>
  </row>
</form>

Wednesday, December 20, 2017

Spelunking your Splunk – Part III (License Usage)

By Tony Lee

In our first article of the series, Spelunking your Splunk Part I (Exploring Your Data), we looked at a clever dashboard that can be used to quickly understand the indexes, sources, sourcetypes, and hosts in any Splunk environment.  In our second article of the series, Spelunking your Splunk – Part II (Disk Usage), we provided a dashboard that can be used to monitor data distribution across multiple indexers.  In this article, we will dive into understanding your license usage.

Finding and understanding license usage information

There easiest way to query your Splunk license information is to use the query below in the search bar:

index=_internal source=*license_usage.log type=Usage

This should return raw license usage data which includes:  index, host, source, sourcetype, and number of bytes as shown in the screenshot below.

Figure 1:  License usage fields

If this search returns nothing, you may need to forward your _internal index to the search peers as described in the article below:

https://docs.splunk.com/Documentation/Splunk/7.0.0/Indexer/Forwardmasterdata

After figuring out the fields you can get a little fancier and convert the bytes into GB and display that data over time as shown below.  Try this as both as a statistics table and a column chart.

index=_internal source=*license_usage.log type=Usage | timechart span=1d eval(round(sum(b)/1024/1024/1024,2)) AS "Total GB Used"

Now that you understand the basics, the sky is the limit.  You can display the license usage per index, source, sourcetype, host, etc.  Take a look at our dashboard at the end of this article and give it a try.


Figure 2:  One of our favorite dashboards for license usage

Conclusion

Splunk provides decent visibility into license usage via the Monitoring Console / DMC (Distributed management console), but we found this visual representation to be quite helpful for monitoring gaining additional insight.  We hope this helps you too.


Dashboard XML code

Below is the dashboard code needed to enumerate your license usage.  Feel free to modify the dashboard as needed:


<form>
  <label>License Usage</label>
  <fieldset submitButton="false" autoRun="true">
    <input type="time" searchWhenChanged="true" token="time1">
      <label></label>
      <default>
        <earliest>-7d@d</earliest>
        <latest>now</latest>
      </default>
    </input>
  </fieldset>
  <row>
    <panel>
      <chart>
        <title>Daily License Usage by Index</title>
        <search>
          <query>index=_internal source=*license_usage.log type=Usage  | rename idx AS index  | timechart span=1d eval(round(sum(b)/1024/1024/1024,2)) AS "Total GB Used" by index</query>
          <earliest>$time1.earliest$</earliest>
          <latest>$time1.latest$</latest>
        </search>
        <option name="charting.axisLabelsX.majorLabelStyle.overflowMode">ellipsisNone</option>
        <option name="charting.axisLabelsX.majorLabelStyle.rotation">0</option>
        <option name="charting.axisTitleX.text">Date</option>
        <option name="charting.axisTitleX.visibility">visible</option>
        <option name="charting.axisTitleY.text">License Usage</option>
        <option name="charting.axisTitleY.visibility">visible</option>
        <option name="charting.axisTitleY2.visibility">visible</option>
        <option name="charting.axisX.scale">linear</option>
        <option name="charting.axisY.scale">linear</option>
        <option name="charting.axisY2.enabled">false</option>
        <option name="charting.axisY2.scale">inherit</option>
        <option name="charting.chart">column</option>
        <option name="charting.chart.bubbleMaximumSize">50</option>
        <option name="charting.chart.bubbleMinimumSize">10</option>
        <option name="charting.chart.bubbleSizeBy">area</option>
        <option name="charting.chart.nullValueMode">gaps</option>
        <option name="charting.chart.sliceCollapsingThreshold">0.01</option>
        <option name="charting.chart.stackMode">default</option>
        <option name="charting.chart.style">shiny</option>
        <option name="charting.drilldown">all</option>
        <option name="charting.layout.splitSeries">0</option>
        <option name="charting.legend.labelStyle.overflowMode">ellipsisStart</option>
        <option name="charting.legend.placement">right</option>
        <option name="charting.axisLabelsY.majorUnit">10</option>
        <option name="charting.axisY.maximumNumber">60</option>
        <option name="charting.axisY.minimumNumber">0</option>
      </chart>
    </panel>
  </row>
  <row>
    <panel>
      <chart>
        <title>Total Daily License  Usage</title>
        <search>
          <query>index=_internal source=*license_usage.log type=Usage  | timechart span=1d eval(round(sum(b)/1024/1024/1024,2)) AS "Total GB Used"</query>
          <earliest>$time1.earliest$</earliest>
          <latest>$time1.latest$</latest>
        </search>
        <option name="charting.axisLabelsX.majorLabelStyle.overflowMode">ellipsisNone</option>
        <option name="charting.axisLabelsX.majorLabelStyle.rotation">0</option>
        <option name="charting.axisTitleX.text">Date</option>
        <option name="charting.axisTitleX.visibility">visible</option>
        <option name="charting.axisTitleY.text">GB</option>
        <option name="charting.axisTitleY.visibility">visible</option>
        <option name="charting.axisTitleY2.visibility">visible</option>
        <option name="charting.axisX.scale">linear</option>
        <option name="charting.axisY.scale">linear</option>
        <option name="charting.axisY2.enabled">0</option>
        <option name="charting.axisY2.scale">inherit</option>
        <option name="charting.chart">column</option>
        <option name="charting.chart.bubbleMaximumSize">50</option>
        <option name="charting.chart.bubbleMinimumSize">10</option>
        <option name="charting.chart.bubbleSizeBy">area</option>
        <option name="charting.chart.nullValueMode">gaps</option>
        <option name="charting.chart.sliceCollapsingThreshold">0.01</option>
        <option name="charting.chart.stackMode">default</option>
        <option name="charting.chart.style">shiny</option>
        <option name="charting.drilldown">all</option>
        <option name="charting.layout.splitSeries">0</option>
        <option name="charting.legend.labelStyle.overflowMode">ellipsisStart</option>
        <option name="charting.legend.placement">right</option>
        <option name="wrap">true</option>
        <option name="rowNumbers">false</option>
        <option name="dataOverlayMode">none</option>
        <option name="charting.axisLabelsY.majorUnit">25</option>
        <option name="charting.chart.showDataLabels">all</option>
        <option name="charting.layout.splitSeries.allowIndependentYRanges">0</option>
      </chart>
    </panel>
    <panel>
      <table>
        <title>Daily License Usage by Index Stats</title>
        <search>
          <query>index=_internal source=*license_usage.log type=Usage earliest=-7d@d  | rename idx AS index  | timechart span=1d eval(round(sum(b)/1024/1024/1024,2)) AS "Total GB Used" by index</query>
          <earliest>$time1.earliest$</earliest>
          <latest>$time1.latest$</latest>
        </search>
        <option name="wrap">true</option>
        <option name="rowNumbers">false</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">cell</option>
        <option name="count">10</option>
      </table>
    </panel>
  </row>
  <row>
    <panel>
      <chart>
        <title>License Usage by Host</title>
        <search>
          <query>index=_internal source=*license_usage.log type=Usage | stats sum(b) AS bytes by h | eval GB= round(bytes/1024/1024/1024,2) | fields h GB | rename h as host | sort -GB</query>
          <earliest>$time1.earliest$</earliest>
          <latest>$time1.latest$</latest>
        </search>
        <option name="charting.axisLabelsX.majorLabelStyle.overflowMode">ellipsisNone</option>
        <option name="charting.axisLabelsX.majorLabelStyle.rotation">0</option>
        <option name="charting.axisTitleX.visibility">visible</option>
        <option name="charting.axisTitleY.visibility">visible</option>
        <option name="charting.axisTitleY2.visibility">visible</option>
        <option name="charting.axisX.scale">linear</option>
        <option name="charting.axisY.scale">linear</option>
        <option name="charting.axisY2.enabled">false</option>
        <option name="charting.axisY2.scale">inherit</option>
        <option name="charting.chart">pie</option>
        <option name="charting.chart.bubbleMaximumSize">50</option>
        <option name="charting.chart.bubbleMinimumSize">10</option>
        <option name="charting.chart.bubbleSizeBy">area</option>
        <option name="charting.chart.nullValueMode">gaps</option>
        <option name="charting.chart.sliceCollapsingThreshold">0.01</option>
        <option name="charting.chart.stackMode">default</option>
        <option name="charting.chart.style">shiny</option>
        <option name="charting.drilldown">all</option>
        <option name="charting.layout.splitSeries">0</option>
        <option name="charting.legend.labelStyle.overflowMode">ellipsisStart</option>
        <option name="charting.legend.placement">right</option>
      </chart>
    </panel>
    <panel>
      <chart>
        <title>License Usage by Sourcetype</title>
        <search>
          <query>index=_internal source=*license_usage.log type=Usage | stats sum(b) AS bytes by st | eval GB= round(bytes/1024/1024/1024,2) | fields st GB | rename st as Sourcetype | sort -GB</query>
          <earliest>$time1.earliest$</earliest>
          <latest>$time1.latest$</latest>
        </search>
        <option name="charting.axisLabelsX.majorLabelStyle.overflowMode">ellipsisNone</option>
        <option name="charting.axisLabelsX.majorLabelStyle.rotation">0</option>
        <option name="charting.axisTitleX.visibility">visible</option>
        <option name="charting.axisTitleY.visibility">visible</option>
        <option name="charting.axisTitleY2.visibility">visible</option>
        <option name="charting.axisX.scale">linear</option>
        <option name="charting.axisY.scale">linear</option>
        <option name="charting.axisY2.enabled">false</option>
        <option name="charting.axisY2.scale">inherit</option>
        <option name="charting.chart">pie</option>
        <option name="charting.chart.bubbleMaximumSize">50</option>
        <option name="charting.chart.bubbleMinimumSize">10</option>
        <option name="charting.chart.bubbleSizeBy">area</option>
        <option name="charting.chart.nullValueMode">gaps</option>
        <option name="charting.chart.sliceCollapsingThreshold">0.01</option>
        <option name="charting.chart.stackMode">default</option>
        <option name="charting.chart.style">shiny</option>
        <option name="charting.drilldown">all</option>
        <option name="charting.layout.splitSeries">0</option>
        <option name="charting.legend.labelStyle.overflowMode">ellipsisStart</option>
        <option name="charting.legend.placement">right</option>
      </chart>
    </panel>
    <panel>
      <chart>
        <title>License Usage by Source</title>
        <search>
          <query>index=_internal source=*license_usage.log type=Usage | stats sum(b) AS bytes by s | eval GB= round(bytes/1024/1024/1024,2) | fields s GB | rename s as Source | sort -GB</query>
          <earliest>$time1.earliest$</earliest>
          <latest>$time1.latest$</latest>
        </search>
        <option name="charting.chart">pie</option>
        <option name="charting.axisY2.enabled">undefined</option>
      </chart>
    </panel>
  </row>
  <row>
    <panel>
      <table>
        <title>License Usage by Host Stats</title>
        <search>
          <query>index=_internal source=*license_usage.log type=Usage | stats sum(b) AS bytes by h | eval GB= round(bytes/1024/1024/1024,2) | fields h GB | rename h as host | sort -GB</query>
          <earliest>$time1.earliest$</earliest>
          <latest>$time1.latest$</latest>
        </search>
        <option name="wrap">true</option>
        <option name="rowNumbers">false</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">cell</option>
        <option name="count">10</option>
      </table>
    </panel>
    <panel>
      <table>
        <title>License Usage by Sourcetype Stats</title>
        <search>
          <query>index=_internal source=*license_usage.log type=Usage | stats sum(b) AS bytes by st | eval GB= round(bytes/1024/1024/1024,2) | fields st GB | rename st as Sourcetype | sort -GB</query>
          <earliest>$time1.earliest$</earliest>
          <latest>$time1.latest$</latest>
        </search>
        <option name="wrap">true</option>
        <option name="rowNumbers">false</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">cell</option>
        <option name="count">10</option>
      </table>
    </panel>
    <panel>
      <table>
        <title>License Usage by Source Stats</title>
        <search>
          <query>index=_internal source=*license_usage.log type=Usage | stats sum(b) AS bytes by s | eval GB= round(bytes/1024/1024/1024,2) | fields s GB | rename s as Sourcetype | sort -GB</query>
          <earliest>$time1.earliest$</earliest>
          <latest>$time1.latest$</latest>
        </search>
        <option name="wrap">true</option>
        <option name="rowNumbers">false</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">cell</option>
        <option name="count">10</option>
      </table>
    </panel>
  </row>
</form>


Wednesday, November 22, 2017

Spelunking your Splunk – Part II (Disk Usage)

By Tony Lee

In our first article of the series, Spelunking your Splunk Part I (Exploring Your Data), we looked at a clever dashboard that can be used to quickly understand the indexes, sources, sourcetypes, and hosts in any Splunk environment.  Now we will examine disk usage!

You may know this already--Splunk stores data on indexers. But have you ever wanted to visually see indexer capacity?  Or in a distributed environment, have you ever wondered how well the data is distributed across the indexers?  We have a solution for both and will provide the code at the bottom of the article.

Finding disk usage information

There are a number of ways to query disk utilization within Splunk.  For example, you could create scripted input that makes a call to the operating system, but Splunk makes it even simpler than that...  Try copying and pasting this RESTful query into the search bar:

| rest splunk_server=* /services/server/status/partitions-space | eval usage = round((capacity - free) / 1024, 2) | eval capacity = round(capacity / 1024, 2) | eval compare_usage = usage." / ".capacity | eval pct_usage = round(usage / capacity * 100, 2)  | table updated, splunk_server, mount_point, fs_type, capacity, compare_usage, pct_usage | rename mount_point as "Mount Point", fs_type as "File System Type", compare_usage as "Disk Usage (GB)", capacity as "Capacity (GB)", pct_usage as "Disk Usage (%)" | sort splunk_server


This should result in something that looks like the following screenshot which provides information such as the server name, mount point, file system type, drive capacity, disk usage, and percentage of disk usage. If you receive information from non-indexers or mount points that are not related to your actual indexer mount points, you can either ignore them or filter them out of the search.


Figure 1:  The search that starts it all

Adding a gauge

This is pretty interesting information, especially in a distributed environment, but let's take it up a notch so we can see a visual representation.  The dashboard code at the bottom of the page will give you the basic building blocks to customize gauges on your disk usage page.

Figure 2:  Adding a filler gauge for each indexer

Note:  For the gauges, you should change two values:  splunk_server to match the value in the splunk_server column and mount_point to match the value in the Mount Point column in our original search.

For environments with clustered indexers, just add a gauge for each indexer.  The end result should look something like the following:

Figure 3:  Filler gauges across the index cluster

In this example, it is very easy to see one indexer that is not properly load balanced. This dashboard can also be used to trigger alerts based on disk usage.

Conclusion

Splunk provides good visibility into indexer health via the Monitoring Console / DMC (Distributed management console), but we found this visual representation quite helpful for monitoring disk usage and indexer cluster load balancing.   We hope this helps you too.


Dashboard XML code is below:

Below is the dashboard code needed to enumerate your servers and mount point and to create one gauge.  Now just copy the gauge code for as many gauges as needed:

<dashboard stylesheet="custom.css">
  <label>Disk Usage</label>
  <row>
    <panel>
      <chart>
        <title>Indexer-1</title>
        <search>
          <query>| rest splunk_server=* /services/server/status/partitions-space | search splunk_server=server_name_here mount_point="/" | eval usage = round((capacity - free) / 1024, 2) | eval capacity = round(capacity / 1024, 2) | eval compare_usage = usage." / ".capacity | eval pct_usage = round(usage / capacity * 100, 2)  | table pct_usage | rename mount_point as "Mount Point", fs_type as "File System Type", compare_usage as "Disk Usage (GB)", capacity as "Capacity (GB)", pct_usage as "Disk Usage (%)" | sort splunk_server</query>
          <earliest>0</earliest>
          <latest></latest>
        </search>
        <option name="charting.axisLabelsX.majorLabelStyle.overflowMode">ellipsisNone</option>
        <option name="charting.axisLabelsX.majorLabelStyle.rotation">0</option>
        <option name="charting.axisTitleX.visibility">visible</option>
        <option name="charting.axisTitleY.visibility">visible</option>
        <option name="charting.axisTitleY2.visibility">visible</option>
        <option name="charting.axisX.scale">linear</option>
        <option name="charting.axisY.scale">linear</option>
        <option name="charting.axisY2.enabled">0</option>
        <option name="charting.axisY2.scale">inherit</option>
        <option name="charting.chart">fillerGauge</option>
        <option name="charting.chart.bubbleMaximumSize">50</option>
        <option name="charting.chart.bubbleMinimumSize">10</option>
        <option name="charting.chart.bubbleSizeBy">area</option>
        <option name="charting.chart.nullValueMode">gaps</option>
        <option name="charting.chart.rangeValues">[0,50,75,100]</option>
        <option name="charting.chart.showDataLabels">none</option>
        <option name="charting.chart.sliceCollapsingThreshold">0.01</option>
        <option name="charting.chart.stackMode">default</option>
        <option name="charting.chart.style">shiny</option>
        <option name="charting.drilldown">all</option>
        <option name="charting.gaugeColors">["0x84E900","0xFFE800","0xBF3030"]</option>
        <option name="charting.layout.splitSeries">0</option>
        <option name="charting.layout.splitSeries.allowIndependentYRanges">0</option>
        <option name="charting.legend.labelStyle.overflowMode">ellipsisMiddle</option>
        <option name="charting.legend.placement">right</option>
      </chart>
    </panel>
  </row>
  <row>
    <panel>
      <table>
        <search>
          <query>| rest splunk_server=* /services/server/status/partitions-space | eval usage = round((capacity - free) / 1024, 2) | eval capacity = round(capacity / 1024, 2) | eval compare_usage = usage." / ".capacity | eval pct_usage = round(usage / capacity * 100, 2)  | table updated, splunk_server, mount_point, fs_type, capacity, compare_usage, pct_usage | rename mount_point as "Mount Point", fs_type as "File System Type", compare_usage as "Disk Usage (GB)", capacity as "Capacity (GB)", pct_usage as "Disk Usage (%)" | sort splunk_server</query>
          <earliest>-15m</earliest>
          <latest>now</latest>
        </search>
        <option name="count">10</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">cell</option>
        <option name="rowNumbers">false</option>
        <option name="wrap">true</option>
      </table>
    </panel>
  </row>
</dashboard>

Friday, November 10, 2017

Detecting Data Feed Issues with Splunk

by Tony Lee

As a Splunk admin, you don’t always control the devices that generate your data. As a result, you may only have control of the data once it reaches Splunk. But what happens when that data stops being sent to Splunk? How long does it take anyone to notice and how much data is lost in the meantime?

We have seen many customers struggle with monitoring and detecting data feed issues so we figured we would share some of the challenges and also a few possible methods for detecting and alerting on data feed issues.

Challenges

Before we discuss the solution, we want to highlight a few challenges to consider when trying to detect data feed issues:
1) This requires searching over a massive amount of data—thus searches in high volume environments may take a while to return.  We have you covered.
2) Complete loss of traffic may not be required—partial loss in traffic may be enough to warrant alerting.  We still have you covered.
3) There may be legitimate reductions in data (weekends) which may produce false alarms—thus the reduction percentage may need to be adjusted.  Yes, we still have you covered.

Constructing a solution

Given these challenges, we wanted to walk you through the solution we developed (Step 4 in the final solution if you want to skip straight to that for the sake of time). This solution can be adapted to monitor indexes, sources, or sourcetypes—depending on what makes the most sense to you. If each of your data sources goes in its own index, then index would make the most sense. If multiple data feeds share indexes, but are referenced by different sources or sourcetypes, then it may make the most sense to monitor by source or sourcetype. In order to change this, just change all instances of “index” (except for the first index=*) to “sourcetype” below.  Our example syntax below show index monitoring, but the screenshots show sourcetype monitoring--this is very flexible.

The first challenge to consider in our searches is the massive amount of data we need to search.  We could use traditional searches such as index=*, but the searches would never finish even in smaller environments.  For this reason we use the tstats command.  In one fairly large environment, it was able to search through 3,663,760,230 events from two days worth of traffic in just 28.526 seconds.

The first solution we arrived at was the following:

Step 1)  View data sources and traffic:

| tstats prestats=t count where earliest=-2d@d latest=-0d@d index=* by index, _time span=1d | timechart useother=false limit=0 span=1d count by index


Figure 1:  Viewing your traffic

Step 2)  Transpose the data:

| tstats prestats=t count where earliest=-2d@d latest=-0d@d index=* by index, _time span=1d | timechart useother=false limit=0 span=1d count by index |  eval _time=strftime(_time,"%Y-%m-%d") | transpose | rename "row 1" AS TwoDaysAgo, "row 2" AS Yesterday


Figure 2:  Transposing the data to get the columns where we need them.

Step 3)  Alert Trigger for dead data source (Yesterday=0):

| tstats prestats=t count where earliest=-2d@d latest=-0d@d index=* by index, _time span=1d | timechart useother=false limit=0 span=1d count by index |  eval _time=strftime(_time,"%Y-%m-%d") | transpose | rename "row 1" AS TwoDaysAgo, "row 2" AS Yesterday | where Yesterday=0


The problem with this solution is that it would not detect partial losses of traffic.  Even if one event was sent, you would not receive an alert.  Thus we changed this to detected a percentage of drop off.

Figure 3:  Detecting a complete loss in traffic.  May not be the best solution.


Final solution:  Alert for percentage of drop off (Example below alerts on reduction of 25% or greater):

| tstats prestats=t count where earliest=-2d@d latest=-0d@d index=* by index, _time span=1d | timechart useother=false limit=0 span=1d count by index |  eval _time=strftime(_time,"%Y-%m-%d") | transpose | rename column AS DataSource, "row 1" AS TwoDaysAgo, "row 2" AS Yesterday | eval PercentageDiff=((Yesterday/TwoDaysAgo)*100) | where PercentageDiff<75

Figure 4:  Final solution to detect a percentage of decline in traffic

Caveats:

The solution above should get most people to where they need to be.  However, depending on your environment, you may need to make some adjustments—such as the percentage of traffic reduction, but that is a simple change of the 75 above.  We have included some additional caveats below that we have encountered:
1) There may be legitimate indexes with low events or possibly naturally occurring 0 events, use “index!=<name>” after the index=* in the |tstats command to ignore these indexes
2) Reminder:  Maybe you send multiple data feeds into a single index, but instead separate it out by sourcetype.  No problem, just change the searches above to use sourcetype instead of index.

Conclusion

The final step is to click the “Save As” button and select “Alert”.  It could be scheduled to run daily with results are greater than 0.  There may be a better way to monitor for data feed loss and we would love to hear it!  There is most likely a way to use _internal logs since Splunk logs information about itself.  😉  If you have that solution, please feel free to share in the comments section.  As you know, with Splunk, there is always more than one way to solve a problem.

Sunday, October 15, 2017

Spelunking your Splunk – Part I (Explore Your Data)

By Tony Lee

Introduction

Have you ever inherited a Splunk instance that you did not build?  This means that you probably have no idea what data sources are being sent into Splunk.  You probably don’t know much about where the data is being stored.  And you certainly do not know who the highest volume hosts are within the environment.

As a consultant, this is reality for nearly every engagement we encounter:  We did not build the environment and documentation is sparse or inaccurate if we are lucky enough to even have it.  So, what do we do?  We could run some fairly complex queries to figure this out, but many of those queries are not efficient enough to search over vast amounts of data or long periods of time—even on highly optimized environments.  All is not lost though, we have some tricks (and a handy dashboard) that we would like to share.

Note:  Maybe you did build the environment, but you need a sanity check to make sure you don’t have any misconfigured or run-away hosts.  You will also find value here.

tstats to the rescue!

If you have not discovered or used the tstats command, we recommend that you become familiar with it even if it is at a very high-level.  In a nutshell, tstats can perform statistical queries on indexed fields—very very quickly.  These indexed fields by default are index, source, sourcetype, and host.  It just so happens that these are the fields that we need to understand the environment.  Best of all, even on an underpowered environment or one with lots of data ingested per day, these commands will still outperform the rest of your typical searches even over long periods of time.  Ok, time to answer some questions!

Common questions

These are common questions we ask during consulting engagements and this is how we get answers FAST.  Most of the time 7 days’ worth of data is enough to give us a good understanding of the environment and week out anomalies.

How many events are we ingesting per day?
| tstats count where index=* by _time

Figure 1:  Events per day


What are my most active indexes (events per day)?
| tstats prestats=t count where index=* by index, _time span=1d | timechart span=1d count by index

Figure 2:  Most active indexes


What are my most active sourcetypes (events per day)?
| tstats prestats=t count where index=* by sourcetype, _time span=1d | timechart span=1d count by sourcetype

Figure 3:  Most active sourcetypes


What are my most active sources (events per day)?
| tstats prestats=t count where index=* by source, _time span=1d | timechart span=1d count by source

Figure 4:  Most active sources


What is the noisiest host (events per day)?
| tstats prestats=t count where index=* by host, _time span=1d | timechart span=1d count by host

Figure 5:  Most active hosts


Dashboard Code

To make things even easier for you, try this dashboard out (code at the bottom) that combines the searches we provided above and as a bonus adds a filter to specify the index and time range.

Figure 6:  Data Explorer dashboard

Conclusion

Splunk is a very powerful search platform but it can grow to be a complicated beast--especially over time.  Feel free to use the searches and dashboard provided to regain control and really understand your environment.  This will allow you to trim the waste and regain efficiency.  Happy Splunking.


Dashboard XML code is below:


<form>
  <label>Data Explorer</label>
  <fieldset submitButton="true" autoRun="true">
    <input type="time" token="time">
      <label>Time Range Selector</label>
      <default>
        <earliest>-7d@h</earliest>
        <latest>now</latest>
      </default>
    </input>
    <input type="text" token="index">
      <label>Index</label>
      <default>*</default>
      <initialValue>*</initialValue>
    </input>
  </fieldset>
  <row>
    <panel>
      <chart>
        <title>Most Active Indexes</title>
        <search>
          <query>| tstats prestats=t count where index=$index$ by index, _time span=1d | timechart span=1d count by index</query>
          <earliest>$time.earliest$</earliest>
          <latest>$time.latest$</latest>
        </search>
        <option name="charting.chart">column</option>
        <option name="charting.drilldown">none</option>
      </chart>
    </panel>
  </row>
  <row>
    <panel>
      <chart>
        <title>Most Active Sourcetypes</title>
        <search>
          <query>| tstats prestats=t count where index=$index$ by sourcetype, _time span=1d | timechart span=1d count by sourcetype</query>
          <earliest>$time.earliest$</earliest>
          <latest>$time.latest$</latest>
        </search>
        <option name="charting.chart">column</option>
        <option name="charting.drilldown">none</option>
      </chart>
    </panel>
  </row>
  <row>
    <panel>
      <chart>
        <title>Most Active Sources</title>
        <search>
          <query>| tstats prestats=t count where index=$index$ by source, _time span=1d | timechart span=1d count by source</query>
          <earliest>$time.earliest$</earliest>
          <latest>$time.latest$</latest>
        </search>
        <option name="charting.chart">column</option>
        <option name="charting.drilldown">none</option>
      </chart>
    </panel>
  </row>
  <row>
    <panel>
      <chart>
        <title>Most Active Hosts</title>
        <search>
          <query>| tstats prestats=t count where index=$index$ by host, _time span=1d | timechart span=1d count by host</query>
          <earliest>$time.earliest$</earliest>
          <latest>$time.latest$</latest>
          <sampleRatio>1</sampleRatio>
        </search>
        <option name="charting.chart">column</option>
        <option name="charting.drilldown">none</option>
      </chart>
    </panel>
  </row>
</form>