Wednesday, November 22, 2017

Spelunking your Splunk – Part II (Disk Usage)

By Tony Lee

In our first article of the series, Spelunking your Splunk Part I (Exploring Your Data), we looked at a clever dashboard that can be used to quickly understand the indexes, sources, sourcetypes, and hosts in any Splunk environment.  Now we will examine disk usage!

You may know this already--Splunk stores data on indexers. But have you ever wanted to visually see indexer capacity?  Or in a distributed environment, have you ever wondered how well the data is distributed across the indexers?  We have a solution for both and will provide the code at the bottom of the article.

Finding disk usage information

There are a number of ways to query disk utilization within Splunk.  For example, you could create scripted input that makes a call to the operating system, but Splunk makes it even simpler than that...  Try copying and pasting this RESTful query into the search bar:

| rest splunk_server=* /services/server/status/partitions-space | eval usage = round((capacity - free) / 1024, 2) | eval capacity = round(capacity / 1024, 2) | eval compare_usage = usage." / ".capacity | eval pct_usage = round(usage / capacity * 100, 2)  | table updated, splunk_server, mount_point, fs_type, capacity, compare_usage, pct_usage | rename mount_point as "Mount Point", fs_type as "File System Type", compare_usage as "Disk Usage (GB)", capacity as "Capacity (GB)", pct_usage as "Disk Usage (%)" | sort splunk_server


This should result in something that looks like the following screenshot which provides information such as the server name, mount point, file system type, drive capacity, disk usage, and percentage of disk usage. If you receive information from non-indexers or mount points that are not related to your actual indexer mount points, you can either ignore them or filter them out of the search.


Figure 1:  The search that starts it all

Adding a gauge

This is pretty interesting information, especially in a distributed environment, but let's take it up a notch so we can see a visual representation.  The dashboard code at the bottom of the page will give you the basic building blocks to customize gauges on your disk usage page.

Figure 2:  Adding a filler gauge for each indexer

Note:  For the gauges, you should change two values:  splunk_server to match the value in the splunk_server column and mount_point to match the value in the Mount Point column in our original search.

For environments with clustered indexers, just add a gauge for each indexer.  The end result should look something like the following:

Figure 3:  Filler gauges across the index cluster

In this example, it is very easy to see one indexer that is not properly load balanced. This dashboard can also be used to trigger alerts based on disk usage.

Conclusion

Splunk provides good visibility into indexer health via the Monitoring Console / DMC (Distributed management console), but we found this visual representation quite helpful for monitoring disk usage and indexer cluster load balancing.   We hope this helps you too.


Dashboard XML code is below:

Below is the dashboard code needed to enumerate your servers and mount point and to create one gauge.  Now just copy the gauge code for as many gauges as needed:

<dashboard stylesheet="custom.css">
  <label>Disk Usage</label>
  <row>
    <panel>
      <chart>
        <title>Indexer-1</title>
        <search>
          <query>| rest splunk_server=* /services/server/status/partitions-space | search splunk_server=server_name_here mount_point="/" | eval usage = round((capacity - free) / 1024, 2) | eval capacity = round(capacity / 1024, 2) | eval compare_usage = usage." / ".capacity | eval pct_usage = round(usage / capacity * 100, 2)  | table pct_usage | rename mount_point as "Mount Point", fs_type as "File System Type", compare_usage as "Disk Usage (GB)", capacity as "Capacity (GB)", pct_usage as "Disk Usage (%)" | sort splunk_server</query>
          <earliest>0</earliest>
          <latest></latest>
        </search>
        <option name="charting.axisLabelsX.majorLabelStyle.overflowMode">ellipsisNone</option>
        <option name="charting.axisLabelsX.majorLabelStyle.rotation">0</option>
        <option name="charting.axisTitleX.visibility">visible</option>
        <option name="charting.axisTitleY.visibility">visible</option>
        <option name="charting.axisTitleY2.visibility">visible</option>
        <option name="charting.axisX.scale">linear</option>
        <option name="charting.axisY.scale">linear</option>
        <option name="charting.axisY2.enabled">0</option>
        <option name="charting.axisY2.scale">inherit</option>
        <option name="charting.chart">fillerGauge</option>
        <option name="charting.chart.bubbleMaximumSize">50</option>
        <option name="charting.chart.bubbleMinimumSize">10</option>
        <option name="charting.chart.bubbleSizeBy">area</option>
        <option name="charting.chart.nullValueMode">gaps</option>
        <option name="charting.chart.rangeValues">[0,50,75,100]</option>
        <option name="charting.chart.showDataLabels">none</option>
        <option name="charting.chart.sliceCollapsingThreshold">0.01</option>
        <option name="charting.chart.stackMode">default</option>
        <option name="charting.chart.style">shiny</option>
        <option name="charting.drilldown">all</option>
        <option name="charting.gaugeColors">["0x84E900","0xFFE800","0xBF3030"]</option>
        <option name="charting.layout.splitSeries">0</option>
        <option name="charting.layout.splitSeries.allowIndependentYRanges">0</option>
        <option name="charting.legend.labelStyle.overflowMode">ellipsisMiddle</option>
        <option name="charting.legend.placement">right</option>
      </chart>
    </panel>
  </row>
  <row>
    <panel>
      <table>
        <search>
          <query>| rest splunk_server=* /services/server/status/partitions-space | eval usage = round((capacity - free) / 1024, 2) | eval capacity = round(capacity / 1024, 2) | eval compare_usage = usage." / ".capacity | eval pct_usage = round(usage / capacity * 100, 2)  | table updated, splunk_server, mount_point, fs_type, capacity, compare_usage, pct_usage | rename mount_point as "Mount Point", fs_type as "File System Type", compare_usage as "Disk Usage (GB)", capacity as "Capacity (GB)", pct_usage as "Disk Usage (%)" | sort splunk_server</query>
          <earliest>-15m</earliest>
          <latest>now</latest>
        </search>
        <option name="count">10</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">cell</option>
        <option name="rowNumbers">false</option>
        <option name="wrap">true</option>
      </table>
    </panel>
  </row>
</dashboard>

Friday, November 10, 2017

Detecting Data Feed Issues with Splunk

by Tony Lee

As a Splunk admin, you don’t always control the devices that generate your data. As a result, you may only have control of the data once it reaches Splunk. But what happens when that data stops being sent to Splunk? How long does it take anyone to notice and how much data is lost in the meantime?

We have seen many customers struggle with monitoring and detecting data feed issues so we figured we would share some of the challenges and also a few possible methods for detecting and alerting on data feed issues.

Challenges

Before we discuss the solution, we want to highlight a few challenges to consider when trying to detect data feed issues:
1) This requires searching over a massive amount of data—thus searches in high volume environments may take a while to return.  We have you covered.
2) Complete loss of traffic may not be required—partial loss in traffic may be enough to warrant alerting.  We still have you covered.
3) There may be legitimate reductions in data (weekends) which may produce false alarms—thus the reduction percentage may need to be adjusted.  Yes, we still have you covered.

Constructing a solution

Given these challenges, we wanted to walk you through the solution we developed (Step 4 in the final solution if you want to skip straight to that for the sake of time). This solution can be adapted to monitor indexes, sources, or sourcetypes—depending on what makes the most sense to you. If each of your data sources goes in its own index, then index would make the most sense. If multiple data feeds share indexes, but are referenced by different sources or sourcetypes, then it may make the most sense to monitor by source or sourcetype. In order to change this, just change all instances of “index” (except for the first index=*) to “sourcetype” below.  Our example syntax below show index monitoring, but the screenshots show sourcetype monitoring--this is very flexible.

The first challenge to consider in our searches is the massive amount of data we need to search.  We could use traditional searches such as index=*, but the searches would never finish even in smaller environments.  For this reason we use the tstats command.  In one fairly large environment, it was able to search through 3,663,760,230 events from two days worth of traffic in just 28.526 seconds.

The first solution we arrived at was the following:

Step 1)  View data sources and traffic:

| tstats prestats=t count where earliest=-2d@d latest=-0d@d index=* by index, _time span=1d | timechart useother=false limit=0 span=1d count by index


Figure 1:  Viewing your traffic

Step 2)  Transpose the data:

| tstats prestats=t count where earliest=-2d@d latest=-0d@d index=* by index, _time span=1d | timechart useother=false limit=0 span=1d count by index |  eval _time=strftime(_time,"%Y-%m-%d") | transpose | rename "row 1" AS TwoDaysAgo, "row 2" AS Yesterday


Figure 2:  Transposing the data to get the columns where we need them.

Step 3)  Alert Trigger for dead data source (Yesterday=0):

| tstats prestats=t count where earliest=-2d@d latest=-0d@d index=* by index, _time span=1d | timechart useother=false limit=0 span=1d count by index |  eval _time=strftime(_time,"%Y-%m-%d") | transpose | rename "row 1" AS TwoDaysAgo, "row 2" AS Yesterday | where Yesterday=0


The problem with this solution is that it would not detect partial losses of traffic.  Even if one event was sent, you would not receive an alert.  Thus we changed this to detected a percentage of drop off.

Figure 3:  Detecting a complete loss in traffic.  May not be the best solution.


Final solution:  Alert for percentage of drop off (Example below alerts on reduction of 25% or greater):

| tstats prestats=t count where earliest=-2d@d latest=-0d@d index=* by index, _time span=1d | timechart useother=false limit=0 span=1d count by index |  eval _time=strftime(_time,"%Y-%m-%d") | transpose | rename column AS DataSource, "row 1" AS TwoDaysAgo, "row 2" AS Yesterday | eval PercentageDiff=((Yesterday/TwoDaysAgo)*100) | where PercentageDiff<75

Figure 4:  Final solution to detect a percentage of decline in traffic

Caveats:

The solution above should get most people to where they need to be.  However, depending on your environment, you may need to make some adjustments—such as the percentage of traffic reduction, but that is a simple change of the 75 above.  We have included some additional caveats below that we have encountered:
1) There may be legitimate indexes with low events or possibly naturally occurring 0 events, use “index!=<name>” after the index=* in the |tstats command to ignore these indexes
2) Reminder:  Maybe you send multiple data feeds into a single index, but instead separate it out by sourcetype.  No problem, just change the searches above to use sourcetype instead of index.

Conclusion

The final step is to click the “Save As” button and select “Alert”.  It could be scheduled to run daily with results are greater than 0.  There may be a better way to monitor for data feed loss and we would love to hear it!  There is most likely a way to use _internal logs since Splunk logs information about itself.  😉  If you have that solution, please feel free to share in the comments section.  As you know, with Splunk, there is always more than one way to solve a problem.