It was great to catch up with a former co-worker at the largest restaurant company in the world, Alex Raitz. Alex and I worked in the Information Security department there. Alex installed Splunk for some log file monitoring we needed to do and he got so adept at scripting with the tool and interacting on the Splunk forums, Splunk offered him a job and Alex migrated from Louisville, Kentucky to San Francisco California.
Splunk recently announced XenServer support so I asked Alex to answer some questions about Splunk and how it can really help with monitoring virtual servers and hosts.
Q: Often Splunk is associated with being just security tool. Why do IT server administrators want this for their virtualization platform?
A: It is funny that Splunk has that image! Application and server availability is actually our predominant use case, with security and compliance a good way behind.
Splunk has a number of advantages when used with virtualization platforms. The biggest advantage that we have over traditional database-oriented tools is that we can take in any unstructured data with minimal configuration, including multiline outputs from tools or scripts. When you have a number of different OS on multiple virtualization platforms, the ability to index new types of data without extending a database schema or spending a day on configuration is huge.
Another huge advantage is that our licensing model is built on total volume indexed rather than number or hosts or interfaces. Thus, in the typical virtual environment where VM’s are allocated generously, it is significantly more cost effective to use Splunk than tools that are licensed per host.
Q: How does it function on the XenServer Platform?
A: Our integration with XenServer is a great example of how Splunk can add value. Using what we call a scripted input, we connect via python to the Xen API, grab configuration, metrics, status, process, audit, and other IT data, and consume this data into our index.
Once we index this data, we provide some pre-configured saved searches which can function as ad hoc investigations, alerts, and/or reports on the virtual environment. For example, the search “XSM- Metrics- Free memory by XenServer” (included in the Splunk for Citrix XenServer Management application on www.splunkbase.com) provides metrics for available memory on each guest running on a given host. This gives server administrators great visibility into Xen performance and availability information, which is not always highly visible in the native tools.
Q: Can Splunk help with capacity planning a Physical to Virtual Migration Project?
A: Splunk can help with capacity planning by providing detailed data on resource utilization in the physical environment. For example, let’s suppose you intended to P2V several linux application servers and you need to size the VM’s appropriately. Using the Splunk for UNIX application available on www.splunkbase.com, you can capture storage metrics from iostat, df, and du, network metrics from netstat and ifconfig, and CPU and memory metrics from ps, top, free, and vmstat. Capturing this data over several days or weeks will enable you to perform a statistical analysis of the performance of these systems, including the identification of CPU and memory spikes, heavy network or disk utilization, as well as the underlying processes and conditions (time of day, day of week, users logged in) that caused them.
Q: Is there a virtual appliance for Splunk?
A: There is not currently an official virtual appliance for Splunk. Although creating and supporting an official appliance has been on our to-do list for some time, I think that this is a good opportunity for us to reach out to the experts in the community-at-large for assistance. Although Splunk is technically closed-source, we do make every attempt to be community and developer oriented
Q: As you know, many organizations want to put XenServers in the DMZ and have one NIC interface on the external network and one on the internal network. This configuration is usually not approved by the security department. Can Splunk help prove that this configuration can be secure or prove the security department correct?
A: What a loaded question! It feels very familiar for some reason…
(editors note: Alex and I had to deal with this very issue when we worked together). Splunk is probably limited here by the data that is fed to it. If IDS and session data were being indexed by Splunk at both interfaces, this would facilitate identifying any abnormal activity transversing the XenServer.
As far as security of the scenario you described, I am not aware of any vulnerabilities in VMWare or XenServer that would allow a malicious party to bridge the two interfaces to gain access to a protected network. On the other hand, I do not have any first hand knowledge of any Splunk customers that have deployed Xen or VMWare in a dual-homed, DMZ-and-Internal configuration such as this.
Q: Does Splunk work with other virtualization platforms, such as VMWare Infrastructure 3.5?
A: We are actively working on developing a VMWare Infrastructure application, but we are do not have anything at the moment. With the Xen application, we worked with Citrix to understand the API and the types of data available through it. It didn’t hurt that we have several large customers running XenServer in their infrastructures.
We need to perform the same exercise with the VMWare team. The VI API is quite robust, and the SDK provide some good example code to work with, but there are only so many hours in the day! It is also worth noting that we run VMWare internally for the purpose of replicating customer environments internally, so we definately have our own self-serving motivation to get this moving.
Q: I assume that Splunk will allow me to easily view multiple log files from the XenSource host and each VM via a single GUI. It also allows me to configure and receive SNMP traps from the VMs and its applications. Is this correct?
A: All of your assumptions are correct. Using our lightweight forwarder configuration, Splunk can be installed on each host to collect IT data (logs, output from tools or scripts, file system changes, and so on) and securely and reliably forward it to centralized Splunk indexing servers. As described above, we can also collect data from the XenServer hosts themselves via the XenAPI. Splunk can also collect traps from the hosts and guests to supplement the information collected from the lightweight forwarders and the XenAPI. All of this data can be searched, viewed, manipulated, and reported on within SplunkWeb.
Q: Does it allow me to determine if Service Level Agreements are being met?
A: One of our key use cases is the reduction of mean time to resolution to meet service level agreements. Our customers find that the more IT data they introduce into Splunk, the more visibility they have into their infrastructure and the common problems that all enterprises face.
For example, one of our customers has leveraged Splunk’s “form search” (http://www.splunk.com/doc/latest/admin/FormSearch) to facilitate first call resolutions at their help desk. The Splunk administrators have crafted form searches for the most common operations issues (failed mail delivery, proxy availability, VPN connectivity) so that the first level technicians can perform basic troubleshooting by inserting the customer’s information (email or ip address) into a simple form. This approach allows the first level personnel to leverage the power of Splunk while masking some of the complexities of the search language from them. This has enabled them to resolve more issues on the first call as well as to reduce the number of tickets that were incorrectly triaged by the help desk.
As far as determining that SLA are being met, we are using Splunk to do that internally. Our VP of support leverages our IMAP application, available on www.splunkbase.com, to track the time that support cases are opened and resolved as well as other fields such as severity, customer account, support technician assigned to the case, and so on. Using Splunk’s built in reporting functionality, he is able to provide detailed metrics on the efficacy of support in meeting our severity-based SLA, the number of cases that have broken severity by customer or technician, as well as the number of opened and closed cases per quarter and differential quarter-by-quarter.
Q: Does it help me determine if the host, VMs, and applications are up to date on patches?
A: Since Splunk can index any unstructured data, patch level and package inventory is one of our primary use cases. For example, we can take in system audit data via /var/log/audit and/or /var/log/secure, package inventory data via yast, yum, or RPM, host patch level via XenAPI, and guest patch level via uname, dmesg, etc. For applications, we can tail install logs to monitor upgrades; we can also monitor the application directory for changes to binary and configuration files.
Q: How scalable is Splunk?
A: Splunk is quite scalable, thanks in large part to our awesome development team. Though it may seem like there is a new Splunk feature once a week, 80-90% of the changes in a given release are implemented to improve performance and scalability.
Because Splunk runs on many hardware and operating systems, we have to qualify any scalability and performance claims. As a rule of thumb, a beefy box (8 core, 16 GB RAM, Fiber or SAS RAID 10 storage with several spindles) running a common (RHEL or SLES) 64-bit distribution will be able to index 200 GB per day while providing adequate resources for 5-10 users to search, alert, and report.
Also of importance is the type of data that we are indexing and the method that we are taking the data in. For example, it is more resource intensive for Splunk to index log4j multiline data than standard syslog data. In most cases, Splunk can read data from a socket faster than it can from disk. The largest customers that we have are indexing more than a TB per day across several index servers in several data centers and are able to search them all from any SplunkWeb using our distributed search feature.
To achieve that kind of scalability, there is a significant advantage to be gained in leveraging Splunk professional services for the requirement gathering and architecture phase through the installation, configuration, and tuning phase. I also can’t stress enough the importance of education – in addition to our web-based training, we offer virtual and physical classroom training.
We did our first developer’s boot camp on August 4th (http://www.splunk.com/index.php/bootcamp08). This is a free boot camp for anyone that can attend, and includes free diner and a Giants game as well. More will come in the future.
Q: Other than cost, what are the primary differences in the free version and the enterprise version?
A: This article provides a good overview of the difference between free and enterprise:
http://www.splunk.com/article/2018
Thanks Alex for the great information. More to come in the future.