Previously, I wrote about the need for a more equitable way to price big data software. In order to help buyers realize the full potential of Big Data, the industry needs to find a mechanism that divorces the cost of software from the cost of scaling hardware. In this post, I consider the factors for pricing an example application. Most readers will be familiar with Hadoop, so the application will be based on the Hadoop stack. I’ll use a canonical use case for Hadoop, log file analytics.
The use case for this example application is to give select users access to a history of log files collected from a variety of internal machines. Specifically, this use case generates 500GB of log file data per day and the retention requirements are six years. In order to give users fast search access to the log data, the application must index it. Users want to see recent logs, so the indexing must be as real time as possible. The users also want charting to look at trends within the log files. Finally, we must ensure that data can only be seen by the people who have been given permission to access it.
Next, we look at what the organization has in place. They do not have a current solution but are considering a range of open source tools that meet these requirements such as ElasticSearch and SolrCloud. They may have an existing Hadoop cluster that is available as a shared service and priced by the total byte cost. They likely have skilled administrators and developers but little or no Hadoop expertise.
Note that figuring out total per byte pricing is a bit tricky since the cost factors vary widely among organizations. The total cost depends on variables like organizational efficiencies, enterprise discounts, excess capacity, and many others. Nonetheless, the total cost per byte is an essential part of this application pricing equation. For this example we’ll assume this application requires the capacity of four typical Hadoop nodes allocated out of a larger cluster. The sweet spot for Hadoop node pricing is around $5000, and the leasing price is typically $200 per month. Adding power, cooling, operations, networking and overhead we pessimistically round this up to $500 per month per node for a total cost of $2,000 per month in hardware infrastructure.
With this much information, we can calculate the cost to build our application internally with some short term outside assistance. Based on my experience this project can be delivered by a team of four and completed within four months. Ongoing maintenance costs are approximately one half time engineer.
Development costs are as follows:
4 senior IT staff @ $150,000 per year x 1.5 benefits cost x 4 months = $300,000
Hadoop Administrator and Developer Training X 4 = ~$25,000
Expert services = 4 weeks @ $300/hr = $48,000
This application requires a total development cost of approximately $373,000k.
Ongoing costs are as follows:
500GB per month internal total cost of $2000/month
1 senior IT staff = $9,375/mo
4 hours per month of ongoing support services at $150/hour = $600 per month
Total ongoing monthly costs = $11,975, or $143,700 per year (38.5% of the initial cost).
So the all-in costs to create such application would be roughly $373,000 plus nine months ongoing support for the first year ($95,800), and then $143,700 on an ongoing annual basis. While ongoing operational costs will decrease with time as storage prices continue to drop and open source software improves, experience has shown that data grows just as fast, if not faster.
Now that we’ve priced the cost of building an application internally, we can compare this to a proprietary offering. For our example, we’ll compare to Splunk, far and away the most popular log analytics and operational intelligence application. Splunk lists at $6,000 plus maintenance for a perpetual license to index 500MB per day. To get a sense of orders of magnitude, an undiscounted license for 500GB per day is $6,144,000. Assuming 20% maintenance, the ongoing cost is $1,228,800.
Obviously tools like Splunk are rarely sold at full list price and come with far more sophisticated management and analytics tools than the open source versions, but that gap will only shrink over time. Even at a 90% discount, the upfront cost is still $614,000. Splunk certainly has a “time to value” advantage. With the open source solution, time to value is measured in months, while with Splunk it’s measured in days, assuming the operations team can install the Splunk agents rather easily and the default dashboards are adequate. But again, this is an advantage that will decline over time.
Using a basic example of a well-known application developed in house compared to a proprietary offering, it’s easy to see why the industry is struggling with pricing models. What might make sense at small scale (it’s hard to beat $6000 with internal development teams) becomes a much more difficult decision at larger scale. Organizations now need to consider up front whether the features in a vendor provided solution are worth the price difference, whether the time to value is compelling enough to forgo the eventual savings, whether there is additional value in training employees to build a solution and even whether it continues to be a business advantage to keep data in house.
In practice we see that organizations implement some combination of schemes. They buy just enough of a vendor’s solution to tide them over until they can build an internal implementation. They hire one or two experienced engineers to lead a wide variety of application development efforts and they start building new applications in the cloud, where it’s easier to use low cost SaaS services and help share in the economies of scale that until now have been reserved primarily for the vendors.
From our perspective, the future direction for pricing in big data is clear. Buyers and vendors must assess the true cost of building and maintaining sufficiently featureful applications and use the cost of alternative development as a guideline for pricing. Other pricing schemes that scale with data size will be retired as buyers realize that they can afford to use more of their data with an alternative approach.