Editor’s Note: Does your brain switch off when you hear industryspeak words like “innovation,” “transformation,” “leading edge,” “disruptive,” and “paradigm shift”? Go on, go ahead and admit it. Ours do, too. That’s why we’re launching the future.ready() series—consisting of blogs, podcasts, webinars, and Twitter chats— with content created by developers, for developers. Nothing fluffy, nothing buzzy. With the future.ready() series, we aim to equip you with tools and knowledge that you can use—not just talk and tweet about.
For the first edition, I’ve invited Frank Ketelaars, an expert in high volume data space, to walk us through seven things to check off when starting a big data development project.
-Michalina Kiera, SoftLayer EMEA senior marketing manager
This year, big data moves from a water cooler discussion to the to-do list. Gartner estimates that more than 75 percent of companies are investing or planning to invest in big data in the next two years.
I have worked on multiple high volume projects in industries that include banking, telecommunications, manufacturing, life sciences, and government, and in roles including architect, big data developer, and streaming analytics specialist. Based on my experience, here’s a checklist I put together that should give developers a good start. Did I miss anything? Join me on the Twitter chat or webinar to share your experience, ask questions, and discuss further. (See details below.)
1. Team up with a person who has a budget and a problem you can solve.
For a successful big data project, you need to solve a business problem that’s keeping somebody awake at night. If there isn’t a business problem and a business owner—ideally one with a budget— your project won’t get implemented. Experimentation is important when learning any new technology. But before you invest a lot of time in your big data platform, find your sponsor. To do so, you’ll need to talk to everyone, including IT, business users, and management. Remember that the technical advantages of analytics at scale might not immediately translate into business value.
2. Get your systems ready to collect the data.
With additional data sources, such as devices, vehicles, and sensors connected to networks and generating data, the variety of information and transportation mechanisms has grown dramatically, posing new challenges for the collection and interpretation of data.
Big data often comes from sources outside the business. External data comes at you in a variety of formats (including XML, JSON, and binary), and using a variety of different APIs. In 2016, you might think that everyone is on REST and JSON, but think again: SOAP still exists! The variety of the data is the primary technical driver behind big data investments, according to a survey of 402 business and IT professionals by management consultancy NewVantage Partners[SM1] . From one day to the next, the API might change or a source might become unavailable.
Maybe one day we’ll see more standardization, but it won’t happen any time soon. For now, developers must plan to spend time checking for changes in APIs and data formats, and be ready to respond quickly to avoid service interruptions. And to expect the unexpected.
3. Make sure you have the right to use that data.
Governance is a business challenge, but it’s going to touch developers more than ever before—from the very start of the project. Much of the data they will be handling is unstructured, such as text records from a call center. That makes it hard to work out what’s confidential, what needs to be masked, and what can be shared freely with external developers. Data will need to be structured before it can be analyzed, but part of that process includes working out where the sensitive data is, and putting measures in place to ensure it is adequately protected throughout its lifecycle.
Developers need to work closely with the business to ensure that they can keep data safe, and provide end users with a guarantee that the right data is being analyzed and that its provenance can be trusted. Part of that process will be about finding somebody who will take ownership of the data and attest to its quality.
4. Pick the right tools and languages.
With no real standards in place yet, there are many different languages and tools used to collect, store, transport, and analyze big data. Languages include R, Python, Julia, Scala, and Go (plus the Java and C++ you might need to work with your existing systems). Technologies include Apache Pig, Hadoop, and Spark, which provide massive parallel processing on top of a file system without Hadoop. There’s a list of 10 popular big data tools here, another 12 here, and a round-up of 45 big data tools here. 451 Research has created a map that classifies data platforms according to the database type, implementation model, and technology. It’s a great resource, but its 18-color key shows how complex the landscape has become.
Not all of these tools and technologies will be right for you, but they hint at one way the developer’s core competency must change. Big data will require developers to be polyglots, conversant in perhaps five languages, who specialize in learning new tools and languages fast—not deep experts in one or two languages.
Nota bene: MapReduce and Pig are among the top highest paid technology skills in the US, and other big data skills are likely to be highly sought-after as the demand for them also grows. Scala is a relatively new functional programming language for data preparation and analysis, and I predict it will be in high demand in the near future.
5. Forget “off-the-shelf.” Experiment and set up a big data solution that fits your needs.
You can think of big data analytics tools like Hadoop as a car. You want to go to the showroom, pay, get in, and drive away. Instead, you’re given the wheels, doors, windows, chassis, engine, steering wheel, and a big bag of nuts and bolts. It’s your job to assemble it.
As InfoWorld notes, DevOps tools can help to create manageable Hadoop solutions. But you’re still faced with a lot of pieces to combine, diverse workloads, and scheduling challenges.
When experimenting with concepts and technologies to solve a certain business problem, also think about successful deployment in the organization. The project does not stop after the proof.
6. Secure resources for changes and updates.
Apache Hadoop and Apache Spark are still evolving rapidly and it is inevitable that the behavior of components will change over time and some may get deprecated shortly after initial release. Implementing new releases will be painful, and developers will need to have an overview of the big data infrastructure to ensure that as components change, their big data projects continue to perform as expected.
The developer team must plan time for updates and deprecated features, and a coordinated approach will be essential for keeping on top of the change.
7. Use infrastructure that’s ready for CPU and I/O intensive workloads.
My preferred definition of big data (and there are many – Forbes found 12) is this: "Big data is when you can no longer afford to bring the data to the processing, and you have to do the processing where the data is."
In traditional database and analytics applications, you get the data, load it onto your reporting server, process it, and post the results to the database.
With big data, you have terabytes of data, which might reside in different places—and which might not even be yours to move. Getting it to the processor is impractical. Big data technologies like Hadoop are based on the concept of data locality—doing the processing where the data resides.
You can run Hadoop in a virtualized environment. Virtual servers don’t have local data, though, so the time taken to transport data between the SAN or other storage device and the server hurts the application’s performance. Noisy neighbors, unpredictable server speeds and contested network connections can have a significant impact on performance in a virtualized environment. As a result, it’s difficult to offer service level agreements (SLAs) to end users, which makes it hard for them to depend on your big data implementations.
The answer is to use bare metal servers on demand, which enable you to predict and guarantee the level of performance your application can achieve, so you can offer an SLA with confidence. Clusters can be set up quickly, so you can accelerate your project really fast. Because performance is predictable and consistent, it’s possible to offer SLAs to business owners that will encourage them to invest in the big data project and rely on it for making business decisions.
How can I learn more?
Add our Twitter chat to your calendar. It happens Thursday, March 31 at 1 p.m. CET. Use the hashtag #SLdevchat to share your views or post your questions to me.
Register for the webinar on Wednesday, Apr 20, at 5 p.m. to 6 p.m. CET.
About the author
Frank Ketelaars has been Big Data Technical Leader in Europe for IBM since August 2013. As an architect, big data developer, and streaming analytics specialist, he has worked on multiple high volume projects in banking, telecommunications, manufacturing, life sciences and government. He is a specialist in Hadoop and real-time analytical processing.