General Electric is involved in nearly every area of the clean energy market: solar engineering, wind manufacturing and development, LED lighting, distributed natural gas, metering and submetering, and grid analytics are just some of its major touch points.
It's also a dominant force in conventional generation, monitoring 1,600 gas and steam turbines that represent nearly one-quarter of the world's power plants.
With such a diverse array of energy assets under its control, GE has become a prodigious producer and consumer of data. Energy engineers at the conglomerate's remote monitoring center are analyzing ten times more data today than they were five years ago -- bringing new insights, while also creating new complications for the data-crunching infrastructure.
The energy market is just one piece of GE's total business, which also includes healthcare, aviation and rail transportation, manufacturing, mining and water processing. As GE expands the industrial internet and uses sensors to track the performance of everything it builds, the company is generating thousands of terabytes of data for customers.
But even a mighty giant like GE is having a hard time keeping up with it all.
“Big data is growing so fast that it is outpacing the ability of current tools to take full advantage of it,” said GE's Vice President of Software Bill Ruh in a statement about the company's new approach to data.
That's why GE made a $105 million equity investment in Pivotal, a big data analytics firm, in April of last year. Over the last sixteen months, the two companies have been working on a new way of sifting through that growing pool of information. And today, they announced the result: a "data lake."
A data lake, enabled by the open-source software Hadoop, is simply a collection of information in its raw format. Rather than process the data and file it away in a rigid way, GE and Pivotal are storing it in its original form and sifting through it when needed.
GE says it can process information 2,000 faster and ten times cheaper than traditional methods -- reducing analysis run times from months to days or even minutes.
Dave Bartlett, the CTO of GE's aviation business, gave a very helpful -- and biologically specific -- description of the data lake:
Bartlett, who studied biology and ecosystems before he jumped into computer science, uses a biological metaphor to describe the data lake concept. “A data lake is like a pond in the woods -- a richly diverse ecosystem,” he says. “You have complex food webs composed of millions of organisms, from algae and plants all the way up to top predators. Other factors such as water depth, available oxygen, nutrient levels, temperature, salinity and flow create the context of an intricate, interconnected ecosystem. If you throw a line in the water, you never know what you will catch. It is an exciting place to fish! The questions and analytical opportunity are almost limitless.”
"On the other hand,” he says, “a more traditional database is more like a fish farm where all the species have been preclassified and fed the same diet and health supplements. Some intensive tanks even employ biosecurity measures -- a [significant] contrast from the rich, open natural ecosystem. If you throw a line in the water here, you have a pretty good idea of what you will catch! While useful, it has more limitations as to what it can teach us.”
GE's "fishing pole" will be its Predix software, the company's analytics platform connecting devices to the industrial internet.
So far, the data lake is only being used for airlines. GE has analyzed 15,000 flights for its customers, and plans to scale up to 10 million flights by next year. But the method will eventually be expanded to all its major industries, including energy.
The possibilities are endless, given GE's deep reach into energy infrastructure. For example, as the company moves into the grid analytics space and sells software for managing outages, understanding customer behavior and monitoring demand, GE could potentially gain an advantage by offering this kind of analytics service. The data-crunching method also has potential for better understanding energy use in commercial buildings, manufacturing plants and within the home.