CONTRIBUTOR
Chief Content Officer,
Techstrong Group

Synopsis

In this Digital CxO Leadership Insights video, Mike Vizard talks to Ahana CEO Steven Mih about why digital transformation requires a data lakehouse in the cloud.

 

Transcript

Mike Vizard: Hello, and welcome to the latest edition of the Digital CxO Leadership Insights series. I’m your host, Mike Vizard.  Today we’re with Steven Mih, who is CEO for Ahana. They are big in the data warehouse, data lakes, and now, I guess data lakehouses, which is next in this whole space. And my first question to you, though, Steven, is, is there a difference between these things? Or is it just kind of like the latest iteration of buzzword bingo? Or what should people make a distinction about here?

Steven Mih: There is a difference. And fundamentally that’s driven by as companies get more data driven, they have a lot more data. And they’re looking for an architecture that can get insights on the data with many different types of engines. And so while some of it fundamentally provides you with fast insights, which data warehousing did, data lake housing is more flexible and more open, and much more scalable.

Mike Vizard: So in the sense that it’s more open, does that mean I can mix and match and combine data? Because there’s a lot of different data types out there? So how does this massive data lake know what data type does what and how do I combine different things because people are trying to get value out of their data, but they seem to be slightly overwhelmed, shall we say?

Steven Mih: Yeah, data is being generated in all different places in all different forms. But oftentimes, you’ll need to put it into some aggregated system. And it’s best to have an open data format for that. And so Apache Parquet, Apache RC, Orc; these are all pretty common open formats, which then lets you do anything with that data easily. Versus sometimes when you put it into a data warehouse, which is mostly proprietary, the formats are not easy to get to. But, Mike, I thought maybe I could talk first about what is that lakehouse? And what does that really define to me from our perspective.

Mize Vizard: Go for it.

Steven Mih: So fundamentally, a lakehouse is a lake, a data lake with data warehouse capabilities. And so a data lake is the lowest cost directly accessible storage – oftentimes now cloud based S3, compatible object stores. So that is the foundation of it, or one piece of the data lakehouse; the storage, and put it in a place where if it’s an S3 based object store, it’s complete, infinite, it’s very low cost, and it’s something that’s easy to use. And then on top of that, you have data warehousing capabilities. And that’s the more traditional analytical database management systems and performance features. Those things are things like ACID transactions, data versioning, auditing, indexing and caching and query optimization. And so that is what a lakehouse is. And because it’s on a lake, it it does imply that you can run ML workloads on the data lake as well. So that’s our definition, and if you think about that for your first question of how is it compared to a data warehouse, it’s a new architecture, based on the lake first and foremost. And that’s a separation of the storage from the computing, and having flexible compute engines for the needs that you have.

Mike Vizard: Kind of sounds like we’re honing in on trying to have the best of both worlds, the lake and the warehouse and kind of combine them in a way that makes it easier to query all that data in there. Because seems like one of the challenges we have is, there’s a massive amount of data out there. But it’s not quite clear how you access it within a data lake, per se.

Steven Mih: Yes, that’s right. And I think that’s part of the history of how we got to here. If I go back a little bit, and I’ll cover kind of the whole element of a lake being a swamp. But prior to that there was everything started with a data warehouse. There was the OLAP systems, and then the warehousing that you you would generate data from different operational databases. And they got so much data that you had them in a separate data warehouse as your source of truth. So you could do historical analysis on it. And that was better to do versus doing it on your operational systems. And then, in the mid 2000s, and early 2010s, then you had this internet scale of companies starting to have the internet scale data. And as the internet became more and more popular, there’s a lot more data than just your simple spreadsheets or ERP systems. And that’s what you call your big data era. And I was at companies like Couchbase, I was employee 13 there, where they were working with non-tabular data to where it was no SQL databases; working with that data log image, data, phone, video data became a huge, huge thing. And so then the rise of data science, ML and AI became a big item in that. And data lakes came from that trend. And Hadoop was the first version of this where you could then say, I’m gonna have a, there’s so much data, I have it in a lake. And that time, it was called the Hadoop file system, and a bunch of distributed servers there. And I can now compute against that using these commodity servers. But the Compute and Storage were still coupled at that stage. And they’re stored as files. And this started becoming the first time where SQL started to be working on the lake. And Facebook at the time, added hive on top of the MapReduce paradigm, to make a sequel interface to let you quarry the data lake. And from there, it was done, presto came out to be the very fast version of SQL on the data lake. So that was kind of how we got to that place of the swamp, right? You have a lot of data there. But how do you know what’s in there? And I think that what’s evolved now is why “lake house” is the new term. And it’s not just a term, it’s a technology that says, “Well, I’m gonna now want this to be as nice and easy to use, and give me data management capabilities that the warehouse had.” But it happened in the cloud, because cloud started to give you that separation of compute and storage, or suddenly storage, you got to Cloud Storage, and you get it very easily. I don’t know, Mike, I’ve been in the industry for a while where we used to talk about RAID; redundant, independent disks, right? The whole bunch of disks, and things like NAS networks storage – we don’t talk about that anymore. Now storage has become something that you just put in the cloud and don’t worry about. But that’s where the object storage became a huge factor. And when you started to put data management on top of that, then your computing separated on top of that – that’s where lakehouses really became what came forth. Sequel on the data lake was adopted – more expectations to get it to be like a data warehouse. And hence, lakehouses happened.

Mike Vizard: One of the dirty little secrets of enterprise IT is that most organizations have not been especially good at managing data; then very few of them would get a Good Housekeeping seal of approval for their data management practices. So is that getting better in the age of the cloud as we kind of look at these lakehouses because there’s more data management capabilities just built into the platform versus something I’m trying to layer in on top of a lot of mishmash of storage?

Steven Mih: Yeah, so the way we look at this is that the consumption of the data needed to evolve. And that’s now happened with the things like Presto and Spark being able to consume data directly on the lake. Once that became more adopted, then companies are saying, “Let’s not just throw the stuff into the lake and then figure out later; let’s put in the formats that those engines can consume easily.” And so that is the table types that then give you the data management capability. And so instead of just ETL and aggregating all your data, it is now being ETL into a table type, such as Apache Hudi that allows you then to be able to do upcerts and inserts and it adds capability to the object store that you didn’t have before which is Eurasian append only. There’s a couple other projects that are very popular such as Apache Iceberg and then the Delta format. And so when you do it that way, then it organizes your data from the get go. And I think that’s just part of the evolution of the maturity as companies find that they don’t want to make the mistakes of having the lake. They want to make it so it’s consumable for insights, whether that be to get the insights out with SQL or to use other ML types of model training on your your non-tabular form of data. Does that makes sense? Did I answer your question?

Mike Vizard: Yep, I’m following you closely. As we move down this path, it used to be I would hire a bunch of administrators and the cost factor was reasonable. Am I now looking for data engineers? And can I find these people? And it seems like it’s a whole other skill level. So what is changing on the personnel side of this equation?

Steven Mih: So, yes, you had infrastructure, folks, but it was the DVAs, that were managing your warehouse. And those folks would be doing a lot of the data wrangling and making sure all the hierarchical models or all the hierarchical organization of the data is clean. I think that is shifting into data platform teams that need to work with a lot of open source technologies, and be able to integrate them or get vendor solutions that make that easy to use, and then pipeline the data into the lake house, and provide the interfaces to the business, which is your community of analysts and data scientists or developers that are building apps on top of the data. And that is a very hot area, in terms of finding people who are good at that. And when that goes, well, then guess what, now the business can get better – more and better information to make decisions. Or they can provide dashboards and analytics to their end customers, which lets them consume their products even better. And so that’s why it’s such a big deal. It’s such an exciting area.

Mike Vizard: At the risk of mixing some metaphors here, we talk about all the time is data is the new oil, but it seems like we don’t have any place to refine said, oil. So is the data lake essentially becoming the place where we’re refining the data to make it something that is more valuable and, you know, is the equivalent of cooking with gas, as they say?

Steven Mih: You know, I would say that the lake itself is the storage, but the lakehouse then gives you the whole capability of compute and storage, and it’s all managed. And so that’s where you get the most price performance – we believe the most price performance solution compared to having everything copied out of a data lake and put into a data warehouse, that’s tends to be very locked in and proprietary. So the cooking with gas, yeah, I think that if you have a lakehouse architecture, you can add it to a warehouse to start off with. And what you’re getting is, you’re getting like a – you get the infinite storage, but you also get the most cost effective compute for that, and in a flexible way, because you don’t know what your organization may need; suddenly, that analytics practice needs to be very critical. Then you need to add some ML practice to this. And these are very, very, very critical initiatives to every company today. Every company is becoming a data driven company and how they manage how they can consume that data is starting to become a strategic business differentiator.

Mike Vizard: Are we going to get to the point where executives will trust the data more than they do today? Because the second dirty secret of IT is that a lot of times the execs are dubious of what the IT guys present to them in an analytics report because there’s so much conflicting data, and they’re not sure that the data is reliable. So are we going to get to a point where the senior level execs will be able to, with confidence, make a data-driven decision? Or will they always be kind of having a little gut check?

Steven Mih: Well, here in the world of business, I’ve been in startups and big companies for over 25 years, and you never have enough information to make a decision. So what’s important is you try to get as much good information as you can get. And then you’re going to have to make a gut call on some of these based on your experience. But being able to get that data in a timely way and get as accurate data as you can – that’s important. Now what I see is happening though, is that on a business unit level, and a product level, having data about your users and having data about what they want, is a huge factor that has changed the trajectories of companies that don’t have that capability. Look at Uber for example. They came out – they can come out with new products like being able to subscribe to a monthly subscription for $4.99. And then your rides in a geofenced area in your neighborhood is only $3.99 a ride. Well, back before, it’d be more intuition to come up with something like that. But now you could AB test that where you can look at historical data, you can figure that out, you can test it out on a slow roll Canary deploy basis; all that’s happening. And you compare that to like, say, Hertz Rent-a-Car had been around for decades, and the best thing they had was the gold subscription, where you show up to the airport and the car was waiting for you. But after that, there wasn’t anything. So it creates new business models, it creates new ways to generate value, it creates a better user experience. And these are all part of the wave of the speed of what consumers are looking for. And that’s true for enterprises on a B2B base – the summarization of technology has to be easiest to give that to you. So I think it happens at the C-suite, as well as it happens, for sure, within the business units.

Mike Vizard: What is the hardest part about setting up a data lakehouse, because we don’t really see it everywhere, just yet. A lot of progress has been made. And we’ve seen a lot of different flavors of things where there was initially the data warehouses, you mentioned it, and plus there was data lakes, and then there were data warehouses in the cloud. And you know, progressing, but ultimately, what’s the challenge in front of organizations? What should they go into with their eyes wide open?

Steven Mih: Yeah, so the lakehouse is made up of four key components. So you have your storage. And so you pick a cloud provider or a private cloud provider, and it makes sure you have your storage. Then you have your table formats, which are what I mentioned – you may have a table format like Hudi, and it does more than just a format. But these are critical components. And you need your compute engines. And you know, SQL is a big portion of that. And Presto happens to be one of the very popular open source engines out there. And then finally, a catalog – a metadata catalog. And so that’s the extent of it, but it takes a little bit to understand. And once you know that, then you can start to pick the pieces and there are folks that put it all together for you. But once you have that, then you have something which is completely scalable, very cost efficient, and open and flexible. And that’s compared to where you may need to go get a data warehouse – it’s locked in, it’s going to be more and more expensive as you use it. And that’s where it may be very easy and very fast. However, the lakehouse is getting faster and faster. And we believe that it will be on par over time. But with a much, much better price performance equation, where you can even run it for all free if you want to do that; with open source you don’t have that lock in.

Mike Vizard: Alright, ultimately, will there be one single data lakehouse to rule them all? Are there going to be a lot of kind of small ponds and whatever else hanging around the big massive data lake, and we’re gonna have like multiple implementations of these things in a more federated model?

Steven Mih: Is there one single cloud vendor in the world? You know, I’ve been in the industry a long time. I’m a big believer in polyglot persistence, meaning that it’s probably more of a oligarchy than it is going to be a complete monopoly. So it’s never – it’s not win or take all, but it’s not completely fractured into hundreds of databases, like we used to have before. There will be a few, three to four, very large standalone lakehouse companies out there that will be delivering technology to different parts of the market. And, it’ll be a great competitive space that will be very exciting to be in. So that’s what we see happening. And it’s a big enough market to support. That is what I’d say.

Mike Vizard: Alright, folks, data will always be distributed in one way or another. Steven, thanks for being on the show.

Steven Mih: Thank you, Mike. Thank you. Thank you for listening and thanks for having me.

Mike Vizard: All right, you can find this videocast and others on the Digital CxO website. We invite you to check them all out. We think you’ll enjoy them. And thank you for watching this one. And we’ll see you again next time.