In this Digital CxO Leadership Insights series video, Mike Vizard speaks with Vinoth Chandar, CEO of Onehouse, about managed data lakes.

 

Transcript Text

Mike Vizard: Hello, and welcome to the latest edition of the Digital CXO videocast. I’m your host, Mike Vizard. Today, we’re with Vinoth Chandar, who’s CEO of a company called Onehouse, they are a startup, they just picked up $8,000,000.00 in funding, and they have what is known as a managed data lake, for lack of a better explanation, and we’ll get into that in a minute. Vinoth, welcome to the show.

Vinoth Chandar: Alright. It’s a pleasure to be here, and thanks for having me on board.

Mike Vizard: Alright. We have a lot of C-level executives that watch this, so, I’m not sure they all understand the difference between what a data lake is and a data warehouse. I’m sure they’re all familiar with data warehouses, but data lakes might be a new concept for many for them. What exactly is this subtle difference between what we used to call a data warehouse and what people are now calling a data lake?

Vinoth Chandar: Yeah, so, that’s a great question, actually. So, to understand this, we need to re-align a little bit and look at how data architectures have evolved. So, data warehouse, everybody understands as, like, a specialized database, if you will, for analytics—your BI reporting, right? And what has happened in the last 10 years, if you look at it, is, as the data volumes have grown more and more and we build a lot of data products which are now, you know, going back to my time at LinkedIn, we built people you may know, jobs you may be interested in. Like, a lot of these more data-driven applications—machine learning, data science—for all those things, we needed a place where we wanna store large amounts of data very cheap and run horizontally scalable, you know, very scalable processes on them. And that is how data lakes came to be.

And if you read most of what people cite as differences between data lakes and data warehouses is that data lakes contain, you know, large amounts of unstructured data, like, you know, it can also store images, videos, and whatnot, which are used for more advanced machine learning applications, right? While warehouses stick to the more traditional transactional data that sits in, like, your RDBMS or it’s more structured that way. And where we really come in was, our journey started with Uber, where we had a data lake and we had a data warehouse, but we needed to actually borrow a lot of the core transactional capabilities of the data warehouse and adapt it for the data lake in order to be able to solve the business problems at Uber at that time. And that’s actually how Apache Hudi came to be and ________ a layer built on top of that.

Mike Vizard: The data lake itself is not necessarily a new idea, and some folks will remember Hadoop and other platforms, and that turned into a data swamp, so there is probably some folks who have at least had one stab at this. What do people need to think about to prevent a data lake from becoming a data swamp?

Vinoth Chandar: Yeah, that’s another great question. So, if you look at, actually, why data lakes become swamps or why the previous incarnation of the lake did not fully live up to its potential, you can actually point it back to some data architectural deficiencies. Like, for example, the central idea of the data lake was, we are able to bring the operational data, like your upstream databases, even data, any type of data in raw form, and then you let any type of analysis run on it, right? That was the promise of the lake.

But there was no structure to it. So, it was too open and too unstructured, and there was lack of, like, a lot of standard data management on the lake in the first incarnation. And I think that’s what led people to kind of, like, have more frustrating experiences with the data lakes of before.

And the other angles were, data lakes were very hard, in the on prem world, it was very hard to build and run, because you had to run on prem HDFS, like, Hadoop clusters, and you had to actually build a team to scale that out, right? The cloud actually has changed the game, has made—because you now have cloud storage, which is ubiquitous, it’s very cheap, and you have a lot of compute power, on demand compute.

So, I think in this model, projects like Hudi have brought some of the much needed management on top of the lakes. And now, if you look at even Hudi’s adoption are generally where the market set it. I think we are kinda like reviving or redeeming the lake in many ways with some of these more recent shifts.

Mike Vizard: You guys are a provider of a managed service. I’m not sure everybody knows what that means. Why should organizations look to an external service platform approach to creating a data lake and managing it versus trying to roll their own?

Vinoth Chandar: Yeah. So, this is, I think, actually, this is the single biggest pain point that we wanna solve in the company. So, bridging a little bit to the previous question, right, so, what—so, Hudi, what does it add that makes lakes better now? It brings updates and deletes, for example, and it brings a layer of services that can make tables be more optimized and faster. And this helps people build higher quality lakes, and this is something that, you know, Hudi was a project built at Uber, Uber built it to the Apache Software Foundation, and we’ve been having a community for four years.

But here’s the thing—even though the technology exists, we constantly see companies spending six months to a year rolling out their own data lakes. In stark contrast, people can still get started with the warehouses on the cloud much quicker. Because most people start with a fully managed data pipe, and a fully—like, most of the cloud warehouses are fully managed, again. So, that gives them a very easy starting point, as opposed to on the lake, you have months of build out, you need to go there is, you know, operational expenditure in terms of hiring and building a high quality team. They have to come and understand different database concepts, like even streaming systems like, you know, even buses like Apache Kafka or then pick a product like Hudi, you know, stitch them all together, it takes a while for people to realize the return on investment on the lakes, even today. And that’s the main piece that we want to solve for.

Mike Vizard: So, your service is based on Hudi as kind of an abstraction layer for managing the data and the governance, but what is the relationship gonna be between you guys and an internal IT team? Do I need data engineers, still, or do you take care of all of that? Or is there some sort of more co-managed hand in glove, how does this all work?

Vinoth Chandar: Right. So, our—what we’re building is a cloud, SaaS application, which is going to automate and self-manage a good chunk of your data engineering workloads. So, you still need data engineers, but they can actually focus on more of the more directly business impacting things, like, instead of the lower level data infrastructure.

So, our service takes care of, let’s say, managed ________. Like, we can bring data in, we can optimize the data under the hood, we can lay it out nicely for efficient query performance, while your data engineers can actually focus on picking up the open source frameworks that sit on top of Hudi, write the ETL pipelines, you know, be focused on, let’s say, how do I detect fraud for a large bank? How do I make that algorithm better instead of clicking files on top of cloud storage or sizing them, cleaning up tables and this kinda thing, right? So, that is where we are trying to build managed data infrastructure and, while you still, you know, you can put your data engineers to kinda like maximize their impact to your business. That’s how we think about it.

Mike Vizard: We hear a lot about various buzzwords, so, DataOps is, of course, one of the buzzwords of the moment, but DevOps has been around as a concept for a while and you’ve been around the block a couple of times. How do you think DevOps and DataOps might come together in an organization and how do I kind bridge maybe a cultural divide between folks who are maybe data scientists and data engineers and then the applications that are gonna consume the models and everything else they create?

Vinoth Chandar: Right. This is like, I think, one of the more open challenges that we have today as an industry. Because data engineers and DataOps in general, they are, in large ways, detached from the backend application engineers who are building the applications, right? So, I’m not sure how they would really, you know, connect in the short term right away, but I do see steady trends on both sides where DataOps people are caring more and more about pipeline reliability, observability, and CI/CD, how did we do?

That’s another thing that we wanna provide is, you know, anything that you do on top of us, we wanna make sure that you can have a good CI/CD experience and, you know, we bring and borrow some of the DevOps principles into DataOps and try to make it higher quality for folks, right?

So, that shift is already happening. There’s tons of work, great open source projects that help in this regard. I think that would be a healthy trend to watch for. What, as a company, for us, we still feel, even with this tooling, just like dealing with these large amounts of data and then restoring data, moving data around between regions—these kinds of things are still very hard, infrastructural pieces that we don’t see good standardization around. So, we are focusing on that, but I think overall, the trends are very healthy, I say, in terms of where we see DataOps going.

Mike Vizard: This is generally more complex than most Digital CXOs appreciate. Is there something you wish they knew about this whole process going in? I wonder, maybe, if they all have some unreasonable expectations about the magic of IT and how all this is gonna come together, or do they need to have patience and it’s gonna take time?

Vinoth Chandar: Yes, and I think the biggest thing, insight I would say is, if you look at it as an analytics problem, I think a lot of people don’t realize that, by solving some of these management problems, you can actually start with more open data formats and keep—you know, start on the lake even right away. So, that is one issue that I feel like, on the other hand, if tools like this exist, maybe there’s broader awareness. That would be an ideal thing for CXOs. When they think about data strategy, if you look at it, right, the different query engines that will come, there will be a newer, better way of analyzing data, there will be new use cases for data that will open up. But there probably needs to be a more central, open way for you to manage and keep your data well maintained and compliant in all that. I think that layer is missing today.

So, I think I would say, if there is one thing I would like to probably see highlighted is that that layer needs to be built in the industry, either by Onehouse or someone else. That, I definitely see as a missing piece and leads to a lot of data silos, in my opinion.

Mike Vizard: Do you think there will ultimately be a gap between organizations that really do know how to harness and master and use their data and those that don’t, and eventually those that don’t really understand this are gonna fall by the wayside?

Vinoth Chandar: I think there’s already proof in the pudding for this, right? If you look at even large retail brands, if you go back to how their apps looked, like, five years ago to now, you can see that, right? Like, data driven approaches to even, you know, what sort of notification do you send people? Because everybody gets so much notification these days, how do you even optimize for your growth? I think this is gonna be very central going forward. We definitely—overall, we have an information overload.

So, the more and more data that you can use and create value for your customers in a more pointed way, I think in general, anything starting from consumer experience to optimizing supply chains to server dealing with, you know, being able to stream and process a lot of sensor data and make, you know, more intelligent ________ around, let’s say, energy usage. I think this is going to be the trend, not the other way around. I think there are already lots of examples for companies who have done really good, just because the usage of data is so intelligent and very focused.

Mike Vizard: What do you think the correlation to the data management and artificial intelligence is gonna be? At least when I look, a lot of these models, they all require some level of data to train. Maybe not as much as they used to be, but the quality of the data matters more than ever, it seems, and in a lot of organizations, to be frank, not many of them would get a Good Housekeeping seal of approval for the way they manage data. So, do we need to kinda tackle this, learn how to walk with data management before we think about AI?

Vinoth Chandar: Yeah, definitely. I think this was actually one of the original motivating factors for us, even at Uber, as you may, if you followed Uber’s engineering. And generally, we’ve put a lot of time into AI and doing more with data from, anywhere from predicting ETAs to rider safety to anything else, right?

Back in the day, just properly ingesting the data, organizing it really well, creating these layers of access and educating the company on what each layer means, that had a very profound impact on the quality of models that you’ve produced. And then, so, just to give you an example, right? In most organizations, people send their, they start with the ________, so, they send the traditional transactional, like, what sits in thee? RDBMS sits in a warehouse. Then they start the lake when they start, like, a data science team. And then they usually don’t have a holistic data infrastructure to bring these sources together.

So, you will often find that, for data science, your logging events started producing some ________. You need to correlate that with the database, right? Without proper data ingestion, for example, you will constantly run into missed records, you know, data arriving out of order, and all of these different things that actually fundamentally affect your feature extraction for machine learning and affect the quality of the models.

And I think having stable, standardized data infrastructure that can get all this data to you with high reliability, attach some schemas, enforce some data quality controls, that is, I think, pretty critical for any organization that wants to use AI in a very, you know, externally impactful way. If you want your bots to making real world ________, I think we need to invest in that. Otherwise, it’s gonna be, like, a house of cards, right? Your model is only as good as your data.

Mike Vizard: Alright. Vinod, thanks for being on the show and sharing your insights. That was awesome.

Vinoth Chandar: Alright. Thank you so much, and a pleasure to be here again.

Mike Vizard: Alright, and thank you, all, for spending some time with us. There’s more of these videos on DigitalCXO, so, by all means, come check ‘em out.