In this Digital CxO Leadership Insights Series video, Mike Vizard talks with FeatureByte CEO, Razi Raziuddin, about what makes managing data science teams challenging in the enterprise.
Mike Vizard: Hey guys, welcome to the latest edition of the Digital CxO Leadership Insights series. I’m your host, Mike Vizard. Today we’re with Razi Raziuddin, CEO of FeatureByte. And we’re talking about data management, feature engineering, ML Ops and how all that drives AI. Razi, welcome the show.
Razi Raziuddin: Thank you so much. Thanks for having me, Mike.
Mike Vizard: I mean, everybody in the world is now talking about AI to some degree or another, but talking about it and doing it are not really the same thing. And I think a lot of people don’t really appreciate what goes into it from a process standpoint. And the first issue it seems that people are struggling with is just getting the data together. A lot of the organizations I talked to, you know, they’ve never been really good at data management in the first place. And now we need really high quality data to drive these AI models. So what are you hearing from folks in terms of the relationship between how AI projects are really driving data management initiatives?
Razi Raziuddin: Absolutely, Mike, that’s, that’s a great question. In fact, there’s a statement that I like to make to anyone that’s asking this, this type of question, which is, you know, and this is a well known fact, in the data science world – which is that great AI starts with great data. And, you know, this is something that pretty much everyone in the data science space recognizes. But the fact of the matter is that doing great data for AI is is very hard. It’s very complex. And, you know, just the process of getting data prepared by data scientists, deploying feature pipelines or data pipelines, and ultimately managing all of that data in production – that is super hard. And unless organizations get a handle on their data, getting AI deployed at scale, and being able to truly sort of get the most out of their AI initiatives is going to be super hard. And like, this is one of the challenges that we saw, both my co-founder and I at FeatureByte, we were very early employees at Data Robot, which is an AI/ML platform. And what we saw is, over the past decade or so this is true, not just the robot and similar platforms, but true of the pretty much the data science, tooling space, there’s just been a lot of focus on modeling. And helping data scientists basically eke out the point or 1% improvement in performance from either, you know, just tuning parameters or trying out different algorithms or whatever have you. But the fact of the matter is that data is really what drives the bulk of performance. And the predictive power of a lot of machine learning models, especially in the enterprise AI space. And organizations have to move from being very model centric in their approach to AI, to being truly data centric. And that’s exactly what we’re solving with FeatureByte. So we’re creating a platform, basically, that helps organizations both simplify AI data and industrialize it because you need industrialization in order to be able to scale and deploy AI everywhere within enterprises.
Mike Vizard: We hear a lot of DataOps, and we hear about MLOps. And they are both in my mind versions on a theme of best practices similar to what we tried to do with DevOps. What’s your sense? What is the relationship between data engineering and MLOps these days? Because I hear a lot of data scientists are spending a lot of time on data management, but maybe that’s somebody else’s job.
Razi Raziuddin: Yeah, I think it’s, it’s interesting you bring up MLOps and DataOps. I think what’s really needed in the market in the industry right now, is what I would describe more as FeatureOps. And you know, I was talking about AI data, but in the data science world, basically, the name for the representation for AI data is features; features are the data that gets fed into models to train them and to do predictions out of those models that have been trained. Right? And today, if you look at how features are created, and how they’re deployed and managed, it’s very complex, and a very expensive process. So you have data scientists who come up with features, and features require a lot of domain knowledge and industry knowledge as well to come up with really good features. So the data scientists work with domain experts to come up with features. And then they work a lot with data engineers or ML engineers to deploy those features in production. And that process takes a really long time, because, you know, you have different teams that use different tools. They speak different languages, they don’t necessarily fully understand each other. And the fact of the matter is that, you know, when you’re creating features, there’s a lot of SQL, a lot of the Spark associated with doing features at scale requires time aware SQL, Spark, which is very complex, to write and ensure correctness of and debug. And doing all of that just requires a lot of time and energy. And from our perspective, we feel like, you know, there are just too many different personas involved in the process. Ultimately, features are something that are created and authored by data scientists. And it’s really the data scientists’ responsibility to ensure the correctness of those features when they get deployed in production. And how those features either work well, or deteriorate over time, which naturally tends to happen as data changes in the markets and market conditions change. And so I think we all need to be thinking about FeatureOps, and really sort of giving a lot of control in the hands of data scientists so that they’re able to, not just create these features, but deploy them very quickly, very effectively, and manage those features at scale – govern them, monitor them, etc, without necessarily having to, you know, do a lot of data engineering either themselves or rely on data engineers. And that’s something that’s going to free up a lot of time and energy from this entire process.
Mike Vizard: How does FeatureOps integrate with the processes that we’re using for building and deploying applications? Because most of the models gotta get inserted into that somehow, and the developers are working with their own repositories, whether it’s some flavor of Git, or whatever it might be. And then the data scientists are managing features. Some people are arguing that the feature management can be within the Git repository; other people are saying, we need a dedicated platform. What’s your sense of what’s the right mix or approach?
Razi Raziuddin: Yeah, I think the integration between what application developers and data engineers do, along with what data scientists do, I think, is essential. Unless it’s a seamless integration between the work that data scientists are doing to build these features and deploy them. And what’s needed from application developers to build really good applications that leverage these models and these feature pipelines, I think it’s just gonna be challenging. So what we’re doing at FeatureByte, Mike, is to help data scientists basically create really good world class features. by just writing very short, simple, you know, half dozen lines of Python code, that automatically gets translated into complex SQL and Spark gets implemented inside the platform of your choice, whatever the data management platform you use, and gets executed within the data platform itself, where data lives. And all of this is managed and governed in a way that makes the process really industrialize and harden. Right? So it’s not just data scientists, basically running experiments from their Python notebooks and those notebooks being deployed in production, which is one of the biggest challenges in ensuring correctness and ensuring the, you know, hardening off these feature pipelines that get built. So we feel that, you know, you need a platform that’s sort of dedicated to ensuring that the data scientists are the ones who are responsible for the end to end process. But at the end of the day, the way these feature pipelines are being built is hardened. It’s engineered very correctly, and is the same time doesn’t require an army of folks to sort of maintain and support over time.
Mike Vizard: One of the issues that seems to be occurring is Digital CxOs will launch some sort of AI project; they’ll hire a bunch of data scientists and stick them in a room somewhere, and then everybody thinks that something magic is going to happen. Do we need to kind of have a deeper conversation about how to industrialize this stuff? Because a lot of times the data scientists are coming up with stuff that the business looks at and says, you know, well, that doesn’t actually reflect reality. But nobody told the data scientists, so how do we kind of have that more meaningful conversation between the business and the AI folks?
Razi Raziuddin: It’s a great question. You know, we’ve worked on this quite a bit, you know, at companies like DataRobot; we’re bringing a lot of foreign learning into FeatureByte as well. And just going back to the the challenge of sort of data management in the AI space. One of the things that’s needed, you know – the three sort of key skills, different skills, that are needed to do AI data management really well, from our perspective, as a data scientists – you have to start with a really good knowledge of the domain of the problem that you’re looking to solve. So unless you know, to your point, unless you bring these data scientists to integrate them into the business so that the data scientists understand the business workflows needs of the business, what data exists, you can’t really, you know, do data management, you can’t do feature engineering really well. Because, you know, you need to understand what types of signals to derive from the data in order to solve, in order to solve whatever use case you’re after. And then I don’t think it just stops at where at the integration of business and data science, but you have to involve the infrastructure and the data engineering folks as well. Because, you know, it’s one thing to run experiments in a vacuum very effectively. But it’s a whole another process to actually take all of those experiments and make them work in the real world. Right? There’s our which is the, our part of the R&D, which data scientists are very much focused on, then there’s D, the development, where they also have to interact with data engineers and application developers, to your point earlier. And unless they’re seamlessly integrated on both sides, I don’t think, you know, data scientists will be very effective. So yes, it’s absolutely essential to bring the data scientist and make them into integral from both business points of view so that the data scientists understand business processes, the needs and requirements of the business, as well as have access to the infrastructure in a way that doesn’t, again, make them dependent on the IT staff, whether it’s DevOps staff, or data engineering staff, and be able to do things a lot more independently.
Mike Vizard: We also worry a lot these days about drift in the AI model, which is typically driven because we got new data sources or the data that we were relying on somehow changed. And that impacted the AI model. So is this a conversation about data management, not just before the AI model is deployed, but after, and it becomes something of a continuous process? But we don’t all have our heads wrapped around that, either?
Razi Raziuddin: That? Absolutely, you know, this has been a challenge. And you know, a lot of tools out there kind of look at data drift after the fact. So once a model has been deployed in production – I would, you know, go so far as to say that a lot of data drift even happens before those data pipelines and data models get deployed in production. So, you know, in many cases, many of the organizations that we’ve been speaking to, it takes literally weeks or months to get some of these data pipelines deployed on production. In fast moving markets, or even in slower moving markets, months, or just, you know, a month or a couple of months is a really long time, from when a model gets trained, when features get built to when the model actually gets deployed in production. And so again, unless, you know that process of going from experimentation to deployment gets reduced significantly, I think we’ll have this problem of just data drifting even before, you know, things are in production and operation. But definitely, this is another challenge where you need feature engineering or feature management platform that’s looking at feature drift or data drift, not after the fact. But really, from the source all the way to when the data shows up for prediction for any kind of scoring on the model that’s built. And again, that’s one of the key things that we’re focusing on as well that feature right now
Mike Vizard: We also hear a lot about the democratization of AI models and that, you know, people who aren’t data scientists will be able to build these things. But it seems to me that that may be a recipe for disaster if I don’t understand the fundamentals of data management. So just how would we democratize AI models in a way that doesn’t wind up with this? One wise guy told me, you know, it’s one thing to be wrong, it’s another thing to be wrong at scale.
Razi Raziuddin: It’s a great one, I’ll make a note of that and just use it when I talk to different folks as well. You know, and democratization, I think, while it’s, you know, it’s challenging and could be done in a negative way that has a negative impact, I think democratization is absolutely essential if organizations are looking to scale. And, to your point, Mike, one of the things we found, you know, again, I’m going back to my experience at DataRobot – DataRobot was a platform that, you know, was basically at the core designed to democratize the modeling process. And one of the things we saw over and over again, is, it’s one thing to make the entire modeling process, you know, just click button, right? And you just click on, load your data, click on button and out come several models from which, you know, you can choose, which is the best one for your use case. The challenge over and over again, and pretty much in 98% of the cases, was that preparing the data for modeling was something that organizations didn’t have the wherewithal to do. So it’s one thing to go to data scientists, data analysts or business analysts, who understand the data, understand the business problem and say, “Okay, well, you know, formulate a data science problem.” And they’re very capable of doing that, because a lot of the formulation of both features as well as data science use cases, is a very human intuitive process that doesn’t require a degree in data science or computer science. But then you do need a lot of skills and expertise to be able to translate that intuition into something that machines and algorithms understand. And that’s where you need tools. That’s where you need, you know, both tools, modeling tools like their robot, etc., as well as sort of data management and data preparation type tools that are designed to help both expert data scientists, as well as novice data scientists both build features, as well as have the governance as well as sort of guardrails around the process to do things well, at scale. Going back to your statement earlier, you know, having a lot more personas within the business, that no data that no business problems, be able to sort of do data management in a distributed sort democratized manner. And that’s, yeah, that’s absolutely essential for scaling AI.
Mike Vizard: Do you think we’ll ever get to the point where we’re going to be using AI to build AI? I mean, you know, just how convoluted can everything get?
Razi Raziuddin: On that point, I think we’ll just have to wait and see what ChatGPT or GPT-10 brings us; perhaps we’ll be at that point, where, you know, there’s AI creating AI. You know, definitely there is, you know, there’s there’s no reason why AI shouldn’t be able to create AI, because, you know, it is a structured sort of fundamental process that AI can learn and do. But again, you know, to your point, if it does it incorrectly, we could do a lot of damage at scale. In many ways, AI is doing good – really good data management and feature engineering is a very creative process. It requires understanding and view of the world. And it’s very integrated, in some ways, to how the world functions, right? And how business functions. And so yes, I mean, you know, if the previous month and the previous year is the same as the next month and the next year, I would be very good at predicting and creating models that, you know, do predictions very, very well. But the fact of the matter is that, you know, the world keeps changing, and a lot of the events are very unpredictable. There’s always new data being generated, there’s always new scenarios and situations coming to the fore, and all of that requires a lot of human creativity. So I think ultimately, it’s going to be a combination of humans and AI. That’s going to beat AI creating AI.
Mike Vizard: Alright folks, Well, you heard it here. If your data hygiene stinks, so does your AI model, so start with the beginning and work from there. Razi, thanks for being on the show.
Razi Raziuddin: It’s a pleasure, Mike. Thank you so much.
Mike Vizard: And thank you all for watching the latest episode of the Digital CxO Leadership Insights series. You can find this episode and others on digitalcxo.com. We invite you to check them all out. And we’ll see you all next time.