A renewed interest in data-centric AI is driving increased model outcome accuracy and introducing the concept to new applications.
Data-centric AI is gaining momentum as engineers working with AI shift their focus from models to data. Whereas engineers previously took a model-centered approach to improve the prediction outcomes and accuracy of a model, current dynamics are causing many to look to the quality of input data to improve outcomes.
With higher quality data both going into and emerging from models, data-centric AI is generating new possibilities in environments outside the traditional rubric of engineers, including 5G communications, lidar, medical device imaging, and state of charge estimations for electric vehicle batteries, among others.
MathWorks’ deep learning product manager David Willingham discussed the evolution of data-centric AI and how engineers can best navigate—and benefit from—the transition to data-focused models within deep learning environments.
Willingham joined MathWorks in 2006, and during this time amassed more than 15 years of applied engineering experience supporting a variety of application areas in artificial intelligence, including deep learning, machine learning, reinforcement learning, predictive maintenance, statistics, big data, and cloud computing.
What is the impact of data quality on AI modeling, and how can engineers evaluate and optimize the data entering and emerging from AI models? Can you elaborate on these issues?
There are a couple areas where improving data quality can help AI and any improvements being made—whether it’s removing bad data or adding features to applications. This leads to obtaining accurate results when training models. With more of the right data while training workflows, engineers can perform and execute a log quicker, get better results, and have better data fed into a model.
What is MathWorks doing to address these issues?
A lot, actually. What’s interesting to look at are the core users who we’re working with and how they are applying AI. MathWorks has MATLAB engineers and scientists who are trained in an area of focus and have built up expertise in a specific area.
They may not be AI experts, but they are the main experts who are well-positioned to take a data-centric approach to building AI models. They can look at raw data for application building and say, “Here’s the type of data,” “Here’s features,” and “Here’s how I would remove unwanted data to build up high quality.”
The tools that MathWorks builds are focused on these main experts. If you’re someone interested in medical imaging, we have a label image app for those to do that as quickly as possible, and that is done by a main expert. We also have broad capabilities for helping out other areas.
How can engineers align the needs of a particular domain or application to the data needed to run a successful AI model?
In regards to training models, we offer another option that is non-image related; instead, it’s based on signals. Signal-based apps don’t work with just raw data; you have to do feature extraction/engineering, take the raw data, and add extra features to enhance prediction capabilities. The domain experts for signaling know the best place to choose the feature algorithm or take an audio signal and convert its frequency. They can look at the raw signal and do wavelength scattering. These domain experts can try and test those features to build a successful model.
What are the best data optimization techniques? And how can engineers implement successful data optimization techniques, including image optimization, noise elimination, and code development?
Optimization is an iterative process that takes multiple tests to run over and over until you find the optimal result. After you’ve found the best data possible to train your model, this testing and iterative process is one where, if done manually, it’s time consuming to build your own testing framework.
Engineers like to do things as efficiently as possible—they have to do more with less and get things done in a short time. With an experiment manager application, they can use a low code app to set up an experiment with the tests or trials they want to perform and can optimize. Once set up, the app will run those trials or tests within the experiment and quickly sort out which one is the best model. By trying different datasets with different features, having tools within MathWorks will allow engineers to do that efficiently.
How can engineers implement best practices emerging from data-centric AI such as reduced order modeling and data synchronization?
Reduced order modeling is a full topic in itself. It is where you have a high-fidelity model or computational fluid model, but it takes a long time to simulate based on complexity and size. It is based on core math principles, and takes compute time and produces a lot of data. It is also expensive.
You can take a high-fidelity model; take rich, high-quality data; and feed them into the AI model. You can then do deep learning and try to train the model to represent the types of data that model performed. That is the core of reduced order modeling. That is a big area for us in MATLAB.
Data synchronization can be employed to increase the quality of data. You train a model, make sure it’s in sync, and you can get a more accurate and successful model. For example, if you have two sensors with two different time stamps, you can align and sync that data together. It is a standard practice if merging data is necessary; MathWorks offers this data analysis and syncing.
What examples can MathWorks share on the practical benefits of data-centric AI within beneficial real-world applications? Looking ahead, what does MathWorks see coming down the pipe in this space?
We work closely with users in industry engineer systems, planes, automobiles, manufacturing equipment, and other industrial areas. Honeywell improved its own workflows with data-centric AI. The company was looking into labeling audio data the best way possible so models could be as accurate as possible.
Their labeling process was manual and labor-intensive, so Honeywell used MathWorks tools to build another model to automate this process. They used MATLAB Signal Labeler to optimize this approach.
Right now, we have a medical imaging label application that is available for labeling medical images. This was inspired by a recent release of a medical imaging toolbox and library for this specific domain. In the future, we see evolving beyond broad application areas to more specialized domain areas. We already have image and signal applications for particular domains.
Medical imaging is one example; we also have one for audio, and a toolbox with pre-processing techniques for automating feature extraction.