In any AI/ML problem/challenge, it is data that are usually more challenging than the analytics itself. Data are messy, unlabeled, hard to access, and often incomplete.
The data pipelines that we make are the NT Concepts framework that take care of this problem so that model development can actually happen.
So what is DataOps? We want our data to be discoverable, accessible, and intelligible. This means data needs to be easy to find, easy for analysts and data scientists to use. And the schemas and visuals that we make need to make sense and be useful to them.
This whole process is about preparing data sets. In a data set, we need:
- Data to contain enough information to solve whatever problem is at hand
- The data needs to be complete and accurate
- There has to be buy-in upfront for work to be done this way every time
We all know that no data comes like that. So we have a process to get it into an acceptable form to where it can be used in the ways that we need. To do this, first we have to explore our data sets and that means:
- What data do we actually care about? (you don’t want to take everything)
- Is there a schema that we can use? Or are we dealing with something that is unstructured and needs to have some sort of structure added to it?
- How will this data be accessed and how often?
Once we answer those things we can then move on to the processing step where we transform our data. We want to identify the end state and then figure out what processing needs there are to get from point A to point B.
A lot of times this is where you have to start thinking about scale. What kind of processing do you need, how often will new data need to be [X] and QA is crucial. You want to make sure your output looks like what it is supposed to look like.
Lastly, you want to get your data ready for production needs. You should be thinking about this the whole time because this is what’s going to keep you from taking on any extra technical debt.
- How should data be stored based on schema and access needs?
- How do we monitor access, what log ins do we need?
- And how are we going to make this data accessible and useful to others – whether that be through an API or dashboard or and SDK?
These are all methods you can use to make your data easy to use.