With all the investment pouring into Artificial Intelligence (AI) and Machine Learning (ML) development, it’s an understatement to say that data is important. It’s everything. While it’s promising to see the Department of Defense (DoD) turn its focus to establishing a common data standard across the services, data readiness is already behind schedule. As is the case with a lot of federal acquisitions, this issue creates a risk of delaying the emergence of future AI/ML critical capabilities. Right now, in fact, this situation is playing out in one of the most important procurements happening at the Pentagon — the $10 billion Joint Enterprise Defense Infrastructure (JEDI) contract with Microsoft.
In their pursuit to adopt and leverage AI/ML capabilities, the DoD is coming face to face with a data issue. It’s not that the DoD lacks data; rather, there is a lot of it, and it’s too messy for the models they hope to train. To realize the return on the investment made in algorithm development to date and stay ahead of our near-peers, the DoD must pursue a collaborative and timely data acquisition strategy that engages a Data Readiness Consortium and leverages Open-source Development Framework and streamlined acquisition approaches such as the Commercial Solutions Opening.
Data engineering and data custodianship are at the heart of a successful advanced analytic AI/ML practice. The tools and processes developed to enact these capabilities will be most efficiently acquired by the DoD in the most collaborative of methods — by mirroring the open-source community structure that has been so successful for private industry and academic practitioners. If tempered with risk mitigation efforts to maintain security and accountability, the open-source community structure offers the best chance for the success of a truly iterative, scalable, fast-paced data engineering practice in support of ongoing AI/ML development.
An open-source community for DoD data
To understand the challenge faced by the DoD and Joint AI Center (JAIC) is to understand the decentralized nature of each of the Services. It also means understanding the importance of getting all services positioned and ready to leverage the adoption of AI/ML capabilities. There are many stakeholders and missions across the DoD that want to adopt AI/ML with the data they are collecting. Achieving data readiness necessitates a decentralized approach to managing performers and achieving standardization, productization, scalability, and (especially) speed to transaction. Particularly considering its preference for commercial buying, the DoD must consider the open-source development that the commercial industry has adopted as an AI/ML best practice. The community-based model is emblematic of how the open-source development community approaches development in general, and data science and engineering technology in particular. It draws upon a voluntary formation of a stakeholder community that opts in to gain a voice in data governance standards. In so doing, it helps to prevent exceptions and carve-outs in data governance.
The main characteristic of open-source software development is that the code base of a project is publicly exposed, with licensing terms that allow for its reuse and modification.
Actual licensing terms may vary and commonly include limited restrictions, such as requiring that the licensing terms be included with any redistribution. More than just being reviewable, the source code may be “cloned” (i.e., copies made) and “forked” (i.e., new versions spun off) for incorporation into new technologies after review and testing which may be manual or automated.
Within the proposed community formed by government stakeholders and consortium members, all code will be viewable and transparent, with a secure “perimeter” maintained around it to shield code from those outside the consortium, either on government-owned networks or approved corporate networks. The allowances create near-total transparency to the members of the consortium.
The secondary characteristic of open-source software development is that it is collaborative.
The DoD desires collaboration from government partners, industry, and academia when it comes to data governance and data engineering because it leads to a high-quality code base and democratically vetted data governance. Data engineering, Infrastructure-as-a-Service, and data analytics/ML also benefit from the collaborative nature of open-source software development processes.
Actively developed public-facing projects are commonly collaborative. Developers volunteer to join the development team to have a seat at the table. Because they can come from anywhere across industry and academia, the team may represent many organizations. The many sets of eyes on the code and dependencies among these varied contributors, along with the process of applying code reviews before acceptance and the rapid feedback from users, quickly reveal bad or malicious code that might be introduced.
To equip participants to interact in a manner resembling how the open-source development community interacts across the data services under consideration by JAIC, the DoD would need to establish a Data Readiness Consortium as its default community of developers. The consortium would provide stability and an ideal framework for a community-based model for the DoD acquisition team. The model would offer unlimited participation and collaboration that JAIC desires from a diverse vendor set (Traditional, Non-Traditional, and Academia). It would also enable JAIC to onboard relevant new start-ups focused on data science and Mid-Tier System Integrators with agile teams that are best positioned to support the DoD ML space.
(In contrast, a traditional acquisition approach can generate a Multi-Award contract, but the framework falls short and limits participation to just a subset set of vendors at a given point in time after source selection.)
Leveraging the Defense CSO Pilot Program – Class Deviation 2018-O0016
With speed to transaction top of mind, it is imperative to determine what available frameworks and permissible acquisition authorities currently exist. The consortium approach is currently available and widely adopted. Nonetheless, the DoD has eschewed any sort of collaboration model. Instead, they are utilizing frameworks and authorities more like Multi-Award Task Order Contracts (MATOCs). If the models properly align, individual developers (and their organizations) gain an inherent incentive to opt-in to gain a “voice at the table” about what is included and omitted in developing technology. Within a modified open-source community, these industry consortiums can enhance and facilitate the multi-party arrangements.
Within the ever-evolving state of data engineering landscape, we believe, a serial approach (that a MAC must follow) to modernizing the data landscape is premature. Leveraging the desire for collaboration and competition is more relevant in the near-term. If a CSO posted within a Data Readiness Consortium follows modified open-source protocols, the DoD can leverage agile methods, foster collaboration, and maximize competition to speed and scale novel solutions. This opportunistic acquisition model allows for a rapid approach in evaluating and selecting solutions (speed to award) as well as the ability to quickly transition from solution/prototype to production (scalability).
A fundamental feature of the regulations surrounding CSO is the breadth of its range. The program allows the DoD to purchase innovative commercial technologies and services that are new at the date of submission. It also enables DoD to purchase new applications of commercial technologies and services. It also can purchase R&D under this authority. These allowances are necessary because state-of-the-art AI/ML and supportive data engineering are changing so rapidly. A flexible vehicle is needed to keep up with the speed of relevance.
Example CSO collaboration model
Vehicle framework
The CSO would be configured similarly to a Broad Agency Announcement (BAA), a competitive solicitation procedure used to obtain proposals for basic and applied research around a “general” long-term solicitation posting. A BAA typically is useful for broad research topics and objectives. Where a BAA is limited to basic and applied R&D, a CSO also allows for commercial research through operational systems development — so long as it is determined “innovative” under the CSO definition. As far as open-source development is concerned, a justification can be made that each line of effort to enhance the data development can be viewed as a new application or adaptation of existing technology, process, or method, per the definition. A CSO is similar to a BAA in that it allows for flexibility in evaluating offers or solutions based on individual merits, without considering trade-offs or value.
Research laboratories have recently adopted “Special Notice” postings, released under a BAA posting, to narrow the definition of efforts within a specific research topic. In these cases, we propose a tiered approach consisting of a centralized CSO posting of umbrella data readiness topics and special notices or solicitations for specific efforts.
Tier One: Centralized CSO posting of umbrella data readiness topics or initiatives — updated no less than annually
To support the modified open-source community and to encourage collaboration, consortium members can submit unsolicited solutions on an ongoing basis, provided they apply to broad topic areas defined in the umbrella posting. The consortium managing member collects the submissions against the umbrella posting and shares them via a central repository for the JAIC and registered decentralized agencies. The DoD decentralized services and the JAIC then can review submissions and invite vendors to formally submit a proposal for further consideration.
Tier Two: Special notices or solicitations for specific efforts published through the consortium manager
The second tier of this CSO approach is the Special Notices and Solicitations process. This is the primary method of solicitation for decentralized DoD entities seeking services for their data science objectives, following a suggested Standardizing the Modified Open-source Purchase process. The DoD would engage the consortium manager to define and tailor the units of work for the problem set. With that input and comprehension of how to purchase, they would own the solicitation and evaluation process and execute the award(s). Because the need is for a new, innovative vehicle, it is crucial that decentralized customers and their acquisition and contracting teams follow a clearly defined process for the allowable types of CSOs. They also must have the requisite business acumen to execute, monitor, and administer the vehicles successfully.
A core benefit of the consortium and CSO for this application is its flexibility and opportunistic approach. Unlike a traditional source selection for Task Order Requests under a MAC where all offers must be evaluated against a set of predetermined criteria, the government is authorized to select without having to thoroughly evaluate each one. They can freely negotiate with the most advantageous offers and have the option to down-select portions of work from each offer. (For example, if an agency is sourcing units of work for A, B, C, D it may choose to fund A and D from proposal 1, B from Offeror 2, and C from Offeror 3.)
Standardizing the modified open-source development acquisition process
Regardless of the award instrument, all efforts under a CSO are considered commercial, so contract types all are limited Fixed Price. (Cost Reimbursement, Labor Hour, and Time & Materials are prohibited). Those mandates are not limiting to the suggested modified open-source development acquisition approach. It is common to purchase commercial industry development work in defined units under a Fixed Price Agreement. To gain buy-in from industry (specifically non-traditional businesses), the DoD must attempt to purchase as close to industry standards as possible.
One suggested method of standardizing development procurement is in the form of “units of work,” set by the consortium based on a predetermined estimate equal to a two-week sprint. With Agile development in mind and based on the size, scope, and complexity of a defined project/objective, teams would propose the number of units (sprints) required to achieve the set of goals. The definition of what comprises a sprint unit is important, as it may vary by contract. Line items can be structured to closely follow industry to its current award instruments by allotting the “Unit of Issue” as the “Sprint.” The estimated number of sprints proposed by the awardee can then be allocated to the line item.
Advancing data readiness is achievable
While the challenges with data volume and management are real for the DoD, these approaches and the historical successes in the commercial industry tell us that data readiness is an achievable objective. Establishing a core cadre of members as the Data Readiness Consortium in the near-term and providing continuous on-ramping to keep that consortium vibrant and growing will allow the DoD to avoid protests that prevent work from ever beginning. If there ever was an action for the DoD to leverage the streamlined acquisition authorities available for innovative development, this is it.