A committee under the Department of Telecommunications has released a draft framework of the Indian Artificial Intelligence Stack which seeks to “remove the impediments to AI deployment”, and essentially proposed to set up a six-layered stack, each handling different functions including consent gathering, storage, and AI/Machine Learning (AI/ML) analytics. Once developed, this stack will be structured across all sectors, including data protection, data minimisation, open algorithm frameworks, defined data structures, trustworthiness and digital rights, and data federation (a single database source for front-end applications), among other things. The paper also said that there is no “uniform” definition of AI.

This committee — AI Standardisation Committee — had, in October last year, invited papers on Artificial Intelligence, addressing different aspects of AI such as functional network architecture, AI architecture, and data structures required, among other things. At the time, the DoT had said that as the proliferation of AI increases, there is a need to develop an Indian AI stack so as to bring interoperability, among other things. Here is a summary of the draft Indian AI Stack, comments to which can be emailed at aigroup-dot@gov.in or diradmnap-dot@gov.in, until October 3.

How the proposed Indian AI stack looks, on paper

The stack will be made up of five main “horizontal” layers, and one “vertical” layer:

1. Infrastructure layer

  • Ensures setting up of a common data controller (an entity that determines the purpose and means of processing personal data) including both public and private clouds
  • Ensures encryption and data minimisation at the cloud
  • Ensures monitoring and data privacy

This is the root layer of the Indian AI stack over which the entire AI functionality is built. The layer will ensure setting up of a common data controller, and will involve multi-cloud scenarios — both private and public clouds. This is where the infrastructure for data collection will be defined. The multilayer cloud services model will define both relations between cloud service models and other functional layers:

  • Inter cloud control and management plane, for controlling and managing inter cloud applications
  • Inter cloud federation framework, which will basically allow independent clouds belonging to different cloud providers and administrative domains to interact
  • Inter cloud operation framework, which includes functionalities for supporting multi-provider infrastructure operation, and defines the basic relations of resource operation, management and ownership.

2. Storage layer

  • Ensures that the data is properly archived and stored in a fashion for easy access when queried

This layer will have to define the protocols and interfaces for storing hot data, cold data, and warm data (all three defined below). The paper called this as the most important layer in the stack regardless of size and type of data, since value from data can only be derived once it is processed. And data can only be processed efficiently, when it is stored properly. It is important to store data safely for a very long time while managing all factors of seasonality and trends, ensuring that it is easily accessible and shareable on any device, the paper said.

The paper has created three subcategories of data depending on the relevance of data and its usability:

  • Fast data/Hot data: This requires the fastest and most expensive storage, and this is where frequently used data will be stored. To access this data, it will have to be stored in hybrid storage environments.
  • Cold data: This data is accessed less frequently and can be stored on slower, and consequently, less expensive media storage environments in-house or in the cloud. However, this layer is designed to store data for a very large duration or for archival purpose, including data that is no longer in active use and might not be needed for months, years, decades, or maybe ever.
  • Warm data: The paper didn’t define precisely what this data is, but just said that this is the data “between” cold and hot data. It is worth noting that the paper hasn’t given any examples of what these three data types might be.

Categories of data

3. Compute layer

  • Ensures proper AI & ML analytics
  • Ensures templates for data access and processing for open algorithm framework is in place
  • Ensures Natural Language Processing (a branch of AI that helps computers understand, and contextualise human language)
  • Deep learning (the ability of an AI system to analyse data without human supervision) and neural networks, which make up the backbone of deep learning algorithms

This layer, through a set of defined protocols and templates ensures an open algorithm framework. The AI/ML process could be Natural Language Processing (NLP), deep learning and neural networks. This layer will also define data analytics that includes “data engineering”, which focuses on practical applications of data collection and analysis, apart from scaling and data ingestion. The technology mapping and rule execution will also be part of this layer.

The paper acknowledged the need for a proper data protection framework: “…the Compute layer involves analysis to mine vast troves of personal data and find correlations, which will then be used for various computations. This raises various privacy issues, as well as broader issues of lack of due process, discrimination and consumer protection.

The data so collected can shed light on most aspects of individuals’ lives. It can also provide information on their interactions and patterns of movement across physical and networked spaces and even on their personalities. The mining of such large troves of data to seek out new correlations creates many potential uses for Big Personal Data. Hence, there is a need to define proper data protection mechanism in this layer along with suitable data encryption and minimisation.” — from the paper

The compute layer will also define a new way to build and deploy enterprise service-oriented architectures, along with providing transparent computing architecture over which the industry could develop their own analytics. It will have to provide for a distinction between public, shared and private data sources, so that machine learning algorithms can be applied against relevant data fields.

The report also said that the NITI Aayog has proposed an AI specific cloud compute infrastructure which will facilitate research and solution development in using high performance and high throughput AI-specific supercomputing technologies. The broad specifications for this proposed cloud controller architecture may include:

  • Multi-user computing support
  • Resource partitioning and provisioning
  • Machine Learning / Deep Learning software stack — training and inferencing development kit, frameworks, libraries, cloud management software
  • Support for varieties of AI workloads and ML / DL frameworks for user choices
  • Low latency high bandwidth network
  • Multi-layer storage system to ingest and process multi-petabytes of big data
  • Compatibility with National Knowledge Network (NKN), which is a pan-Indian network, capable of providing secure and reliable connectivity, to various entities such as universities, research institutions, libraries, laboratories, healthcare and agricultural institutions.

Proposed architecture of AI specific controller

4. Application layer

  • This layer ensures that the backend services are properly and legitimately programmed
  • Develop proper service framework
  • Ensures proper transaction movement, and that proper logging and management is put in place for auditing if required at any point of time.

The paper described this as a “purpose-built” layer through which software and applications can be hosted and executed as a service layer. This layer will also support various backend services for processing of data, and will provide for backend services and a proper service framework for the AI engine to function. It will also keep track of all transaction across the stack, helping in logging auditing activities.

5. Data / information exchange layer

  • Provides for end customer interface
  • Has consent framework for data consent from/to customers
  • Provides various services through secured gateway services
  • Ensures that digital rights are protected and the ethical standards maintained
  • Provides for open API access of the data and has chatbots access, along with various AI/ML Apps.

This layer will define the end customer experience through defined data structures and proper interfaces and protocols. It will have to support a proper consent framework for access to data by/for the customer. Provision for consent can be for individual data fields or for collective fields. This layer will also host gateway services. Typically, different tiers of consent will be made available to accommodate different tiers of permissions, the paper said.

This layer also needs to ensure that ethical standards are followed to ensure digital rights. “In the absence of a clear data protection law in the country, the EU’s General Data Protection Regulation (GDPR) or any of the laws can be applied. This will serve as interim measure until Indian laws are formalised,” the paper said.

6. Security and governance layer (vertical layer)

  • This is a cross cutting layer across all above layers that ensures that AI services are safe, secure, privately protected, trusted and assured.

This layer will ensure the process of security and governance for all the preceding five horizontal layers. There will be an “overwhelming flow” of data through the stack, which is why there is a need to “ensure encryption at different levels”, the paper said. This may require setting up the ability for handling multiple queries in an encrypted environment, among other things. Cryptographic support is also an important dimension of the security layer, the paper said.

Why this layer is important, per the paper: “…data aggregated, transmitted, stored, and used by various stakeholders may increase the potential for discriminatory practices and pose substantial privacy and cybersecurity challenges. The data processed and stored in many cases include geolocation information, product-identifying data, and personal information related to use or owner identity, such as biometric data, health information, or smart-home metrics

Data storage in backend systems can present challenges in protection of data from cyberattacks. In addition to personal-information, privacy concerns, there could be data used in system operation, which may not typically be personal information. Cyber attackers could misuse these data by compromising data availability or changing data, causing data integrity issues, and use big data insights to reinforce or create discriminatory outcomes. When data is not available, causing a system to fail, it can result in damage—for example a smart home’s furnace overheats or an individual’s medical device cannot function, when required.” — from the paper

How the proposed AI stack looks like

Benefits of the AI stack, per the paper

According to the report, the key benefits of this proposed AI stack are:

  • Easy interface (vertical or horizontal) with end user application
  • Secure storage environment that simplifies the archiving and extraction of data based on the data classification
  • Ensures protection of data, data federation, data minimisation, open algorithm framework, defined data structures, interfaces and protocols, monitoring, audit and logging, and trustworthiness, among other things
  • Ensures legitimacy of backend services, and provides services through secured gateway services to the customer
  • Protection of digital rights and maintaining ethical standards
  • Consent for use of data from customers will be taken through “properly framed consent framework”
  • Enables provision of safe, secure and trusted AI services to the customer, and open API integration and facilitates the environment for load balancing, security, failover capabilities, multi-tenant architecture for concurrent users
  • Enforces the usage of government Public Key Infrastructure (PKI) services, which is essentially a system for the creation, storage, and distribution of digital certificates which are used to verify that a particular public key belongs to a certain entity.

How data will flow through the AI stack

This is how the paper proposes data flow through the stack:

  1. “Generic” public/private data will be input to the multi-cloud data controller, which will be “monitored” for data privacy concerns and sent to next stage for data encryption verification (The paper doesn’t explain how this “monitoring” will happen)
  2. The input data will be encrypted and stored in the storage layer
  3. The data that flows into the storage layer will then be “cleaned”, “refined”, and categorised depending on the requirement, and data type (hot, cold, warm)
  4. Data from the storage layer will be made available to the compute layer
  5. The data will be processed for various AI/ ML analytics through Deep Learning/ Machine Learning/ Natural language processing techniques, etc. All movement of data will be recorded
  6. Data will then be finally accessed by the end-user through the data/information exchange layer where “trustworthiness” of the data will be defined for verification. “The data will also be defined to ascertain digital rights and ethical standards”.
  7. The refined data will be available for open APIs (Application Programming Interface, which is a software intermediary that allows two applications to talk to each other), as well as to various apps designed through these open APIs. Various feedback mechanism including dashboards, chats etc. for future processing will be defined here
  8. The data obtained from the compute layer will be used for developing services framework and the backend services, that resides in application layer
  9. The processing algorithm will be defined as an open algorithm framework before being accessed by the application layer for movement of transaction

Proposed AI flowchart

How the paper proposes to tackle algorithmic bias

In AI, the thrust is on how efficiently data is used, the paper said, noting that if the data is “garbage” then the output will also be so. For example, if programmers or AI trainers transfer their biases to AI; the system will become biased, the paper said. “There is a need for evolving ethical standards, trustworthiness, and consent framework to get data validation from users,” the paper suggested.

“The risks of passive adoption of AI that automates human decision-making are also severe. Such delegation can lead to harmful, unintended consequences, especially when it involves sensitive decisions or tasks and excludes human supervision,” the paper said. It gave the example of Microsoft’s Twitter chatbot Tay as an example of what can happen when “garbage” data is input into an AI system. Tay had started tweeting racist and misogynist remarks in less than 24 hours of its launch.

Need for openness in AI algorithms: The paper said it was necessary to have an open AI algorithm framework, along with clearly defined data structures. It referenced on how the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) software used by some US courts in predicting the likelihood of recidivism in criminal defendants was demonstrated to be biased since the AI “black box” was “proprietary”.

  • As AI becomes more intelligent, it becomes more effective at its tasks of prediction and decision-making, but conversely its processes also become less transparent to humans, the paper said. It added that this “opaque” problem leads to a lack of control and supervision by controllers and users of AI, ultimately risking progress. “Thus, there is a need to ensure unbiased open architecture at Application level”.
  • The paper said that the the main effect of opening existing AI through open-sourcing code and placing related intellectual property into the public domain, will accelerate the diffusion and application of current techniques. “Software and knowledge about algorithms are non-rival goods. Making them freely available would enable more people to use them, at low marginal cost,” it said.

“As AI learns to address societal problems, it also develops its own hidden biases. The self learning nature of AI means, the distorted data the AI discovers in search engines, perhaps based upon “unconscious and institutional biases”, and other prejudices, is codified into a matrix that will make decisions for years to come. In the pursuit of being the best at its task, the AI may make decisions it considers the most effective or efficient for its given objective, but because of the wrong data, it becomes unfair to humans,” the report said.

Need to ‘centrally control’ data: Right after the paper made a pitch for having openness in AI algorithms, it proposed that the data fed into the AI system should be controlled centrally. “The data from which the AI learns can itself be flawed or biased, leading to flawed automated AI decisions. This is certainly not the intention of algorithmised decision-making, which is “perhaps a good-faith attempt to remove unbridled discretion — and its inherent biases.” There is thus a need to ensure that the data is centrally controlled including using a single or multiple cloud controllers,” the report said.

Proper storage frameworks for AI: An important factor in aiding biases in AI systems is contamination of data, per the paper, which includes, missing information, inconsistent data, or simply errors. “This could be because of unstructured storage of data. Thus, there is a need to ensure proper storage frameworks for AI,” it said.

Changing the ‘culture’ of coders and developers: There is a need to change the “culture” so that coders and developers themselves recognise the “harmful and consequential” implication of biases, the paper said, adding that this goes beyond standardisation of the type of algorithmic code and focuses on the programmers of the code. “Since much coding is outsourced, this would place the onus on the company developing the software product to enforce such standards. Such a comprehensive approach would tackle the problem across the industry as a whole, and enable AI software to make fair decisions made on unbiased data, in a transparent manner,” it added.

Why we need an Indian AI stack, per the paper

In the near future, AI will have huge implications on the country’s security, its economic activities and the society. The risks are unpredictable and unprecedented. Therefore, it is imperative for all countries including India to develop a stack that fits into a standard model, which protects customers; users; business establishments and the government.

Economic impact: AI will have a major impact on mainly four sectors, per the paper: manufacturing industries, professional services, financial services, and wholesale and retail. The paper also charted out how AI could be used in some specific sectors. For instance, in healthcare, it said in rural areas, which suffer from limited availability of healthcare professionals and facilities, AI could be used for diagnostics, personalised treatment, early identification of potential pandemics, and imaging diagnostics, among others.

Similarly, in the banking and financial services sector, A can be used for things like development of credit scores through analysis of bank history or social media data, and fraud analytics for proactive monitoring and prevention of various instances of fraud, money laundering, malpractice, and prediction of potential risks, according to the report.

Uses for the government: For governments, for example, cybersecurity attacks can be rectified within hours, rather than months and national spending patterns can be monitored in real-time to instantly gauge inflation levels whilst collecting indirect taxes.