Building a production-ready solution in AWS involves a series of trade-offs between resources, time, customer expectation, and business outcome. The AWS Well-Architected Framework helps you understand the benefits and risks of decisions you make while building workloads on AWS. By using the Framework, you will learn current operational and architectural recommendations for designing and operating reliable, secure, efficient, cost-effective, and sustainable workloads in AWS.
An intelligent document processing (IDP) project usually combines optical character recognition (OCR) and natural language processing (NLP) to read and understand a document and extract specific entities or phrases. This IDP Well-Architected Custom Lens provides you the guidance to tackle the common challenges we see in the field. By answering a series of questions in this custom lens, you will identify the potential risks and be able to resolve them by following the improvement plan.
This post focuses on the Security pillar of the IDP solution. Starting from the introduction of the Security Pillar and design principles, we then examine the solution design and implementation with four focus areas: access control, data protection, key and secret management, and workload configuration. By reading this post, you will learn about the Security Pillar in the Well-Architected Framework, and its application to the IDP solutions.
Design principles
The Security Pillar encompasses the ability of an IDP solution to protect input documents, document processing systems, and output assets, taking advantage of AWS technologies to improve security while processing documents intelligently.
All of the AWS AI services (for example, Amazon Textract, Amazon Comprehend, or Amazon Comprehend Medical) used in IDP solutions are fully managed AI services where AWS secures their physical infrastructure, API endpoints, OS, and application code, and handles service resilience and failover within a given region. As an AWS customer, you can therefore focus on using these services to accomplish your IDP tasks, rather than on securing these elements. There are a number of design principles that can help you strengthen your IDP workload security:
Implement a strong identity foundation – Implement the principle of least privilege and enforce separation of duties with appropriate authorization for each interaction with your AWS resources in IDP applications. Centralize identity management, and aim to eliminate reliance on long-term static credentials.
Maintain traceability – AI services used in IDP are integrated with AWS CloudTrail, which enables you to monitor, alert on, and audit actions and changes to your IDP environment with low latency. Their integration with Amazon CloudWatch allows you to integrate log and metric collection with your IDP system to automatically investigate and take action.
Automate current security recommendations – Automated software-based security mechanisms improve your ability to securely scale more rapidly and cost-effectively. Create secured IDP architectures, including the implementation of controls that are defined and managed as code in version-controlled templates by using AWS CloudFormation.
Protect data in transit and at rest – Encryption in transit is supported by default for all of the AI services required for IDP. Pay attention to protection of data at rest and data produced in IDP outputs. Classify your data into sensitivity levels and use mechanisms, such as encryption, tokenization, and access control where appropriate.
Grant least privilege permissions to people – IDP largely reduces the need for direct access and manual processing of documents. Only involving necessary people to do case validation or augmentation tasks reduces the risk of document mishandling and human error when dealing with sensitive data.
Prepare for security events – Prepare for an incident by having incident management and investigation policy and processes in place that align to your organizational requirements. Run incident response simulations and use tools with automation to increase your speed for detection, investigation, and recovery.
Focus areas
Before you architect an IDP workload, you need to put practices in place to meet your security requirements. This post focuses on the Security pillar with four focus areas:
Access control – In an IDP application, access control is the key part to ensure information security. It’s not only related to ensuring that only authorized users are able to access the application, but also about ensuring that other services are only able to access the environment and interact with each other in a suitably secure manner.
Data protection – Because encrypting data in transit is supported by default for all of the AI services required for IDP, data protection in an IDP application focuses more on encrypting data at rest and managing sensitive information such as personally identifiable information (PII).
Key and secret management – The encryption approach that you use to secure your IDP workflow may include different keys to encrypt data and authorize users across multiple services and related systems. Applying a comprehensive key and secret management system provides durable and secure mechanisms to further protect your IDP application and data.
Workload configuration – Workload configuration involves multiple design principles, including using monitoring and auditing services to maintain traceability of transactions and data in your IDP workload, setting up incident response procedures, and separating different IDP workloads from each other.
Access control
In focus area of access control, consider the following current recommendations:
Use VPC endpoints to a establish private connection with IDP related services – You can use Amazon Textract, Amazon Comprehend, and Amazon Simple Storage Service (Amazon S3) APIs through a world-routable network or keep your network traffic within the AWS network by using VPC endpoints. To follow current security recommnedations, you should keep your IDP traffic within your VPCs, and establish a private connection between your VPC and Amazon Textract or Amazon Comprehend by creating interface VPC endpoints. You can also access Amazon S3 from your VPC using gateway VPC endpoints.
Set up a centralized identity provider – For authenticating users and systems to your IDP application, setting up a centralized identity provider makes it easier to manage access across multiple IDP applications and services. This reduces the need for multiple sets of credentials and provides an opportunity to integrate with existing human resources (HR) processes.
For federation with individual AWS accounts, you can use centralized identities for AWS with a SAML 2.0-based provider with AWS Identity and Access Management (IAM).
For federation to multiple accounts in your AWS Organizations, you can configure your identity source in AWS IAM Identity Center and specify where your users and groups are managed.
Use IAM roles to control access and enforce least privilege access – To manage user access to IDP services, you should create IAM roles for user access to services in the IDP application and attach the appropriate policies and tags to achieve least privilege access. Roles should then be assigned to appropriate groups as managed in your identity provider. You can also use IAM roles for assigning service usage permissions, thereby employing ephemeral AWS Security Token Service (STS) credentials for calling service APIs. For circumstances where AWS services need to be called for IDP purposes from systems not running on AWS, use AWS IAM Roles Anywhere to obtain temporary security credentials in IAM for workloads running outside of AWS.
Protect Amazon Textract and Amazon Comprehend in your account from cross-service impersonation – An IDP application usually employs multiple AWS services, such that one service may call another service. Therefore, you need to prevent cross-service “confused deputy” scenarios. We recommend using the aws:SourceArn and aws:SourceAccount global condition context keys in resource policies to limit the permissions that Amazon Textract or Amazon Comprehend gives another service to the resource.
Data protection
The following are some current recommendations to consider for data protection:
Follow current recommendations to secure sensitive data in data stores – IDP usually involves multiple data stores. Sensitive data in these data stores needs to be secured. Current security recommendations in this area involve defining IAM controls, multiple ways to implement detective controls on databases, strengthening infrastructure security surrounding your data via network flow control, and data protection through encryption and tokenization.
Encrypt data at rest in Amazon Textract – Amazon Textract uses Transport Layer Security (TLS) and VPC endpoints to encrypt data in transit. The method of encrypting data at rest for use by Amazon Textract is server-side encryption. You can choose from the following options:
Server-side encryption with Amazon S3 (SSE-S3) – When you use Amazon S3 managed keys, each object is encrypted with a unique key. As an additional safeguard, this method encrypts the key itself with a primary key that it regularly rotates.
Server-side encryption with AWS KMS (SSE-KMS) – There are separate permissions for the use of an AWS Key Management Service (AWS KMS) key that provide protection against unauthorized access of your objects in Amazon S3. SSE-KMS also provides you with an audit trail in CloudTrail that shows when your KMS key was used, and by whom. Additionally, you can create and manage KMS keys that are unique to you, your service, and your Region.
Encrypt the output from Amazon Textract asynchronous API in a custom S3 bucket – When you start an asynchronous Amazon Textract job by calling StartDocumentTextDetection or StartDocumentAnalysis, an optional parameter in the API action is called OutputConfig. This parameter allows you to specify the S3 bucket for storing the output. Another optional input parameter KMSKeyId allows you to specify the KMS customer managed key (CMK) to use to encrypt the output.
Use AWS KMS encryption in Amazon Comprehend – Amazon Comprehend works with AWS KMS to provide enhanced encryption for your data. Integration with AWS KMS enables you to encrypt the data in the storage volume for Start* and Create* jobs, and it encrypts the output results of Start* jobs using your own KMS key.
For use via the AWS Management Console, Amazon Comprehend encrypts custom models with its own KMS key.
For use via the AWS Command Line Interface (AWS CLI), Amazon Comprehend can encrypt custom models using either its own KMS key or a provided CMK, and we recommend the latter.
Protect PII in IDP output – For documents including PII, any PII in IDP output also needs to be protected. You can either secure the output PII in your data store or redact the PII in your IDP output.
If you need to store the PII in your IDP downstream, look into defining IAM controls, implementing protective and detective controls on databases, strengthening infrastructure security surrounding your data via network flow control, and implementing data protection through encryption and tokenization.
If you don’t need to store the PII in your IDP downstream, consider redacting the PII in your IDP output. You can design a PII redaction step using Amazon Comprehend in your IDP workflow.
Key and secret management
Consider the following current recommendations for managing keys and secrets:
Use AWS KMS to implement secure key management for cryptographic keys – You need to define an encryption approach that includes the storage, rotation, and access control of keys, which helps provide protection for your content. AWS KMS helps you manage encryption keys and integrates with many AWS services. It provides durable, secure, and redundant storage for your KMS keys.
Use AWS Secrets Manager to implement secret management – An IDP workflow may have secrets such as database credentials in multiple services or stages. You need a tool to store, manage, retrieve, and potentially rotate these secrets. AWS Secrets Manager helps you manage, retrieve, and rotate database credentials, application credentials, and other secrets throughout their lifecycles. Storing the credentials in Secrets Manager helps mitigate the risk of possible credential exfiltration by anyone who can inspect your application code.
Workload configuration
To configure workload, follow these current recommendations:
Separate multiple IDP workloads using different AWS accounts – We recommend establishing common guardrails and isolation between environments (such as production, development, and test) and workloads through a multi-account strategy. AWS provides tools to manage your workloads at scale through a multi-account strategy to establish this isolation boundary. When you have multiple AWS accounts under central management, your accounts should be organized into a hierarchy defined by groupings of organizational units (OUs). Security controls can then be organized and applied to the OUs and member accounts, establishing consistent preventative controls on member accounts in the organization.
Log Amazon Textract and Amazon Comprehend API calls with CloudTrail – Amazon Textract and Amazon Comprehend are integrated with CloudTrail. The calls captured include calls from the service console and calls from your own code to the services’ API endpoints.
Establish incident response procedures – Even with comprehensive, preventative and detective controls, your organization should still have processes in place to respond to and mitigate the potential impact of security incidents. Putting the tools and controls in place ahead of a security incident, then routinely practicing incident response through simulations, will help you verify that your environment can support timely investigation and recovery.
Conclusion
In this post we shared design principles and current recommendations for Security Pillar in building well-architected IDP solutions.
To learn more about the IDP Well-Architected Custom Lens, explore the following posts in this series:
Build well-architected IDP solutions with a custom lens – Part 1: Operational excellence
Build well-architected IDP solutions with a custom lens – Part 2: Security
Build well-architected IDP solutions with a custom lens – Part 3: Reliability
Build well-architected IDP solutions with a custom lens – Part 4: Performance efficiency
Build well-architected IDP solutions with a custom lens – Part 5: Cost optimization
Build well-architected IDP solutions with a custom lens – Part 6: Sustainability
For next steps, you can read more about the AWS Well-Architected Framework and refer to our Guidance for Intelligent Document Processing on AWS to design and build your IDP application. Please also reach out to your account team for a Well-Architected review for your IDP workload. If you require additional expert guidance, contact your AWS account team to engage an IDP Specialist Solutions Architect.
AWS is committed to the IDP Well-Architected Lens as a living tool. As the IDP solutions and related AWS AI services evolve, we will update the IDP Well-Architected Lens accordingly.
About the Authors
Sherry Ding is a senior artificial intelligence (AI) and machine learning (ML) specialist solutions architect at Amazon Web Services (AWS). She has extensive experience in machine learning with a PhD degree in computer science. She mainly works with public sector customers on various AI/ML related business challenges, helping them accelerate their machine learning journey on the AWS Cloud. When not helping customers, she enjoys outdoor activities.
Brijesh Pati is an Enterprise Solutions Architect at AWS. His primary focus is helping enterprise customers adopt cloud technologies for their workloads. He has a background in application development and enterprise architecture and has worked with customers from various industries such as sports, finance, energy and professional services. His interests include serverless architectures and AI/ML.
Rui Cardoso is a partner solutions architect at Amazon Web Services (AWS). He is focusing on AI/ML and IoT. He works with AWS Partners and support them in developing solutions in AWS. When not working, he enjoys cycling, hiking and learning new things.
Mia Chang is a ML Specialist Solutions Architect for Amazon Web Services. She works with customers in EMEA and shares best practices for running AI/ML workloads on the cloud with her background in applied mathematics, computer science, and AI/ML. She focuses on NLP-specific workloads, and shares her experience as a conference speaker and a book author. In her free time, she enjoys hiking, board games, and brewing coffee.
Suyin Wang is an AI/ML Specialist Solutions Architect at AWS. She has an interdisciplinary education background in Machine Learning, Financial Information Service and Economics, along with years of experience in building Data Science and Machine Learning applications that solved real-world business problems. She enjoys helping customers identify the right business questions and building the right AI/ML solutions. In her spare time, she loves singing and cooking.
Tim Condello is a senior artificial intelligence (AI) and machine learning (ML) specialist solutions architect at Amazon Web Services (AWS). His focus is natural language processing and computer vision. Tim enjoys taking customer ideas and turning them into scalable solutions.
Building a production-ready solution in AWS involves a series of trade-offs between resources, time, customer expectation, and business outcome. The AWS Well-Architected Framework helps you understand the benefits and risks of decisions you make while building workloads on AWS. By using the Framework, you will learn current operational and architectural recommendations for designing and operating Read More Amazon Comprehend, Amazon Textract, AWS Well-Architected