AWS In Bites
Posts
DynamoDB Streams

DynamoDB Streams

Part 2: Designing for Failure

Chris Gonzalez
November 04, 2024

When you build serverless systems, one important thing to think about is how to deal with failures. It’s not just about DynamoDB. You need to keep this in mind for all serverless services. Otherwise, you might find yourself in trouble, just like the person in the image above who didn’t plan for service failure! What will you do if a third-party service goes down? What if your API gets throttled? How can you ensure messages get processed only once? Each of these problems can create headaches in your app that you want to avoid.

If you’re new to DynamoDB streams, check out Part 1 of this series. If you’re all set, then let's tackle one common issue in stream processing: handling duplicate records

Idempotency

DynamoDB promises us that we'll get our data at least once. This means that if you're using Kinesis Data Streams for DynamoDB, you might see some duplicate records. That's just how it works in systems like this! If AWS isn't sure a record was processed, it'll send it again to make sure we get it. So, it's on us as developers to handle these duplicates in our app. To avoid issues, we should use what's called idempotent processing. This helps us make sure each record in our stream is processed just once. Normally, we don’t get duplicate records in a standard DynamoDB stream, but it's a good idea to build in that idempotent logic. This way, if any problems happen downstream, we’re covered.

To better understand how we can handle these duplicates, let's look at a practical implementation using Python:

import boto3
import json
from decimal import Decimal
from boto3.dynamodb.conditions import Key

class IdempotentStreamProcessor:
    def __init__(self):
        self.dynamodb = boto3.resource('dynamodb')
        self.processed_items_table = self.dynamodb.Table('ProcessedItems')

    def process_record(self, record):
        # Extract sequence number as unique identifier
        sequence_number = record['dynamodb']['SequenceNumber']
        try:
            # Try to insert processing record
            self.processed_items_table.put_item(
                Item={
                    'sequence_number': sequence_number,
                    'processed_at': Decimal(str(time.time())),
                    'status': 'PROCESSING'
                },
                ConditionExpression='attribute_not_exists(sequence_number)'
            )
            
            # Process the actual record
            self._process_payment(record)
            
            # Update status to completed
            self.processed_items_table.update_item(
                Key={'sequence_number': sequence_number},
                UpdateExpression='SET status = :status',
                ExpressionAttributeValues={':status': 'COMPLETED'}
            )
            
        except self.dynamodb.meta.client.exceptions.ConditionalCheckFailedException:
            print(f"Record {sequence_number} already processed, skipping")
            return

While the above example demonstrates the underlying concepts of idempotency, there's no need to reinvent the wheel for most use cases. Libraries like Powertools for AWS Lambda can do this natively for you, and I recommend going this route 99% of the time unless you need something highly customized for your given use case. You can learn more the Powertools library idempotency concepts here.

Let's see a visual representation of the idempotency flow:

Idempotency based processing sequence

Now that we understand how to handle duplicates, let's tackle another critical aspect of stream processing: managing errors and failures effectively.

Error Handling and Management

Stream processing can fail for various reasons, such as:

Downstream service unavailability
Data validation errors
Processing timeouts
Resource constraints
Network issues

We must design for failure in our codebase so we can maintain data consistency and ensure no records are lost during processing! Implementation considerations include:

Processing the error with retries
Forwarding the error to a dead-letter-queue for human investigation
Alerting to your monitoring system so delegated team can respond.

Here's a practical implementation that showcases how to handle different types of errors while maintaining system reliability (the below Python code is demoware and is not suitable for a live system!) :

class StreamErrorHandler:
    def __init__(self):
        self.sqs = boto3.client('sqs')
        self.dynamodb = boto3.resource('dynamodb')
        self.dlq_url = os.environ['DLQ_URL']
        self.retry_table = self.dynamodb.Table('RetryRecords')

    def process_record_with_retries(self, record, max_retries=3):
        retry_count = 0
        while retry_count < max_retries:
            try:
                return self._process_record(record)
            except TemporaryError as e:
                # Exponential backoff
                # Wait a pre-determined amount of time
                # increment the retry_count try again
                wait_time = (2 ** retry_count) * 0.1
                time.sleep(wait_time)
                retry_count += 1
            except PermanentError as e:
                # if our retries were unsuccessful
                # then send them to a DLQ
                # for further inspection
                self.send_to_dlq(record, str(e))
                raise
            except Exception as e:
                # this is a critical error
                # hopefully this alerts 
                # your platform or SRE team
                self.handle_unknown_error(record, e)
                raise

    def send_to_dlq(self, record, error_message):
        message = {
            'record': record,
            'error': error_message,
            'timestamp': datetime.utcnow().isoformat(),
            'retry_count': 0
        }
        
        # Send to DLQ
        self.sqs.send_message(
            QueueUrl=self.dlq_url,
            MessageBody=json.dumps(message)
        )
        
        # Log to retry table
        self.retry_table.put_item(
            Item={
                'record_id': record['dynamodb']['SequenceNumber'],
                'error': error_message,
                'status': 'FAILED',
                'timestamp': datetime.utcnow().isoformat()
            }
        )

To better visualize this error handling flow and its decision points:

Error handling flow a DynamoDB stream

We aren't discussing observability in this article, however, it is important to take note of any trends you see in your errors so that you can evaluate if a specific error is something to worry about or not. You'll also want to configure proper alerting for the errors you identify as critical to the functionality of your system, so you can then triage and fix accordingly.

Conclusion

Let’s go over why certain patterns are really important for making reliable stream processing systems.

When you're building DynamoDB stream processors, two main things to focus on are handling duplicates and managing errors. It's easy to run into duplicate processing in distributed systems. To avoid this, you should track unique sequence numbers and keep processing state tables in check.

Error handling is also crucial. You need a good strategy that tells the difference between temporary and permanent failures. Use smart retry methods that exponential back-off. As an added step, set up Dead Letter Queues (DLQ) for errors that can’t be fixed.

Don’t forget to put in monitoring systems. These can help you track how often duplicates happen, what errors you’re seeing, and how healthy your processing is.

Stream processing isn’t just about the smooth parts. You need to handle edge cases and failures too. By following these patterns and steering clear of mistakes like too many retries or missing error types, you can create strong stream processing systems that keep data reliable and consistent.