Adam Hopkins

In section 20.3.1, recent deals are truncated by timestamp to shard data across multiple partitions. My question is, is this to avoid the 10 GB partition limit, or is it just so we don't hit one partition so much?

The reason I ask is because in section 3.5.4, Alex mentions that partition size limit is not a problem for global secondary indexes, and DynamoDB will split data across multiple partitions under the hood for you. Using a GSI seems much simpler than adding the truncating logic to your application/querying-- unless the point actually is to not hit one partition vs the fear of running out of space.
I've been mulling this over and believe the book example showed that pattern just to get us thinking, not out of necessity.  To further complicate things, the docs say the 10 GB limit doesn't apply unless you have an LSI.  I'm not 100% sure though as I've never hit the size limit (yet).
Good question, Adam Hopkins! It's more for the latter -- to avoid hitting one partition too much. My assumption is that this would be one of the hotter read paths, so it could cause problems with the 3000 RCU limit if you had high adoption.

If you're never going to hit the RCU / WCU limits, I wouldn't worry too much about partition size. The DynamoDB team has made some indication that partitions over 10GB are split across multiple storage nodes, which could add some latency if you queried across them, but it's not something I'd lose a lot of sleep over.
Ah, so they don't even have to be global secondary indexes to go above 10 GB. Very cool. Thanks! Makes sense that it's to avoid a hot partition.
I have Items that enter the table with a status of Pending. If they are not marked Completed within 20 minutes, I'd like for them to be marked with a Status of Incomplete.

In a relational table, I would have a cron script that checks the status column every 20 minutes and takes that action on the Pending items.

However, with DynamoDB/AWS is there a better way to take some action after a certain amount of time. I would also need to make sure that the action was cancelled if the Item was marked Completed.

This may be hacky, but I was thinking about using expiresAt and setting it for 20 minutes from now, then having a lambda to handle when it expires. BUT, the book points out that expiresAt can actually take up to 48 hours to delete an Item.
There's an article by Yan (@theburningmonk) that considers using DDB TTL for this purpose. However, as you state, there's theoretically a wide variance in accuracy. From his analysis, you're typically looking at around 20 minutes lag on average – so the same margin of error as an every-20-minute cron task (PENDING items could be marked COMPLETED up to 20 minutes late). I find these numbers to be consistent with my usage.

You would presumably store the TTL on an empty lookup item, which can update the source item upon its REMOVE stream record. These item pairs would need to be in the same item collection for efficient querying, since it's a good practice to consider the TTL in any data access: if the TTL has already passed (and DDB is yet to have remove the record) then the source item is inherently COMPLETED.

So that's arguably a fair amount of work just to stay in DDB-land.

AFAIK there isn't another mechanism that could suite your purpose inherent to DDB. Other options would be a scheduled task (via CloudWatch Events/EventBridge) or Step Functions; the former presumably being satisfactory, and a cinch to implement.
Thanks for the response! Storing an empty lookup item is a good idea, that way you don't need to re-insert the original item. I think I'm okay with a 20 minute lag, and I'm actually okay with showing the item as PENDING until it's actually deleted (assuming the average 20 minute lag :) ), versus looking up the corresponding item to check if the TTL is expired.

Yeah, for the other option with a scheduled CloudWatch task, I was thinking I could maybe have a sparse index just for all PENDING items and review those.
I have a table for form submissions that looks like this for form submissions:

PK: SITE#example.com
SK: SUBMISSION#ksuid

And I have an aggregate field for the total number of form submissions for a field like this:

PK: SITE#example.com
SK: TOTAL#example.com

If I sort in reverse to get the most recent form submissions (what I normally want) using PK SITE#example.com, the first item is the total and then I have the form submissions.

However, if I wanted to flip it around to get the total and then the least recent form submissions, what's the best way to do this?

Would it be to use a GSI and use the same PKs but give TOTAL#example.com a different SK so that it comes before the submissions? Like:

GSI1PK: SITE#example.com
GSI1SK: _TOTAL#example.com

And then for submissions:

GSI1PK: SITE#example.com
GSI1SK: SUBMISSION#ksuid

And now _TOTAL comes before SUBMISSION?

Using an entire GSI feels like a misuse to me because it duplicates every submission, when all you really need is the rollup.  Have you considered creating two aggregate totals, one for each direction?

PK: SITE#example.com
SK: TOTAL

PK: SITE#example.com
SK: _TOTAL
Ah, I hadn't considered that. And I guess I would just update both? I guess that's cleaner than duplicating every submission?
I think I've decided to actually have a "reference count" on a parent item as mentioned in the book in section 16.5.