Seth Geoghegan

Developer
I believe I've found a mistake in chapter 20.3.1 (the Deals modeling example).  The discussion of truncated timestamps says

You can truncate down to any granularity you like—hour, day, week, month, year—and your choice will depend on how frequently you’re writing items. 
  The issue is with using truncated timestamps to model weeks.  I believe you can only truncate down to the fields the timestamp represent; second, minute, hour, day, month, year (not weeks).  I came across this issue while trying to model something on a weekly boundary.  I'd love if I were mistaken, since it would solve my data modeling issue :)


Owain in this specific example, the truncated timestamps are used as PK's.  I don't believe we can do a between query on PKs.  Although, I suppose we could use truncated timestamps as SK's and do what you've described.  However, it's not clear to me that's what was meant when this chapter said this strategy could work on the week boundary.

The example later in the chapter shows how multiple single-day partitions are queried to ensure at least 25 Deals are returned.  Since the example used truncated timestamps at PK's, the application logic needed to make n queries across several single-day partitions.  This pattern is probably sufficient for this example, since querying multiple partitions is the worst case scenario.  However, if querying by week is an access pattern that your app always supports (e.g. fetch all deals by week), hitting 5 (or 7) day partitions every time seems less than ideal.  

I thought the truncated timestamp strategy was meant to build single item collections on minute/hour/day/moth/year boundaries.  The result being a single partition for each timeframe.  I suppose I could query truncated timestamps in the way you describe, but it seems not terribly different than querying non-truncated timestamps.


Seth Geoghegan You are quite correct that between only works on SK and not PK.  I didn't review the code but there are ways of doing it.  Remember that you are not trying to optimize on storage just writes then reads.  So data redundancy is your friend.

If you need this is as an access pattern then create a GSI with a PK on YYYY-WW from the timestamp and or if you want to keep timestamps then use the message I suggested above. 

My original suggestion (SK timestamp) gives you the added benefit of giving you more access patterns e.g. previous 7 days, previous 5 days on any date you pass in as an arg.
HTH,

O.

Hey Seth Geoghegan , you could truncate down to the week by choosing the first day of a week (e.g. August 9th, 2020) and truncating the value to the beginning of that day: "2020-08-09T00:00:00.000Z".

The key here isn't that you take a timestamp and zero-out anything after a certain digit. Rather, it's making sure you have a consistent way to group a certain timestamp into some boundary. Here, it would take a timestamp, find the Sunday prior to it, and create a timestamp for 00:00 UTC on that Sunday.

Your application will need to encode the logic to handle this truncation, but you'll basically have a function that takes in a time and finds the truncated week timestamp for that time.

I may be misunderstanding, so let me know if that doesn't make sense :)
Seth Geoghegan replied
  ·  1 reply
In Chapter 13 of the Dynamo DB Book, Alex DeBrie gives an example of an access pattern that relies on the arrangement of sort keys (section 13.2.2):



I understand this specific example may have been contrived to illustrate the pattern, but it's not clear to me when this pattern would be useful.  

In this example, if you were trying to get all ISSUES for a given REPO, why not query for items where the SK begins with "ISSUES#".  This access strategy is interesting, but I'm not clear about what it gives us what we don't already get with a "begins with" constraint on the SK.

 
After thinking about this more, I think I've come up with a use case for this pattern.

In this particular example, it's likely that you'd want to get both issues *and* repo information in a single query.  Something a query limiting SK to ISSUE#<id> wouldn't achieve by itself.  By structuring the query this way, you're able to fetch both sides of the one-to-many relationship between repos and issues in a single call.  Querying with SK begins_with ISSUE# only gets you the "many" side of the relationship.
Alex DeBrie replied
  ·  1 reply
Yeah, I think you nailed it.  This pattern is useful when you want something like a metadata record, as well as N additional records.  I haven't found many places to use this because of the alphabetical requirement, but that's probably my own fault.
First time DynamoDB user checking-in for feedback about my first DynamoDB data modeling attempt!  The DynamoDB Book has been an amazing resource, I'm thankful it exists!

I am building an application that collects and presents running race results (e.g. 5k, 10k, marathon, etc).  This is how I'm modeling the Race, User and Result entities.


One of my basic access patterns is to fetch a sorted list of results by race.  I've implemented this access pattern by defining a secondary index on the RESULT entity, which includes a composite sort key.  The index GSI1PK is RACE#<race_id>, the GSI1SK is TIME#<time_in_seconds>.  This allows me to get sorted race results for a specific race by time:



This covers the basic needs of my application.  However, have additional access patterns that require fetching results by gender (top male and top female results per race) and age group results per race (14 years old and under, 15-19, 20-24, etc.). This is where I'm a bit stuck and could use some guidance.  My thoughts on a few options:

  1. Push this off to the client - It would be simpler if I made this a client-side concern.  If my data layer returns all the required information about a race result (age/gender/time) it would be trivial to create arbitrary groupings.  
  2. Filter existing results - each RESULT can include a user gender/age attribute, which I could then filter.  This would allow me to take advantage of my existing model without creating new indexes or other complicated trickery :)
  3. Create additional indexes - Taking advantage of gender being an enumeration (M or F) I could create a composite sort key that includes gender.  e.g. TIME#M#1200.  I could also treat age groups to enumerated values (e.g. 2024 for 20-24).  I could also combine the enumerations as well (e.g. 2024M and 2024M, etc).  However, this is starting to feel complex.
  4. Create a complex attribute on the RACE item that stores age group/gender lists/maps.  I don't have any access patterns on the values in age/gender groupings.  I'm not sure about this pattern for this use case, since a race has an unbounded list of results.  

  I'm new to modeling in DynamoDB and I feel like I'm wading in an ocean of possibilities!  Any advice or guidance would be greatly appreciated!



To elaborate on option 3 from above, I've come up with a compound sort key that encapsulates gender, age group and performance times: <gender_m_or_f>#<age_group>#<time_in_seconds>.  In this example, age groups would be represented as 4 digits that provide a natural sort order:

14 and under -> 0014
15-19              -> 1519
20-24             -> 2024
25-29             -> 2529
...
94-99             ->9499
Therefore, my list of results would be sorted/grouped as follows:




This pattern lets me fetch grouped and sorted results in one query.  Although, I lose the non-grouped, sorted results I achieved with my former TIME#<time_in_seconds> sort key example. 

 I suppose I could just use 2 GSI's to support these access patterns, but am not sure if I'm approaching this correctly.


Hi Seth, personally, I think your answer will depend a bit on the scale.  How many people do you expect for each race?

If you're expecting a few hundred records, then returning all of them to the client makes sense to me.  A few thousand seems doable as well.  If you're looking at many thousands of records, you might think about additional server-side filtering.  I'm assuming a race won't have millions of records.

For example, if you have a page that shows only the women's results, you could query by pk: RACE#<id> with a QueryFilter on gender = 'F'.  AWS still charges you for the men, but the client only gets what they asked for.

It may also make sense to create some rollups for things like number of participants by age group.  When you add the runner's final score, also do a lookup and update a shared record for their age group, gender, whatever.  These could look something like:

pk: RACE#<id>, sk: STATS#AGE, age: { 7: 1, 18: 240, 19: 280, 20: 212 }

pk: RACE#<id>, sk: STATS#GENDER, gender: { 'male': 2455, 'female': 2674 }

These metadata-level results can answer the general questions where you don't need a specific user, just the counts across all users.

Just some ideas, I'm curious how others would handle this.