Crate: Why require the PRIMARY KEY column to be present in a PARTITIONED BY clause? -


can me understand reading in documentation? https://crate.io/docs/reference/sql/partitioned_tables.html

in these example tables, column id long not primary_key; indeed, id not primary key here, because noted below "if primary key set, must present in partition by clause"

in app, i've historically had primary key on id string not null, want add partitioning on table, on generated date column in example partition_date timestamp generated date_trunc('day', created_at). i've read partitioning on date column speed of queries scoped time period (counting today's records, example, hit today's partitions), , helping me archive older frames of data (e.g. > 180 days), don't want lose performance of single pk lookups.

so since can't partitioned (partition_date), best if i...

a) remove primary key constraint id? i'm nervous affect performance single row lookups! in context, makes sense pk must in partition key, because lookup where id = "abc-123" should ideally have hit single node.

or

b) use both columns partition key, partitioned (id, partition_date) -- seems weird, because instinctively, want assume id have high cardinality , bad choice partition column, , 'day' or 'month' better, shown in example on docs. in case, pk lookup hitting every partition, or know go? if run aggregate query scoped today only, hit every partition or 1 holding today's data?

so that's great question! since partitions "sub tables" of sort, reducing size of queried data.

a primary key influences routing in cratedb, adding partitioned table (which requires more extensive routing) deny non-primary key column in partitioned clause. thus, options follows:

  • a) while removes ability pk lookups effectively, seems sensible option - regular string lookups can accelerated using fulltext index - going remove read-after-write-consistent primary key lookups add. depending on way generate primary key, might feasible use internal _id column instead (for lookups), or issue refresh table before id-lookup.
  • b) result in many partitions there primary keys (and since unique ...) - option create many partitions

since option b) result in chaos, i'd recommend option a). however, if primary key lookups crucial application , expected amount of data not huge (a couple of million fine - depending on cluster size , machine specs of course), might work fine without partitioning!

cheers, claus


Comments

Popular posts from this blog

python - How to insert QWidgets in the middle of a Layout? -

python - serve multiple gunicorn django instances under nginx ubuntu -

module - Prestashop displayPaymentReturn hook url -