Honestly, the storage use would probably be the last thing of my mind when designing for "what should state/region/district/bundesland/etc. be modelled as". Sometimes those things get renamed, sometimes they are merged, and sometimes they are split. Which means that you may end up in an awkward state when e.g. Mecklenburg-Vorpommern gets split back into Mecklenburg and Western Pomerania, and some of your customers have updated their addresses, and some haven't. You have to store all of that anyway because remember: your DB doesn't represent the current state of the world, it represents your knowledge about the current state of the world (which is where the whole impetus for NULL originated: "I know that the customer has an address, I just don't know what it is", and all related problems with it: compare "I know that the customer actually does not have any address at all", and "I know that this address just can't be correct no longer but I have no new knowledge about what it can be").
I don't database, but I like to think I have some kind of intuition for storage space requirements, and this article was very confusing.
Ignoring the indexes and just focusing on the main table sizes reported, we have:
- String ("The frequent repetition of these names inflates the size of the table"): 392 MB
- Enum data type ("Internally, an enum type is stored as four-byte floating point number. So it saves space in the table [...]"): 338 MB
- Lookup table ("Also, since a smallint only occupies two bytes, the person_l table can potentially use less storage space than the other solutions"): 338 MB.
I just can't make sense of the numbers, especially given the authors comments that I've quoted.
I'm also wondering about that. But maybe this could be it?
> Surprisingly, the table is just as big as with the enum type above, even though an enum uses four bytes. The reason is that each table row is aligned at a memory address divisible by eight, so PostgreSQL will add six padding bytes after the smallint. If we had more columns and could arrange them carefully, we could see a difference.
This could be the explanation. If the row is padded to 8, bigint is 8, then smallint or enum also use 8. The entries in the string table will be 8 or 16 due to the string length. So one row in person_e and person_l is 16, one row in person_s could be about 20 on average, that is a bit closer to the reality than my intuition, although the storage savings are still less than what I would have expected.
edit:
I did also try out the test and dropped the primary key on the table to compare only enum and string size:
well uniformity and homoiconicity are very important
in an ideal db management system (a.k.a a true rdbms)
everything should be represent as a relation
and use the same set of operators to be manipulated
separations of types and relations should be limited to core atomic type, string, int , date etc ... (althought date is debatable as is not usually atomic in most cases, and many dbs end up with one more date relations)
Data should be data, queryable, relational. So often I have had to change enums into lookup tables - or worse, duplicate them into lookup tables - because now we need other information attached to the values. Labels, descriptions, colors, etc.
My biggest recommendation though is that if you have a lookup table like this, make the value you would have made an enum not just unique, but _the primary key_. Now all the places that you would be putting an ID have the value just like they would with an enum, and oftentimes you wont need to join. The FK makes sure its valid. The other information is a join away if you need it.
I do wish though that there were more ways to denote certain tables as configuration data vs domain data, besides naming conventions or schemas.
Edit to add: I will say there is one places where I have begrudgingly used enums and thats where we have used something like prisma to get typescript types from the schema. It is useful to have types generated for these values. Of course you can do your own generation of those values based on data, but there is a fundamental difference there between "schema" and "data".
well, if DDL (data definition language) and DML (data manipulation language), were unified and both operated on relation , manipulating meta data would have been a lot simpler, and more dynamics
you can always created data dictionary relation, where you stored the code for table creation, add meta data, and use dynamic sql to execute the DML code stored in the DB, i worked somewhere where they did this ... sort of
Everything should be represented as relations (sets of tuples) but you should always use tables (multisets of tuples) when possible? That seems a little contradictory.
From a maintainability standpoint lookup tables are miles ahead, but from a DX perspective there are a few cases where enums are nice. Honestly I probably would never use enums again, I feel like it's caused pain every time I've done it.
Enums are great if you're into json/jsonb custom logic and aggregates. It's quite cool to use the constraint system to impose checks on various JSON fields, especially if you're doing extension development, or packaging up procedures for downstream consumption.
In a lot of web apps this need tends to be related to validation, so many just do these lookups and simple comparisons in their app logic and based on static values from config files long before any db query is made. Sometimes you just don't need to involve the database and the performance would be better for it anyway.
Table with a thread-safe read-through cache in code, imo. But there are places where enums make sense. For instance, things that are specifically in the code's domain.
Honestly, the storage use would probably be the last thing of my mind when designing for "what should state/region/district/bundesland/etc. be modelled as". Sometimes those things get renamed, sometimes they are merged, and sometimes they are split. Which means that you may end up in an awkward state when e.g. Mecklenburg-Vorpommern gets split back into Mecklenburg and Western Pomerania, and some of your customers have updated their addresses, and some haven't. You have to store all of that anyway because remember: your DB doesn't represent the current state of the world, it represents your knowledge about the current state of the world (which is where the whole impetus for NULL originated: "I know that the customer has an address, I just don't know what it is", and all related problems with it: compare "I know that the customer actually does not have any address at all", and "I know that this address just can't be correct no longer but I have no new knowledge about what it can be").
I don't database, but I like to think I have some kind of intuition for storage space requirements, and this article was very confusing.
Ignoring the indexes and just focusing on the main table sizes reported, we have:
- String ("The frequent repetition of these names inflates the size of the table"): 392 MB
- Enum data type ("Internally, an enum type is stored as four-byte floating point number. So it saves space in the table [...]"): 338 MB
- Lookup table ("Also, since a smallint only occupies two bytes, the person_l table can potentially use less storage space than the other solutions"): 338 MB.
I just can't make sense of the numbers, especially given the authors comments that I've quoted.
Is this some kind of typo/editing fail?
I'm also wondering about that. But maybe this could be it?
> Surprisingly, the table is just as big as with the enum type above, even though an enum uses four bytes. The reason is that each table row is aligned at a memory address divisible by eight, so PostgreSQL will add six padding bytes after the smallint. If we had more columns and could arrange them carefully, we could see a difference.
This could be the explanation. If the row is padded to 8, bigint is 8, then smallint or enum also use 8. The entries in the string table will be 8 or 16 due to the string length. So one row in person_e and person_l is 16, one row in person_s could be about 20 on average, that is a bit closer to the reality than my intuition, although the storage savings are still less than what I would have expected.
edit:
I did also try out the test and dropped the primary key on the table to compare only enum and string size:
Does not look like an amazing saving either.> Enum type 4-byte floating point number
This is why the storage is weird. Why would you use a float for distinct number storage!
well uniformity and homoiconicity are very important in an ideal db management system (a.k.a a true rdbms) everything should be represent as a relation and use the same set of operators to be manipulated
separations of types and relations should be limited to core atomic type, string, int , date etc ... (althought date is debatable as is not usually atomic in most cases, and many dbs end up with one more date relations)
anyway, always use a table .. when its a choice
couldn't have said it better myself.
Data should be data, queryable, relational. So often I have had to change enums into lookup tables - or worse, duplicate them into lookup tables - because now we need other information attached to the values. Labels, descriptions, colors, etc.
My biggest recommendation though is that if you have a lookup table like this, make the value you would have made an enum not just unique, but _the primary key_. Now all the places that you would be putting an ID have the value just like they would with an enum, and oftentimes you wont need to join. The FK makes sure its valid. The other information is a join away if you need it.
I do wish though that there were more ways to denote certain tables as configuration data vs domain data, besides naming conventions or schemas.
Edit to add: I will say there is one places where I have begrudgingly used enums and thats where we have used something like prisma to get typescript types from the schema. It is useful to have types generated for these values. Of course you can do your own generation of those values based on data, but there is a fundamental difference there between "schema" and "data".
well, if DDL (data definition language) and DML (data manipulation language), were unified and both operated on relation , manipulating meta data would have been a lot simpler, and more dynamics
you can always created data dictionary relation, where you stored the code for table creation, add meta data, and use dynamic sql to execute the DML code stored in the DB, i worked somewhere where they did this ... sort of
Yeah, that is what I think on https://tablam.org, where I consider everything could be a relation, so like
> everything should be represent as a relation
> always use a table .. when its a choice
Everything should be represented as relations (sets of tuples) but you should always use tables (multisets of tuples) when possible? That seems a little contradictory.
how do you want to represent relations in a DBMS, an enum or a table ?
If said DBMS is relational, with relations.
If said DBMS is tablational, like SQL, then you would have to approximate them using tables and constraints.
If said DBMS is of an another paradigm, like a document database, there may be no way to represent relations within the DBMS.
with foreign keys?
I also love the approach of ClickHouse with LowCardinality(String). Flexible, clear semantics, high performance
From a maintainability standpoint lookup tables are miles ahead, but from a DX perspective there are a few cases where enums are nice. Honestly I probably would never use enums again, I feel like it's caused pain every time I've done it.
Enums are great if you're into json/jsonb custom logic and aggregates. It's quite cool to use the constraint system to impose checks on various JSON fields, especially if you're doing extension development, or packaging up procedures for downstream consumption.
Basically ugly no matter what.
In a lot of web apps this need tends to be related to validation, so many just do these lookups and simple comparisons in their app logic and based on static values from config files long before any db query is made. Sometimes you just don't need to involve the database and the performance would be better for it anyway.
Table with a thread-safe read-through cache in code, imo. But there are places where enums make sense. For instance, things that are specifically in the code's domain.
Who was child 12
Who was child 12̣