Does 'Cities' deserve a separate table?

WePython 1 Months+

I have the following Customers table:

customer_id - int 
company_name - nvarchar
street - nvarchar
city - nvarchar
comments - nvarchar

The app will only be used in part of one small country (something like 30 cities). A friend told me I should seperate 'city' into a different table 'Cities' and use only city_id in customers table.

Personally I didn't see much of a benefit from it (except for saving some space on Customers table which seems insignificant to me in this case for the cost of creating another table).

He also mentioned that because I have duplicate columns - city: foo, city: bar, city: foo. (few customers in the same city) this is not considered normalized, is this true?

Who's right? Any enlightenment on the issue?

-----------------Answer-----------------

The design doesn't meet third normal form, but not just because of the city. The fields STREET, CITY are functionally dependant on each other (if you change the city, the street should probably change as well and vice-versa). You could also have the same street, city combination represented in different ways (Foo St, Foo; Foo Street, Foo; etc.).

To normalise this you would create a new table ADDRESSES which has the street, city etc. in and link the customer to that via an address id. This would also allow you to list several addresses for a customer (via a link table) if this is what you require.

This still leaves you to decide whether to extract the city into it's own table. To fully meet 3NF you should create a cities table, whether you need or want to depends on the answer to the following questions:

  1. Do you (or will you) have additional attributes of the city (e.g. population, county/state, display name etc.)
  2. Are you executing a query to generate a list of distinct cities to pick from (your comment suggests yes)

If the first point is true, then you should definitely create a CITIES table, otherwise you could end up with one city having different populations etc. If the second is true then it's a very good idea to have a separate table as the query to list all your cities will scale much better - you'll only have to scan this table rather than the (almost certainly larger) CUSTOMERS table and then get the distinct cities from that.

-----------------Answer-----------------

When you get into it -- really get into it -- storing componentized address data is an extremely complicated problem because of all the disparate and varied systems in use globally.

I think whatever you develop needs to be balanced between flexibility, and storing only what your business needs to store.


The biggest piece of the puzzle here is to move all address-related fields out of the Customers table -- addresses are entities unto themselves.

The space cost may not be relevant in a very small system (maybe), but this is more about a technical debt issue. If you need to start adding more address-related fields, you would have to keep adding more and more of them to the Customers table. Sooner or later, you'll realize that this is an inflexible design -- if you need to use multiple addresses for a given customer (billing & shipping addresses being the classic example), now you're in a world of hurt without normalizing, as you can't reuse the existing structure to store the required data.

At an absolute minimum, create a new table Addresses, and then reference address_id from Customers. If you want to go the multiple address route eventually, even doing just this step (as opposed to sticking with the current design) will save a massive headache later.

The address line could go in either the Addresses table directly for simplicity, or in a separate Address_Lines table to handle multiple lines. (The latter is usually preferred.)

After that, a general minimum for being able to slice and dice your data in a meaningful way is to construct normalized Countries, Regions (aka provinces/territories/etc.), and Cities tables, with only the latter appearing as a field in the Addresses table. This lets you ask business questions like "how many products did we sell in city X?" and "how many products did we sell in region Y?". (Note: depending on where you operate, what data you have, and how the data will be sliced, you may require a 4th table in there between Regions and Cities.)

If you need to get more granular ("how many products did we sell to customers on street X?"), then you'll have to start componentizing the address lines themselves, which is the really difficult part. Usually, though, a business won't ask this kind of question. Given that I don't even see a postal code field, I'm guessing this is not something you care about.

-----------------Answer-----------------

Address modeling is not universal. A universal implementation would be too complex for most applications. The different models vary according to model type (OLTP vs OLAP), country rules, customer type (organization vs. individual), how critical the address data is, etc.

As said, you should separate city. Separating City will make the problem of having different city names in the customer table go away. Reasons are:

1 - Separating City names in a separate table allows you to run queries like: give me customer distribution by city and show cities where no customers are there.

2 - Allows you GUI to always refer to the correct list of cities.

3 - Allows you to maintain city information without touching your program code.

4 - Allows duplicate city names (in different states) if you use the diagram below.

If you are sure that your application is for 1 country, then don't add the country in.

Also, I noticed you don't separate the street address information into 2 columns which is common in North America and that you have no Zip Code. Review the country's postal address requirements from postal authorities to make sure your design conforms to them.

Here is a common representation of address in an OLTP application. Here the PK of City is 2 columns, namely, CityID and StateID. A variation on this version would be to use single ID for each table (as a sequence number) and end up with a FK composed of 1 column only at the customer table.

It all boils down to business rules and requirements.

The following is a basic representation.

-----------------Answer-----------------

To normalize the design, you have to seperate the CITY. Besides saving some space, its good design. Also alleviates duplicate city name entries, like Foo, foo, Fooo etc and you can have details specific to city in that table like zip code.

-----------------Answer-----------------

He also mentioned that because I have duplicate columns - city: foo, city: bar, city: foo. (few customers in the same city) this is not considered normalized, is this true?

The Wikipedia article on 3NF says: a table is in 3NF if and only if, for each of its functional dependencies X → A, at least one of the following conditions holds:

  • X contains A (that is, X → A is trivial functional dependency)
  • Or X is a superkey
  • A-X, the set difference between A and X is a prime attribute (i.e., A-X is contained within a candidate key)

City cannot be said to functionally dependent on itself. So the repetition of identical values in different rows does not violate 3NF.

For an example that would violate 3NF, add CityPopulation to the table. Now there is a functional dependency for X = (city) and A = (CityPopulation). This dependency satisfies none of the three conditions, and this table design would violate normal form.

Having said that, I think normalization is a pointless academic exercise. Striving for 3NF compliance in a database is a guaranteed catastrophe.


Previous : How to use MySQL Server 5.6 for GeoIP replication?
Next : Fastest way to copy data from MyISAM to InnoDB