3

CSV and data types

 2 years ago
source link: https://relational-pipes.globalcode.info/v_0/examples-csv-data-types.xhtml
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

CSV and data types

CSV (RFC 4180) is quite good solution when we want to store or share relational data in a simple text format – both, human-readable and well supported by many existing applications and libraries. We have even ready-to-use GUI editors, so called spreadsheets (e.g. LibreOffice Calc). However, such simple formats have usually some drawbacks. CSV may contain only a single relation (table, sheet). This is not a big issue – we can use several files. A more serious problem is the absence of data types – in CSV, everything is just a text string. Thus it was impossible to have loss-less conversion to CSV and back.

$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-tabular
filesystem:
 ╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮
 │ path   (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │
 ├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤
 │ license/        │ d             │              0 │ hacker         │ hacker         │
 │ license/gpl.txt │ f             │          35147 │ hacker         │ hacker         │
 ╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯
Record count: 2

Data types are missing in CSV by default:

$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv 
"path","type","size","owner","group"
"license/","d","0","hacker","hacker"
"license/gpl.txt","f","35147","hacker","hacker"

The size attribute was integer and now it is mere string:

$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv | relpipe-in-csv | relpipe-out-tabular 
csv:
 ╭─────────────────┬───────────────┬───────────────┬────────────────┬────────────────╮
 │ path   (string) │ type (string) │ size (string) │ owner (string) │ group (string) │
 ├─────────────────┼───────────────┼───────────────┼────────────────┼────────────────┤
 │ license/        │ d             │ 0             │ hacker         │ hacker         │
 │ license/gpl.txt │ f             │ 35147         │ hacker         │ hacker         │
 ╰─────────────────┴───────────────┴───────────────┴────────────────┴────────────────╯
Record count: 2

Declare data types in the CSV header

Since Relational pipes v0.18 we can encode the data types (currently strings, integers and booleans) in the CSV header and then recover them while reading. Such „CSV with data types“ is valid CSV according to the RFC specification and can be viewed or edited in any CSV-capable software.

The attribute name and data type are separated by the :: symbol e.g. name::string,age::integer,member::boolean. Attribute names may contain :: (unlike the data type names).

The data type declarations may be added simply by hand or automatically using relpipe-out-csv.

$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv --write-types true 
"path::string","type::string","size::integer","owner::string","group::string"
"license/","d","0","hacker","hacker"
"license/gpl.txt","f","35147","hacker","hacker"

The relpipe-out-csv + relpipe-in-csv round-trip now does not degrade the data quality:

$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv --write-types true | relpipe-in-csv | relpipe-out-tabular 
csv:
 ╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮
 │ path   (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │
 ├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤
 │ license/        │ d             │              0 │ hacker         │ hacker         │
 │ license/gpl.txt │ f             │          35147 │ hacker         │ hacker         │
 ╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯
Record count: 2

So we can put e.g. a CSV editor between them while storing and versioning the data in a different format (like XML or Recfile). Such workflow can be effectively managed by makemake edit will convert versioned data to CSV and launch the editor, make commit will convert data back from the CSV and commit them in Mercurial, Git or other version control system (VCS).

Why put into VCS data in different format than CSV? Formats like XML or Recfile may have each attribute on a separate line which leads to more readable diffs. At a glance we can see which attributes have been changed. While in CSV we see just a changed long line and even with a better tools we need to count the comas to know which attribute it was.

The relpipe-out-csv tool generates data types only when explicitly asked for: --write-types true. The relpipe-in-csv tool automatically looks for these type declarations and if all attributes have valid type declarations, they are used, otherwise they are considered to be a part of the attribute name. This behavior can be disabled by --read-types false (true will require valid type declarations).

Recognize data types using relpipe-tr-infertypes

Sometimes we may also want to infer data types from the values automatically without any explicit declaration. Then we put the relpipe-tr-infertypes tool in our pipeline. It buffers whole relations and checks all values of each attribute. If they contain all integers or all booleans they are converted to given type.

$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv | relpipe-in-csv | relpipe-tr-infertypes | relpipe-out-tabular
csv:
 ╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮
 │ path   (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │
 ├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤
 │ license/        │ d             │              0 │ hacker         │ hacker         │
 │ license/gpl.txt │ f             │          35147 │ hacker         │ hacker         │
 ╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯
Record count: 2

This approach is inefficient and contradicts streaming, however it is sometimes useful and convenient for small data coming from external sources. We can e.g. download some data set from network and pipe it through relpipe-in-csv + relpipe-tr-infertypes and improve the data quality a bit.

We may apply the type inference only on certain relations: --relation "my_relation" or chose different mode: --mode data or metadata or auto. The data mode is described above. In the metadata mode the relpipe-tr-infertypes works similar to relpipe-in-csv --read-types true. The auto mode checks for the metadata in attribute names first and if not found, it fallbacks to the data mode. This tool works with any relational data regardless their original format or source (not only with CSV).

No header? Specify types as CLI parameters

Some CSV files contain just data – have no header line containing the column names. Then we specify the attribute names and data types as CLI parameters of relpipe-in-csv:

$ echo -e "a,b,c\nA,B,C" \
	| relpipe-in-csv \
		--relation 'just_data' \
			--attribute 'x' string \
			--attribute 'y' string \
			--attribute 'z' string \
	| relpipe-out-tabular

just_data:
 ╭────────────┬────────────┬────────────╮
 │ x (string) │ y (string) │ z (string) │
 ├────────────┼────────────┼────────────┤
 │ a          │ b          │ c          │
 │ A          │ B          │ C          │
 ╰────────────┴────────────┴────────────╯
Record count: 2

We may also skip existing header line: tail -n +2 and force our own names and types. However this will not work if there are multiline values in the header – which is not common – in such cases we should use some relpipe-tr-* tool to rewrite the names or types (these tools work with relational data instead of plain text).


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK