GitHub - Yord/pxi: 🧚pxi (pixie) is a small, fast, and magical command-line data...
source link: https://github.com/Yord/pxi
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
pxi
(pixie) is a small, fast, and magical command-line data processor similar to jq
, mlr
, and awk
.
Installation
Installation is done using npm
.
$ npm i -g pxi
Try pxi --help
to see if the installation was successful.
Features
- Small: Pixie does one thing and does it well (processing data with JavaScript).
- Fast:
pxi
is as fast asgawk
, 3x faster thanjq
andmlr
, and 15x faster thanfx
. - Magical: It is trivial to write your own
spellsplugins. - Playful: Opt-in to more data formats by installing plugins.
- Versatile: Use Ramda, Lodash and any other JavaScript library to process data on the command-line.
- Loving: Pixie is made with love and encourages a positive and welcoming environment.
Getting Started
Pixie reads in big structured text files, transforms them with JavaScript functions, and writes them back to disk. The usage examples in this section are based on the following large JSONL file. Inspect the examples by clicking on them!
$ head -5 2019.jsonl # 2.6GB, 31,536,000 lines
{"time":1546300800,"year":2019,"month":1,"day":1,"hours":0,"minutes":0,"seconds":0} {"time":1546300801,"year":2019,"month":1,"day":1,"hours":0,"minutes":0,"seconds":1} {"time":1546300802,"year":2019,"month":1,"day":1,"hours":0,"minutes":0,"seconds":2} {"time":1546300803,"year":2019,"month":1,"day":1,"hours":0,"minutes":0,"seconds":3} {"time":1546300804,"year":2019,"month":1,"day":1,"hours":0,"minutes":0,"seconds":4}
Execute any JavaScript function:
$ pxi "json => json.time" < 2019.jsonl $ pxi "({time}) => time" < 2019.jsonl
Convert between JSON, CSV, SSV, and TSV:
$ pxi --from json --to csv < 2019.jsonl > 2019.csv $ pxi --deserializer json --serializer csv < 2019.jsonl > 2019.csv $ pxi -d json -s csv < 2019.jsonl > 2019.csv
Use Ramda, Lodash or any other JavaScript library:
$ pxi "o(obj => _.omit(obj, ['seconds']), evolve({time: parseInt}))" --from csv < 2019.csv
Process data streams from REST APIs and other sources and pipe pixie's output to other commands:
$ curl -s "https://swapi.co/api/films/" | pxi 'json => json.results' --with flatMap --keep '["episode_id", "title"]' | sort
Use pixie's ssv deserializer to work with command line output:
$ ls -ahl / | pxi '([,,,,size,,,,file]) => ({size, file})' --from ssv
See the usage section below for more examples.
Introductory Blogposts
For a quick start, read the following blog posts:
Pixie
Pixie's philosophy is to provide a small, extensible frame for processing large files and streams with JavaScript functions. Different data formats are supported through plugins. JSON, CSV, SSV, and TSV are supported by default, but users can customize their pixie installation by picking and choosing from more available (including third-party) plugins.
Pixie works its magic by chunking, deserializing, applying functions, and serializing data. Expressed in code, it works like this:
function pxi (data) { // Data is passed to pxi from stdin. const chunks = chunk(data) // The data is chunked. const jsons = deserialize(chunks) // The chunks are deserialized into JSON objects. const jsons2 = apply(f, jsons) // f is applied to each object and new JSON objects are returned. const string = serialize(jsons2) // The new objects are serialized to a string. process.stdout.write(string) // The string is written to stdout. }
For example, chunking, deserializing, and serializing JSON is provided by the pxi-json
plugin.
Plugins
The following plugins are available:
Chunkers | Deserializers | Appliers | Serializers | pxi |
|
---|---|---|---|---|---|
pxi-dust |
line |
map , flatMap , filter |
string |
✓ | |
pxi-json |
jsonObj |
json |
json |
✓ | |
pxi-dsv |
csv , tsv , ssv , dsv |
csv , tsv , ssv , dsv |
✓ | ||
pxi-sample |
sample |
sample |
sample |
sample |
✕ |
The last column states which plugins come preinstalled in pxi
.
Refer to the .pxi
Module section to see how to enable more plugins and how to develop plugins.
New experimental pixie plugins are developed i.a. in the pxi-sandbox
repository.
Performance
pxi
is very fast and beats several similar tools in performance benchmarks.
Times are given in CPU time (seconds), wall-clock times may deviate by ± 1s.
The benchmarks were run on a 13" MacBook Pro (2019) with a 2,8 GHz Quad-Core i7 and 16GB memory.
Feel free to run the benchmarks on your own machine
and if you do, please open an issue to report your results!
Benchmark | Description | pxi |
gawk |
jq |
mlr |
fx |
---|---|---|---|---|---|---|
JSON 1 | Select an attribute on small JSON objects | 11s | 15s | 46s | – | 284s |
JSON 2 | Select an attribute on large JSON objects | 20s | 20s | 97s | – | 301s |
JSON 3 | Pick a single attribute on small JSON objects | 15s | 21s | 68s | 91s | 368s |
JSON 4 | Pick a single attribute on large JSON objects | 26s | 27s | 130s | 257s† | 420s |
JSON to CSV 1 | Convert a small JSON to CSV format | 15s | – | 77s | 60s | – |
JSON to CSV 2 | Convert a large JSON to CSV format | 38s | – | 264s | 237s† | – |
CSV 1 | Select a column from a small csv file | 11s | 8s | 37s | 23s | – |
CSV 2 | Select a column from a large csv file | 19s | 9s | 66s | 72s | – |
CSV to JSON 1 | Convert a small CSV to JSON format | 15s | – | – | 120s | – |
CSV to JSON 2 | Convert a large CSV to JSON format | 42s | – | – | 352s | – |
† mlr
appears to load the whole file instead of processing it in chunks if reading JSON.
This is why it fails on large input files.
So in these benchmarks, the first 20,000,000 lines are processed first, followed by the remaining 11,536,000 lines.
The times of both runs are summed up.
pxi
and gawk
notably beat
jq
, mlr
, and fx
in every benchmark.
However, due to its different data processing approach, pxi
is more versatile than gawk
and is e.g. able to transform data formats into another.
For a more detailed interpretation, open this box.
Usage
The examples in this section are based on the following big JSONL file. Inspect the examples by clicking on them!
$ head -5 2019.jsonl # 2.6GB, 31,536,000 lines
{"time":1546300800,"year":2019,"month":1,"day":1,"hours":0,"minutes":0,"seconds":0} {"time":1546300801,"year":2019,"month":1,"day":1,"hours":0,"minutes":0,"seconds":1} {"time":1546300802,"year":2019,"month":1,"day":1,"hours":0,"minutes":0,"seconds":2} {"time":1546300803,"year":2019,"month":1,"day":1,"hours":0,"minutes":0,"seconds":3} {"time":1546300804,"year":2019,"month":1,"day":1,"hours":0,"minutes":0,"seconds":4}
Select the time:
$ pxi "json => json.time" < 2019.jsonl
Select month and day:
$ pxi '({month, day}) => ({month, day})' < 2019.jsonl
Convert JSON to CSV:
$ pxi --from json --to csv < 2019.jsonl > 2019.csv
Convert JSON to CSV, but keep only time and month:
$ pxi '({time, month}) => [time, month]' --to csv < 2019.jsonl
Rename time to timestamp and convert CSV to TSV:
$ pxi '({time, ...rest}) => ({timestamp: time, ...rest})' --from csv --to tsv < 2019.csv
Convert CSV to JSON:
$ pxi --deserializer csv --serializer json < 2019.csv
Convert CSV to JSON and cast time to integer:
$ pxi '({time, ...rest}) => ({time: parseInt(time), ...rest})' -d csv < 2019.csv
Use Ramda (or Lodash):
$ pxi 'evolve({year: parseInt, month: parseInt, day: parseInt})' -d csv < 2019.csv
Select only May the 4th:
$ pxi '({month, day}) => month == 5 && day == 4' --applier filter < 2019.jsonl
Use more than one function:
$ pxi '({month}) => month == 5' '({day}) => day == 4' -a filter < 2019.jsonl
Keep only certain keys and pretty-print JSON:
$ pxi --keep '["time"]' --spaces 2 < 2019.jsonl > pretty.jsonl
Deserialize JSON that is not given line by line:
$ pxi --by jsonObj < pretty.jsonl
Suppose you have to access a web API:
$ curl -s "https://swapi.co/api/people/"
Use pixie to organize the response:
$ curl -s "https://swapi.co/api/people/" | pxi "json => json.results" --with flatMap --keep '["name","height","mass"]'
Compute all Star Wars character's BMI:
$ curl -s "https://swapi.co/api/people/" | pxi "json => json.results" -a flatMap -K '["name","height","mass"]' | pxi "ch => (ch.bmi = ch.mass / (ch.height / 100) ** 2, ch)" -K '["name","bmi"]'
Identify all obese Star Wars characters:
$ curl -s "https://swapi.co/api/people/" | pxi "json => json.results" -a flatMap -K '["name","height","mass"]' | pxi "ch => (ch.bmi = ch.mass / (ch.height / 100) ** 2, ch)" -K '["name","bmi"]' | pxi "ch => ch.bmi >= 30" -a filter -K '["name"]'
Select PID and CMD from ps
:
$ ps | pxi '([pid, tty, time, cmd]) => ({pid, cmd})' --from ssv
Select file size and filename from ls
:
$ ls -ahl / | pxi '([,,,,size,,,,file]) => ({size, file})' --from ssv
Allow JSON objects and lists in CSV:
$ echo '{"a":1,"b":[1,2,3]}\n{"a":2,"b":{"c":2}}' | pxi --to csv --no-fixed-length --allow-list-values
Decode JSON values in CSV:
$ echo '{"a":1,"b":[1,2,3]}\n{"a":2,"b":{"c":2}}' | pxi --to csv --no-fixed-length --allow-list-values | pxi --from csv 'evolve({b: JSON.parse})'
.pxi
Module
Users may extend and modify pxi
by providing a .pxi
module.
If you wish to do that, create a ~/.pxi/index.js
file and insert the following base structure:
module.exports = { plugins: [], context: {}, defaults: {} }
The following sections will walk you through all capabilities of .pxi
modules.
If you want to skip over the details and instead see sample code, visit pxi-pxi
!
Writing Plugins
You may write pixie plugins in ~/.pxi/index.js
.
Writing your own extensions is straightforward:
const sampleChunker = { name: 'sample', desc: 'is a sample chunker.', func: ({verbose}) => (data, prevLines, noMoreData) => ( // * Turn data into an array of chunks // * Count lines for better error reporting throughout pxi // * Collect error reports: {msg: String, line: Number, info: String} // If verbose > 0, include line in error reports // If verbose > 1, include info in error reports // * Return errors, chunks, lines, the last line, and all unchunked data {err: [], chunks: [], lines: [], lastLine: 0, rest: ''} ) } const sampleDeserializer = { name: 'sample', desc: 'is a sample deserializer.', func: ({verbose}) => (chunks, lines) => ( // * Deserialize chunks to jsons // * Collect error reports: {msg: String, line: Number, info: Chunk} // If verbose > 0, include line in error reports // If verbose > 1, include info in error reports // * Return errors and deserialized jsons {err: [], jsons: []} ) } const sampleApplier = { name: 'sample', desc: 'is a sample applier.', func: (functions, {verbose}) => (jsons, lines) => ( // * Turn jsons into other jsons by applying all functions // * Collect error reports: {msg: String, line: Number, info: Json} // If verbose > 0, include line in error reports // If verbose > 1, include info in error reports // * Return errors and serialized string {err: [], jsons: []} ) } const sampleSerializer = { name: 'sample', desc: 'is a sample serializer.', func: ({verbose}) => jsons => ( // * Turn jsons into a string // * Collect error reports: {msg: String, line: Number, info: Json} // If verbose > 0, include line in error reports // If verbose > 1, include info in error reports // * Return errors and serialized string {err: [], str: ''} ) }
The name
is used by pixie to select your extension,
the desc
is displayed in the options section of pxi --help
, and
the func
is called by pixie to transform data.
The sample extensions are bundled to the sample plugin, as follows:
const sample = { chunkers: [sampleChunker], deserializers: [sampleDeserializer], appliers: [sampleApplier], serializers: [sampleSerializer] }
Extending Pixie with Plugins
Plugins can come from two sources:
They are either written by the user, as shown in the previous section, or they are installed in ~/.pxi/
as follows:
$ npm install pxi-sample
If a plugin was installed, it has to be imported into ~/.pxi/index.js
:
const sample = require('pxi-sample')
Regardless of whether a plugin was defined by a user or installed from npm
,
all plugins are added to the .pxi
module the same way:
module.exports = { plugins: [sample], context: {}, defaults: {} }
pxi --help
should now list the sample plugin extensions in the options section.
Adding plugins may break the
pxi
command line tool! If this happens, just remove the plugin from the list andpxi
should work normal again. Use this feature responsibly.
Including Libraries like Ramda or Lodash
Libraries like Ramda and Lodash are of immense help when writing functions to transform JSON objects and many heated discussions have been had, which of these libraries is superior. Since different people have different preferences, pixie lets the user decide which library to use.
First, install your preferred libraries in ~/.pxi/
:
$ npm install ramda $ npm install lodash
Next, add the libraries to ~/.pxi/index.js
:
const R = require('ramda') const L = require('lodash') module.exports = { plugins: [], context: Object.assign({}, R, {_: L}), defaults: {} }
You may now use all Ramda functions without prefix, and all Lodash functions with prefix _
:
$ pxi "prop('time')" < 2019.jsonl $ pxi "json => _.get(json, 'time')" < 2019.jsonl
Using Ramda and Lodash in your functions may have a negative impact on performance! Use this feature responsibly.
Including Custom JavaScript Functions
Just as you may extend pixie with third-party libraries like Ramda and Lodash,
you may add your own functions.
This is as simple as adding them to the context in ~/.pxi/index.js
:
const getTime = json => json.time module.exports = { plugins: [], context: {getTime}, defaults: {} }
After adding it to the context, you may use your function:
$ pxi "json => getTime(json)" < 2019.jsonl $ pxi "getTime" < 2019.jsonl
Changing pxi
Defaults
You may globally change default chunkers, deserializers, appliers, and serializers in ~/.pxi/index.js
, as follows:
module.exports = { plugins: [], context: {}, defaults: { chunker: 'sample', deserializer: 'sample', appliers: 'sample', serializer: 'sample', noPlugins: false } }
Defaults are assigned globally and changing them may break existing
pxi
scripts! Use this feature responsibly.
id
Plugin
pxi
includes the id
plugin that comes with the following extensions:
Description | |
---|---|
id chunker |
Returns each data as a chunk. |
id deserializer |
Returns all chunks unchanged. |
id applier |
Does not apply any functions and returns the JSON objects unchanged. |
id serializer |
Applies Object.prototype.toString to the input and joins without newlines. |
Comparison to Related Tools
pxi |
jq |
mlr |
fx |
gawk |
|
---|---|---|---|---|---|
Self-description | Small, fast, and magical command-line data processor similar to awk, jq, and mlr. | Command-line JSON processor | Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON | Command-line tool and terminal JSON viewer | The awk utility interprets a special-purpose programming language that makes it possible to handle simple data-reformatting jobs with just a few lines of code |
Focus | Transforming data with user provided functions and converting between formats | Transforming JSON with user provided functions | Transforming CSV with user provided functions and converting between formats | Transforming JSON with user provided functions | Language for simple data reformatting tasks |
License | MIT | MIT | BSD-3-Clause | MIT | GPL-3.0-only |
Performance | (performance is given relative to pxi ) |
jq is >3x slower than pxi |
mlr is >3x slower than pxi |
fx is >15x slower than pxi |
pxi is as performant as gawk when processing JSON and CSV |
Processing Language | JavaScript and all JavaScript libraries | jq language | Predefined verbs and custom put/filter DSL | JavaScript and all JavaScript libraries | awk language |
Extensibility | (Third-party) Plugins, any JavaScript library, custom functions | (Third-party) Modules written in jq | Running arbitrary shell commands | Any JavaScript library, custom functions | gawk dynamic extensions |
Similarities | pxi and jq both heavily rely on JSON |
pxi and mlr both convert back and forth between CSV and JSON |
pxi and fx both apply JavaScript functions to JSON streams |
pxi and gawk both transform data |
|
Differences | pxi and jq use different processing languages |
While pxi uses a programming language for data processing, mlr uses a custom put/filter DSL, also, mlr reads in the whole file while pxi processes it in chunks |
pxi supports data formats other than JSON, and fx provides a terminal JSON viewer |
While pxi functions transform a JSON into another JSON, gawk does not have a strict format other than transforming strings into other strings |
Reporting Issues
Please report issues in the tracker!
Contributing
We are open to, and grateful for, any contributions made by the community. By contributing to pixie, you agree to abide by the code of conduct. Please read the contributing guide.
License
pxi
is MIT licensed.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK