GitHub - Yord/pxi: 🧚pxi (pixie) is a small, fast, and magical command-line data... - JOYK Joy of Geek, Geek News, Link all geek

pxi (pixie) is a small, fast, and magical command-line data processor similar to jq, mlr, and awk.

Installation

Installation is done using npm.

$ npm i -g pxi

Try pxi --help to see if the installation was successful.

Features

Small: Pixie does one thing and does it well (processing data with JavaScript).
Fast: pxi is as fast as gawk, 3x faster than jq and mlr, and 15x faster than fx.
Magical: It is trivial to write your own ~~spells~~ plugins.
Playful: Opt-in to more data formats by installing plugins.
Versatile: Use Ramda, Lodash and any other JavaScript library to process data on the command-line.
Loving: Pixie is made with love and encourages a positive and welcoming environment.

Getting Started

Pixie reads in big structured text files, transforms them with JavaScript functions, and writes them back to disk. The usage examples in this section are based on the following large JSONL file. Inspect the examples by clicking on them!

$ head -5 2019.jsonl # 2.6GB, 31,536,000 lines

{"time":1546300800,"year":2019,"month":1,"day":1,"hours":0,"minutes":0,"seconds":0}
{"time":1546300801,"year":2019,"month":1,"day":1,"hours":0,"minutes":0,"seconds":1}
{"time":1546300802,"year":2019,"month":1,"day":1,"hours":0,"minutes":0,"seconds":2}
{"time":1546300803,"year":2019,"month":1,"day":1,"hours":0,"minutes":0,"seconds":3}
{"time":1546300804,"year":2019,"month":1,"day":1,"hours":0,"minutes":0,"seconds":4}

Execute any JavaScript function:

$ pxi "json => json.time" < 2019.jsonl
$ pxi "({time}) => time" < 2019.jsonl

Convert between JSON, CSV, SSV, and TSV:

$ pxi --from json --to csv < 2019.jsonl > 2019.csv
$ pxi --deserializer json --serializer csv < 2019.jsonl > 2019.csv
$ pxi -d json -s csv < 2019.jsonl > 2019.csv

Use Ramda, Lodash or any other JavaScript library:

$ pxi "o(obj => _.omit(obj, ['seconds']), evolve({time: parseInt}))" --from csv < 2019.csv

Process data streams from REST APIs and other sources and pipe pixie's output to other commands:

$ curl -s "https://swapi.co/api/films/" |
  pxi 'json => json.results' --with flatMap --keep '["episode_id", "title"]' |
  sort

Use pixie's ssv deserializer to work with command line output:

$ ls -ahl / | pxi '([,,,,size,,,,file]) => ({size, file})' --from ssv

See the usage section below for more examples.

Introductory Blogposts

For a quick start, read the following blog posts:

Pixie

Pixie's philosophy is to provide a small, extensible frame for processing large files and streams with JavaScript functions. Different data formats are supported through plugins. JSON, CSV, SSV, and TSV are supported by default, but users can customize their pixie installation by picking and choosing from more available (including third-party) plugins.

Pixie works its magic by chunking, deserializing, applying functions, and serializing data. Expressed in code, it works like this:

function pxi (data) {                // Data is passed to pxi from stdin.
  const chunks = chunk(data)         // The data is chunked.
  const jsons  = deserialize(chunks) // The chunks are deserialized into JSON objects. 
  const jsons2 = apply(f, jsons)     // f is applied to each object and new JSON objects are returned.
  const string = serialize(jsons2)   // The new objects are serialized to a string.
  process.stdout.write(string)       // The string is written to stdout.
}

For example, chunking, deserializing, and serializing JSON is provided by the pxi-json plugin.

Plugins

The following plugins are available:

Chunkers	Deserializers	Appliers	Serializers	`pxi`
`pxi-dust`	`line`		`map`, `flatMap`, `filter`	`string`	✓
`pxi-json`	`jsonObj`	`json`		`json`	✓
`pxi-dsv`		`csv`, `tsv`, `ssv`, `dsv`		`csv`, `tsv`, `ssv`, `dsv`	✓
`pxi-sample`	`sample`	`sample`	`sample`	`sample`	✕

The last column states which plugins come preinstalled in pxi. Refer to the .pxi Module section to see how to enable more plugins and how to develop plugins. New experimental pixie plugins are developed i.a. in the pxi-sandbox repository.

Performance

pxi is very fast and beats several similar tools in performance benchmarks. Times are given in CPU time (seconds), wall-clock times may deviate by ± 1s. The benchmarks were run on a 13" MacBook Pro (2019) with a 2,8 GHz Quad-Core i7 and 16GB memory. Feel free to run the benchmarks on your own machine and if you do, please open an issue to report your results!

Benchmark	Description	`pxi`	`gawk`	`jq`	`mlr`	`fx`
JSON 1	Select an attribute on small JSON objects	11s	15s	46s	–	284s
JSON 2	Select an attribute on large JSON objects	20s	20s	97s	–	301s
JSON 3	Pick a single attribute on small JSON objects	15s	21s	68s	91s	368s
JSON 4	Pick a single attribute on large JSON objects	26s	27s	130s	257s†	420s
JSON to CSV 1	Convert a small JSON to CSV format	15s	–	77s	60s	–
JSON to CSV 2	Convert a large JSON to CSV format	38s	–	264s	237s†	–
CSV 1	Select a column from a small csv file	11s	8s	37s	23s	–
CSV 2	Select a column from a large csv file	19s	9s	66s	72s	–
CSV to JSON 1	Convert a small CSV to JSON format	15s	–	–	120s	–
CSV to JSON 2	Convert a large CSV to JSON format	42s	–	–	352s	–

† mlr appears to load the whole file instead of processing it in chunks if reading JSON. This is why it fails on large input files. So in these benchmarks, the first 20,000,000 lines are processed first, followed by the remaining 11,536,000 lines. The times of both runs are summed up.

pxi and gawk notably beat jq, mlr, and fx in every benchmark. However, due to its different data processing approach, pxi is more versatile than gawk and is e.g. able to transform data formats into another. For a more detailed interpretation, open this box.

Usage

The examples in this section are based on the following big JSONL file. Inspect the examples by clicking on them!

$ head -5 2019.jsonl # 2.6GB, 31,536,000 lines

{"time":1546300800,"year":2019,"month":1,"day":1,"hours":0,"minutes":0,"seconds":0}
{"time":1546300801,"year":2019,"month":1,"day":1,"hours":0,"minutes":0,"seconds":1}
{"time":1546300802,"year":2019,"month":1,"day":1,"hours":0,"minutes":0,"seconds":2}
{"time":1546300803,"year":2019,"month":1,"day":1,"hours":0,"minutes":0,"seconds":3}
{"time":1546300804,"year":2019,"month":1,"day":1,"hours":0,"minutes":0,"seconds":4}

Select the time:

$ pxi "json => json.time" < 2019.jsonl

Select month and day:

$ pxi '({month, day}) => ({month, day})' < 2019.jsonl

Convert JSON to CSV:

$ pxi --from json --to csv < 2019.jsonl > 2019.csv

Convert JSON to CSV, but keep only time and month:

$ pxi '({time, month}) => [time, month]' --to csv < 2019.jsonl

Rename time to timestamp and convert CSV to TSV:

$ pxi '({time, ...rest}) => ({timestamp: time, ...rest})' --from csv --to tsv < 2019.csv

Convert CSV to JSON:

$ pxi --deserializer csv --serializer json < 2019.csv

Convert CSV to JSON and cast time to integer:

$ pxi '({time, ...rest}) => ({time: parseInt(time), ...rest})' -d csv < 2019.csv

Use Ramda (or Lodash):

$ pxi 'evolve({year: parseInt, month: parseInt, day: parseInt})' -d csv < 2019.csv

Select only May the 4th:

$ pxi '({month, day}) => month == 5 && day == 4' --applier filter < 2019.jsonl

Use more than one function:

$ pxi '({month}) => month == 5' '({day}) => day == 4' -a filter < 2019.jsonl

Keep only certain keys and pretty-print JSON:

$ pxi --keep '["time"]' --spaces 2 < 2019.jsonl > pretty.jsonl

Deserialize JSON that is not given line by line:

$ pxi --by jsonObj < pretty.jsonl

Suppose you have to access a web API:

$ curl -s "https://swapi.co/api/people/"

Use pixie to organize the response:

$ curl -s "https://swapi.co/api/people/" |
  pxi "json => json.results" --with flatMap --keep '["name","height","mass"]'

Compute all Star Wars character's BMI:

$ curl -s "https://swapi.co/api/people/" |
  pxi "json => json.results" -a flatMap -K '["name","height","mass"]' |
  pxi "ch => (ch.bmi = ch.mass / (ch.height / 100) ** 2, ch)" -K '["name","bmi"]'

Identify all obese Star Wars characters:

$ curl -s "https://swapi.co/api/people/" |
  pxi "json => json.results" -a flatMap -K '["name","height","mass"]' |
  pxi "ch => (ch.bmi = ch.mass / (ch.height / 100) ** 2, ch)" -K '["name","bmi"]' |
  pxi "ch => ch.bmi >= 30" -a filter -K '["name"]'

Select PID and CMD from ps:

$ ps | pxi '([pid, tty, time, cmd]) => ({pid, cmd})' --from ssv

Select file size and filename from ls:

$ ls -ahl / | pxi '([,,,,size,,,,file]) => ({size, file})' --from ssv

Allow JSON objects and lists in CSV:

$ echo '{"a":1,"b":[1,2,3]}\n{"a":2,"b":{"c":2}}' |
  pxi --to csv --no-fixed-length --allow-list-values

Decode JSON values in CSV:

$ echo '{"a":1,"b":[1,2,3]}\n{"a":2,"b":{"c":2}}' |
  pxi --to csv --no-fixed-length --allow-list-values |
  pxi --from csv 'evolve({b: JSON.parse})'

`.pxi` Module

Users may extend and modify pxi by providing a .pxi module. If you wish to do that, create a ~/.pxi/index.js file and insert the following base structure:

module.exports = {
  plugins:  [],
  context:  {},
  defaults: {}
}

The following sections will walk you through all capabilities of .pxi modules. If you want to skip over the details and instead see sample code, visit pxi-pxi!

Writing Plugins

You may write pixie plugins in ~/.pxi/index.js. Writing your own extensions is straightforward:

const sampleChunker = {
  name: 'sample',
  desc: 'is a sample chunker.',
  func: ({verbose}) => (data, prevLines, noMoreData) => (
    // * Turn data into an array of chunks
    // * Count lines for better error reporting throughout pxi
    // * Collect error reports: {msg: String, line: Number, info: String}
    //   If verbose > 0, include line in error reports
    //   If verbose > 1, include info in error reports
    // * Return errors, chunks, lines, the last line, and all unchunked data
    {err: [], chunks: [], lines: [], lastLine: 0, rest: ''}
  )
}

const sampleDeserializer = {
  name: 'sample',
  desc: 'is a sample deserializer.',
  func: ({verbose}) => (chunks, lines) => (
    // * Deserialize chunks to jsons
    // * Collect error reports: {msg: String, line: Number, info: Chunk}
    //   If verbose > 0, include line in error reports
    //   If verbose > 1, include info in error reports
    // * Return errors and deserialized jsons
    {err: [], jsons: []}
  )
}

const sampleApplier = {
  name: 'sample',
  desc: 'is a sample applier.',
  func: (functions, {verbose}) => (jsons, lines) => (
    // * Turn jsons into other jsons by applying all functions
    // * Collect error reports: {msg: String, line: Number, info: Json}
    //   If verbose > 0, include line in error reports
    //   If verbose > 1, include info in error reports
    // * Return errors and serialized string
    {err: [], jsons: []}
  )
}

const sampleSerializer = {
  name: 'sample',
  desc: 'is a sample serializer.',
  func: ({verbose}) => jsons => (
    // * Turn jsons into a string
    // * Collect error reports: {msg: String, line: Number, info: Json}
    //   If verbose > 0, include line in error reports
    //   If verbose > 1, include info in error reports
    // * Return errors and serialized string
    {err: [], str: ''}
  )
}

The name is used by pixie to select your extension, the desc is displayed in the options section of pxi --help, and the func is called by pixie to transform data.

The sample extensions are bundled to the sample plugin, as follows:

const sample = {
  chunkers:      [sampleChunker],
  deserializers: [sampleDeserializer],
  appliers:      [sampleApplier],
  serializers:   [sampleSerializer]
}

Extending Pixie with Plugins

Plugins can come from two sources: They are either written by the user, as shown in the previous section, or they are installed in ~/.pxi/ as follows:

$ npm install pxi-sample

If a plugin was installed, it has to be imported into ~/.pxi/index.js:

const sample = require('pxi-sample')

Regardless of whether a plugin was defined by a user or installed from npm, all plugins are added to the .pxi module the same way:

module.exports = {
  plugins:  [sample],
  context:  {},
  defaults: {}
}

pxi --help should now list the sample plugin extensions in the options section.

Adding plugins may break the pxi command line tool! If this happens, just remove the plugin from the list and pxi should work normal again. Use this feature responsibly.

Including Libraries like Ramda or Lodash

Libraries like Ramda and Lodash are of immense help when writing functions to transform JSON objects and many heated discussions have been had, which of these libraries is superior. Since different people have different preferences, pixie lets the user decide which library to use.

First, install your preferred libraries in ~/.pxi/:

$ npm install ramda
$ npm install lodash

Next, add the libraries to ~/.pxi/index.js:

const R = require('ramda')
const L = require('lodash')

module.exports = {
  plugins:  [],
  context:  Object.assign({}, R, {_: L}),
  defaults: {}
}

You may now use all Ramda functions without prefix, and all Lodash functions with prefix _:

$ pxi "prop('time')" < 2019.jsonl
$ pxi "json => _.get(json, 'time')" < 2019.jsonl

Using Ramda and Lodash in your functions may have a negative impact on performance! Use this feature responsibly.

Including Custom JavaScript Functions

Just as you may extend pixie with third-party libraries like Ramda and Lodash, you may add your own functions. This is as simple as adding them to the context in ~/.pxi/index.js:

const getTime = json => json.time

module.exports = {
  plugins:  [],
  context:  {getTime},
  defaults: {}
}

After adding it to the context, you may use your function:

$ pxi "json => getTime(json)" < 2019.jsonl
$ pxi "getTime" < 2019.jsonl

Changing `pxi` Defaults

You may globally change default chunkers, deserializers, appliers, and serializers in ~/.pxi/index.js, as follows:

module.exports = {
  plugins:  [],
  context:  {},
  defaults: {
    chunker:      'sample',
    deserializer: 'sample',
    appliers:     'sample',
    serializer:   'sample',
    noPlugins:    false
  }
}

Defaults are assigned globally and changing them may break existing pxi scripts! Use this feature responsibly.

`id` Plugin

pxi includes the id plugin that comes with the following extensions:

Description
`id` chunker	Returns each data as a chunk.
`id` deserializer	Returns all chunks unchanged.
`id` applier	Does not apply any functions and returns the JSON objects unchanged.
`id` serializer	Applies Object.prototype.toString to the input and joins without newlines.

Comparison to Related Tools

`pxi`	`jq`	`mlr`	`fx`	`gawk`
Self-description	Small, fast, and magical command-line data processor similar to awk, jq, and mlr.	Command-line JSON processor	Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON	Command-line tool and terminal JSON viewer	The awk utility interprets a special-purpose programming language that makes it possible to handle simple data-reformatting jobs with just a few lines of code
Focus	Transforming data with user provided functions and converting between formats	Transforming JSON with user provided functions	Transforming CSV with user provided functions and converting between formats	Transforming JSON with user provided functions	Language for simple data reformatting tasks
License	MIT	MIT	BSD-3-Clause	MIT	GPL-3.0-only
Performance	(performance is given relative to `pxi`)	`jq` is >3x slower than `pxi`	`mlr` is >3x slower than `pxi`	`fx` is >15x slower than `pxi`	`pxi` is as performant as `gawk` when processing JSON and CSV
Processing Language	JavaScript and all JavaScript libraries	jq language	Predefined verbs and custom put/filter DSL	JavaScript and all JavaScript libraries	awk language
Extensibility	(Third-party) Plugins, any JavaScript library, custom functions	(Third-party) Modules written in jq	Running arbitrary shell commands	Any JavaScript library, custom functions	`gawk` dynamic extensions
Similarities		`pxi` and `jq` both heavily rely on JSON	`pxi` and `mlr` both convert back and forth between CSV and JSON	`pxi` and `fx` both apply JavaScript functions to JSON streams	`pxi` and `gawk` both transform data
Differences		`pxi` and `jq` use different processing languages	While `pxi` uses a programming language for data processing, `mlr` uses a custom put/filter DSL, also, `mlr` reads in the whole file while `pxi` processes it in chunks	`pxi` supports data formats other than JSON, and `fx` provides a terminal JSON viewer	While `pxi` functions transform a JSON into another JSON, `gawk` does not have a strict format other than transforming strings into other strings

Reporting Issues

Please report issues in the tracker!

Contributing

We are open to, and grateful for, any contributions made by the community. By contributing to pixie, you agree to abide by the code of conduct. Please read the contributing guide.

License

pxi is MIT licensed.

GitHub - Yord/pxi: 🧚pxi (pixie) is a small, fast, and magical command-line data...

Installation

Features

Getting Started

Introductory Blogposts

Pixie

Plugins

Performance

Usage

`.pxi` Module

Writing Plugins

Extending Pixie with Plugins

Including Libraries like Ramda or Lodash

Including Custom JavaScript Functions

Changing `pxi` Defaults

`id` Plugin

Comparison to Related Tools

Reporting Issues

Contributing

License

Recommend

挤下"老对手"，美国巨头成台积电7纳米芯片最大客户

警匪对决，人脸识别真的那么管用吗？

青山资本创始人张野：消费品投资中的溢价逻辑

钟鼎资本十周年：累计管理规模130亿，首次披露核心投资理念

GitHub上星数排行前6的VUE框架，看看有没有你需要的

洋码头宣布获新浪微博数亿元D轮融资并称2019年已实现全面盈利

用户体验设计之路（一）我们不是作图仔

滴滴公布春节出行四项措施司机服务费全额给司机

神速！支付宝集五福开启半小时近7万人已集齐

外媒称苹果即将发布新一代iPad Pro

About Joyk

GitHub - Yord/pxi: 🧚pxi (pixie) is a small, fast, and magical command-line data...

Installation

Features

Getting Started

Introductory Blogposts

Pixie

Plugins

Performance

Usage

.pxi Module

Writing Plugins

Extending Pixie with Plugins

Including Libraries like Ramda or Lodash

Including Custom JavaScript Functions

Changing pxi Defaults

id Plugin

Comparison to Related Tools

Reporting Issues

Contributing

License

Recommend

About Joyk

`.pxi` Module

Changing `pxi` Defaults

`id` Plugin