probe-scraper
source link: https://github.com/mozilla/probe-scraper
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
probe-scraper
Scrape Telemetry probe data from Firefox repositories.
This extracts per-version Telemetry probe data for Firefox and mobile products from registry files like Histograms.json and Scalars.yaml. The data allows answering questions like "which Firefox versions is this Telemetry probe in anyway?". Also, probes outside of Histograms.json - like the CSS use counters - are included in the output data.
The data is pulled from two different sources:
A web tool to explore the data is available here.
Adding a New Glean Repository
To scrape a git repository for probe definitions, an entry needs to be added in repositories.yaml
.
description
: A brief description of the repository.channel
: (optional) The release channel. Must be one of "release", "beta", "nightly" or "esr".deprecated
(optional)true
if repository corresponds to a deprecated product or library.prototype
: (optional)true
if the application is still under developement.notification_emails
: Where emails about probe-scraper failures and improper files will be forwarded to. These will be just about your specific repository.url
: The URL of the repository to scrape. It should be able to be cloned directly from that URL.branch
(optional): The branch in the repository to use. (Defaults tomaster
).metrics_files
: A list of relative paths tometrics.yaml
filesping_files
: A list of relative paths topings.yaml
files
Adding an application
For a given application, Glean metrics are emitted by the application itself, any libraries it uses that also use Glean, as well as the Glean library proper. Therefore, probe scraper needs a way to find all of the dependencies to determine all of the metrics emitted by that application.
Therefore, each application should specify a dependencies
parameter, which is a list of Glean-using libraries used by the application. Each entry should be a library name as specified by the library's library_names
parameter.
For Android applications, if you're not sure what the dependencies of the application are, you can run the following command at the root of the project folder:
$ ./gradlew :app:dependencies
Adding a library
All libraries must define library_names
.
Probe scraper also needs a way to map dependencies back to an entry in the
repositories.yaml
file. Therefore, any libraries defined should also include
their build-system-specific library names in the library_names
parameter.
Developing the probe-scraper
You can choose to develop using the container, or locally. Using the container will be slower, since changes will trigger a rebuild of the container. But using the container method will ensure that your PR passes CircleCI build/test phases.
Local development
Install the requirements:
pip install -r requirements.txt
pip install -r test_requirements.txt
python setup.py develop
Run tests. This by default does not run tests that require a web connection:
pytest tests/
To run all tests, including those that require a web connection:
pytest tests/ --run-web-tests
To test whether the code conforms to the style rules, you can run:
flake8 --max-line-length 100 .
Developing using the container
Run tests in container. This does not run tests that require a web connection:
export COMMAND='pytest tests/'
make run
To run all tests, including those that require a web connection:
make test
To test whether the code conforms to the style rules, you can run:
make lint
Tests with Web Dependencies
Any tests that require a web connection to run should be marked with @pytest.mark.web_dependency
.
These will not run by default, but will run on CI.
Performing a Dry-Run
Before opening a PR, it's good to test the code you wrote on the production data. You can specify a specific Firefox
version to run on by using first-version
:
export COMMAND='python -m probe_scraper.runner --firefox-version 65 --dry-run'
make run
or locally via:
python -m probe_scraper.runner --firefox-version 65 --dry-run
Additionally, you can test just on Glean repositories:
export COMMAND='python -m probe_scraper.runner --glean --dry-run'
make run
or locally via:
python -m probe_scraper.runner --glean --dry-run
Including --dry-run
means emails will not be sent.
Module overview
The module is built around the following data flow:
- scrape registry files from mozilla-central, clone files from repositories directory
- extract probe data from the files
- transform probe data into output formats
- save to disk
The code layout consists mainly of:
probe_scraper
runner.py
- the central script, ties the other pieces togetherscrapers
buildhub.py
- pull build info from the BuildHub servicemoz_central_scraper.py
- loads probe registry files for multiple versions from mozilla-centralgit_scraper.py
- loads probe registry files from a git repository (no version or channel support yet, just per-commit)
parsers/
- extract probe data from the registry filesthird_party
- these are imported parser scripts from mozilla-central
transform_*.py
- transform the extracted raw data into output formats
tests/
- the unit tests
File formats
This scraper generates three different JSON file types.
revisions
This file contains the revision hashes of the changesets the probe files were scraped. These hashes are mapped to an human-readable version string.
{
"<channel>": {
"<revision hash>": {
"version": "<human-readable version string>"
},
...
},
...
"aurora": {
"1196bf3032e1bce1fb07a01fd9082a767426c5fb": {
"version": "51"
},
},
...
}
general
This file contains general properties related to the scraping process. As of today, it only contains the lastUpdate
property, which is the day and time the scraping was performed, in ISO 8601 format.
{
"lastUpdate": "2018-01-15T17:57:08.944690+01:00"
}
Probe data file
This file contains the data for the probes. The data might be spread across multiple files. It has the following format:
{
"<probe type>/<probe name>": {
"history": {
"<channel>": [
{
"cpp_guard": <string or null>,
"description": "<string>",
"details": {
"<type specific detail>": "<detail data>",
...
"record_in_processes": [
"<string>",
...
]
},
"expiry_version": "<string>",
"optout": <bool>,
"revisions": {
"first": "<string>",
"last": "<string>"
},
"versions": {
"first": "<string>",
"last": "<string>"
}
},
...
]
},
"name": "<probe name>",
"type": "<probe type>"
},
...
"histogram/A11Y_CONSUMERS": {
"history": {
"nightly": [
{
"cpp_guard": null,
"description": "A list of known accessibility clients that inject into Firefox process space (see https://dxr.mozilla.org/mozilla-central/source/accessible/windows/msaa/Compatibility.h).",
"details": {
"high": 11,
"keyed": false,
"kind": "enumerated",
"low": 1,
"n_buckets": 12
},
"expiry_version": "never",
"optout": true,
"revisions": {
"first": "320642944e42a889db13c6c55b404e32319d4de6",
"last": "6f5fac320fcb6625603fa8a744ffa8523f8b3d71"
},
"versions": {
"first": "56",
"last": "59"
}
}
]
},
"name": "A11Y_CONSUMERS",
"type": "histogram"
},
}
Please refer to the Telemetry data collection documentation for a detailed explanation of the field information reported for each probe (e.g. cpp_guard
).
Glean Metrics Data Files
The format is similar for probe data files, but without the revisions
and versions
keys. Instead it has git-commits
and dates
keys, which contains the
first and last commits that definition has been seen in, and when those commits were committed.
{
"<metric name>": {
"history": [
{
"type": "timespan",
"description": " The duration of the last foreground session.",
"time_unit": "second",
"send_in_pings": ["baseline"],
"bugs": [1497894, 1519120],
"data_reviews": ["https://bugzilla.mozilla.org/show_bug.cgi?id=1512938#c3"],
"notification_emails": ["[email protected]"],
"git-commits": {
"first": "<commit-hash>",
"last": "<commit-hash>"
},
"dates": {
"first": "2019-01-01 12:12:12",
"last": "2019-02-01 14:14:14"
},
},
...
]
"name": "<metric name>",
"type": "<metric type>"
},
...
}
Glean dependencies files
The Glean dependency file contains information about the dependencies of an
application in repositories.yaml
.
The format is similar for Glean metrics data files. Each entry in the top-level
object represents a dependency of the application. The only data point tracked
in the history
log is version
, which is the version number of the
dependency. At the top-level of each entry is:
name
: the name of the dependency (for Android, this is a Maven package name).type
: Always"dependency"
.
{
"<library name>": {
"history": [
{
"dates": {
"first": "2019-05-25 00:39:19",
"last": "2019-05-28 10:24:06"
},
"git-commits": {
"first": "9aa4f48e77001058c05f3d3182228706720bf87a",
"last": "69c485078950fb09ee2cef609b75ea9dd30d249b"
},
"type": "dependency",
"version": "1.1.0-alpha05"
}
],
"name": "<library-name>",
"type": "dependency"
},
...
}
Accessing the data files
The processed probe data is serialized to the disk in a directory hierarchy starting from the provided output directory. The directory layout resembles a REST-friendly structure.
|-- product
|-- general
|-- revisions
|-- channel (or "all")
|-- ping type
|-- probe type (or "all_probes")
For example, all the JSON probe data in the main ping for the Firefox Nightly channel can be accessed with the following path: firefox/nightly/main/all_probes
. The probe data for all the channels (same product and ping) can be accessed instead using firefox/all/main/all_probes
.
The root directory for the output generated from the scheduled job can be found at: https://probeinfo.telemetry.mozilla.org/ . All the probe data for Firefox coming from the main ping can be found here.
Accessing Glean
metrics data
Glean data is generally laid out as follows:
| -- glean
| -- repositories
| -- general
| -- repository-name
| -- general
| -- metrics
For example, the data for a repository called fenix
would be found at /glean/fenix/metrics
. The time the data was last updated for that project can be found at glean/fenix/general
.
A list of available repositories is at /glean/repositories
.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK