1

ClickHouse的简介、安装、使用与应用

 3 years ago
source link: http://www.lzhpo.com/article/145
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

ClickHouse简介

ClickHouse 是由号称“俄罗斯 Google/百度”的 Yandex 公司开源的面向 OLAP 的分布式列式数据库,2016年6月15日开源的,能够使用 SQL 查询生成实时数据报告,覆盖了俄语搜索超过68%的市场。

官方的简介已经十分详细了,所以我直接摘录一些官方的。

ClickHouse是一个用于联机分析(OLAP)的列式数据库管理系统(DBMS)。

在传统的行式数据库系统中,数据按如下顺序存储:

rowwatchIDJavaEnabletitleGoodEventEventTime#0893543506621投资者关系12016-05-18 05:19:20#1903295099580联系我们12016-05-18 08:10:20#2899537060541任务12016-05-18 07:38:00#N……………

处于同一行中的数据总是被物理的存储在一起。

常见的行式数据库系统有: MySQL、Postgres和MS SQL Server。

在列式数据库系统中,数据按如下的顺序存储:

row:#0#1#2#NwatchID:893543506629032950995889953706054…JavaEnable:101…title:投资者关系联系我们任务…GoodEvent:111…EventTime:2016-05-18 05:19:202016-05-18 08:10:202016-05-18 07:38:00…

该示例中只展示了数据在列式数据库中数据的排列方式。
对于存储而言,列式数据库总是将同一列的数据存储在一起,不同列的数据也总是分开存储。

常见的列式数据库有: Vertica、 Paraccel (Actian Matrix,Amazon Redshift)、 Sybase IQ、 Exasol、 Infobright、 InfiniDB、 MonetDB (VectorWise, Actian Vector)、 LucidDB、 SAP HANA、 Google Dremel、 Google PowerDrill、 Druid、 kdb+。

ClickHouse在字节跳动中的应用

它最大的特点,莫过于快!!!但是,不支持事务。。。

被称作是最快的OLAP引擎,彪悍开源的分析数据库。

像国内的字节跳动就有在使用ClickHouse,头条内部第一个使用 ClickHouse 的是用户行为分析系统。该系统在使用 ClickHouse 之前,engine 层已经有两个迭代。他们尝试过 Spark 全内存方案还有一些其他的方案,都存在很多问题。主要因为产品需要比较强的交互能力,页面拖拽的方式能够给分析师展示不同的指标,查询模式比较多变,并且有一些查询的 DSL 描述,也不好用现成的 SQL 去表示,这就需要 engine 有比较好的定制能力。

Bytedance 如何使用 ClickHouse.png

目前头条 ClickHouse 集群的规模大概有几千个节点,最大的集群规模可能有 1200 个节点,这是一个单集群的最大集群节点数。数据总量大概是几十个 PB,日增数据 100TB,落地到 ClickHouse,日增数据总量大概是它的 3 倍,原始数据也就 300T 左右,大多数查询的响应时间是在几秒钟。从交互式的用户体验来说,一般希望把所有的响应控制在 30 秒之内返回,ClickHouse 基本上能够满足大部分要求。覆盖的用户场景包括产品分析师做精细化运营,开发人员定位问题,也有少量的广告类客户。

OLAP(机分析处理)是一种软件技术,它使分析人员能够迅速、一致、交互地从各个方面观察信息,以达到深入理解数据的目的。

OLAP场景特征

  • 大多数是读请求
  • 数据总是以相当大的批(> 1000 rows)进行写入
  • 不修改已添加的数据
  • 每次查询都从数据库中读取大量的行,但是同时又仅需要少量的列
  • 宽表,即每个表包含着大量的列
  • 较少的查询(通常每台服务器每秒数百个查询或更少)
  • 对于简单查询,允许延迟大约50毫秒
  • 列中的数据相对较小: 数字和短字符串(例如,每个URL 60个字节)
  • 处理单个查询时需要高吞吐量(每个服务器每秒高达数十亿行)
  • 事务不是必须的
  • 对数据一致性要求低
  • 每一个查询除了一个大表外都很小
  • 查询结果明显小于源数据,换句话说,数据被过滤或聚合后能够被盛放在单台服务器的内存中

列式数据库更适合OLAP场景的原因

列式数据库更适合于OLAP场景(对于大多数查询而言,处理速度至少提高了100倍),下面详细解释了原因(通过图片更有利于直观理解):

行式:
行式数据库.png

列式:
列式数据库.png

列式数据库VS 行式数据库.png

ClickHouse特性

  • 真正的列式数据库管理系统
  • 数据的磁盘存储
  • 多核心并行处理
  • 多服务器分布式处理
  • 支持SQL
  • 实时的数据更新
  • 适合在线查询
  • 支持近似计算
  • 支持数据复制和数据完整性
  • 限制
    1. 没有完整的事务支持。
    2. 缺少高频率,低延迟的修改或删除已存在数据的能力。仅能用于批量删除或修改数据,但这符合 GDPR
    3. 稀疏索引使得ClickHouse不适合通过其键检索单行的点查询。

Docker安装与测试使用

# 这两个拉不拉都无所谓,在docker run的时候没有的话,会自动去pulldocker pull yandex/clickhouse-serverdocker pull yandex/clickhouse-client
# 不使用主机系统卷进行数据存储docker run -d --name some-clickhouse-server --ulimit nofile=262144:262144 -p 8123:8123 -p 9009:9009 -p 9090:9000 yandex/clickhouse-server# 使用主机系统卷进行数据存储docker run -d --name some-clickhouse-server --ulimit nofile=262144:262144 -p 8123:8123 -p 9009:9009 -p 9090:9000 --volume=D:\ClickHouse-File\DataBase:/var/lib/clickhouse yandex/clickhouse-server
# 使用客户端去连接docker run -it --rm --link some-clickhouse-server:clickhouse-server yandex/clickhouse-client --host clickhouse-server

使用ClickHouse客户端去连接ClickHouse.png

使用DBeaver连接ClickHouse

默认ClickHouse里面是有名称为default的数据库和账号的,账号没有密码。

可以直接连接:
DBeaver连接ClickHouse-连接数据库概览.png

DBeaver连接ClickHouse-成功.png

使用shell生成SHA256密码密文

PASSWORD=$(base64 < /dev/urandom | head -c8); echo "$PASSWORD"; echo -n "$PASSWORD" | sha256sum | tr -d '-'
[root@lzhpo-ecs ~]# PASSWORD=$(base64 < /dev/urandom | head -c8); echo "$PASSWORD"; echo -n "$PASSWORD" | sha256sum | tr -d '-'NYOcmLVk0d111f6a247ddb94c2639b39b79407276bb1bf223144a15910070e336e138b80

NYOcmLVk就是明文。

0d111f6a247ddb94c2639b39b79407276bb1bf223144a15910070e336e138b80就是SHA256密文。

使用vim修改ClickHouse密码

vim /etc/clickhouse-server/users.xml

默认应该是没有vim的,需要自己安装。

apt-get updateapt-get install vim

原本是明文配置<password></password>,官方不建议使用明文,建议使用SHA256(十六进制格式)指定,所以,修改为<password_sha256_hex>密码密文</password_sha256_hex>
ClickHouse修改密码.png

修改密码之后,使用明文重新连接ClickHouse

DBeaver连接ClickHouse-修改密码之后-连接成功.png

ClickHouse初始化的结构

ClickHouse初始化的结构.png

刚创建,除system是有数据之外,其它都是空的。

ClickHouse的Sql语法

Sql参考官方文档:https://clickhouse.tech/docs/zh/sql-reference/statements/select/array-join/

测试数据,提取的文件大小约为10GB:

https://clickhouse-datasets.s3.yandex.net/hits/tsv/hits_v1.tsv.xz

https://clickhouse-datasets.s3.yandex.net/visits/tsv/visits_v1.tsv.xz

创建数据库

# 创建数据库(数据库的名字不能带横杆“-”,需要换成下划线“_”)CREATE DATABASE IF NOT EXISTS study_clickhouse
# 在数据库study_clickhouse中创建表hits_v1CREATE TABLE study_clickhouse.hits_v1(    `WatchID` UInt64,    `JavaEnable` UInt8,    `Title` String,    `GoodEvent` Int16,    `EventTime` DateTime,    `EventDate` Date,    `CounterID` UInt32,    `ClientIP` UInt32,    `ClientIP6` FixedString(16),    `RegionID` UInt32,    `UserID` UInt64,    `CounterClass` Int8,    `OS` UInt8,    `UserAgent` UInt8,    `URL` String,    `Referer` String,    `URLDomain` String,    `RefererDomain` String,    `Refresh` UInt8,    `IsRobot` UInt8,    `RefererCategories` Array(UInt16),    `URLCategories` Array(UInt16),    `URLRegions` Array(UInt32),    `RefererRegions` Array(UInt32),    `ResolutionWidth` UInt16,    `ResolutionHeight` UInt16,    `ResolutionDepth` UInt8,    `FlashMajor` UInt8,    `FlashMinor` UInt8,    `FlashMinor2` String,    `NetMajor` UInt8,    `NetMinor` UInt8,    `UserAgentMajor` UInt16,    `UserAgentMinor` FixedString(2),    `CookieEnable` UInt8,    `JavascriptEnable` UInt8,    `IsMobile` UInt8,    `MobilePhone` UInt8,    `MobilePhoneModel` String,    `Params` String,    `IPNetworkID` UInt32,    `TraficSourceID` Int8,    `SearchEngineID` UInt16,    `SearchPhrase` String,    `AdvEngineID` UInt8,    `IsArtifical` UInt8,    `WindowClientWidth` UInt16,    `WindowClientHeight` UInt16,    `ClientTimeZone` Int16,    `ClientEventTime` DateTime,    `SilverlightVersion1` UInt8,    `SilverlightVersion2` UInt8,    `SilverlightVersion3` UInt32,    `SilverlightVersion4` UInt16,    `PageCharset` String,    `CodeVersion` UInt32,    `IsLink` UInt8,    `IsDownload` UInt8,    `IsNotBounce` UInt8,    `FUniqID` UInt64,    `HID` UInt32,    `IsOldCounter` UInt8,    `IsEvent` UInt8,    `IsParameter` UInt8,    `DontCountHits` UInt8,    `WithHash` UInt8,    `HitColor` FixedString(1),    `UTCEventTime` DateTime,    `Age` UInt8,    `Sex` UInt8,    `Income` UInt8,    `Interests` UInt16,    `Robotness` UInt8,    `GeneralInterests` Array(UInt16),    `RemoteIP` UInt32,    `RemoteIP6` FixedString(16),    `WindowName` Int32,    `OpenerName` Int32,    `HistoryLength` Int16,    `BrowserLanguage` FixedString(2),    `BrowserCountry` FixedString(2),    `SocialNetwork` String,    `SocialAction` String,    `HTTPError` UInt16,    `SendTiming` Int32,    `DNSTiming` Int32,    `ConnectTiming` Int32,    `ResponseStartTiming` Int32,    `ResponseEndTiming` Int32,    `FetchTiming` Int32,    `RedirectTiming` Int32,    `DOMInteractiveTiming` Int32,    `DOMContentLoadedTiming` Int32,    `DOMCompleteTiming` Int32,    `LoadEventStartTiming` Int32,    `LoadEventEndTiming` Int32,    `NSToDOMContentLoadedTiming` Int32,    `FirstPaintTiming` Int32,    `RedirectCount` Int8,    `SocialSourceNetworkID` UInt8,    `SocialSourcePage` String,    `ParamPrice` Int64,    `ParamOrderID` String,    `ParamCurrency` FixedString(3),    `ParamCurrencyID` UInt16,    `GoalsReached` Array(UInt32),    `OpenstatServiceName` String,    `OpenstatCampaignID` String,    `OpenstatAdID` String,    `OpenstatSourceID` String,    `UTMSource` String,    `UTMMedium` String,    `UTMCampaign` String,    `UTMContent` String,    `UTMTerm` String,    `FromTag` String,    `HasGCLID` UInt8,    `RefererHash` UInt64,    `URLHash` UInt64,    `CLID` UInt32,    `YCLID` UInt64,    `ShareService` String,    `ShareURL` String,    `ShareTitle` String,    `ParsedParams` Nested(        Key1 String,        Key2 String,        Key3 String,        Key4 String,        Key5 String,        ValueDouble Float64),    `IslandID` FixedString(16),    `RequestNum` UInt32,    `RequestTry` UInt8)ENGINE = MergeTree()PARTITION BY toYYYYMM(EventDate)ORDER BY (CounterID, EventDate, intHash32(UserID))SAMPLE BY intHash32(UserID)SETTINGS index_granularity = 8192
# 在数据库study_clickhouse中创建表visits_v1CREATE TABLE study_clickhouse.visits_v1(    `CounterID` UInt32,    `StartDate` Date,    `Sign` Int8,    `IsNew` UInt8,    `VisitID` UInt64,    `UserID` UInt64,    `StartTime` DateTime,    `Duration` UInt32,    `UTCStartTime` DateTime,    `PageViews` Int32,    `Hits` Int32,    `IsBounce` UInt8,    `Referer` String,    `StartURL` String,    `RefererDomain` String,    `StartURLDomain` String,    `EndURL` String,    `LinkURL` String,    `IsDownload` UInt8,    `TraficSourceID` Int8,    `SearchEngineID` UInt16,    `SearchPhrase` String,    `AdvEngineID` UInt8,    `PlaceID` Int32,    `RefererCategories` Array(UInt16),    `URLCategories` Array(UInt16),    `URLRegions` Array(UInt32),    `RefererRegions` Array(UInt32),    `IsYandex` UInt8,    `GoalReachesDepth` Int32,    `GoalReachesURL` Int32,    `GoalReachesAny` Int32,    `SocialSourceNetworkID` UInt8,    `SocialSourcePage` String,    `MobilePhoneModel` String,    `ClientEventTime` DateTime,    `RegionID` UInt32,    `ClientIP` UInt32,    `ClientIP6` FixedString(16),    `RemoteIP` UInt32,    `RemoteIP6` FixedString(16),    `IPNetworkID` UInt32,    `SilverlightVersion3` UInt32,    `CodeVersion` UInt32,    `ResolutionWidth` UInt16,    `ResolutionHeight` UInt16,    `UserAgentMajor` UInt16,    `UserAgentMinor` UInt16,    `WindowClientWidth` UInt16,    `WindowClientHeight` UInt16,    `SilverlightVersion2` UInt8,    `SilverlightVersion4` UInt16,    `FlashVersion3` UInt16,    `FlashVersion4` UInt16,    `ClientTimeZone` Int16,    `OS` UInt8,    `UserAgent` UInt8,    `ResolutionDepth` UInt8,    `FlashMajor` UInt8,    `FlashMinor` UInt8,    `NetMajor` UInt8,    `NetMinor` UInt8,    `MobilePhone` UInt8,    `SilverlightVersion1` UInt8,    `Age` UInt8,    `Sex` UInt8,    `Income` UInt8,    `JavaEnable` UInt8,    `CookieEnable` UInt8,    `JavascriptEnable` UInt8,    `IsMobile` UInt8,    `BrowserLanguage` UInt16,    `BrowserCountry` UInt16,    `Interests` UInt16,    `Robotness` UInt8,    `GeneralInterests` Array(UInt16),    `Params` Array(String),    `Goals` Nested(        ID UInt32,        Serial UInt32,        EventTime DateTime,        Price Int64,        OrderID String,        CurrencyID UInt32),    `WatchIDs` Array(UInt64),    `ParamSumPrice` Int64,    `ParamCurrency` FixedString(3),    `ParamCurrencyID` UInt16,    `ClickLogID` UInt64,    `ClickEventID` Int32,    `ClickGoodEvent` Int32,    `ClickEventTime` DateTime,    `ClickPriorityID` Int32,    `ClickPhraseID` Int32,    `ClickPageID` Int32,    `ClickPlaceID` Int32,    `ClickTypeID` Int32,    `ClickResourceID` Int32,    `ClickCost` UInt32,    `ClickClientIP` UInt32,    `ClickDomainID` UInt32,    `ClickURL` String,    `ClickAttempt` UInt8,    `ClickOrderID` UInt32,    `ClickBannerID` UInt32,    `ClickMarketCategoryID` UInt32,    `ClickMarketPP` UInt32,    `ClickMarketCategoryName` String,    `ClickMarketPPName` String,    `ClickAWAPSCampaignName` String,    `ClickPageName` String,    `ClickTargetType` UInt16,    `ClickTargetPhraseID` UInt64,    `ClickContextType` UInt8,    `ClickSelectType` Int8,    `ClickOptions` String,    `ClickGroupBannerID` Int32,    `OpenstatServiceName` String,    `OpenstatCampaignID` String,    `OpenstatAdID` String,    `OpenstatSourceID` String,    `UTMSource` String,    `UTMMedium` String,    `UTMCampaign` String,    `UTMContent` String,    `UTMTerm` String,    `FromTag` String,    `HasGCLID` UInt8,    `FirstVisit` DateTime,    `PredLastVisit` Date,    `LastVisit` Date,    `TotalVisits` UInt32,    `TraficSource` Nested(        ID Int8,        SearchEngineID UInt16,        AdvEngineID UInt8,        PlaceID UInt16,        SocialSourceNetworkID UInt8,        Domain String,        SearchPhrase String,        SocialSourcePage String),    `Attendance` FixedString(16),    `CLID` UInt32,    `YCLID` UInt64,    `NormalizedRefererHash` UInt64,    `SearchPhraseHash` UInt64,    `RefererDomainHash` UInt64,    `NormalizedStartURLHash` UInt64,    `StartURLDomainHash` UInt64,    `NormalizedEndURLHash` UInt64,    `TopLevelDomain` UInt64,    `URLScheme` UInt64,    `OpenstatServiceNameHash` UInt64,    `OpenstatCampaignIDHash` UInt64,    `OpenstatAdIDHash` UInt64,    `OpenstatSourceIDHash` UInt64,    `UTMSourceHash` UInt64,    `UTMMediumHash` UInt64,    `UTMCampaignHash` UInt64,    `UTMContentHash` UInt64,    `UTMTermHash` UInt64,    `FromHash` UInt64,    `WebVisorEnabled` UInt8,    `WebVisorActivity` UInt32,    `ParsedParams` Nested(        Key1 String,        Key2 String,        Key3 String,        Key4 String,        Key5 String,        ValueDouble Float64),    `Market` Nested(        Type UInt8,        GoalID UInt32,        OrderID String,        OrderPrice Int64,        PP UInt32,        DirectPlaceID UInt32,        DirectOrderID UInt32,        DirectBannerID UInt32,        GoodID String,        GoodName String,        GoodQuantity Int32,        GoodPrice Int64),    `IslandID` FixedString(16))ENGINE = CollapsingMergeTree(Sign)PARTITION BY toYYYYMM(StartDate)ORDER BY (CounterID, StartDate, intHash32(UserID), VisitID)SAMPLE BY intHash32(UserID)SETTINGS index_granularity = 8192

ClickHouse创建表.png

SELECT * FROM study_clickhouse.hits_v1 hv SELECT * FROM study_clickhouse.visits_v1 vv

ClickHouse查询.png

emmm......由于ClickHouse的官方的数据量太大,我是在windows安装的Docker,所以,10G占用太大了,我还是弄点小数据先测试一下吧。。。

create table study_clickhouse.user (id UInt16, name String, sex String) ENGINE=TinyLog;insert into study_clickhouse.user(id, name, sex) values (1, 'lzhpo', '男');insert into study_clickhouse.user(id, name, sex) values (2, 'lewis', '男');insert into study_clickhouse.user(id, name, sex) values (3, 'emily', '女');select * from study_clickhouse.`user` u

一条一条数据插入:
ClickHouse插入数据-示例.png

查看插入的数据:
ClickHouse查看1-示例.png

ClickHouse查看2-示例.png


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK