Clickhouse Replication without Sharding – Database Administrators Stack Exchange

Nevermind, found it on other Altinity Blog post

just create a docker-compose.yml file:

version: '3.3'
services:
  # from rongfengliang/clickhouse-docker-compose
  # from https://github.com/abraithwaite/clickhouse-replication-example/blob/master/docker-compose.yaml
  # from https://altinity.com/blog/2017/6/5/clickhouse-data-distribution
  ch1:
    image: yandex/clickhouse-server
    restart: always
    volumes:
      - ./config.xml:/etc/clickhouse-server/config.d/local.xml
      - ./macro1.xml:/etc/clickhouse-server/config.d/macros.xml
      - ./data/1:/var/lib/clickhouse    
    ports: 
      - '18123:8123'
      - '19000:9000'
      - '19009:9009'
    ulimits:
      nproc: 65536
      nofile:
        soft: 252144
        hard: 252144
  ch2:
    image: yandex/clickhouse-server
    restart: always
    volumes:
      - ./config.xml:/etc/clickhouse-server/config.d/local.xml
      - ./macro2.xml:/etc/clickhouse-server/config.d/macros.xml
      - ./data/2:/var/lib/clickhouse
    ports: 
      - '28123:8123'
      - '29000:9000'
      - '29009:9009'
    ulimits:
      nproc: 65536
      nofile:
        soft: 252144
        hard: 252144
  ch3:
    image: yandex/clickhouse-server
    restart: always
    volumes:
      - ./config.xml:/etc/clickhouse-server/config.d/local.xml
      - ./macro3.xml:/etc/clickhouse-server/config.d/macros.xml
      - ./data/3:/var/lib/clickhouse
    ports: 
      - '38123:8123'
      - '39000:9000'
      - '39009:9009'
    ulimits:
      nproc: 65536
      nofile:
        soft: 252144
        hard: 252144
  zookeeper:
    image: zookeeper

and config.xml file:

<yandex>
    <remote_servers>
        <replicated>
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>ch1</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>ch2</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>ch3</host>
                    <port>9000</port>
                </replica>
            </shard>
        </replicated>
    </remote_servers>
    <zookeeper>
        <node>
            <host>zookeeper</host>
            <port>2181</port>
        </node>
    </zookeeper>
</yandex>

and 3 macroX.xml where X=1,2,3 (replace chX with ch1, ch2, or ch3):

<yandex>
    <macros replace="replace">
        <cluster>cluster1</cluster>
        <replica>chX</replica>
    </macros>
</yandex>

then create a data directory and start docker-compose up

partitioning – SAAS Software – MySQL DB – Sharding data

I am working on a SAAS application which is using AWS MySQL.
The application is for multiple organization. We use same DB and same tables for storing multiple org data and storing org_id in most of the table and use this in the select, update and delete query( where org_id = ? ).

DB Model :-

Organization:-

org_id

User

user_id
org_id

client

client_id
org_id

client_contact

client_id
phone_number
phone_type

project

project_id
org_id

user_project

project_id
user_id

The data is growing rapidly and we need to shard the data. We have customers from all over the world. If I use sharding id different for different table, then join might happen across the node. For ex, For user, sharding can be on shard_id and for client, sharding can be on client_id and for project, sharding can be on project_id. Here, the problem would be joining project and user. The user_id and project_id might be on different nodes and join will happen across nodes.

What is the best approach here? I am thinking on sharding based on org_id as I store org_id in most of the table.

I see 2 problems here.

  1. Few child tables doesnt store org_id as the parent table is storing. Do I need to store the org_id in all the tables?
  2. Some org might have more data and load which might lead to a hot spot and more storage on a particular node. Is it possible scale a particular node alone with AWS RDS?

Please suggest on the best approach.

Note:- Each org might have upto 5 hundred thousands records in any table with around 10 columns.

scalability – Start small – but design in such way that sharding is possible – how?

Following question is more about best-practices than a real problem – nevertheless, I’d like to know how to do it in best way.

Given a service, that can operate in multiple countries/geo-areas, one probably start simple before even scaling is needed. A design would contain a single DB and single piece of infrastructure. An API endpoint(s) would look like this:

/.../v1/items?geo_area=xyz&page=1&size=100

Now imagine that service grew a lot, and there is a need to create separate piece of infrastructure per each country/geo-area where our service operates. Would you do:

Option 1)

Stay with above API format, and route to shards based on queryString param from api url?

Option 2)

Create new API endpoints that have country/geo-area in url, e.g. /.../xyz/v1/items&page=1&size=100

Option 3)

Put country/geo-area to server part of url, e.g. https://xyz.mydomain.com/api/v1/items&page=1&size=100

I see that Option 1) has a pros of not breaking contracts. But I’m not sure that routing based on queryString is good idea at all.

Option 2) a 3) breaks previous API contracts (clients that uses it must update) and it forces clients to react on server’s infrastructure changes, which is also a design-smell in my opinion.

Designing for sharding from very beginning is also not an option, as you don’t know if you ever need it.

design – Is consistent hashing required for sharding?

I am reading about scaling of database and came to know about sharding technique. But I also read about consistent hashing technique. So how practically sharding is implemented? Do we arrange nodes in ring like consistent hashing and then assign servers to rings and then data to servers? Because as I see if my number of shards changes at run time, and if consistent hashing technique is not there, then it will mess up a lot of stuff. Can someone please throw some light on this?

transactions – Securing the “weakest link” of a chain with Sharding

In this question I discussed how sharding the private key with Shamir’s Secret Sharing Scheme maybe a good alternative to MultiSig as a means of reducing the exorbitant fees associated with MultiSig transactions. @Pieter Wuille provided an excellent analysis of the pros and cons of using Sharding vs. MultiSig and revealed some of the work in progress that is trying to address the main “con” of sharding (the “A Chain Only as Strong as Its Weakest Link” problem).

As mentioned, the main problem with employing SSS as an alternative to MultiSig is that it is necessary to reconstruct the actual secret key on a single machine before a transaction can be signed – once again making it vulnerable to a single point of failure (attack). As the saying goes “a chain is only as strong as its weakest link”.

While Taproot and Schnorr signatures was discussed as potential alternatives, these two solutions are for the time being just “proposals” and they will need to be accepted by the entire Bitcoin network (which could take forever) to work.

Is there an alternative to SSS, that can be used right now, where a transaction can be signed by two (or more) entities without needing the secret key to convene on one machine? Are there any alternatives to to “Shamir’s Secret Sharing” that might allow this?

In case it makes a difference, all I need is n-of-n (i.e, 3-out-of-3) rather than k-of-n (i.e, 3-out-of-4 or 2-out-of-3).

Thanks

bitcoin core – Is Sharding a Good Alternative to MultiSig?

For security reasons, we require that each withdrawal be “approved” by multiple servers so that in the event that a single server is compromised the attacker won’t be able to siphon funds from the entire wallet. The typical approach used to execute this concept is a MultiSig address where the transaction is only approved after it is signed by multiple entities.

However, the fees for a 2/3 MultiSig transactions are nearly twice as high as fees for a regular transaction. If we wanted 3/4 (or higher) MultiSig the fees would be even costlier. This got me thinking of a way to take advantage of the security benefits offered by MultiSig transactions without suffering from excessive fee consumption in the process.

The immediate solution that comes to mind is sharding. In short, instead of using MultiSig we can safely break up the PK using Shamir’s Secret Sharing Scheme with unlimited schemes (such as 4-of-7 or 3-of-4 etc) and store the shards on separate servers thereby requiring multiple servers to “sign” a withdrawal request.

Does MultiSig provide any (security) benefit over sharding? Is sharding a viable alternative to MultiSig?

Update:

Perhaps the only problem with this proposal is that each server cannot independently “authorize” the transaction using its “shard”. At some point, all shards would need to be known to “reassemble” the PK so the attacker could still use this as the “weakest point of attack”? Is this correct?

Thanks

replication – How to horizontal sharding mysql?

I’m trying to horizontal sharding in mysql. I installed MySQL cluster and every thing work well. But i don’t know how to horizontal sharding my database.
Detail:
I have 2 data node: Node1 and Node2.
I created database Test with one table T1(key(int), field1..).
Now I want to 50% record of T1 save on Node1( odd value of key) , and 50% another on Node2 ( even value of key).
I was looking through the documentation but have not found the answer to my problem. Please help me !!
Best regards.

replication – Fault tolerance for Database sharding and Database partitioning

I’m aware that database sharding is splitting up of datasets horizontally into various database instances, whereas database partitioning uses one single instance.

In Database Sharding, what if one of the database crashes? we would lose that part of the data completely. We won’t be able to read or write on it. I’m assuming we are keeping a replica of all the databases we have shared? Is there any better approach? That would be too expensive, I believe if we have many database instances.

In Database partition, we could create a replica of the main database (that would be just one replica) since data partition splits dataset in the same database.

One last question would be, why would we go for a master-slave approach? Do the slaves have complete data or are the data partitioned among the slaves? I believe that the Master database has complete data, but I’m not sure about the slaves? If the slaves have different data partitioned, let’s say, how would the fault tolerance. Would it just read from the Master database then?

I know these are a lot of questions. Could you please help me? I am interested in this and that’s why I have so many questions, which I am not able to grasp.

database – Types of sharding?

Partitionning is a distribution model based for data:

  • vertical partitioning splits different kind of data (e.g. tables or columns) between different nodes, independently of data values.
  • horizontal partitioning splits the same kind of data (e.g. rows) between different nodes, based on the data values.

Sharding is in principle a horizontal partitioning of the data. So there is no vertical sharding, unless you generalize the word “sharding” to mean partitionning.

Indeed, the different sharding strategies that your describe are all implementation of an horizontal distribution, where different techniques are used to decide efficiently on which shard to store or retrieve data. The choice of the most appropriate sharding strategy may use some properties of the hash keys or algorithms used for example, to optimize the distributed access based on known needs.