computer science – Do CFGs’ concepts of derivation and derivation tree apply to grammars more general than CFGs?

Ullman’s Introduction to Automata, Languages and Computation (1979) says that given a CFG $G$, derivation is defined as:

Two strings are related by $to_G$ exactly when the second is obtained
from the first by one application of some production.

$to^*_G$ is the reflexive and transitive closure of $to_G$. We say $alpha_1$ derives $alpha_m$ in grammar $G$, if $alpha_1 to^*_G alpha_m$.

and then a derivation can be represented as a tree, called derivation tree or parse tree.

In grammars that are more general than CFGs (such as CSGs i.e. type 1 grammars, unrestricted grammars i.e. type 0 grammars),

  • do the concepts of derivation and derivation tree (also called parse tree) still exist, or
  • what concepts are they generalized to?

Thanks.

C++ STL: How does the distance() method work for a set/ multiset (stored internally as a self balancing tree)?

The complexity of the distance function depends on the type of the iterators supplied: in general it only required to take linear time in the distance but, in the special case in which the input iterators are random access iterators, the worst-case running time is linear. (I believe this is accounting for the time spent in the function in itself, and assumes that the time needed to advance iterators is constant).

The C++ specification do not mandate any particular implementation as long as it conforms to the required complexity so your question cannot be answered without inspecting the particular implementation that you are using.
However, just to convey the intuition, here are two possible implementations that would conform to the requirements:

  • Given random access iterators $x$ and $y$, distance($x$, $y$) returns $y-x$.
  • For general iterators increment $x$ until it equals $y$. Return the number of increments performed.

The type std::set does not return a random access iterator, therefore std::distance can take linear time and the second implementation above can be used. Now your question reduces to “how can the standard library iterate over the elements of a std::set in sorted order?”

The answer to this question depends once again on the implementation as there is no particular data structure mandated by the standard to implement std::set.
Since you mention red-black trees, which are a special kind of BSTs, this can easily be done by noticing that the order of iteration coincides with the order in which the vertices of a BST are visited by an in-order visit.

Notice that the concept of distance completely abstracts from the data structure used to store the elements of the set. Instead it only refer to the positions in which two elements appear when using an iterator to access the collection’s contents (in the case of std::set, the elements appear in sorted order).

version control – Best practice for organizing build products of dependencies and project code in your repo source tree?

I’ve checked quite a few related questions on source tree organization, but couldn’t find the answer for my exact need:

For a project I’m working on, my source tree is organized this way

  • build: all build scripts and resources required by continuous integration
  • src: all first-party source code and IDE projects of our team
  • test: all the code and data required for automated tests
  • thirdparty: all external dependencies
    • _original_: all downloaded open-source library archives
    • libthis: unzipped open-source lib with possible custom changes
    • libthat: …
    • ….

So far I’ve been building our first-party build products right in the src folder inside each IDE projects, such as Visual Studio and Xcode; and build the third-party products in their own working copy folders.

Problem

However, this reveals several drawbacks, e.g.:

  • In order to accommodate the variety of dependency locations, the lib search paths of the first-party IDE projects become messy
  • it’s hard to track the output products through the file system

Intentions

So I’d love to centralize all the build products including dependencies and our first-party products, so that

  • the build products don’t mess up the repo SCM tidiness
  • all the targets and intermediates are easy to rebuild, archive, or purge
  • it’s easy to track down to the sub source tree of the products from file system

Current Ideas

I’ve tried to create another folder, e.g., _prebuilt_ under thirdparty folder so that it looks like this

  • thirdparty
    • _original_
    • _prebuilt_: holding build products from all thirdparty libs for all platforms
      • platform1
      • platform2
      • platform3
      • ….
    • libthis
    • libthat

One complaint I have towards this scheme: mixing derivatives with working copies (lib…) and archives (original) forces me to make folders of derivatives and archives stand out by naming them with ugly markups, in this case underscores (_).

Another idea is to use a universal folder right at the root of the repo and have all the build products of dependencies and project products sit there in a jumble. But then, it sounds messy and would be hard to track different sources.

Either way, some post-build scripts must be set in action to move artifacts out of their original working copies.

Question

In general, what would be the best practice to organize the build products?

I’d love to achieve at least the goals in the Intentions above

binary tree – Is this model of converting integers to Gray code correct?

The model shown in the figure converts all numbers that have k digits in the binary system to Gray code without any calculation, but I have no proof that guarantees this claim.

enter image description here

Here is some information on how to use it.

Conversion model of all integers that have k digits in the binary system in Gray code.

Rules

  1. You will need k columns of numbers. The numbers to be encoded are arranged in the last column.

  2. The conversion is done gradually by transferring groups of numbers to the adjacent columns in the way the arrows show. There are two types of paired arrows, parallel (=) and intersecting (×). Symbolically, the (=) and (×) are considered inverse of each other.

  3. Each column with arrows is formed by an exact copy of the previous column and by a copy of the previous column that has inverted pairs of arrows, which is placed below the exact copy.

Observation

If we set “=” = 0 and “×” = 1, then the successive columns containing arrows form the Thue Morse sequence which essentially forms the rules for converting integers to Gray code.

algorithms – Tight upper bound for forming an $n$ element Red-Black Tree from scratch

I learnt that in a order-statistic tree (augmented Red-Black Tree, in which each node $x$ contains an extra field denoting the number of nodes in the sub-tree rooted at $x$) finding the $i$ th order statistics can be done in $O(lg(n))$ time in the worst case. Now in case of an array representing the dynamic set of elements finding the $i$ th order statistic can be achieved in the $O(n)$ time in the worst case.( where $n$ is the number of elements).

Now I felt like finding a tight upper bound for forming an $n$ element Red-Black Tree so that I could comment about which alternative is better : “maintain the set elements in an array and perform query in $O(n)$ time” or “maintaining the elements in a Red-Black Tree (formation of which takes $O(f(n))$ time say) and then perform query in $O(lg(n))$ time”.


So a very rough analysis is as follows, inserting an element into an $n$ element Red-Black Tree takes $O(lg(n))$ time and there are $n$ elements to insert , so it takes $O(nlg(n))$ time. Now this analysis is quite loose as when there are only few elements in the Red-Black tree the height is quite less and so is the time to insert in the tree.

I tried to attempt a detailed analysis as follows (but failed however):

Let while trying to insert the $j=i+1$ th element the height of the tree is atmost $2.lg(i+1)+1$. For an appropriate $c$, the total running time,

$$T(n)leq sum_{j=1}^{n}c.(2.lg(i+1)+1)$$

$$=c.sum_{i=0}^{n-1}(2.lg(i+1)+1)$$

$$=c.left(sum_{i=0}^{n-1}2.lg(i+1)+sum_{i=0}^{n-1}1right)$$

$$=2csum_{i=0}^{n-1}lg(i+1)+cntag1$$

Now

$$sum_{i=0}^{n-1}lg(i+1)=lg(1)+lg(2)+lg(3)+…+lg(n)=lg(1.2.3….n)tag2$$

Now $$prod_{k=1}^{n}kleq n^n, text{which is a very loose upper bound}tag 3$$

Using $(3)$ in $(2)$ and substituting the result in $(1)$ we have $T(n)=O(nlg(n))$ which is the same as the rough analysis…

Can I do anything better than $(3)$?


All the nodes referred to are the internal nodes in the Red-Black Tree.

c# – Algorithm to determine if binary tree is a Binary Search Tree (BST)

Continuing with algorithms I’ve implemented a binary search tree validator. I don’t like the two boolean variables within NodeFollowsBSTContract as it feels too complicated. I feel like it should be cleaned up but don’t see how, yet.

Also, before each recursive step down, to check child nodes, a new list is created. Is there’s a better way to implement this check that doesn’t repeatedly create new lists?

public class BinaryTreeNode
{
    public BinaryTreeNode Left { get; set; }
    public BinaryTreeNode Right { get; set; }

    public int? Value { get; }

    public BinaryTreeNode(int value)
    {
        Value = value;
    }
}

public class ValidateBST
{
    BinaryTreeNode _root;
    public ValidateBST(BinaryTreeNode root)
    {
        _root = root;
    }

    public bool IsBinarySearchTree()
    {
        if ((_root.Left?.Value ?? 0) <= (_root.Value)
            || (_root.Right?.Value ?? 0) > (_root.Value))
        {
            var listIncludingRootValue = new List<int>()
            {
                _root.Value.Value
            };

            var leftLegValid = NodeFollowsBSTContract(_root.Left, new List<int>(), new List<int>(listIncludingRootValue));

            var rightLegvalid = NodeFollowsBSTContract(_root.Right, new List<int>(listIncludingRootValue), new List<int>());

            return leftLegValid && rightLegvalid;
        }
        else
        {
            return false;
        }   
    }

    private bool NodeFollowsBSTContract(BinaryTreeNode node, List<int> parentSmallerValues, List<int> parentLargerValues)
    {
        if (node == null)
        {
            return true;
        }

        bool isLessThanAllParentLargerValues = !parentLargerValues.Any()
            || parentLargerValues.Where(value => node.Value.Value <= value).Count() == parentLargerValues.Count;

        bool isGreaterThanAllParentSmallerValues = !parentSmallerValues.Any()
            || parentSmallerValues.Where(value => node.Value.Value > value).Count() == parentSmallerValues.Count;

        if (!isLessThanAllParentLargerValues || !isGreaterThanAllParentSmallerValues)
        {
            return false;
        }

        if (node.Left != null)
        {
            var updatedLargerValues = GenerateUpdatedLists(node.Value.Value, parentLargerValues);
            var updatedSmallervalues = new List<int>(parentSmallerValues);

            if (!NodeFollowsBSTContract(node.Left, updatedSmallervalues, updatedLargerValues))
            {
                return false;
            }
        }

        if (node.Right != null)
        {
            var updatedvalues = GenerateUpdatedLists(node.Value.Value, parentSmallerValues);

            if (!NodeFollowsBSTContract(node.Right, updatedvalues, parentLargerValues))
            {
                return false;
            }
        }

        return true;
    }

    private List<int> GenerateUpdatedLists(int addValue, List<int> values)
    {
        var updatedValues = new List<int>(values)
        {
            addValue
        };

        return updatedValues;
    }
}

complexity theory – Best case “skew height” of an arbitrary tree

Given an arbitrary binary tree on $n$ nodes, choose an assignment $A$ from each parent to one of its children (the “favored child” as it were). We define the skew height of the tree as $H_A(mathsf{nil})=0$ and $H_A(mathsf{node};a;b)=max(H_A(a), H_A(b)+1)$ if $A(mathsf{node};a;b)=a$ is the favored child and symmetrically $max(H_A(a)+1, H_A(b))$ if $b$ is favored.

The question is: For a fixed tree $T$, what is the minimum skew height over all assignments? I would like to get an asymptotic bound on $f(n)=max_{|T|=n}min_AH_A(T)$.

Other variations on this problem I am interested in are when the trees are not binary (but there is still one favored child and all others add one to the height), and when there is sharing (i.e. it is a dag), which doesn’t affect the height computation but allows for much wider “trees” while staying under the $n$ node bound.

The obvious bounds are $f(n)=Omega(log n)$ and $f(n)=O(n)$. My guess is that $f(n)=Theta(log n)$ for binary trees, and $f(n)=Theta(sqrt n)$ for dags (with some kind of grid graph as counterexample).

Is Merkle tree pruning described in the whitepaper feasible/useful? If not, would there be any alternative?

When I was reading bitcoin-paper-errata-and-details.md written by David A. Harding, I realized that there’s probably a common misunderstanding or over-simplification about Merkle tree pruning. What Nick ODell had said might be a live example:

  • A leaf (transaction) can be pruned when all of its outputs have been spent.

This once seemed to be true for me, until I read what David had written:

there is currently no way in Bitcoin to prove that a transaction has not been spent

I’m not sure whether I have grasped it, so firstly I made a diagram to illustrate (part of) my understanding to this problem:

incensistent-pruning

Still, I don’t think merely this problem can kill the whole idea of Merkle tree pruning yet, I think it just means that “the reclaimable disk capacity is much lower than expectation”. In other words, if I’m not mistaken, Nick ODell’s claim could be “corrected” like:

  • A leaf (transaction) can be pruned when all of its outputs have been spent, and all of its previous transactions have been pruned.

However, I then think that, even if the “corrected” claim is taken into consideration, the idea of Merkle tree pruning still doesn’t seem to be feasible/useful:

  1. A new full node joining the network needs to download & verify everything. Even if the problem mentioned above is avoided, a malicious node can still deceive the new full node by hiding/picking some merkle branches. In other words, a malicious node can lie about the actual ownership of coins (spent/unspent state) without breaking the Merkle tree structure at all.

  2. If a full node needs to enable pruning to reduce disk space requirement for itself, directly reading/modifying the blockchain files seems to be much less efficient than the current implementation that the UTXO set is completely separated from the blockchain storage, so that a full node (no matter it’s pruning or not) only needs to query and update the UTXO set database during the downloading & validation process. The blockchain itself doesn’t need to be touched once again for validation purposes at all, which is the reason why the old blocks can be simply deleted when “pruning” (not Merkle tree pruning) is enabled.

However, I’m still not sure about this conclusion. Is this related to the idea of fraud proofs, in the sense that as long as there’s still at least one honest full node, the new node would be able to spot which piece of data is the correct one? What if the UTXO set is also committed to the blockchain? What if some more commitments like the block height of previous transaction are also added to the blockchain?

Furthermore, I’ve heard that the Mimblewimble protocol enables secure blockchain pruning. I’m also curious how Mimblewimble could achieve this, and whether similar goal could be eventually achieved in Bitcoin?

What is the upper bound on the number of nodes in a tree with n leaves where each internal node has at least two children?

Is there a way to find the upper bound on the total number of nodes in the tree?

Yes. There is actually a formula for the exact number: see e.g. these notes:

A full $m$-ary tree with $l$ leaves has $n = frac{ml – 1}{m – 1}$ vertices and $i = frac{l-1}{m-1}$ internal vertices.

Here, $m$-ary means that every internal node has between $1$ and $m$ children, and full means that every internal node has exactly $m$ children (the maximum). So in your case, for a full binary tree, just plug in $m = 2$ to these formulas.


By the way, to derive such formulas, you can count the number of vertices in the tree in two ways. First, they are either internal or leaves, so
$$
n = i + l
$$

Second, we can obtain the total number of vertices by adding up the number of children over all nodes, plus the root, so
$$
n = 1 + sum_{v in V} text{children}(v) = 1 + mi,
$$

because each internal node has $m$ children and leaves have no children.
Now set the two equal and we have
$$
i + l = 1 + mi,
$$

from which we can derive the above results.

recurrence relation – Clarifying statements involving asymptotic notations in soln of $T(n) = 3T(lfloor n/4 rfloor) + Theta(n^2)$ using recursion tree and substitution

Below is a problem worked out in the Introduction to Algorithms by Cormen et. al.

(I am not having problem with the proof but only I want to clarify the meaning conveyed by few statements in the text while solving the recurrence and the statements are given as ordered list at the end. Simply because I want to master the text.)

$$T(n) = 3T(lfloor n/4 rfloor) + Theta(n^2)$$

Now the authors attempt to first find a good guess of the recurrence relation using the recursion tree method and for that they allow sloppiness and assumes $T(n)=3T(n/4) + cn^2$.

Recursion Tree

Though the above recursion tree is not quite required for my question but I felt like including it to make the background a bit clearer.

The guessed candidate is $T(n)=O(n^2)$. Then the authors proof the same using the substitution method.

In fact, if $O(n^2)$ is indeed an upper bound for the recurrence (as we shall verify in a moment), then it must be a tight bound. Why? The first recursive call contributes a cost of $Theta(n^2)$ , and so $Omega(n^2)$ must be a lower bound for the recurrence. Now we can use the substitution method to verify that our guess was correct, that is, $T(n)=O(n^2)$ is an upper bound for the recurrence $T(n) = 3T(lfloor n/4 rfloor) + cn^2$ We want to show that $T(n)leq d n^2$ for some constant $d > 0$.

Now there are a few things which I want to get clarified…

(1) if $O(n^2)$ is indeed an upper bound for the recurrence. Here the sentence means (probably) $exists$ a function $f(n) in O(n^2)$ such that $T(n)in O(f(n))$

(2) $Omega(n^2)$ must be a lower bound for the recurrence Here the sentence means probably $exists$ a function $f(n) in Omega(n^2)$ such that $T(n)in Omega(f(n))$

(3) $T(n)=O(n^2)$ is an upper bound for the recurrence $T(n) = 3T(lfloor n/4 rfloor) + cn^2$ This sentence can be interpreted as follows assume that $T'(n) = 3T'(lfloor n/4 rfloor) + cn^2$ and $exists$ a function $T(n) in O(n^2)$ such that $T'(n)in O(T(n))$

(4) $T(n)leq d n^2$ for some constant $d > 0$ We are using induction to verify to the definition of Big Oh…

I feel that the author could simply have written the $T(n)$ is Upper Bounded by $n^2$ and Lower Bounded by $n^2$ or the author could have simply written $T(n) = O(n^2)$ and $T(n)=Omega(n^2)$, did the author just use the above style of statements as pointed out in $(1),(2),(3)$ just for more clearer explanation or there are some extra meaning conveyed which I am missing out.