tools – Automated opt-in accessibility, is this viable?

tools – Automated opt-in accessibility, is this viable? – User Experience Stack Exchange

python – Organizing things together to form a minimum viable Scraper App (part 2)

This is a follow-up of my question over here.

I have corrected the more trivial issues highlighted in @Reinderien’s answer below as follows. qinghua‘s search engine is perpetually down and fudan‘s server is super slow; so we use the other 2 for the test.

I would like to seek further advise as to how to go about:

class-ify(ing) any session-level state, and keep that state alive across multiple searches for a given database

Also, whether or not the implementation of an abstract base class can be used to save reusable code in this case and how that can be implemented.


main.py

import cnki, fudan, wuhan, qinghua
import json
from typing import Iterable, Tuple
from pathlib import Path


DB_DICT = {
    "cnki": cnki.search,
    "fudan": fudan.search,
    "wuhan": wuhan.search,
    "qinghua": qinghua.search,
    }


def save_articles(articles: Iterable, file_prefix: str) -> None:
    file_path = Path(file_prefix).with_suffix('.json')

    with file_path.open('w') as file:
        file.write('(n')
        first = True

        for article in articles:

            if first:
                first = False
            else:
                file.write(',n')
            json.dump(article.as_dict(), file, ensure_ascii=False, indent=4)

        file.write('n)n')


def db_search(keyword: str, *args: Tuple(str)):

    if args:
        
        for db in args:
            yield from DB_DICT(db)(keyword)

    else:

        for key in DB_DICT.keys():
            yield from DB_DICT(key)(keyword)



def search(keywords: Tuple(str), *args: Tuple(str)):
    for kw in keywords:
        yield from db_search(kw, *args)



if __name__ == '__main__':
    rslt = search(('尹誥','尹至'), 'cnki', 'wuhan')
    save_articles(rslt, 'search_result')

cnki.py

from contextlib import contextmanager
from dataclasses import dataclass
from datetime import date
from pathlib import Path
from typing import Generator, Iterable, Optional, List, ContextManager, Dict, Tuple
from urllib.parse import unquote
from itertools import chain, count
import re
import json
from math import ceil

# pip install proxy.py
import proxy
from proxy.http.exception import HttpRequestRejected
from proxy.http.parser import HttpParser
from proxy.http.proxy import HttpProxyBasePlugin
from selenium.common.exceptions import (
    NoSuchElementException,
    StaleElementReferenceException,
    TimeoutException,
    WebDriverException,
)
from selenium.webdriver import Firefox, FirefoxProfile
from selenium.webdriver.common.by import By
from selenium.webdriver.common.proxy import ProxyType
from selenium.webdriver.remote.webdriver import WebDriver
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
# from urllib3.packages.six import X


@dataclass
class Result:
    title: str        # Mozi's Theory of Human Nature and Politics
    title_link: str   # http://big5.oversea.cnki.net/kns55/detail/detail.aspx?recid=&FileName=ZDXB202006009&DbName=CJFDLAST2021&DbCode=CJFD
    html_link: Optional(str)  # http%3a%2f%2fkns.cnki.net%2fKXReader%2fDetail%3fdbcode%3dCJFD%26filename%3dZDXB202006009
    author: str       # Xie Qiyang
    source: str       # Vocational University News
    source_link: str  # http://big5.oversea.cnki.net/kns55/Navi/ScdbBridge.aspx?DBCode=CJFD&BaseID=ZDXB&UnitCode=&NaviLink=%e8%81%8c%e5%a4%a7%e5%ad%a6%e6%8a%a5
    date: date   # 2020-12-28
    download: str        #
    database: str     # Periodical

    @classmethod
    def from_row(cls, row: WebElement) -> 'Result':
        number, title, author, source, published, database = row.find_elements_by_xpath('td')

        title_links = title.find_elements_by_tag_name('a')

        if len(title_links) > 1:
            # 'http://big5.oversea.cnki.net/kns55/ReadRedirectPage.aspx?flag=html&domain=http%3a%2f%2fkns.cnki.net%2fKXReader%2fDetail%3fdbcode%3dCJFD%26filename%3dZDXB202006009'
            html_link = unquote(
                title_links(1)
                .get_attribute('href')
                .split('domain=', 1)(1))
        else:
            html_link = None

        dl_links, sno = number.find_elements_by_tag_name('a')
        dl_links = dl_links.get_attribute('href')

        if re.search("javascript:alert.+", dl_links):
            dl_links = None

        published_date = date.fromisoformat(
            published.text.split(maxsplit=1)(0)
        )

        return cls(
            title=title_links(0).text,
            title_link=title_links(0).get_attribute('href'),
            html_link=html_link,
            author=author.text,
            source=source.text,
            source_link=source.get_attribute('href'),
            date=published_date,
            download=dl_links,
            database=database.text,
        )


    def __str__(self):
        return (
            f'題名      {self.title}'
            f'n作者     {self.author}'
            f'n來源     {self.source}'
            f'n發表時間  {self.date}'
            f'n下載連結 {self.download}'
            f'n來源數據庫 {self.database}'
        )

    def as_dict(self) -> Dict(str, str):
        return {
        'author': self.author,
        'title': self.title,
        'publication': self.source,
        'date': self.date.isoformat(),
        'download': self.download,
        'url': self.html_link,
        'database': self.database,
        }


class MainPage:
    def __init__(self, driver: WebDriver):
        self.driver = driver

    def submit_search(self, keyword: str) -> None:
        wait = WebDriverWait(self.driver, 50)
        search = wait.until(
            EC.presence_of_element_located((By.NAME, 'txt_1_value1'))
        )
        search.send_keys(keyword)
        search.submit()

    def switch_to_frame(self) -> None:
        wait = WebDriverWait(self.driver, 100)
        wait.until(
            EC.presence_of_element_located((By.XPATH, '//iframe(@name="iframeResult")'))
        )
        self.driver.switch_to.default_content()
        self.driver.switch_to.frame('iframeResult')

        wait.until(
            EC.presence_of_element_located((By.XPATH, '//table(@class="GridTableContent")'))
        )

    def max_content(self) -> None:
        """Maximize the number of items on display in the search results."""
        max_content = self.driver.find_element(
            By.CSS_SELECTOR, '#id_grid_display_num > a:nth-child(3)',
        )
        max_content.click()

    # def get_element_and_stop_page(self, *locator) -> WebElement:
    #     ignored_exceptions = (NoSuchElementException, StaleElementReferenceException)
    #     wait = WebDriverWait(self.driver, 30, ignored_exceptions=ignored_exceptions)
    #     elm = wait.until(EC.presence_of_element_located(locator))
    #     self.driver.execute_script("window.stop();")
    #     return elm



class SearchResults:
    def __init__(self, driver: WebDriver):
        self.driver = driver


    def number_of_articles_and_pages(self) -> Tuple(
    int,  # articles
    int,  # pages
    int,  # page size
    ):
        articles_elem = self.driver.find_element_by_css_selector('td.TitleLeftCell td')
        n_articles = int(re.search(r"d+", articles_elem.text)(0))

        page_elem = self.driver.find_element_by_css_selector('font.numNow')
        per_page = int(page_elem.text)

        n_pages = ceil(n_articles / per_page)

        return n_articles, n_pages


    def get_structured_elements(self) -> Iterable(Result):
        rows = self.driver.find_elements_by_xpath(
            '//table(@class="GridTableContent")//tr(position() > 1)'
        )

        for row in rows:
            yield Result.from_row(row)


    def get_element_and_stop_page(self, *locator) -> WebElement:
        ignored_exceptions = (NoSuchElementException, StaleElementReferenceException)
        wait = WebDriverWait(self.driver, 30, ignored_exceptions=ignored_exceptions)
        elm = wait.until(EC.presence_of_element_located(locator))
        self.driver.execute_script("window.stop();")
        return elm

    def next_page(self) -> None:
        link = self.get_element_and_stop_page(By.LINK_TEXT, "下頁")

        try:
            link.click()
            print("Navigating to Next Page")
        except (TimeoutException, WebDriverException):
            print("Last page reached")



class ContentFilterPlugin(HttpProxyBasePlugin):
    HOST_WHITELIST = {
        b'ocsp.digicert.com',
        b'ocsp.sca1b.amazontrust.com',
        b'big5.oversea.cnki.net',
    }

    def handle_client_request(self, request: HttpParser) -> Optional(HttpParser):
        host = request.host or request.header(b'Host')
        if host not in self.HOST_WHITELIST:
            raise HttpRequestRejected(403)

        if any(
            suffix in request.path
            for suffix in (
                b'png', b'ico', b'jpg', b'gif', b'css',
            )
        ):
            raise HttpRequestRejected(403)

        return request

    def before_upstream_connection(self, request):
        return super().before_upstream_connection(request)
    def handle_upstream_chunk(self, chunk):
        return super().handle_upstream_chunk(chunk)
    def on_upstream_connection_close(self):
        pass


@contextmanager
def run_driver() -> ContextManager(WebDriver):
    prox_type = ProxyType.MANUAL('ff_value')
    prox_host = '127.0.0.1'
    prox_port = 8889

    profile = FirefoxProfile()
    profile.set_preference('network.proxy.type', prox_type)
    profile.set_preference('network.proxy.http', prox_host)
    profile.set_preference('network.proxy.ssl', prox_host)
    profile.set_preference('network.proxy.http_port', prox_port)
    profile.set_preference('network.proxy.ssl_port', prox_port)
    profile.update_preferences()

    plugin = f'{Path(__file__).stem}.{ContentFilterPlugin.__name__}'

    with proxy.start((
        '--hostname', prox_host,
        '--port', str(prox_port),
        '--plugins', plugin,
    )), Firefox(profile) as driver:
        yield driver


def loop_through_results(driver):
    result_page = SearchResults(driver)
    n_articles, n_pages = result_page.number_of_articles_and_pages()
    
    print(f"{n_articles} found. A maximum of 500 will be retrieved.")

    for page in count(1):

        print(f"Scraping page {page}/{n_pages}")
        print()

        result = result_page.get_structured_elements()
        yield from result

        if page >= n_pages or page >= 10:
            break

        result_page.next_page()
        result_page = SearchResults(driver)


def save_articles(articles: Iterable, file_prefix: str) -> None:
    file_path = Path(file_prefix).with_suffix('.json')

    with file_path.open('w') as file:
        file.write('(n')
        first = True

        for article in articles:
            if first:
                first = False
            else:
                file.write(',n')
            json.dump(article.as_dict(), file, ensure_ascii=False, indent=4)

        file.write('n)n')


def query(keyword, driver) -> None:

    page = MainPage(driver)
    page.submit_search(keyword)
    page.switch_to_frame()
    page.max_content()


def search(keyword):

    with run_driver() as driver:
        driver.get('http://big5.oversea.cnki.net/kns55/')
        query(keyword, driver)

        print("正在搜尋中國期刊網……")
        print(f"關鍵字:「{keyword}」")

        result = loop_through_results(driver)
        # save_articles(result, 'cnki_search_result.json')

        yield from result


if __name__ == '__main__':
    search('尹至')

qinghua.py

Search functionality is down at the moment. Planning the try out with Requests as soon as it is up and running.

from contextlib import contextmanager
from dataclasses import dataclass, asdict, replace
from datetime import datetime, date
from pathlib import Path
from typing import Iterable, Optional, ContextManager
import re
import os
import time
import json

# pip install proxy.py
import proxy
from proxy.http.exception import HttpRequestRejected
from proxy.http.parser import HttpParser
from proxy.http.proxy import HttpProxyBasePlugin
from selenium.common.exceptions import (
    NoSuchElementException,
    StaleElementReferenceException,
    TimeoutException,
    WebDriverException,
)
from selenium.webdriver import Firefox, FirefoxProfile
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.common.proxy import ProxyType
from selenium.webdriver.remote.webdriver import WebDriver
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait


@dataclass
class PrimaryResult:
    captions: str
    date: date
    link: str
    publication:str = "清華大學出土文獻研究與保護中心"

    @classmethod
    def from_row(cls, row: WebElement) -> 'PrimaryResult': 

        caption_elems = row.find_element_by_tag_name('a')
        date_elems = row.find_element_by_class_name('time')

        published_date = date.isoformat(datetime.strptime(date_elems.text, '%Y-%m-%d'))

        return cls(
            captions = caption_elems.text,
            date = published_date,
            link = caption_elems.get_attribute('href'),
        )

    def __str__(self):
        return (
            f'n標題     {self.captions}'
            f'n發表時間  {self.date}'
            f'n文章連結 {self.link}'
        )


class MainPage:
    def __init__(self, driver: WebDriver):
        self.driver = driver
 
    def submit_search(self, keyword: str) -> None:
        driver = self.driver
        wait = WebDriverWait(self.driver, 100)

        xpath = "//form/button/input"
        element_to_hover_over = driver.find_element_by_xpath(xpath)
        hover = ActionChains(driver).move_to_element(element_to_hover_over)
        hover.perform()

        search = wait.until(
            EC.presence_of_element_located((By.ID, 'showkeycode1015273'))
        )
        search.send_keys(keyword)
        search.submit()


    def get_element_and_stop_page(self, *locator) -> WebElement:
        ignored_exceptions = (NoSuchElementException, StaleElementReferenceException)
        wait = WebDriverWait(self.driver, 30, ignored_exceptions=ignored_exceptions)
        elm = wait.until(EC.presence_of_element_located(locator))
        self.driver.execute_script("window.stop();")
        return elm

    def next_page(self) -> None:
        try: 
            link = self.get_element_and_stop_page(By.LINK_TEXT, "下一页")
            link.click()
            print("Navigating to Next Page")

        except (TimeoutException, WebDriverException):
            print("No button with 「下一页」 found.")
            return 0


    # @contextmanager
    # def wait_for_new_window(self):
    #     driver = self.driver
    #     handles_before = driver.window_handles
    #     yield
    #     WebDriverWait(driver, 10).until(
    #         lambda driver: len(handles_before) != len(driver.window_handles))

    def switch_tabs(self):
        driver = self.driver
        print("Current Window:")
        print(driver.title)
        print()

        p = driver.current_window_handle
        
        chwd = driver.window_handles
        time.sleep(3)
        driver.switch_to.window(chwd(1))

        print("New Window:")
        print(driver.title)
        print()


class SearchResults:
    def __init__(self, driver: WebDriver):
        self.driver = driver

    def get_primary_search_result(self):
        
        filePath = os.path.join(os.getcwd(), "qinghua_primary_search_result.json")

        if os.path.exists(filePath):
            os.remove(filePath)    

        rows = self.driver.find_elements_by_xpath('//ul(@class="search_list")/li')

        for row in rows:
            rslt = PrimaryResult.from_row(row)
            with open('qinghua_primary_search_result.json', 'a') as file:
                json.dump(asdict(rslt), file, ensure_ascii=False, indent=4)
            yield rslt


# class ContentFilterPlugin(HttpProxyBasePlugin):
#     HOST_WHITELIST = {
#         b'ocsp.digicert.com',
#         b'ocsp.sca1b.amazontrust.com',
#         b'big5.oversea.cnki.net',
#         b'gwz.fudan.edu.cn',
#         b'bsm.org.cn/index.php'
#         b'ctwx.tsinghua.edu.cn',
#     }

#     def handle_client_request(self, request: HttpParser) -> Optional(HttpParser):
#         host = request.host or request.header(b'Host')
#         if host not in self.HOST_WHITELIST:
#             raise HttpRequestRejected(403)

#         if any(
#             suffix in request.path
#             for suffix in (
#                 b'png', b'ico', b'jpg', b'gif', b'css',
#             )
#         ):
#             raise HttpRequestRejected(403)

#         return request

#     def before_upstream_connection(self, request):
#         return super().before_upstream_connection(request)
#     def handle_upstream_chunk(self, chunk):
#         return super().handle_upstream_chunk(chunk)
#     def on_upstream_connection_close(self):
#         pass


# @contextmanager
# def run_driver() -> ContextManager(WebDriver):
#     prox_type = ProxyType.MANUAL('ff_value')
#     prox_host = '127.0.0.1'
#     prox_port = 8889

#     profile = FirefoxProfile()
#     profile.set_preference('network.proxy.type', prox_type)
#     profile.set_preference('network.proxy.http', prox_host)
#     profile.set_preference('network.proxy.ssl', prox_host)
#     profile.set_preference('network.proxy.http_port', prox_port)
#     profile.set_preference('network.proxy.ssl_port', prox_port)
#     profile.update_preferences()

#     plugin = f'{Path(__file__).stem}.{ContentFilterPlugin.__name__}'

#     with proxy.start((
#         '--hostname', prox_host,
#         '--port', str(prox_port),
#         '--plugins', plugin,
#     )), Firefox(profile) as driver:
#         yield driver


def search(keyword) -> None:
    print("正在搜尋清華大學出土文獻研究與保護中心網……")
    print(f"關鍵字:「{keyword}」")
    with Firefox() as driver:
        driver.get('http://www.ctwx.tsinghua.edu.cn/index.htm')

        page = MainPage(driver)
        # page.select_dropdown_item()
        page.submit_search(keyword)

        time.sleep(5)
        # page.switch_tabs()

        while True:
            primary_result_page = SearchResults(driver)
            primary_results = primary_result_page.get_primary_search_result()
            
            yield from primary_results
            
            # for result in primary_results:
            #     print(result)
            #     print()
                
            if page.next_page() == 0:
                break
            else:
                pass


if __name__ == '__main__':
    search('尹至')

fudan.py

# fudan.py

from dataclasses import dataclass
from itertools import count
from pathlib import Path
from typing import Dict, Iterable, Tuple, List, Optional
from urllib.parse import urljoin

from bs4 import BeautifulSoup
from requests import Session
from datetime import date, datetime

import json
import re

BASE_URL = 'http://www.gwz.fudan.edu.cn'


@dataclass
class Link:
    caption: str
    url: str
    clicks: int
    replies: int
    added: date

    @classmethod
    def from_row(cls, props: Dict(str, str), path: str) -> 'Link':
        clicks, replies = props('点击/回复').split('/')
        # Skip number=int(props('编号')) - this only has meaning within one page

        return cls(
            caption=props('资源标题'),
            url=urljoin(BASE_URL, path),
            clicks=int(clicks),
            replies=int(replies),
            added=datetime.strptime(props('添加时间'), '%Y/%m/%d').date(),
        )

    def __str__(self):
        return f'{self.added} {self.url} {self.caption}'

    def author_title(self) -> Tuple(Optional(str), str):
        sep = ':'  # full-width colon, U+FF1A

        if sep not in self.caption:
            return None, self.caption

        author, title = self.caption.split(sep, 1)
        author, title = author.strip(), title.strip()

        net_digest = '網摘'
        if author == net_digest:
            return None, self.caption

        return author, title


@dataclass
class Article:
    author: Optional(str)
    title: str
    date: date
    download: Optional(str)
    url: str
    publication: str = "復旦大學出土文獻與古文字研究中心學者文庫"

    @classmethod
    def from_link(cls, link: Link, download: str) -> 'Article':

        author, title = link.author_title()

        download = download.replace("r", "").replace("n", "").strip()
        if download == '#_edn1':
            download = None
        elif download(0) != '/':
            download = '/' + download

        return cls(
            author=author,
            title=title,
            date=link.added,
            download=download,
            url=link.url,
        )

    def __str__(self) -> str:
        return(
            f"n作者   {self.author}"
            f"n標題   {self.title}"
            f"n發佈日期 {self.date}"
            f"n下載連結 {self.download}"
            f"n訪問網頁 {self.url}"
        )

    def as_dict(self) -> Dict(str, str):
        return {
            'author': self.author,
            'title': self.title,
            'date': self.date.isoformat(),
            'download': self.download,
            'url': self.url,
            'publication': self.publication
        }


def compile_search_results(session: Session, links: Iterable(Link), category_filter: str) -> Iterable(Article):

    for link in links:
        with session.get(link.url) as resp:
            resp.raise_for_status()
            doc = BeautifulSoup(resp.text, 'html.parser')

        category = doc.select_one('#_top td a(href="#")').text
        if category != category_filter:
            continue

        content = doc.select_one('span.ny_font_content')
        dl_tag = content.find(
            'a', {
                'href': re.compile("/?(lunwen/|articles/up/).+")
            }
        )

        yield Article.from_link(link, download=dl_tag('href'))


def get_page(session: Session, query: str, page: int) -> Tuple(List(Link), int):
    with session.get(
        urljoin(BASE_URL, '/Web/Search'),
        params={
            's': query,
            'page': page,
        },
    ) as resp:
        resp.raise_for_status()
        doc = BeautifulSoup(resp.text, 'html.parser')

    table = doc.select_one('#tab table')
    heads = (h.text for h in table.select('tr.cap td'))
    links = ()

    for row in table.find_all('tr', class_=''):
        cells = (td.text for td in row.find_all('td'))
        links.append(Link.from_row(
            props=dict(zip(heads, cells)),
            path=row.find('a')('href'),
        ))

    page_td = doc.select_one('#tab table:nth-child(2) td')  # 共 87 条记录, 页 1/3
    n_pages = int(page_td.text.rsplit('/', 1)(1))

    return links, n_pages


def get_all_links(session: Session, query: str) -> Iterable(Link):
    for page in count(1):
        links, n_pages = get_page(session, query, page)
        print(f'Scraping page {page}/{n_pages}')
        yield from links

        if page >= n_pages:
            break


def save_articles(articles: Iterable(Article), file_prefix: str) -> None:
    file_path = Path(file_prefix).with_suffix('.json')

    with file_path.open('w') as file:
        file.write('(n')
        first = True

        for article in articles:
            if first:
                first = False
            else:
                file.write(',n')
            json.dump(article.as_dict(), file, ensure_ascii=False, indent=4)

        file.write('n)n')


def search(keyword):
    print("正在搜尋復旦大學出土文獻與古文字研究中心學者文庫……")
    print(f"關鍵字:「{keyword}」")
    with Session() as session:
        links = get_all_links(session, query=keyword)
        academic_library = '学者文库'
        articles = compile_search_results(
            session, links, category_filter=academic_library)
        # save_articles(articles, 'fudan_search_result')

        yield from articles


if __name__ == '__main__':
    search('尹誥')

wuhan.py

from dataclasses import dataclass, asdict
from itertools import count
from typing import Dict, Iterable, Tuple, List

from bs4 import BeautifulSoup
from requests import post
from datetime import date, datetime

import json
import os
import re

@dataclass
class Result:
    author: str
    title: str
    date: date
    url: str
    publication: str = "武漢大學簡帛網"

    @classmethod
    def from_metadata(cls, metadata: Dict) -> 'Result': 
        author, title = metadata('caption').split(':')
        published_date = date.isoformat(datetime.strptime(metadata('date'), '%y/%m/%d'))
        url = 'http://www.bsm.org.cn/' + metadata('url')

        return cls(
            author = author,
            title = title,
            date = published_date,
            url = url
        )


    def __str__(self):
        return (
            f'作者    {self.author}'
            f'n標題     {self.title}'
            f'n發表時間  {self.date}'
            f'n文章連結 {self.url}'
            f'n發表平台  {self.publication}'
        )


def submit_query(keyword: str):
    query = {"searchword": keyword}
    with post('http://www.bsm.org.cn/pages.php?pagename=search', query) as resp:
        resp.raise_for_status()
        doc = BeautifulSoup(resp.text, 'html.parser')
        content = doc.find('div', class_='record_list_main')
        rows = content.select('ul')


    for row in rows:
        if len(row.findAll('li')) != 2:
            print()
            print(row.text)
            print()
        else:
            captions_tag, date_tag = row.findAll('li')
            caption_anchors = captions_tag.findAll('a')
            category, caption = (item.text for item in caption_anchors)
            url = caption_anchors(1)('href')
            date = re.sub("(())", "", date_tag.text)

            yield {
                "category": category, 
                "caption": caption, 
                "date": date,
                "url": url}


def remove_json_if_exists(filename):
    json_file = filename + ".json"
    filePath = os.path.join(os.getcwd(), json_file)

    if os.path.exists(filePath):
        os.remove(filePath)


def search(query: str):
    remove_json_if_exists('wuhan_search_result')
    rslt = submit_query(query)

    for metadata in rslt:
        article = Result.from_metadata(metadata)
        print(article)
        print()

        with open('wuhan_search_result.json', 'a') as file:
            json.dump(asdict(article), file, ensure_ascii=False, indent=4)



if __name__ == '__main__':
    search('尹至')

python – Organizing things together to form a minimum viable Scraper App

This is a follow-up of my group of scraper questions starting from here.

I have thus far, with the help of @Reinderien, written 4 separate “modules” that expose a search function to scrape bibliographic information from separate online databases. Half of which use Selenium; the other Reqests.

I would like to know the best way to put them together, possibly organizing into a single module that can be imported together, and/or creating a base class so that common code can be shared between them.

I would like the final App to be able to execute the search function for each database, when given a list of search keywords, together with a choice of databases to search on as arguments.

from contextlib import contextmanager
from dataclasses import dataclass
from datetime import date
from pathlib import Path
from typing import Generator, Iterable, Optional, List, ContextManager, Dict
from urllib.parse import unquote
from itertools import chain, count
import re
import json
from math import ceil

# pip install proxy.py
import proxy
from proxy.http.exception import HttpRequestRejected
from proxy.http.parser import HttpParser
from proxy.http.proxy import HttpProxyBasePlugin
from selenium.common.exceptions import (
    NoSuchElementException,
    StaleElementReferenceException,
    TimeoutException,
    WebDriverException,
)
from selenium.webdriver import Firefox, FirefoxProfile
from selenium.webdriver.common.by import By
from selenium.webdriver.common.proxy import ProxyType
from selenium.webdriver.remote.webdriver import WebDriver
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
# from urllib3.packages.six import X


@dataclass
class Result:
    title: str        # Mozi's Theory of Human Nature and Politics
    title_link: str   # http://big5.oversea.cnki.net/kns55/detail/detail.aspx?recid=&FileName=ZDXB202006009&DbName=CJFDLAST2021&DbCode=CJFD
    html_link: Optional(str)  # http%3a%2f%2fkns.cnki.net%2fKXReader%2fDetail%3fdbcode%3dCJFD%26filename%3dZDXB202006009
    author: str       # Xie Qiyang
    source: str       # Vocational University News
    source_link: str  # http://big5.oversea.cnki.net/kns55/Navi/ScdbBridge.aspx?DBCode=CJFD&BaseID=ZDXB&UnitCode=&NaviLink=%e8%81%8c%e5%a4%a7%e5%ad%a6%e6%8a%a5
    date: date   # 2020-12-28
    download: str        #
    database: str     # Periodical

    @classmethod
    def from_row(cls, row: WebElement) -> 'Result':
        number, title, author, source, published, database = row.find_elements_by_xpath('td')

        title_links = title.find_elements_by_tag_name('a')

        if len(title_links) > 1:
            # 'http://big5.oversea.cnki.net/kns55/ReadRedirectPage.aspx?flag=html&domain=http%3a%2f%2fkns.cnki.net%2fKXReader%2fDetail%3fdbcode%3dCJFD%26filename%3dZDXB202006009'
            html_link = unquote(
                title_links(1)
                .get_attribute('href')
                .split('domain=', 1)(1))
        else:
            html_link = None

        dl_links, sno = number.find_elements_by_tag_name('a')

        published_date = date.fromisoformat(
            published.text.split(maxsplit=1)(0)
        )

        return cls(
            title=title_links(0).text,
            title_link=title_links(0).get_attribute('href'),
            html_link=html_link,
            author=author.text,
            source=source.text,
            source_link=source.get_attribute('href'),
            date=published_date,
            download=dl_links.get_attribute('href'),
            database=database.text,
        )

    def __str__(self):
        return (
            f'題名      {self.title}'
            f'n作者     {self.author}'
            f'n來源     {self.source}'
            f'n發表時間  {self.date}'
            f'n下載連結 {self.download}'
            f'n來源數據庫 {self.database}'
        )

    def as_dict(self) -> Dict(str, str):
        return {
        'author': self.author,
        'title': self.title,
        'date': self.date.isoformat(),
        'download': self.download,
        'url': self.html_link,
        'database': self.database,
    }


class MainPage:
    def __init__(self, driver: WebDriver):
        self.driver = driver

    def submit_search(self, keyword: str) -> None:
        wait = WebDriverWait(self.driver, 50)
        search = wait.until(
            EC.presence_of_element_located((By.NAME, 'txt_1_value1'))
        )
        search.send_keys(keyword)
        search.submit()

    def switch_to_frame(self) -> None:
        wait = WebDriverWait(self.driver, 100)
        wait.until(
            EC.presence_of_element_located((By.XPATH, '//iframe(@name="iframeResult")'))
        )
        self.driver.switch_to.default_content()
        self.driver.switch_to.frame('iframeResult')

        wait.until(
            EC.presence_of_element_located((By.XPATH, '//table(@class="GridTableContent")'))
        )

    def max_content(self) -> None:
        """Maximize the number of items on display in the search results."""
        max_content = self.driver.find_element(
            By.CSS_SELECTOR, '#id_grid_display_num > a:nth-child(3)',
        )
        max_content.click()

    # def get_element_and_stop_page(self, *locator) -> WebElement:
    #     ignored_exceptions = (NoSuchElementException, StaleElementReferenceException)
    #     wait = WebDriverWait(self.driver, 30, ignored_exceptions=ignored_exceptions)
    #     elm = wait.until(EC.presence_of_element_located(locator))
    #     self.driver.execute_script("window.stop();")
    #     return elm



class SearchResults:
    def __init__(self, driver: WebDriver):
        self.driver = driver


    def number_of_articles_and_pages(self) -> int:
        elem = self.driver.find_element_by_xpath(
            '//table//tr(3)//table//table//td(1)/table//td(1)'
        )
        n_articles = re.search("共有記錄(.+)條", elem.text).group(1)
        n_pages = ceil(int(n_articles)/50)

        return n_articles, n_pages


    def get_structured_elements(self) -> Iterable(Result):
        rows = self.driver.find_elements_by_xpath(
            '//table(@class="GridTableContent")//tr(position() > 1)'
        )

        for row in rows:
            yield Result.from_row(row)


    def get_element_and_stop_page(self, *locator) -> WebElement:
        ignored_exceptions = (NoSuchElementException, StaleElementReferenceException)
        wait = WebDriverWait(self.driver, 30, ignored_exceptions=ignored_exceptions)
        elm = wait.until(EC.presence_of_element_located(locator))
        self.driver.execute_script("window.stop();")
        return elm

    def next_page(self) -> None:
        link = self.get_element_and_stop_page(By.LINK_TEXT, "下頁")

        try:
            link.click()
            print("Navigating to Next Page")
        except (TimeoutException, WebDriverException):
            print("Last page reached")



class ContentFilterPlugin(HttpProxyBasePlugin):
    HOST_WHITELIST = {
        b'ocsp.digicert.com',
        b'ocsp.sca1b.amazontrust.com',
        b'big5.oversea.cnki.net',
    }

    def handle_client_request(self, request: HttpParser) -> Optional(HttpParser):
        host = request.host or request.header(b'Host')
        if host not in self.HOST_WHITELIST:
            raise HttpRequestRejected(403)

        if any(
            suffix in request.path
            for suffix in (
                b'png', b'ico', b'jpg', b'gif', b'css',
            )
        ):
            raise HttpRequestRejected(403)

        return request

    def before_upstream_connection(self, request):
        return super().before_upstream_connection(request)
    def handle_upstream_chunk(self, chunk):
        return super().handle_upstream_chunk(chunk)
    def on_upstream_connection_close(self):
        pass


@contextmanager
def run_driver() -> ContextManager(WebDriver):
    prox_type = ProxyType.MANUAL('ff_value')
    prox_host = '127.0.0.1'
    prox_port = 8889

    profile = FirefoxProfile()
    profile.set_preference('network.proxy.type', prox_type)
    profile.set_preference('network.proxy.http', prox_host)
    profile.set_preference('network.proxy.ssl', prox_host)
    profile.set_preference('network.proxy.http_port', prox_port)
    profile.set_preference('network.proxy.ssl_port', prox_port)
    profile.update_preferences()

    plugin = f'{Path(__file__).stem}.{ContentFilterPlugin.__name__}'

    with proxy.start((
        '--hostname', prox_host,
        '--port', str(prox_port),
        '--plugins', plugin,
    )), Firefox(profile) as driver:
        yield driver


def loop_through_results(driver):
    result_page = SearchResults(driver)
    n_articles, n_pages = result_page.number_of_articles_and_pages()
    
    print(f"{n_articles} found. A maximum of 500 will be retrieved.")

    for page in count(1):

        print(f"Scraping page {page}/{n_pages}")
        print()

        result = result_page.get_structured_elements()
        yield from result

        if page >= n_pages or page >= 10:
            break

        result_page.next_page()
        result_page = SearchResults(driver)


def save_articles(articles: Iterable, file_prefix: str) -> None:
    file_path = Path(file_prefix).with_suffix('.json')

    with file_path.open('w') as file:
        file.write('(n')
        first = True

        for article in articles:
            if first:
                first = False
            else:
                file.write(',n')
            json.dump(article.as_dict(), file, ensure_ascii=False, indent=4)

        file.write('n)n')


def query(keyword, driver) -> None:

    page = MainPage(driver)
    page.submit_search(keyword)
    page.switch_to_frame()
    page.max_content()


def search(keyword):
    with Firefox() as driver:
        driver.get('http://big5.oversea.cnki.net/kns55/')
        query(keyword, driver)
        result = loop_through_results(driver)
        save_articles(result, 'cnki_search_result.json')


if __name__ == '__main__':
    search('尹至')

qinghua.py

Search functionality is down at the moment. Planning the try out with Requests as soon as it is up and running.

from contextlib import contextmanager
from dataclasses import dataclass, asdict, replace
from datetime import datetime, date
from pathlib import Path
from typing import Iterable, Optional, ContextManager
import re
import os
import time
import json

# pip install proxy.py
import proxy
from proxy.http.exception import HttpRequestRejected
from proxy.http.parser import HttpParser
from proxy.http.proxy import HttpProxyBasePlugin
from selenium.common.exceptions import (
    NoSuchElementException,
    StaleElementReferenceException,
    TimeoutException,
    WebDriverException,
)
from selenium.webdriver import Firefox, FirefoxProfile
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.common.proxy import ProxyType
from selenium.webdriver.remote.webdriver import WebDriver
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait


@dataclass
class PrimaryResult:
    captions: str
    date: date
    link: str

    @classmethod
    def from_row(cls, row: WebElement) -> 'PrimaryResult': 

        caption_elems = row.find_element_by_tag_name('a')
        date_elems = row.find_element_by_class_name('time')

        published_date = date.isoformat(datetime.strptime(date_elems.text, '%Y-%m-%d'))

        return cls(
            captions = caption_elems.text,
            date = published_date,
            link = caption_elems.get_attribute('href')
        )

    def __str__(self):
        return (
            f'n標題     {self.captions}'
            f'n發表時間  {self.date}'
            f'n文章連結 {self.link}'
        )


class MainPage:
    def __init__(self, driver: WebDriver):
        self.driver = driver
 
    def submit_search(self, keyword: str) -> None:
        driver = self.driver
        wait = WebDriverWait(self.driver, 100)

        xpath = "//form/button/input"
        element_to_hover_over = driver.find_element_by_xpath(xpath)
        hover = ActionChains(driver).move_to_element(element_to_hover_over)
        hover.perform()

        search = wait.until(
            EC.presence_of_element_located((By.ID, 'showkeycode1015273'))
        )
        search.send_keys(keyword)
        search.submit()


    def get_element_and_stop_page(self, *locator) -> WebElement:
        ignored_exceptions = (NoSuchElementException, StaleElementReferenceException)
        wait = WebDriverWait(self.driver, 30, ignored_exceptions=ignored_exceptions)
        elm = wait.until(EC.presence_of_element_located(locator))
        self.driver.execute_script("window.stop();")
        return elm

    def next_page(self) -> None:
        try: 
            link = self.get_element_and_stop_page(By.LINK_TEXT, "下一页")
            link.click()
            print("Navigating to Next Page")

        except (TimeoutException, WebDriverException):
            print("No button with 「下一页」 found.")
            return 0


    # @contextmanager
    # def wait_for_new_window(self):
    #     driver = self.driver
    #     handles_before = driver.window_handles
    #     yield
    #     WebDriverWait(driver, 10).until(
    #         lambda driver: len(handles_before) != len(driver.window_handles))

    def switch_tabs(self):
        driver = self.driver
        print("Current Window:")
        print(driver.title)
        print()

        p = driver.current_window_handle
        
        chwd = driver.window_handles
        time.sleep(3)
        driver.switch_to.window(chwd(1))

        print("New Window:")
        print(driver.title)
        print()


class SearchResults:
    def __init__(self, driver: WebDriver):
        self.driver = driver

    def get_primary_search_result(self):
        
        filePath = os.path.join(os.getcwd(), "qinghua_primary_search_result.json")

        if os.path.exists(filePath):
            os.remove(filePath)    

        rows = self.driver.find_elements_by_xpath('//ul(@class="search_list")/li')

        for row in rows:
            rslt = PrimaryResult.from_row(row)
            with open('qinghua_primary_search_result.json', 'a') as file:
                json.dump(asdict(rslt), file, ensure_ascii=False, indent=4)
            yield rslt


# class ContentFilterPlugin(HttpProxyBasePlugin):
#     HOST_WHITELIST = {
#         b'ocsp.digicert.com',
#         b'ocsp.sca1b.amazontrust.com',
#         b'big5.oversea.cnki.net',
#         b'gwz.fudan.edu.cn',
#         b'bsm.org.cn/index.php'
#         b'ctwx.tsinghua.edu.cn',
#     }

#     def handle_client_request(self, request: HttpParser) -> Optional(HttpParser):
#         host = request.host or request.header(b'Host')
#         if host not in self.HOST_WHITELIST:
#             raise HttpRequestRejected(403)

#         if any(
#             suffix in request.path
#             for suffix in (
#                 b'png', b'ico', b'jpg', b'gif', b'css',
#             )
#         ):
#             raise HttpRequestRejected(403)

#         return request

#     def before_upstream_connection(self, request):
#         return super().before_upstream_connection(request)
#     def handle_upstream_chunk(self, chunk):
#         return super().handle_upstream_chunk(chunk)
#     def on_upstream_connection_close(self):
#         pass


# @contextmanager
# def run_driver() -> ContextManager(WebDriver):
#     prox_type = ProxyType.MANUAL('ff_value')
#     prox_host = '127.0.0.1'
#     prox_port = 8889

#     profile = FirefoxProfile()
#     profile.set_preference('network.proxy.type', prox_type)
#     profile.set_preference('network.proxy.http', prox_host)
#     profile.set_preference('network.proxy.ssl', prox_host)
#     profile.set_preference('network.proxy.http_port', prox_port)
#     profile.set_preference('network.proxy.ssl_port', prox_port)
#     profile.update_preferences()

#     plugin = f'{Path(__file__).stem}.{ContentFilterPlugin.__name__}'

#     with proxy.start((
#         '--hostname', prox_host,
#         '--port', str(prox_port),
#         '--plugins', plugin,
#     )), Firefox(profile) as driver:
#         yield driver


def search(keyword) -> None:
    with Firefox() as driver:
        driver.get('http://www.ctwx.tsinghua.edu.cn/index.htm')

        page = MainPage(driver)
        # page.select_dropdown_item()
        page.submit_search(keyword)

        time.sleep(5)
        # page.switch_tabs()

        while True:
            primary_result_page = SearchResults(driver)
            primary_results = primary_result_page.get_primary_search_result()
            for result in primary_results:
                print(result)
                print()
            if page.next_page() == 0:
                break
            else:
                pass


if __name__ == '__main__':
    search('尹至')

fudan.py

# fudan.py

from dataclasses import dataclass
from itertools import count
from pathlib import Path
from typing import Dict, Iterable, Tuple, List, Optional
from urllib.parse import urljoin

from bs4 import BeautifulSoup
from requests import Session
from datetime import date, datetime

import json
import re

BASE_URL = 'http://www.gwz.fudan.edu.cn'


@dataclass
class Link:
    caption: str
    url: str
    clicks: int
    replies: int
    added: date

    @classmethod
    def from_row(cls, props: Dict(str, str), path: str) -> 'Link':
        clicks, replies = props('点击/回复').split('/')
        # Skip number=int(props('编号')) - this only has meaning within one page

        return cls(
            caption=props('资源标题'),
            url=urljoin(BASE_URL, path),
            clicks=int(clicks),
            replies=int(replies),
            added=datetime.strptime(props('添加时间'), '%Y/%m/%d').date(),
        )
        
    def __str__(self):
        return f'{self.added} {self.url} {self.caption}'

    def author_title(self) -> Tuple(Optional(str), str):
        sep = ':'  # full-width colon, U+FF1A

        if sep not in self.caption:
            return None, self.caption

        author, title = self.caption.split(sep, 1)
        author, title = author.strip(), title.strip()

        net_digest = '網摘'
        if author == net_digest:
            return None, title

        return author, title


@dataclass
class Article:
    author: Optional(str)
    title: str
    date: date
    download: Optional(str)
    url: str

    @classmethod
    def from_link(cls, link: Link, download: str) -> 'Article':

        author, title = link.author_title()

        download = download.replace("r", "").replace("n", "").strip()
        if download == '#_edn1':
            download = None
        elif download(0) != '/':
            download = '/' + download

        return cls(
            author=author,
            title=title,
            date=link.added,
            download=download,
            url=link.url,
        )

    def __str__(self) -> str:
        return(
            f"n作者   {self.author}"
            f"n標題   {self.title}"
            f"n發佈日期 {self.date}"
            f"n下載連結 {self.download}"
            f"n訪問網頁 {self.url}"
        )

    def as_dict(self) -> Dict(str, str):
        return {
            'author': self.author,
            'title': self.title,
            'date': self.date.isoformat(),
            'download': self.download,
            'url': self.url,
        }


def compile_search_results(session: Session, links: Iterable(Link), category_filter: str) -> Iterable(Article):

    for link in links:
        with session.get(link.url) as resp:
            resp.raise_for_status()
            doc = BeautifulSoup(resp.text, 'html.parser')

        category = doc.select_one('#_top td a(href="#")').text
        if category != category_filter:
            continue

        content = doc.select_one('span.ny_font_content')
        dl_tag = content.find(
            'a', {
                'href': re.compile("/?(lunwen/|articles/up/).+")
            }
        )

        yield Article.from_link(link, download=dl_tag('href'))


def get_page(session: Session, query: str, page: int) -> Tuple(List(Link), int):
    with session.get(
        urljoin(BASE_URL, '/Web/Search'),
        params={
            's': query,
            'page': page,
        },
    ) as resp:
        resp.raise_for_status()
        doc = BeautifulSoup(resp.text, 'html.parser')

    table = doc.select_one('#tab table')
    heads = (h.text for h in table.select('tr.cap td'))
    links = ()

    for row in table.find_all('tr', class_=''):
        cells = (td.text for td in row.find_all('td'))
        links.append(Link.from_row(
            props=dict(zip(heads, cells)),
            path=row.find('a')('href'),
        ))

    page_td = doc.select_one('#tab table:nth-child(2) td') # 共 87 条记录, 页 1/3
    n_pages = int(page_td.text.rsplit('/', 1)(1))

    return links, n_pages


def get_all_links(session: Session, query: str) -> Iterable(Link):
    for page in count(1):
        links, n_pages = get_page(session, query, page)
        print(f'{page}/{n_pages}')
        yield from links

        if page >= n_pages:
            break


def save_articles(articles: Iterable(Article), file_prefix: str) -> None:
    file_path = Path(file_prefix).with_suffix('.json')

    with file_path.open('w') as file:
        file.write('(n')
        first = True

        for article in articles:
            if first:
                first = False
            else:
                file.write(',n')
            json.dump(article.as_dict(), file, ensure_ascii=False, indent=4)

        file.write('n)n')


def search(keyword):
    with Session() as session:
        links = get_all_links(session, query=keyword)
        academic_library = '学者文库'
        articles = compile_search_results(session, links, category_filter=academic_library)
        save_articles(articles, 'fudan_search_result')


if __name__ == '__main__':
    search('尹至')

wuhan.py

from dataclasses import dataclass, asdict
from itertools import count
from typing import Dict, Iterable, Tuple, List

from bs4 import BeautifulSoup
from requests import post
from datetime import date, datetime

import json
import os
import re

@dataclass
class Result:
    author: str
    title: str
    date: date
    url: str
    publication: str = "武漢大學簡帛網"

    @classmethod
    def from_metadata(cls, metadata: Dict) -> 'Result': 
        author, title = metadata('caption').split(':')
        published_date = date.isoformat(datetime.strptime(metadata('date'), '%y/%m/%d'))
        url = 'http://www.bsm.org.cn/' + metadata('url')

        return cls(
            author = author,
            title = title,
            date = published_date,
            url = url
        )


    def __str__(self):
        return (
            f'作者    {self.author}'
            f'n標題     {self.title}'
            f'n發表時間  {self.date}'
            f'n文章連結 {self.url}'
            f'n發表平台  {self.publication}'
        )


def submit_query(keyword: str):
    query = {"searchword": keyword}
    with post('http://www.bsm.org.cn/pages.php?pagename=search', query) as resp:
        resp.raise_for_status()
        doc = BeautifulSoup(resp.text, 'html.parser')
        content = doc.find('div', class_='record_list_main')
        rows = content.select('ul')


    for row in rows:
        if len(row.findAll('li')) != 2:
            print()
            print(row.text)
            print()
        else:
            captions_tag, date_tag = row.findAll('li')
            caption_anchors = captions_tag.findAll('a')
            category, caption = (item.text for item in caption_anchors)
            url = caption_anchors(1)('href')
            date = re.sub("(())", "", date_tag.text)

            yield {
                "category": category, 
                "caption": caption, 
                "date": date,
                "url": url}


def remove_json_if_exists(filename):
    json_file = filename + ".json"
    filePath = os.path.join(os.getcwd(), json_file)

    if os.path.exists(filePath):
        os.remove(filePath)


def search(query: str):
    remove_json_if_exists('wuhan_search_result')
    rslt = submit_query(query)

    for metadata in rslt:
        article = Result.from_metadata(metadata)
        print(article)
        print()

        with open('wuhan_search_result.json', 'a') as file:
            json.dump(asdict(article), file, ensure_ascii=False, indent=4)



if __name__ == '__main__':
    search('尹至')

tcp – What prevents this specific type of attack from being viable?

Imagine a user has an ip of 1.2.3.4

The server the user intends to connect to has an ip of 2.3.4.5

An attacker has a machine with a promiscuous network card on the user’s local network.

The attacker also has a server on a seperate network with ip 3.4.5.6

The user sends a request to 2.3.4.5, which the attacker had DDOS’d. As such, 2.3.4.5 will not respond.

The attacker’s machine on the user’s local network sniffs the request and sends it to the 3.4.5.6; 3.4.5.6 is set up to take this information to form a request to 1.2.3.4 where it spoofs the IP of 2.3.4.5 and has all the required TCP Sequencing information to form a request that looks real.

When the user sends another request, it is once again sniffed by the attacker’s local machine and sent to 3.4.5.6 which can then send another false request. The cycle continues.

Since 3.4.5.6 appears to be 2.3.4.5 and since 3.4.5.6 is NOT located on the user’s local network, the user’s firewall is unable to detect any foul play.

I’m assuming that this type of attack is not actually possible and that somewhere there is a misconception on my part about how networking works. Why would an attack like this not be possible?

❓ASK – Does a Facebook page for affiliate Marketing viable? | Proxies-free

Can I buy BMF Coins?
Yes, you may use Purchase function for this. We accept only PayPal.

How can I cashout?
Payments are made guaranteed via PayPal on 25th of each month.
(old active members can cashout in less than 24h, contact @Mr. B)
Step by step on how can you cashout guide here.

Can I transfer BMF Coins to other members?
Yes! You may use the Donate function for this.

Why sometimes when I log in my balance is lower?
If your new thread or reply is deleted you lose BMF Coins too.

Why my new thread or reply was deleted?
1. Duplicate threads are deleted (before posting a new thread please search it on the forum)
2. Copy/paste contents from other sites are deleted.

architecture – Is an ECS viable in garbage collected languages?

The Garbage Collector (GC) is not really an obstacle to implement an Entity-Component-System (ECS) architecture.

All you need is a root object for your ECS. It would hold references to the containers you use for your components (and references to your systems, if that makes sense in your implementation). Those containers will likely hold arrays of components, which is how you have the components contiguous in memory. Which is important for for cache optimization (and yes, that remains important for performance regardless of GC).

As long as you have a reference to the root, the GC will not collect any of it.

In fact, there isn’t a lot of manual memory management in ECS. Usually, you will not store your components by de/allocating in the heap (because then you lose control over memory layout), instead you will be storing components as elements of arrays. Then the memory management is about allocating the arrays, for which you would use a an Object Pool pattern.


The obstacles an object oriented language could pose to implement an ECS are others. For instance, if the language does not allow you create custom value types (e.g. C# struct), it means that your arrays would only hold references to the actual components. And those components would be stored elsewhere. Thus, you lose that cache optimization you want.

I also want to mention that depending on the language, you may need to pay attention on whether or not arrays guarantee that values are contiguous in memory. Some runtimes would offer sparse arrays or other structures by default.


An object oriented approach would suggest that an entity is a container of components. And semantically, that is the case. However, implementing it that way means that components of the same kind are not contiguous in memory. Instead, the entity will be an integer, which you would use to find its components. Unless you can’t have chunks of contiguous in memory anyway, then go ahead and implement the ECS on more object oriented grounds.

By the way, if your language is one of those that poses obstacles, chances are it is a dynamic language where you can add components to an object at runtime directly. Which means you don’t have to implement that part. Also, you would have a more idiomatic syntax.


Which reminds me, cache optimization is not the only benefit of an ECS. Thus, there is still value in an ECS even if you don’t have chunks of continuous memory.

What is the value of an ECS without cache optimization? First, composition. For starters, you can add and remove components at runtime, plus you can iterate over all the entities that have components of a given kind. This allows you to have components work both as state and as messages. And second, going against the object oriented approach, having the systems separate from the entities help in keeping your code easier to manage. Without that, in order to not repeat code, behavior tends to accumulate in a base class for all entities. Base class which has to deal with all cases.

By the way, I don’t to suggest that ECS is the only solution, nor that everything should be in an ECS. There is also value in having some aspect of a game exist outside of an ECS. Similarly, remember that making games without an ECS is perfectly possible.


I’ll recommend you to watch the video: RustConf 2018 – Closing Keynote – Using Rust For Game Development by Catherine West and the response video Rant: Entity systems and the Rust borrow checker … or something.. Although, these videos are about Rust, they manage to do a good job of explaining the problems of an orthodox object oriented approaches and the memory management challenges in implementing an ECS. You will hear about the Rust borrow checker… If you are not familiar with Rust, suffice to say, that the borrow checker, is part of the Rust compiler and it is there to make sure that your memory management is correct and safe. Code has ownership of memory when it is responsible for de-allocating, and borrowing means to access some memory without taking ownership. Could the owner de-allocated the memory while borrowed? The borrow checker, well, checks that.

compilers – Understanding the concept of “Viable Prefixes”

I was going through the text: Compilers: Principles, Techniques and Tools by Ullman et. al where I came across the concept of viable prefix and I faced some difficultly in grasping the concept. So I present my doubts here in a systematic manner, with a hope for clarification.


Viable Prefixes: The set of prefixes of right sentential forms that can appear on the stack of a shift-reduce parser are called viable prefixes.

This is the actual definition. No problem with it, since it is a definition after all.


An equivalent definition of a viable prefix is that it is a prefix of a right-sentential form that does not continue past the right end of the rightmost handle of that sentential form.

I do not quite understand why “the rightmost handle”? And why not the leftmost handle?

Is it so because of grammars like:

Grammar

can have the two possible rightmost derivations for $id_1+id_2*id_3$:

Derivation 1 Derivation (1)

enter image description here Derivation (2)

where for the right sentential form $E+E*id_3$ there are two handles:

left: $require{color}colorbox{pink}{E+E}*id_3$ and

right: $E+E*require{color}colorbox{pink}{$id_3$}$


By this definition, it is always possible to add terminal symbols to the end of a viable prefix to obtain a right-sentential form.

Is it something like this, Suppose

$$(1) S xrightarrow (text{rm}){text{*}} ABCDE xrightarrow(text{rm}){text{*}} ABpqr$$

$$(2) S xrightarrow (text{rm}){text{*}} ABCDE xrightarrow(text{rm}){text{*}} ABxyz$$

So adding terminals $pqr$ or $xyz$ to the viable prefix $AB$ we can get a right sentential form…


Therefore, there is apparently no error as long as the portion of the input seen to a given point can be reduced to a viable prefix.

This I feel from the actual definition is true, because a viable prefix is a prefix appearing on the stack…


Is my understanding fine? If not please rectify me.

architecture – Is implementing the logic of a singleplayer game in a dedicated server a viable option?

I want to start writing a singleplayer game and stumbled upon mature game server libraries in my preferred language. Since I am not a designer, I don’t know what’s possible in the future concerning visuals and having a dedicated server for game logic seems like a good idea at first, here are my thoughts on the topic:

Pros

  • there is a clear separation between game logic and graphics, which allows me to switch from 2D to 3D or even pick an other game engine without re-writing the important bits
  • I can add a multiplayer mode later on with fairly low effort if my plans change

Cons

  • sending messages is more complex than just calling methods, but can be hidden behind a facade
  • the communication can hurt performance, even if the server runs on the same machine

I am unsure about this approach, because I wasn’t able to find resources concerning dedicated servers for singleplayer games exclusively. Is it a viable choice or are there any other cons which outweigh the pros?

c# – Web Components/Redux with .NET Core MVC viable?

I’m currenty building several new themes using NopCommerce, a .NET e-commerce platform using the MVC architecture.

NopCommerce exposes a lot of services to you for communicating with the db, and is itself built with the Entity Framework code first approach.

Overall, I think NopCommerce is pretty well-rounded, the platform is pretty extensive and very easy to get up and running, decent developer experience and whatnot. The only thing I am missing from the platform is a built-in way of creating asynchronous views.

The NopCommerce team currently uses a lot of jQuery for updating some certain DOM-elements, such as a cart drawer, cart quantity indicator, toasters and some other things. I would prefer to exclude jQuery from the project completely, as I don’t feel new developers at work should have to learn jQuery in 2021.

We have discussed using some sort of front end framework for asynchronous UI components for some time now, but are hesitant to include React or Angular due to 1) the bundle size and 2) the need for our developers to also learn these frameworks. Preact was an option but painpoint 2 was still being hit. We also don’t want to do everything “vanilla”, but if we would, one concern was how long time it would take to craft such an implementation compared to the benefits it would give.

We came across Web Components and noticed that the support is pretty good now, and there is an official polyfill for the V1 spec. It also results in pretty neat looking DOM-structure, and is very lightweight (<0.5kb), and also has the webcomponent-redux npm package available for dead-simple redux-bindings!

Initially, it looks promising, but again, questions are raised.. Is the time to make this work worth it? It probably wouldn’t take very long to make one async components in this manner, but since you can make web components in so many ways (https://webcomponents.dev/blog/all-the-ways-to-make-a-web-component/), I’m very unsure what a good approach would be for our specific case.

As an example: Basically, we are going to take the existing (synchronous) “GetProductsByFilters” endpoint, and rewrite it to return JSON instead, which our Web Component would then request when filters are changed by a customer, then rendering the new product grid asynchronously instead of rerendering the page.

If anyone has any experience with .NET Core and Web Components, and with redux in particular (or a self-built implementation?), I would appreciate any input you have!

Bonus: Also interested in thoughts about a monorepo for said Web Components!

All the best

DreamProxies - Cheapest USA Elite Private Proxies 100 Cheapest USA Private Proxies Buy 200 Cheap USA Private Proxies 400 Best Private Proxies Cheap 1000 USA Private Proxies 2000 USA Private Proxies 5000 Cheap USA Private Proxies ExtraProxies.com - Buy Cheap Private Proxies Buy 50 Private Proxies Buy 100 Private Proxies Buy 200 Private Proxies Buy 500 Private Proxies Buy 1000 Private Proxies Buy 2000 Private Proxies ProxiesLive.com Proxies-free.com New Proxy Lists Every Day Proxies123.com Proxyti.com Buy Quality Private Proxies