From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2.migadu.com ([2001:41d0:403:4876::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms5.migadu.com with LMTPS id +PZ3JV1TKWZBGAEAe85BDQ:P1 (envelope-from ) for ; Wed, 24 Apr 2024 20:45:49 +0200 Received: from aspmx1.migadu.com ([2001:41d0:403:4876::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2.migadu.com with LMTPS id +PZ3JV1TKWZBGAEAe85BDQ (envelope-from ) for ; Wed, 24 Apr 2024 20:45:49 +0200 X-Envelope-To: patches@johnnyrichard.com Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=lists.sr.ht header.s=20240113 header.b=ICU7Wn+B; dkim=pass header.d=disroot.org header.s=mail header.b=OjCD8hqL; dmarc=pass (policy=reject) header.from=disroot.org; spf=pass (aspmx1.migadu.com: domain of lists@sr.ht designates 46.23.81.152 as permitted sender) smtp.mailfrom=lists@sr.ht ARC-Seal: i=1; s=key1; d=johnnyrichard.com; t=1713984349; a=rsa-sha256; cv=none; b=kCyStf30qe6Cx7imw/ms/f0aUg1ejiYJavS0TFtWo7CBGtO83M95lxxFnMEl3LCUGX7aRM s2p3sAQ6XZVtWkE6OXDAOKDxnrWmM1QT04rPwjP4zrvWFSOuU/kpd55ARXqgPX5PAnSoSD mikOkBnktmHGh1n0ZS9n4LEMLDqGPJI4vx5S3JGFR3zssKPTR8pB7flalDd6gWcspczvKa tIODKVjXMn+aGbGE61/kXN9YQl6dH+hjcvphrvXBuoQ0lmuIyFMr56DH1z6d8QDDQQ0RPZ U+pWDhRiT+HjGME0IMo51wgPpSW4bYeSdMRT64PlKAu8QRSvxvmmzH6UUfkeJg== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=lists.sr.ht header.s=20240113 header.b=ICU7Wn+B; dkim=pass header.d=disroot.org header.s=mail header.b=OjCD8hqL; dmarc=pass (policy=reject) header.from=disroot.org; spf=pass (aspmx1.migadu.com: domain of lists@sr.ht designates 46.23.81.152 as permitted sender) smtp.mailfrom=lists@sr.ht ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=johnnyrichard.com; s=key1; t=1713984349; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id: list-unsubscribe:list-subscribe:list-post:dkim-signature; bh=Qgh9ygpqlUu5AfWWsQMG82rZYyCxCWGIPg6ATKvZ1iM=; b=1cnImAZ9y08/giL6WXJiqo5+BbVoiiMItw1hSmblwnUVn2hZhY7i543BgLe4eIIkBcY1Xy fwwIOEW9pR6pB6zC209JAlRdRCP1HALWt5YKTEFFOdT7MiAzhB9DvwJ4XIgiRxcdeTgzqn Lh4L11MJ+BiCemtVI5r0/aSi6eWmHXQcqpAebrQVnRB3Y89wSrrrh9qyN7JwDxnxu6oB8u uY6MuORrFyMB8qSYlaA6MBEx5x67Sxpsl4XY7cv5KPoMKbsgNmlQlu3RoyjTvHe8p+1hWa o6c+rfncL8aBgZ6il8STCE3Y7yX9VXVxIE1D6YwfTSd1SAgRAagaFBw6XLU6VA== Received: from mail-a.sr.ht (mail-a.sr.ht [46.23.81.152]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 4904E697B5 for ; Wed, 24 Apr 2024 20:45:46 +0200 (CEST) DKIM-Signature: a=rsa-sha256; bh=fXkoyM4eJIQOhmt5U5g0H33WIwhP+yl9mWCAL8LR6HM=; c=simple/simple; d=lists.sr.ht; h=From:To:Subject:Date:In-Reply-To:References:List-Unsubscribe:List-Subscribe:List-Archive:List-Post:List-ID; q=dns/txt; s=20240113; t=1713984345; v=1; b=ICU7Wn+BfdalZA8fAO9SiZeavV8PreH+HySCzCUH/I6DFAIIYcETNl6pmWhhd4ZR0Tm+F2RQ QoUJ6zsEtttT+Iw2hOpdupmuYR7xdb3ogFEkUeTQlPzXQMp29s9b7gKbswuDyx1VAUIQoBWYoWg PTfHb29mHVOlP7X5XMsHokHi9fvasEMERJxp9VlebF1b+2APSu6/fHKkQDvaKjkj13BUwzyKD5B p4FJHMkaBHgkOrW7phrFJITvWYwrrRoibkRR4Yf5OWzQ045H7EiO2NB3mgUoi3tNBMxZQMQK9iB Tqa8+zFCRVkiSgBEy2KY7uZAPFu7J/mLaLJ9t4jxwQsJA== Received: from lists.sr.ht (unknown [IPv6:2a03:6000:1813:1337::154]) by mail-a.sr.ht (Postfix) with ESMTPSA id 8EE012033B for ; Wed, 24 Apr 2024 18:45:45 +0000 (UTC) Received: from layka.disroot.org (layka.disroot.org [178.21.23.139]) by mail-a.sr.ht (Postfix) with ESMTPS id BCA2A20322 for <~johnnyrichard/olang-devel@lists.sr.ht>; Wed, 24 Apr 2024 18:45:44 +0000 (UTC) X-Virus-Scanned: SPAM Filter at disroot.org From: ricardo_kagawa@disroot.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=disroot.org; s=mail; t=1713984338; bh=fXkoyM4eJIQOhmt5U5g0H33WIwhP+yl9mWCAL8LR6HM=; h=From:To:Subject:Date:In-Reply-To:References; b=OjCD8hqLVLnaaXaU6bF3Czw5nVn4nGvoewBI0btLquKXRwSSzLg8vPzusxyzcjTSt uO5FYBASXW4qc6BplOXSy86pFb+nodPJJvwY8TLYloxCsP1neIAeJaznsU0/LLFyZR o9eczEL9JTP6nDu8I6CYVn0JuycgHNNwTQzOw2iVLY8pbkmh11snlRMktkCmwE4QfC JbZ8F/wc6GjxJNjiDoIY49gNmnIu8nfN91qHTH0LaRV5dlKW0fdYt2FeS8e0DCSstF N/VD6sM50W8xUxjtfn0eU1FOe64CNwUJISuZf8vrQ+XUaBd4HBmvEDxeghlR1x9q+P K8YkBuidGDV/Q== To: ~johnnyrichard/olang-devel@lists.sr.ht Subject: Re: [RFC SPEC] Primitive data types and arrays Date: Wed, 24 Apr 2024 15:45:16 -0300 Message-ID: <20240424184532.33699-1-ricardo_kagawa@disroot.org> In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit List-Unsubscribe: List-Subscribe: List-Archive: Archived-At: List-Post: List-ID: ~johnnyrichard/olang-devel <~johnnyrichard/olang-devel.lists.sr.ht> Sender: ~johnnyrichard/olang-devel <~johnnyrichard/olang-devel@lists.sr.ht> X-Migadu-Flow: FLOW_IN X-Migadu-Country: NL X-Spam-Score: -9.62 X-Migadu-Queue-Id: 4904E697B5 X-Migadu-Spam-Score: -9.62 X-Migadu-Scanner: mx10.migadu.com X-TUID: IJkeNyxBr6dm > > > A olang array is just like a C array, no need to translation. Although it > > > > Are you sure about this? I mean, as a contiguous, properly sized chunk > > of memory with indexed access, it looks fine. But in C, an array > > variable is a pointer to that chunk of memory, and therefore pointer > > arithmetics would be required to match C arrays. > > > > I'm not sure I'd like to deal with pointers. But it's not like I can't, > > it's just that I know it opens a nasty can of worms that I'm not sure > > you'd want to deal with as a language designer. > > I really would like to know what you see as nasty. I mean, don't you > want to deal with pointer in general? Or you want to segregate the > concept of array and pointers? Both, in fact. As soon as you include pointers in the language, all memory positions in the process become fair game, even for mutation. Which also means that your compiler will not be able to enforce that `const` variables are in fact immutable. It can at most check that known bindings are not assigned to, but they could still be mutated through pointers. Even literal values (as those coming from preprocessor macros) could be mutated at runtime. Any visibility modifier would become advisory rather than compulsory from this point. You just have to locate the correct memory address and write new values or instructions to it. But an array does not need to be handled through pointers as in C, unless you specifically say that arrays match the C implementation. Languages without pointers still handle arrays just fine, they just don't match C arrays. > > Obviously, `boolean` can be either `true` or `false`, but what should > > that mean? If `boolean` is mapped to `u8`, then zero and non-zero? > > That's what exactly what I had in mind. > > Which problems you see with this approach? My issue only lies in the lack of a proper definition of what a `boolean` is, not (yet) in how they are implemented (as that was not yet mentioned). Processors don't handle single-bits very well, so an implementation of the `boolean` type will likely use at least `u8` (and possibly a full word). But that would have to map at least 256 values to a type that only has 2, and obviously there are many ways to do just that. AFAIK, processors should have a "branch if not zero" instruction to handle conditionals based on zero/non-zero booleans, which would make this approach one of the better options (and probably why C uses this definition, ignoring the fact that C has no `boolean` type). On the other hand, I'm not sure that you should allow implicit coersions between `boolean` and `u8` (or between most any types), even if internally they would be equivalent. > > But the real question is what would `char` be? If the language should > > support Unicode properly, then `char` would represent a _code unit_ > > rather than a "character", which could be considered a misnomer. Since > > Unicode uses variable-length characters, a Unicode character might be > > difficult to represent as just `char`. > > > > If no Unicode support is planned, then `char` as `u8` is good enough to > > represent characters in 7-bit ASCII encoding. > > Could you please enlighten me the implications of starting with `char` > as `u8` alias (7-bit ASCII)? What are the problems we could have if we > don't support Unicode properly? You don't have to support Unicode at all at any time, but transitioning from ASCII to Unicode may not be exactly trivial, since Unicode characters have varying width, unlike ASCII, which is fixed to 7 bits. And I'm not saying "varying width" as in "there exist the UTF-8/16/32 variants". Even UTF-8 can represent all Unicode characters, despite assuming characters to be basically ASCII. That is possible because the first bits of each character are used to represent its width. If the first bit is zero, then the remaining 7 bits are in the ASCII range. Otherwise, the number of consecutive 1-bits before the first 0-bit is the number of bytes in the character, so a Unicode character can effectively have an arbitrary length (aligned to 8 bits). The remaining bits after reading that many bytes represent the character's code point, which could be in UTF-16 or UTF-32 range and beyond. UTF-16 uses a different mechanism, but the overall idea is the same, and they can also represent characters in UTF-32 and beyond. And besides having varying width, Unicode characters also have the issues of normalization and validity. Validation seems somewhat simple, but possibly expensive, since AFAIK you just need to check that multi-byte characters are not truncated. Unicode strings are not required to be normalized, but you would need to implement normalization for users to be able to compare them. Some glyphs may have multiple machine representations, and normalization converts any representation to one in particular, so that the more efficient byte-by-byte comparisons yield correct results. For example, there is a "latin small letter a with tilde" character (ã), which is equivalent to the character "latin small letter a" followed by a "combining tilde" character. They represent the same glyph, but have different binary representations and even of different sizes, and would not be considered equal when testing string equality. All that is to say that depending on what assumptions you make while implementing `char` as ASCII, it might be relatively easy or very demanding to transition from ASCII to Unicode. Despite Unicode being a superset of ASCII, the way they work is quite different, especially at low level. > > > Let's go with _void_. We are on very early development stage, > > > everything can change anytime. And _void_ is kind of very well known > > > keyword. > > > > Note that in most languages where there is a `void` type, the `void` > > type is not actually valid in variable declarations. They are valid only > > in funtion return types. In C, they are also valid as pointer types > > (that is, `void* x;` is valid), but IIRC, not as variable types > > (`void x;` is not valid). > > I'm okay of not using void pointers as long as we have a replacement for > it. I still want to have support to define a raw pointer (untyped). Or you could also add `void` to a future . Just don't add it to . But I'd rather not have pointers. > > In ECMAScript specs, there is a `null` type that uses the `null` > > value as its unit value. > > I think this approach lead us to design a complex type system. I > understand the value of this, but the cost is high when you want to > design a simple language. > > Regarding `null` I would like to have `null` as an alias to 0 (zero). > And we could also have semantic analyses on it. In this case `null` > wont be a proper type. No, I think if you define `null` as an alias to 0 as in C, you won't have the ability to perform semantic analysis on it. In C, the preprocessor will replace the identifier `NULL` with the literal `0` _before_ semantic analysis. AFAIK, NULL only works in C as an invalid memory position to intentionally cause segmentation faults when access is attempted. And that is not even handled by the C compiler, it is an error from the operating system. C will just let you access the zero address, and if the operating system says it's okay, then it's okay. In Java there is no preprocessor (unless you count the annotation processor as a preprocessor), so `null` is a proper value, despite not having a proper type. As such, it is available for semantic analysis for the Java compiler. All classes in Java are assumed "nullable", which works relatively well since all variables for non-primitive types are reference types, where a "null pointer" makes sense, even with (or especially with) C-like semantics. In this case, there is a JVM that can deny access to the `null` address and throw a `NullPointerException`. As I usually say, the complexity doesn't really disappear, it is just moved somewhere else. Even if you don't include this check in the type system, it will be included somewhere else (perhaps manually, in code), as people will still need to perform this check all the time. Especially if you decide to incorporate pointers. For example, in TS, types are not nullable by default, so as long as the type definitions are sound, testing for `x !== null` is usually not needed. But in JS, where the interpreter does not check types statically, you always have to check for `x !== null` at runtime before using `x`. JS is simpler, but this complexity does not really go away. On the other hand, if you do incorporate pointers, the non-nullability of types becomes advisory rather than compulsory, so perhaps not that useful in this case.