From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <lists@sr.ht>
Received: from mp2.migadu.com ([2001:41d0:403:4876::])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	by ms5.migadu.com with LMTPS
	id +PZ3JV1TKWZBGAEAe85BDQ:P1
	(envelope-from <lists@sr.ht>)
	for <patches@johnnyrichard.com>; Wed, 24 Apr 2024 20:45:49 +0200
Received: from aspmx1.migadu.com ([2001:41d0:403:4876::])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	by mp2.migadu.com with LMTPS
	id +PZ3JV1TKWZBGAEAe85BDQ
	(envelope-from <lists@sr.ht>)
	for <patches@johnnyrichard.com>; Wed, 24 Apr 2024 20:45:49 +0200
X-Envelope-To: patches@johnnyrichard.com
Authentication-Results: aspmx1.migadu.com;
	dkim=pass header.d=lists.sr.ht header.s=20240113 header.b=ICU7Wn+B;
	dkim=pass header.d=disroot.org header.s=mail header.b=OjCD8hqL;
	dmarc=pass (policy=reject) header.from=disroot.org;
	spf=pass (aspmx1.migadu.com: domain of lists@sr.ht designates 46.23.81.152 as permitted sender) smtp.mailfrom=lists@sr.ht
ARC-Seal: i=1; s=key1; d=johnnyrichard.com; t=1713984349; a=rsa-sha256;
	cv=none;
	b=kCyStf30qe6Cx7imw/ms/f0aUg1ejiYJavS0TFtWo7CBGtO83M95lxxFnMEl3LCUGX7aRM
	s2p3sAQ6XZVtWkE6OXDAOKDxnrWmM1QT04rPwjP4zrvWFSOuU/kpd55ARXqgPX5PAnSoSD
	mikOkBnktmHGh1n0ZS9n4LEMLDqGPJI4vx5S3JGFR3zssKPTR8pB7flalDd6gWcspczvKa
	tIODKVjXMn+aGbGE61/kXN9YQl6dH+hjcvphrvXBuoQ0lmuIyFMr56DH1z6d8QDDQQ0RPZ
	U+pWDhRiT+HjGME0IMo51wgPpSW4bYeSdMRT64PlKAu8QRSvxvmmzH6UUfkeJg==
ARC-Authentication-Results: i=1;
	aspmx1.migadu.com;
	dkim=pass header.d=lists.sr.ht header.s=20240113 header.b=ICU7Wn+B;
	dkim=pass header.d=disroot.org header.s=mail header.b=OjCD8hqL;
	dmarc=pass (policy=reject) header.from=disroot.org;
	spf=pass (aspmx1.migadu.com: domain of lists@sr.ht designates 46.23.81.152 as permitted sender) smtp.mailfrom=lists@sr.ht
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
	d=johnnyrichard.com; s=key1; t=1713984349;
	h=from:from:sender:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:list-id:
	 list-unsubscribe:list-subscribe:list-post:dkim-signature;
	bh=Qgh9ygpqlUu5AfWWsQMG82rZYyCxCWGIPg6ATKvZ1iM=;
	b=1cnImAZ9y08/giL6WXJiqo5+BbVoiiMItw1hSmblwnUVn2hZhY7i543BgLe4eIIkBcY1Xy
	fwwIOEW9pR6pB6zC209JAlRdRCP1HALWt5YKTEFFOdT7MiAzhB9DvwJ4XIgiRxcdeTgzqn
	Lh4L11MJ+BiCemtVI5r0/aSi6eWmHXQcqpAebrQVnRB3Y89wSrrrh9qyN7JwDxnxu6oB8u
	uY6MuORrFyMB8qSYlaA6MBEx5x67Sxpsl4XY7cv5KPoMKbsgNmlQlu3RoyjTvHe8p+1hWa
	o6c+rfncL8aBgZ6il8STCE3Y7yX9VXVxIE1D6YwfTSd1SAgRAagaFBw6XLU6VA==
Received: from mail-a.sr.ht (mail-a.sr.ht [46.23.81.152])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by aspmx1.migadu.com (Postfix) with ESMTPS id 4904E697B5
	for <patches@johnnyrichard.com>; Wed, 24 Apr 2024 20:45:46 +0200 (CEST)
DKIM-Signature: a=rsa-sha256; bh=fXkoyM4eJIQOhmt5U5g0H33WIwhP+yl9mWCAL8LR6HM=;
 c=simple/simple; d=lists.sr.ht;
 h=From:To:Subject:Date:In-Reply-To:References:List-Unsubscribe:List-Subscribe:List-Archive:List-Post:List-ID;
 q=dns/txt; s=20240113; t=1713984345; v=1;
 b=ICU7Wn+BfdalZA8fAO9SiZeavV8PreH+HySCzCUH/I6DFAIIYcETNl6pmWhhd4ZR0Tm+F2RQ
 QoUJ6zsEtttT+Iw2hOpdupmuYR7xdb3ogFEkUeTQlPzXQMp29s9b7gKbswuDyx1VAUIQoBWYoWg
 PTfHb29mHVOlP7X5XMsHokHi9fvasEMERJxp9VlebF1b+2APSu6/fHKkQDvaKjkj13BUwzyKD5B
 p4FJHMkaBHgkOrW7phrFJITvWYwrrRoibkRR4Yf5OWzQ045H7EiO2NB3mgUoi3tNBMxZQMQK9iB
 Tqa8+zFCRVkiSgBEy2KY7uZAPFu7J/mLaLJ9t4jxwQsJA==
Received: from lists.sr.ht (unknown [IPv6:2a03:6000:1813:1337::154])
	by mail-a.sr.ht (Postfix) with ESMTPSA id 8EE012033B
	for <patches@johnnyrichard.com>; Wed, 24 Apr 2024 18:45:45 +0000 (UTC)
Received: from layka.disroot.org (layka.disroot.org [178.21.23.139])
	by mail-a.sr.ht (Postfix) with ESMTPS id BCA2A20322
	for <~johnnyrichard/olang-devel@lists.sr.ht>; Wed, 24 Apr 2024 18:45:44 +0000 (UTC)
X-Virus-Scanned: SPAM Filter at disroot.org
From: ricardo_kagawa@disroot.org
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=disroot.org; s=mail;
	t=1713984338; bh=fXkoyM4eJIQOhmt5U5g0H33WIwhP+yl9mWCAL8LR6HM=;
	h=From:To:Subject:Date:In-Reply-To:References;
	b=OjCD8hqLVLnaaXaU6bF3Czw5nVn4nGvoewBI0btLquKXRwSSzLg8vPzusxyzcjTSt
	 uO5FYBASXW4qc6BplOXSy86pFb+nodPJJvwY8TLYloxCsP1neIAeJaznsU0/LLFyZR
	 o9eczEL9JTP6nDu8I6CYVn0JuycgHNNwTQzOw2iVLY8pbkmh11snlRMktkCmwE4QfC
	 JbZ8F/wc6GjxJNjiDoIY49gNmnIu8nfN91qHTH0LaRV5dlKW0fdYt2FeS8e0DCSstF
	 N/VD6sM50W8xUxjtfn0eU1FOe64CNwUJISuZf8vrQ+XUaBd4HBmvEDxeghlR1x9q+P
	 K8YkBuidGDV/Q==
To: ~johnnyrichard/olang-devel@lists.sr.ht
Subject: Re: [RFC SPEC] Primitive data types and arrays
Date: Wed, 24 Apr 2024 15:45:16 -0300
Message-ID: <20240424184532.33699-1-ricardo_kagawa@disroot.org>
In-Reply-To: <oa5mpbzl5ykffqbsrlarktkchznuiez2ndhxufe6x7czt46ltd@mpvd53nufvko>
References: <oa5mpbzl5ykffqbsrlarktkchznuiez2ndhxufe6x7czt46ltd@mpvd53nufvko>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
List-Unsubscribe: <mailto:~johnnyrichard/olang-devel+unsubscribe@lists.sr.ht?subject=unsubscribe>
List-Subscribe: <mailto:~johnnyrichard/olang-devel+subscribe@lists.sr.ht?subject=subscribe>
List-Archive: <https://lists.sr.ht/~johnnyrichard/olang-devel>
Archived-At: <https://lists.sr.ht/~johnnyrichard/olang-devel/%3C20240424184532.33699-1-ricardo_kagawa%40disroot.org%3E>
List-Post: <mailto:~johnnyrichard/olang-devel@lists.sr.ht>
List-ID: ~johnnyrichard/olang-devel <~johnnyrichard/olang-devel.lists.sr.ht>
Sender: ~johnnyrichard/olang-devel <~johnnyrichard/olang-devel@lists.sr.ht>
X-Migadu-Flow: FLOW_IN
X-Migadu-Country: NL
X-Spam-Score: -9.62
X-Migadu-Queue-Id: 4904E697B5
X-Migadu-Spam-Score: -9.62
X-Migadu-Scanner: mx10.migadu.com
X-TUID: IJkeNyxBr6dm

> > > A olang array is just like a C array, no need to translation. Although it
> >
> > Are you sure about this? I mean, as a contiguous, properly sized chunk
> > of memory with indexed access, it looks fine. But in C, an array
> > variable is a pointer to that chunk of memory, and therefore pointer
> > arithmetics would be required to match C arrays.
> >
> > I'm not sure I'd like to deal with pointers. But it's not like I can't,
> > it's just that I know it opens a nasty can of worms that I'm not sure
> > you'd want to deal with as a language designer.
> 
> I really would like to know what you see as nasty.  I mean, don't you
> want to deal with pointer in general?  Or you want to segregate the
> concept of array and pointers?

Both, in fact.

As soon as you include pointers in the language, all memory positions
in the process become fair game, even for mutation. Which also means
that your compiler will not be able to enforce that `const` variables
are in fact immutable. It can at most check that known bindings are not
assigned to, but they could still be mutated through pointers.

Even literal values (as those coming from preprocessor macros) could be
mutated at runtime. Any visibility modifier would become advisory rather
than compulsory from this point. You just have to locate the correct
memory address and write new values or instructions to it.

But an array does not need to be handled through pointers as in C,
unless you specifically say that arrays match the C implementation.
Languages without pointers still handle arrays just fine, they just
don't match C arrays.

> > Obviously, `boolean` can be either `true` or `false`, but what should
> > that mean? If `boolean` is mapped to `u8`, then zero and non-zero?
> 
> That's what exactly what I had in mind.
> 
> Which problems you see with this approach?

My issue only lies in the lack of a proper definition of what a
`boolean` is, not (yet) in how they are implemented (as that was not
yet mentioned).

Processors don't handle single-bits very well, so an implementation of
the `boolean` type will likely use at least `u8` (and possibly a full
word). But that would have to map at least 256 values to a type that
only has 2, and obviously there are many ways to do just that.

AFAIK, processors should have a "branch if not zero" instruction to
handle conditionals based on zero/non-zero booleans, which would make
this approach one of the better options (and probably why C uses this
definition, ignoring the fact that C has no `boolean` type).

On the other hand, I'm not sure that you should allow implicit coersions
between `boolean` and `u8` (or between most any types), even if
internally they would be equivalent.

> > But the real question is what would `char` be? If the language should
> > support Unicode properly, then `char` would represent a _code unit_
> > rather than a "character", which could be considered a misnomer. Since
> > Unicode uses variable-length characters, a Unicode character might be
> > difficult to represent as just `char`.
> >
> > If no Unicode support is planned, then `char` as `u8` is good enough to
> > represent characters in 7-bit ASCII encoding.
> 
> Could you please enlighten me the implications of starting with `char`
> as `u8` alias (7-bit ASCII)?  What are the problems we could have if we
> don't support Unicode properly?

You don't have to support Unicode at all at any time, but transitioning
from ASCII to Unicode may not be exactly trivial, since Unicode
characters have varying width, unlike ASCII, which is fixed to 7 bits.

And I'm not saying "varying width" as in "there exist the UTF-8/16/32
variants". Even UTF-8 can represent all Unicode characters, despite
assuming characters to be basically ASCII.

That is possible because the first bits of each character are used to
represent its width. If the first bit is zero, then the remaining 7 bits
are in the ASCII range. Otherwise, the number of consecutive 1-bits
before the first 0-bit is the number of bytes in the character, so a
Unicode character can effectively have an arbitrary length (aligned to
8 bits). The remaining bits after reading that many bytes represent the
character's code point, which could be in UTF-16 or UTF-32 range and
beyond.

UTF-16 uses a different mechanism, but the overall idea is the same, and
they can also represent characters in UTF-32 and beyond.

And besides having varying width, Unicode characters also have the
issues of normalization and validity. Validation seems somewhat simple,
but possibly expensive, since AFAIK you just need to check that
multi-byte characters are not truncated.

Unicode strings are not required to be normalized, but you would need to
implement normalization for users to be able to compare them. Some
glyphs may have multiple machine representations, and normalization
converts any representation to one in particular, so that the more
efficient byte-by-byte comparisons yield correct results.

For example, there is a "latin small letter a with tilde" character
(ã), which is equivalent to the character "latin small letter a"
followed by a "combining tilde" character. They represent the same
glyph, but have different binary representations and even of different
sizes, and would not be considered equal when testing string equality.

All that is to say that depending on what assumptions you make while
implementing `char` as ASCII, it might be relatively easy or very
demanding to transition from ASCII to Unicode. Despite Unicode being a
superset of ASCII, the way they work is quite different, especially at
low level.

> > > Let's go with _void_.  We are on very early development stage,
> > > everything can change anytime.  And _void_ is kind of very well known
> > > keyword.
> >
> > Note that in most languages where there is a `void` type, the `void`
> > type is not actually valid in variable declarations. They are valid only
> > in funtion return types. In C, they are also valid as pointer types
> > (that is, `void* x;` is valid), but IIRC, not as variable types
> > (`void x;` is not valid).
> 
> I'm okay of not using void pointers as long as we have a replacement for
> it.  I still want to have support to define a raw pointer (untyped).

Or you could also add `void` to a future <pointer-type>. Just don't add
it to <type>. But I'd rather not have pointers.

> >   In ECMAScript specs, there is a `null` type that uses the `null`
> >   value as its unit value.
> 
> I think this approach lead us to design a complex type system.  I
> understand the value of this, but the cost is high when you want to
> design a simple language.
> 
> Regarding `null` I would like to have `null` as an alias to 0 (zero).
> And we could also have semantic analyses on it.  In this case `null`
> wont be a proper type.

No, I think if you define `null` as an alias to 0 as in C, you won't
have the ability to perform semantic analysis on it. In C, the
preprocessor will replace the identifier `NULL` with the literal `0`
_before_ semantic analysis. AFAIK, NULL only works in C as an invalid
memory position to intentionally cause segmentation faults when access
is attempted. And that is not even handled by the C compiler, it is an
error from the operating system. C will just let you access the zero
address, and if the operating system says it's okay, then it's okay.

In Java there is no preprocessor (unless you count the annotation
processor as a preprocessor), so `null` is a proper value, despite not
having a proper type. As such, it is available for semantic analysis for
the Java compiler. All classes in Java are assumed "nullable", which
works relatively well since all variables for non-primitive types are
reference types, where a "null pointer" makes sense, even with (or
especially with) C-like semantics. In this case, there is a JVM that
can deny access to the `null` address and throw a
`NullPointerException`.

As I usually say, the complexity doesn't really disappear, it is just
moved somewhere else. Even if you don't include this check in the type
system, it will be included somewhere else (perhaps manually, in code),
as people will still need to perform this check all the time.
Especially if you decide to incorporate pointers.

For example, in TS, types are not nullable by default, so as long as the
type definitions are sound, testing for `x !== null` is usually not
needed. But in JS, where the interpreter does not check types
statically, you always have to check for `x !== null` at runtime before
using `x`. JS is simpler, but this complexity does not really go away.

On the other hand, if you do incorporate pointers, the non-nullability
of types becomes advisory rather than compulsory, so perhaps not that
useful in this case.