Re: [RFC SPEC] Primitive data types and arrays

public inbox for ~johnnyrichard/olang-devel@lists.sr.ht
 help / color / mirror / code / Atom feed

From: ricardo_kagawa@disroot.org
To: ~johnnyrichard/olang-devel@lists.sr.ht
Subject: Re: [RFC SPEC] Primitive data types and arrays
Date: Wed, 24 Apr 2024 15:45:16 -0300	[thread overview]
Message-ID: <20240424184532.33699-1-ricardo_kagawa@disroot.org> (raw)
In-Reply-To: <oa5mpbzl5ykffqbsrlarktkchznuiez2ndhxufe6x7czt46ltd@mpvd53nufvko>

> > > A olang array is just like a C array, no need to translation. Although it
> >
> > Are you sure about this? I mean, as a contiguous, properly sized chunk
> > of memory with indexed access, it looks fine. But in C, an array
> > variable is a pointer to that chunk of memory, and therefore pointer
> > arithmetics would be required to match C arrays.
> >
> > I'm not sure I'd like to deal with pointers. But it's not like I can't,
> > it's just that I know it opens a nasty can of worms that I'm not sure
> > you'd want to deal with as a language designer.
> 
> I really would like to know what you see as nasty.  I mean, don't you
> want to deal with pointer in general?  Or you want to segregate the
> concept of array and pointers?

Both, in fact.

As soon as you include pointers in the language, all memory positions
in the process become fair game, even for mutation. Which also means
that your compiler will not be able to enforce that `const` variables
are in fact immutable. It can at most check that known bindings are not
assigned to, but they could still be mutated through pointers.

Even literal values (as those coming from preprocessor macros) could be
mutated at runtime. Any visibility modifier would become advisory rather
than compulsory from this point. You just have to locate the correct
memory address and write new values or instructions to it.

But an array does not need to be handled through pointers as in C,
unless you specifically say that arrays match the C implementation.
Languages without pointers still handle arrays just fine, they just
don't match C arrays.

> > Obviously, `boolean` can be either `true` or `false`, but what should
> > that mean? If `boolean` is mapped to `u8`, then zero and non-zero?
> 
> That's what exactly what I had in mind.
> 
> Which problems you see with this approach?

My issue only lies in the lack of a proper definition of what a
`boolean` is, not (yet) in how they are implemented (as that was not
yet mentioned).

Processors don't handle single-bits very well, so an implementation of
the `boolean` type will likely use at least `u8` (and possibly a full
word). But that would have to map at least 256 values to a type that
only has 2, and obviously there are many ways to do just that.

AFAIK, processors should have a "branch if not zero" instruction to
handle conditionals based on zero/non-zero booleans, which would make
this approach one of the better options (and probably why C uses this
definition, ignoring the fact that C has no `boolean` type).

On the other hand, I'm not sure that you should allow implicit coersions
between `boolean` and `u8` (or between most any types), even if
internally they would be equivalent.

> > But the real question is what would `char` be? If the language should
> > support Unicode properly, then `char` would represent a _code unit_
> > rather than a "character", which could be considered a misnomer. Since
> > Unicode uses variable-length characters, a Unicode character might be
> > difficult to represent as just `char`.
> >
> > If no Unicode support is planned, then `char` as `u8` is good enough to
> > represent characters in 7-bit ASCII encoding.
> 
> Could you please enlighten me the implications of starting with `char`
> as `u8` alias (7-bit ASCII)?  What are the problems we could have if we
> don't support Unicode properly?

You don't have to support Unicode at all at any time, but transitioning
from ASCII to Unicode may not be exactly trivial, since Unicode
characters have varying width, unlike ASCII, which is fixed to 7 bits.

And I'm not saying "varying width" as in "there exist the UTF-8/16/32
variants". Even UTF-8 can represent all Unicode characters, despite
assuming characters to be basically ASCII.

That is possible because the first bits of each character are used to
represent its width. If the first bit is zero, then the remaining 7 bits
are in the ASCII range. Otherwise, the number of consecutive 1-bits
before the first 0-bit is the number of bytes in the character, so a
Unicode character can effectively have an arbitrary length (aligned to
8 bits). The remaining bits after reading that many bytes represent the
character's code point, which could be in UTF-16 or UTF-32 range and
beyond.

UTF-16 uses a different mechanism, but the overall idea is the same, and
they can also represent characters in UTF-32 and beyond.

And besides having varying width, Unicode characters also have the
issues of normalization and validity. Validation seems somewhat simple,
but possibly expensive, since AFAIK you just need to check that
multi-byte characters are not truncated.

Unicode strings are not required to be normalized, but you would need to
implement normalization for users to be able to compare them. Some
glyphs may have multiple machine representations, and normalization
converts any representation to one in particular, so that the more
efficient byte-by-byte comparisons yield correct results.

For example, there is a "latin small letter a with tilde" character
(ã), which is equivalent to the character "latin small letter a"
followed by a "combining tilde" character. They represent the same
glyph, but have different binary representations and even of different
sizes, and would not be considered equal when testing string equality.

All that is to say that depending on what assumptions you make while
implementing `char` as ASCII, it might be relatively easy or very
demanding to transition from ASCII to Unicode. Despite Unicode being a
superset of ASCII, the way they work is quite different, especially at
low level.

> > > Let's go with _void_.  We are on very early development stage,
> > > everything can change anytime.  And _void_ is kind of very well known
> > > keyword.
> >
> > Note that in most languages where there is a `void` type, the `void`
> > type is not actually valid in variable declarations. They are valid only
> > in funtion return types. In C, they are also valid as pointer types
> > (that is, `void* x;` is valid), but IIRC, not as variable types
> > (`void x;` is not valid).
> 
> I'm okay of not using void pointers as long as we have a replacement for
> it.  I still want to have support to define a raw pointer (untyped).

Or you could also add `void` to a future <pointer-type>. Just don't add
it to <type>. But I'd rather not have pointers.

> >   In ECMAScript specs, there is a `null` type that uses the `null`
> >   value as its unit value.
> 
> I think this approach lead us to design a complex type system.  I
> understand the value of this, but the cost is high when you want to
> design a simple language.
> 
> Regarding `null` I would like to have `null` as an alias to 0 (zero).
> And we could also have semantic analyses on it.  In this case `null`
> wont be a proper type.

No, I think if you define `null` as an alias to 0 as in C, you won't
have the ability to perform semantic analysis on it. In C, the
preprocessor will replace the identifier `NULL` with the literal `0`
_before_ semantic analysis. AFAIK, NULL only works in C as an invalid
memory position to intentionally cause segmentation faults when access
is attempted. And that is not even handled by the C compiler, it is an
error from the operating system. C will just let you access the zero
address, and if the operating system says it's okay, then it's okay.

In Java there is no preprocessor (unless you count the annotation
processor as a preprocessor), so `null` is a proper value, despite not
having a proper type. As such, it is available for semantic analysis for
the Java compiler. All classes in Java are assumed "nullable", which
works relatively well since all variables for non-primitive types are
reference types, where a "null pointer" makes sense, even with (or
especially with) C-like semantics. In this case, there is a JVM that
can deny access to the `null` address and throw a
`NullPointerException`.

As I usually say, the complexity doesn't really disappear, it is just
moved somewhere else. Even if you don't include this check in the type
system, it will be included somewhere else (perhaps manually, in code),
as people will still need to perform this check all the time.
Especially if you decide to incorporate pointers.

For example, in TS, types are not nullable by default, so as long as the
type definitions are sound, testing for `x !== null` is usually not
needed. But in JS, where the interpreter does not check types
statically, you always have to check for `x !== null` at runtime before
using `x`. JS is simpler, but this complexity does not really go away.

On the other hand, if you do incorporate pointers, the non-nullability
of types becomes advisory rather than compulsory, so perhaps not that
useful in this case.

     prev parent reply	other threads:[~2024-04-24 18:45 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-08  3:29 Carlos Maniero
2024-04-12  7:32 ` Johnny Richard
2024-04-13  2:51   ` Carlos Maniero
2024-04-13 23:31     ` Johnny Richard
2024-04-16  3:40       ` Carlos Maniero
2024-04-16 18:34         ` Johnny Richard
2024-04-17  1:30           ` ricardo_kagawa
2024-04-18 21:53             ` Carlos Maniero
2024-04-24 16:23               ` ricardo_kagawa
2024-04-20 11:45             ` Johnny Richard
2024-04-24 18:45               ` ricardo_kagawa [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240424184532.33699-1-ricardo_kagawa@disroot.org \
    --to=ricardo_kagawa@disroot.org \
    --cc=~johnnyrichard/olang-devel@lists.sr.ht \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.johnnyrichard.com/olang.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox