[RFC SPEC] Primitive data types and arrays

public inbox for ~johnnyrichard/olang-devel@lists.sr.ht
 help / color / mirror / code / Atom feed

* [RFC SPEC] Primitive data types and arrays
@ 2024-04-08  3:29 Carlos Maniero
  2024-04-12  7:32 ` Johnny Richard
  0 siblings, 1 reply; 11+ messages in thread
From: Carlos Maniero @ 2024-04-08  3:29 UTC (permalink / raw)
  To: ~johnnyrichard/olang-devel

This thread tries to specify a basic datatypes of olang.

Primitive data types:
=====================

Primitive types are types that can be held on a general purpose register.

s8   : 8-bit  signed   integer type.
s16  : 16-bit signed   integer type.
s32  : 32-bit signed   integer type.
s64  : 64-bit signed   integer type.
u8   : 8-bit  unsigned integer type.
u16  : 16-bit unsigned integer type.
u32  : 32-bit unsigned integer type.
u64  : 64-bit unsigned integer type.
f32  : 32-bit floating point   type.
f64  : 64-bit floating point   type.

Translation to C:
-----------------

s8   : int8_t
s16  : int16_t
s32  : int32_t
s64  : int64_t
u8   : uint8_t
u16  : uint16_t
u32  : uint32_t
u64  : uint64_t
f32  : int32_t
f64  : int64_t

C also permits the use of type qualifiers, such as signed int or short int.
However, this specification recommends omitting the qualifier for simplicity.
In my opinion, this approach is more intuitive. While the meaning of long int
can be ambiguous, there’s no ambiguity with int32_t.

Example:
--------

const x: u32 = 1

Grammar:
--------

<primitive-type> ::= 's8'| 's16'| 's32'| 's64'| 'u8'|
                     'u16'| 'u32'| 'u64'| 'f32'| 'f64'

Arrays:
=======

An array is a fixed-size collection of similar data items stored in contiguous
memory locations. It can be used to store the collection of primitive data
types such as int, char, float, etc., and also derived and user-defined data
types, structures, etc.

Example:
--------

const x: u32[] = [1]
const y: u32[2] = [1, 2]

Grammar:
--------

<array-type> ::= <type> <ows> '[' <ows> <number>* <ows> ']'
<array-assign> ::= '['
                     <ows>
                     (<expression>
                       (<ows> ',' <ows> <expression>)*
                     )?
                     <ows>
                   ']'

Open question:
--------------

I have no idea how to initialize an array with a value. In C I know that this
is allowed:

int arr[20] = {0};

But I think this is ambiguous since if I remove the number 20 from the
statement above it will give me an one-sized array.

Translation to C:
-----------------

A olang array is just like a C array, no need to translation. Although it
differs from C by using square brakets other then curly brakets. That way we
could easily differ arrays from structs.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC SPEC] Primitive data types and arrays
  2024-04-08  3:29 [RFC SPEC] Primitive data types and arrays Carlos Maniero
@ 2024-04-12  7:32 ` Johnny Richard
  2024-04-13  2:51   ` Carlos Maniero
  0 siblings, 1 reply; 11+ messages in thread
From: Johnny Richard @ 2024-04-12  7:32 UTC (permalink / raw)
  To: Carlos Maniero; +Cc: ~johnnyrichard/olang-devel

On Mon, Apr 08, 2024 at 12:29:11AM -0300, Carlos Maniero wrote:
> This thread tries to specify a basic datatypes of olang.
> 
> Primitive data types:
> =====================
> 
> Primitive types are types that can be held on a general purpose register.
> 
> s8   : 8-bit  signed   integer type.
> s16  : 16-bit signed   integer type.
> s32  : 32-bit signed   integer type.
> s64  : 64-bit signed   integer type.
> u8   : 8-bit  unsigned integer type.
> u16  : 16-bit unsigned integer type.
> u32  : 32-bit unsigned integer type.
> u64  : 64-bit unsigned integer type.
> f32  : 32-bit floating point   type.
> f64  : 64-bit floating point   type.
> 
> Translation to C:
> -----------------
> 
> s8   : int8_t
> s16  : int16_t
> s32  : int32_t
> s64  : int64_t
> u8   : uint8_t
> u16  : uint16_t
> u32  : uint32_t
> u64  : uint64_t
> f32  : int32_t
> f64  : int64_t

I loved it.  Out of curiosity, we are going to have _boolean_ and _char_
I believe.  Shouldn't they also be included on these primitive spec?

> C also permits the use of type qualifiers, such as signed int or short int.
> However, this specification recommends omitting the qualifier for simplicity.
> In my opinion, this approach is more intuitive. While the meaning of long int
> can be ambiguous, there’s no ambiguity with int32_t.

I agree.

> Arrays:
> =======
> 
> An array is a fixed-size collection of similar data items stored in contiguous
> memory locations. It can be used to store the collection of primitive data
> types such as int, char, float, etc., and also derived and user-defined data
> types, structures, etc.
> 
> Example:
> --------
> 
> const x: u32[] = [1]
> const y: u32[2] = [1, 2]
> 
> Grammar:
> --------
> 
> <array-type> ::= <type> <ows> '[' <ows> <number>* <ows> ']'
> <array-assign> ::= '['
>                      <ows>
>                      (<expression>
>                        (<ows> ',' <ows> <expression>)*
>                      )?
>                      <ows>
>                    ']'
> 
> Open question:
> --------------
> 
> I have no idea how to initialize an array with a value. In C I know that this
> is allowed:
> 
> int arr[20] = {0};
> 
> But I think this is ambiguous since if I remove the number 20 from the
> statement above it will give me an one-sized array.

Yeah, I see... With the syntax you proposed I suggest the following syntax
for initialize all elements zeroed:

    const arr: u8[2] = [...0]

It should only work for arrays with size explicitly declared.

> Translation to C:
> -----------------
> 
> A olang array is just like a C array, no need to translation. Although it
> differs from C by using square brakets other then curly brakets. That way we
> could easily differ arrays from structs.

Sure.  I like it. I think we can split this RFC up into different RFCs.
One for discuss primitive types and another one for discussing arrays in
general.  What do you think?

There is other topics we also can discuss about arrays, for example:

Array access by index
---------------------

The brackets in C has weird behaviour on arrays when accessed by index
as listed bellow:

    int xs[] = { 1, 2 };

    xs[0] = 1;
    0[xs] = 1;

    xs[1] = 2;
    1[xs] = 2;

The weird array access with the number as the first element works
because c does pointer arithmetics in the end.

    xs[i] == *(xs + 1)
    [i]xs == *(i + xs)

I like the C simplicity of translating it to pointers arithmetics, but I
think the access using number first is too weird.  We should avoid it I
believe.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC SPEC] Primitive data types and arrays
  2024-04-12  7:32 ` Johnny Richard
@ 2024-04-13  2:51   ` Carlos Maniero
  2024-04-13 23:31     ` Johnny Richard
  0 siblings, 1 reply; 11+ messages in thread
From: Carlos Maniero @ 2024-04-13  2:51 UTC (permalink / raw)
  To: Johnny Richard; +Cc: ~johnnyrichard/olang-devel

> I loved it.  Out of curiosity, we are going to have _boolean_ and _char_
> I believe.  Shouldn't they also be included on these primitive spec?

I like it! We could discuss in the near feature if they are or not just
type alias for u8. But I also agree they must be built-in without the
need of any include.

<primitive-type>      ::= 's8'| 's16'| 's32'| 's64'| 'u8'|
                          'u16'| 'u32'| 'u64'| 'f32'| 'f64'|
                          'char' | 'bool'
<expression>          ::= <integer> | <identifier> | <boolean> | <char>
<boolean>             ::= "true" | "false"
<char>                ::= "'" #'.' "'" 

Are you comfortable with the above grammar?

> Sure.  I like it. I think we can split this RFC up into different RFCs.
> One for discuss primitive types and another one for discussing arrays in
> general.  What do you think?

LGTM! Let's conclude this one first and them I start a new thread.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC SPEC] Primitive data types and arrays
  2024-04-13  2:51   ` Carlos Maniero
@ 2024-04-13 23:31     ` Johnny Richard
  2024-04-16  3:40       ` Carlos Maniero
  0 siblings, 1 reply; 11+ messages in thread
From: Johnny Richard @ 2024-04-13 23:31 UTC (permalink / raw)
  To: Carlos Maniero; +Cc: ~johnnyrichard/olang-devel

On Fri, Apr 12, 2024 at 11:51:51PM -0300, Carlos Maniero wrote:
> > I loved it.  Out of curiosity, we are going to have _boolean_ and _char_
> > I believe.  Shouldn't they also be included on these primitive spec?
> 
> I like it! We could discuss in the near feature if they are or not just
> type alias for u8. But I also agree they must be built-in without the
> need of any include.
> 
> <primitive-type>      ::= 's8'| 's16'| 's32'| 's64'| 'u8'|
>                           'u16'| 'u32'| 'u64'| 'f32'| 'f64'|
>                           'char' | 'bool'
> <expression>          ::= <integer> | <identifier> | <boolean> | <char>
> <boolean>             ::= "true" | "false"
> <char>                ::= "'" #'.' "'" 

Perhaps _char_ SHOULD have support to escaped chars like \r (carried
return), \n (line feed)...  Whenever you create the patch, don't forget it.

> Are you comfortable with the above grammar?

I am wondering if we should also define _void_ as a primitive.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC SPEC] Primitive data types and arrays
  2024-04-13 23:31     ` Johnny Richard
@ 2024-04-16  3:40       ` Carlos Maniero
  2024-04-16 18:34         ` Johnny Richard
  0 siblings, 1 reply; 11+ messages in thread
From: Carlos Maniero @ 2024-04-16  3:40 UTC (permalink / raw)
  To: Johnny Richard; +Cc: ~johnnyrichard/olang-devel

> Perhaps _char_ SHOULD have support to escaped chars like \r (carried
> return), \n (line feed)...  Whenever you create the patch, don't forget it.

Sure! I will cover this. Thanks!

> > Are you comfortable with the above grammar?
>
> I am wondering if we should also define _void_ as a primitive.

I think so. Do you like the name *void*? I don't like that much by I
can't think in any alternative.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC SPEC] Primitive data types and arrays
  2024-04-16  3:40       ` Carlos Maniero
@ 2024-04-16 18:34         ` Johnny Richard
  2024-04-17  1:30           ` ricardo_kagawa
  0 siblings, 1 reply; 11+ messages in thread
From: Johnny Richard @ 2024-04-16 18:34 UTC (permalink / raw)
  To: Carlos Maniero; +Cc: ~johnnyrichard/olang-devel

On Tue, Apr 16, 2024 at 12:40:50AM -0300, Carlos Maniero wrote:
> > I am wondering if we should also define _void_ as a primitive.
> 
> I think so. Do you like the name *void*? I don't like that much by I
> can't think in any alternative.

Let's go with _void_.  We are on very early development stage,
everything can change anytime.  And _void_ is kind of very well known
keyword.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC SPEC] Primitive data types and arrays
  2024-04-16 18:34         ` Johnny Richard
@ 2024-04-17  1:30           ` ricardo_kagawa
  2024-04-18 21:53             ` Carlos Maniero
  2024-04-20 11:45             ` Johnny Richard
  0 siblings, 2 replies; 11+ messages in thread
From: ricardo_kagawa @ 2024-04-17  1:30 UTC (permalink / raw)
  To: ~johnnyrichard/olang-devel

> A olang array is just like a C array, no need to translation. Although it

Are you sure about this? I mean, as a contiguous, properly sized chunk
of memory with indexed access, it looks fine. But in C, an array
variable is a pointer to that chunk of memory, and therefore pointer
arithmetics would be required to match C arrays.

I'm not sure I'd like to deal with pointers. But it's not like I can't,
it's just that I know it opens a nasty can of worms that I'm not sure
you'd want to deal with as a language designer.

> > I loved it.  Out of curiosity, we are going to have _boolean_ and _char_
> > I believe.  Shouldn't they also be included on these primitive spec?
> 
> I like it! We could discuss in the near feature if they are or not just
> type alias for u8. But I also agree they must be built-in without the
> need of any include.

I like the idea of treating `boolean` and `char` as primitives, but do
be careful about what they mean.

Obviously, `boolean` can be either `true` or `false`, but what should
that mean? If `boolean` is mapped to `u8`, then zero and non-zero?

But the real question is what would `char` be? If the language should
support Unicode properly, then `char` would represent a _code unit_
rather than a "character", which could be considered a misnomer. Since
Unicode uses variable-length characters, a Unicode character might be
difficult to represent as just `char`.

If no Unicode support is planned, then `char` as `u8` is good enough to
represent characters in 7-bit ASCII encoding.

> > Perhaps _char_ SHOULD have support to escaped chars like \r (carried
> > return), \n (line feed)...  Whenever you create the patch, don't forget it.
> 
> Sure! I will cover this. Thanks!

If you have plans to support Unicode, then I'd also suggest to include
hexadecimal and Unicode escapes, like `\x20` and `\uffef`.

> > > I am wondering if we should also define _void_ as a primitive.
> > 
> > I think so. Do you like the name *void*? I don't like that much by I
> > can't think in any alternative.
> 
> Let's go with _void_.  We are on very early development stage,
> everything can change anytime.  And _void_ is kind of very well known
> keyword.

Note that in most languages where there is a `void` type, the `void`
type is not actually valid in variable declarations. They are valid only
in funtion return types. In C, they are also valid as pointer types
(that is, `void* x;` is valid), but IIRC, not as variable types
(`void x;` is not valid).

In the current version of the spec, it would be included in
<return-type>, rather than <type>, to allow it only as a function
return type.

Also, there are three other types that might be interesting, if I may
suggest: `never` (from TypeScript [1]), `unit` (from functional-like
languages [2]) and `null` (from ECMAScript specs [3]).

[1]: https://www.typescriptlang.org/docs/handbook/2/functions.html#never
[2]: https://en.wikipedia.org/wiki/Unit_type
[3]: https://tc39.es/ecma262/multipage/overview.html#sec-null-value

- `never` would not be that useful without an exception system.
- `unit` and `null` would not make much sense at the same time, so it is
  either one or the other.
- `null` would also be more interesting with union types (TypeScript),
  to define nullable types as the union of a non-nullable type and the
  `null` type. (C has union types, but they are not related to this.)
- I don't really know why an empty tuple would be interesting as the
  value for the `unit` type, but several languages use this convention.
  In ECMAScript specs, there is a `null` type that uses the `null`
  value as its unit value.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC SPEC] Primitive data types and arrays
  2024-04-17  1:30           ` ricardo_kagawa
@ 2024-04-18 21:53             ` Carlos Maniero
  2024-04-24 16:23               ` ricardo_kagawa
  2024-04-20 11:45             ` Johnny Richard
  1 sibling, 1 reply; 11+ messages in thread
From: Carlos Maniero @ 2024-04-18 21:53 UTC (permalink / raw)
  To: ricardo_kagawa, ~johnnyrichard/olang-devel

About arrays, Johnny has suggested to talk about arrays in a different
thread, I'm just waiting us to conclude this discussion and I'll start
another thread to define Olang's array specification. But we brought
excellent points, and maybe we should define pointers before arrays.

> Obviously, `boolean` can be either `true` or `false`, but what should
> that mean? If `boolean` is mapped to `u8`, then zero and non-zero?

IMO, true should be 1 and false 0 in a way that *1 == true* is true and
*2 == true* is false. Control flow structures may accept anything not
just booleans and may apply the non-zero approach you described, but we
can discuss this on their own RFC (that does not exists yet).

> But the real question is what would `char` be? If the language should
> support Unicode properly, then `char` would represent a _code unit_
> rather than a "character", which could be considered a misnomer. Since
> Unicode uses variable-length characters, a Unicode character might be
> difficult to represent as just `char`.
>
> If no Unicode support is planned, then `char` as `u8` is good enough to
> represent characters in 7-bit ASCII encoding.

I'll be honest with you, It makes a lot of sense all you said, making a
char a u8 seems to enforce an Western-Eurocentrism in Olang. But I
confess that I never stopped to learn more about unicode.

At the same time I think we should support a 32-bit sized unicode char,
I don't wanna make all chars an u32 keeping the support to ASCII encoding.

IMO, we should either postpone specifying a char right now or assume
that a char at this point represents an ASCII char and start a new RFC
about unicode where we may define something like an unicode char.

BTW, you seem well versed on the unicode theory, would you like to
purpose a mechanism to deal with unicode?

> Note that in most languages where there is a `void` type, the `void`
> type is not actually valid in variable declarations. [...]
>
> In the current version of the spec, it would be included in
> <return-type>, rather than <type>, to allow it only as a function
> return type.

Agree!

> Also, there are three other types that might be interesting, if I may
> suggest: `never` (from TypeScript [1]), `unit` (from functional-like
> languages [2]) and `null` (from ECMAScript specs [3]).
>
> [1]: https://www.typescriptlang.org/docs/handbook/2/functions.html#never
> [2]: https://en.wikipedia.org/wiki/Unit_type
> [3]: https://tc39.es/ecma262/multipage/overview.html#sec-null-value

They seems to be very specific, we may wanna to wait until we find an
use for them.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC SPEC] Primitive data types and arrays
  2024-04-17  1:30           ` ricardo_kagawa
  2024-04-18 21:53             ` Carlos Maniero
@ 2024-04-20 11:45             ` Johnny Richard
  2024-04-24 18:45               ` ricardo_kagawa
  1 sibling, 1 reply; 11+ messages in thread
From: Johnny Richard @ 2024-04-20 11:45 UTC (permalink / raw)
  To: ricardo_kagawa; +Cc: ~johnnyrichard/olang-devel

On Tue, Apr 16, 2024 at 10:30:03PM -0300, ricardo_kagawa@disroot.org wrote:
> > A olang array is just like a C array, no need to translation. Although it
> 
> Are you sure about this? I mean, as a contiguous, properly sized chunk
> of memory with indexed access, it looks fine. But in C, an array
> variable is a pointer to that chunk of memory, and therefore pointer
> arithmetics would be required to match C arrays.
> 
> I'm not sure I'd like to deal with pointers. But it's not like I can't,
> it's just that I know it opens a nasty can of worms that I'm not sure
> you'd want to deal with as a language designer.

I really would like to know what you see as nasty.  I mean, don't you
want to deal with pointer in general?  Or you want to segregate the
concept of array and pointers?

> > > I loved it.  Out of curiosity, we are going to have _boolean_ and _char_
> > > I believe.  Shouldn't they also be included on these primitive spec?
> > 
> > I like it! We could discuss in the near feature if they are or not just
> > type alias for u8. But I also agree they must be built-in without the
> > need of any include.
> 
> I like the idea of treating `boolean` and `char` as primitives, but do
> be careful about what they mean.
> 
> Obviously, `boolean` can be either `true` or `false`, but what should
> that mean? If `boolean` is mapped to `u8`, then zero and non-zero?

That's what exactly what I had in mind.

Which problems you see with this approach?

> But the real question is what would `char` be? If the language should
> support Unicode properly, then `char` would represent a _code unit_
> rather than a "character", which could be considered a misnomer. Since
> Unicode uses variable-length characters, a Unicode character might be
> difficult to represent as just `char`.
> 
> If no Unicode support is planned, then `char` as `u8` is good enough to
> represent characters in 7-bit ASCII encoding.

Could you please enlighten me the implications of starting with `char`
as `u8` alias (7-bit ASCII)?  What are the problems we could have if we
don't support Unicode properly?

> > Let's go with _void_.  We are on very early development stage,
> > everything can change anytime.  And _void_ is kind of very well known
> > keyword.
> 
> Note that in most languages where there is a `void` type, the `void`
> type is not actually valid in variable declarations. They are valid only
> in funtion return types. In C, they are also valid as pointer types
> (that is, `void* x;` is valid), but IIRC, not as variable types
> (`void x;` is not valid).

I'm okay of not using void pointers as long as we have a replacement for
it.  I still want to have support to define a raw pointer (untyped).

> In the current version of the spec, it would be included in
> <return-type>, rather than <type>, to allow it only as a function
> return type.

Yeah, I like it.

> Also, there are three other types that might be interesting, if I may
> suggest: `never` (from TypeScript [1]), `unit` (from functional-like
> languages [2]) and `null` (from ECMAScript specs [3]).
> 
> [1]: https://www.typescriptlang.org/docs/handbook/2/functions.html#never
> [2]: https://en.wikipedia.org/wiki/Unit_type
> [3]: https://tc39.es/ecma262/multipage/overview.html#sec-null-value
> 
> - `never` would not be that useful without an exception system.

The language wont have exception.

> - `null` would also be more interesting with union types (TypeScript),
>   to define nullable types as the union of a non-nullable type and the
>   `null` type. (C has union types, but they are not related to this.)
>
> - I don't really know why an empty tuple would be interesting as the
>   value for the `unit` type, but several languages use this convention.
>   In ECMAScript specs, there is a `null` type that uses the `null`
>   value as its unit value.

I think this approach lead us to design a complex type system.  I
understand the value of this, but the cost is high when you want to
design a simple language.

Regarding `null` I would like to have `null` as an alias to 0 (zero).
And we could also have semantic analyses on it.  In this case `null`
wont be a proper type.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC SPEC] Primitive data types and arrays
  2024-04-18 21:53             ` Carlos Maniero
@ 2024-04-24 16:23               ` ricardo_kagawa
  0 siblings, 0 replies; 11+ messages in thread
From: ricardo_kagawa @ 2024-04-24 16:23 UTC (permalink / raw)
  To: ~johnnyrichard/olang-devel; +Cc: carlos

> About arrays, Johnny has suggested to talk about arrays in a different
> thread, I'm just waiting us to conclude this discussion and I'll start
> another thread to define Olang's array specification. But we brought
> excellent points, and maybe we should define pointers before arrays.

My point was not that you should include pointers in the language, but
that you should not match C arrays as you have specified. Of course,
you can include pointers if that was your plan all along, but I would
rather you did not.

> > Obviously, `boolean` can be either `true` or `false`, but what should
> > that mean? If `boolean` is mapped to `u8`, then zero and non-zero?
> 
> IMO, true should be 1 and false 0 in a way that *1 == true* is true and
> *2 == true* is false. Control flow structures may accept anything not
> just booleans and may apply the non-zero approach you described, but we
> can discuss this on their own RFC (that does not exists yet).

I have my issues regarding that, but let's wait for that new thread.

> > But the real question is what would `char` be? If the language should
> > support Unicode properly, then `char` would represent a _code unit_
> > rather than a "character", which could be considered a misnomer. Since
> > Unicode uses variable-length characters, a Unicode character might be
> > difficult to represent as just `char`.
> > 
> > If no Unicode support is planned, then `char` as `u8` is good enough to
> > represent characters in 7-bit ASCII encoding.
> 
> I'll be honest with you, It makes a lot of sense all you said, making a
> char a u8 seems to enforce an Western-Eurocentrism in Olang. But I
> confess that I never stopped to learn more about unicode.
> 
> At the same time I think we should support a 32-bit sized unicode char,
> I don't wanna make all chars an u32 keeping the support to ASCII encoding.

This is exactly how I feel, except I would stick to UTF-16 (this is what
JS uses). Unicode would be a lot more complex to deal with, and totally
overkill if you don't have plans to support non-ASCII characters as
primitives. But if you do have plans to support it, it might be better
to at least avoid making assumptions that could make it difficult to
transition to it later.

> IMO, we should either postpone specifying a char right now or assume
> that a char at this point represents an ASCII char and start a new RFC
> about unicode where we may define something like an unicode char.

My intent was actually to make you postpone the definition of the `char`
type until you have considered this carefully enough. You don't have to
decide that right now, but you also don't have to define the `char` type
right now either.

But if you do intend to support Unicode as `char`, then I would not make
it something separate from ASCII, as Unicode is a superset of ASCII. Not
a problem if you intend to support Unicode as a separate library (as in
C), but I feel it would be weird to have both ASCII and Unicode as
primitives if you already have ASCII included in Unicode.

> BTW, you seem well versed on the unicode theory, would you like to
> purpose a mechanism to deal with unicode?

I am not that well versed, I just have a user-level knowledge of
Unicode. What I would propose however, is to look at languages that
natively support Unicode, like JS. More precisely, not just copy what
they do, but also look at what they did wrong and try to do better.

In C, `char` is assumed ASCII (it is not actually, but sort of can be)
and Unicode seems to be supported through a standard library (I have
never used Unicode in C, but I suspect it is related to "wide chars",
at least).

> > Also, there are three other types that might be interesting, if I may
> > suggest: `never` (from TypeScript [1]), `unit` (from functional-like
> >     languages [2]) and `null` (from ECMAScript specs [3]).
> 
> They seems to be very specific, we may wanna to wait until we find an
> use for them.

Yeah, I am not suggesting you to include these right now (or at all),
just to take them into consideration. I don't know where you are
planning to go about your language's design, as details are still
lacking at this point.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC SPEC] Primitive data types and arrays
  2024-04-20 11:45             ` Johnny Richard
@ 2024-04-24 18:45               ` ricardo_kagawa
  0 siblings, 0 replies; 11+ messages in thread
From: ricardo_kagawa @ 2024-04-24 18:45 UTC (permalink / raw)
  To: ~johnnyrichard/olang-devel

> > > A olang array is just like a C array, no need to translation. Although it
> >
> > Are you sure about this? I mean, as a contiguous, properly sized chunk
> > of memory with indexed access, it looks fine. But in C, an array
> > variable is a pointer to that chunk of memory, and therefore pointer
> > arithmetics would be required to match C arrays.
> >
> > I'm not sure I'd like to deal with pointers. But it's not like I can't,
> > it's just that I know it opens a nasty can of worms that I'm not sure
> > you'd want to deal with as a language designer.
> 
> I really would like to know what you see as nasty.  I mean, don't you
> want to deal with pointer in general?  Or you want to segregate the
> concept of array and pointers?

Both, in fact.

As soon as you include pointers in the language, all memory positions
in the process become fair game, even for mutation. Which also means
that your compiler will not be able to enforce that `const` variables
are in fact immutable. It can at most check that known bindings are not
assigned to, but they could still be mutated through pointers.

Even literal values (as those coming from preprocessor macros) could be
mutated at runtime. Any visibility modifier would become advisory rather
than compulsory from this point. You just have to locate the correct
memory address and write new values or instructions to it.

But an array does not need to be handled through pointers as in C,
unless you specifically say that arrays match the C implementation.
Languages without pointers still handle arrays just fine, they just
don't match C arrays.

> > Obviously, `boolean` can be either `true` or `false`, but what should
> > that mean? If `boolean` is mapped to `u8`, then zero and non-zero?
> 
> That's what exactly what I had in mind.
> 
> Which problems you see with this approach?

My issue only lies in the lack of a proper definition of what a
`boolean` is, not (yet) in how they are implemented (as that was not
yet mentioned).

Processors don't handle single-bits very well, so an implementation of
the `boolean` type will likely use at least `u8` (and possibly a full
word). But that would have to map at least 256 values to a type that
only has 2, and obviously there are many ways to do just that.

AFAIK, processors should have a "branch if not zero" instruction to
handle conditionals based on zero/non-zero booleans, which would make
this approach one of the better options (and probably why C uses this
definition, ignoring the fact that C has no `boolean` type).

On the other hand, I'm not sure that you should allow implicit coersions
between `boolean` and `u8` (or between most any types), even if
internally they would be equivalent.

> > But the real question is what would `char` be? If the language should
> > support Unicode properly, then `char` would represent a _code unit_
> > rather than a "character", which could be considered a misnomer. Since
> > Unicode uses variable-length characters, a Unicode character might be
> > difficult to represent as just `char`.
> >
> > If no Unicode support is planned, then `char` as `u8` is good enough to
> > represent characters in 7-bit ASCII encoding.
> 
> Could you please enlighten me the implications of starting with `char`
> as `u8` alias (7-bit ASCII)?  What are the problems we could have if we
> don't support Unicode properly?

You don't have to support Unicode at all at any time, but transitioning
from ASCII to Unicode may not be exactly trivial, since Unicode
characters have varying width, unlike ASCII, which is fixed to 7 bits.

And I'm not saying "varying width" as in "there exist the UTF-8/16/32
variants". Even UTF-8 can represent all Unicode characters, despite
assuming characters to be basically ASCII.

That is possible because the first bits of each character are used to
represent its width. If the first bit is zero, then the remaining 7 bits
are in the ASCII range. Otherwise, the number of consecutive 1-bits
before the first 0-bit is the number of bytes in the character, so a
Unicode character can effectively have an arbitrary length (aligned to
8 bits). The remaining bits after reading that many bytes represent the
character's code point, which could be in UTF-16 or UTF-32 range and
beyond.

UTF-16 uses a different mechanism, but the overall idea is the same, and
they can also represent characters in UTF-32 and beyond.

And besides having varying width, Unicode characters also have the
issues of normalization and validity. Validation seems somewhat simple,
but possibly expensive, since AFAIK you just need to check that
multi-byte characters are not truncated.

Unicode strings are not required to be normalized, but you would need to
implement normalization for users to be able to compare them. Some
glyphs may have multiple machine representations, and normalization
converts any representation to one in particular, so that the more
efficient byte-by-byte comparisons yield correct results.

For example, there is a "latin small letter a with tilde" character
(ã), which is equivalent to the character "latin small letter a"
followed by a "combining tilde" character. They represent the same
glyph, but have different binary representations and even of different
sizes, and would not be considered equal when testing string equality.

All that is to say that depending on what assumptions you make while
implementing `char` as ASCII, it might be relatively easy or very
demanding to transition from ASCII to Unicode. Despite Unicode being a
superset of ASCII, the way they work is quite different, especially at
low level.

> > > Let's go with _void_.  We are on very early development stage,
> > > everything can change anytime.  And _void_ is kind of very well known
> > > keyword.
> >
> > Note that in most languages where there is a `void` type, the `void`
> > type is not actually valid in variable declarations. They are valid only
> > in funtion return types. In C, they are also valid as pointer types
> > (that is, `void* x;` is valid), but IIRC, not as variable types
> > (`void x;` is not valid).
> 
> I'm okay of not using void pointers as long as we have a replacement for
> it.  I still want to have support to define a raw pointer (untyped).

Or you could also add `void` to a future <pointer-type>. Just don't add
it to <type>. But I'd rather not have pointers.

> >   In ECMAScript specs, there is a `null` type that uses the `null`
> >   value as its unit value.
> 
> I think this approach lead us to design a complex type system.  I
> understand the value of this, but the cost is high when you want to
> design a simple language.
> 
> Regarding `null` I would like to have `null` as an alias to 0 (zero).
> And we could also have semantic analyses on it.  In this case `null`
> wont be a proper type.

No, I think if you define `null` as an alias to 0 as in C, you won't
have the ability to perform semantic analysis on it. In C, the
preprocessor will replace the identifier `NULL` with the literal `0`
_before_ semantic analysis. AFAIK, NULL only works in C as an invalid
memory position to intentionally cause segmentation faults when access
is attempted. And that is not even handled by the C compiler, it is an
error from the operating system. C will just let you access the zero
address, and if the operating system says it's okay, then it's okay.

In Java there is no preprocessor (unless you count the annotation
processor as a preprocessor), so `null` is a proper value, despite not
having a proper type. As such, it is available for semantic analysis for
the Java compiler. All classes in Java are assumed "nullable", which
works relatively well since all variables for non-primitive types are
reference types, where a "null pointer" makes sense, even with (or
especially with) C-like semantics. In this case, there is a JVM that
can deny access to the `null` address and throw a
`NullPointerException`.

As I usually say, the complexity doesn't really disappear, it is just
moved somewhere else. Even if you don't include this check in the type
system, it will be included somewhere else (perhaps manually, in code),
as people will still need to perform this check all the time.
Especially if you decide to incorporate pointers.

For example, in TS, types are not nullable by default, so as long as the
type definitions are sound, testing for `x !== null` is usually not
needed. But in JS, where the interpreter does not check types
statically, you always have to check for `x !== null` at runtime before
using `x`. JS is simpler, but this complexity does not really go away.

On the other hand, if you do incorporate pointers, the non-nullability
of types becomes advisory rather than compulsory, so perhaps not that
useful in this case.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-04-24 18:45 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-08  3:29 [RFC SPEC] Primitive data types and arrays Carlos Maniero
2024-04-12  7:32 ` Johnny Richard
2024-04-13  2:51   ` Carlos Maniero
2024-04-13 23:31     ` Johnny Richard
2024-04-16  3:40       ` Carlos Maniero
2024-04-16 18:34         ` Johnny Richard
2024-04-17  1:30           ` ricardo_kagawa
2024-04-18 21:53             ` Carlos Maniero
2024-04-24 16:23               ` ricardo_kagawa
2024-04-20 11:45             ` Johnny Richard
2024-04-24 18:45               ` ricardo_kagawa

Code repositories for project(s) associated with this public inbox

	https://git.johnnyrichard.com/olang.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox