[RFC PATCH olang v1] docs: create zero programming language specification

public inbox for ~johnnyrichard/olang-devel@lists.sr.ht
 help / color / mirror / code / Atom feed

* [RFC PATCH olang v1] docs: create zero programming language specification
@ 2024-03-09  0:05 Johnny Richard
  2024-03-08 23:09 ` [olang/patches/.build.yml] build success builds.sr.ht
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Johnny Richard @ 2024-03-09  0:05 UTC (permalink / raw)
  To: ~johnnyrichard/olang-devel; +Cc: Johnny Richard

This document specifies the semantics and behavior of the Zero Programming
Language for compiler programmers be informed how the language is designed.

This document will help newcomers to understand how the language looks
like and as a DRAFT guide on the language design discussions.

The grammar was made by using a EBNF evaluator tool[1].

[1]: https://mdkrajnak.github.io/ebnftest/

Signed-off-by: Johnny Richard <johnny@johnnyrichard.com>
---
This grammar adds the token SEMICOLON (';') for every statement.  I know we
agreed make it optional, but the SEMICOLON makes the parser much more
convenient to implement.

And this is the first topic I would like to discuss. Let me know if you
agree otherwise I can adapt the grammar to make SEMICOLON optional.

 docs/pages/language_specification.md | 41 ++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)
 create mode 100644 docs/pages/language_specification.md

diff --git a/docs/pages/language_specification.md b/docs/pages/language_specification.md
new file mode 100644
index 0000000..9d27eda
--- /dev/null
+++ b/docs/pages/language_specification.md
@@ -0,0 +1,41 @@
+zero programming language specification
+=======================================
+
+ABSTRACT
+--------
+
+This document specifies the semantics and behavior of the Zero Programming
+Language for compiler programmers be informed how the language is designed.
+
+This specification is on DRAFT and will evolve through discussions on olang-dev
+mailing list.
+
+Language Syntax
+---------------
+
+This is the Zero Programming Language EBNF grammar specification 
+
+NOTE: This grammar spec is a DRAFT and it covers only a small portion of the
+language.
+
+```
+<program>               ::= <function-definition>
+<function-definition>   ::= <fn_keyword> <space>+ <identifier> <space>* <f-args> <space>* <colon> <space>* <type> <space>* <block>
+<identifier>            ::= <alpha>+
+                          | <alpha>+ <number>*
+                          ;
+<f-args>                ::= '(' <space>* ')'
+<block>                 ::= <ocurly> <space>* <statement>* <space>* <ccurly>
+<statement>             ::= <return-statement>
+<return-statement>      ::= <return_keyword> <space>* <number>* <space>* <semicolon>
+<semicolon>             ::= ';'
+<ocurly>                ::= '{'
+<ccurly>                ::= '}'
+<type>                  ::= 'u32 '
+<colon>                 ::= ':'
+<alpha>                 ::= #'[a-zA-Z_]'
+<number>                ::= #'[0-9]'
+<fn_keyword>            ::= 'fn'
+<return_keyword>        ::= 'return'
+<space>                 ::= #'[ \t\r\n]'
+```
-- 
2.44.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [olang/patches/.build.yml] build success
  2024-03-09  0:05 [RFC PATCH olang v1] docs: create zero programming language specification Johnny Richard
@ 2024-03-08 23:09 ` builds.sr.ht
  2024-03-14  4:29   ` Ricardo Kagawa
  2024-03-09  0:36 ` [RFC PATCH olang v1] docs: create zero programming language specification Johnny Richard
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 11+ messages in thread
From: builds.sr.ht @ 2024-03-08 23:09 UTC (permalink / raw)
  To: Johnny Richard; +Cc: ~johnnyrichard/olang-devel

olang/patches/.build.yml: SUCCESS in 33s

[docs: create zero programming language specification][0] from [Johnny Richard][1]

[0]: https://lists.sr.ht/~johnnyrichard/olang-devel/patches/50092
[1]: mailto:johnny@johnnyrichard.com

✓ #1164786 SUCCESS olang/patches/.build.yml https://builds.sr.ht/~johnnyrichard/job/1164786

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [olang/patches/.build.yml] build success
  2024-03-08 23:09 ` [olang/patches/.build.yml] build success builds.sr.ht
@ 2024-03-14  4:29   ` Ricardo Kagawa
  2024-03-14 22:43     ` Johnny Richard
  0 siblings, 1 reply; 11+ messages in thread
From: Ricardo Kagawa @ 2024-03-14  4:29 UTC (permalink / raw)
  To: ~johnnyrichard/olang-devel; +Cc: builds.sr.ht

 >> This grammar adds the token SEMICOLON (';') for every statement.  I 
know we

 >> agreed make it optional, but the SEMICOLON makes the parser much more

 >> convenient to implement.

 >>

 >> And this is the first topic I would like to discuss. Let me know if you

 >> agree otherwise I can adapt the grammar to make SEMICOLON optional.

 >

 > (...) Therefore, I'm curious about your statement that using a

 > semicolon makes the parser much more convenient to implement. Could you

 > elaborate on this? Have you encountered any new considerations that might

 > complicate the implementation?

My limited understanding is that the semicolon would indeed be more

convenient, as it would be a definitive end-of-statement symbol,

requiring no lookahead to resolve as such. The LF token could be

ambiguous on its own (between end-of-statement and white space), so

some lookahead would be required to resolve it.

But it should be alright, as long as the language remains context-free.

Even if it becomes ambiguous, non-deterministic, or requires a long

lookahead. Ideally it should be determinitstic for linear time

performance, but it seems there are parsers that can run close to it in

the average case, as long as the language remains close to

deterministic.

And I don't have a strong opinion on the semicolon issue, except that

it must be an option. But whatever we do, we must avoid the following

pitfall from JavaScript:

```javascript

example

;(x)

```

The semicolon is mandatory here, because otherwise `(x)` is handled as

an argument list, and `example` would be called as a function. That is,

it would be a multi-line statement, instead of two separate statements.

And why anyone would do this?

```javascript

const x = y.example

;(() => {

   console.log(x)

})()

```

Immediately invoked function expressions are a thing in JavaScript, and

it would not be uncommon to have some expression ending with an

identifier right before them.

 >> The grammar was made by using a EBNF evaluator tool[1].

 >>

 >> [1]: https://mdkrajnak.github.io/ebnftest/

 >

 > I would add this link at the markdown, so then people can play with it.

I would make an even stronger argument for including the link in the

docs. A good language specification also specifies which language

specification grammar is used for the specification itself. And the

EBNF in particular is not properly standardized, so you really need to

specify which EBNF variant you are using.

The link should thus be good enough to refer to the EBNF implementation

used in this specification, although a permanent (version locked) link

would be better.

----

As for my revision of the grammar:

- Separated rules into sections.

- Added optional white space around the program.

- You don't actually need non-terminal symbols for keywords. Especially

   if you are including the keyword in the symbol name.

- You don't need non-terminal symbols for symbols either, unless you

   have a more "semantic" name for it. There should not be another

   "semicolon" besides `;`, for example.

- In Johnny's version the function name is a single identifier. I don't

   know why Carlos's version made it multiple. I have made it single

   again.

- In Johnny's version the space before the return type is optional. I

   don't know why Carlos's version made it mandatory. I have made it

   optional again.

- Replaced `<identifier>` in `<function-definition>` with

   `<function-name>` to express that this identifier is the name of the

   declared function. Then, `<function-name>` is just `<identifier>`.

- Renamed `<fn-args>` to `<function-parameters>`, since parameters are

   the variables in a function declaration, while arguments are the

   values bound to those variables during function calls.

- Replaced `<type>` for `<return-type>` in `<function-declaration>` to

   express that this type identifier is the return type of the function.

   Then, `<return-type>` is just `<type>`.

- Replaced `<block>` in `<function-definition>` for `<function-body>` to

   express that this block is the body of the declared function.

- Reworked `<block>`, `<statement>` and `<end-of-statement>` to allow

   for:

     - Single statement followd by optional end-of-statement;

     - Statement list with mandatory end-of-statement between statements;

     - But the statements could be made optional, yet I did not in this

       version, as there is no `void` return type, currently.

- Replaced `<number>` in `<return-statement>` with `<expression>` to

   prepare for them in the future. The only allowed expression is still

   an integer literal, though.

- Renamed `<number>` to `<integer>`, and reworked it to actually

   represent decimal integer literals. Sequences of zero digits are now

   forbidden at the left side, but a lone zero digit is still allowed.

- Reworked `<identifier>` to better express that it starts with

   `<alpha>` or underline, followed by zero or more `<alpha>`, `<digit>`

   or underline.

- Removed `_` from `<alpha>` to better reflect the name (as underline is

   not an alphabetic character).

- Renamed `<space>` for `<ws>` to avoid ambiguity with the character

   U+0020 Space, and made it a one-or-more list. Also introduced `<ows>`

   for "optional white space". Shorter names were preferred here due to

   these symbols in particular being used very frequently.

- Also introduced `<line-break>` as either LF, CR or CRLF. Otherwise the

   CRLF sequence would be parsed as two separate line breaks. Not that it

   would matter that much, except maybe for mapping line numbers.

```

(* Entry Point *)

<program>             ::= <ows> <function-definition> <ows>

(* Functions *)

<function-definition> ::= 'fn' <ws> <function-name> <ows> 
<function-parameters> <ows> ':' <ows> <return-type> <ows> <function-body>

<function-name>       ::= <identifier>

<function-parameters> ::= '(' <ows> ')'

<return-type>         ::= <type>

<function-body>       ::= <block>

(* Statements *)

<block>               ::= '{' <ows> <statement> <ows> 
(<end-of-statement> <ows> <statement> <ows>)* <end-of-statement>? <ows> '}'

<end-of-statement>    ::= ';' | <line-break>

<statement>           ::= <return-statement>

<return-statement>    ::= 'return' <ws> <expression>

(* Expressions *)

<expression>          ::= <integer>

(* Identifiers *)

<type>                ::= 'u32'

<identifier>          ::= (<alpha> | '_') (<alpha> | <digit> | '_')*

(* Literals *)

<integer>             ::= <integer-base10>

<integer-base10>      ::= #'[1-9]' <digit>* | '0'

(* Utilities *)

<ws>                  ::= <white-space>+

<ows>                 ::= <white-space>*

<white-space>         ::= <linear-space> | <line-break>

<line-break>          ::= '\n' | '\r' | '\r\n'

<linear-space>        ::= #'[ \t]'

<alpha>               ::= #'[a-zA-Z]'

<digit>               ::= #'[0-9]'

```

Further discussion:

- Is the language going to support Unicode? If so, `<alpha>` could use

   the _L:Letter_ Unicode category instead of being limited to

   `[a-zA-Z]`. But the EBNF tool does not support Unicode categories in

   its regular expressions (it does not support flags). Also don't

   forget to rename it to `<letter>` in that case.

     - It would help developers in non-English speaking countries, but it

       could be difficult to work with multi-byte characters and Unicode

       normalization.

- There are more linear space and line break characters than the ones

   included here, even within ASCII, although they are not all that

   important. Even more in Unicode (some under _Cc:Other/control_,

   others under _Z:Separator_). Should we support them?

- The function definition could accept a single expression as an

   alternative to its `<block>`, similar to Kotlin.

- The integer literal could include optional underline separators for

   readability. Just need to be careful not to start with underline, to

   avoid ambiguity with identifiers.

- I guess we don't have to support the full set of Unicode digits, since

   we don't know if these digits would even be decimal in the first

   place. The numbering system could be very different from our own, so

   it is likely not feasible to support them.

- I have not checked if this syntax would avoid that edge case with

   JavaScript I mentioned in the beginning. I might check that next

   time (I'm still not sure of how).

- It might seem strange that I included semantic non-terminals here,

   despite having removed non-terminals for symbols and keywords. I can't

   say for sure, since this is my first time trying this style, but I

   suspect that besides making the language specification easier to

   understand, the important bits to hook into in the parser will be

   around these symbols. That is, it could simplify some work on the

   parser.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [olang/patches/.build.yml] build success
  2024-03-14  4:29   ` Ricardo Kagawa
@ 2024-03-14 22:43     ` Johnny Richard
  0 siblings, 0 replies; 11+ messages in thread
From: Johnny Richard @ 2024-03-14 22:43 UTC (permalink / raw)
  To: Ricardo Kagawa; +Cc: ~johnnyrichard/olang-devel

Thank you very much for you contribution, I love it <3.

nitpick for next replies:

    1) You've replied to the CI build reply.  Next time try to reply to
       the right thread.

    2) Your message has few weird line breaks.  I don't know why it's
       happening (noticed you are using Thunderbird).  However, you can
       make sure your setup is correctly configured to send plain text
       emails by visiting this website https://useplaintext.email/

On Thu, Mar 14, 2024 at 01:29:09AM -0300, Ricardo Kagawa wrote:
> >> This grammar adds the token SEMICOLON (';') for every statement.  I know
> we
> >> agreed make it optional, but the SEMICOLON makes the parser much more
> >> convenient to implement.
> >>
> >> And this is the first topic I would like to discuss. Let me know if you
> >> agree otherwise I can adapt the grammar to make SEMICOLON optional.
> >
> > (...) Therefore, I'm curious about your statement that using a
> > semicolon makes the parser much more convenient to implement. Could you
> > elaborate on this? Have you encountered any new considerations that might
> > complicate the implementation?
> 
> My limited understanding is that the semicolon would indeed be more
> convenient, as it would be a definitive end-of-statement symbol,
> requiring no lookahead to resolve as such. The LF token could be
> ambiguous on its own (between end-of-statement and white space), so
> some lookahead would be required to resolve it.

You are right about it.  I had to implement the lookahead capability in
order to skip LF tokens.

> But it should be alright, as long as the language remains context-free.
> Even if it becomes ambiguous, non-deterministic, or requires a long
> lookahead. Ideally it should be determinitstic for linear time
> performance, but it seems there are parsers that can run close to it in
> the average case, as long as the language remains close to
> deterministic.
> 
> And I don't have a strong opinion on the semicolon issue, except that
> it must be an option. But whatever we do, we must avoid the following
> pitfall from JavaScript:
> 
> ```javascript
> example
> ;(x)
> ```
> 
> The semicolon is mandatory here, because otherwise `(x)` is handled as
> an argument list, and `example` would be called as a function. That is,
> it would be a multi-line statement, instead of two separate statements.
> 
> And why anyone would do this?
> 
> ```javascript
> const x = y.example
> ;(() => {
>   console.log(x)
> })()
> ```

I strong agree on avoid those odd JavaScript design.  I think we can
continue with optional SEMICOLON.  I also think it makes a better
programmer experience.

> >> The grammar was made by using a EBNF evaluator tool[1].
> >>
> >> [1]: https://mdkrajnak.github.io/ebnftest/
> >
> > I would add this link at the markdown, so then people can play with it.
> 
> I would make an even stronger argument for including the link in the
> docs. A good language specification also specifies which language
> specification grammar is used for the specification itself. And the
> EBNF in particular is not properly standardized, so you really need to
> specify which EBNF variant you are using.
> 
> The link should thus be good enough to refer to the EBNF implementation
> used in this specification, although a permanent (version locked) link
> would be better.

Sure, I can add it to the document.  I'm not sure how you want to
version lock this variant.  Should I add a specific github/git tag
version to the document?

> As for my revision of the grammar:

I liked all comments and definitely it seems to be better version.  In
my option we can start with you changes and keep this document alive for
future discussion.  Not sure about Carlos.  Let see his thoughts on that
as well.

> Further discussion:
> 
> - Is the language going to support Unicode? If so, `<alpha>` could use
>   the _L:Letter_ Unicode category instead of being limited to
>   `[a-zA-Z]`. But the EBNF tool does not support Unicode categories in
>   its regular expressions (it does not support flags). Also don't
>   forget to rename it to `<letter>` in that case.
> 
>     - It would help developers in non-English speaking countries, but it
>       could be difficult to work with multi-byte characters and Unicode
>       normalization.

I lack knowledge to answer this question right know.  I would say to
keep it simple as much as we can on this earlier stage (ASCII only)
unless you have a big concern.

> - There are more linear space and line break characters than the ones
>   included here, even within ASCII, although they are not all that
>   important. Even more in Unicode (some under _Cc:Other/control_,
>   others under _Z:Separator_). Should we support them?

Let's add the remaining ASCII ones meanwhile.

> - The function definition could accept a single expression as an
>   alternative to its `<block>`, similar to Kotlin.

Scala also has this capability.  But I think it doesn't fit well in our
current function declaration:

    fn f(): u32 <statement>
               ^
               If we don't add a token in here like **=** it will be very
               weird.

No strong options here to be honest.

> - The integer literal could include optional underline separators for
>   readability. Just need to be careful not to start with underline, to
>   avoid ambiguity with identifiers.

I like that.  We can have it as well.

> - I guess we don't have to support the full set of Unicode digits, since
>   we don't know if these digits would even be decimal in the first
>   place. The numbering system could be very different from our own, so
>   it is likely not feasible to support them.

Perhaps we could postpone the Unicode support?

> - I have not checked if this syntax would avoid that edge case with
>   JavaScript I mentioned in the beginning. I might check that next
>   time (I'm still not sure of how).

Maybe we are going to discovery it on the implementation process.

> - It might seem strange that I included semantic non-terminals here,
>   despite having removed non-terminals for symbols and keywords. I can't
>   say for sure, since this is my first time trying this style, but I
>   suspect that besides making the language specification easier to
>   understand, the important bits to hook into in the parser will be
>   around these symbols. That is, it could simplify some work on the
>   parser.

ACK

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH olang v1] docs: create zero programming language specification
  2024-03-09  0:05 [RFC PATCH olang v1] docs: create zero programming language specification Johnny Richard
  2024-03-08 23:09 ` [olang/patches/.build.yml] build success builds.sr.ht
@ 2024-03-09  0:36 ` Johnny Richard
  2024-03-09  5:09 ` Carlos Maniero
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Johnny Richard @ 2024-03-09  0:36 UTC (permalink / raw)
  To: ~johnnyrichard/olang-devel

I messed with the grammar, here I have few adjusts.

On Sat, Mar 09, 2024 at 01:05:39AM +0100, Johnny Richard wrote:
> diff --git a/docs/pages/language_specification.md b/docs/pages/language_specification.md
> new file mode 100644
> index 0000000..9d27eda
> --- /dev/null
> +++ b/docs/pages/language_specification.md
> @@ -0,0 +1,41 @@
> +zero programming language specification
> +=======================================
> +
> +ABSTRACT
> +--------
> +
> +This document specifies the semantics and behavior of the Zero Programming
> +Language for compiler programmers be informed how the language is designed.

s/be informed how/be informed on how/

> +This specification is on DRAFT and will evolve through discussions on olang-dev

s/is on DRAFT/is a DRAFT/

> +<identifier>            ::= <alpha>+
> +                          | <alpha>+ <number>*

I missed the recusion here, it should be:

    <identifier> ::= <alpha>+
                   | <alpha>+ <number>* <identifier>
                   ;

> +<type>                  ::= 'u32 '

s/u32 /u32/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH olang v1] docs: create zero programming language specification
  2024-03-09  0:05 [RFC PATCH olang v1] docs: create zero programming language specification Johnny Richard
  2024-03-08 23:09 ` [olang/patches/.build.yml] build success builds.sr.ht
  2024-03-09  0:36 ` [RFC PATCH olang v1] docs: create zero programming language specification Johnny Richard
@ 2024-03-09  5:09 ` Carlos Maniero
  2024-03-19 20:21 ` Johnny Richard
  2024-03-23 23:31 ` Carlos Maniero
  4 siblings, 0 replies; 11+ messages in thread
From: Carlos Maniero @ 2024-03-09  5:09 UTC (permalink / raw)
  To: Johnny Richard, ~johnnyrichard/olang-devel

> This grammar adds the token SEMICOLON (';') for every statement.  I know we
> agreed make it optional, but the SEMICOLON makes the parser much more
> convenient to implement.
> 
> And this is the first topic I would like to discuss. Let me know if you
> agree otherwise I can adapt the grammar to make SEMICOLON optional.

I understand that using line feed (LF) as an end of statement (EOS)
marker can simplify the parsing process. However, at the top of my mind
I see two main downsides in using LF as EOS:

1. It could increase the number of tokens that the lexer produces,
   requiring the parser to ignore some of them.
2. It requires an strategy to skip LF for enable multi-line statements.

While the second point does add some complexity, I believe the first one
only slightly alters the implementation without significantly increasing
its complexity. Therefore, I'm curious about your statement that using a
semicolon makes the parser much more convenient to implement. Could you
elaborate on this? Have you encountered any new considerations that might
complicate the implementation?

> The grammar was made by using a EBNF evaluator tool[1].
> 
> [1]: https://mdkrajnak.github.io/ebnftest/

I would add this link at the markdown, so then people can play with it.

Furthermore, I was playing with the grammar you send and I did a few
adjusts:

- I introduced the <end-of-statement> to make SEMICOLON optional;
- It seams it has a pattern in where you used DASH or UNDERLINE but IMO
  it reads better when standardized.
- I made a simplification on <identifier>.
- I reordered the non-terminal keeping the single chars and strings at
  the bottom.

  <program>               ::= <function-definition>
  <function-definition>   ::= <fn-keyword> <space>+ <identifier>+ <space>* <fn-args> <space>* <colon> <space>+ <type> <space>* <block>
  <identifier>            ::= <alpha>+ <number>* <identifier>*
  <fn-args>               ::= '(' <space>* ')'
  <block>                 ::= <ocurly> <space>* <statement> <space>* <ccurly>
  <statement>             ::= <return-statement> <end-of-statement>+
  <return-statement>      ::= <return-keyword> <space>* <number>+ <space>*
  <end-of-statement>      ::= <semicolon>* <space>* '\n'
  <fn-keyword>            ::= 'fn'
  <return-keyword>        ::= 'return'
  <type>                  ::= 'u32'
  <alpha>                 ::= #'[a-zA-Z_]'
  <number>                ::= #'[0-9]'
  <space>                 ::= #'[ \t\r\n]'
  <semicolon>             ::= ';'
  <ocurly>                ::= '{'
  <ccurly>                ::= '}'
  <colon>                 ::= ':'

Other than that I would like to say thank you to always been looking
forward and helping us to improve our process.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH olang v1] docs: create zero programming language specification
  2024-03-09  0:05 [RFC PATCH olang v1] docs: create zero programming language specification Johnny Richard
                   ` (2 preceding siblings ...)
  2024-03-09  5:09 ` Carlos Maniero
@ 2024-03-19 20:21 ` Johnny Richard
  2024-03-23 23:31 ` Carlos Maniero
  4 siblings, 0 replies; 11+ messages in thread
From: Johnny Richard @ 2024-03-19 20:21 UTC (permalink / raw)
  To: ~johnnyrichard/olang-devel

A patch v2 has been sent, this patch is SUPERSEDED.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH olang v1] docs: create zero programming language specification
  2024-03-09  0:05 [RFC PATCH olang v1] docs: create zero programming language specification Johnny Richard
                   ` (3 preceding siblings ...)
  2024-03-19 20:21 ` Johnny Richard
@ 2024-03-23 23:31 ` Carlos Maniero
  4 siblings, 0 replies; 11+ messages in thread
From: Carlos Maniero @ 2024-03-23 23:31 UTC (permalink / raw)
  To: Johnny Richard, ~johnnyrichard/olang-devel



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH olang v1] docs: create zero programming language specification
@ 2024-03-15 20:54 Ricardo Kagawa
  2024-03-17 15:41 ` Carlos Maniero
  2024-03-18  9:58 ` Johnny Richard
  0 siblings, 2 replies; 11+ messages in thread
From: Ricardo Kagawa @ 2024-03-15 20:54 UTC (permalink / raw)
  To: ~johnnyrichard/olang-devel

> You've replied to the CI build reply.  Next time try to reply to the
> right thread.

I just opened the "Reply to thread" link from sourcehut's web interface.
It automatically filled the TO, CC, Subject and some thread ID header,
which values I just trusted. It seems the UI is not to be trusted, but
then, which values should I use? The address in TO seems to have bounced
my reply, so I thought it didn't even make it to sourcehut.

> Your message has few weird line breaks.

I am not myself sure, but I suspect it is an issue with the file format
generated by `vim`. My default `email` file type seems to be forcing the
`dos` format (CRLF line breaks), which might be being interpreted as two
separate line breaks somewhere between Thunderird, Gmail and sourcehut.
I'll see if I can force it to use the `unix` format (LF only) and if
that fixes things at all.

> I'm not sure how you want to version lock this variant.  Should I add
> a specific github/git tag version to the document?

Yes, sort of. The web tool itself cannot be version locked, since it
simply does not have that option, but it does link to its GitHub
project. The project does not itself contain the description of its
EBNF syntax, but it does have a link to what it uses to implement its
EBNF parser, which in turn describes its syntax. You could include a
version locked link to [that][1].

[1]: https://github.com/Engelberg/instaparse/tree/v1.4.12

> > Is the language going to support Unicode?
>
> I would say to keep it simple as much as we can on this earlier stage
> (ASCII only) unless you have a big concern.

I guess that would be OK. I don't think it would be too difficult to
migrate later. Maybe tricky, but not difficult, since Unicode is a
superset of ASCII. Just need to be careful not to depend too much on
the fact that ASCII characters are stored in 8-bit variables, as
Unicode uses variable-length characters (variable within a string, but
characters are multiples of 8 bits).

> If we don't add a token in here like **=** it will be very weird.

Actually, I mentioned Kotlin also to imply that there would be an
equals sign before the expression.

> > - I have not checked if this syntax would avoid that edge case with
> >   JavaScript I mentioned in the beginning. I might check that next
> >   time (I'm still not sure of how).
>
> Maybe we are going to discovery it on the implementation process.

I _suspect_ it would be enough to give precedence to interpreting line
breaks as end-of-statement, and if so, there might be a way to represent
that precedence in the EBNF grammar (by convention). I would still need
to mull over it for a while to be sure.

Another revision:

- Function body now accepts a single expression.
     - This introduced the `<end-of-file>` token, which is not an actual
       sequence of characters. It allows the function body expression at
       the end of the program without a following line break or
       semicolon. Earlier declarations must include a line break or
       semicolon.
- `\v` (vertical tab) and `\f` (form feed) included as line breaks for
   completeness over ASCII (based on `\s` regex class, which agrees with
   Unicode properties over the ASCII range).
- Integer literals can now include underlines as separators.
     - The literal is allowed to terminate with an arbitrarily long
       sequence of separators, though.
     - It would be possible to restrict the last character to be a digit,
       but maybe it is not worth the trouble?
- Introducing hexadecimal integer literals.
     - Hexadecimal literals are allowed to have an arbitrarily long
       sequence of zeroes to the left, after the `0x` prefix. This is
       intentional, and the parser should ignore excess zeroes.
     - The alphabetic characters in the literal should be parsed without
       case-sensitivity. There is no difference between the `0x` and `0X`
       prefixes in literals. If mixed case is not desirable, let the
       linter or formattter restrict that.

```
(* Entry Point *)
<program>             ::= <ows> <function-definition> <ows> <end-of-file>

(* Functions *)
<function-definition> ::= 'fn' <ws> <function-name> <ows> 
<function-parameters> <ows> ':' <ows> <return-type> <ows> <function-body>
<function-name>       ::= <identifier>
<function-parameters> ::= '(' <ows> ')'
<return-type>         ::= <type>
<function-body>       ::= <block> | '=' <ows> <expression> <ows> 
(<end-of-statement> | <end-of-file>)

(* Statements *)
<block>               ::= '{' <ows> <statement> <ows> 
(<end-of-statement> <ows> <statement> <ows>)* <end-of-statement>? <ows> '}'
<end-of-statement>    ::= ';' | <line-break>
<statement>           ::= <return-statement>
<return-statement>    ::= 'return' <ws> <expression>

(* Expressions *)
<expression>          ::= <integer>

(* Identifiers *)
<type>                ::= 'u32'
<identifier>          ::= (<alpha> | '_') (<alpha> | <digit> | '_')*

(* Literals *)
<integer>             ::= <integer-base10> | <integer-base16>
<integer-base10>      ::= #'[1-9]' (<digit> | '_')* | '0'
<integer-base16>      ::= #'0[Xx]' <hex-digit> (<hex-digit> | '_')*

(* Utilities *)
<ws>                  ::= <white-space>+
<ows>                 ::= <white-space>*
<white-space>         ::= <linear-space> | <line-break>
<line-break>          ::= #'[\n\v\f\r]' | '\r\n'
<linear-space>        ::= #'[ \t]'
<alpha>               ::= #'[a-zA-Z]'
<digit>               ::= #'[0-9]'
<hex-digit>           ::= <digit> | #'[a-fA-F]'
<end-of-file>         ::= #'$'
```

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH olang v1] docs: create zero programming language specification
  2024-03-15 20:54 Ricardo Kagawa
@ 2024-03-17 15:41 ` Carlos Maniero
  2024-03-18  9:58 ` Johnny Richard
  1 sibling, 0 replies; 11+ messages in thread
From: Carlos Maniero @ 2024-03-17 15:41 UTC (permalink / raw)
  To: Ricardo Kagawa, ~johnnyrichard/olang-devel

I'd like to begin by echoing Johnny's words of thanks. It's truly
fantastic to see more people getting involved in making olang a
remarkable language.

It's also quite refreshing to have someone on board who is well-versed
in the theory behind creating a programming language. I'm certain that
I, along with others, will learn a great deal from you.

I'll start with a question from your first reply:

> My limited understanding is that the semicolon would indeed be more
> convenient, as it would be a definitive end-of-statement symbol,
> requiring no lookahead to resolve as such. The LF token could be
> ambiguous on its own (between end-of-statement and white space), so
> some lookahead would be required to resolve it.

I had a hard time trying to understand why the "LF token could be
ambiguous on its own", but now I got it. Before, I was just thinking
about the function body where, to me, a blank line in the function's
body is just an empty statement, same as a sequence of semicolons in C.
But I wasn't considering that there are a lot of other places where the
programmer could add a blank space, like in function declarations:

  fn
  main()
  :
  u32
  {
    return 0
  }

Even though the code above is ugly AF, it's still syntactically correct.
And that made me realize why it's more convenient to have an
end-of-statement token. I'm just putting this out here in case someone
else has the same doubt. Although, it seems like we are  all on the same
page about wanting to make the language more user-friendly, even if it
means giving the parser a bit of a hard time.

About the "some lookahead would be required to resolve it" we definitely
added some of these in the parser, but given a better look, we could
easily get rid of then by replacing the function:

  static void
  skip_line_feeds(lexer_t *lexer)
  {
      token_t token;
      lexer_peek_next(lexer, &token);

      while (token.kind == TOKEN_LF) {
          lexer_next_token(lexer, &token);
          lexer_peek_next(lexer, &token);
      }
  }

With:

  static void
  next_non_lf_token(lexer_t *lexer, token_t *token)
  {
      do {
          lexer_next_token(lexer, token);
      } while (token->kind == TOKEN_LF);
  }

I'm sure it may be some corner cases where it cannot be applied, but I
think that it reduces backtrack for most of the cases.

> - Function body now accepts a single expression.
> ...
> - `\v` (vertical tab) and `\f` (form feed) included as line breaks for
>    completeness over ASCII (based on `\s` regex class, which agrees with
>    Unicode properties over the ASCII range).
> - Integer literals can now include underlines as separators.
> ...
> - Introducing hexadecimal integer literals.
> ...

>From your last email, I'm totally up for discussing these topics. But I
reckon we might want to split them into new threads to avoid trying to
hash out the entire language in one go.

What do you all think about us trying to:

1. Nail down the current state of the language, leaving new features for
   later.
2. Figure out how we're gonna document new features.
3. Kick off a new thread for each feature we're thinking of adding.

Sound good?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH olang v1] docs: create zero programming language specification
  2024-03-15 20:54 Ricardo Kagawa
  2024-03-17 15:41 ` Carlos Maniero
@ 2024-03-18  9:58 ` Johnny Richard
  1 sibling, 0 replies; 11+ messages in thread
From: Johnny Richard @ 2024-03-18  9:58 UTC (permalink / raw)
  To: Ricardo Kagawa; +Cc: ~johnnyrichard/olang-devel

Amazing job. Wonderful changes and thank you so much for you
contribution.

On Fri, Mar 15, 2024 at 05:54:10PM -0300, Ricardo Kagawa wrote:
> > You've replied to the CI build reply.  Next time try to reply to the
> > right thread.
> 
> I just opened the "Reply to thread" link from sourcehut's web interface.
> It automatically filled the TO, CC, Subject and some thread ID header,
> which values I just trusted. It seems the UI is not to be trusted, but
> then, which values should I use? The address in TO seems to have bounced
> my reply, so I thought it didn't even make it to sourcehut.

Sad to hear the Sourcehut reply is miss behaving.  Could you please
report the problem to them?

The good practice is always **reply all** (in case someone who is not
subscribed to the mailing list can follow up the thread discussion).
Since I wrote the email you replied, you should reply it to me and
**CC** the mailing list and carry the **CC** from previous Email (reply
already behaves like so).  You might want to have a copy (optional), in
that case you can **BCC** yourself.

In order to have set up correctly the thread you want to reply to, you
have to set the **In-Reply-To** header with the **Message-ID** of the
message you want to reply to.

It seems to be very hard to active but your email client should do the
work.  You can download and import the email thread from the archive
right into your email client and **reply all**. 

There is another mailing list archive[1] which has instruction on how to
reply to emails using **git-send-email** that maybe you might be interested.

[1]: http://lists.johnnyrichard.com/olang/

For example, you could follow the steps bellow to reply to the right
thread using **git-send-email**:

    Reply instructions:

    You may reply publicly to this message[2] via plain-text email
    using any one of the following methods:

    * Save the following mbox file, import it into your mail client,
      and reply-to-all from there: mbox[3]

      Avoid top-posting and favor interleaved quoting:
      https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

    * Reply using the --to, --cc, and --in-reply-to
      switches of git-send-email(1):

      git send-email \
        --in-reply-to=y2brhif5vjjbjmd26wwxykhlt2qnnm2k7n2lpiz7dr7pccmoc7@qqx67mvyzl2c \
        --to=johnny@johnnyrichard.com \
        --cc=ricardo.kagawa@gmail.com \
        --cc=~johnnyrichard/olang-devel@lists.sr.ht \
        /path/to/YOUR_REPLY

      https://kernel.org/pub/software/scm/git/docs/git-send-email.html

    * If your mail client supports setting the In-Reply-To header
      via mailto: links, try the mailto: link[4]

    Be sure your reply has a Subject: header at the top and a blank line before the message body. 

[2]: http://lists.johnnyrichard.com/olang/y2brhif5vjjbjmd26wwxykhlt2qnnm2k7n2lpiz7dr7pccmoc7@qqx67mvyzl2c/#t
[3]: http://lists.johnnyrichard.com/olang/y2brhif5vjjbjmd26wwxykhlt2qnnm2k7n2lpiz7dr7pccmoc7@qqx67mvyzl2c/raw
[4]: mailto:johnny@johnnyrichard.com?In-Reply-To=%3Cy2brhif5vjjbjmd26wwxykhlt2qnnm2k7n2lpiz7dr7pccmoc7@qqx67mvyzl2c%3E&Cc=ricardo.kagawa%40gmail.com%2C~johnnyrichard%2Folang-devel%40lists.sr.ht&Subject=Re%3A%20%5Bolang%2Fpatches%2F.build.yml%5D%20build%20success

> > Your message has few weird line breaks.
> 
> I am not myself sure, but I suspect it is an issue with the file format
> generated by `vim`. My default `email` file type seems to be forcing the
> `dos` format (CRLF line breaks), which might be being interpreted as two
> separate line breaks somewhere between Thunderird, Gmail and sourcehut.
> I'll see if I can force it to use the `unix` format (LF only) and if
> that fixes things at all.

I think you made it, this email looks good.  Perhaps you can set the vim
as your default editor on Thunderird (I know you can do it but not sure
how). Gmail should not mess up with you message.

If you like to live on terminal I suggest **mutt** or **aerc** clients
which works out of the box with plain text emails and they are very flexible.

> > I'm not sure how you want to version lock this variant.  Should I add
> > a specific github/git tag version to the document?
> 
> Yes, sort of. The web tool itself cannot be version locked, since it
> simply does not have that option, but it does link to its GitHub
> project. The project does not itself contain the description of its
> EBNF syntax, but it does have a link to what it uses to implement its
> EBNF parser, which in turn describes its syntax. You could include a
> version locked link to [that][1].
> 
> [1]: https://github.com/Engelberg/instaparse/tree/v1.4.12

Sounds good.

> > > Is the language going to support Unicode?
> > 
> > I would say to keep it simple as much as we can on this earlier stage
> > (ASCII only) unless you have a big concern.
> 
> I guess that would be OK. I don't think it would be too difficult to
> migrate later. Maybe tricky, but not difficult, since Unicode is a
> superset of ASCII. Just need to be careful not to depend too much on
> the fact that ASCII characters are stored in 8-bit variables, as
> Unicode uses variable-length characters (variable within a string, but
> characters are multiples of 8 bits).

Nice, let us know whenever we make such ASCII dependency. 

> > If we don't add a token in here like **=** it will be very weird.
> 
> Actually, I mentioned Kotlin also to imply that there would be an
> equals sign before the expression.
> 
> > > - I have not checked if this syntax would avoid that edge case with
> > >   JavaScript I mentioned in the beginning. I might check that next
> > >   time (I'm still not sure of how).
> > 
> > Maybe we are going to discovery it on the implementation process.
> 
> I _suspect_ it would be enough to give precedence to interpreting line
> breaks as end-of-statement, and if so, there might be a way to represent
> that precedence in the EBNF grammar (by convention). I would still need
> to mull over it for a while to be sure.

I can test the precedence solution later as well.

> Another revision:
> 
> - Function body now accepts a single expression.

Hmm... I think we still haven't agreed on having it.  We are trying to
make a very simple language and maybe avoid having multiple ways of
achieving the stuff will make it simpler.

> - Introducing hexadecimal integer literals.
>     - Hexadecimal literals are allowed to have an arbitrarily long
>       sequence of zeroes to the left, after the `0x` prefix. This is
>       intentional, and the parser should ignore excess zeroes.
>     - The alphabetic characters in the literal should be parsed without
>       case-sensitivity. There is no difference between the `0x` and `0X`
>       prefixes in literals. If mixed case is not desirable, let the
>       linter or formattter restrict that.

Nice you added support to hex. We will need octal in the future as well
(starting with 0).  But let's craft more features in new patches, this
thread is become out of control.  Perhaps we can merge an initial
version and propose small changes to it (making new discussion threads).

I can set up a PATCH with most of your changes (except the function
definition with single line, we can discuss it on a new PATCH thread). 

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-03-23 23:31 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-09  0:05 [RFC PATCH olang v1] docs: create zero programming language specification Johnny Richard
2024-03-08 23:09 ` [olang/patches/.build.yml] build success builds.sr.ht
2024-03-14  4:29   ` Ricardo Kagawa
2024-03-14 22:43     ` Johnny Richard
2024-03-09  0:36 ` [RFC PATCH olang v1] docs: create zero programming language specification Johnny Richard
2024-03-09  5:09 ` Carlos Maniero
2024-03-19 20:21 ` Johnny Richard
2024-03-23 23:31 ` Carlos Maniero
2024-03-15 20:54 Ricardo Kagawa
2024-03-17 15:41 ` Carlos Maniero
2024-03-18  9:58 ` Johnny Richard

Code repositories for project(s) associated with this public inbox

	https://git.johnnyrichard.com/olang.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox