LLMs as Interpreters: Making AI Run Code

LLMs are revolutionizing code generation, but a deeper question lingers: can they also execute code? This post explores the exciting frontier of LLM code execution, examining how these models leverage code structure, the limitations of their pattern-matching approach, and promising research that aims to make LLMs not just code writers but code runners.

LLMs and the Structure of Text

Large Language Models are fundamentally pattern recognition machines, trained on massive text datasets. They excel at identifying and replicating patterns, but their capabilities extend beyond simply predicting the next word in a sequence. LLMs are also remarkably adept at leveraging the structure of text, which plays a crucial role in their ability to understand context, follow instructions, and generate meaningful output. This understanding of structure is foundational to how they interact with code, as code itself is highly structured.

Understanding Conversational Structure

One way LLMs interact with structure is through the use of prompts. Prompts act as instructions or queries, guiding the LLM's behavior and shaping its responses. These prompts often incorporate special markers, similar to tags in HTML or XML, which provide clear boundaries and context. These markers delineate different parts of the text and guide the LLM's interpretation and generation process. For instance, the documentation for Llama 3 specifies a precise prompt structure using markers like <|begin_of_text|>, <|start_header_id|>, and <|end_header_id|>. These markers are crucial for defining roles and establishing boundaries between different parts of the conversation.

To understand how LLMs interpret text as structured conversations, let's examine the specific prompt structure used for Llama 3. This structure uses special markers to delineate individual messages and define roles within a multi-turn conversation, much like a chat interface. The effectiveness of this structured format is amplified by Reinforcement Learning from Human Feedback (RLHF). This training process uses human feedback to teach the LLM how to generate appropriate and helpful responses within the defined conversational structure.

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Who are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I'm an AI assistant, which means I'm a computer program designed to ... (rest of the assistant's response) ... <|eot_id|>

Let's break down how these markers function:

<|begin_of_text|>: This marker signals the absolute start of the conversation, providing a clear entry point for the LLM.
<|start_header_id|> and <|end_header_id|>: These markers encapsulate role identifiers. system indicates a system-level instruction or context-setting message. user denotes a message from the user, posing a question or request. assistant marks a response generated by the LLM, playing the role of a helpful assistant.
<|eot_id|>: This crucial marker signifies the end of a turn or message. It acts as a delimiter, separating each message chunk within the conversation. This allows the LLM to treat each segment between these markers as a distinct message in the ongoing dialogue.

By using these specific markers, the prompt structures the input as a sequence of discrete messages, mimicking a chat exchange. This structure, combined with RLHF training, allows the LLM to generate contextually appropriate and conversationally coherent responses.

Handling Nested Structures: The JSON Example

Building upon the concept of delimited structures, let's consider a more complex example: JSON (JavaScript Object Notation). JSON is commonly used for data interchange and incorporates nested objects, arrays, and key-value pairs, creating a hierarchical structure. LLMs are remarkably proficient at processing and generating JSON, showcasing their ability to handle more than just flat, delimited text. The following example demonstrates zero-shot function calling, where the LLM is given a description of a function in JSON format and asked to invoke it based on a user query. The JSON provides the LLM with the necessary information about the function's name, parameters, and expected behavior.

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Questions: Can you retrieve the details for the user with the ID 7890, who has black as their special request?
Here is a list of functions in JSON format that you can invoke:
[
  {
    "name": "get_user_info",
    "description": "Retrieve details for a specific user by their unique identifier...",
    "parameters": {
      "type": "dict",
      "required": [ "user_id" ],
      "properties": {
        "user_id": { "type": "integer", "description": "..." },
        "special": { "type": "string", "description": "...", "default": "none" }
      }
    }
  }
]
...(rest of prompt)...
[get_user_info(user_id=7890, special='black')]<|eot_id|>

This ability to work with nested structures like JSON demonstrates that LLMs can parse and interpret hierarchical data, extracting information from multiple levels of nesting. However, it's important to note that even the nested structure of JSON is still pre-trained and static. This differs significantly from the dynamic and flexible structure of programming code, which presents a greater challenge for LLMs.

The Semantic Gap

While LLMs exhibit impressive proficiency with structured code and demonstrate an ability to generate syntactically correct and often functionally relevant code snippets, a critical question remains: Do they truly understand the code they manipulate? Or are they simply mimicking patterns without grasping the underlying meaning? This is where the "parrot" analogy comes into play. Just as a parrot can mimic human speech without comprehending the meaning of the words, LLMs can generate code that looks right without necessarily understanding its purpose or implications.

This limitation becomes apparent when LLMs encounter code that deviates from the common patterns they've learned during training. For instance, they might struggle with code that uses unusual syntax or grammar, even if it's logically sound. They might also falter when faced with domain-specific languages or APIs that aren't widely represented in their training data. In these situations, the LLM's reliance on statistical associations becomes a hindrance, as it lacks the deeper understanding needed to adapt to novel or unexpected scenarios. Furthermore, LLMs can be easily fooled by code containing subtle logical errors. Even if the code is syntactically correct and conforms to familiar patterns, the LLM might not detect flaws that would be immediately obvious to a human programmer who understands the intended behavior of the code. This inability to reason about the code's logic and anticipate its execution flow highlights a fundamental limitation of current LLMs.

The crux of the issue lies in the distinction between semantic understanding and statistical understanding. Human programmers understand code semantically. They grasp the meaning and intent behind the code, can reason about its behavior, and can anticipate its effects. LLMs, on the other hand, operate on a statistical level. They identify patterns and predict the likelihood of certain code elements appearing in specific contexts, but they lack the deeper understanding of why those patterns exist or what they represent in the broader context of the program. This semantic gap is a significant hurdle in the quest for true LLM code comprehension. While LLMs can generate impressive code artifacts, their ability to reason about, debug, or truly understand the code they produce remains limited. This raises crucial questions about the role of LLMs in software development and the need for more robust approaches to code interpretation.

Pushing the Boundaries: From Patterns to Programs

While LLMs excel at leveraging patterns in code, a crucial question remains: can they move beyond pattern recognition to trace the algorithmic steps and reason about program behavior? This is a significant challenge, requiring LLMs to go beyond static structures and grapple with the dynamic processes of code execution. A key area of exploration is moving beyond simply generating code to actually modeling the execution process, enabling LLMs to perform computations and reason about program behavior.

Prompt Engineering and Algorithmic Execution

Prompt engineering plays a crucial role in eliciting desired behavior from LLMs. By crafting carefully structured prompts, we can guide LLMs to perform complex tasks, including generating traces of algorithmic execution. Two key techniques in prompt engineering are few-shot prompting and in-context learning. Few-shot prompting involves providing the LLM with a small number of examples demonstrating the target task. In-context learning, which encompasses few-shot prompting, embeds examples or task instructions directly within the prompt itself, offering immediate context for the LLM.

The idea of using prompts to guide LLMs towards algorithmic execution has been explored in interesting ways. In the paper "GPT is becoming a Turing machine: Here are some ways to program it" (Jojic, Wang, and Jojic, 2023, arXiv:2303.14310v1 [cs.CL]), the authors demonstrate how carefully crafted prompts can induce LLMs to model the step-by-step execution of programs, including those involving loops and other control flow mechanisms. Their work highlights the potential for LLMs to move beyond simply generating code to actually producing a symbolic representation of how a program would run, opening up exciting possibilities for automated program analysis and debugging. The LLM's ability to generate these step-by-step execution traces also hints at the potential for more advanced code generation capabilities in the future. One of the key techniques they introduce is "Iteration by Regimenting Self-Attention" (IRSA), which we'll explore in detail in the next section.

IRSA: Guiding LLMs to Execute Code

IRSA (Iteration by Regimenting Self-Attention), introduced by Jojic, Wang, and Jojic, offers a novel approach to guiding LLMs toward executing algorithms. IRSA prompts provide a detailed, step-by-step trace of an algorithm's execution on a specific input. This trace includes intermediate states, operations performed, and conditions governing the control flow. This “regimentation” of self-attention compels the LLM to attend to the entire execution history when generating subsequent steps, effectively modeling the process of running the code line by line. The core idea is to provide enough structure and guidance in the prompt to enable the LLM to effectively "step through" the algorithm, even though it's not actually executing the code in a traditional sense.

Jojic et al. illustrate the power of IRSA by comparing it to traditional zero-shot prompting using the bubble sort algorithm. They demonstrate the effectiveness of a single-execution prompt, which provides a complete, step-by-step execution trace for a single input. Prompt 1 in the paper exemplifies this approach, meticulously detailing each step of bubble sort on the input 2, 3, 1, 5.

Problem: 2, 3, 1, 5
EXECUTION
  Prep
  Length of the list: 4
  Number of consecutive pairs: 3
  a=[2 3 1 5]
  set n_swaps=0
  EndPrep
  Iteration:
    set swap_flag=false. The state is:
    State: a=[2 3 1 5], n_swaps=0, swap_flag=false EndState
    Pair a[1,2] = [2 3] Check if 2<3. Is it true? Yes. Because of that, we leave state as is
    State: a=[2 3 1 5], n_swaps=0, swap_flag=false
    Pair a[2,3]= [3 1] Check if 3<1. Is it true? No. Thus, we swap_flag=true, increase n_swaps by one, and in the latest a=[2 3 1 5] swap 3 and 1 to get into state:
    State: a=[2 1 3 5], n_swaps=1, swap_flag=true EndState
    Pair a[3,4]= [3 5] Check if 3<5. Is it true? Yes. Because of that, we leave state as is
    State: a=[2 1 3 5], n_swaps=1, swap_flag=true EndState
    swap_flag is true, so do another iteration
  Iteration:
    set swap_flag=false. The state is:
    State: a=[2 1 3 5], n_swaps=1, swap_flag=false EndState
    Pair a[1,2] = [2 1] Check if 2<1. Is it true? No.
    Thus, we set swap_flag=true, increase n_swaps by one,
    and in the latest a=[2, 1, 3, 5] swap 2 and 1 to get into state:
    State: a=[1 2 3 5], n_swaps=2, swap_flag=true EndState
    Pair a[2,3] = [2 3] Check if 2<3. Is it true? Yes.
    Because of that, we leave state as is
    State: a=[1 2 3 5], n_swaps=2, swap_flag=true EndState
    Pair a[3,4] = [3 5] Check if 3<5. Is it true? Yes.
    Because of that, we leave state as is
    State: a=[1 2 3 5], n_swaps=2, swap_flag=true EndState
    swap_flag is true, so do another iteration
  Iteration:
    set swap_flag=false. The state is:
    State: a=[1 2 3 5], n_swaps=2, swap_flag=false EndState
    Pair a[1,2] = [1 2] Check if 1<2. Is it true? Yes.
    Because of that, we leave state as is
    State: a=[1 2 3 5], n_swaps=2, swap_flag=false EndState
    Pair a[2,3] = [2 3] Check if 2<3. Is it true? Yes.
    Because of that, we leave state as is
    State: a=[1 2 3 5], n_swaps=2, swap_flag=false EndState
    Pair a[3,4] = [3 5] Check if 3<5. Is it true? Yes.
    Because of that, we leave state as is
    State: a=[1 2 3 5], n_swaps=2, swap_flag=false EndState
    swap_flag is false, so stop the iteration
Final List: 1, 2, 3, 5
Number of swaps: 2
END OF EXECUTION
Problem: 0, 3, 8, 5, 6
EXECUTION

When given Prompt 1, the LLM correctly extends the execution to the new input 0, 3, 8, 5, 6, demonstrating successful few-shot learning of the bubble sort algorithm through a single, detailed execution example.

  Prep
  Length of the list: 5
  Number of consecutive pairs: 4
  a=[0 3 8 5 6]
  set n_swaps=0
  EndPrep
  Iteration:
    set swap_flag=false. The state is:
    State: a=[0 3 8 5 6], n_swaps=0, swap_flag=false EndState
    Pair a[1,2] = [0 3] Check if 0<3. Is it true? Yes. Because of that, we leave state as is
    State: a=[0 3 8 5 6], n_swaps=0, swap_flag=false EndState
    Pair a[2,3] = [3 8] Check if 3<8. Is it true? Yes. Because of that, we leave state as is
    State: a=[0 3 8 5 6], n_swaps=0, swap_flag=false EndState
    Pair a[3,4] = [8 5] Check if 8<5. Is it true? No. Thus, we set swap_flag=true, increase n_swaps by one, and in the latest a=[0 3 8 5 6] swap 8 and 5 to get into state:
    State: a=[0 3 5 8 6], n_swaps=1, swap_flag=true EndState
    Pair a[4,5] = [8 6] Check if 8<6. Is it true? No. Thus, we set swap_flag=true, increase n_swaps by one, and in the latest a=[0 3 5 8 6] swap 8 and 6 to get into state:
    State: a=[0 3 5 6 8], n_swaps=2, swap_flag=true EndState
    swap_flag is true, so do another iteration
  Iteration:
    set swap_flag=false. The state is:
    State: a=[0 3 5 6 8], n_swaps=2, swap_flag=false EndState
    Pair a[1,2] = [0 3] Check if 0<3. Is it true? Yes. Because of that, we leave state as is
    State: a=[0 3 5 6 8], n_swaps=2, swap_flag=false EndState
    Pair a[2,3] = [3 5] Check if 3<5. Is it true? Yes. Because of that, we leave state as is
    State: a=[0 3 5 6 8], n_swaps=2, swap_flag=false EndState
    Pair a[3,4] = [5 6] Check if 5<6. Is it true? Yes. Because of that, we leave state as is
    State: a=[0 3 5 6 8], n_swaps=2, swap_flag=false EndState
    Pair a[4,5] = [6 8] Check if 6<8. Is it true? Yes. Because of that, we leave state as is
    State: a=[0 3 5 6 8], n_swaps=2, swap_flag=false EndState
    swap_flag is false, so stop the iteration
Final List: 0, 3, 5, 6, 8
Number of swaps: 2
END OF EXECUTION

Contrast this with Prompt 11, a zero-shot prompt that provides only the code and input 8, 7, 6, 4, 9:

# Here is the code for Bubble Sort algorithm that keeps track of the number of needed swaps:
def bubbleSort(arr):
  n = len(arr)
  # initiate the number of swaps to 0
  n_swaps=0;
  # Traverse through all array elements
  for i in range(n):

    # Last i elements are already in place
    for j in range(0, n-i-1):

      # traverse the array from 0 to n-i-1
      # Swap if the element found is greater
      # than the next element
      if arr[j] > arr[j+1] :
        arr[j], arr[j+1] = arr[j+1], arr[j]
        n_swaps = n_swaps+1

Problem: 8, 7, 6, 4, 9
EXECUTION:

However, the LLM's response to Prompt 11 is inconsistent and ultimately fails to correctly sort the input. It gets stuck in a loop, repeatedly comparing the same elements without making the necessary swaps. This highlights the limitations of relying solely on pattern recognition and code understanding without the explicit step-by-step guidance provided by IRSA.

i=0
j=0
8 > 7
7, 8, 6, 4, 9
j=1
8 > 6
7, 6, 8, 4, 9
j=2
8 > 4
7, 6, 4, 8, 9
j=3
8 > 9
7, 6, 4, 8, 9
j=4
i=1
j=0
7 > 6
6, 7, 4, 8, 9
j=1
7 > 4
6, 4, 7, 8, 9
j=2
7 > 8
6, 4, 7, 8, 9
j=3
i=2
j=0
6 > 4
4, 6, 7, 8, 9
j=1
6 > 7
4, 6, 7, 8, 9
j=2
i=3
j=0
4 > 6
4, 6, 7, 8, 9
j=1
i=4
j=0
i=5
j=0
i=6
j=0
i=7
j=0
i=8
...

Experiments with Condensed Symbolic Representations

Inspired by IRSA's approach of providing explicit guidance for algorithmic execution, my own experiments explored the use of more concise and abstract symbolic representations. These experiments aimed to test the boundaries of how explicitly LLMs need to be guided to successfully trace the steps of an algorithm.

Experiment 1: GPT-3.5 with Explicit Symbol Mapping (June 2, 2023):

This initial experiment, conducted with GPT-3.5, employed a condensed symbolic representation of the bubble sort algorithm, coupled with a dictionary that explicitly mapped symbols to code elements.

Prompt:

'Problem': '[P]'
'[Initial State]': '[IS]'
'a:': 'a'
'n_swaps:': 'n'
'swap_flag:': 's'
'[Iteration 1]': '[I1]'
'Comparing a[0]': 'C0'
'No swap performed': 'No'
'Updated State:': 'US'
'Comparing a[1]': 'C1'
'3 < 1:': '3 < 1'
'Swap performed:': 'S'
'n_swaps incremented by 1:': 'n+'
'swap_flag set to': 's='
'[Swap Detected]': '[SD]'
'[Swap Not Detected]': '[SND]'
'[Final State]': '[FS]'
'Final List:': 'FL'
'Number of Swaps:': 'NS'
'[End of Execution]': '[EE]'
'Entered another iteration': '[EAI]'

[P]
[IS] a: [2, 3, 1, 5] n: 0 s: False
[I1] C0 (2) C1 (3) 2 < 3: True No US: a: [2, 3, 1, 5] n: 0 s: False
[I2] C1 (3) C2 (1) 3 < 1: False S: [2, 3, 1, 5] -> [2, 1, 3, 5] n+: 0 -> 1 s=: True US: a: [2, 1, 3, 5] n: 1 s: True
[I3] C2 (3) C3 (5) 3 < 5: True No US: a: [2, 1, 3, 5] n: 1 s: True
[SD] s is True, [EAI]
[I4] C0 (2) C1 (1) 2 < 1: False S: [2, 1, 3, 5] -> [1, 2, 3, 5] n+: 1 -> 2 s=: True US: a: [1, 2, 3, 5] n: 2 s: True
[I5] C1 (2) C2 (3) 2 < 3: True No US: a: [1, 2, 3, 5] n: 2 s: True
[I6] C2 (3) C3 (5) 3 < 5: True No US: a: [1, 2, 3, 5] n: 2 s: True
[SD] s is True, [EAI]
[I7] C0 (1) C1 (2) 1 < 2: True No US: a: [1, 2, 3, 5] n: 2 s: False
[I8] C1 (2) C2 (3) 2 < 3: True No US: a: [1, 2, 3, 5] n: 2 s: False
[I9] C2 (3) C3 (5) 3 < 5: True No US: a: [1, 2, 3, 5] n: 2 s: False
[SND] s is False, exiting the iteration
[FS] FL: [1, 2, 3, 5] NS: 2
[EE]

[P]
[IS] a: [14, 0, 2, 15, 3, 6] n: 0 s: False

Output:

[I1] C0 (14) C1 (0) 14 < 0: False S: [14, 0, 2, 15, 3, 6] -> [0, 14, 2, 15, 3, 6] n+: 0 -> 1 s=: True US: a: [0, 14, 2, 15, 3, 6] n: 1 s: True
[I2] C1 (14) C2 (2) 14 < 2: False S: [0, 14, 2, 15, 3, 6] -> [0, 2, 14, 15, 3, 6] n+: 1 -> 2 s=: True US: a: [0, 2, 14, 15, 3, 6] n: 2 s: True
[I3] C2 (14) C3 (15) 14 < 15: True No US: a: [0, 2, 14, 15, 3, 6] n: 2 s: True
[I4] C3 (15) C4 (3) 15 < 3: False S: [0, 2, 14, 15, 3, 6] -> [0, 2, 14, 3, 15, 6] n+: 2 -> 3 s=: True US: a: [0, 2, 14, 3, 15, 6] n: 3 s: True
[I5] C4 (15) C5 (6) 15 < 6: False S: [0, 2, 14, 3, 15, 6] -> [0, 2, 14, 3, 6, 15] n+: 3 -> 4 s=: True US: a: [0, 2, 14, 3, 6, 15] n: 4 s: True
[SD] s is True, [EAI]
[I6] C0 (0) C1 (2) 0 < 2: True No US: a: [0, 2, 14, 3, 6, 15] n: 4 s: True
[I7] C1 (2) C2 (14) 2 < 14: True No US: a: [0, 2, 14, 3, 6, 15] n: 4 s: True
[I8] C2 (14) C3 (3) 14 < 3: False S: [0, 2, 14, 3, 6, 15] -> [0, 2, 3, 14, 6, 15] n+: 4 -> 5 s=: True US: a: [0, 2, 3, 14, 6, 15] n: 5 s: True
[I9] C3 (14) C4 (6) 14 < 6: False S: [0, 2, 3, 14, 6, 15] -> [0, 2, 3, 6, 14, 15] n+: 5 -> 6 s=: True US: a: [0, 2, 3, 6, 14, 15] n: 6 s: True
[I10] C4 (14) C5 (15) 14 < 15: True No US: a: [0, 2, 3, 6, 14, 15] n: 6 s: True
[SD] s is True, [EAI]
[I11] C0 (0) C1 (2) 0 < 2: True No US: a: [0, 2, 3, 6, 14, 15] n: 6 s: True
[I12] C1 (2) C2 (3) 2 < 3: True No US: a: [0, 2, 3, 6, 14, 15] n: 6 s: True
[I13] C2 (3) C3 (6) 3 < 6: True No US: a: [0, 2, 3, 6, 14, 15] n: 6 s: True
[I14] C3 (6) C4 (14) 6 < 14: True No US: a: [0, 2, 3, 6, 14, 15] n: 6 s: True
[I15] C4 (14) C5 (15) 14 < 15: True No US: a: [0, 2, 3, 6, 14, 15] n: 6 s: True
[SND] s is False, exiting the iteration
[FS] FL: [0, 2, 3, 6, 14, 15] NS: 6
[EE]

Despite the more abstract and concise format, GPT-3.5 successfully executed the bubble sort algorithm, demonstrating its ability to interpret and apply symbolic instructions when provided with an explicit mapping. This result suggested that LLMs could potentially work with more compressed representations of code execution, opening up possibilities for more efficient and flexible prompting strategies.

Experiment 2: Gemini 1.5 with Implicit Symbol Interpretation (December 6, 2024):

Building on the first experiment, this subsequent experiment with gemini-1.5-flash-002 aimed to test whether explicit symbol mapping was strictly necessary. The dictionary was removed, leaving only the condensed symbolic representation.

Prompt:

[P]
[IS] a: [2, 3, 1, 5] n: 0 s: False
[I1] C0 (2) C1 (3) 2 < 3: True No US: a: [2, 3, 1, 5] n: 0 s: False
[I2] C1 (3) C2 (1) 3 < 1: False S: [2, 3, 1, 5] -> [2, 1, 3, 5] n+: 0 -> 1 s=: True US: a: [2, 1, 3, 5] n: 1 s: True
[I3] C2 (3) C3 (5) 3 < 5: True No US: a: [2, 1, 3, 5] n: 1 s: True
[SD] s is True, [EAI]
[I4] C0 (2) C1 (1) 2 < 1: False S: [2, 1, 3, 5] -> [1, 2, 3, 5] n+: 1 -> 2 s=: True US: a: [1, 2, 3, 5] n: 2 s: True
[I5] C1 (2) C2 (3) 2 < 3: True No US: a: [1, 2, 3, 5] n: 2 s: True
[I6] C2 (3) C3 (5) 3 < 5: True No US: a: [1, 2, 3, 5] n: 2 s: True
[SD] s is True, [EAI]
[I7] C0 (1) C1 (2) 1 < 2: True No US: a: [1, 2, 3, 5] n: 2 s: False
[I8] C1 (2) C2 (3) 2 < 3: True No US: a: [1, 2, 3, 5] n: 2 s: False
[I9] C2 (3) C3 (5) 3 < 5: True No US: a: [1, 2, 3, 5] n: 2 s: False
[SND] s is False, exiting the iteration
[FS] FL: [1, 2, 3, 5] NS: 2
[EE]

[P]
[IS] a: [14, 0, 2, 15, 3, 6] n: 0 s: False

Output:

[I1] C0 (14) C1 (0) 14 < 0: False S: [14, 0, 2, 15, 3, 6] -> [0, 14, 2, 15, 3, 6] n+: 0 -> 1 s=: True US: a: [0, 14, 2, 15, 3, 6] n: 1 s: True
[I2] C1 (14) C2 (2) 14 < 2: False S: [0, 14, 2, 15, 3, 6] -> [0, 2, 14, 15, 3, 6] n+: 1 -> 2 s=: True US: a: [0, 2, 14, 15, 3, 6] n: 2 s: True
[I3] C2 (14) C3 (15) 14 < 15: True No US: a: [0, 2, 14, 15, 3, 6] n: 2 s: True
[I4] C3 (15) C4 (3) 15 < 3: False S: [0, 2, 14, 15, 3, 6] -> [0, 2, 14, 3, 15, 6] n+: 2 -> 3 s=: True US: a: [0, 2, 14, 3, 15, 6] n: 3 s: True
[I5] C4 (15) C5 (6) 15 < 6: False S: [0, 2, 14, 3, 15, 6] -> [0, 2, 14, 3, 6, 15] n+: 3 -> 4 s=: True US: a: [0, 2, 14, 3, 6, 15] n: 4 s: True
[SD] s is True, [EAI]
[I6] C0 (0) C1 (2) 0 < 2: True No US: a: [0, 2, 14, 3, 6, 15] n: 4 s: True
[I7] C1 (2) C2 (14) 2 < 14: True No US: a: [0, 2, 14, 3, 6, 15] n: 4 s: True
[I8] C2 (14) C3 (3) 14 < 3: False S: [0, 2, 14, 3, 6, 15] -> [0, 2, 3, 14, 6, 15] n+: 4 -> 5 s=: True US: a: [0, 2, 3, 14, 6, 15] n: 5 s: True
[I9] C3 (14) C4 (6) 14 < 6: False S: [0, 2, 3, 14, 6, 15] -> [0, 2, 3, 6, 14, 15] n+: 5 -> 6 s=: True US: a: [0, 2, 3, 6, 14, 15] n: 6 s: True
[SD] s is True, [EAI]
[I10] C0 (0) C1 (2) 0 < 2: True No US: a: [0, 2, 3, 6, 14, 15] n: 6 s: False
[I11] C1 (2) C2 (3) 2 < 3: True No US: a: [0, 2, 3, 6, 14, 15] n: 6 s: False
[I12] C2 (3) C3 (6) 3 < 6: True No US: a: [0, 2, 3, 6, 14, 15] n: 6 s: False
[I13] C3 (6) C4 (14) 6 < 14: True No US: a: [0, 2, 3, 6, 14, 15] n: 6 s: False
[I14] C4 (14) C5 (15) 14 < 15: True No US: a: [0, 2, 3, 6, 14, 15] n: 6 s: False
[SND] s is False, exiting the iteration
[FS] FL: [0, 2, 3, 6, 14, 15] NS: 6
[EE]

Remarkably, gemini-1.5-flash-002 successfully executed the algorithm without the dictionary, demonstrating an emergent ability for implicit symbol interpretation. This result suggests that more advanced LLMs can infer the meaning of symbols from their usage in context, moving beyond simple pattern matching towards a more nuanced understanding of algorithmic processes. This finding has significant implications for prompt engineering and the potential for developing more concise and flexible ways to interact with LLMs for code execution and other complex tasks. It also raises intriguing questions about the limits of implicit understanding and the potential for LLMs to learn and utilize more formal, symbolic representations of code.

Conclusion

Our exploration of how LLMs interact with code reveals a nuanced and evolving landscape. While these models demonstrate remarkable proficiency with the structure of text, from simple delimiters to the nested hierarchies of JSON, their "understanding" of code remains fundamentally different from that of human programmers. LLMs excel at recognizing and replicating patterns, but they often struggle with the semantic nuances and dynamic behavior inherent in programming languages. This limitation is evident in their susceptibility to unusual syntax, subtle logical errors, and their reliance on familiar patterns.

However, techniques like IRSA, which provides explicit, step-by-step guidance, demonstrate that LLMs can be pushed beyond simple pattern matching toward more sophisticated forms of code interaction, including generating traces of algorithmic execution. Furthermore, experiments with condensed symbolic representations, particularly the successful execution without an explicit symbol dictionary, suggest an emergent ability for implicit symbol interpretation, hinting at a deeper, more nuanced understanding of algorithmic processes.