Intelligence, I believe, is not an isolated phenomenon—it is a projection observed, learned from, and refined through interaction. While modern artificial intelligence systems excel at mimicking specific slices of human behavior—whether through large-scale visual datasets or the linguistic fluency of LLMs—these are merely fragments of a broader projection of human intelligence. They simulate behavior within a bounded plane: recognizing objects, generating coherent text, or responding to prompts. Yet, they lack the continuity of understanding that emerges when perception, memory, and action are entangled across time. Human intelligence is built from lifelong observation—of language, motion, context, and consequence—paired with the embodied feedback of acting in the world. It is within these projections, from gesture to gaze, from language to locomotion, that intelligent behavior is first perceived and then replicated.
To build truly intelligent systems, we must move beyond training on isolated modalities and static datasets. We need agents that can learn continuously from rich streams of multimodal observations—spanning vision, language, audio, physical interaction, and social context. More critically, these systems must interact with the world and with other agents to ground their understanding in experience. Just as humans refine their motor skills, social behaviors, and reasoning by navigating a world of constraints and feedback, artificial agents must be exposed to environments that demand adaptation, correction, and learning over time. Only by combining broad-spectrum observation with continual interaction can we hope to approach the flexible, grounded, and adaptive intelligence that we recognize in ourselves.