The total RAM per block is 18KB. Each block has a 72-bit width. I don't really know where you're pulling your numbers from. Even if you calculate in parallel, 128/18 = 8 block RAM units required, with 72-bit widths each --> not 1024 bit width either.
I think you are misinformed about what is and is not possible.
You can construct whatever width you like by putting multiple units in parallel. This is commonly done, and is a general feature of FPGA's not unique to Xilinx.
The vendors put them into small blocks like that to improve the granularity / flexibility for the designer. As a result, you effectively lose capacity (bits) when your chosen configuration doesn't map efficiently to the underlying memory organization.
Artix-7 is even better, but limiting the discussion to Spartan 6 which many people have already bought, here is some documentation:
See page two of this:
(a)
http://www.xilinx.com/support/documentation/ip_documentation/blk_mem_gen_ds512.pdfSee page nine of this:
(b)
http://www.xilinx.com/support/documentation/user_guides/ug383.pdfSee page two of this:
(c)
http://www.xilinx.com/support/documentation/data_sheets/ds160.pdfTo get a x1024 memory using (a), you can see from (b) that one possibility might be (32) instances of (x32) width.
As far as the capability of the LX150 part commonly used on existing bitcoin mining boards, you will see in (c) that this devices has a total of (268) such blocks.
So accommodating the 128KB scratchpad in SCRYPT could be done with (64) blocks configured for (x32) width and (32) units in parallel. The LX150 could possibly hold (4) such memories, but I think you run out of gates for SCRYPT arithmetic well before that.
I'm sorry, but I still don't follow. (b) Table 4 that you cited shows a maximum width of 32-bits for a 9 KB block. With 18 KB data blocks, the maximum width is 64-bits (plus error checks bits).
You can get get a 32-bit writes in parallel on 32 separate 9 KB blocks, which is sort of like a 1024-bit interface (I guess; 1024-bit interface really implies that you're writing 1024-bits a cycle through the same memory interface...). I think a direct implementation like this won't achieve a very good speed, though (less than 10 KH/s on most of these chips).
The better implementation would just run in the allocated memory and remake the LUT as needed I would think. See the kernel for cgminer and reaper, and use of the "lookup gap" function, which more or less does this.